Introduction

In data warehousing, schema design is critical to ensuring efficient data retrieval and analysis.

Two common schema models used in designing data warehouses are the Star Schema and the Snowflake Schema.

Let's dive deep into these two schemas, compare them, and explore other relevant topics that connect to it.

💡

You might wanna give a read to Principles of Database Design before diving into this one.

Star Schema

A Star Schema is a simple database schema used in data warehousing. It consists of one or more fact tables referencing any number of dimension tables. The fact table contains the primary data to be analyzed, and the dimension tables provide descriptive attributes related to the facts.

Structure

Fact Tables

Purpose: Contain measurable, quantitative data. Typically very large and store historical data.
Examples: Sales amounts, transaction quantities, profit margins.

Dimension Tables

Purpose: Contain descriptive attributes related to fact data. Usually smaller and support filtering, grouping, and labeling.
Examples: Product details, time periods, store locations.

Consider a retail sales data warehouse. The schema might look like this:

Fact Table: Sales

sale_id	product_id	time_id	store_id	amount	quantity
1	101	202101	301	500	5
2	102	202101	302	300	3

Dimension Table: Product

product_id	product_name	category
101	Laptop	Electronics
102	Smartphone	Electronics

Dimension Table: Time

time_id	date	month	quarter	year
202101	2021-01-15	Jan	Q1	2021

Dimension Table: Store

store_id	store_name	location
301	Store A	New York
302	Store B	Los Angeles

Advantages

Simplicity: Easy to understand and design.
Query Performance: Optimized for read-heavy operations due to fewer joins.
Denormalization: Reduces the number of joins needed for queries, improving performance.

Disadvantages

Redundancy: Denormalization leads to data redundancy, increasing storage requirements.
Maintenance: Updates and maintenance can be more challenging due to redundancy. Any change in the data might require updates in multiple places.

Snowflake Schema

A Snowflake Schema is a more complex database schema used in data warehousing. It is a normalized form of the star schema where dimension tables are further split into sub-dimension tables. This design reduces redundancy but increases the complexity of the database.

Structure

Fact Table: Central table containing quantitative data.
Normalized Dimension Tables: Dimension tables are normalized, splitting data into additional tables.

Using the same retail sales data warehouse, a snowflake schema might look like this:

Fact Table: Sales

sale_id	product_id	time_id	store_id	amount	quantity
1	101	202101	301	500	5
2	102	202101	302	300	3

Dimension Table: Product

product_id	product_name	category_id
101	Laptop	1
102	Smartphone	1

Sub-Dimension Table: Category

category_id	category_name
1	Electronics

Dimension Table: Time

time_id	date	month_id
202101	2021-01-15	1

Sub-Dimension Table: Month

month_id	month_name	quarter	year
1	Jan	Q1	2021

Dimension Table: Store

store_id	store_name	location_id
301	Store A	1
302	Store B	2

Sub-Dimension Table: Location

location_id	location_name
1	New York
2	Los Angeles

Advantages

Normalization: Reduces data redundancy and storage requirements by normalizing the dimension tables.
Data Integrity: Easier to maintain data integrity with normalized tables, reducing the risk of anomalies.
Scalability: More scalable for handling complex queries and large datasets due to the normalized structure.

Disadvantages

Complexity: More complex design and querying due to additional joins. Understanding and managing the schema requires more effort.
Performance: Potentially slower query performance due to the need for more joins to retrieve data.
Query Complexity: Requires more complex SQL queries, making it harder to write and maintain.

When to Use Star Schema vs. Snowflake Schema

Use Star Schema When

Query Performance is Critical: If the primary use case involves read-heavy operations and fast query performance, the star schema's denormalization can be advantageous.
Simplicity is Preferred: If easily understandable and simplicity in design are important, the star schema is more straightforward.
Reporting Needs: When the primary goal is to generate simple and quick reports, the star schema is often preferred due to its simplicity.

Use Snowflake Schema When

Data Integrity is Crucial: If maintaining data integrity and reducing redundancy are top priorities, the snowflake schema’s normalization helps achieve these goals.
Storage Efficiency: When minimizing storage requirements is necessary, the normalized structure of the snowflake schema is beneficial.
Complex Queries: When the database needs to support complex queries that require multiple joins, the snowflake schema's structure can be advantageous despite the increased complexity.

Other Relevant Topics in Schema Design

In addition to understanding the Star and Snowflake schemas, it’s essential to understand other key concepts that improve the efficiency, integrity, and scalability of data warehouses.

Let's dive into Surrogate Keys, Slowly Changing Dimensions (SCD), Conformed Dimensions, Data Vault Modeling, and the differences between Data Marts and Data Warehouses.

Surrogate Keys

Surrogate keys are artificial or synthetic keys used as primary keys in database tables. Unlike natural keys, product from the actual data (e.g., a social security number), surrogate keys are typically numeric and auto-incremented, having no business meaning.

Benefits

Simplify Joins: Surrogate keys provide a simple, consistent key for joining tables.
Improve Performance: Numeric surrogate keys are generally faster for database operations than complex natural keys.
Ensure Uniqueness: Surrogate keys guarantee unique identifiers for each record, avoiding issues with natural keys that might not be unique.
Decoupling Data: They allow changes in business logic or source data without affecting database schema.

Let's create a Product Dimension Table with a Surrogate Key:

CREATE TABLE product (
    product_id INT AUTO_INCREMENT PRIMARY KEY,
    product_name VARCHAR(100),
    category VARCHAR(100)
);

-- Inserting data into the Product table
INSERT INTO product (product_name, category)
VALUES ('Laptop', 'Electronics'), ('Smartphone', 'Electronics');

-- Querying the Product table
SELECT * FROM product;

product_id	product_name	category
1	Laptop	Electronics
2	Smartphone	Electronics

in this case, AUTO_INCREMENT ensures that each new record gets a unique, automatically incremented integer value.

Why Use Surrogate Keys Instead of Natural Keys?

Consistency and Simplicity

Imagine you use product_name as a natural key. If the product name changes (e.g., 'Laptop' to 'Notebook'), you must update this key in all related tables, which can be cumbersome and error-prone.

With a surrogate key like product_id, you can change product_name without affecting the key. The key remains consistent, simplifying data management.

Performance

Natural keys can be long strings or combinations of multiple columns, making them less efficient for indexing and joining tables. For example, product_name might require more storage and processing power.

Surrogate keys are typically single-column, numeric, and small in size, making them faster for indexing and joining.

Uniqueness

Natural keys come from business data may not always be unique. For instance, if you use name and address as a key, two users could share the same name and address, causing issues.

Surrogate keys are always unique, ensuring that each record is distinctly identifiable.

Data Independence

Natural keys tie your database schema directly to your business logic. Changes in business rules can dictate changes in the schema, causing maintenance overhead.

Surrogate keys decouple the database schema from business rules. Changes in business logic don't impact the surrogate key structure, making the schema more stable and easier to maintain.

Slowly Changing Dimensions (SCD)

Slowly Changing Dimensions (SCD) are techniques used to manage and track changes in dimension data over time. They ensure that historical data remains intact while updates are recorded.

Types

Type 1 (Overwrite): Simply overwrites the old data with new data. This method does not maintain any history of previous values. Serves well when historical changes are not critical for analysis.
Type 2 (Row Versioning): Tracks historical changes by creating new rows with versioning information. Each row has start and end dates to clarify the validity period, and a flag to indicate the current active record.
Type 3 (Column Versioning): Tracks changes by adding new columns to store previous values and change dates. This method maintains limited history within the same row.

Type 1: Overwriting Old Data

Let's say we have a product table where we want to update the category of a product.

Initial State:

CREATE TABLE product (
    product_id INT PRIMARY KEY,
    product_name VARCHAR(100),
    category VARCHAR(100)
);

-- Inserting initial product data
INSERT INTO product (product_id, product_name, category)
VALUES (1, 'Laptop', 'Electronics'), (2, 'Smartphone', 'Electronics');

Update Using Type 1 SCD:

-- Updating product category using Type 1 SCD
UPDATE product
SET category = 'Computers'
WHERE product_name = 'Laptop';

-- Query result
SELECT * FROM product;

product_id	product_name	category
1	Laptop	Computers
2	Smartphone	Electronics

The category of 'Laptop' is updated from 'Electronics' to 'Computers'.
The old category value is lost and not stored anywhere in the table.

Type 2: Adding New Rows

Initial State:

CREATE TABLE product_scd (
    product_id INT,
    product_name VARCHAR(100),
    category VARCHAR(100),
    start_date DATE,
    end_date DATE,
    current_flag CHAR(1),
    PRIMARY KEY (product_id, start_date)
);

-- Inserting initial product data
INSERT INTO product_scd (product_id, product_name, category, start_date, end_date, current_flag)
VALUES (1, 'Laptop', 'Electronics', '2021-01-01', '2021-06-30', 'N');

Adding a New Version Using Type 2 SCD:

-- Adding new version of the product
INSERT INTO product_scd (product_id, product_name, category, start_date, end_date, current_flag)
VALUES (1, 'Laptop', 'Computers', '2021-07-01', NULL, 'Y');

-- Query result
SELECT * FROM product_scd;

Result:

product_id	product_name	category	start_date	end_date	current_flag
1	Laptop	Electronics	2021-01-01	2021-06-30	N
1	Laptop	Computers	2021-07-01	NULL	Y

The old category 'Electronics' is retained in a separate row with an end date.
A new row is added for the category 'Computers' with the start date of '2021-07-01'.
The current_flag indicates the active record.

Type 3: Adding New Columns

Initial State:

CREATE TABLE product (
    product_id INT PRIMARY KEY,
    product_name VARCHAR(100),
    category VARCHAR(100)
);

-- Inserting initial product data
INSERT INTO product (product_id, product_name, category)
VALUES (1, 'Laptop', 'Electronics'), (2, 'Smartphone', 'Electronics');

Adding Columns for Type 3 SCD

ALTER TABLE product
ADD COLUMN old_category VARCHAR(100),
ADD COLUMN change_date DATE;

Update Using Type 3 SCD

-- Updating product category using Type 3 SCD
UPDATE product
SET old_category = category,
    category = 'Computers',
    change_date = '2024-05-27'
WHERE product_name = 'Laptop';

-- Query result
SELECT * FROM product;

product_id	product_name	category	old_category	change_date
1	Laptop	Computers	Electronics	2021-07-01
2	Smartphone	Electronics	NULL	NULL

The old_category column stores the previous category value 'Electronics'.
The change_date column records the date of the change.
The current category is updated to 'Computers'.

When to Use Each Type of SCD

Type 1 SCD

Use when historical data changes are not important.
Good for attributes where only the current value is necessary (e.g., correcting spelling errors, fixing inaccurate data).

Type 2 SCD

Use when tracking historical changes is critical.
Ideal for dimensions where maintaining a complete history is important (e.g., tracking employee roles over time, changes in customer addresses).

Type 3 SCD

Use when limited history is enough.
Useful for attributes where only a few previous values need to be tracked (e.g., previous year’s sales target, last two job titles).

Conformed Dimensions

Conformed dimensions are shared dimensions across multiple fact tables or data marts, ensuring consistency in reporting and analysis. These dimensions are designed to be reusable and standardized across the entire data warehouse.

Importance

Consistency: Conformed dimensions provide a uniform way to represent key business entities across different parts of the data warehouse, ensuring that reports and analyses are consistent and comparable.
Reusability: These dimensions can be reused across multiple fact tables, reducing redundancy and ensuring that the same logic and data structures are applied universally.
Simplified Maintenance: Maintaining and updating dimensions becomes easier when they are conformed. Changes made to a conformed dimension are automatically reflected across all the fact tables that use it.

Considering a scenario where a retail company needs to analyze both sales and inventory data. To maintain consistency, the company uses a shared date dimension across both the sales and inventory fact tables.

The date_dimension table is a shared dimension that contains information about dates. This table can be used across several fact tables to ensure consistent reporting on date-related data.

CREATE TABLE date_dimension (
    date_id INT PRIMARY KEY,
    date DATE,
    month VARCHAR(20),
    quarter VARCHAR(20),
    year INT
);

-- Inserting data into the Date Dimension table
INSERT INTO date_dimension (date_id, date, month, quarter, year)
VALUES (1, '2021-01-01', 'January', 'Q1', 2021);

-- Querying the Date Dimension table
SELECT * FROM date_dimension;

date_id	date	month	quarter	year
1	2021-01-01	January	Q1	2021

The date_dimension table provides a standardized way to represent dates, including the month, quarter, and year.

To demonstrate the use of conformed dimensions, let's create two fact tables: sales_fact and inventory_fact. Both tables will reference the date_dimension table using the date_id column.

-- Sales Fact Table
CREATE TABLE sales_fact (
    sale_id INT PRIMARY KEY,
    product_id INT,
    date_id INT,
    amount DECIMAL(10, 2),
    FOREIGN KEY (date_id) REFERENCES date_dimension(date_id)
);

-- Inventory Fact Table
CREATE TABLE inventory_fact (
    inventory_id INT PRIMARY KEY,
    product_id INT,
    date_id INT,
    stock_level INT,
    FOREIGN KEY (date_id) REFERENCES date_dimension(date_id)
);

-- Inserting data into the Sales Fact table
INSERT INTO sales_fact (sale_id, product_id, date_id, amount)
VALUES (1, 101, 1, 500.00);

-- Inserting data into the Inventory Fact table
INSERT INTO inventory_fact (inventory_id, product_id, date_id, stock_level)
VALUES (1, 101, 1, 100);

-- Querying the Sales Fact table
SELECT * FROM sales_fact;

-- Querying the Inventory Fact table
SELECT * FROM inventory_fact;

Sales Fact Table Result:

sale_id	product_id	date_id	amount
1	101	1	500.00

Inventory Fact Table Result:

inventory_id	product_id	date_id	stock_level
1	101	1	100

Date Dimension Table:
- Contains the date_id, date, month, quarter, and year.
- Standardized date information can be referenced by multiple fact tables.
Sales Fact Table:
- Contains sales data including sale_id, product_id, date_id, and amount.
- The date_id column references the date_id in the date_dimension table to ensure consistency.
Inventory Fact Table:
- Contains inventory data including inventory_id, product_id, date_id, and stock_level.
- The date_id column also references the date_id in the date_dimension table.

By using a conformed date dimension, both the sales_fact and inventory_fact tables can share and consistently reference date information. This approach ensures that any analysis involving dates is based on the same standardized data, facilitating accurate and consistent reporting.

Application

Consistency Across Reports: When generating reports that combine sales and inventory data, using conformed dimensions ensures that the date information aligns perfectly. For example, sales and inventory levels can be accurately compared on a monthly or quarterly basis without divergence due to different date definitions.

Simplified Data Management: Maintaining a single date dimension table reduces redundancy and simplifies data management. Any update to the date dimension (e.g., adding fiscal year information) is automatically reflected in all related fact tables.

Easy to Expand: As the data warehouse grows, new fact tables can be easily added while still referencing existing conformed dimensions. This extensibility allows for scalable data warehousing solutions.

Data Vault Modeling is a database modeling technique specifically designed for data warehouses. It focuses on providing a flexible, scalable, and auditable structure that can handle large volumes of data from multiple sources. Data Vault combines the best aspects of third normal form (3NF) and star schema modeling to address the needs of complex data environments.

Key Concepts

Hubs

Hubs represent core business entities (e.g., customers, products) and contain a unique list of business keys.

Attributes: Hub tables typically have surrogate keys, business keys, and metadata columns such as load dates and record sources.
Purpose: They serve as the central point of reference for all other related data.

CREATE TABLE hub_customer (
    customer_hub_id INT AUTO_INCREMENT PRIMARY KEY,
    customer_business_key VARCHAR(100) NOT NULL,
    load_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    record_source VARCHAR(50) NOT NULL
);

-- Inserting data into Hub table
INSERT INTO hub_customer (customer_business_key, record_source)
VALUES ('CUST123', 'source_system_1');

Links

Links represent relationships between hubs and define how entities are related.

Attributes: Link tables contain surrogate keys, foreign keys to the related hubs, and metadata columns such as load dates and record sources.
Purpose: They capture associations and many-to-many relationships between hubs.

CREATE TABLE link_customer_order (
    customer_order_link_id INT AUTO_INCREMENT PRIMARY KEY,
    customer_hub_id INT NOT NULL,
    order_hub_id INT NOT NULL,
    load_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    record_source VARCHAR(50) NOT NULL,
    FOREIGN KEY (customer_hub_id) REFERENCES hub_customer(customer_hub_id),
    FOREIGN KEY (order_hub_id) REFERENCES hub_order(order_hub_id)
);

-- Inserting data into Link table
INSERT INTO link_customer_order (customer_hub_id, order_hub_id, record_source)
VALUES (1, 1, 'source_system_1');

Satellites

Satellites store the descriptive attributes of hubs and links, providing context and details.

Attributes: Satellite tables contain surrogate keys, foreign keys to the hubs or links they describe, and metadata columns such as load dates and record sources.
Purpose: They allow for the capture of historical data and changes over time, supporting auditability and traceability.

CREATE TABLE sat_customer_details (
    customer_details_sat_id INT AUTO_INCREMENT PRIMARY KEY,
    customer_hub_id INT NOT NULL,
    customer_name VARCHAR(100),
    customer_address VARCHAR(255),
    load_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    record_source VARCHAR(50) NOT NULL,
    FOREIGN KEY (customer_hub_id) REFERENCES hub_customer(customer_hub_id)
);

-- Inserting data into Satellite table
INSERT INTO sat_customer_details (customer_hub_id, customer_name, customer_address, record_source)
VALUES (1, 'John Doe', '123 Main St', 'source_system_1');

Advantages

Scalability

Data Vault's modular structure makes it highly scalable. Adding new sources or entities involves creating new hubs, links, and satellites without disrupting existing structures.

Flexibility

The separation of business keys (hubs) and descriptive attributes (satellites) allows for flexible schema changes. New attributes can be added to satellites without affecting the core structure.

Auditability

Data Vault supports data lineage and historical tracking. Every record is timestamped and includes a record source, making it easy to trace the origin and changes of data over time.

Consistency

By separating relationships (links) from entities (hubs) and their attributes (satellites), Data Vault ensures consistent representation of data, reducing redundancy and anomalies.

Use Cases

Complex Data Environments

Data Vault is ideal for environments with multiple data sources and complex relationships. For example, a financial institution integrating data from various systems (CRM, transaction systems, customer support) can use Data Vault to ensure consistent and traceable data integration.

Historical Data Tracking

Organizations needing to maintain a detailed history of changes in their data will benefit from Data Vault's approach. For instance, healthcare providers tracking patient information over time can use Data Vault to maintain accurate and auditable records.

Regulatory Compliance

Industries with stringent regulatory requirements (e.g., finance, healthcare) can leverage Data Vault's auditability to meet compliance standards. The ability to trace data back to its source and understand its changes over time is crucial for audits and reporting.

Data Mart vs. Data Warehouse

Data Mart

A Data Mart is a subset of a data warehouse, focused on a specific business line or team. It is designed for quick access and specific analysis, adjusted to the needs of a particular department.

Data Warehouse

A Data Warehouse is a central repository of integrated data from multiple sources. It supports detailed analysis and reporting, serving the entire organization.

Differences

Feature	Data Mart	Data Warehouse
Scope	Department-specific or line of business	Organization-wide
Data Integration	Limited to specific business areas	Integrates data from multiple business areas
Complexity	Less complex, smaller in size	More complex, larger in size
Implementation Time	Faster to implement	Longer implementation time
Data Granularity	More detailed, granular data	Summarized, aggregated data
Cost	Lower cost	Higher cost

Data Warehouse Schema Example

Fact Table: Sales

sale_id	product_id	date_id	store_id	amount	quantity
1	101	1	301	500	5
2	102	1	302	300	3

Dimension Tables:

Product

product_id	product_name	category
101	Laptop	Electronics
102	Smartphone	Electronics

Date

date_id	date	month	quarter	year
1	2021-01-01	January	Q1	2021

Store

store_id	store_name	location
301	Store A	New York
302	Store B	Los Angeles

Data Mart Schema Example

Fact Table: Marketing Campaigns

campaign_id	product_id	date_id	impressions	clicks
1	101	1	10000	500
2	102	1	15000	700

Dimension Tables:

Product

product_id	product_name	category
101	Laptop	Electronics
102	Smartphone	Electronics

Date

date_id	date	month	quarter	year
1	2021-01-01	January	Q1	2021

Conclusion

Understanding and implementing these schema design principles is key for building an efficient and scalable data warehouse.

By choosing the appropriate schema model (Star or Snowflake), using surrogate keys, managing slowly changing dimensions, maintaining conformed dimensions, and differentiating between data marts and data warehouses, you can ensure that your data infrastructure supports strong, accurate, and performant data analysis.

These practices will help you make data-driven decisions with confidence, improve data integrity, and optimize query performance, leading to more improved business intelligence and analytics.

Data Warehouse Modeling

A Detailed Exploration

Table of contents

Introduction

Star Schema

Structure

Fact Tables

Dimension Tables

Advantages

Disadvantages

Snowflake Schema

Structure

Advantages

Disadvantages

When to Use Star Schema vs. Snowflake Schema

Other Relevant Topics in Schema Design

Surrogate Keys

Benefits

Why Use Surrogate Keys Instead of Natural Keys?

Consistency and Simplicity

Performance

Uniqueness

Data Independence

Slowly Changing Dimensions (SCD)

Types

When to Use Each Type of SCD

Conformed Dimensions

Importance

Application

Key Concepts

Advantages

Use Cases

Data Mart vs. Data Warehouse

Data Mart

Data Warehouse

Differences

Data Warehouse Schema Example

Data Mart Schema Example

Conclusion