Introduction
In data warehousing, schema design is critical to ensuring efficient data retrieval and analysis.
Two common schema models used in designing data warehouses are the Star Schema and the Snowflake Schema.
Let's dive deep into these two schemas, compare them, and explore other relevant topics that connect to it.
Star Schema
A Star Schema is a simple database schema used in data warehousing. It consists of one or more fact tables referencing any number of dimension tables. The fact table contains the primary data to be analyzed, and the dimension tables provide descriptive attributes related to the facts.
Structure
Fact Tables
Purpose: Contain measurable, quantitative data. Typically very large and store historical data.
Examples: Sales amounts, transaction quantities, profit margins.
Dimension Tables
Purpose: Contain descriptive attributes related to fact data. Usually smaller and support filtering, grouping, and labeling.
Examples: Product details, time periods, store locations.
Consider a retail sales data warehouse. The schema might look like this:
Fact Table: Sales
sale_id | product_id | time_id | store_id | amount | quantity |
1 | 101 | 202101 | 301 | 500 | 5 |
2 | 102 | 202101 | 302 | 300 | 3 |
Dimension Table: Product
product_id | product_name | category |
101 | Laptop | Electronics |
102 | Smartphone | Electronics |
Dimension Table: Time
time_id | date | month | quarter | year |
202101 | 2021-01-15 | Jan | Q1 | 2021 |
Dimension Table: Store
store_id | store_name | location |
301 | Store A | New York |
302 | Store B | Los Angeles |
Advantages
Simplicity: Easy to understand and design.
Query Performance: Optimized for read-heavy operations due to fewer joins.
Denormalization: Reduces the number of joins needed for queries, improving performance.
Disadvantages
Redundancy: Denormalization leads to data redundancy, increasing storage requirements.
Maintenance: Updates and maintenance can be more challenging due to redundancy. Any change in the data might require updates in multiple places.
Snowflake Schema
A Snowflake Schema is a more complex database schema used in data warehousing. It is a normalized form of the star schema where dimension tables are further split into sub-dimension tables. This design reduces redundancy but increases the complexity of the database.
Structure
Fact Table: Central table containing quantitative data.
Normalized Dimension Tables: Dimension tables are normalized, splitting data into additional tables.
Using the same retail sales data warehouse, a snowflake schema might look like this:
Fact Table: Sales
sale_id | product_id | time_id | store_id | amount | quantity |
1 | 101 | 202101 | 301 | 500 | 5 |
2 | 102 | 202101 | 302 | 300 | 3 |
Dimension Table: Product
product_id | product_name | category_id |
101 | Laptop | 1 |
102 | Smartphone | 1 |
Sub-Dimension Table: Category
category_id | category_name |
1 | Electronics |
Dimension Table: Time
time_id | date | month_id |
202101 | 2021-01-15 | 1 |
Sub-Dimension Table: Month
month_id | month_name | quarter | year |
1 | Jan | Q1 | 2021 |
Dimension Table: Store
store_id | store_name | location_id |
301 | Store A | 1 |
302 | Store B | 2 |
Sub-Dimension Table: Location
location_id | location_name |
1 | New York |
2 | Los Angeles |
Advantages
Normalization: Reduces data redundancy and storage requirements by normalizing the dimension tables.
Data Integrity: Easier to maintain data integrity with normalized tables, reducing the risk of anomalies.
Scalability: More scalable for handling complex queries and large datasets due to the normalized structure.
Disadvantages
Complexity: More complex design and querying due to additional joins. Understanding and managing the schema requires more effort.
Performance: Potentially slower query performance due to the need for more joins to retrieve data.
Query Complexity: Requires more complex SQL queries, making it harder to write and maintain.
When to Use Star Schema vs. Snowflake Schema
Use Star Schema When
Query Performance is Critical: If the primary use case involves read-heavy operations and fast query performance, the star schema's denormalization can be advantageous.
Simplicity is Preferred: If easily understandable and simplicity in design are important, the star schema is more straightforward.
Reporting Needs: When the primary goal is to generate simple and quick reports, the star schema is often preferred due to its simplicity.
Use Snowflake Schema When
Data Integrity is Crucial: If maintaining data integrity and reducing redundancy are top priorities, the snowflake schema’s normalization helps achieve these goals.
Storage Efficiency: When minimizing storage requirements is necessary, the normalized structure of the snowflake schema is beneficial.
Complex Queries: When the database needs to support complex queries that require multiple joins, the snowflake schema's structure can be advantageous despite the increased complexity.
Other Relevant Topics in Schema Design
In addition to understanding the Star and Snowflake schemas, it’s essential to understand other key concepts that improve the efficiency, integrity, and scalability of data warehouses.
Let's dive into Surrogate Keys, Slowly Changing Dimensions (SCD), Conformed Dimensions, Data Vault Modeling, and the differences between Data Marts and Data Warehouses.
Surrogate Keys
Surrogate keys are artificial or synthetic keys used as primary keys in database tables. Unlike natural keys, product from the actual data (e.g., a social security number), surrogate keys are typically numeric and auto-incremented, having no business meaning.
Benefits
Simplify Joins: Surrogate keys provide a simple, consistent key for joining tables.
Improve Performance: Numeric surrogate keys are generally faster for database operations than complex natural keys.
Ensure Uniqueness: Surrogate keys guarantee unique identifiers for each record, avoiding issues with natural keys that might not be unique.
Decoupling Data: They allow changes in business logic or source data without affecting database schema.
Let's create a Product Dimension Table with a Surrogate Key:
CREATE TABLE product (
product_id INT AUTO_INCREMENT PRIMARY KEY,
product_name VARCHAR(100),
category VARCHAR(100)
);
-- Inserting data into the Product table
INSERT INTO product (product_name, category)
VALUES ('Laptop', 'Electronics'), ('Smartphone', 'Electronics');
-- Querying the Product table
SELECT * FROM product;
product_id | product_name | category |
1 | Laptop | Electronics |
2 | Smartphone | Electronics |
in this case, AUTO_INCREMENT
ensures that each new record gets a unique, automatically incremented integer value.
Why Use Surrogate Keys Instead of Natural Keys?
Consistency and Simplicity
Imagine you use product_name
as a natural key. If the product name changes (e.g., 'Laptop' to 'Notebook'), you must update this key in all related tables, which can be cumbersome and error-prone.
With a surrogate key like product_id
, you can change product_name
without affecting the key. The key remains consistent, simplifying data management.
Performance
Natural keys can be long strings or combinations of multiple columns, making them less efficient for indexing and joining tables. For example, product_name
might require more storage and processing power.
Surrogate keys are typically single-column, numeric, and small in size, making them faster for indexing and joining.
Uniqueness
Natural keys come from business data may not always be unique. For instance, if you use name
and address
as a key, two users could share the same name and address, causing issues.
Surrogate keys are always unique, ensuring that each record is distinctly identifiable.
Data Independence
Natural keys tie your database schema directly to your business logic. Changes in business rules can dictate changes in the schema, causing maintenance overhead.
Surrogate keys decouple the database schema from business rules. Changes in business logic don't impact the surrogate key structure, making the schema more stable and easier to maintain.
Slowly Changing Dimensions (SCD)
Slowly Changing Dimensions (SCD) are techniques used to manage and track changes in dimension data over time. They ensure that historical data remains intact while updates are recorded.
Types
Type 1 (Overwrite): Simply overwrites the old data with new data. This method does not maintain any history of previous values. Serves well when historical changes are not critical for analysis.
Type 2 (Row Versioning): Tracks historical changes by creating new rows with versioning information. Each row has start and end dates to clarify the validity period, and a flag to indicate the current active record.
Type 3 (Column Versioning): Tracks changes by adding new columns to store previous values and change dates. This method maintains limited history within the same row.
Type 1: Overwriting Old Data
Let's say we have a product
table where we want to update the category of a product.
Initial State:
CREATE TABLE product (
product_id INT PRIMARY KEY,
product_name VARCHAR(100),
category VARCHAR(100)
);
-- Inserting initial product data
INSERT INTO product (product_id, product_name, category)
VALUES (1, 'Laptop', 'Electronics'), (2, 'Smartphone', 'Electronics');
Update Using Type 1 SCD:
-- Updating product category using Type 1 SCD
UPDATE product
SET category = 'Computers'
WHERE product_name = 'Laptop';
-- Query result
SELECT * FROM product;
product_id | product_name | category |
1 | Laptop | Computers |
2 | Smartphone | Electronics |
The category of 'Laptop' is updated from 'Electronics' to 'Computers'.
The old category value is lost and not stored anywhere in the table.
Type 2: Adding New Rows
Initial State:
CREATE TABLE product_scd (
product_id INT,
product_name VARCHAR(100),
category VARCHAR(100),
start_date DATE,
end_date DATE,
current_flag CHAR(1),
PRIMARY KEY (product_id, start_date)
);
-- Inserting initial product data
INSERT INTO product_scd (product_id, product_name, category, start_date, end_date, current_flag)
VALUES (1, 'Laptop', 'Electronics', '2021-01-01', '2021-06-30', 'N');
Adding a New Version Using Type 2 SCD:
-- Adding new version of the product
INSERT INTO product_scd (product_id, product_name, category, start_date, end_date, current_flag)
VALUES (1, 'Laptop', 'Computers', '2021-07-01', NULL, 'Y');
-- Query result
SELECT * FROM product_scd;
Result:
product_id | product_name | category | start_date | end_date | current_flag |
1 | Laptop | Electronics | 2021-01-01 | 2021-06-30 | N |
1 | Laptop | Computers | 2021-07-01 | NULL | Y |
The old category 'Electronics' is retained in a separate row with an end date.
A new row is added for the category 'Computers' with the start date of '2021-07-01'.
The
current_flag
indicates the active record.
Type 3: Adding New Columns
Initial State:
CREATE TABLE product (
product_id INT PRIMARY KEY,
product_name VARCHAR(100),
category VARCHAR(100)
);
-- Inserting initial product data
INSERT INTO product (product_id, product_name, category)
VALUES (1, 'Laptop', 'Electronics'), (2, 'Smartphone', 'Electronics');
Adding Columns for Type 3 SCD
ALTER TABLE product
ADD COLUMN old_category VARCHAR(100),
ADD COLUMN change_date DATE;
Update Using Type 3 SCD
-- Updating product category using Type 3 SCD
UPDATE product
SET old_category = category,
category = 'Computers',
change_date = '2024-05-27'
WHERE product_name = 'Laptop';
-- Query result
SELECT * FROM product;
product_id | product_name | category | old_category | change_date |
1 | Laptop | Computers | Electronics | 2021-07-01 |
2 | Smartphone | Electronics | NULL | NULL |
The
old_category
column stores the previous category value 'Electronics'.The
change_date
column records the date of the change.The current category is updated to 'Computers'.
When to Use Each Type of SCD
Type 1 SCD
Use when historical data changes are not important.
Good for attributes where only the current value is necessary (e.g., correcting spelling errors, fixing inaccurate data).
Type 2 SCD
Use when tracking historical changes is critical.
Ideal for dimensions where maintaining a complete history is important (e.g., tracking employee roles over time, changes in customer addresses).
Type 3 SCD
Use when limited history is enough.
Useful for attributes where only a few previous values need to be tracked (e.g., previous year’s sales target, last two job titles).
Conformed Dimensions
Conformed dimensions are shared dimensions across multiple fact tables or data marts, ensuring consistency in reporting and analysis. These dimensions are designed to be reusable and standardized across the entire data warehouse.
Importance
Consistency: Conformed dimensions provide a uniform way to represent key business entities across different parts of the data warehouse, ensuring that reports and analyses are consistent and comparable.
Reusability: These dimensions can be reused across multiple fact tables, reducing redundancy and ensuring that the same logic and data structures are applied universally.
Simplified Maintenance: Maintaining and updating dimensions becomes easier when they are conformed. Changes made to a conformed dimension are automatically reflected across all the fact tables that use it.
Considering a scenario where a retail company needs to analyze both sales and inventory data. To maintain consistency, the company uses a shared date dimension across both the sales and inventory fact tables.
The date_dimension
table is a shared dimension that contains information about dates. This table can be used across several fact tables to ensure consistent reporting on date-related data.
CREATE TABLE date_dimension (
date_id INT PRIMARY KEY,
date DATE,
month VARCHAR(20),
quarter VARCHAR(20),
year INT
);
-- Inserting data into the Date Dimension table
INSERT INTO date_dimension (date_id, date, month, quarter, year)
VALUES (1, '2021-01-01', 'January', 'Q1', 2021);
-- Querying the Date Dimension table
SELECT * FROM date_dimension;
date_id | date | month | quarter | year |
1 | 2021-01-01 | January | Q1 | 2021 |
The date_dimension
table provides a standardized way to represent dates, including the month, quarter, and year.
To demonstrate the use of conformed dimensions, let's create two fact tables: sales_fact
and inventory_fact
. Both tables will reference the date_dimension
table using the date_id
column.
-- Sales Fact Table
CREATE TABLE sales_fact (
sale_id INT PRIMARY KEY,
product_id INT,
date_id INT,
amount DECIMAL(10, 2),
FOREIGN KEY (date_id) REFERENCES date_dimension(date_id)
);
-- Inventory Fact Table
CREATE TABLE inventory_fact (
inventory_id INT PRIMARY KEY,
product_id INT,
date_id INT,
stock_level INT,
FOREIGN KEY (date_id) REFERENCES date_dimension(date_id)
);
-- Inserting data into the Sales Fact table
INSERT INTO sales_fact (sale_id, product_id, date_id, amount)
VALUES (1, 101, 1, 500.00);
-- Inserting data into the Inventory Fact table
INSERT INTO inventory_fact (inventory_id, product_id, date_id, stock_level)
VALUES (1, 101, 1, 100);
-- Querying the Sales Fact table
SELECT * FROM sales_fact;
-- Querying the Inventory Fact table
SELECT * FROM inventory_fact;
Sales Fact Table Result:
sale_id | product_id | date_id | amount |
1 | 101 | 1 | 500.00 |
Inventory Fact Table Result:
inventory_id | product_id | date_id | stock_level |
1 | 101 | 1 | 100 |
Date Dimension Table:
Contains the
date_id
,date
,month
,quarter
, andyear
.Standardized date information can be referenced by multiple fact tables.
Sales Fact Table:
Contains sales data including
sale_id
,product_id
,date_id
, andamount
.The
date_id
column references thedate_id
in thedate_dimension
table to ensure consistency.
Inventory Fact Table:
Contains inventory data including
inventory_id
,product_id
,date_id
, andstock_level
.The
date_id
column also references thedate_id
in thedate_dimension
table.
By using a conformed date dimension, both the sales_fact
and inventory_fact
tables can share and consistently reference date information. This approach ensures that any analysis involving dates is based on the same standardized data, facilitating accurate and consistent reporting.
Application
Consistency Across Reports: When generating reports that combine sales and inventory data, using conformed dimensions ensures that the date information aligns perfectly. For example, sales and inventory levels can be accurately compared on a monthly or quarterly basis without divergence due to different date definitions.
Simplified Data Management: Maintaining a single date dimension table reduces redundancy and simplifies data management. Any update to the date dimension (e.g., adding fiscal year information) is automatically reflected in all related fact tables.
Easy to Expand: As the data warehouse grows, new fact tables can be easily added while still referencing existing conformed dimensions. This extensibility allows for scalable data warehousing solutions.
Data Vault Modeling is a database modeling technique specifically designed for data warehouses. It focuses on providing a flexible, scalable, and auditable structure that can handle large volumes of data from multiple sources. Data Vault combines the best aspects of third normal form (3NF) and star schema modeling to address the needs of complex data environments.
Key Concepts
Hubs
Hubs represent core business entities (e.g., customers, products) and contain a unique list of business keys.
Attributes: Hub tables typically have surrogate keys, business keys, and metadata columns such as load dates and record sources.
Purpose: They serve as the central point of reference for all other related data.
CREATE TABLE hub_customer (
customer_hub_id INT AUTO_INCREMENT PRIMARY KEY,
customer_business_key VARCHAR(100) NOT NULL,
load_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
record_source VARCHAR(50) NOT NULL
);
-- Inserting data into Hub table
INSERT INTO hub_customer (customer_business_key, record_source)
VALUES ('CUST123', 'source_system_1');
Links
Links represent relationships between hubs and define how entities are related.
Attributes: Link tables contain surrogate keys, foreign keys to the related hubs, and metadata columns such as load dates and record sources.
Purpose: They capture associations and many-to-many relationships between hubs.
CREATE TABLE link_customer_order (
customer_order_link_id INT AUTO_INCREMENT PRIMARY KEY,
customer_hub_id INT NOT NULL,
order_hub_id INT NOT NULL,
load_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
record_source VARCHAR(50) NOT NULL,
FOREIGN KEY (customer_hub_id) REFERENCES hub_customer(customer_hub_id),
FOREIGN KEY (order_hub_id) REFERENCES hub_order(order_hub_id)
);
-- Inserting data into Link table
INSERT INTO link_customer_order (customer_hub_id, order_hub_id, record_source)
VALUES (1, 1, 'source_system_1');
Satellites
Satellites store the descriptive attributes of hubs and links, providing context and details.
Attributes: Satellite tables contain surrogate keys, foreign keys to the hubs or links they describe, and metadata columns such as load dates and record sources.
Purpose: They allow for the capture of historical data and changes over time, supporting auditability and traceability.
CREATE TABLE sat_customer_details (
customer_details_sat_id INT AUTO_INCREMENT PRIMARY KEY,
customer_hub_id INT NOT NULL,
customer_name VARCHAR(100),
customer_address VARCHAR(255),
load_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
record_source VARCHAR(50) NOT NULL,
FOREIGN KEY (customer_hub_id) REFERENCES hub_customer(customer_hub_id)
);
-- Inserting data into Satellite table
INSERT INTO sat_customer_details (customer_hub_id, customer_name, customer_address, record_source)
VALUES (1, 'John Doe', '123 Main St', 'source_system_1');
Advantages
Scalability
Data Vault's modular structure makes it highly scalable. Adding new sources or entities involves creating new hubs, links, and satellites without disrupting existing structures.
Flexibility
The separation of business keys (hubs) and descriptive attributes (satellites) allows for flexible schema changes. New attributes can be added to satellites without affecting the core structure.
Auditability
Data Vault supports data lineage and historical tracking. Every record is timestamped and includes a record source, making it easy to trace the origin and changes of data over time.
Consistency
By separating relationships (links) from entities (hubs) and their attributes (satellites), Data Vault ensures consistent representation of data, reducing redundancy and anomalies.
Use Cases
Complex Data Environments
Data Vault is ideal for environments with multiple data sources and complex relationships. For example, a financial institution integrating data from various systems (CRM, transaction systems, customer support) can use Data Vault to ensure consistent and traceable data integration.
Historical Data Tracking
Organizations needing to maintain a detailed history of changes in their data will benefit from Data Vault's approach. For instance, healthcare providers tracking patient information over time can use Data Vault to maintain accurate and auditable records.
Regulatory Compliance
Industries with stringent regulatory requirements (e.g., finance, healthcare) can leverage Data Vault's auditability to meet compliance standards. The ability to trace data back to its source and understand its changes over time is crucial for audits and reporting.
Data Mart vs. Data Warehouse
Data Mart
A Data Mart is a subset of a data warehouse, focused on a specific business line or team. It is designed for quick access and specific analysis, adjusted to the needs of a particular department.
Data Warehouse
A Data Warehouse is a central repository of integrated data from multiple sources. It supports detailed analysis and reporting, serving the entire organization.
Differences
Feature | Data Mart | Data Warehouse |
Scope | Department-specific or line of business | Organization-wide |
Data Integration | Limited to specific business areas | Integrates data from multiple business areas |
Complexity | Less complex, smaller in size | More complex, larger in size |
Implementation Time | Faster to implement | Longer implementation time |
Data Granularity | More detailed, granular data | Summarized, aggregated data |
Cost | Lower cost | Higher cost |
Data Warehouse Schema Example
Fact Table: Sales
sale_id | product_id | date_id | store_id | amount | quantity |
1 | 101 | 1 | 301 | 500 | 5 |
2 | 102 | 1 | 302 | 300 | 3 |
Dimension Tables:
- Product
product_id | product_name | category |
101 | Laptop | Electronics |
102 | Smartphone | Electronics |
- Date
date_id | date | month | quarter | year |
1 | 2021-01-01 | January | Q1 | 2021 |
- Store
store_id | store_name | location |
301 | Store A | New York |
302 | Store B | Los Angeles |
Data Mart Schema Example
Fact Table: Marketing Campaigns
campaign_id | product_id | date_id | impressions | clicks |
1 | 101 | 1 | 10000 | 500 |
2 | 102 | 1 | 15000 | 700 |
Dimension Tables:
- Product
product_id | product_name | category |
101 | Laptop | Electronics |
102 | Smartphone | Electronics |
- Date
date_id | date | month | quarter | year |
1 | 2021-01-01 | January | Q1 | 2021 |
Conclusion
Understanding and implementing these schema design principles is key for building an efficient and scalable data warehouse.
By choosing the appropriate schema model (Star or Snowflake), using surrogate keys, managing slowly changing dimensions, maintaining conformed dimensions, and differentiating between data marts and data warehouses, you can ensure that your data infrastructure supports strong, accurate, and performant data analysis.
These practices will help you make data-driven decisions with confidence, improve data integrity, and optimize query performance, leading to more improved business intelligence and analytics.