Data Warehouse Modeling

Data Warehouse Modeling

A Detailed Exploration

Introduction

In data warehousing, schema design is critical to ensuring efficient data retrieval and analysis.

Two common schema models used in designing data warehouses are the Star Schema and the Snowflake Schema.

Let's dive deep into these two schemas, compare them, and explore other relevant topics that connect to it.

💡
You might wanna give a read to Principles of Database Design before diving into this one.

Star Schema

A Star Schema is a simple database schema used in data warehousing. It consists of one or more fact tables referencing any number of dimension tables. The fact table contains the primary data to be analyzed, and the dimension tables provide descriptive attributes related to the facts.

Structure

Fact Tables

  • Purpose: Contain measurable, quantitative data. Typically very large and store historical data.

  • Examples: Sales amounts, transaction quantities, profit margins.

Dimension Tables

  • Purpose: Contain descriptive attributes related to fact data. Usually smaller and support filtering, grouping, and labeling.

  • Examples: Product details, time periods, store locations.

Consider a retail sales data warehouse. The schema might look like this:

Fact Table: Sales

sale_idproduct_idtime_idstore_idamountquantity
11012021013015005
21022021013023003

Dimension Table: Product

product_idproduct_namecategory
101LaptopElectronics
102SmartphoneElectronics

Dimension Table: Time

time_iddatemonthquarteryear
2021012021-01-15JanQ12021

Dimension Table: Store

store_idstore_namelocation
301Store ANew York
302Store BLos Angeles

Advantages

  1. Simplicity: Easy to understand and design.

  2. Query Performance: Optimized for read-heavy operations due to fewer joins.

  3. Denormalization: Reduces the number of joins needed for queries, improving performance.

Disadvantages

  1. Redundancy: Denormalization leads to data redundancy, increasing storage requirements.

  2. Maintenance: Updates and maintenance can be more challenging due to redundancy. Any change in the data might require updates in multiple places.

Snowflake Schema

A Snowflake Schema is a more complex database schema used in data warehousing. It is a normalized form of the star schema where dimension tables are further split into sub-dimension tables. This design reduces redundancy but increases the complexity of the database.

Structure

  • Fact Table: Central table containing quantitative data.

  • Normalized Dimension Tables: Dimension tables are normalized, splitting data into additional tables.

Using the same retail sales data warehouse, a snowflake schema might look like this:

Fact Table: Sales

sale_idproduct_idtime_idstore_idamountquantity
11012021013015005
21022021013023003

Dimension Table: Product

product_idproduct_namecategory_id
101Laptop1
102Smartphone1

Sub-Dimension Table: Category

category_idcategory_name
1Electronics

Dimension Table: Time

time_iddatemonth_id
2021012021-01-151

Sub-Dimension Table: Month

month_idmonth_namequarteryear
1JanQ12021

Dimension Table: Store

store_idstore_namelocation_id
301Store A1
302Store B2

Sub-Dimension Table: Location

location_idlocation_name
1New York
2Los Angeles

Advantages

  1. Normalization: Reduces data redundancy and storage requirements by normalizing the dimension tables.

  2. Data Integrity: Easier to maintain data integrity with normalized tables, reducing the risk of anomalies.

  3. Scalability: More scalable for handling complex queries and large datasets due to the normalized structure.

Disadvantages

  1. Complexity: More complex design and querying due to additional joins. Understanding and managing the schema requires more effort.

  2. Performance: Potentially slower query performance due to the need for more joins to retrieve data.

  3. Query Complexity: Requires more complex SQL queries, making it harder to write and maintain.

When to Use Star Schema vs. Snowflake Schema

Use Star Schema When

  1. Query Performance is Critical: If the primary use case involves read-heavy operations and fast query performance, the star schema's denormalization can be advantageous.

  2. Simplicity is Preferred: If easily understandable and simplicity in design are important, the star schema is more straightforward.

  3. Reporting Needs: When the primary goal is to generate simple and quick reports, the star schema is often preferred due to its simplicity.

Use Snowflake Schema When

  1. Data Integrity is Crucial: If maintaining data integrity and reducing redundancy are top priorities, the snowflake schema’s normalization helps achieve these goals.

  2. Storage Efficiency: When minimizing storage requirements is necessary, the normalized structure of the snowflake schema is beneficial.

  3. Complex Queries: When the database needs to support complex queries that require multiple joins, the snowflake schema's structure can be advantageous despite the increased complexity.

Other Relevant Topics in Schema Design

In addition to understanding the Star and Snowflake schemas, it’s essential to understand other key concepts that improve the efficiency, integrity, and scalability of data warehouses.

Let's dive into Surrogate Keys, Slowly Changing Dimensions (SCD), Conformed Dimensions, Data Vault Modeling, and the differences between Data Marts and Data Warehouses.

Surrogate Keys

Surrogate keys are artificial or synthetic keys used as primary keys in database tables. Unlike natural keys, product from the actual data (e.g., a social security number), surrogate keys are typically numeric and auto-incremented, having no business meaning.

Benefits

  1. Simplify Joins: Surrogate keys provide a simple, consistent key for joining tables.

  2. Improve Performance: Numeric surrogate keys are generally faster for database operations than complex natural keys.

  3. Ensure Uniqueness: Surrogate keys guarantee unique identifiers for each record, avoiding issues with natural keys that might not be unique.

  4. Decoupling Data: They allow changes in business logic or source data without affecting database schema.

Let's create a Product Dimension Table with a Surrogate Key:

CREATE TABLE product (
    product_id INT AUTO_INCREMENT PRIMARY KEY,
    product_name VARCHAR(100),
    category VARCHAR(100)
);

-- Inserting data into the Product table
INSERT INTO product (product_name, category)
VALUES ('Laptop', 'Electronics'), ('Smartphone', 'Electronics');

-- Querying the Product table
SELECT * FROM product;
product_idproduct_namecategory
1LaptopElectronics
2SmartphoneElectronics

in this case, AUTO_INCREMENT ensures that each new record gets a unique, automatically incremented integer value.

Why Use Surrogate Keys Instead of Natural Keys?

Consistency and Simplicity

Imagine you use product_name as a natural key. If the product name changes (e.g., 'Laptop' to 'Notebook'), you must update this key in all related tables, which can be cumbersome and error-prone.

With a surrogate key like product_id, you can change product_name without affecting the key. The key remains consistent, simplifying data management.

Performance

Natural keys can be long strings or combinations of multiple columns, making them less efficient for indexing and joining tables. For example, product_name might require more storage and processing power.

Surrogate keys are typically single-column, numeric, and small in size, making them faster for indexing and joining.

Uniqueness

Natural keys come from business data may not always be unique. For instance, if you use name and address as a key, two users could share the same name and address, causing issues.

Surrogate keys are always unique, ensuring that each record is distinctly identifiable.

Data Independence

Natural keys tie your database schema directly to your business logic. Changes in business rules can dictate changes in the schema, causing maintenance overhead.

Surrogate keys decouple the database schema from business rules. Changes in business logic don't impact the surrogate key structure, making the schema more stable and easier to maintain.

Slowly Changing Dimensions (SCD)

Slowly Changing Dimensions (SCD) are techniques used to manage and track changes in dimension data over time. They ensure that historical data remains intact while updates are recorded.

Types

  1. Type 1 (Overwrite): Simply overwrites the old data with new data. This method does not maintain any history of previous values. Serves well when historical changes are not critical for analysis.

  2. Type 2 (Row Versioning): Tracks historical changes by creating new rows with versioning information. Each row has start and end dates to clarify the validity period, and a flag to indicate the current active record.

  3. Type 3 (Column Versioning): Tracks changes by adding new columns to store previous values and change dates. This method maintains limited history within the same row.

Type 1: Overwriting Old Data

Let's say we have a product table where we want to update the category of a product.

Initial State:

CREATE TABLE product (
    product_id INT PRIMARY KEY,
    product_name VARCHAR(100),
    category VARCHAR(100)
);

-- Inserting initial product data
INSERT INTO product (product_id, product_name, category)
VALUES (1, 'Laptop', 'Electronics'), (2, 'Smartphone', 'Electronics');

Update Using Type 1 SCD:

-- Updating product category using Type 1 SCD
UPDATE product
SET category = 'Computers'
WHERE product_name = 'Laptop';

-- Query result
SELECT * FROM product;
product_idproduct_namecategory
1LaptopComputers
2SmartphoneElectronics
  • The category of 'Laptop' is updated from 'Electronics' to 'Computers'.

  • The old category value is lost and not stored anywhere in the table.

Type 2: Adding New Rows

Initial State:

CREATE TABLE product_scd (
    product_id INT,
    product_name VARCHAR(100),
    category VARCHAR(100),
    start_date DATE,
    end_date DATE,
    current_flag CHAR(1),
    PRIMARY KEY (product_id, start_date)
);

-- Inserting initial product data
INSERT INTO product_scd (product_id, product_name, category, start_date, end_date, current_flag)
VALUES (1, 'Laptop', 'Electronics', '2021-01-01', '2021-06-30', 'N');

Adding a New Version Using Type 2 SCD:

-- Adding new version of the product
INSERT INTO product_scd (product_id, product_name, category, start_date, end_date, current_flag)
VALUES (1, 'Laptop', 'Computers', '2021-07-01', NULL, 'Y');

-- Query result
SELECT * FROM product_scd;

Result:

product_idproduct_namecategorystart_dateend_datecurrent_flag
1LaptopElectronics2021-01-012021-06-30N
1LaptopComputers2021-07-01NULLY
  • The old category 'Electronics' is retained in a separate row with an end date.

  • A new row is added for the category 'Computers' with the start date of '2021-07-01'.

  • The current_flag indicates the active record.

Type 3: Adding New Columns

Initial State:

CREATE TABLE product (
    product_id INT PRIMARY KEY,
    product_name VARCHAR(100),
    category VARCHAR(100)
);

-- Inserting initial product data
INSERT INTO product (product_id, product_name, category)
VALUES (1, 'Laptop', 'Electronics'), (2, 'Smartphone', 'Electronics');

Adding Columns for Type 3 SCD

ALTER TABLE product
ADD COLUMN old_category VARCHAR(100),
ADD COLUMN change_date DATE;

Update Using Type 3 SCD

-- Updating product category using Type 3 SCD
UPDATE product
SET old_category = category,
    category = 'Computers',
    change_date = '2024-05-27'
WHERE product_name = 'Laptop';

-- Query result
SELECT * FROM product;
product_idproduct_namecategoryold_categorychange_date
1LaptopComputersElectronics2021-07-01
2SmartphoneElectronicsNULLNULL
  • The old_category column stores the previous category value 'Electronics'.

  • The change_date column records the date of the change.

  • The current category is updated to 'Computers'.

When to Use Each Type of SCD

Type 1 SCD

  • Use when historical data changes are not important.

  • Good for attributes where only the current value is necessary (e.g., correcting spelling errors, fixing inaccurate data).

Type 2 SCD

  • Use when tracking historical changes is critical.

  • Ideal for dimensions where maintaining a complete history is important (e.g., tracking employee roles over time, changes in customer addresses).

Type 3 SCD

  • Use when limited history is enough.

  • Useful for attributes where only a few previous values need to be tracked (e.g., previous year’s sales target, last two job titles).

Conformed Dimensions

Conformed dimensions are shared dimensions across multiple fact tables or data marts, ensuring consistency in reporting and analysis. These dimensions are designed to be reusable and standardized across the entire data warehouse.

Importance

  1. Consistency: Conformed dimensions provide a uniform way to represent key business entities across different parts of the data warehouse, ensuring that reports and analyses are consistent and comparable.

  2. Reusability: These dimensions can be reused across multiple fact tables, reducing redundancy and ensuring that the same logic and data structures are applied universally.

  3. Simplified Maintenance: Maintaining and updating dimensions becomes easier when they are conformed. Changes made to a conformed dimension are automatically reflected across all the fact tables that use it.

Considering a scenario where a retail company needs to analyze both sales and inventory data. To maintain consistency, the company uses a shared date dimension across both the sales and inventory fact tables.

The date_dimension table is a shared dimension that contains information about dates. This table can be used across several fact tables to ensure consistent reporting on date-related data.

CREATE TABLE date_dimension (
    date_id INT PRIMARY KEY,
    date DATE,
    month VARCHAR(20),
    quarter VARCHAR(20),
    year INT
);

-- Inserting data into the Date Dimension table
INSERT INTO date_dimension (date_id, date, month, quarter, year)
VALUES (1, '2021-01-01', 'January', 'Q1', 2021);

-- Querying the Date Dimension table
SELECT * FROM date_dimension;
date_iddatemonthquarteryear
12021-01-01JanuaryQ12021

The date_dimension table provides a standardized way to represent dates, including the month, quarter, and year.

To demonstrate the use of conformed dimensions, let's create two fact tables: sales_fact and inventory_fact. Both tables will reference the date_dimension table using the date_id column.

-- Sales Fact Table
CREATE TABLE sales_fact (
    sale_id INT PRIMARY KEY,
    product_id INT,
    date_id INT,
    amount DECIMAL(10, 2),
    FOREIGN KEY (date_id) REFERENCES date_dimension(date_id)
);

-- Inventory Fact Table
CREATE TABLE inventory_fact (
    inventory_id INT PRIMARY KEY,
    product_id INT,
    date_id INT,
    stock_level INT,
    FOREIGN KEY (date_id) REFERENCES date_dimension(date_id)
);

-- Inserting data into the Sales Fact table
INSERT INTO sales_fact (sale_id, product_id, date_id, amount)
VALUES (1, 101, 1, 500.00);

-- Inserting data into the Inventory Fact table
INSERT INTO inventory_fact (inventory_id, product_id, date_id, stock_level)
VALUES (1, 101, 1, 100);

-- Querying the Sales Fact table
SELECT * FROM sales_fact;

-- Querying the Inventory Fact table
SELECT * FROM inventory_fact;

Sales Fact Table Result:

sale_idproduct_iddate_idamount
11011500.00

Inventory Fact Table Result:

inventory_idproduct_iddate_idstock_level
11011100
  1. Date Dimension Table:

    • Contains the date_id, date, month, quarter, and year.

    • Standardized date information can be referenced by multiple fact tables.

  2. Sales Fact Table:

    • Contains sales data including sale_id, product_id, date_id, and amount.

    • The date_id column references the date_id in the date_dimension table to ensure consistency.

  3. Inventory Fact Table:

    • Contains inventory data including inventory_id, product_id, date_id, and stock_level.

    • The date_id column also references the date_id in the date_dimension table.

By using a conformed date dimension, both the sales_fact and inventory_fact tables can share and consistently reference date information. This approach ensures that any analysis involving dates is based on the same standardized data, facilitating accurate and consistent reporting.

Application

Consistency Across Reports: When generating reports that combine sales and inventory data, using conformed dimensions ensures that the date information aligns perfectly. For example, sales and inventory levels can be accurately compared on a monthly or quarterly basis without divergence due to different date definitions.

Simplified Data Management: Maintaining a single date dimension table reduces redundancy and simplifies data management. Any update to the date dimension (e.g., adding fiscal year information) is automatically reflected in all related fact tables.

Easy to Expand: As the data warehouse grows, new fact tables can be easily added while still referencing existing conformed dimensions. This extensibility allows for scalable data warehousing solutions.

Data Vault Modeling is a database modeling technique specifically designed for data warehouses. It focuses on providing a flexible, scalable, and auditable structure that can handle large volumes of data from multiple sources. Data Vault combines the best aspects of third normal form (3NF) and star schema modeling to address the needs of complex data environments.

Key Concepts

Hubs

Hubs represent core business entities (e.g., customers, products) and contain a unique list of business keys.

  • Attributes: Hub tables typically have surrogate keys, business keys, and metadata columns such as load dates and record sources.

  • Purpose: They serve as the central point of reference for all other related data.

CREATE TABLE hub_customer (
    customer_hub_id INT AUTO_INCREMENT PRIMARY KEY,
    customer_business_key VARCHAR(100) NOT NULL,
    load_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    record_source VARCHAR(50) NOT NULL
);

-- Inserting data into Hub table
INSERT INTO hub_customer (customer_business_key, record_source)
VALUES ('CUST123', 'source_system_1');

Links

Links represent relationships between hubs and define how entities are related.

  • Attributes: Link tables contain surrogate keys, foreign keys to the related hubs, and metadata columns such as load dates and record sources.

  • Purpose: They capture associations and many-to-many relationships between hubs.

CREATE TABLE link_customer_order (
    customer_order_link_id INT AUTO_INCREMENT PRIMARY KEY,
    customer_hub_id INT NOT NULL,
    order_hub_id INT NOT NULL,
    load_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    record_source VARCHAR(50) NOT NULL,
    FOREIGN KEY (customer_hub_id) REFERENCES hub_customer(customer_hub_id),
    FOREIGN KEY (order_hub_id) REFERENCES hub_order(order_hub_id)
);

-- Inserting data into Link table
INSERT INTO link_customer_order (customer_hub_id, order_hub_id, record_source)
VALUES (1, 1, 'source_system_1');

Satellites

Satellites store the descriptive attributes of hubs and links, providing context and details.

  • Attributes: Satellite tables contain surrogate keys, foreign keys to the hubs or links they describe, and metadata columns such as load dates and record sources.

  • Purpose: They allow for the capture of historical data and changes over time, supporting auditability and traceability.

CREATE TABLE sat_customer_details (
    customer_details_sat_id INT AUTO_INCREMENT PRIMARY KEY,
    customer_hub_id INT NOT NULL,
    customer_name VARCHAR(100),
    customer_address VARCHAR(255),
    load_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    record_source VARCHAR(50) NOT NULL,
    FOREIGN KEY (customer_hub_id) REFERENCES hub_customer(customer_hub_id)
);

-- Inserting data into Satellite table
INSERT INTO sat_customer_details (customer_hub_id, customer_name, customer_address, record_source)
VALUES (1, 'John Doe', '123 Main St', 'source_system_1');

Advantages

Scalability

Data Vault's modular structure makes it highly scalable. Adding new sources or entities involves creating new hubs, links, and satellites without disrupting existing structures.

Flexibility

The separation of business keys (hubs) and descriptive attributes (satellites) allows for flexible schema changes. New attributes can be added to satellites without affecting the core structure.

Auditability

Data Vault supports data lineage and historical tracking. Every record is timestamped and includes a record source, making it easy to trace the origin and changes of data over time.

Consistency

By separating relationships (links) from entities (hubs) and their attributes (satellites), Data Vault ensures consistent representation of data, reducing redundancy and anomalies.

Use Cases

Complex Data Environments

Data Vault is ideal for environments with multiple data sources and complex relationships. For example, a financial institution integrating data from various systems (CRM, transaction systems, customer support) can use Data Vault to ensure consistent and traceable data integration.

Historical Data Tracking

Organizations needing to maintain a detailed history of changes in their data will benefit from Data Vault's approach. For instance, healthcare providers tracking patient information over time can use Data Vault to maintain accurate and auditable records.

Regulatory Compliance

Industries with stringent regulatory requirements (e.g., finance, healthcare) can leverage Data Vault's auditability to meet compliance standards. The ability to trace data back to its source and understand its changes over time is crucial for audits and reporting.

Data Mart vs. Data Warehouse

Data Mart

A Data Mart is a subset of a data warehouse, focused on a specific business line or team. It is designed for quick access and specific analysis, adjusted to the needs of a particular department.

Data Warehouse

A Data Warehouse is a central repository of integrated data from multiple sources. It supports detailed analysis and reporting, serving the entire organization.

Differences

FeatureData MartData Warehouse
ScopeDepartment-specific or line of businessOrganization-wide
Data IntegrationLimited to specific business areasIntegrates data from multiple business areas
ComplexityLess complex, smaller in sizeMore complex, larger in size
Implementation TimeFaster to implementLonger implementation time
Data GranularityMore detailed, granular dataSummarized, aggregated data
CostLower costHigher cost

Data Warehouse Schema Example

Fact Table: Sales

sale_idproduct_iddate_idstore_idamountquantity
110113015005
210213023003

Dimension Tables:

  • Product
product_idproduct_namecategory
101LaptopElectronics
102SmartphoneElectronics
  • Date
date_iddatemonthquarteryear
12021-01-01JanuaryQ12021
  • Store
store_idstore_namelocation
301Store ANew York
302Store BLos Angeles

Data Mart Schema Example

Fact Table: Marketing Campaigns

campaign_idproduct_iddate_idimpressionsclicks
1101110000500
2102115000700

Dimension Tables:

  • Product
product_idproduct_namecategory
101LaptopElectronics
102SmartphoneElectronics
  • Date
date_iddatemonthquarteryear
12021-01-01JanuaryQ12021

Conclusion

Understanding and implementing these schema design principles is key for building an efficient and scalable data warehouse.

By choosing the appropriate schema model (Star or Snowflake), using surrogate keys, managing slowly changing dimensions, maintaining conformed dimensions, and differentiating between data marts and data warehouses, you can ensure that your data infrastructure supports strong, accurate, and performant data analysis.

These practices will help you make data-driven decisions with confidence, improve data integrity, and optimize query performance, leading to more improved business intelligence and analytics.