SQL: Window Functions, CTEs and Joins

SQL: Window Functions, CTEs and Joins

Some advanced topics

Introduction

Mastering SQL comes down to dive into advanced concepts that enable you to query, manage, and optimize databases.

We'll explore window functions, Common Table Expressions (CTEs), and complex joins.

These advanced SQL features are tools that can improve your database querying capabilities.

Window Functions

Window functions are a powerful feature in SQL that allow you to perform calculations across a set of table rows related to the current row. Unlike aggregate functions, which return a single result for a group of rows, window functions return a value for each row in the result set.

Syntax

The basic syntax for a window function is:

function_name (expression) OVER (
    [PARTITION BY partition_expression]
    [ORDER BY sort_expression]
    [frame_clause]
)
  • function_name: The name of the window function (e.g., ROW_NUMBER, RANK, SUM).

  • expression: The column or expression to be calculated.

  • PARTITION BY: Divides the result set into partitions to which the window function is applied.

  • ORDER BY: Specifies the order of rows within each partition.

  • frame_clause: Defines the subset of rows within each partition.

Types of Window Functions

  • Ranking Functions: Assign a rank to each row (e.g., ROW_NUMBER, RANK, DENSE_RANK).

  • Aggregate Functions: Perform calculations on a set of values (e.g., SUM, AVG, MIN, MAX).

  • Value Functions: Provide access to a row's data (e.g., LEAD, LAG, FIRST_VALUE, LAST_VALUE).

ROW_NUMBER

The ROW_NUMBER function assigns a unique sequential integer to rows within a partition, starting at 1.

Let's say we have an employees table with the following data:

employee_idfirst_namelast_namedepartment_idsalary
1JohnDoe150000
2JaneSmith155000
3MaryJohnson260000
4MikeBrown262000
5EmilyDavis348000
6AlanWhite150000
7SarahGreen262000

We want to assign a row number to each employee within their department, ordered by their salary.

SELECT 
    employee_id, 
    first_name, 
    last_name, 
    department_id,
    salary,
    ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) AS row_num
FROM employees;

Result:

employee_idfirst_namelast_namedepartment_idsalaryrow_num
2JaneSmith1550001
1JohnDoe1500002
6AlanWhite1500003
4MikeBrown2620001
7SarahGreen2620002
3MaryJohnson2600003
5EmilyDavis3480001
  • PARTITION BY department_id: Divides the result set into partitions by department.

  • ORDER BY salary DESC: Orders rows within each partition by salary in descending order.

  • ROW_NUMBER(): Assigns a unique row number to each row within the partition.

RANK

The RANK function assigns a rank to each row within a partition, with gaps in the rank values for ties.

Example:

Using the same employees table, we want to rank employees within their department based on their salary.

SELECT 
    employee_id, 
    first_name, 
    last_name, 
    department_id,
    salary,
    RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rank
FROM employees;

Result:

employee_idfirst_namelast_namedepartment_idsalaryrank
2JaneSmith1550001
1JohnDoe1500002
6AlanWhite1500002
4MikeBrown2620001
7SarahGreen2620001
3MaryJohnson2600003
5EmilyDavis3480001
  • PARTITION BY department_id: Divides the result set into partitions by department.

  • ORDER BY salary DESC: Orders rows within each partition by salary in descending order.

  • RANK(): Assigns a rank to each row within the partition. If two employees have the same salary, they receive the same rank, and the next rank value is skipped (next would be rank 4 for department_id 1)

SUM

The SUM function calculates the cumulative sum of a column within a partition.

Using the same employees table, we want to calculate the cumulative salary for each employee within their department.

SELECT 
    employee_id, 
    first_name, 
    last_name, 
    department_id,
    salary,
    SUM(salary) OVER (PARTITION BY department_id ORDER BY employee_id) AS cumulative_salary
FROM employees;

Result:

employee_idfirst_namelast_namedepartment_idsalarycumulative_salary
1JohnDoe15000050000
2JaneSmith155000105000
6AlanWhite150000155000
3MaryJohnson26000060000
4MikeBrown262000122000
7SarahGreen262000184000
5EmilyDavis34800048000
  • PARTITION BY department_id: Divides the result set into partitions by department.

  • ORDER BY employee_id: Orders rows within each partition by employee ID.

  • SUM(salary): Calculates the cumulative sum of salaries within each partition.

Common Table Expressions (CTEs)

Common Table Expressions (CTEs) provide a way to define temporary result sets that can be referenced within a SELECT, INSERT, UPDATE, or DELETE statement.

CTEs make complex queries more readable and easier to manage, allowing you to break down queries into simpler, more manageable parts.

Syntax

The basic syntax for a CTE is:

WITH cte_name AS (
    -- CTE Query
    SELECT ...
)
-- Main Query
SELECT ...
FROM cte_name;
  • WITH cte_name AS: Defines the CTE.

  • CTE Query: The query that defines the temporary result set.

  • Main Query: The query that references the CTE.

Types of CTEs

  • Non-recursive CTEs: Used for simple, single-use queries.

  • Recursive CTEs: Used for hierarchical or tree-structured data.

Simple CTE

A simple CTE that selects employee names and their departments.

Let's say we have an employees table and a departments table with the following data:

employees:

employee_idfirst_namelast_namedepartment_id
1JohnDoe1
2JaneSmith1
3MaryJohnson2
4MikeBrown2
5EmilyDavis3

departments:

department_iddepartment_name
1HR
2IT
3Finance

We want to create a CTE to select employee names and their department names.

WITH EmployeeDepartments AS (
    SELECT 
        e.employee_id, 
        e.first_name, 
        e.last_name, 
        d.department_name
    FROM employees e
    JOIN departments d ON e.department_id = d.department_id
)
SELECT 
    employee_id, 
    first_name, 
    last_name, 
    department_name
FROM EmployeeDepartments;

Result:

employee_idfirst_namelast_namedepartment_name
1JohnDoeHR
2JaneSmithHR
3MaryJohnsonIT
4MikeBrownIT
5EmilyDavisFinance
  • WITH EmployeeDepartments AS: Defines the CTE named EmployeeDepartments.

  • CTE Query: Joins the employees and departments tables to create a temporary result set.

  • Main Query: Selects data from the EmployeeDepartments CTE.

Recursive CTE

A recursive CTE is useful for hierarchical data, such as organizational charts or tree structures.

Example:

Let's say we have an employees table with a manager_id column that references the employee_id of the manager.

employees:

employee_idfirst_namelast_namemanager_id
1JohnDoeNULL
2JaneSmith1
3MaryJohnson1
4MikeBrown2
5EmilyDavis2

We want to create a hierarchical list of employees and their managers.

WITH RECURSIVE EmployeeHierarchy AS (
    SELECT 
        employee_id, 
        first_name, 
        last_name, 
        manager_id,
        1 AS level
    FROM employees
    WHERE manager_id IS NULL
    UNION ALL
    SELECT 
        e.employee_id, 
        e.first_name, 
        e.last_name, 
        e.manager_id,
        eh.level + 1
    FROM employees e
    INNER JOIN EmployeeHierarchy eh ON e.manager_id = eh.employee_id
)
SELECT 
    employee_id, 
    first_name, 
    last_name, 
    manager_id,
    level
FROM EmployeeHierarchy;

Result:

employee_idfirst_namelast_namemanager_idlevel
1JohnDoeNULL1
2JaneSmith12
3MaryJohnson12
4MikeBrown23
5EmilyDavis23
  • WITH RECURSIVE EmployeeHierarchy AS: Defines the recursive CTE named EmployeeHierarchy.

  • Anchor Member (First Part): Selects employees with no manager (top-level employees) and assigns a level of 1.

  • Recursive Member (Second Part): Joins the employees table with the CTE to find employees managed by the current level of employees and increments the level.

  • Main Query: Selects data from the EmployeeHierarchy CTE, displaying the hierarchy.

Joins

Joins are used to combine rows from two or more tables based on a related column. Complex joins involve multiple tables and advanced conditions to retrieve more intricate datasets.

Understanding complex joins is important for querying normalized databases and extracting meaningful insights from related tables.

Types of Joins

  1. Inner Join

  2. Left Join

  3. Right Join

  4. Full Outer Join

  5. Cross Join

  6. Self Join

Inner Join

An inner join returns only the rows that have matching values in both tables.

Let's say we have the following employees and departments tables:

employees

employee_idfirst_namelast_namedepartment_id
1JohnDoe1
2JaneSmith1
3MaryJohnson2
4MikeBrown2
5EmilyDavis3

departments

department_iddepartment_name
1HR
2IT

We want to list all employees and their corresponding department names.

SELECT 
    e.employee_id, 
    e.first_name, 
    e.last_name, 
    d.department_name
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id;
employee_idfirst_namelast_namedepartment_name
1JohnDoeHR
2JaneSmithHR
3MaryJohnsonIT
4MikeBrownIT
  • INNER JOIN departments d ON e.department_id = d.department_id: Combines rows from employees and departments where the department_id matches in both tables.

  • Employees without a matching department in the departments table are excluded from the result.

Left Join

A left join returns all rows from the left table and the matched rows from the right table. Unmatched rows from the right table will be NULL.

Using the same employees and departments tables, we want to list all employees and their department names, including those without a department.

SELECT 
    e.employee_id, 
    e.first_name, 
    e.last_name, 
    d.department_name
FROM employees e
LEFT JOIN departments d ON e.department_id = d.department_id;
employee_idfirst_namelast_namedepartment_name
1JohnDoeHR
2JaneSmithHR
3MaryJohnsonIT
4MikeBrownIT
5EmilyDavisNULL
  • LEFT JOIN departments d ON e.department_id = d.department_id: Combines rows from employees and departments where the department_id matches in both tables, including employees without a matching department.

  • The row for Emily Davis, who doesn't have a matching department, is included with a NULL department_name.

Right Join

A right join returns all rows from the right table and the matched rows from the left table. Unmatched rows from the left table will be NULL.

Using the same employees and departments tables, we want to list all departments and their employees, including departments without employees.

SELECT 
    e.employee_id, 
    e.first_name, 
    e.last_name, 
    d.department_name
FROM employees e
RIGHT JOIN departments d ON e.department_id = d.department_id;
employee_idfirst_namelast_namedepartment_name
1JohnDoeHR
2JaneSmithHR
3MaryJohnsonIT
4MikeBrownIT
NULLNULLNULLFinance
  • RIGHT JOIN departments d ON e.department_id = d.department_id: Combines rows from employees and departments where the department_id matches in both tables, including departments without matching employees.

  • The row for the Finance department, which doesn't have matching employees, is included with NULL values for employee details.

Full Outer Join

A full outer join returns rows when there is a match in one of the tables. It returns all rows from both tables and fills in NULLs for missing matches on either side.

Using the same employees and departments tables, we want to list all employees and all departments, including those without matches.

SELECT 
    e.employee_id, 
    e.first_name, 
    e.last_name, 
    d.department_name
FROM employees e
FULL OUTER JOIN departments d ON e.department_id = d.department_id;
employee_idfirst_namelast_namedepartment_name
1JohnDoeHR
2JaneSmithHR
3MaryJohnsonIT
4MikeBrownIT
5EmilyDavisNULL
NULLNULLNULLFinance
  • FULL OUTER JOIN departments d ON e.department_id = d.department_id: Combines rows from employees and departments where the department_id matches in both tables, including unmatched rows from both tables.

  • Rows for Emily Davis and the Finance department, which don't have matching entries, are included with NULL values for the missing details.

Cross Join

A cross join returns the Cartesian product of the two tables, i.e., it returns all possible combinations of rows.

Example:

Using an employees table and a projects table, we want to list all combinations of employees and projects.

employees

employee_idfirst_namelast_name
1JohnDoe
2JaneSmith

projects

project_idproject_name
1Project Alpha
2Project Beta
SELECT 
    e.first_name, 
    e.last_name, 
    p.project_name
FROM employees e
CROSS JOIN projects p;
first_namelast_nameproject_name
JohnDoeProject Alpha
JohnDoeProject Beta
JaneSmithProject Alpha
JaneSmithProject Beta
  • CROSS JOIN projects p: Returns every combination of rows from the employees and projects tables, producing a Cartesian product.

Self Join

A self join is a regular join, but the table is joined with itself.

Using the employees table, we want to find each employee and their manager.

employee_idfirst_namelast_namemanager_id
1JohnDoeNULL
2JaneSmith1
3MaryJohnson1
4MikeBrown2
5EmilyDavis2
SELECT 
    e.employee_id, 
    e.first_name, 
    e.last_name, 
    m.first_name AS manager_first_name, 
    m.last_name AS manager_last_name
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.employee_id;
employee_idfirst_namelast_namemanager_first_namemanager_last_name
1JohnDoeNULLNULL
2JaneSmithJohnDoe
3MaryJohnsonJohnDoe
4MikeBrownJaneSmith
5EmilyDavisJaneSmith
  • LEFT JOIN employees m ON e.manager_id = m.employee_id: Joins the employees table with itself to match employees with their managers.

Final Consideration: Applying Advanced SQL Concepts

As a data engineer, it's crucial to understand when and how to apply advanced SQL concepts to optimize database performance and facilitate effective data analysis. Here's a comprehensive analysis of when to use window functions, CTEs, and joins in real-world scenarios:

Window Functions

When to Use:

  • Analytics and Reporting: Use window functions to calculate running totals, moving averages, ranks, and other analytics without complex subqueries. They are particularly useful in generating reports where metrics need to be calculated across partitions of data.

  • Time Series Analysis: When analyzing time-series data, window functions can help compute metrics like cumulative sums, moving averages, and lagged values.

  • Financial Calculations: Financial analysis often requires complex calculations that window functions can simplify, such as cumulative returns or ranks of financial instruments.

Application:

  • Sales Analytics: Calculate monthly sales growth percentages, rank products by sales within each category, and compute cumulative sales over time.

  • Customer Insights: Determine customer rankings based on purchase behavior, calculate running totals of transactions, and analyze trends over time.

Example:

You want a running total of sales by month:

SELECT 
    sales_date,
    amount,
    SUM(amount) OVER (ORDER BY sales_date) AS running_total
FROM sales;

Common Table Expressions (CTEs)

When to Use:

  • Breaking Down Complex Queries: Use CTEs to split complex queries into simpler parts, making them easier to read, write, and maintain.

  • Hierarchical Data: Recursive CTEs are ideal for querying hierarchical or tree-structured data, such as organizational charts, bill of materials, and family trees.

  • Data Preparation: Prepare data for analysis by filtering, aggregating, or transforming it in a structured and readable way.

Application:

  • Data Transformation: Use CTEs to transform raw data into a format suitable for reporting or further analysis. For instance, aggregate daily sales data into monthly totals before further processing.

  • Hierarchical Queries: Generate reports that require hierarchical data processing, such as organizational hierarchies or project dependencies.

Example:

Let's say you have a table sales that contains daily sales data with columns date, product_id, and sales_amount. You want to aggregate this data into monthly totals before further processing or reporting.

WITH monthly_sales AS (
    SELECT
        DATE_TRUNC('month', date) AS month,
        product_id,
        SUM(sales_amount) AS total_sales
    FROM
        sales
    GROUP BY
        DATE_TRUNC('month', date),
        product_id
)
SELECT
    month,
    product_id,
    total_sales
FROM
    monthly_sales
ORDER BY
    month,
    product_id;
  • We use a CTE named monthly_sales to aggregate the daily sales data into monthly totals.

  • Within the CTE, we truncate the date column to the month using the DATE_TRUNC function, group by the truncated date and product_id, and calculate the sum of sales_amount.

  • Then, in the main query, we select the aggregated monthly sales data from the CTE and order it by month and product_id.

Joins

When to Use:

  • Combining Data from Multiple Tables: Use joins to merge data from multiple related tables, especially in normalized databases where data is split into several tables to reduce redundancy.

  • Reporting and Analytics: Generate detailed reports that require data from different sources, such as combining sales, customer, and product data.

  • Data Integration: Integrate data from different systems or databases to create a unified view, such as merging customer data from CRM and billing systems.

Application:

  • Business Intelligence: Combine various data sources to generate detailed business intelligence reports, such as combining sales data with customer demographics and product information.

  • Data Warehousing: Integrate data from multiple operational databases into a data warehouse for analysis and reporting.

Example:

You want to to generate a sales report with customer and product details:

SELECT 
    s.sale_id,
    s.sale_date,
    c.customer_name,
    p.product_name,
    s.amount
FROM sales s
JOIN customers c ON s.customer_id = c.customer_id
JOIN products p ON s.product_id = p.product_id;

Conclusion

In the real world, as a data engineer, the choice of SQL techniques depends on the specific requirements of the task at hand. Here are some guidelines:

  • Use Window Functions when you need to perform calculations across a set of rows related to the current row, especially for analytics and reporting.

  • Use CTEs to simplify complex queries, handle hierarchical data, and prepare data for analysis in a structured and readable manner.

  • Use Joins to combine data from multiple tables, especially in normalized databases, and generate detailed reports.

By mastering these advanced SQL concepts, you can design efficient, scalable, and maintainable database queries that meet the needs of modern data analysis and reporting.