Introduction

Mastering SQL comes down to dive into advanced concepts that enable you to query, manage, and optimize databases.

We'll explore window functions, Common Table Expressions (CTEs), and complex joins.

These advanced SQL features are tools that can improve your database querying capabilities.

Window Functions

Window functions are a powerful feature in SQL that allow you to perform calculations across a set of table rows related to the current row. Unlike aggregate functions, which return a single result for a group of rows, window functions return a value for each row in the result set.

Syntax

The basic syntax for a window function is:

function_name (expression) OVER (
    [PARTITION BY partition_expression]
    [ORDER BY sort_expression]
    [frame_clause]
)

function_name: The name of the window function (e.g., ROW_NUMBER, RANK, SUM).
expression: The column or expression to be calculated.
PARTITION BY: Divides the result set into partitions to which the window function is applied.
ORDER BY: Specifies the order of rows within each partition.
frame_clause: Defines the subset of rows within each partition.

Types of Window Functions

Ranking Functions: Assign a rank to each row (e.g., ROW_NUMBER, RANK, DENSE_RANK).
Aggregate Functions: Perform calculations on a set of values (e.g., SUM, AVG, MIN, MAX).
Value Functions: Provide access to a row's data (e.g., LEAD, LAG, FIRST_VALUE, LAST_VALUE).

ROW_NUMBER

The ROW_NUMBER function assigns a unique sequential integer to rows within a partition, starting at 1.

Let's say we have an employees table with the following data:

employee_id	first_name	last_name	department_id	salary
1	John	Doe	1	50000
2	Jane	Smith	1	55000
3	Mary	Johnson	2	60000
4	Mike	Brown	2	62000
5	Emily	Davis	3	48000
6	Alan	White	1	50000
7	Sarah	Green	2	62000

We want to assign a row number to each employee within their department, ordered by their salary.

SELECT 
    employee_id, 
    first_name, 
    last_name, 
    department_id,
    salary,
    ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) AS row_num
FROM employees;

Result:

employee_id	first_name	last_name	department_id	salary	row_num
2	Jane	Smith	1	55000	1
1	John	Doe	1	50000	2
6	Alan	White	1	50000	3
4	Mike	Brown	2	62000	1
7	Sarah	Green	2	62000	2
3	Mary	Johnson	2	60000	3
5	Emily	Davis	3	48000	1

PARTITION BY department_id: Divides the result set into partitions by department.
ORDER BY salary DESC: Orders rows within each partition by salary in descending order.
ROW_NUMBER(): Assigns a unique row number to each row within the partition.

RANK

The RANK function assigns a rank to each row within a partition, with gaps in the rank values for ties.

Example:

Using the same employees table, we want to rank employees within their department based on their salary.

SELECT 
    employee_id, 
    first_name, 
    last_name, 
    department_id,
    salary,
    RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rank
FROM employees;

Result:

employee_id	first_name	last_name	department_id	salary	rank
2	Jane	Smith	1	55000	1
1	John	Doe	1	50000	2
6	Alan	White	1	50000	2
4	Mike	Brown	2	62000	1
7	Sarah	Green	2	62000	1
3	Mary	Johnson	2	60000	3
5	Emily	Davis	3	48000	1

PARTITION BY department_id: Divides the result set into partitions by department.
ORDER BY salary DESC: Orders rows within each partition by salary in descending order.
RANK(): Assigns a rank to each row within the partition. If two employees have the same salary, they receive the same rank, and the next rank value is skipped (next would be rank 4 for department_id 1)

SUM

The SUM function calculates the cumulative sum of a column within a partition.

Using the same employees table, we want to calculate the cumulative salary for each employee within their department.

SELECT 
    employee_id, 
    first_name, 
    last_name, 
    department_id,
    salary,
    SUM(salary) OVER (PARTITION BY department_id ORDER BY employee_id) AS cumulative_salary
FROM employees;

Result:

employee_id	first_name	last_name	department_id	salary	cumulative_salary
1	John	Doe	1	50000	50000
2	Jane	Smith	1	55000	105000
6	Alan	White	1	50000	155000
3	Mary	Johnson	2	60000	60000
4	Mike	Brown	2	62000	122000
7	Sarah	Green	2	62000	184000
5	Emily	Davis	3	48000	48000

PARTITION BY department_id: Divides the result set into partitions by department.
ORDER BY employee_id: Orders rows within each partition by employee ID.
SUM(salary): Calculates the cumulative sum of salaries within each partition.

Common Table Expressions (CTEs)

Common Table Expressions (CTEs) provide a way to define temporary result sets that can be referenced within a SELECT, INSERT, UPDATE, or DELETE statement.

CTEs make complex queries more readable and easier to manage, allowing you to break down queries into simpler, more manageable parts.

Syntax

The basic syntax for a CTE is:

WITH cte_name AS (
    -- CTE Query
    SELECT ...
)
-- Main Query
SELECT ...
FROM cte_name;

WITH cte_name AS: Defines the CTE.
CTE Query: The query that defines the temporary result set.
Main Query: The query that references the CTE.

Types of CTEs

Non-recursive CTEs: Used for simple, single-use queries.
Recursive CTEs: Used for hierarchical or tree-structured data.

Simple CTE

A simple CTE that selects employee names and their departments.

Let's say we have an employees table and a departments table with the following data:

employees:

employee_id	first_name	last_name	department_id
1	John	Doe	1
2	Jane	Smith	1
3	Mary	Johnson	2
4	Mike	Brown	2
5	Emily	Davis	3

departments:

department_id	department_name
1	HR
2	IT
3	Finance

We want to create a CTE to select employee names and their department names.

WITH EmployeeDepartments AS (
    SELECT 
        e.employee_id, 
        e.first_name, 
        e.last_name, 
        d.department_name
    FROM employees e
    JOIN departments d ON e.department_id = d.department_id
)
SELECT 
    employee_id, 
    first_name, 
    last_name, 
    department_name
FROM EmployeeDepartments;

Result:

employee_id	first_name	last_name	department_name
1	John	Doe	HR
2	Jane	Smith	HR
3	Mary	Johnson	IT
4	Mike	Brown	IT
5	Emily	Davis	Finance

WITH EmployeeDepartments AS: Defines the CTE named EmployeeDepartments.
CTE Query: Joins the employees and departments tables to create a temporary result set.
Main Query: Selects data from the EmployeeDepartments CTE.

Recursive CTE

A recursive CTE is useful for hierarchical data, such as organizational charts or tree structures.

Example:

Let's say we have an employees table with a manager_id column that references the employee_id of the manager.

employees:

employee_id	first_name	last_name	manager_id
1	John	Doe	NULL
2	Jane	Smith	1
3	Mary	Johnson	1
4	Mike	Brown	2
5	Emily	Davis	2

We want to create a hierarchical list of employees and their managers.

WITH RECURSIVE EmployeeHierarchy AS (
    SELECT 
        employee_id, 
        first_name, 
        last_name, 
        manager_id,
        1 AS level
    FROM employees
    WHERE manager_id IS NULL
    UNION ALL
    SELECT 
        e.employee_id, 
        e.first_name, 
        e.last_name, 
        e.manager_id,
        eh.level + 1
    FROM employees e
    INNER JOIN EmployeeHierarchy eh ON e.manager_id = eh.employee_id
)
SELECT 
    employee_id, 
    first_name, 
    last_name, 
    manager_id,
    level
FROM EmployeeHierarchy;

Result:

employee_id	first_name	last_name	manager_id	level
1	John	Doe	NULL	1
2	Jane	Smith	1	2
3	Mary	Johnson	1	2
4	Mike	Brown	2	3
5	Emily	Davis	2	3

WITH RECURSIVE EmployeeHierarchy AS: Defines the recursive CTE named EmployeeHierarchy.
Anchor Member (First Part): Selects employees with no manager (top-level employees) and assigns a level of 1.
Recursive Member (Second Part): Joins the employees table with the CTE to find employees managed by the current level of employees and increments the level.
Main Query: Selects data from the EmployeeHierarchy CTE, displaying the hierarchy.

Joins

Joins are used to combine rows from two or more tables based on a related column. Complex joins involve multiple tables and advanced conditions to retrieve more intricate datasets.

Understanding complex joins is important for querying normalized databases and extracting meaningful insights from related tables.

Types of Joins

Inner Join
Left Join
Right Join
Full Outer Join
Cross Join
Self Join

Inner Join

An inner join returns only the rows that have matching values in both tables.

Let's say we have the following employees and departments tables:

employees

employee_id	first_name	last_name	department_id
1	John	Doe	1
2	Jane	Smith	1
3	Mary	Johnson	2
4	Mike	Brown	2
5	Emily	Davis	3

departments

department_id	department_name
1	HR
2	IT

We want to list all employees and their corresponding department names.

SELECT 
    e.employee_id, 
    e.first_name, 
    e.last_name, 
    d.department_name
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id;

employee_id	first_name	last_name	department_name
1	John	Doe	HR
2	Jane	Smith	HR
3	Mary	Johnson	IT
4	Mike	Brown	IT

INNER JOIN departments d ON e.department_id = d.department_id: Combines rows from employees and departments where the department_id matches in both tables.
Employees without a matching department in the departments table are excluded from the result.

Left Join

A left join returns all rows from the left table and the matched rows from the right table. Unmatched rows from the right table will be NULL.

Using the same employees and departments tables, we want to list all employees and their department names, including those without a department.

SELECT 
    e.employee_id, 
    e.first_name, 
    e.last_name, 
    d.department_name
FROM employees e
LEFT JOIN departments d ON e.department_id = d.department_id;

employee_id	first_name	last_name	department_name
1	John	Doe	HR
2	Jane	Smith	HR
3	Mary	Johnson	IT
4	Mike	Brown	IT
5	Emily	Davis	NULL

LEFT JOIN departments d ON e.department_id = d.department_id: Combines rows from employees and departments where the department_id matches in both tables, including employees without a matching department.
The row for Emily Davis, who doesn't have a matching department, is included with a NULL department_name.

Right Join

A right join returns all rows from the right table and the matched rows from the left table. Unmatched rows from the left table will be NULL.

Using the same employees and departments tables, we want to list all departments and their employees, including departments without employees.

SELECT 
    e.employee_id, 
    e.first_name, 
    e.last_name, 
    d.department_name
FROM employees e
RIGHT JOIN departments d ON e.department_id = d.department_id;

employee_id	first_name	last_name	department_name
1	John	Doe	HR
2	Jane	Smith	HR
3	Mary	Johnson	IT
4	Mike	Brown	IT
NULL	NULL	NULL	Finance

RIGHT JOIN departments d ON e.department_id = d.department_id: Combines rows from employees and departments where the department_id matches in both tables, including departments without matching employees.
The row for the Finance department, which doesn't have matching employees, is included with NULL values for employee details.

Full Outer Join

A full outer join returns rows when there is a match in one of the tables. It returns all rows from both tables and fills in NULLs for missing matches on either side.

Using the same employees and departments tables, we want to list all employees and all departments, including those without matches.

SELECT 
    e.employee_id, 
    e.first_name, 
    e.last_name, 
    d.department_name
FROM employees e
FULL OUTER JOIN departments d ON e.department_id = d.department_id;

employee_id	first_name	last_name	department_name
1	John	Doe	HR
2	Jane	Smith	HR
3	Mary	Johnson	IT
4	Mike	Brown	IT
5	Emily	Davis	NULL
NULL	NULL	NULL	Finance

FULL OUTER JOIN departments d ON e.department_id = d.department_id: Combines rows from employees and departments where the department_id matches in both tables, including unmatched rows from both tables.
Rows for Emily Davis and the Finance department, which don't have matching entries, are included with NULL values for the missing details.

Cross Join

A cross join returns the Cartesian product of the two tables, i.e., it returns all possible combinations of rows.

Example:

Using an employees table and a projects table, we want to list all combinations of employees and projects.

employees

employee_id	first_name	last_name
1	John	Doe
2	Jane	Smith

projects

project_id	project_name
1	Project Alpha
2	Project Beta

SELECT 
    e.first_name, 
    e.last_name, 
    p.project_name
FROM employees e
CROSS JOIN projects p;

first_name	last_name	project_name
John	Doe	Project Alpha
John	Doe	Project Beta
Jane	Smith	Project Alpha
Jane	Smith	Project Beta

CROSS JOIN projects p: Returns every combination of rows from the employees and projects tables, producing a Cartesian product.

Self Join

A self join is a regular join, but the table is joined with itself.

Using the employees table, we want to find each employee and their manager.

employee_id	first_name	last_name	manager_id
1	John	Doe	NULL
2	Jane	Smith	1
3	Mary	Johnson	1
4	Mike	Brown	2
5	Emily	Davis	2

SELECT 
    e.employee_id, 
    e.first_name, 
    e.last_name, 
    m.first_name AS manager_first_name, 
    m.last_name AS manager_last_name
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.employee_id;

employee_id	first_name	last_name	manager_first_name	manager_last_name
1	John	Doe	NULL	NULL
2	Jane	Smith	John	Doe
3	Mary	Johnson	John	Doe
4	Mike	Brown	Jane	Smith
5	Emily	Davis	Jane	Smith

LEFT JOIN employees m ON e.manager_id = m.employee_id: Joins the employees table with itself to match employees with their managers.

Final Consideration: Applying Advanced SQL Concepts

As a data engineer, it's crucial to understand when and how to apply advanced SQL concepts to optimize database performance and facilitate effective data analysis. Here's a comprehensive analysis of when to use window functions, CTEs, and joins in real-world scenarios:

Window Functions

When to Use:

Analytics and Reporting: Use window functions to calculate running totals, moving averages, ranks, and other analytics without complex subqueries. They are particularly useful in generating reports where metrics need to be calculated across partitions of data.
Time Series Analysis: When analyzing time-series data, window functions can help compute metrics like cumulative sums, moving averages, and lagged values.
Financial Calculations: Financial analysis often requires complex calculations that window functions can simplify, such as cumulative returns or ranks of financial instruments.

Application:

Sales Analytics: Calculate monthly sales growth percentages, rank products by sales within each category, and compute cumulative sales over time.
Customer Insights: Determine customer rankings based on purchase behavior, calculate running totals of transactions, and analyze trends over time.

Example:

You want a running total of sales by month:

SELECT 
    sales_date,
    amount,
    SUM(amount) OVER (ORDER BY sales_date) AS running_total
FROM sales;

Common Table Expressions (CTEs)

When to Use:

Breaking Down Complex Queries: Use CTEs to split complex queries into simpler parts, making them easier to read, write, and maintain.
Hierarchical Data: Recursive CTEs are ideal for querying hierarchical or tree-structured data, such as organizational charts, bill of materials, and family trees.
Data Preparation: Prepare data for analysis by filtering, aggregating, or transforming it in a structured and readable way.

Application:

Data Transformation: Use CTEs to transform raw data into a format suitable for reporting or further analysis. For instance, aggregate daily sales data into monthly totals before further processing.
Hierarchical Queries: Generate reports that require hierarchical data processing, such as organizational hierarchies or project dependencies.

Example:

Let's say you have a table sales that contains daily sales data with columns date, product_id, and sales_amount. You want to aggregate this data into monthly totals before further processing or reporting.

WITH monthly_sales AS (
    SELECT
        DATE_TRUNC('month', date) AS month,
        product_id,
        SUM(sales_amount) AS total_sales
    FROM
        sales
    GROUP BY
        DATE_TRUNC('month', date),
        product_id
)
SELECT
    month,
    product_id,
    total_sales
FROM
    monthly_sales
ORDER BY
    month,
    product_id;

We use a CTE named monthly_sales to aggregate the daily sales data into monthly totals.
Within the CTE, we truncate the date column to the month using the DATE_TRUNC function, group by the truncated date and product_id, and calculate the sum of sales_amount.
Then, in the main query, we select the aggregated monthly sales data from the CTE and order it by month and product_id.

Joins

When to Use:

Combining Data from Multiple Tables: Use joins to merge data from multiple related tables, especially in normalized databases where data is split into several tables to reduce redundancy.
Reporting and Analytics: Generate detailed reports that require data from different sources, such as combining sales, customer, and product data.
Data Integration: Integrate data from different systems or databases to create a unified view, such as merging customer data from CRM and billing systems.

Application:

Business Intelligence: Combine various data sources to generate detailed business intelligence reports, such as combining sales data with customer demographics and product information.
Data Warehousing: Integrate data from multiple operational databases into a data warehouse for analysis and reporting.

Example:

You want to to generate a sales report with customer and product details:

SELECT 
    s.sale_id,
    s.sale_date,
    c.customer_name,
    p.product_name,
    s.amount
FROM sales s
JOIN customers c ON s.customer_id = c.customer_id
JOIN products p ON s.product_id = p.product_id;

Conclusion

In the real world, as a data engineer, the choice of SQL techniques depends on the specific requirements of the task at hand. Here are some guidelines:

Use Window Functions when you need to perform calculations across a set of rows related to the current row, especially for analytics and reporting.
Use CTEs to simplify complex queries, handle hierarchical data, and prepare data for analysis in a structured and readable manner.
Use Joins to combine data from multiple tables, especially in normalized databases, and generate detailed reports.

By mastering these advanced SQL concepts, you can design efficient, scalable, and maintainable database queries that meet the needs of modern data analysis and reporting.

SQL: Window Functions, CTEs and Joins

Some advanced topics

Table of contents

Introduction

Window Functions

Syntax

Types of Window Functions

ROW_NUMBER

RANK

SUM

Common Table Expressions (CTEs)

Syntax

Types of CTEs

Simple CTE

Recursive CTE

Joins

Types of Joins

Inner Join

Left Join

Right Join

Full Outer Join

Cross Join

Self Join

Final Consideration: Applying Advanced SQL Concepts

Window Functions

Common Table Expressions (CTEs)

Joins

Conclusion