Introduction
Mastering SQL comes down to dive into advanced concepts that enable you to query, manage, and optimize databases.
We'll explore window functions, Common Table Expressions (CTEs), and complex joins.
These advanced SQL features are tools that can improve your database querying capabilities.
Window Functions
Window functions are a powerful feature in SQL that allow you to perform calculations across a set of table rows related to the current row. Unlike aggregate functions, which return a single result for a group of rows, window functions return a value for each row in the result set.
Syntax
The basic syntax for a window function is:
function_name (expression) OVER (
[PARTITION BY partition_expression]
[ORDER BY sort_expression]
[frame_clause]
)
function_name: The name of the window function (e.g.,
ROW_NUMBER
,RANK
,SUM
).expression: The column or expression to be calculated.
PARTITION BY: Divides the result set into partitions to which the window function is applied.
ORDER BY: Specifies the order of rows within each partition.
frame_clause: Defines the subset of rows within each partition.
Types of Window Functions
Ranking Functions: Assign a rank to each row (e.g.,
ROW_NUMBER
,RANK
,DENSE_RANK
).Aggregate Functions: Perform calculations on a set of values (e.g.,
SUM
,AVG
,MIN
,MAX
).Value Functions: Provide access to a row's data (e.g.,
LEAD
,LAG
,FIRST_VALUE
,LAST_VALUE
).
ROW_NUMBER
The ROW_NUMBER
function assigns a unique sequential integer to rows within a partition, starting at 1.
Let's say we have an employees
table with the following data:
employee_id | first_name | last_name | department_id | salary |
1 | John | Doe | 1 | 50000 |
2 | Jane | Smith | 1 | 55000 |
3 | Mary | Johnson | 2 | 60000 |
4 | Mike | Brown | 2 | 62000 |
5 | Emily | Davis | 3 | 48000 |
6 | Alan | White | 1 | 50000 |
7 | Sarah | Green | 2 | 62000 |
We want to assign a row number to each employee within their department, ordered by their salary.
SELECT
employee_id,
first_name,
last_name,
department_id,
salary,
ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) AS row_num
FROM employees;
Result:
employee_id | first_name | last_name | department_id | salary | row_num |
2 | Jane | Smith | 1 | 55000 | 1 |
1 | John | Doe | 1 | 50000 | 2 |
6 | Alan | White | 1 | 50000 | 3 |
4 | Mike | Brown | 2 | 62000 | 1 |
7 | Sarah | Green | 2 | 62000 | 2 |
3 | Mary | Johnson | 2 | 60000 | 3 |
5 | Emily | Davis | 3 | 48000 | 1 |
PARTITION BY department_id: Divides the result set into partitions by department.
ORDER BY salary DESC: Orders rows within each partition by salary in descending order.
ROW_NUMBER(): Assigns a unique row number to each row within the partition.
RANK
The RANK
function assigns a rank to each row within a partition, with gaps in the rank values for ties.
Example:
Using the same employees
table, we want to rank employees within their department based on their salary.
SELECT
employee_id,
first_name,
last_name,
department_id,
salary,
RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rank
FROM employees;
Result:
employee_id | first_name | last_name | department_id | salary | rank |
2 | Jane | Smith | 1 | 55000 | 1 |
1 | John | Doe | 1 | 50000 | 2 |
6 | Alan | White | 1 | 50000 | 2 |
4 | Mike | Brown | 2 | 62000 | 1 |
7 | Sarah | Green | 2 | 62000 | 1 |
3 | Mary | Johnson | 2 | 60000 | 3 |
5 | Emily | Davis | 3 | 48000 | 1 |
PARTITION BY department_id: Divides the result set into partitions by department.
ORDER BY salary DESC: Orders rows within each partition by salary in descending order.
RANK(): Assigns a rank to each row within the partition. If two employees have the same salary, they receive the same rank, and the next rank value is skipped (next would be rank 4 for
department_id
1)
SUM
The SUM
function calculates the cumulative sum of a column within a partition.
Using the same employees
table, we want to calculate the cumulative salary for each employee within their department.
SELECT
employee_id,
first_name,
last_name,
department_id,
salary,
SUM(salary) OVER (PARTITION BY department_id ORDER BY employee_id) AS cumulative_salary
FROM employees;
Result:
employee_id | first_name | last_name | department_id | salary | cumulative_salary |
1 | John | Doe | 1 | 50000 | 50000 |
2 | Jane | Smith | 1 | 55000 | 105000 |
6 | Alan | White | 1 | 50000 | 155000 |
3 | Mary | Johnson | 2 | 60000 | 60000 |
4 | Mike | Brown | 2 | 62000 | 122000 |
7 | Sarah | Green | 2 | 62000 | 184000 |
5 | Emily | Davis | 3 | 48000 | 48000 |
PARTITION BY department_id: Divides the result set into partitions by department.
ORDER BY employee_id: Orders rows within each partition by employee ID.
SUM(salary): Calculates the cumulative sum of salaries within each partition.
Common Table Expressions (CTEs)
Common Table Expressions (CTEs) provide a way to define temporary result sets that can be referenced within a SELECT
, INSERT
, UPDATE
, or DELETE
statement.
CTEs make complex queries more readable and easier to manage, allowing you to break down queries into simpler, more manageable parts.
Syntax
The basic syntax for a CTE is:
WITH cte_name AS (
-- CTE Query
SELECT ...
)
-- Main Query
SELECT ...
FROM cte_name;
WITH cte_name AS: Defines the CTE.
CTE Query: The query that defines the temporary result set.
Main Query: The query that references the CTE.
Types of CTEs
Non-recursive CTEs: Used for simple, single-use queries.
Recursive CTEs: Used for hierarchical or tree-structured data.
Simple CTE
A simple CTE that selects employee names and their departments.
Let's say we have an employees
table and a departments
table with the following data:
employees:
employee_id | first_name | last_name | department_id |
1 | John | Doe | 1 |
2 | Jane | Smith | 1 |
3 | Mary | Johnson | 2 |
4 | Mike | Brown | 2 |
5 | Emily | Davis | 3 |
departments:
department_id | department_name |
1 | HR |
2 | IT |
3 | Finance |
We want to create a CTE to select employee names and their department names.
WITH EmployeeDepartments AS (
SELECT
e.employee_id,
e.first_name,
e.last_name,
d.department_name
FROM employees e
JOIN departments d ON e.department_id = d.department_id
)
SELECT
employee_id,
first_name,
last_name,
department_name
FROM EmployeeDepartments;
Result:
employee_id | first_name | last_name | department_name |
1 | John | Doe | HR |
2 | Jane | Smith | HR |
3 | Mary | Johnson | IT |
4 | Mike | Brown | IT |
5 | Emily | Davis | Finance |
WITH EmployeeDepartments AS: Defines the CTE named
EmployeeDepartments
.CTE Query: Joins the
employees
anddepartments
tables to create a temporary result set.Main Query: Selects data from the
EmployeeDepartments
CTE.
Recursive CTE
A recursive CTE is useful for hierarchical data, such as organizational charts or tree structures.
Example:
Let's say we have an employees
table with a manager_id
column that references the employee_id
of the manager.
employees:
employee_id | first_name | last_name | manager_id |
1 | John | Doe | NULL |
2 | Jane | Smith | 1 |
3 | Mary | Johnson | 1 |
4 | Mike | Brown | 2 |
5 | Emily | Davis | 2 |
We want to create a hierarchical list of employees and their managers.
WITH RECURSIVE EmployeeHierarchy AS (
SELECT
employee_id,
first_name,
last_name,
manager_id,
1 AS level
FROM employees
WHERE manager_id IS NULL
UNION ALL
SELECT
e.employee_id,
e.first_name,
e.last_name,
e.manager_id,
eh.level + 1
FROM employees e
INNER JOIN EmployeeHierarchy eh ON e.manager_id = eh.employee_id
)
SELECT
employee_id,
first_name,
last_name,
manager_id,
level
FROM EmployeeHierarchy;
Result:
employee_id | first_name | last_name | manager_id | level |
1 | John | Doe | NULL | 1 |
2 | Jane | Smith | 1 | 2 |
3 | Mary | Johnson | 1 | 2 |
4 | Mike | Brown | 2 | 3 |
5 | Emily | Davis | 2 | 3 |
WITH RECURSIVE EmployeeHierarchy AS: Defines the recursive CTE named
EmployeeHierarchy
.Anchor Member (First Part): Selects employees with no manager (top-level employees) and assigns a level of 1.
Recursive Member (Second Part): Joins the
employees
table with the CTE to find employees managed by the current level of employees and increments the level.Main Query: Selects data from the
EmployeeHierarchy
CTE, displaying the hierarchy.
Joins
Joins are used to combine rows from two or more tables based on a related column. Complex joins involve multiple tables and advanced conditions to retrieve more intricate datasets.
Understanding complex joins is important for querying normalized databases and extracting meaningful insights from related tables.
Types of Joins
Inner Join
Left Join
Right Join
Full Outer Join
Cross Join
Self Join
Inner Join
An inner join returns only the rows that have matching values in both tables.
Let's say we have the following employees
and departments
tables:
employees
employee_id | first_name | last_name | department_id |
1 | John | Doe | 1 |
2 | Jane | Smith | 1 |
3 | Mary | Johnson | 2 |
4 | Mike | Brown | 2 |
5 | Emily | Davis | 3 |
departments
department_id | department_name |
1 | HR |
2 | IT |
We want to list all employees and their corresponding department names.
SELECT
e.employee_id,
e.first_name,
e.last_name,
d.department_name
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id;
employee_id | first_name | last_name | department_name |
1 | John | Doe | HR |
2 | Jane | Smith | HR |
3 | Mary | Johnson | IT |
4 | Mike | Brown | IT |
INNER JOIN departments d ON e.department_id = d.department_id: Combines rows from
employees
anddepartments
where thedepartment_id
matches in both tables.Employees without a matching department in the
departments
table are excluded from the result.
Left Join
A left join returns all rows from the left table and the matched rows from the right table. Unmatched rows from the right table will be NULL.
Using the same employees
and departments
tables, we want to list all employees and their department names, including those without a department.
SELECT
e.employee_id,
e.first_name,
e.last_name,
d.department_name
FROM employees e
LEFT JOIN departments d ON e.department_id = d.department_id;
employee_id | first_name | last_name | department_name |
1 | John | Doe | HR |
2 | Jane | Smith | HR |
3 | Mary | Johnson | IT |
4 | Mike | Brown | IT |
5 | Emily | Davis | NULL |
LEFT JOIN departments d ON e.department_id = d.department_id: Combines rows from
employees
anddepartments
where thedepartment_id
matches in both tables, including employees without a matching department.The row for Emily Davis, who doesn't have a matching department, is included with a NULL
department_name
.
Right Join
A right join returns all rows from the right table and the matched rows from the left table. Unmatched rows from the left table will be NULL.
Using the same employees
and departments
tables, we want to list all departments and their employees, including departments without employees.
SELECT
e.employee_id,
e.first_name,
e.last_name,
d.department_name
FROM employees e
RIGHT JOIN departments d ON e.department_id = d.department_id;
employee_id | first_name | last_name | department_name |
1 | John | Doe | HR |
2 | Jane | Smith | HR |
3 | Mary | Johnson | IT |
4 | Mike | Brown | IT |
NULL | NULL | NULL | Finance |
RIGHT JOIN departments d ON e.department_id = d.department_id: Combines rows from
employees
anddepartments
where thedepartment_id
matches in both tables, including departments without matching employees.The row for the Finance department, which doesn't have matching employees, is included with NULL values for employee details.
Full Outer Join
A full outer join returns rows when there is a match in one of the tables. It returns all rows from both tables and fills in NULLs for missing matches on either side.
Using the same employees
and departments
tables, we want to list all employees and all departments, including those without matches.
SELECT
e.employee_id,
e.first_name,
e.last_name,
d.department_name
FROM employees e
FULL OUTER JOIN departments d ON e.department_id = d.department_id;
employee_id | first_name | last_name | department_name |
1 | John | Doe | HR |
2 | Jane | Smith | HR |
3 | Mary | Johnson | IT |
4 | Mike | Brown | IT |
5 | Emily | Davis | NULL |
NULL | NULL | NULL | Finance |
FULL OUTER JOIN departments d ON e.department_id = d.department_id: Combines rows from
employees
anddepartments
where thedepartment_id
matches in both tables, including unmatched rows from both tables.Rows for Emily Davis and the Finance department, which don't have matching entries, are included with NULL values for the missing details.
Cross Join
A cross join returns the Cartesian product of the two tables, i.e., it returns all possible combinations of rows.
Example:
Using an employees
table and a projects
table, we want to list all combinations of employees and projects.
employees
employee_id | first_name | last_name |
1 | John | Doe |
2 | Jane | Smith |
projects
project_id | project_name |
1 | Project Alpha |
2 | Project Beta |
SELECT
e.first_name,
e.last_name,
p.project_name
FROM employees e
CROSS JOIN projects p;
first_name | last_name | project_name |
John | Doe | Project Alpha |
John | Doe | Project Beta |
Jane | Smith | Project Alpha |
Jane | Smith | Project Beta |
- CROSS JOIN projects p: Returns every combination of rows from the
employees
andprojects
tables, producing a Cartesian product.
Self Join
A self join is a regular join, but the table is joined with itself.
Using the employees
table, we want to find each employee and their manager.
employee_id | first_name | last_name | manager_id |
1 | John | Doe | NULL |
2 | Jane | Smith | 1 |
3 | Mary | Johnson | 1 |
4 | Mike | Brown | 2 |
5 | Emily | Davis | 2 |
SELECT
e.employee_id,
e.first_name,
e.last_name,
m.first_name AS manager_first_name,
m.last_name AS manager_last_name
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.employee_id;
employee_id | first_name | last_name | manager_first_name | manager_last_name |
1 | John | Doe | NULL | NULL |
2 | Jane | Smith | John | Doe |
3 | Mary | Johnson | John | Doe |
4 | Mike | Brown | Jane | Smith |
5 | Emily | Davis | Jane | Smith |
- LEFT JOIN employees m ON e.manager_id = m.employee_id: Joins the
employees
table with itself to match employees with their managers.
Final Consideration: Applying Advanced SQL Concepts
As a data engineer, it's crucial to understand when and how to apply advanced SQL concepts to optimize database performance and facilitate effective data analysis. Here's a comprehensive analysis of when to use window functions, CTEs, and joins in real-world scenarios:
Window Functions
When to Use:
Analytics and Reporting: Use window functions to calculate running totals, moving averages, ranks, and other analytics without complex subqueries. They are particularly useful in generating reports where metrics need to be calculated across partitions of data.
Time Series Analysis: When analyzing time-series data, window functions can help compute metrics like cumulative sums, moving averages, and lagged values.
Financial Calculations: Financial analysis often requires complex calculations that window functions can simplify, such as cumulative returns or ranks of financial instruments.
Application:
Sales Analytics: Calculate monthly sales growth percentages, rank products by sales within each category, and compute cumulative sales over time.
Customer Insights: Determine customer rankings based on purchase behavior, calculate running totals of transactions, and analyze trends over time.
Example:
You want a running total of sales by month:
SELECT
sales_date,
amount,
SUM(amount) OVER (ORDER BY sales_date) AS running_total
FROM sales;
Common Table Expressions (CTEs)
When to Use:
Breaking Down Complex Queries: Use CTEs to split complex queries into simpler parts, making them easier to read, write, and maintain.
Hierarchical Data: Recursive CTEs are ideal for querying hierarchical or tree-structured data, such as organizational charts, bill of materials, and family trees.
Data Preparation: Prepare data for analysis by filtering, aggregating, or transforming it in a structured and readable way.
Application:
Data Transformation: Use CTEs to transform raw data into a format suitable for reporting or further analysis. For instance, aggregate daily sales data into monthly totals before further processing.
Hierarchical Queries: Generate reports that require hierarchical data processing, such as organizational hierarchies or project dependencies.
Example:
Let's say you have a table sales
that contains daily sales data with columns date
, product_id
, and sales_amount
. You want to aggregate this data into monthly totals before further processing or reporting.
WITH monthly_sales AS (
SELECT
DATE_TRUNC('month', date) AS month,
product_id,
SUM(sales_amount) AS total_sales
FROM
sales
GROUP BY
DATE_TRUNC('month', date),
product_id
)
SELECT
month,
product_id,
total_sales
FROM
monthly_sales
ORDER BY
month,
product_id;
We use a CTE named
monthly_sales
to aggregate the daily sales data into monthly totals.Within the CTE, we truncate the
date
column to the month using theDATE_TRUNC
function, group by the truncated date andproduct_id
, and calculate the sum ofsales_amount
.Then, in the main query, we select the aggregated monthly sales data from the CTE and order it by
month
andproduct_id
.
Joins
When to Use:
Combining Data from Multiple Tables: Use joins to merge data from multiple related tables, especially in normalized databases where data is split into several tables to reduce redundancy.
Reporting and Analytics: Generate detailed reports that require data from different sources, such as combining sales, customer, and product data.
Data Integration: Integrate data from different systems or databases to create a unified view, such as merging customer data from CRM and billing systems.
Application:
Business Intelligence: Combine various data sources to generate detailed business intelligence reports, such as combining sales data with customer demographics and product information.
Data Warehousing: Integrate data from multiple operational databases into a data warehouse for analysis and reporting.
Example:
You want to to generate a sales report with customer and product details:
SELECT
s.sale_id,
s.sale_date,
c.customer_name,
p.product_name,
s.amount
FROM sales s
JOIN customers c ON s.customer_id = c.customer_id
JOIN products p ON s.product_id = p.product_id;
Conclusion
In the real world, as a data engineer, the choice of SQL techniques depends on the specific requirements of the task at hand. Here are some guidelines:
Use Window Functions when you need to perform calculations across a set of rows related to the current row, especially for analytics and reporting.
Use CTEs to simplify complex queries, handle hierarchical data, and prepare data for analysis in a structured and readable manner.
Use Joins to combine data from multiple tables, especially in normalized databases, and generate detailed reports.
By mastering these advanced SQL concepts, you can design efficient, scalable, and maintainable database queries that meet the needs of modern data analysis and reporting.