Unit 3 - Notes
Unit 3: Relational Operations
1. Aggregate Functions
Aggregate functions perform a calculation on a set of values and return a single scalar value. They are often used with the GROUP BY and HAVING clauses in SQL.
Common Aggregate Functions
COUNT(): Returns the number of rows that match a specified criterion.COUNT(*)counts all rows, whileCOUNT(column_name)counts non-NULL values.SUM(): Returns the total sum of a numeric column.AVG(): Returns the average value of a numeric column.MIN(): Returns the smallest value in the selected column.MAX(): Returns the largest value in the selected column.
GROUP BY and HAVING
GROUP BY: Groups rows that have the same values in specified columns into summary rows.HAVING: Used to filter groups created by theGROUP BYclause (since theWHEREclause cannot be used with aggregate functions).
SELECT department_id, COUNT(employee_id) AS total_employees, AVG(salary) AS avg_salary
FROM employees
WHERE hire_date > '2020-01-01'
GROUP BY department_id
HAVING COUNT(employee_id) > 5;
2. SQL Joins
Joins are used to combine rows from two or more tables based on a related column between them.
Types of Joins
- INNER JOIN: Returns records that have matching values in both tables.
- LEFT (OUTER) JOIN: Returns all records from the left table, and the matched records from the right table. The result is NULL from the right side if there is no match.
- RIGHT (OUTER) JOIN: Returns all records from the right table, and the matched records from the left table.
- FULL (OUTER) JOIN: Returns all records when there is a match in either left or right table. Unmatched rows will contain NULLs for columns from the other table.
- CROSS JOIN: Returns the Cartesian product of the two tables (every row in the left table is combined with every row in the right table).
- SELF JOIN: A regular join, but the table is joined with itself.
-- Example of a LEFT JOIN
SELECT Customers.CustomerName, Orders.OrderID
FROM Customers
LEFT JOIN Orders ON Customers.CustomerID = Orders.CustomerID;
3. Set Operators
Set operators combine the results of two or more SELECT statements into a single result set.
Requirement: The queries being combined must be union-compatible (same number of columns, and corresponding columns must have compatible data types).
UNION: Combines the result sets and removes duplicate rows.UNION ALL: Combines the result sets and retains duplicate rows. It is faster thanUNIONbecause it avoids the deduplication overhead.INTERSECT: Returns only the distinct rows that are output by both the left and right queries.EXCEPT(orMINUSin Oracle): Returns distinct rows from the left query that aren't output by the right query.
SELECT employee_id FROM Department_A
INTERSECT
SELECT employee_id FROM Department_B;
4. Views
A View is a virtual table based on the result-set of an SQL statement. It contains rows and columns just like a real table, but the data is derived dynamically (or stored in the case of materialized views).
Characteristics and Types
- Standard Views: Do not store data physically. They simplify complex queries, enhance security (by restricting access to specific rows/columns), and provide logical data independence.
- Updatable Views: In some cases, you can perform
INSERT,UPDATE, andDELETEoperations on a view. The view must typically map directly to a single base table without aggregate functions,GROUP BY, orDISTINCT. - Materialized Views: The result of the query is physically stored (cached) on disk. They must be refreshed periodically but offer significant performance benefits for expensive queries in data warehousing.
CREATE VIEW HighEarners AS
SELECT employee_id, first_name, salary
FROM employees
WHERE salary > 100000;
5. Subqueries
A subquery (inner query) is a query nested inside another query (outer query).
Types of Subqueries
- Single-row Subquery: Returns zero or one row. Used with single-row comparison operators (
=,<,>,<=,>=). - Multiple-row Subquery: Returns one or more rows. Used with multiple-row operators (
IN,ANY,ALL). - Correlated Subquery: A subquery that references columns from the outer query. It is evaluated once for each row processed by the outer query, making it resource-intensive.
-- Uncorrelated Subquery Example
SELECT first_name, salary FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);
-- Correlated Subquery Example using EXISTS
SELECT department_name FROM departments d
WHERE EXISTS (
SELECT 1 FROM employees e
WHERE e.department_id = d.department_id AND e.salary > 100000
);
6. Relational Algebra
Relational algebra is a procedural query language forming the theoretical foundation of SQL. It takes relations as input and produces relations as output.
Basic Operators
- Select (): Extracts rows that satisfy a specified condition. ()
- Project (): Extracts specified columns, discarding others and removing duplicates. ()
- Union (): Returns tuples present in either relation.
- Set Difference (): Returns tuples present in the first relation but not in the second.
- Cartesian Product (): Combines every tuple from the first relation with every tuple from the second.
- Rename (): Renames the output relation or its attributes.
Derived Operators
- Join (): Combines related tuples from two relations.
- Natural Join: Matches based on all attributes with the same name.
- Theta Join: Matches based on a specific condition ().
- Division (): Used for queries containing the phrase "for all". E.g., "Find sailors who have reserved all boats."
7. Window Functions
Window functions perform calculations across a set of table rows that are somehow related to the current row. Unlike aggregate functions, window functions do not cause rows to become grouped into a single output row; the rows retain their separate identities.
Syntax Components
OVER(): Defines the window.PARTITION BY: Divides the result set into partitions (similar toGROUP BY).ORDER BY: Orders the rows within each partition.
Common Window Functions
ROW_NUMBER(): Assigns a unique sequential integer to rows within a partition.RANK(): Assigns a rank. If there's a tie, the next rank is skipped (e.g., 1, 2, 2, 4).DENSE_RANK(): Assigns a rank without skipping numbers in case of ties (e.g., 1, 2, 2, 3).LEAD(): Accesses data from a subsequent row in the same result set.LAG(): Accesses data from a previous row in the same result set.
SELECT employee_id, department_id, salary,
RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) as dept_salary_rank
FROM employees;
8. Hashing
Hashing is a technique to directly search the location of desired data on the disk without using index structure, utilizing a mathematical function (hash function).
Key Concepts
- Hash Function: Maps a search key to a specific bucket address.
- Bucket: A unit of storage (typically a disk block) that stores one or more records.
- Collision: Occurs when the hash function maps two different keys to the same bucket. Handled via chaining (overflow buckets) or open addressing.
Types of Hashing
- Static Hashing: The number of buckets is fixed. Prone to overflow if data grows, or wasted space if data shrinks.
- Dynamic (Extendible) Hashing: The hash table grows and shrinks dynamically. It uses a directory of pointers to buckets, and the directory size doubles when a bucket overflows, minimizing disk access and maintaining performance.
9. Indexing
Indexing is a data structure technique used to quickly locate and access data in a database. It reduces the number of disk accesses required to process a query.
Types of Indexes
- Primary Index: Created on the primary key of a sequentially ordered file. (Sparse index).
- Clustering Index: Created on a non-primary key that dictates the physical sorting order of the file. A table can only have one clustered index.
- Secondary (Non-Clustered) Index: Created on search keys that are not the ordering key. The physical data is not sorted by this key. A table can have multiple secondary indexes. (Dense index).
Tree-Based Indexing
- B-Trees (Balanced Trees): Every node contains key-pointer pairs. Search, insert, and delete take time.
- B+ Trees: The standard implementation for RDBMS indexing.
- Difference from B-Tree: Data pointers are stored only at the leaf nodes. Internal nodes only store routing keys.
- Advantage: Leaf nodes are linked in a linked list, making range queries (e.g.,
WHERE salary BETWEEN 40000 AND 60000) extremely fast and efficient.
10. Query Optimization
Query optimization is the process of selecting the most efficient query execution plan from the many possible plans generated by the DBMS.
Phases of Query Processing
- Parsing & Translation: SQL is parsed for syntax/semantics and translated into an internal representation (Relational Algebra).
- Optimization: The optimizer evaluates different execution plans.
- Execution: The query engine executes the chosen plan.
Optimization Approaches
- Heuristic (Rule-Based) Optimization: Uses a set of equivalence rules to transform the query into a more efficient form.
- Rule 1: Perform
Select() operations as early as possible to reduce the number of rows. - Rule 2: Perform
Project() operations early to reduce the number of columns. - Rule 3: Replace Cartesian products with Joins whenever possible.
- Rule 1: Perform
- Cost-Based Optimization: The DBMS estimates the "cost" (I/O operations, CPU usage, memory) of various execution plans using database statistics (table sizes, index cardinality, data distribution). It chooses the plan with the lowest overall estimated cost.
Execution Plans
Developers can view how the optimizer intends to run a query using commands like EXPLAIN or EXPLAIN PLAN. It shows whether a sequential scan (full table scan) or an index scan will be used, and which join algorithms (Nested Loop, Hash Join, Merge Join) have been selected.