Unit 3 - Notes

INT306 6 min read

Unit 3: Relational Operations

1. Relational Algebra

Relational Algebra is a procedural query language that acts as the theoretical foundation for relational databases and SQL. It consists of a set of operations that take one or two relations as input and produce a new relation as output.

Fundamental Operations

Select ( $\sigma$ ): Selects a subset of tuples (rows) that satisfy a specific predicate.
- Notation: $\sigma_{predicate}(R)$
- Example: $\sigma_{age > 20}(Student)$
Project ( $\pi$ ): Selects specific attributes (columns) from a relation, discarding the others. It automatically eliminates duplicates in the result.
- Notation: $\pi_{column\_list}(R)$
- Example: $\pi_{name, id}(Student)$
Rename ( $\rho$ ): Renames a relation or its attributes.
- Notation: $\rho_{new\_name}(R)$
Cartesian Product ( $\times$ ): Combines information of two different relations into one.
- Notation: $R \times S$
Union ( $\cup$ ): Returns all tuples present in either relation R or S (requires union compatibility).
Set Difference ( $-$ ): Returns tuples present in R but not in S.

Join Operations (Relational Algebra Context)

Natural Join ( $\bowtie$ ): Performs a selection on equality of common attributes on the Cartesian product.
Outer Joins: Extension of joins to keep information from one or both tables even if no matching tuple exists (⟕, ⟖, ⟗).

2. Aggregate Functions

Aggregate functions operate on a collection of values (a column or a group) and return a single scalar value.

Common Functions

Function	Description	Syntax Example
`COUNT()`	Returns the number of rows.	`SELECT COUNT(*) FROM Employees;`
`SUM()`	Returns the total sum of a numeric column.	`SELECT SUM(Salary) FROM Employees;`
`AVG()`	Returns the average value of a numeric column.	`SELECT AVG(Salary) FROM Employees;`
`MIN()`	Returns the smallest value.	`SELECT MIN(Age) FROM Students;`
`MAX()`	Returns the largest value.	`SELECT MAX(Price) FROM Products;`

GROUP BY and HAVING

GROUP BY: Arranges identical data into groups. It is often used with aggregate functions.
HAVING: Used to filter groups created by GROUP BY. (Note: WHERE filters rows before grouping; HAVING filters groups after aggregation).

SQL

SELECT Department, AVG(Salary) 
FROM Employees 
GROUP BY Department 
HAVING AVG(Salary) > 50000;

3. SQL Joins

Joins allow the retrieval of data from two or more tables based on a logical relationship between them.

![A set of four Venn diagrams arranged in a 2x2 grid illustrating SQL Join types.

Top-Left: "Inner...](/notes-images/Sem2/INT306/images/INT306_Unit3_img1_55f079c4.jpg)

Types of Joins

INNER JOIN: Returns records that have matching values in both tables.
LEFT (OUTER) JOIN: Returns all records from the left table, and the matched records from the right table. (Nulls for non-matches).
RIGHT (OUTER) JOIN: Returns all records from the right table, and the matched records from the left table.
FULL (OUTER) JOIN: Returns all records when there is a match in either left or right table.
CROSS JOIN: Produces the Cartesian product of the two tables (every row of Table A combined with every row of Table B).
SELF JOIN: A regular join, but the table is joined with itself.

SQL

-- Inner Join Example
SELECT Employees.Name, Departments.DeptName
FROM Employees
INNER JOIN Departments ON Employees.DeptID = Departments.DeptID;

4. Set Operators

Set operators combine the result sets of two or more SELECT statements.

Rules:
1. The number of columns in the SELECT statements must be the same.
2. The data types of corresponding columns must be compatible.
UNION: Combines results from two queries and removes duplicates.
UNION ALL: Combines results from two queries and includes duplicates.
INTERSECT: Returns only rows returned by both queries.
MINUS / EXCEPT: Returns rows from the first query that are not present in the second query.

5. Subqueries

A Subquery (or Nested Query) is a query within another SQL query and embedded within the WHERE or HAVING clause.

Types of Subqueries

Scalar Subquery: Returns a single value (one row, one column).
Multi-row Subquery: Returns multiple rows (used with operators like IN, ANY, ALL).
Correlated Subquery: The inner query relies on values from the outer query. It executes once for every row selected by the outer query.

SQL

-- Subquery Example: Find employees with salary greater than average
SELECT Name, Salary 
FROM Employees 
WHERE Salary > (SELECT AVG(Salary) FROM Employees);

6. Views

A View is a virtual table based on the result-set of an SQL statement. It does not store data itself; it stores the query definition.

Advantages:
- Security: Restrict user access to specific columns.
- Simplicity: Hide complex joins and logic from the end-user.
- Consistency: Provide a consistent view of data even if the underlying table structure changes.

SQL

CREATE VIEW HighPerformers AS
SELECT Name, Department
FROM Employees
WHERE PerformanceRating > 4;

7. Window Functions

Window functions perform calculations across a set of table rows that are somehow related to the current row. Unlike GROUP BY, window functions do not collapse rows into a single output row; the rows retain their separate identities.

Syntax

FUNCTION_NAME() OVER (PARTITION BY col1 ORDER BY col2)

Key Functions

ROW_NUMBER(): Assigns a unique integer to rows.
RANK(): Assigns a rank with gaps for ties (e.g., 1, 2, 2, 4).
DENSE_RANK(): Assigns a rank without gaps for ties (e.g., 1, 2, 2, 3).
NTILE(n): Divides rows into n buckets.
LEAD() / LAG(): Access data from a subsequent or preceding row.

8. Hashing

Hashing is used to locate data directly on the disk without using an index structure. It uses a Hash Function to map a search key to a specific bucket (storage unit).

Static Hashing

The number of buckets is fixed.
Hash Function ( $h$ ): Maps search key $K$ to address $B$ .
Collision: When two different keys map to the same bucket.
Collision Resolution:
- Open Addressing: Look for the next available slot (Linear Probing).
- Chaining: Use a linked list (overflow chain) at the bucket address.

Dynamic (Extendible) Hashing

Allows the directory structure (number of buckets) to grow or shrink as data is added or removed.
Uses a directory of pointers to buckets; the directory doubles in size when overflow occurs, splitting only the specific bucket that overflowed.

9. Indexing

Indexing creates auxiliary data structures to speed up data retrieval.

Types of Indices

Primary Index: Built on the primary key (ordered file).
Clustering Index: Built on a non-unique ordering key (physically orders the file).
Secondary Index: Built on non-ordering fields. The index is ordered, but the actual data file is not.
Dense vs. Sparse Index:
- Dense: An index entry exists for every search key value.
- Sparse: Index entries exist only for some values.

B+ Trees

The most common indexing structure in DBMS. It is a balanced tree where all data pointers are stored in leaf nodes.

![A detailed technical diagram of a B+ Tree structure.
Top Level: A "Root Node" containing keys 15, ... and pointers.
Middle Level: "Internal Nodes" connected by arrows from the root. Left node has keys [5, 10], Right node has keys [20, 25].
Bottom Level: "Leaf Nodes" containing actual data pointers. The leaf nodes should be connected horizontally by a linked-list arrow (pointing right) to show sequential access.
Visual Style: Rectangular nodes with subdivisions for keys and pointers. Distinct colors for Internal Nodes (light blue) vs Leaf Nodes (light green). Labels indicating "Root", "Internal Nodes", "Leaf Nodes", and "Data File Pointers".]

Properties of B+ Trees:
- Balanced (all leaf nodes are at the same depth).
- Leaf nodes are linked together (Linked List) to facilitate sequential access.
- Supports both Equality searches (=) and Range searches (>, <).

10. Query Optimization

Query optimization is the process of selecting the most efficient execution plan for a SQL query.

A flowchart diagram illustrating the Query Processing Architecture.
The flow starts at the top with ... — AI-generated image — may contain inaccuracies

Steps in Query Processing

Parsing and Translation: Checks syntax and converts the query into an internal representation (Relational Algebra).
Optimization: The DBMS generates multiple execution plans and chooses the one with the lowest cost (CPU, I/O, Memory).
Evaluation: The Execution Engine runs the chosen plan.

Optimization Techniques

Heuristic Optimization: Uses rules to improve performance.
- Rule: Perform Selection ( $\sigma$ ) and Projection ( $\pi$ ) as early as possible to reduce the size of intermediate results.
- Rule: Combine Cartesian Products with Selections to form Joins.
Cost-Based Optimization (CBO):
- Estimates the cost of different plans based on database statistics (table size, number of rows, index availability).
- Calculates cost units based on disk I/O operations.

Unit 2

Unit 4