Home / Sem 2 / INT306 / Unit 3 Subjective

Back to Subject

View Notes Practice MCQ

Unit3 - Subjective Questions

INT306 • Practice Questions with Detailed Answers

Explain the fundamental operations of Relational Algebra with their notations.

Relational Algebra is a procedural query language. The fundamental operations are:

Select ( $\sigma$ ): Selects tuples that satisfy a given predicate.
- Notation: $\sigma_p(r)$
Project ( $\pi$ ): Selects specified columns (attributes) and discards others.
- Notation: $\pi_{A1, A2, ...}(r)$
Union ( $\cup$ ): Returns a relation containing all tuples that appear in either or both of the two specified relations.
- Notation: $r \cup s$
Set Difference ( $-$ ): Returns a relation containing tuples that appear in the first relation but not the second.
- Notation: $r - s$
Cartesian Product ( $\times$ ): Combines information from two relations by pairing every tuple of the first relation with every tuple of the second.
- Notation: $r \times s$
Rename ( $\rho$ ): Renames the output relation or attributes.
- Notation: $\rho_x(E)$

Differentiate between WHERE and HAVING clauses in SQL with an example.

The primary differences between WHERE and HAVING are:

Execution Time:
- WHERE: Filters rows before any grouping or aggregation takes place.
- HAVING: Filters groups after the aggregation has been performed.
Usage with Aggregates:
- WHERE: Cannot contain aggregate functions (like SUM, COUNT) directly.
- HAVING: Is specifically designed to work with aggregate functions.

Example:
Consider a table Sales(Product, Amount).

To find rows where the amount is > 100:
SELECT * FROM Sales WHERE Amount > 100;
To find Products where the total sales are > 1000:
SELECT Product, SUM(Amount) FROM Sales GROUP BY Product HAVING SUM(Amount) > 1000;

Describe the different types of SQL Joins. Explain how an Outer Join differs from an Inner Join.

SQL Joins combine rows from two or more tables based on a related column.

Types of Joins:

Inner Join: Returns records that have matching values in both tables.
Left (Outer) Join: Returns all records from the left table, and the matched records from the right table. (Nulls for unmatched right side).
Right (Outer) Join: Returns all records from the right table, and the matched records from the left table.
Full (Outer) Join: Returns all records when there is a match in either left or right table.

Difference:

Inner Join: If a tuple in table A has no matching tuple in table B, that tuple is excluded from the result.
Outer Join: Preserves tuples even if they have no match. If a tuple in table A matches nothing in table B, the result includes the tuple from A with NULL values for the attributes of B.

What are Window Functions in SQL? How do RANK() and DENSE_RANK() differ?

Window Functions perform a calculation across a set of table rows that are somehow related to the current row. Unlike regular aggregate functions, window functions do not cause rows to become grouped into a single output row; the rows retain their separate identities.

Syntax Example: SELECT profit, RANK() OVER (ORDER BY profit DESC) FROM Sales;

Difference between RANK() and DENSE_RANK():

RANK(): Assigns a rank to each row. If two rows have the same value, they get the same rank, but the next rank is skipped.
- Example Sequence: 1, 2, 2, 4
DENSE_RANK(): Assigns a rank to each row. If two rows have the same value, they get the same rank, and the next rank is not skipped.
- Example Sequence: 1, 2, 2, 3

Explain the structure of a B+ Tree and discuss its advantages over a B-Tree for database indexing.

Structure of a B+ Tree:
A B+ Tree is a balanced binary search tree used for indexing.

Internal Nodes: Store only keys and pointers to children. They do not store actual data pointers.
Leaf Nodes: Store all keys and the actual data pointers (or records).
Linked Leaves: All leaf nodes are linked together in a linked list format, allowing for efficient sequential access.

Advantages over B-Tree:

Height: Since internal nodes only store keys (not data), more keys fit in a single block (node). This leads to a larger fan-out and reduced tree height, reducing disk I/O.
Range Queries: B+ Trees are superior for range queries. Once the starting leaf is found, the system simply traverses the linked list of leaves. A B-Tree requires traversing up and down the tree structure.
Predictability: Search time is consistent as every search must traverse from root to leaf.

Define Views in SQL. What are the advantages of using views?

A View is a virtual table based on the result-set of an SQL statement. It contains rows and columns, just like a real table, but it does not store data physically on the disk (unless it is a Materialized View).

Syntax: CREATE VIEW ViewName AS SELECT ...

Advantages:

Security: specific rows or columns can be hidden from users by granting access only to the view, not the underlying table.
Simplicity: Complex queries (involving multiple joins or aggregations) can be encapsulated in a view, allowing users to select from the view simply.
Logical Data Independence: If the schema of the underlying table changes, the view definition can be updated without affecting the external applications that use the view.
Consistency: Users see a consistent, frozen definition of data logic.

Explain the concept of Correlated Subqueries with an example.

A Correlated Subquery is a subquery that uses values from the outer query. The subquery is executed once for each row processed by the outer query.

Mechanism:

The outer query selects a candidate row.
The value from this row is passed to the inner query.
The inner query executes.
The outer query evaluates the qualification based on the inner query result.

Example:
Find employees whose salary is greater than the average salary of their respective department.

sql
SELECT e1.emp_name, e1.salary
FROM Employee e1
WHERE e1.salary > (
SELECT AVG(e2.salary)
FROM Employee e2
WHERE e2.dept_id = e1.dept_id
);

Here, e1.dept_id is a reference to the outer table Employee e1.

Discuss Extendible Hashing (Dynamic Hashing). How does it handle bucket overflow?

Extendible Hashing is a dynamic hashing technique that handles database growth by splitting buckets and doubling the directory size rather than rehashing all data.

Components:

Directory: An array of pointers to buckets. The size is $2^d$ , where $d$ is the global depth.
Buckets: Store the data records. Each bucket has a local depth $d'$ .

Handling Overflow:
When a bucket overflows (becomes full):

Compare Depths:
- If Local Depth < Global Depth: The bucket is split into two. The local depth of the new buckets is incremented. The directory pointers are updated to point to the new buckets. No directory expansion is needed.
- If Local Depth == Global Depth: The directory size doubles (Global Depth increases by 1). The overflowing bucket splits, and pointers are redistributed. Other buckets remain untouched.

This approach ensures that only the specific overflowing area is reorganized, making it efficient for large databases.

Explain the Set Operators in SQL: UNION, INTERSECT, and EXCEPT. What are the requirements for their usage?

Set operators combine the results of two or more SELECT statements.

UNION: Combines the result sets of two queries and removes duplicates.
- UNION ALL preserves duplicates.
INTERSECT: Returns only the rows that appear in both result sets.
EXCEPT (or MINUS): Returns rows from the first query that are not present in the second query.

Requirements (Union Compatibility):
For these operators to work, the two queries must be union-compatible:

Same Number of Columns: Both queries must return the same number of attributes.
Compatible Data Types: The data types of the corresponding columns must be compatible (e.g., you cannot union a String column with an Integer column).

Derive the Division Operator ( $r \div s$ ) in Relational Algebra using fundamental operators. Provide an example scenario.

The Division operator ( $R \div S$ ) is used to answer queries like "Find entities in R that are associated with all entities in S".

Derivation:
Let $R(A, B)$ and $S(B)$ be relations.

R \div S = \pi_A(R) - \pi_A((\pi_A(R) \times S) - R)

Step-by-step logic:

Identify all candidate $A$ values: $\pi_A(R)$ .
Build all possible combinations of candidate $A$ values with every $B$ value in $S$ : $\pi_A(R) \times S$ .
Identify the combinations that do not exist in the original relation $R$ : $(\pi_A(R) \times S) - R$ .
Identify the $A$ values associated with these missing combinations: $\pi_A(...)$ .
Subtract these "incomplete" $A$ values from the list of all candidates.

Example: Find students (from table Enrolled) who have taken all courses listed in table RequiredCourses.

What is Query Optimization? Explain the difference between Heuristic and Cost-based optimization.

Query Optimization is the process of selecting the most efficient execution plan for a given query to minimize response time and resource usage.

Heuristic Optimization:
- Uses rule-based approaches to transform the query into a more efficient equivalent form.
- Common Rules: Perform Selection early (push down $\sigma$ ), perform Projections early (push down $\pi$ ), and reorder Joins so smaller relations are joined first.
- It does not calculate actual costs but relies on general "good practices."
Cost-Based Optimization (CBO):
- Estimates the cost of various execution plans based on database statistics (table size, index selectivity, row count).
- Assigns a "cost" (usually relative I/O or CPU time) to operations.
- Selects the plan with the lowest estimated cost.

Compare Clustered and Non-Clustered (Secondary) Indexes.

Clustered Index:

Data Organization: The physical order of rows on the disk is the same as the order of the index.
Count: There can be only one clustered index per table (because data can only be sorted physically in one way).
Speed: Faster for retrieving ranges of data or sorting.
Leaf Nodes: The leaf nodes contain the actual data pages.

Non-Clustered (Secondary) Index:

Data Organization: The logical order of the index does not match the physical order of the data on the disk.
Count: A table can have multiple non-clustered indexes.
Speed: generally slower than clustered for range queries because it requires pointer lookups.
Leaf Nodes: The leaf nodes contain pointers (or row IDs) to the data pages, not the data itself.

Explain the Aggregate Functions available in SQL and how NULL values are handled by them.

Aggregate Functions perform a calculation on a set of values and return a single scalar value.

COUNT(): Returns the number of rows.
SUM(): Returns the total sum of a numeric column.
AVG(): Returns the average value.
MIN(): Returns the smallest value.
MAX(): Returns the largest value.

Handling NULLs:

COUNT(*): Counts all rows, including those with NULLs.
COUNT(column): Counts only rows where column is NOT NULL.
SUM, AVG, MIN, MAX: Ignore NULL values completely. For example, the average of (10, NULL, 20) is calculated as $(10+20)/2 = 15$ , not divided by 3.

Describe the Hash Join algorithm. When is it preferred over Nested Loop Join?

Hash Join Algorithm:
Used for joining two relations, $R$ and $S$ , on a common attribute.

Build Phase: The algorithm scans the smaller relation (say $S$ ) and builds an in-memory hash table using the join attribute as the key.
Probe Phase: It then scans the larger relation ( $R$ ). For each tuple in $R$ , it hashes the join attribute and probes the hash table to find matches from $S$ .

Preference:

Hash Join is preferred when both tables are large and there are no existing indices on the join columns. It is generally faster than Nested Loop for large datasets.
Nested Loop Join is preferred if one table is very small, or if the join column is indexed.

Explain the difference between Static Hashing and Dynamic Hashing.

Static Hashing:

The number of buckets is fixed when the file is created.
Function: $h(k) = k \mod M$ (where $M$ is fixed).
Problem: As the database grows, buckets fill up, leading to overflow chains (collisions). Performance degrades to linear search in long chains.
Resizing: Requires rehashing the entire database into a larger file, which is expensive.

Dynamic Hashing:

The number of buckets grows or shrinks as records are added or deleted.
Techniques: Extendible Hashing, Linear Hashing.
Advantage: Maintains performance (approx. 1 disk access per retrieval) regardless of data volume growth.
Mechanism: Uses a directory or specific splitting rules to split only the overflowing bucket without reorganizing the whole data.

What is a Materialized View? How does it differ from a standard View in terms of maintenance and performance?

Materialized View:
A database object that contains the results of a query and physically stores the data on the disk (like a cached table).

Comparison:

Storage:
- Standard View: Virtual; stores only the query definition.
- Materialized View: Physical; stores the actual data rows.
Performance (Read):
- Standard View: Slower for complex queries because the underlying query runs every time the view is accessed.
- Materialized View: Very fast, as data is pre-computed.
Maintenance (Write):
- Standard View: Always up to date.
- Materialized View: Needs Refresh (updating). If the base tables change, the materialized view becomes stale until it is refreshed (immediately or periodically). This adds overhead to write operations.

List and explain the heuristic rules used for Query Optimization in Relational Algebra.

Heuristic rules transform a query tree into an equivalent, more efficient tree:

Push Selection Down ( $\\sigma$ ): Perform selection operations as early as possible (closest to the leaf nodes). This reduces the number of tuples passed to subsequent operations like joins.
Push Projection Down ( $\\pi$ ): Perform projection early to reduce the size of tuples (remove unused columns) before expensive operations like joins.
Reorder Joins: Join the most restrictive relations (resulting in the smallest intermediate relation) first.
Combine Operations: A sequence of unary operations (like a selection followed by another selection) can be combined into a single operation.
Evaluate Cartesian Products last: Never perform a cross product ( $R \times S$ ) if a Join ( $\\bowtie$ ) is possible, as cross products generate massive intermediate results.

Write the SQL syntax and explain the usage of GROUP BY with the HAVING clause using an example.

Usage: GROUP BY groups rows that have the same values in specified columns into summary rows. HAVING filters these groups based on a condition.

Syntax:
sql
SELECT column_name, aggregate_function(column_name)
FROM table_name
WHERE condition
GROUP BY column_name
HAVING condition;

Example:
Assume a Students table with Department and Marks.
To find departments where the average marks are greater than 80:
sql
SELECT Department, AVG(Marks)
FROM Students
GROUP BY Department
HAVING AVG(Marks) > 80;

Here, the database first groups students by department, calculates the average, and then filters out departments with an average $\le 80$ .

What is the Natural Join? How is it expressed in Relational Algebra and SQL?

Natural Join ( $\bowtie$ ):
A natural join is a specific type of inner join that joins two tables based on all attributes that have the same name in both tables. It removes duplicate columns from the result.

Relational Algebra:
Notation: $R \bowtie S$
If $R(A, B, C)$ and $S(B, C, D)$ , the join condition is implicitly $R.B = S.B$ AND $R.C = S.C$ . The result schema is $(A, B, C, D)$ .

SQL:
sql
SELECT *
FROM R
NATURAL JOIN S;

Note: Use with caution in SQL, as adding a column with the same name to one table later can unintentionally change the join logic.

Explain the concept of Selectivity in the context of Query Optimization and Indexing.

Selectivity is a measure of the variety of unique values in a column, used by the query optimizer to choose indices.

Definition: Selectivity is the ratio of distinct values to the total number of rows.
$Selectivity = \frac{\text{Number of Distinct Values}}{\text{Total Number of Rows}}$
High Selectivity: Values are unique or nearly unique (e.g., Primary Key, Phone Number). Indexes are very effective here because a query filters out most rows.
Low Selectivity: Many repeated values (e.g., Gender: 'M', 'F'). Indexes are generally inefficient here because the database might have to read a large percentage of the table anyway, making a full table scan faster than index lookup.

Unit2 Unit4