Unit 5 - Notes

CSE205 6 min read

Unit 5: Heaps and Hashing

Part 1: Heaps

1. Introduction to Heaps

A Heap is a specialized tree-based data structure that satisfies the heap property. It is essentially an almost complete binary tree. Heaps are primarily used to implement Priority Queues and the HeapSort algorithm.

There are two main types of heaps:

Max-Heap: For any given node $I$ , the value of $I$ is greater than or equal to the values of its children. The largest element is at the root.
Min-Heap: For any given node $I$ , the value of $I$ is less than or equal to the values of its children. The smallest element is at the root.

Array Representation:
Because a heap is a complete binary tree, it can be efficiently represented using an array without pointers.
For a node at index $i$ (0-based index):

Parent: $\lfloor (i-1) / 2 \rfloor$
Left Child: $2i + 1$
Right Child: $2i + 2$

2. Insertion in a Heap

Inserting a new element involves maintaining the complete binary tree structure and restoring the heap property (Heapify Up).

Algorithm (Max-Heap):

Add the new element to the end of the array (the first available leaf position).
Compare the new element with its parent.
If the new element is greater than the parent, swap them.
Repeat step 2-3 until the element reaches the root or is smaller than its parent.

Time Complexity: $O(\log n)$ (where $n$ is the number of nodes).

3. Deletion in a Heap

Deletion usually refers to removing the root element (the maximum in a Max-Heap or minimum in a Min-Heap), as this is the primary operation of a priority queue.

Algorithm (Delete Max from Max-Heap):

Remove the root element (index 0).
Replace the root with the last element in the heap (the last leaf).
Decrease the heap size by 1.
Heapify Down (Percolate Down):
- Compare the new root with its children.
- If the root is smaller than the largest child, swap it with that child.
- Repeat until the node is a leaf or larger than both children.

Time Complexity: $O(\log n)$ .

4. HeapSort

HeapSort is a comparison-based sorting algorithm that uses a Binary Heap data structure. It sorts an array in two phases.

Phase 1: Build Heap
Convert the unsorted array into a Max-Heap. This is done by calling heapify on all non-leaf nodes starting from the last non-leaf node up to the root.

Time complexity to build heap: $O(n)$ .

Phase 2: Extraction
Repeatedly extract the maximum element from the heap to build the sorted array.

Swap the root (maximum value) with the last element of the heap.
Reduce the heap size by 1 (effectively "sorting" the max element at the end of the array).
Call heapify on the new root to restore the Max-Heap property.
Repeat until the heap size is 1.

Complexity Analysis:

Time Complexity: $O(n \log n)$ for Best, Average, and Worst cases.
Space Complexity: $O(1)$ (In-place sort).
Stability: Not Stable.

Part 2: Hashing Introduction

1. Basic Concepts

Hashing is a technique used to convert a range of key values into a range of indexes of an array. It allows for the efficient storage and retrieval of data.

Goal: To perform Insertion, Deletion, and Search operations in $O(1)$ time on average.
Hash Table: An array of size $M$ where data is stored.
Hash Function: A mathematical function $h(k)$ that maps a key $k$ to an index in the range $[0, M-1]$ .
Collision: Occurs when two distinct keys map to the same index, i.e., $h(k_1) = h(k_2)$ where $k_1 \neq k_2$ .

2. Hash Functions

A good hash function should:

Be easy to compute.
Distribute keys uniformly across the hash table to minimize collisions.

Common Methods:

Division Method: $h(k) = k \mod M$ . (Best if $M$ is a prime number).
Multiplication Method: $h(k) = \lfloor M(kA \mod 1) \rfloor$ , where $0 < A < 1$ .
Mid-Square Method: Square the key $k^2$ and extract the middle digits.
Folding Method: Divide the key into parts and add them together.

Part 3: Open Hashing (Separate Chaining)

Open Hashing, also known as Separate Chaining, is a collision resolution technique where the hash table is an array of pointers (or references) to linked lists.

Mechanism

The hash function calculates the index: $i = h(key)$ .
If the slot at Table[i] is empty, a new linked list is initiated, and the key is inserted.
If Table[i] is occupied (collision), the new key is appended to the linked list at that index.

Operations

Insert: Compute hash, insert at the head (or tail) of the linked list.
Search: Compute hash, traverse the linked list at that index.
Delete: Compute hash, search the list, and remove the node.

Pros and Cons

Pros:
- Simple to implement.
- The hash table never fills up (we can always add nodes to the lists).
- Less sensitive to the hash function's quality or load factor.
Cons:
- Requires extra memory for linked list pointers.
- Cache performance is poor due to non-contiguous memory allocation (linked lists).

Part 4: Closed Hashing (Open Addressing)

In Closed Hashing (Open Addressing), all elements are stored directly within the hash table array. No external data structures (like linked lists) are used.

If a collision occurs, the algorithm searches for the next available empty slot using a probing sequence.

Load Factor ( $\lambda$ ): Number of elements / Table Size. In open addressing, $\lambda$ must be $\le 1$ .

1. Linear Probing

If a collision occurs at index $h(k)$ , check the next index sequentially.

Formula:

Index = (h(k) + i) \pmod M

Where

i = 0, 1, 2, ...

(the probe number).

Process: Check $h(k)$ , then $h(k)+1$ , then $h(k)+2$ , wrapping around if necessary until an empty slot is found.
Issue - Primary Clustering: Continuous blocks of occupied cells form. Any key hashing into the cluster extends the cluster, increasing search time for future insertions.

2. Quadratic Probing

Instead of a constant step size (1), the step size increases quadratically to reduce primary clustering.

Formula:

Index = (h(k) + i^2) \pmod M

Where

i = 0, 1, 2, ...

Process: Check $h(k)$ , then $h(k)+1$ , then $h(k)+4$ , then $h(k)+9$ , etc.
Issue - Secondary Clustering: Keys hashing to the same initial position follow the exact same probe sequence.
Constraint: Table size should ideally be a prime number congruent to $3 \mod 4$ to ensure the probe hits at least half the table.

3. Double Hashing

Uses a second hash function to determine the step size if a collision occurs. This is the most efficient open addressing method.

Formula:

Index = (h_1(k) + i \cdot h_2(k)) \pmod M

$h_1(k)$ : Primary hash function.
$h_2(k)$ : Secondary hash function (must never yield 0).
$i$ : Probe number $(0, 1, 2...)$ .

Properties of $h_2(k)$ :

It must be different from $h_1(k)$ .
Values must be relatively prime to the table size to ensure the probe sequence visits all slots.
- Common choice: If $M$ is prime, $h_2(k) = R - (k \mod R)$ , where $R < M$ is a prime.

Advantage: Eliminates both primary and secondary clustering (uniform hashing).

Summary Comparison

Feature	Separate Chaining	Open Addressing
Memory	Uses extra pointers.	No pointers, fixed table.
Storage	Can store more elements than table size ( $n > M$ ).	Elements limited to table size ( $n \le M$ ).
Cache	Poor (pointer jumping).	Good (array locality).
Clustering	No clustering issues.	Susceptible to clustering.
Search Time	$O(1 + \lambda)$	Depends on $\lambda$ ; degrades as table fills.

Unit 4

Unit 6

Unit 5 - Notes

Table of Contents

Unit 5: Heaps and Hashing

Part 1: Heaps

1. Introduction to Heaps

2. Insertion in a Heap

3. Deletion in a Heap

4. HeapSort

Part 2: Hashing Introduction

1. Basic Concepts

2. Hash Functions

Part 3: Open Hashing (Separate Chaining)

Mechanism

Operations

Pros and Cons

Part 4: Closed Hashing (Open Addressing)

1. Linear Probing

2. Quadratic Probing

3. Double Hashing

Summary Comparison