Unit 2 - Notes

CSE408 7 min read

Unit 2: String Matching Algorithms and Computational Geometry

1. Exhaustive Search

Exhaustive search is a general problem-solving approach based on generating all possible candidate solutions to a problem and then testing each candidate to determine whether it satisfies the problem's constraints or optimizes the objective function.

Characteristics: Simple to implement, guarantees finding a solution if one exists, but usually highly inefficient.
Time Complexity: Generally exponential ( $O(2^n)$ , $O(n!)$ ) or polynomial of high degree, making it suitable only for very small problem sizes.
Applications: Traveling Salesman Problem (generating all permutations), Knapsack Problem (generating all subsets). It serves as the foundation for Brute-Force algorithms.

2. Sequential Search and Brute-Force String Matching

These are direct applications of the exhaustive search paradigm to searching problems.

Sequential Search (Linear Search)

Concept: To find a target value within a list, sequentially check each element of the list until a match is found or the whole list has been searched.
Algorithm:
1. Start from the first element.
2. Compare the current element with the target.
3. If they match, return the index.
4. If not, move to the next element.
Time Complexity:
- Best Case: $O(1)$ (found at the first position)
- Worst Case: $O(n)$ (found at the last position or not present)
- Average Case: $O(n)$

Brute-Force String Matching

Problem: Given a text $T$ of length $n$ and a pattern $P$ of length $m$ ( $m \le n$ ), find all occurrences of $P$ in $T$ .
Concept: Align the pattern at every possible starting position in the text (from index $0$ to $n-m$ ) and check if every character matches.

3. String Matching Algorithms

3.1 Naive String-Matching Algorithm

This is the formalization of the Brute-Force string matching method.

How it works:
The algorithm slides the pattern over the text one shift at a time. For each shift $s$ (where $0 \le s \le n-m$ ), it compares the pattern characters $P[1..m]$ with the text characters $T[s+1 .. s+m]$ .

Pseudocode:

TEXT

    NAIVE-STRING-MATCHER(T, P)
        n = T.length
        m = P.length
        for s = 0 to n - m
            if P[1..m] == T[s+1 .. s+m]
                print "Pattern occurs with shift" s

Time Complexity:
- Best Case: $O(n)$ when the first character of the pattern doesn't appear in the text at all.
- Worst Case: $O((n-m+1) \times m)$ or $O(nm)$ . Occurs when all characters of the text and pattern are the same, or when only the last character is different (e.g., $T = \text{"AAAAAAAAAB"}$ , $P = \text{"AAAB"}$ ).

3.2 Rabin-Karp Algorithm

The Rabin-Karp algorithm improves upon the naive method by using hashing to find an exact match of a pattern string in a text.

How it works:
Instead of comparing strings character by character initially, it calculates a hash value for the pattern $P$ and a hash value for each substring of text $T$ of length $m$ . If the hash values match, it performs a full character-by-character comparison to ensure it's not a spurious hit (hash collision).
Rolling Hash: To make calculating the hash of the next substring efficient, it uses a rolling hash function. When shifting the window, the old leading character's hash is subtracted, and the new trailing character's hash is added, taking $O(1)$ time.
Algorithm Steps:
1. Compute the hash value of the pattern $P$ .
2. Compute the hash value of the first window of text $T$ (length $m$ ).
3. Slide the window one character at a time.
4. If hashes match, compare the strings character by character.
5. Update the text hash using the rolling hash technique.
Time Complexity:
- Average and Best Case: $O(n + m)$ .
- Worst Case: $O(nm)$ , occurring when there are many hash collisions (spurious hits).

3.3 Knuth-Morris-Pratt (KMP) Algorithm

The KMP algorithm uses the degenerating property (pattern having same sub-patterns appearing more than once in the pattern) of the pattern to improve the worst-case complexity to $O(n)$ .

How it works:
Instead of shifting the pattern by exactly $1$ position upon a mismatch, KMP uses a precomputed table (LPS - Longest Proper Prefix which is also Suffix) to skip characters that we know will definitely match.
The LPS Array ( $\pi$ array):
For each position in the pattern, the LPS array stores the length of the longest proper prefix of the pattern that is also a suffix up to that position.
Algorithm Steps:
1. Preprocess the pattern $P$ to calculate the LPS array. This takes $O(m)$ time.
2. Search the text $T$ . Keep track of matched characters $j$ .
3. If a mismatch occurs at $P[j]$ , we don't need to backtrack the text pointer $i$ . Instead, we update $j = LPS[j-1]$ , safely skipping redundant comparisons.
Time Complexity:
- Preprocessing: $O(m)$
- Matching: $O(n)$
- Total Worst Case: $O(n + m)$ .

4. Computational Geometry

Computational geometry deals with the study of algorithms which can be stated in terms of geometry.

4.1 Closest-Pair Problem

Problem: Given a set of $n$ points in a 2D plane, find the pair of points with the smallest Euclidean distance between them.
Brute-Force Approach: Calculate the distance between all possible pairs of points and find the minimum.
- Number of pairs: $n(n-1)/2$ .
- Time Complexity: $O(n^2)$ .
Divide and Conquer Approach:
1. Sort the points based on their x-coordinates.
2. Divide the points into two equal halves by a vertical line.
3. Recursively find the closest pair in the left half ( $d_L$ ) and the right half ( $d_R$ ).
4. Let $d = \min(d_L, d_R)$ .
5. Find if there is a pair of points with a distance less than that crosses the dividing line. We only need to check points within a strip of width centered at the dividing line.
  - Time Complexity: $O(n \log n)$ after initially sorting the points.

4.2 Convex-Hull Problem

Problem: Given a set of points in a 2D plane, the convex hull is the smallest convex polygon that encloses all the points. Imagine stretching a rubber band around the outermost points; the shape the rubber band assumes is the convex hull.
Brute-Force Approach:
For every pair of points, check if all other points lie on the same side of the line connecting these two points. If they do, the edge connecting these two points is part of the convex hull.
- Time Complexity: $O(n^3)$ .
Efficient Algorithms (Overview):
- Graham Scan: Finds the lowest point, sorts the remaining points by polar angle, and maintains a stack of points forming the convex boundary, discarding points that create "right turns". Time Complexity: $O(n \log n)$ .
- Jarvis March (Gift Wrapping): Starts with the leftmost point. The next point on the hull is the one that is furthest to the "right" (has the smallest counterclockwise angle) from the current point. Time Complexity: $O(nh)$ , where $h$ is the number of points on the hull.

4.3 Voronoi Diagrams

Definition: A Voronoi diagram partitions a plane into regions based on the distance to a specific set of points (called sites or generators). For each site, there is a corresponding region consisting of all points closer to that site than to any other.
Properties:
- The regions are convex polygons.
- The boundary between two adjacent Voronoi regions is a segment of the perpendicular bisector of the line segment connecting the two sites.
- Voronoi vertices are points equidistant to three (or more) sites.
- The Delaunay Triangulation is the dual graph of a Voronoi diagram (connecting sites that share a Voronoi edge).
Applications: Nearest neighbor search, facility location (finding the largest empty circle among points), cell biology (modeling cell structures), and computer graphics.
Algorithms:
- Fortune's Algorithm: A sweep-line algorithm that constructs the Voronoi diagram in $O(n \log n)$ time.
- Divide and Conquer: Constructs the diagram recursively. Time Complexity: $O(n \log n)$ .

Unit 1

Unit 3