Unit 5 - Notes

INT396 8 min read

Unit 5: Association Rule Mining & Anomaly Detection: Association Rule Mining

Part 1: Association Rule Mining

Association Rule Mining (ARM) is a fundamental unsupervised learning technique used to discover interesting relations, hidden patterns, or frequent itemsets within large transactional databases. The classic rule is represented as $A \implies B$ , meaning "if item A is purchased, item B is likely to be purchased as well," where $A$ is the antecedent and $B$ is the consequent.

Market Basket Analysis

Market Basket Analysis (MBA) is the most prominent application of Association Rule Mining. Retailers use it to understand customer purchasing behavior by discovering groups of items that are frequently bought together.

Goal: Optimize store layouts, cross-sell products, and design promotional campaigns.
Example: Placing salsa next to tortilla chips, or the famous (though apocryphal) "beer and diapers" correlation.

Key Evaluation Metrics

To extract meaningful rules and filter out random co-occurrences, ARM relies on several mathematical metrics. Let $X$ and $Y$ be itemsets, and $N$ be the total number of transactions.

1. Support

Support measures how frequently an itemset appears in the dataset. It indicates the base popularity of an item or a combination of items.

Formula: $\text{Support}(X \implies Y) = \frac{\text{Transactions containing both } X \text{ and } Y}{N}$
Interpretation: A high support implies the items are frequently bought together. Algorithms use a "Minimum Support Threshold" to prune infrequent items early.

2. Confidence

Confidence measures the reliability of the inference. It represents the conditional probability that a transaction containing $X$ also contains $Y$ .

Formula: $\text{Confidence}(X \implies Y) = \frac{\text{Support}(X \cup Y)}{\text{Support}(X)}$
Interpretation: If Confidence is 0.8 (80%), it means 80% of the times a customer bought $X$ , they also bought $Y$ . It is directional ( $X \implies Y$ is not the same as $Y \implies X$ ).

3. Lift

Lift measures the strength of an association over the random co-occurrence of $X$ and $Y$ , assuming they are independent.

Formula: $\text{Lift}(X \implies Y) = \frac{\text{Confidence}(X \implies Y)}{\text{Support}(Y)} = \frac{\text{Support}(X \cup Y)}{\text{Support}(X) \times \text{Support}(Y)}$
Interpretation:
- Lift = 1: $X$ and $Y$ are independent (no association).
- Lift > 1: Positive correlation. $X$ and $Y$ appear together more often than expected by chance. High lift indicates a strong, actionable rule.
- Lift < 1: Negative correlation. $X$ and $Y$ are substitutes; buying one makes buying the other less likely.

4. Conviction

Conviction evaluates the degree of implication of a rule. It compares the probability that $X$ appears without $Y$ if they were dependent with the actual frequency of the appearance of $X$ without $Y$ .

Formula: $\text{Conviction}(X \implies Y) = \frac{1 - \text{Support}(Y)}{1 - \text{Confidence}(X \implies Y)}$
Interpretation: A high conviction value means that the consequent is highly dependent on the antecedent. For instance, a conviction of 1.2 shows that the rule $X \implies Y$ would be incorrect 20% more often if the association between $X$ and $Y$ was purely random.

ARM Algorithms

The Apriori Algorithm

Apriori is the foundational algorithm for ARM, introduced by Agrawal and Srikant in 1994.

Core Principle (The Apriori Property): All non-empty subsets of a frequent itemset must also be frequent. (Anti-monotonicity). Conversely, if an itemset is infrequent, all its supersets will also be infrequent.
How it Works:
1. Determine Support of Individual Items: Scan the database to find the support of single items. Discard those below the minimum support threshold (Generate $L_1$ ).
2. Candidate Generation: Join the frequent 1-itemsets to create 2-itemsets ( $C_2$ ).
3. Pruning: Scan the database to count the support of $C_2$ . Keep only those meeting the threshold (Generate $L_2$ ).
4. Iteration: Repeat generation and pruning for $k$ -itemsets until no more frequent itemsets can be found.
Limitations: Highly computationally expensive for large datasets because it requires multiple scans of the database and generates a massive number of candidate itemsets.

The FP-Growth Algorithm (Frequent Pattern Growth)

FP-Growth was proposed to overcome the bottlenecks of Apriori (specifically, the expensive candidate generation phase).

Core Principle: It compresses the database into a compact tree structure (FP-Tree) and uses a divide-and-conquer strategy.
How it Works:
1. Pass 1 (Item Counting): Scan the database to calculate item frequencies. Sort items in descending order of frequency.
2. Pass 2 (Tree Construction): Read the database again. Map transactions onto a prefix-tree (FP-Tree). Shared items share the same branches, heavily compressing the data.
3. Mining: Extract frequent patterns directly from the FP-Tree by generating "conditional pattern bases" (sub-trees for each item) and recursively mining them.
Advantages over Apriori:
- Requires exactly two passes over the database.
- No candidate generation is required, making it orders of magnitude faster and significantly more memory-efficient.

Part 2: Anomaly Detection

Anomaly detection (or Outlier detection) is the identification of rare items, events, or observations that raise suspicions by differing significantly from the majority of the data.

Anomaly vs. Novelty Detection

While often used interchangeably, they apply to different data contexts:

Anomaly (Outlier) Detection: The training data contains both normal data and outliers. The algorithm tries to fit the regions where the training data is most concentrated, ignoring the deviant observations. (Unsupervised).
Novelty Detection: The training data is clean (contains NO outliers). The algorithm learns the boundaries of the "normal" data. When new, unseen data is introduced, the algorithm flags whether it falls inside or outside this boundary. (Semi-supervised).

Anomaly Detection Algorithms

1. Isolation Forest

Isolation Forest is a tree-based ensemble algorithm built on the premise that anomalies are few and different, making them easier to "isolate" than normal data points.

How it works:
1. The algorithm creates an ensemble of Isolation Trees (iTrees).
2. For each tree, it recursively and randomly selects a feature, and then randomly selects a split value between the minimum and maximum values of that feature.
3. Because anomalies are far from normal data clusters, they require fewer random splits to be isolated into their own leaf node.
4. Normal data points, packed closely together, require many more splits to be isolated.
Scoring: The anomaly score is inversely proportional to the path length from the root node to the leaf node. A shorter average path length across all trees means a higher likelihood of being an anomaly.
Advantages: Linear time complexity with a low memory footprint. Scales exceptionally well to high-dimensional datasets.

2. Local Outlier Factor (LOF)

LOF is a density-based algorithm. It identifies outliers by measuring the local deviation of a given data point with respect to its neighbors.

How it works:
1. k-Distance: Finds the distance to the $k$ -th nearest neighbor.
2. Reachability Distance: Calculates the distance between point A and point B, but bounded by the $k$ -distance of B.
3. Local Reachability Density (LRD): The inverse of the average reachability distance of the $k$ -nearest neighbors. A low LRD means the point is in a sparse region.
4. LOF Score calculation: Compares the LRD of a point to the LRDs of its neighbors.
Interpretation of LOF Score:
- $\text{LOF} \approx 1$ : The point has a similar density to its neighbors (Normal).
- $\text{LOF} < 1$ : The point is in a denser region than its neighbors (Inlier/Normal).
- $\text{LOF} \gg 1$ : The point has a significantly lower density than its neighbors (Anomaly).
Advantages: Excellent at finding local outliers (points that are anomalous relative to their immediate neighborhood, even if they aren't extreme outliers in the global dataset).

Applications in Cybersecurity and Fraud Detection

Unsupervised anomaly detection is highly effective in domains where "bad behavior" is constantly evolving and labeled data is scarce.

Cybersecurity

Intrusion Detection Systems (IDS): Identifying unusual network traffic patterns that could indicate a zero-day exploit, DDoS attack, or unauthorized data exfiltration. Since attackers constantly change tactics, rule-based systems fail, but anomaly detection can spot the abnormal network flow.
User and Entity Behavior Analytics (UEBA): Profiling the standard behavior of users (login times, files accessed, IP addresses). If a user suddenly downloads 50GB of confidential data at 3 AM from a foreign IP, algorithms like Isolation Forest flag this as highly anomalous (potential insider threat or compromised credential).
Malware Detection: Identifying novel malware strains by analyzing anomalous API calls or abnormal system resource consumption compared to standard application behavior.

Fraud Detection

Credit Card Fraud: Systems analyze transaction location, amount, frequency, and merchant type. If a customer who usually buys groceries in New York suddenly makes a $5,000 electronics purchase in Paris, LOF or Isolation Forests will flag the transaction for review.
Insurance Fraud: Detecting anomalous patterns in healthcare or auto insurance claims. For example, a doctor billing for rare procedures at a rate 100x the national average can be flagged as an anomaly.
Anti-Money Laundering (AML): Banks use anomaly detection to find unusual "smurfing" behaviors (breaking large sums into smaller transactions) or rapid money transfers across multiple international accounts that deviate from standard retail or corporate banking behaviors.

Unit 4

Unit 6