Unit3 - Subjective Questions
CSE274 • Practice Questions with Detailed Answers
Explain the Perceptron algorithm. What is the Perceptron Update Rule?
The Perceptron
The Perceptron is one of the simplest artificial neural network architectures, used for binary classification. It models a neuron by taking a weighted sum of inputs and applying a step function (activation function).
Mechanism:
- Input: A vector of features .
- Weights: Each feature is multiplied by a weight .
- Summation: Compute weighted sum .
- Activation: Apply a step function. If , output 1; otherwise, output 0 (or -1).
Perceptron Update Rule
The weights are updated iteratively to minimize classification error. For a training sample where is the true label and is the predicted label:
Where:
- (eta) is the learning rate.
- is the error term. If prediction is correct, the term is 0, and weights don't change.
Describe Logistic Regression. Why is the Sigmoid function used in this model?
Logistic Regression
Despite its name, Logistic Regression is a classification algorithm used to predict the probability that an instance belongs to a specific class. It fits a logical 'S' shape curve to the data rather than a straight line.
The Sigmoid Function
Logistic regression uses the Sigmoid (or Logistic) function to map predicted values to probabilities between 0 and 1.
Formula:
Reasons for use:
- Bounded Output: It squashes any real-valued number into the range , making it suitable for probability estimation.
- Differentiability: The function is differentiable, which is crucial for optimization algorithms like Gradient Descent.
- Decision Boundary: We typically classify an input as positive if .
Derive the Cost Function (Log Loss) for Logistic Regression.
Derivation of Log Loss
In linear regression, we use Mean Squared Error (MSE), but in logistic regression, the sigmoid function makes MSE non-convex (multiple local minima). Therefore, we use the Log Loss (Cross-Entropy Loss).
For a single training example :
- If , we want the hypothesis to be close to 1. The cost is .
- If , we want to be close to 0. The cost is .
Combining these into a single equation:
Global Cost Function ():
Averaging over training examples:
This function is convex, ensuring Gradient Descent converges to the global minimum.
What is the 'Naïve' assumption in the Naïve Bayes Classifier? Explain with the help of Bayes' Theorem.
Bayes' Theorem
Bayes' theorem calculates the posterior probability based on the prior probability and the likelihood :
Where is the feature vector .
The 'Naïve' Assumption
Calculating the exact likelihood is computationally expensive because it requires modeling the joint distribution of all features.
The Naïve Bayes classifier assumes that all features are conditionally independent given the class label . This means the presence of one feature does not affect the presence of another.
Mathematical Consequence:
This simplifies the calculation drastically, allowing the model to perform well even with high-dimensional data, despite the assumption rarely being true in real life.
Define Confusion Matrix and explain the terms TP, TN, FP, and FN.
Confusion Matrix
A Confusion Matrix is a tabular layout used to evaluate the performance of a classification model. It compares the predicted labels against the actual labels.
Terminology
- True Positive (TP):
- Definition: The model correctly predicted the positive class.
- Example: Identifying a spam email correctly as spam.
- True Negative (TN):
- Definition: The model correctly predicted the negative class.
- Example: Identifying a normal email correctly as not spam.
- False Positive (FP) (Type I Error):
- Definition: The model incorrectly predicted the positive class (it was actually negative).
- Example: Classifying a normal email as spam.
- False Negative (FN) (Type II Error):
- Definition: The model incorrectly predicted the negative class (it was actually positive).
- Example: Classifying a spam email as normal.
Differentiate between Precision and Recall. When is F1-Score used?
Precision vs. Recall
Precision:
- Definition: Out of all the instances predicted as positive, how many were actually positive?
- Formula:
- Focus: Useful when the cost of False Positives is high (e.g., email spam detection).
Recall (Sensitivity/TPR):
- Definition: Out of all the actual positive instances, how many did the model predict correctly?
- Formula:
- Focus: Useful when the cost of False Negatives is high (e.g., cancer detection).
F1-Score
The F1-Score is the harmonic mean of Precision and Recall. It is used when:
- There is an uneven class distribution (imbalanced dataset).
- We seek a balance between Precision and Recall.
Explain the K-Nearest Neighbors (K-NN) algorithm. Why is it called a 'Lazy Learner'?
K-Nearest Neighbors (K-NN)
K-NN is a non-parametric, distance-based supervised learning algorithm used for classification and regression.
Steps:
- Choose the number of neighbors, .
- For a new data point, calculate the distance (e.g., Euclidean) between this point and all training points.
- Select the points with the smallest distances.
- Classification: Assign the class that is most frequent (mode) among the neighbors.
Lazy Learner
K-NN is called a Lazy Learner because:
- No Training Phase: It does not learn a discriminative function () from the training data during a training phase.
- Storage: It simply stores the training dataset.
- Delayed Computation: All computations are deferred until the time of prediction (inference). This makes training fast (zero time) but prediction slow compared to Eager Learners like Logistic Regression.
Discuss the effect of the value of 'K' in the K-NN algorithm regarding overfitting and underfitting.
The choice of significantly impacts the bias-variance trade-off in K-NN.
1. Small K (e.g., K=1 or K=2):
- Behavior: The model becomes very sensitive to local patterns and noise in the data.
- Decision Boundary: The boundary is jagged and complex.
- Outcome: High Variance, Low Bias. This leads to Overfitting.
2. Large K:
- Behavior: The model considers a large number of neighbors, effectively averaging out the noise but potentially missing local patterns.
- Decision Boundary: The boundary becomes smoother and more linear.
- Outcome: Low Variance, High Bias. This leads to Underfitting.
Selection: The optimal is usually selected via techniques like Cross-Validation.
Explain the concept of Entropy and Information Gain in Decision Trees.
Decision Trees use metrics to determine the best attribute to split the data at each node. ID3 algorithm uses Entropy and Information Gain.
Entropy
Entropy measures the impurity or randomness in a dataset . If a dataset contains only one class, entropy is 0. If classes are balanced, entropy is 1.
Formula:
Where is the probability of class in dataset .
Information Gain (IG)
Information Gain measures the reduction in entropy achieved by splitting the dataset on a specific attribute . The tree selects the attribute with the highest IG.
Formula:
Where:
- is the entropy of the parent set.
- is the subset of where attribute has value .
- We subtract the weighted average entropy of the children from the parent's entropy.
What is the difference between Pre-pruning and Post-pruning in Decision Trees?
Pruning is a technique to reduce the size of decision trees to prevent overfitting.
| Feature | Pre-pruning (Early Stopping) | Post-pruning |
|---|---|---|
| Definition | Halting the tree construction early before it creates a full tree. | Letting the tree grow to its full depth and then removing insignificant nodes. |
| Mechanism | Stops splitting based on constraints (e.g., max depth, min samples per leaf, min impurity decrease). | Evaluates complete branches. If removing a branch improves/maintains validation accuracy, replace it with a leaf. |
| Pros | Faster training time; simple to implement. | Usually produces better trees; less likely to 'under-prune'. |
| Cons | Can be too aggressive (Horizon effect) - might stop before a good split occurs later. | Computationally expensive as it requires growing the full tree first. |
Explain the core intuition behind Support Vector Machines (SVM). What are Support Vectors?
Support Vector Machine (SVM)
SVM is a supervised learning model used for classification and regression. The core intuition is to find a hyperplane in an N-dimensional space that distinctly classifies the data points.
Goal:
Among many possible hyperplanes that separate classes, SVM chooses the one that maximizes the Margin.
Margin:
The distance between the hyperplane and the nearest data point from either class. Maximizing this margin improves the model's generalization ability.
Support Vectors
- Support vectors are the data points that lie closest to the decision boundary (hyperplane).
- They are the most difficult points to classify.
- These points effectively "support" or define the orientation and position of the hyperplane.
- If we remove other data points, the hyperplane does not change. If we move a support vector, the hyperplane shifts.
What is the 'Kernel Trick' in SVM? Why is it useful?
The Problem
Standard SVM seeks a linear decision boundary. However, many real-world datasets are not linearly separable (e.g., concentric circles of data).
The Kernel Trick
The Kernel Trick is a mathematical technique that allows SVM to classify non-linear data without explicitly transforming the data into higher dimensions.
It computes the dot product of data points in a high-dimensional feature space without actually calculating the coordinates in that space. This avoids the high computational cost (Curse of Dimensionality).
Common Kernels:
- Linear Kernel: For linearly separable data.
- Polynomial Kernel: Maps to polynomial dimensions.
- RBF (Radial Basis Function): Maps to infinite-dimensional space.
Utility:
It allows algorithms like SVM to fit complex, non-linear boundaries efficiently.
Define ROC Curve and ROC-AUC. How do they help in model evaluation?
ROC Curve (Receiver Operating Characteristic)
The ROC curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.
- X-axis: False Positive Rate ()
- Y-axis: True Positive Rate ()
A curve closer to the top-left corner indicates a better model (High TPR, Low FPR).
ROC-AUC (Area Under the Curve)
- Definition: AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes.
- Range: 0 to 1.
- : Random guessing (no discrimination).
- : Perfect classifier.
- : Reciprocating the classes (predicting 0 as 1 and vice versa).
It helps evaluate model performance independent of the specific threshold chosen for classification.
Compare the Precision-Recall (PR) Curve with the ROC Curve. When should you use PR Curve?
Comparison
-
Axes:
- ROC: Plots TPR (Recall) vs. FPR.
- PR Curve: Plots Precision vs. Recall (TPR).
-
Baseline:
- ROC: The baseline (random guess) is a diagonal line from (0,0) to (1,1).
- PR: The baseline depends on the class distribution (ratio of positives).
When to use PR Curve (AUC-PR)?
The ROC curve can be misleading when the dataset is heavily imbalanced (e.g., Fraud Detection where 99% are valid, 1% fraud).
- ROC behavior in imbalance: The False Positive Rate (FPR) uses True Negatives in the denominator (). If TN is huge (many negatives), FPR stays low even if FP increases, making the model look better than it is.
- PR Curve behavior: Precision () does not use True Negatives. It focuses strictly on how well the minority class is detected.
Conclusion: Use PR Curve when the positive class is rare or more important than the negative class.
What is the problem of Zero Frequency in Naïve Bayes and how does Laplace Smoothing solve it?
Zero Frequency Problem
In Naïve Bayes, if a specific categorical feature value occurs in the test set but was never seen in the training set for a particular class, the probability estimate becomes zero.
Since Naïve Bayes multiplies probabilities (), a single zero turns the entire posterior probability to zero, effectively wiping out all other information provided by other features.
Laplace Smoothing (Additive Smoothing)
To solve this, we add a small smoothing parameter (usually 1) to the count of every feature value.
Formula:
Where:
- : Smoothing parameter.
- : Number of distinct values feature can take.
This ensures no probability is ever exactly zero, allowing the model to make predictions based on other available features.
Compare Parametric and Non-parametric models with examples from Unit 3.
Parametric Models
- Definition: These models make strong assumptions about the functional form of the data distribution. They have a fixed number of parameters, regardless of the training data size.
- Pros: Simpler, faster training and inference, less data required.
- Cons: Limited complexity; cannot learn complex patterns if the assumption is wrong (high bias).
- Examples:
- Logistic Regression: Assumes a linear decision boundary separated by a sigmoid curve.
- Linear Perceptron.
Non-parametric Models
- Definition: These models do not make strong assumptions about the mapping function. The number of parameters effectively grows with the amount of training data.
- Pros: Flexible; can fit any functional form (low bias).
- Cons: Requires more data; slower training/inference; higher risk of overfitting.
- Examples:
- K-Nearest Neighbors (KNN): Stores all data points.
- Decision Trees: Can grow very deep/complex.
- SVM with RBF Kernel: Infinite dimensional mapping.
Distinguish between Generative and Discriminative models using Naïve Bayes and Logistic Regression as examples.
Discriminative Models (e.g., Logistic Regression)
- Goal: Model the decision boundary directly.
- Probability: Learns the conditional probability directly from the data.
- Focus: It focuses on finding the boundary that separates the classes.
- Analogy: To distinguish between a cat and a dog, it looks for specific differences (e.g., whiskers, ear shape).
Generative Models (e.g., Naïve Bayes)
- Goal: Model the distribution of individual classes.
- Probability: Learns the joint probability (usually by learning and ) and then uses Bayes rule to find .
- Focus: It tries to understand how the data is generated for each class.
- Analogy: It builds a model of what a cat looks like and what a dog looks like. When a new animal comes, it matches it against both models to see which one fits best.
Explain Gini Impurity and how it differs from Entropy.
Gini Impurity
Gini Impurity is an alternative metric to Entropy used by the CART (Classification and Regression Trees) algorithm for splitting nodes.
It measures the likelihood of an incorrect classification of a new instance of a random variable, if that new instance were randomly classified according to the distribution of class labels from the dataset.
Formula:
Difference from Entropy
- Computation: Entropy involves logarithmic calculations (), which are computationally more expensive than the squaring operations used in Gini Impurity.
- Range:
- Entropy range: (for binary classification).
- Gini range: (for binary classification).
- Result: In practice, they usually result in very similar trees. However, Gini tends to isolate the most frequent class in its own branch of the tree, while Entropy tends to produce slightly more balanced trees.
Briefly explain the concept of Supervised Learning and list the steps involved.
Supervised Learning
Supervised learning is a type of machine learning where the algorithm is trained on a labeled dataset. The model learns a mapping function from input variables to output variables ().
Goal: Approximate the mapping function so well that when you have new input data (), you can predict the output variables () for that data.
Steps involved:
- Data Collection: Gather a dataset containing inputs (features) and corresponding correct outputs (labels).
- Data Preprocessing: Clean, normalize, and split data into Training and Testing sets.
- Model Selection: Choose an appropriate algorithm (e.g., SVM, Decision Tree) based on the problem type.
- Training: The algorithm processes the training data to minimize a loss function (learning the weights/parameters).
- Evaluation: Test the model on unseen data using metrics like Accuracy, Precision, Recall.
- Tuning: Adjust hyperparameters to improve performance.
What are the advantages and disadvantages of Decision Trees?
Advantages
- Interpretability: They are easy to understand and visualize (white-box model). The rules can be explained to non-technical stakeholders.
- No Normalization: They require very little data preprocessing (scaling or normalization of features is not needed).
- Non-linear relationships: Can handle both numerical and categorical data and model non-linear boundaries effectively.
Disadvantages
- Overfitting: Trees can create overly complex models that do not generalize well (solved by pruning).
- Instability: Small variations in the data can result in a completely different tree being generated (High Variance).
- Bias: Decision tree learners can create biased trees if some classes dominate (solved by balancing the dataset).