Unit3 - Subjective Questions
INT428 • Practice Questions with Detailed Answers
Define Supervised Learning and distinguish between its two main sub-categories: Classification and Regression.
Supervised Learning is a type of machine learning where the algorithm is trained on a labeled dataset. This means that for every input data point in the training set, the correct output (target) is known. The goal is to learn a mapping function from input variables () to output variables ().
Sub-categories:
- Classification:
- Goal: To predict a discrete class label or category.
- Output: Categorical (e.g., 'Spam' or 'Not Spam', 'Cat' or 'Dog').
- Example: Medical diagnosis (Disease Present vs. Absent).
- Regression:
- Goal: To predict a continuous numerical value.
- Output: Quantitative (e.g., price, temperature, height).
- Example: Predicting house prices based on square footage.
Explain the concept of Unsupervised Learning. How does it differ from Supervised Learning? Provide two real-world applications.
Unsupervised Learning involves training a machine learning algorithm on data that is neither classified nor labeled. The algorithm acts on the data without guidance, looking for hidden structures, patterns, or groupings within the input.
Difference from Supervised Learning:
- Labels: Supervised learning uses labeled data (Input-Output pairs), whereas unsupervised learning uses unlabeled data (Input only).
- Goal: Supervised learning aims to predict outcomes; Unsupervised learning aims to discover structure or distribution in the data.
Real-World Applications:
- Customer Segmentation: Grouping customers based on purchasing behavior for targeted marketing (Clustering).
- Anomaly Detection: Identifying unusual data points in network traffic to detect security breaches.
Describe Reinforcement Learning (RL). Define the key components: Agent, Environment, State, Action, and Reward.
Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by performing actions in an environment and receiving feedback in the form of rewards or penalties.
Key Components:
- Agent: The learner or decision-maker (e.g., a robot or a software bot).
- Environment: The world through which the agent moves or interacts.
- State (): The current situation or configuration of the agent within the environment.
- Action (): The set of all possible moves the agent can make.
- Reward (): Immediate feedback signal (positive or negative) received after taking an action, guiding the agent toward the optimal policy.
State Bayes' Theorem mathematically and explain its significance in machine learning.
Bayes' Theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. It is the mathematical foundation for probabilistic reasoning and algorithms like Naive Bayes.
Mathematical Formula:
Where:
- is the Posterior probability: Probability of hypothesis given observed evidence .
- is the Likelihood: Probability of observing evidence given that hypothesis is true.
- is the Prior probability: Probability of hypothesis being true before observing evidence.
- is the Marginal likelihood: Total probability of observing the evidence.
Significance: It allows ML models to update predictions as new data becomes available and handles uncertainty effectively.
What are Bayesian Networks? Explain their structure and utility in probabilistic reasoning.
A Bayesian Network (or Belief Network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a Directed Acyclic Graph (DAG).
Structure:
- Nodes: Represent random variables (discrete or continuous).
- Edges: Directed arrows represent causal relationships or conditional dependencies (from Parent to Child).
- Conditional Probability Tables (CPTs): Each node has a table quantifying the effect of the parents on the node.
Utility:
- They are used for Probabilistic Reasoning to predict the likelihood of causes given observed effects (Diagnostic reasoning) or effects given causes (Causal reasoning).
- They handle incomplete data effectively by marginalizing over unknown variables.
Explain the role of Linear Algebra in Machine Learning, specifically focusing on Vectors and Matrices.
Linear Algebra is the backbone of Machine Learning, providing the language to define and manipulate data.
- Vectors:
- Concept: An ordered list of numbers. In ML, a single data point (e.g., a customer's age, income, and debt) is represented as a feature vector: .
- Usage: Representing input features, weights in a linear model, or biases.
- Matrices:
- Concept: A 2D array of numbers. An entire dataset is often represented as a matrix of size (where is the number of samples and is the number of features).
- Usage: Operations like matrix multiplication allow algorithms to process large batches of data simultaneously (vectorization). For example, in Neural Networks, layers are transformed using , where is a weight matrix.
What is Feature Engineering? Discuss two common techniques: Normalization and One-Hot Encoding.
Feature Engineering is the process of using domain knowledge to extract features (attributes) from raw data that make machine learning algorithms work better.
1. Normalization (Scaling):
- Purpose: To bring all numerical features to a similar scale (usually between 0 and 1) so that variables with large ranges (e.g., Salary) don't dominate variables with small ranges (e.g., Age).
- Formula (Min-Max):
2. One-Hot Encoding:
- Purpose: To convert categorical data (text labels) into a numerical format that ML algorithms can process.
- Mechanism: It creates a new binary column for each unique category. For example, a 'Color' feature with values ['Red', 'Blue'] becomes two columns: 'Is_Red' (1 or 0) and 'Is_Blue' (1 or 0).
Differentiate between Precision and Recall in the context of model evaluation. When should each be prioritized?
Precision and Recall are metrics used to evaluate classification models, derived from the Confusion Matrix.
Precision:
- Definition: The ratio of correctly predicted positive observations to the total predicted positives.
- Formula:
- When to prioritize: When the cost of a False Positive is high. (e.g., In Email Spam detection, we want to avoid classifying a legitimate email as spam).
Recall (Sensitivity):
- Definition: The ratio of correctly predicted positive observations to the all observations in the actual class.
- Formula:
- When to prioritize: When the cost of a False Negative is high. (e.g., In Cancer diagnosis, it is critical not to miss a patient who actually has the disease).
Explain K-Fold Cross-Validation and why it is preferred over a simple Train-Test split.
K-Fold Cross-Validation is a resampling procedure used to evaluate machine learning models on a limited data sample.
Process:
- Shuffle the dataset randomly.
- Split the dataset into groups (folds).
- For each unique group:
- Take the group as a hold out or test data set.
- Take the remaining groups as a training data set.
- Fit a model on the training set and evaluate it on the test set.
- Summarize the skill of the model using the average of the model evaluation scores.
Why Preferred over Simple Train-Test Split:
- Reduces Variance: It provides a less biased estimate of model performance because every data point gets to be in the test set exactly once.
- Utilizes Data: It allows the use of more data for training, which is crucial when the dataset is small.
Explain the Bias-Variance Tradeoff in Machine Learning.
The Bias-Variance Tradeoff is the conflict in trying to simultaneously minimize these two sources of error that prevent supervised learning algorithms from generalizing beyond their training set.
- Bias: Error due to overly simplistic assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (Underfitting).
- Variance: Error due to too much complexity in the learning algorithm. The model becomes highly sensitive to high-frequency variations (noise) in the training data. High variance causes the model to model the random noise in the training data (Overfitting).
The Tradeoff:
- Increasing model complexity generally increases variance and decreases bias.
- Decreasing model complexity increases bias and decreases variance.
- Goal: To find the 'sweet spot' (optimal complexity) where the total error (Bias + Variance) is minimized.
Define Overfitting and Underfitting. How can they be detected?
Overfitting:
- Definition: When a model learns the training data too well, capturing noise and fluctuations rather than the underlying pattern. It performs excellently on training data but poorly on unseen test data.
- Detection: High Training Accuracy, Low Test/Validation Accuracy.
Underfitting:
- Definition: When a model is too simple to capture the underlying structure of the data. It cannot learn the relationships even in the training data.
- Detection: Low Training Accuracy and Low Test/Validation Accuracy.
Visual Detection: By plotting the learning curves (Error vs. Training Size or Error vs. Epochs), a large gap between training and validation error suggests overfitting, while high error for both suggests underfitting.
Compare Supervised, Unsupervised, and Reinforcement Learning based on Data Type, Feedback, and Goal.
| Feature | Supervised Learning | Unsupervised Learning | Reinforcement Learning |
|---|---|---|---|
| Data Type | Labeled data (Input-Output pairs). | Unlabeled data (Input only). | States and Actions (No predefined dataset initially). |
| Feedback | Direct feedback (Correct/Incorrect) provided by the supervisor. | No external feedback; internal evaluation of structure. | Delayed feedback in the form of Reward or Penalty. |
| Goal | Predict outputs or classify new data. | Find hidden patterns, structures, or clusters. | Learn a sequence of actions (Policy) to maximize total reward. |
| Example | Spam Filter, House Price Prediction. | Customer Segmentation, Market Basket Analysis. | Chess playing engine, Robot navigation. |
Explain the role of Statistics in Machine Learning, specifically focusing on Descriptive Statistics (Mean, Median, Standard Deviation).
Statistics provides the tools to collect, analyze, interpret, and present data, serving as a prerequisite for applied Machine Learning.
Descriptive Statistics:
- Mean (Average):
- Sum of values divided by the count.
- ML Role: Used for data imputation (filling missing values) and understanding the central tendency of features.
- Median:
- The middle value separating the higher half from the lower half of data.
- ML Role: More robust to outliers than the mean. Preferred for imputation in skewed distributions.
- Standard Deviation ():
- A measure of the amount of variation or dispersion of a set of values.
- ML Role: Critical for Feature Scaling (Z-score normalization). It helps in understanding the spread of data; low means data is clustered around the mean, high means data is spread out.
What is a Confusion Matrix? Draw a layout for a binary classification problem and explain True Positives, False Positives, True Negatives, and False Negatives.
A Confusion Matrix is a table used to evaluate the performance of a classification model.
Layout:
| Predicted: Positive | Predicted: Negative | |
|---|---|---|
| Actual: Positive | True Positive (TP) | False Negative (FN) |
| Actual: Negative | False Positive (FP) | True Negative (TN) |
Definitions:
- True Positive (TP): The model correctly predicted the positive class (e.g., Predicted Spam, Actually Spam).
- True Negative (TN): The model correctly predicted the negative class (e.g., Predicted Not-Spam, Actually Not-Spam).
- False Positive (FP): The model incorrectly predicted positive (Type I Error) (e.g., Predicted Spam, Actually Not-Spam).
- False Negative (FN): The model incorrectly predicted negative (Type II Error) (e.g., Predicted Not-Spam, Actually Spam).
How does Probabilistic Reasoning handle uncertainty in AI systems? Explain the concept of Marginalization.
Probabilistic Reasoning uses probability theory to manage uncertainty. Unlike logic-based systems that require absolute truths (True/False), probabilistic systems assign a degree of belief (0 to 1) to statements.
Handling Uncertainty:
- It models the real world where sensors are noisy, rules are incomplete, and ignorance exists.
- It allows the system to update beliefs based on new evidence using Bayes' rule.
Marginalization:
- Concept: The process of summing out variables from a joint probability distribution to determine the probability of a subset of variables.
- Formula:
- Use: If we have a joint distribution of Weather and Traffic, but we only care about the probability of Traffic, we "marginalize" (sum over) all possible states of Weather.
Discuss the real-world application of Supervised Learning in Email Spam Detection and Unsupervised Learning in Recommender Systems.
1. Supervised Learning in Email Spam Detection:
- Input: Email content (subject line, body text, sender address).
- Label: 'Spam' or 'Not Spam' (Binary Classification).
- Process: The model is trained on a dataset of thousands of emails already flagged by users. It learns that keywords like "Lottery," "Free," or "Winner" are highly correlated with the 'Spam' label. When a new email arrives, the model predicts the label based on these learned patterns.
2. Unsupervised Learning in Recommender Systems:
- Input: User watch history or purchase logs.
- Process: The algorithm uses Clustering or Association Rule Mining to find patterns without explicit labels. For example, it might identify that users who watch 'Sci-Fi Movie A' also tend to watch 'Sci-Fi Movie B'.
- Outcome: It suggests Movie B to a user who just watched Movie A, effectively grouping similar users or items based on hidden preferences.
Why is Dimensionality Reduction important in Machine Learning? Mention the Curse of Dimensionality.
Dimensionality Reduction is the process of reducing the number of random variables (features) under consideration.
Importance:
- Computational Efficiency: Fewer features mean less data to store and faster training times.
- Visualization: It is easier to visualize data in 2D or 3D than in 100D.
- Noise Reduction: Removing irrelevant features improves model accuracy.
Curse of Dimensionality:
This refers to various phenomena that arise when analyzing data in high-dimensional spaces. As the number of features increases, the amount of data required to generalize accurately increases exponentially. In high dimensions, data becomes sparse, and distance metrics (like Euclidean distance) become less meaningful, making clustering and classification difficult.
Derive the Naive Bayes classification rule from Bayes' Theorem. Why is it called "Naive"?
Derivation:
We want to find the class that maximizes for a feature vector .
Using Bayes' Theorem:
Since is constant for all classes, we only need to maximize the numerator:
The "Naive" Assumption:
The algorithm assumes that all features are mutually independent given the class . This allows us to decompose the joint likelihood:
Classification Rule:
It is called "Naive" because the assumption of independence between features is rarely true in the real world (e.g., in text, the word "Bank" is dependent on "Account"), yet the classifier often performs surprisingly well.
Explain the concept of F1 Score. Why is it a better metric than Accuracy for imbalanced datasets?
F1 Score:
It is the harmonic mean of Precision and Recall. It provides a single metric that balances both concerns.
Formula:
Comparison with Accuracy on Imbalanced Data:
- Scenario: Consider a dataset with 95% 'Healthy' patients and 5% 'Sick' patients.
- Accuracy Problem: A model that simply predicts 'Healthy' for everyone achieves 95% accuracy but is useless because it catches 0% of the sick patients.
- F1 Score Advantage: In this scenario, the Recall for the 'Sick' class would be 0, making the F1 score 0. The F1 score penalizes extreme values. If either Precision or Recall is low, the F1 score will be low. Therefore, F1 gives a more realistic view of performance when classes are unevenly distributed.
What is Imputation in the context of handling missing data? Describe two methods to perform it.
Imputation is the technique of replacing missing data with substituted values to retain the majority of the dataset's information rather than discarding rows with missing values.
Methods:
-
Mean/Median Imputation (Univariate):
- Replace missing values in a column with the Mean (for normal distribution) or Median (for skewed distribution) of that column.
- Pros: Simple and fast.
- Cons: Reduces variance and ignores correlations between features.
-
K-Nearest Neighbors (KNN) Imputation (Multivariate):
- Find the 'k' samples in the dataset that are most similar (closest in distance) to the sample with the missing data.
- Average the values of these neighbors to fill the gap.
- Pros: More accurate as it accounts for local data structure.
- Cons: Computationally expensive for large datasets.