Viva Questions
INT234
The primary goal of predictive analytics is to assess what is likely to happen in the future based on historical data. It provides a probability score for individual entities to inform or influence organizational processes.
Diagnostic Analytics asks "Why did it happen?" using drill-down techniques and data discovery to understand past events. Predictive Analytics asks "What is likely to happen?" using forecasting and statistical modeling to anticipate future outcomes.
The six phases are: 1. Business Understanding, 2. Data Understanding, 3. Data Preparation, 4. Modeling, 5. Evaluation, and 6. Deployment.
In traditional programming, rules are explicitly coded (e.g., "If X > 5 then Y"). In Machine Learning, algorithms identify patterns in data to create their own rules without being explicitly programmed for specific tasks.
Supervised learning uses a labeled dataset containing both input features and the correct output/answer. Unsupervised learning uses an unlabeled dataset where no correct answer is provided.
Classification is used when the output variable is categorical (e.g., Spam/Not Spam, Yes/No). Regression is used when the output variable is continuous or numerical (e.g., price, temperature).
List-wise deletion involves dropping the entire row if data is missing, which risks losing valuable information. Imputation involves filling the missing data with estimated values, such as the mean, median, or mode.
The formula is
. It scales features to a fixed range [0, 1] and is useful when algorithms use distance measures (like k-NN) and features have vastly different scales.Machine learning models require numerical input. One-Hot Encoding converts categorical labels into binary columns (0s and 1s) to prevent the model from misinterpreting the order or magnitude of simple integer labels.
Splitting data allows us to train the model on one subset and evaluate it on a separate, unseen subset (the test set). This helps assess how well the model generalizes to new data and checks for overfitting.
The objective of OLS is to find the line of best fit by minimizing the Sum of Squared Errors (SSE), which is the sum of the squared vertical differences between the observed values and the predicted values.
An
value of -1 indicates a perfect negative linear relationship, meaning as one variable increases, the other decreases in exact proportion.Multicollinearity occurs when independent variables are highly correlated with each other. It makes it difficult to determine the individual effect of each variable on the dependent variable and can lead to unstable coefficient estimates.
Logistic Regression is a classification algorithm used to predict discrete outcomes (like Yes/No or 0/1) by estimating the probability of an event occurring.
The Sigmoid function maps any real-valued number from a linear equation to a value between 0 and 1. This output represents a probability, which is then used to classify the result based on a threshold.
Polynomial Regression should be used when the relationship between the independent and dependent variables is non-linear (curved), as a straight line would result in underfitting.
MAE (Mean Absolute Error) takes the average of absolute errors and is robust to outliers. MSE (Mean Squared Error) squares the errors, which heavily penalizes large errors/outliers but makes the unit different from the target variable.
An
score of 1 indicates a perfect fit, meaning the model explains 100% of the variance in the dependent variable.Homoscedasticity assumes that the variance of the residual terms (errors) is constant at every level of the independent variable (
). If the variance changes (e.g., errors get larger as X increases), it is called Heteroscedasticity.It is called lazy learning because it does not generate a model during a training phase. It simply stores the training data and waits until a query is made to perform calculations and classification.
If 'k' is too small, the model becomes highly sensitive to noise and outliers, leading to high variance and overfitting.
The algorithm assumes that the presence of a particular feature in a class is completely independent of the presence of any other feature, which is rarely true in real-world data.
Entropy is a measure of randomness or disorder in a dataset. A value of 0 indicates a perfectly homogeneous node (all same class), while a value of 1 indicates maximum disorder (equally divided classes). The algorithm splits nodes to minimize entropy.
Support Vectors are the data points closest to the hyperplane. They are critical because they define the position and orientation of the decision boundary; removing other points doesn't change the model, but moving support vectors does.
The Kernel Trick maps input data into a higher-dimensional space where a linear separator can be found. It is used to enable SVM to classify data that is not linearly separable in its original dimension.
A False Positive represents a "False Alarm," where the model predicts the positive class, but the actual value is negative (e.g., predicting a healthy person has a disease).
Recall should be prioritized when the cost of False Negatives is high, such as in cancer detection, where missing a positive case is more dangerous than a false alarm.
The F1 Score is the harmonic mean of Precision and Recall. It is most useful when the class distribution is uneven (imbalanced datasets) and you need a balance between precision and recall.
An AUC of 0.5 indicates that the model has no discrimination capacity and is essentially performing random guessing.
The goal is to explore unlabeled data to discover hidden structures, patterns, or groupings (clusters) without any prior training on what the output should look like.
A Centroid is the center point of a cluster. In K-Means, it represents the mean position of all the data points assigned to that specific cluster.
The Elbow Method plots the WCSS against the number of clusters. The optimal
is found at the "elbow" of the curve, where the rate of decrease in variance shifts sharply, indicating diminishing returns for adding more clusters.It occurs when poor initial placement of centroids leads the algorithm to converge at a local minimum rather than the global minimum. It is solved using K-Means++, which initializes centroids to be far apart.
A Dendrogram is a tree-like diagram used in hierarchical clustering to visualize the arrangement of clusters and the sequence of merges or splits.
Agglomerative is a "bottom-up" approach starting with individual points as clusters and merging them. Divisive is a "top-down" approach starting with one giant cluster and splitting it recursively.
It is a technique used by retailers to identify items that are frequently purchased together (e.g., Bread and Butter) to optimize store layout or cross-sell products.
Support indicates how frequently an itemset appears in the dataset. It is calculated as the number of transactions containing the item divided by the total number of transactions.
Lift measures the strength of association independent of the popularity of the consequent item. Unlike Confidence, Lift accounts for the baseline probability of the item occurring, revealing if the relationship is stronger than random chance.
A Lift value greater than 1 indicates a positive association, meaning that the two items are likely to be bought together more often than would be expected by chance.
It refers to the problems that arise when analyzing data in high-dimensional spaces (many features), such as increased computational time, difficulty in visualization, and sparse data leading to overfitting.
Principal Components are new, uncorrelated variables derived from the original features. They are ordered so that the first component captures the most variance in the data, the second captures the second most, and so on.
PCA seeks to maximize variance. If features are not scaled (e.g., one ranges 0-1 and another 0-1000), the feature with the larger scale will dominate the variance calculation, skewing the principal components.
The three layers are the Input Layer, Hidden Layer(s), and the Output Layer.
Activation functions introduce non-linearity to the network. Without them, a neural network, no matter how deep, would act just like a simple linear regression model and couldn't learn complex patterns.
Backpropagation is the learning mechanism where the network calculates the gradient of the loss function with respect to the weights. It propagates the error backward from the output to the input layer to update weights and minimize error.
CNNs preserve the spatial structure of images (2D relationships) and use shared weights (filters) to reduce the number of parameters. MLPs would require flattening the image, losing spatial info and causing a parameter explosion.
A Pooling Layer (e.g., Max Pooling) is used for downsampling. It reduces the spatial dimensions (width and height) of the input to decrease computational load and control overfitting while retaining prominent features.
RNNs are designed for sequential data where the order matters, such as time series data, text (NLP), and speech.
In long sequences, gradients calculated during backpropagation can become extremely small as they are multiplied backward. This causes the network to stop learning from earlier time steps, effectively giving it only "short-term" memory.
Bias is the error introduced by approximating a complex real-world problem with a simplified model. High bias leads to underfitting, where the model misses the underlying trend.
Variance refers to how much the model's estimate would change if used on a different training set. High variance implies the model is sensitive to noise in the training data, leading to overfitting.
Underfitting occurs when a model is too simple (High Bias) to capture the underlying structure of the data, resulting in poor performance on both training and test data.
Overfitting occurs when a model is too complex (High Variance) and learns the noise in the training data rather than the signal. It performs well on training data but poorly on unseen test data.
Irreducible error is the noise inherent in the system itself. It is the part of the error that cannot be reduced by any model, regardless of how good the model is.
There is an inverse relationship: As model complexity increases, Bias decreases (better fit), but Variance increases (sensitive to noise). As complexity decreases, Bias increases, but Variance decreases.
The sweet spot is the level of model complexity where the sum of
and Variance is minimized, resulting in the lowest possible Total Error and best generalization.The expected prediction error consists of: 1.
, 2. Variance, and 3. Irreducible Error.It has High Variance. A fully grown tree is very complex and can perfectly memorize the training data (including noise), making it prone to overfitting.
It has High Bias. A linear model is too simple to capture the curve, leading to systematic error (underfitting).