Unit 6 - Practice Quiz

INT395 50 Questions
0 Correct 0 Wrong 50 Left
0/50

1 What is the primary purpose of using a Pipeline in machine learning workflows, such as those in Scikit-Learn?

A. To deploy the model directly to a cloud server
B. To chain together multiple processing steps (transformers) and a final estimator
C. To visualize the neural network architecture
D. To automatically select the best algorithm for the dataset

2 When using a Pipeline, how is data leakage prevented during Cross-Validation?

A. By using only the final estimator during cross-validation
B. By fitting transformers only on the training folds and applying them to the validation fold
C. By fitting the transformer on the entire dataset before splitting
D. By shuffling the data repeatedly

3 In a Scikit-Learn Pipeline, which method is called on the intermediate steps during the training phase?

A. fit_transform()
B. score()
C. predict()
D. transform()

4 Which of the following classes allows you to apply different transformers to different columns of an array or pandas DataFrame?

A. GridSearchCV
B. ColumnTransformer
C. FunctionTransformer
D. FeatureUnion

5 In -Fold Cross-Validation, if , what percentage of the data is used for training in each iteration?

A. 20%
B. 50%
C. 80%
D. 100%

6 Which Cross-Validation strategy is recommended for classification problems where the target classes are imbalanced?

A. Stratified K-Fold
B. Leave-One-Out
C. K-Fold
D. TimeSeriesSplit

7 What is a significant drawback of Leave-One-Out Cross-Validation (LOOCV) on large datasets?

A. It is computationally expensive
B. It has high bias
C. It reduces the variance of the estimator significantly
D. It cannot handle categorical data

8 When debugging an algorithm, a Learning Curve plots the model performance (score) against:

A. The number of training samples
B. The values of a specific hyperparameter
C. The number of features
D. The time taken to train

9 If a Learning Curve shows a high training score but a low validation score with a large gap between them, the model is suffering from:

A. Convergence failure
B. High Bias (Underfitting)
C. Data leakage
D. High Variance (Overfitting)

10 If both the training score and validation score converge to a low value (high error) on a Learning Curve, what is the diagnosis?

A. High Bias (Underfitting)
B. Optimal performance
C. Need for regularization
D. High Variance (Overfitting)

11 A Validation Curve is used to evaluate the effect of:

A. Feature scaling
B. Training set size
C. Different random seeds
D. A single hyperparameter

12 In the context of the Bias-Variance tradeoff, increasing the complexity of a model usually leads to:

A. Higher Bias and Lower Variance
B. Higher Bias and Higher Variance
C. Lower Bias and Higher Variance
D. Lower Bias and Lower Variance

13 Which of the following actions is most likely to fix a model suffering from High Variance?

A. Getting more training data
B. Decreasing the regularization parameter
C. Adding more polynomial features
D. Reducing the size of the training set

14 Which of the following actions is most likely to fix a model suffering from High Bias?

A. Increasing the regularization parameter
B. Adding polynomial features or increasing model complexity
C. Removing features
D. Using a simpler algorithm

15 What is Model Serialization?

A. Training a model in a serial sequence rather than parallel
B. Converting a trained model into a format that can be stored or transmitted
C. Converting categorical features into serial integers
D. Assigning a unique serial number to a model version

16 Which Python library is commonly used for serializing Scikit-Learn models, particularly efficient for NumPy arrays?

A. json
B. joblib
C. csv
D. pandas

17 What is the inverse process of serialization called, where a saved model is loaded back into memory?

A. Unzipping
B. Parsing
C. Decoding
D. Deserialization

18 What is a major security risk associated with deserializing data using pickle or joblib?

A. It converts float64 to float32
B. Arbitrary code execution if the file is malicious
C. The file size becomes too large
D. The model accuracy decreases

19 Which format is specifically designed as an open standard for representing machine learning models to allow interoperability between different frameworks (e.g., PyTorch to ONNX Runtime)?

A. Pickle
B. HDF5
C. CSV
D. ONNX (Open Neural Network Exchange)

20 What does Model Deployment generally refer to?

A. The process of cleaning data
B. Hyperparameter tuning using GridSearch
C. Writing the documentation for the model
D. Integrating a machine learning model into an existing production environment to make practical business decisions

21 In a Web Service deployment (e.g., using Flask or FastAPI), how does a client typically request a prediction?

A. By connecting via SSH to the server console
B. By sending an HTTP request (usually POST) with data in JSON format
C. By emailing the dataset to the server
D. By downloading the model file and running it locally

22 What is Containerization (e.g., using Docker) useful for in model deployment?

A. It packages the model with all its dependencies (libraries, OS settings) to ensure consistency across environments
B. It converts Python code to C++
C. It compresses the model to a smaller file size
D. It increases the model's accuracy automatically

23 What characterizes Serverless deployment (e.g., AWS Lambda, Azure Functions)?

A. The model runs without any hardware physically existing anywhere
B. The developer manages the physical servers and operating system updates
C. It requires a dedicated server running 24/7
D. The cloud provider dynamically manages the allocation of machine resources, and you pay only for the compute time used

24 Which of the following is a disadvantage of Local Deployment (running the model on the user's device, e.g., mobile app)?

A. High latency due to network transfer
B. Dependence on internet connectivity
C. Limited computational resources (battery, CPU/RAM) on the device
D. Data privacy concerns

25 In the context of deployment, what is Concept Drift?

A. Moving the model from one cloud provider to another
B. The statistical properties of the target variable change over time, making the model less accurate
C. The model code changing over time due to git commits
D. The loss of floating-point precision during serialization

26 What is Batch Prediction (Offline Inference)?

A. Generating predictions for a large set of observations at once, often on a schedule
B. Training the model in batches
C. Grouping multiple models together
D. Generating a prediction immediately when a user clicks a button

27 Which HTTP method is most appropriate for a REST API endpoint that accepts input features and returns a model prediction?

A. GET
B. DELETE
C. HEAD
D. POST

28 Why is Pipeline serialization preferred over serializing just the model estimator?

A. It makes the file smaller
B. It is faster to load
C. It ensures that raw data fed into the loaded object undergoes the exact same preprocessing steps as training data
D. Pipelines cannot be serialized

29 In Scikit-Learn, how do you perform a Grid Search over parameters inside a Pipeline?

A. Pass the parameters directly to the estimator
B. Use the syntax step_name__parameter_name in the param_grid
C. You cannot perform Grid Search on a Pipeline
D. Modify the source code of the library

30 What is Nested Cross-Validation used for?

A. To tune hyperparameters without biasing the model evaluation
B. To use multiple models simultaneously
C. To visualize the data in 3D
D. To reduce the training time

31 Which cross-validation method is appropriate for time-series data?

A. Random K-Fold
B. ShuffleSplit
C. Leave-One-Out
D. TimeSeriesSplit (Rolling basis)

32 In a validation curve, if the training score is 0.99 and the validation score is 0.60, what should you do regarding the hyperparameter being tested?

A. Adjust the parameter to increase model complexity
B. Keep the parameter as is
C. Stop collecting data
D. Adjust the parameter to decrease model complexity (increase regularization)

33 What is the primary benefit of Microservices architecture for ML deployment?

A. It allows the ML model to be developed, deployed, and scaled independently of the main application
B. It puts all code into one giant script
C. It eliminates the need for data preprocessing
D. It removes the need for APIs

34 Which file extension is commonly associated with a Python pickled model?

A. .txt
B. .pkl
C. .css
D. .html

35 A Canary Deployment strategy involves:

A. Releasing the model to a small percentage of users first to monitor performance before full rollout
B. Replacing the old model instantly for all users
C. Deploying the model to a coal mine
D. Running the model only on weekends

36 In Scikit-Learn, Pipeline(steps=[('scaler', StandardScaler()), ('svc', SVC())]). What object does steps expect?

A. A pandas DataFrame
B. A list of (name, transform) tuples
C. A JSON string
D. A dictionary of parameters

37 Which metric on a learning curve would indicate that obtaining more data is NOT worth the cost?

A. The validation score is fluctuating wildly
B. The training score is increasing rapidly
C. The gap between training and validation score is widening
D. The validation score has plateaued and converged with the training score

38 What is the purpose of make_pipeline in Scikit-Learn compared to the Pipeline class constructor?

A. It does not support cross-validation
B. It creates a pipeline that runs faster
C. It only supports regression models
D. It automatically names the steps based on the class names of the estimators

39 When deploying a model via a REST API, what is Latency?

A. The time taken from sending the request to receiving the prediction
B. The cost of the server
C. The number of requests the server can handle per second
D. The accuracy of the model

40 What is PMML (Predictive Model Markup Language)?

A. A python library for plotting
B. A type of neural network
C. An XML-based standard for representing predictive models
D. A cloud service provider

41 In Repeated K-Fold Cross-Validation:

A. The same K-Fold split is repeated exactly
B. K-Fold CV is run times with different randomization splits
C. The model is trained repeatedly on the same fold
D. Data is duplicated times before splitting

42 What is the formula for Total Error in the context of Bias-Variance decomposition?

A.
B.
C.
D.

43 Which Scikit-Learn utility helps ensure that the training and testing sets have the same distribution of classes?

A. Pipeline()
B. train_test_split(..., shuffle=False)
C. train_test_split(..., stratify=y)
D. StandardScaler()

44 Why might a FunctionTransformer be included in a Pipeline?

A. To plot the data
B. To execute a custom Python function (like log transformation) as a step
C. To transform the pipeline into a function
D. To debug the pipeline

45 When serializing a model that relies on external code files (custom classes), what common issue arises?

A. The pickle file may fail to load if the custom class definition is not available in the loading environment
B. The file size doubles
C. The model becomes a regression model
D. The pickle file works everywhere automatically

46 Which of the following is an example of Online Learning deployment?

A. Using a static HTML page
B. Uploading the model to the internet
C. Training a model once a year
D. The model updates its weights incrementally as new data streams in

47 What is the purpose of the random_state parameter in Cross-Validation splitters?

A. To delete random data
B. To improve accuracy
C. To ensure reproducibility of the splits
D. To randomise the hyperparameters

48 In a pipeline, what happens if you call predict()?

A. It returns an error
B. It calls predict() on all steps
C. It calls fit() on all steps
D. It calls transform() on all transformers and predict() on the final estimator

49 Which tool is commonly used to create an isolated environment for Python dependencies to avoid version conflicts during development and deployment?

A. Pipenv / Virtualenv / Conda
B. Chrome
C. Notepad
D. Excel

50 What is the typical use of A/B Testing in model deployment?

A. Debugging syntax errors
B. Comparing two different models (Model A and Model B) on live traffic to see which performs better
C. Checking if the model works on inputs A and B
D. Testing the model in Alpha and Beta stages