Unit 1 - Practice Quiz

CSE275 60 Questions
0 Correct 0 Wrong 60 Left
0/60

1 What is the primary goal of optimization in the context of training a machine learning model?

Role of optimization in AI and ML Easy
A. To find the model parameters that best minimize a loss function.
B. To increase the size of the training dataset.
C. To write the model's code in the most efficient programming language.
D. To select the fastest computer hardware.

2 In machine learning, what does a 'loss function' measure?

Loss minimization and search-based optimization Easy
A. The amount of time it takes to train the model.
B. The error or discrepancy between the model's prediction and the true value.
C. The complexity of the model's architecture.
D. The number of features used by the model.

3 Which of the following algorithms is a classic example of gradient-based optimization?

Gradient-based vs gradient-free optimization Easy
A. Grid Search
B. Genetic Algorithm
C. Random Search
D. Gradient Descent

4 A key advantage of a convex optimization problem is that it has:

Convex vs non-convex optimization in ML Easy
A. No minimum, which simplifies the problem.
B. A single global minimum, making the optimal solution easier to find.
C. Multiple global minima, offering more choices.
D. Many local minima, which helps in exploring the solution space.

5 The process of systematically searching for the best learning rate or number of hidden layers for a neural network is called:

Applications of optimization in feature selection, hyperparameter tuning and model selection Easy
A. Feature selection
B. Hyperparameter tuning
C. Data normalization
D. Model compilation

6 In machine learning, the process of 'training' a model is fundamentally an:

Role of optimization in AI and ML Easy
A. Data visualization process
B. Data collection process
C. Software deployment process
D. Optimization process

7 What is the 'objective function' in a typical supervised machine learning problem?

Optimization problems in learning systems Easy
A. The final prediction function of the trained model.
B. The function to be minimized or maximized, usually the loss function.
C. A function that counts the number of data points.
D. The function that transforms the input data.

8 Which search-based optimization method involves exhaustively trying all combinations from a predefined set of hyperparameter values?

Loss minimization and search-based optimization Easy
A. Gradient Descent
B. Bayesian Optimization
C. Grid Search
D. Random Search

9 What is the primary requirement for an objective function to be optimized using gradient-based methods?

Gradient-based vs gradient-free optimization Easy
A. It must be positive.
B. It must be linear.
C. It must be differentiable.
D. It must have only one variable.

10 The error landscape for training a deep neural network is typically:

Convex vs non-convex optimization in ML Easy
A. Non-convex, with many local minima.
B. A flat plane with no minimum.
C. A simple quadratic bowl.
D. Convex, with a single global minimum.

11 Metaheuristic algorithms like Particle Swarm Optimization are often used when:

Need for metaheuristic optimization Easy
A. A mathematical proof of convergence is required.
B. The dataset is extremely small.
C. The optimization problem is complex and non-convex.
D. The problem is known to be perfectly convex.

12 What is the goal of optimization in 'feature selection'?

Applications of optimization in feature selection, hyperparameter tuning and model selection Easy
A. To find the subset of input features that yields the best model performance.
B. To find the best hyperparameters for the model.
C. To select the best algorithm for the task.
D. To create new features from existing ones.

13 An optimization method that evaluates the objective function at different points without using its derivative is called:

Gradient-based vs gradient-free optimization Easy
A. Gradient-free
B. Gradient-based
C. Stochastic Gradient Descent
D. Newton's method

14 The ultimate goal of the 'loss minimization' process is to:

Loss minimization and search-based optimization Easy
A. Reduce the time it takes to make a single prediction.
B. Use as little memory as possible during training.
C. Make the model's predictions as close as possible to the actual data.
D. Make the model as complex as possible.

15 Which of these core AI/ML tasks is fundamentally an optimization problem?

Role of optimization in AI and ML Easy
A. Training a neural network.
B. Storing a dataset in a database.
C. Visualizing a confusion matrix.
D. Loading a pre-trained model.

16 If an optimization algorithm is guaranteed to find the absolute best solution regardless of its starting point, the problem is most likely:

Convex vs non-convex optimization in ML Easy
A. Stochastic
B. Convex
C. Unbounded
D. Non-convex

17 Algorithms like Genetic Algorithms and Simulated Annealing are examples of:

Need for metaheuristic optimization Easy
A. Linear programming
B. Gradient-based optimization
C. Metaheuristic optimization
D. Data preprocessing techniques

18 Choosing between a Decision Tree and a Support Vector Machine for a classification task is an example of:

Applications of optimization in feature selection, hyperparameter tuning and model selection Easy
A. Feature engineering
B. Model selection
C. Loss function design
D. Hyperparameter tuning

19 Finding the parameters that minimize a cost function is the definition of:

Optimization problems in learning systems Easy
A. A data clustering problem
B. A feature extraction method
C. An optimization problem
D. A data normalization procedure

20 Which is a potential disadvantage of gradient-free methods compared to gradient-based methods on simple, convex problems?

Gradient-based vs gradient-free optimization Easy
A. They cannot be used for minimization.
B. They require the function to be differentiable.
C. They are often slower to converge.
D. They are mathematically more complex.

21 A deep neural network with multiple hidden layers and ReLU activation functions is being trained using a standard loss function like cross-entropy. What is the most likely characteristic of the loss landscape for this model?

Convex vs non-convex optimization in ML Medium
A. It is strictly convex, guaranteeing a single global minimum.
B. It is convex but not strictly convex, having a flat region of optimal solutions.
C. It is non-convex with numerous local minima and saddle points.
D. It is a quadratic function that can be solved directly using linear algebra.

22 You need to optimize the architecture of a neural network (e.g., number of layers, neurons per layer, type of activation function). Why would a gradient-free optimization method like Bayesian Optimization or a Genetic Algorithm be more suitable than a gradient-based method like SGD?

Gradient-based vs gradient-free optimization Medium
A. Gradient-based methods are computationally too slow for any type of optimization.
B. The search space for the architecture is continuous and smooth, which is ideal for gradient-free methods.
C. The objective function (model performance) is not differentiable with respect to the architectural parameters.
D. Gradient-free methods are guaranteed to find the global optimum, whereas gradient-based methods are not.

23 In wrapper-based feature selection, the goal is to find a subset of features that maximizes a model's performance. How is this typically framed as a search-based optimization problem?

Applications of optimization in feature selection Medium
A. By training a model on every single possible subset of features and picking the best one.
B. By using an intelligent search strategy (like recursive feature elimination) to explore the space of feature subsets without evaluating all of them.
C. By calculating the gradient of the model's performance with respect to the presence of each feature.
D. By selecting features with the highest correlation to the target variable, which is not an optimization problem.

24 Consider a linear regression model with the loss function . What is the primary goal of the optimization algorithm applied to this function?

Loss minimization and search-based optimization Medium
A. To find the optimal number of training samples that results in the lowest error.
B. To find the model parameters that minimize the average squared difference between predictions and actual values.
C. To find the input data that minimizes the loss for a given set of parameters .
D. To find the parameters that maximize the sum of squared errors.

25 In the context of supervised machine learning, the process of "learning" or "training" a model is fundamentally equivalent to what?

Role of optimization in AI and ML Medium
A. Solving a well-defined optimization problem to find the best model parameters according to an objective function.
B. A data compression and retrieval process.
C. A randomized search for parameters that perfectly fit the training data.
D. Exploring the entire hypothesis space to find and store all possible valid solutions.

26 A machine learning problem has a highly rugged, non-convex loss landscape with many deceptive local optima. A standard gradient descent algorithm consistently gets stuck. Which type of optimization approach would be a more suitable alternative to explore the search space more effectively?

Need for metaheuristic optimization Medium
A. A metaheuristic algorithm like a Genetic Algorithm or Particle Swarm Optimization.
B. A simpler gradient-based method with adaptive learning rates, like Adagrad.
C. Newton's Method, as it's a more powerful second-order gradient-based method.
D. A closed-form analytical solution like the Normal Equation.

27 Which of the following machine learning models typically results in a convex optimization problem, assuming a standard loss function like Mean Squared Error or Hinge Loss?

Convex vs non-convex optimization in ML Medium
A. A Support Vector Machine (SVM) with a linear kernel.
B. A K-Means clustering algorithm.
C. A decision tree trained with the CART algorithm.
D. A multi-layer perceptron with sigmoid activation functions.

28 When performing hyperparameter tuning using Grid Search, what is the underlying optimization strategy?

Applications of optimization in hyperparameter tuning Medium
A. An evolutionary search that combines and mutates hyperparameter sets to create new generations.
B. A probabilistic search that builds a surrogate model of the objective function.
C. An exhaustive, brute-force search over a manually specified, discrete subset of the hyperparameter space.
D. A gradient-based search over a continuous hyperparameter space.

29 An engineer is training a model where the loss function is non-differentiable and has several discontinuities (e.g., optimizing a function based on the 0-1 loss). Which of the following optimization algorithms is the most appropriate choice?

Gradient-based vs gradient-free optimization Medium
A. Stochastic Gradient Descent (SGD)
B. Nelder-Mead simplex method
C. L-BFGS
D. Adam (Adaptive Moment Estimation)

30 In designing an optimization problem for a classification model, what is the primary role of the objective function?

Optimization problems in learning systems Medium
A. To specify the hard constraints on the model's weights, such as forcing them to be non-negative.
B. To select the subset of data used for training the model.
C. To define the model's architecture, such as the number of layers or neurons.
D. To quantify the discrepancy between the model's predictions and the true labels, which the algorithm aims to minimize.

31 What is a key difference between the optimization process for training a neural network (loss minimization) and for finding its best hyperparameters (e.g., using Random Search)?

Loss minimization and search-based optimization Medium
A. Loss minimization aims to maximize a reward metric, while Random Search aims to minimize an error metric.
B. Loss minimization applies to discrete parameters, while Random Search applies only to continuous parameters.
C. Random Search is guaranteed to find a global optimum, while loss minimization is not.
D. Neural network training is a gradient-based optimization in a continuous parameter space, while hyperparameter tuning is often a gradient-free search over a discrete or mixed space.

32 You are tasked with optimizing a complex factory simulation where each run is computationally expensive and provides a single performance score. The relationship between input parameters and the score is a "black box." Why are metaheuristics like Simulated Annealing a good fit here?

Need for metaheuristic optimization Medium
A. They require an analytical formula for the objective function to work correctly.
B. They are guaranteed to find the single best solution in a finite number of steps.
C. They converge much faster than gradient-based methods on simple, convex problems.
D. They can effectively explore the solution space without needing gradient information, relying only on the objective function's output values.

33 What is a significant practical advantage of knowing that a machine learning optimization problem is convex?

Convex vs non-convex optimization in ML Medium
A. The optimization algorithm will not require setting a learning rate.
B. Any locally optimal solution found by a standard optimization algorithm is also a globally optimal solution.
C. The model can be trained much faster, often in a single step.
D. The resulting model is guaranteed to have higher predictive accuracy on unseen data.

34 How does adding an L2 regularization term, , to a loss function change the optimization problem in machine learning?

Role of optimization in AI and ML Medium
A. It transforms a non-convex problem into a convex one.
B. It adds a penalty for large parameter values to the objective function, guiding the optimization towards simpler models to prevent overfitting.
C. It removes the need for an optimization algorithm by providing a closed-form solution.
D. It makes the loss function non-differentiable, forcing the use of gradient-free methods.

35 Consider the task of finding the optimal set of weights for a logistic regression model. The loss function is the binary cross-entropy, which is convex and differentiable. Which approach is generally more efficient for this problem?

Gradient-based vs gradient-free optimization Medium
A. Random search, as it is simple to implement and unbiased in its exploration.
B. Brute-force search over all possible floating-point weight combinations.
C. A gradient-free method like a Genetic Algorithm, as it explores the entire search space more broadly.
D. A gradient-based method like Gradient Descent, as it can efficiently follow the slope of the loss function towards the global minimum.

36 A data scientist is comparing three different models: a Logistic Regression, a Support Vector Machine, and a Random Forest. After performing hyperparameter tuning for each, they select the model with the lowest validation error. This entire process can be viewed as what type of optimization?

Applications of optimization in model selection Medium
A. A convex optimization problem where the global minimum is guaranteed to be the best possible model.
B. A search-based optimization problem over a discrete set of choices (the models themselves).
C. A constrained optimization problem where the main constraint is the size of the dataset.
D. A continuous optimization problem solved with a single run of SGD.

37 When formulating a machine learning problem, constraints are sometimes added to the optimization. For example, in some formulations of SVMs, we maximize the margin subject to constraints on the classification of data points. What is the role of such constraints?

Optimization problems in learning systems Medium
A. They define the feasible region, which is the set of all possible parameter values that are considered valid solutions.
B. They are a mathematical trick to guarantee that the optimization problem is convex.
C. They replace the objective function, becoming the new quantity to be minimized.
D. They primarily serve to increase the convergence speed of the optimization algorithm.

38 In the context of minimizing a loss function for a machine learning model, what does the 'search space' that the optimization algorithm explores represent?

Loss minimization and search-based optimization Medium
A. The space of all possible input data samples from the dataset.
B. The high-dimensional space defined by all possible values for the model's parameters (e.g., weights and biases).
C. The set of all possible loss functions that could be defined.
D. The set of all possible machine learning algorithms that could be used for the task.

39 An optimization problem is characterized as being high-dimensional, non-differentiable, and multi-modal (having many local optima). Why would a population-based metaheuristic like Particle Swarm Optimization (PSO) be a strong candidate?

Need for metaheuristic optimization Medium
A. Because it relies on calculating the Hessian matrix, which is efficient in high dimensions.
B. Because it is a deterministic method that guarantees convergence to the global optimum.
C. Because it is mathematically proven to be applicable only to convex, differentiable problems.
D. Because it maintains a diverse set of candidate solutions that can explore different regions of the search space simultaneously, helping to avoid getting trapped in a single local optimum.

40 Compared to Grid Search, what is the primary advantage of using a more sophisticated hyperparameter optimization technique like Bayesian Optimization?

Applications of optimization in hyperparameter tuning Medium
A. It evaluates every single point in the hyperparameter space, ensuring complete coverage.
B. It makes informed decisions about which hyperparameters to evaluate next based on past results, often finding a better solution in fewer iterations.
C. It requires no training data to tune the hyperparameters.
D. It is a gradient-based method that is much faster for differentiable objective functions.

41 The bias-variance trade-off is a central concept in machine learning. How can the choice of an optimization algorithm and its configuration (e.g., number of epochs) be framed as an implicit attempt to manage this trade-off, rather than solely minimizing the training loss?

Role of optimization in AI and ML Hard
A. The bias-variance trade-off is a property of the model architecture and data, and is completely independent of the optimization process.
B. Optimization algorithms with adaptive learning rates, like Adam, are designed to eliminate variance completely by finding the true global minimum of the loss function.
C. By stopping the optimization process early (early stopping), we prevent the model from perfectly fitting the training data, which acts as a regularizer to reduce variance at the cost of slightly higher bias.
D. Using a higher learning rate helps the model converge to a lower bias solution faster, minimizing the training loss more effectively.

42 In deep learning, it has been observed that Stochastic Gradient Descent (SGD) often finds solutions that generalize better than adaptive methods like Adam, despite Adam converging faster. What is the most plausible optimization-centric explanation for this phenomenon?

Role of optimization in AI and ML Hard
A. SGD's inherent noise due to mini-batch sampling helps it escape sharp local minima and settle in flatter, wider minima, which are associated with better generalization.
B. SGD always finds the global minimum, which by definition generalizes best, while Adam gets stuck in poor local minima.
C. The momentum term in SGD is mathematically proven to be a better regularizer than the adaptive learning rate components in Adam.
D. Adam's adaptive learning rates cause the optimization to "overfit" to the training set's loss landscape, finding a numerically perfect but brittle minimum.

43 The training of a Generative Adversarial Network (GAN) is formulated as a minimax optimization problem: . Which statement accurately describes the stability challenges of this optimization process?

Optimization problems in learning systems Hard
A. The problem is unstable because the discriminator's objective function is non-convex, while the generator's is convex, creating an imbalance.
B. Stability is guaranteed if a gradient-free optimizer is used for the generator and a gradient-based one for the discriminator.
C. The optimization is unstable because the gradients for the generator (G) and discriminator (D) can point in opposing directions, leading to oscillations or mode collapse rather than convergence to a stable Nash equilibrium.
D. The minimax formulation is inherently stable, and any observed issues are due to poor hyperparameter choices, not the problem structure itself.

44 Structural Risk Minimization (SRM) extends Empirical Risk Minimization (ERM) by adding a penalty term for model complexity: . How does this change the nature of the optimization problem compared to pure ERM?

Optimization problems in learning systems Hard
A. It transforms a potentially ill-posed problem into a well-posed one by introducing a regularization bias, which helps select a simpler function from a set of functions that fit the data equally well.
B. It simplifies the optimization process by reducing the number of parameters that need to be learned.
C. It guarantees that the global minimum of the SRM objective corresponds to the model with the highest accuracy on the test set.
D. It makes the optimization problem non-convex, regardless of the original loss function's convexity.

45 Consider two optimization problems: 1) A linear regression model trained with Mean Squared Error (MSE) loss. 2) A 10-layer deep neural network with ReLU activations trained with Cross-Entropy loss. Which statement best contrasts their optimization landscapes?

Convex vs non-convex optimization in ML Hard
A. The MSE loss for linear regression is a convex function with a single global minimum, whereas the deep network's loss landscape is highly non-convex with numerous local minima, saddle points, and plateaus.
B. The linear regression problem has no local minima, only saddle points, while the deep network has no saddle points, only local minima.
C. Both landscapes are convex, but the deep network's landscape has a much higher dimensionality.
D. The MSE landscape is smooth and differentiable everywhere, while the cross-entropy loss with ReLU activations results in a non-differentiable landscape.

46 You are tasked with optimizing a model where the primary evaluation metric is the F1-score, but the model's architecture makes the F1-score non-differentiable with respect to its parameters. Which optimization strategy is most appropriate and why?

Loss minimization and search-based optimization Hard
A. Use a surrogate loss function like cross-entropy, optimize it with gradient descent, and hope it indirectly maximizes the F1-score.
B. Directly optimize the F1-score using Stochastic Gradient Descent, as it can handle non-differentiable functions.
C. A search-based method like a genetic algorithm, which treats the F1-score as a black-box fitness function and does not require gradients.
D. Approximate the gradient of the F1-score using finite differences and apply a gradient-based optimizer.

47 In which scenario would a gradient-free optimization method like CMA-ES (Covariance Matrix Adaptation Evolution Strategy) be significantly superior to a state-of-the-art gradient-based optimizer like Adam?

Gradient-based vs gradient-free optimization Hard
A. Training a deep convolutional neural network on a large, well-behaved dataset like ImageNet.
B. Fine-tuning the final layer of a pre-trained transformer model.
C. Solving a large-scale linear regression problem with millions of features.
D. Optimizing the parameters of a reinforcement learning policy where the objective function is estimated via noisy simulations and has many local optima.

48 When optimizing a high-dimensional () non-convex function, how does the computational complexity of a single iteration of Newton's method compare to a single iteration of a simple Genetic Algorithm (GA) with population size ?

Gradient-based vs gradient-free optimization Hard
A. Newton's method is significantly more expensive, scaling at due to Hessian inversion, while the GA's fitness evaluations dominate its cost, scaling at , where is the evaluation cost.
B. Both have a similar complexity of per iteration.
C. Newton's method scales linearly with dimension, , while the GA scales quadratically, .
D. The GA is always more expensive due to its large population size , regardless of the dimension .

49 The objective function for L1-regularized logistic regression is , where is the convex negative log-likelihood. The entire objective function remains convex. Why might a deep neural network with a convex loss function (e.g., MSE) and convex L1 regularization still result in a non-convex optimization problem?

Convex vs non-convex optimization in ML Hard
A. The sum of two convex functions (loss and regularization) is not always convex.
B. Any model with more than one hidden layer is mathematically defined as non-convex.
C. The composition of a convex function with a non-linear function (the neural network mapping) is not guaranteed to be convex.
D. L1 regularization is only convex for linear models.

50 In the context of deep learning, it's widely accepted that finding the global minimum of the loss function is not necessary and may even be detrimental to generalization. What property of the loss landscape is now considered more critical for achieving good performance?

Convex vs non-convex optimization in ML Hard
A. Finding a local minimum with the smallest possible Euclidean norm of the weight vector.
B. Finding a local minimum that lies in a wide, flat basin, as these solutions are more robust to variations between training and test data.
C. Finding the local minimum that is closest to the initialization point to ensure algorithmic stability.
D. Finding a saddle point with the lowest number of negative eigenvalues in its Hessian matrix.

51 An SVM with a linear kernel and a hinge loss function results in a convex optimization problem. A deep neural network (DNN) with a cross-entropy loss function is non-convex. What is a direct practical consequence of this difference for a researcher?

Convex vs non-convex optimization in ML Hard
A. The SVM solution is globally optimal and reproducible; two researchers with the same data will find the same separating hyperplane. The DNN solution is a local minimum, and different initializations can lead to different final models with varying performance.
B. Training the SVM requires more computational power than training the DNN due to the need to solve a quadratic program.
C. The SVM can only solve linearly separable problems, while the DNN can solve any problem.
D. The SVM's convergence is not guaranteed, while the DNN's convergence to a local minimum is always guaranteed with SGD.

52 Neural Architecture Search (NAS) aims to automatically find the best neural network architecture for a given task. This problem involves discrete choices (e.g., type of layer, number of filters) and a vast search space. Why are metaheuristics like evolutionary algorithms or reinforcement learning often preferred over traditional optimization methods for this task?

Need for metaheuristic optimization Hard
A. Metaheuristics are guaranteed to find the globally optimal architecture in a finite amount of time.
B. The architecture search space is a convex space, which is the ideal application area for metaheuristics.
C. The search space is discrete and non-differentiable, making gradient-based methods inapplicable for directly optimizing the architecture choices.
D. Traditional methods like gradient descent are too slow for evaluating a single architecture's performance.

53 When tuning hyperparameters for a deep learning model, how does the exploration-exploitation trade-off manifest differently in Simulated Annealing (SA) versus Particle Swarm Optimization (PSO)?

Need for metaheuristic optimization Hard
A. Both algorithms use the exact same mechanism (a decreasing random mutation rate) to balance exploration and exploitation.
B. SA manages the trade-off via a "temperature" parameter that starts high (more exploration) and gradually cools down (more exploitation), while PSO manages it through the cognitive (personal best) and social (global best) components influencing a particle's velocity.
C. SA relies purely on exploitation by always moving to a better state, while PSO relies purely on exploration through random particle movements.
D. In SA, exploration is driven by population diversity, whereas in PSO, it is driven by the probability of accepting a worse solution.

54 The "No Free Lunch" (NFL) theorem for optimization states that, averaged over all possible problems, every optimization algorithm performs equally well. What is the most critical implication of the NFL theorem when selecting a metaheuristic for a specific machine learning problem like hyperparameter tuning?

Need for metaheuristic optimization Hard
A. One should always choose the simplest metaheuristic, like a random search, as it will perform as well as any other on average.
B. The NFL theorem proves that metaheuristic optimization is not useful for machine learning, and one should stick to gradient-based methods.
C. The theorem implies that for any given problem, all metaheuristics will converge to the same final solution.
D. There is no universally superior metaheuristic; the best choice depends on how well the algorithm's assumptions and search strategy align with the structure of the specific problem's search landscape.

55 Wrapper methods for feature selection (e.g., using recursive feature elimination) frame the problem as a search. Embedded methods (e.g., LASSO L1 regularization) integrate it into model training. Which statement accurately analyzes the trade-offs?

Applications of optimization in feature selection, hyperparameter tuning and model selection Hard
A. Embedded methods are more computationally expensive because they add a non-convex penalty term to the loss function, making it harder to optimize.
B. Wrapper methods are guaranteed to find the globally optimal feature subset, while embedded methods are only heuristic approximations.
C. Wrapper methods are computationally very expensive as they require training and evaluating a model for each feature subset considered, but they can find better subsets by directly optimizing for model performance.
D. Embedded methods can only be used with linear models, while wrapper methods can be used with any model.

56 Bayesian Optimization is a popular technique for hyperparameter tuning. It involves two key components: a surrogate model and an acquisition function. What is the precise optimization problem solved by the acquisition function at each step?

Applications of optimization in feature selection, hyperparameter tuning and model selection Hard
A. It directly optimizes the machine learning model's validation accuracy by taking its gradient with respect to the hyperparameters.
B. It finds the next set of hyperparameters that offers the best trade-off between exploiting known good regions and exploring uncertain regions of the hyperparameter space.
C. It minimizes the prediction error of the surrogate model (e.g., a Gaussian Process) on all previously evaluated hyperparameter sets.
D. It selects the hyperparameter set that is most distant from all previously evaluated sets to ensure maximum exploration.

57 Model selection criteria like AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) are defined as and , where is the number of parameters, is the number of data points, and is the maximum likelihood. How do these criteria frame model selection as an optimization problem distinct from simple loss minimization?

Applications of optimization in feature selection, hyperparameter tuning and model selection Hard
A. They convert the non-convex model selection problem into a convex one that can be solved with gradient descent.
B. They optimize for training accuracy directly, with a small penalty for the time it takes to train the model.
C. They are not optimization criteria but statistical tests used to verify the significance of a model.
D. They define a multi-objective optimization problem that seeks to simultaneously maximize the model's fit (likelihood) and minimize its complexity (number of parameters), with BIC penalizing complexity more heavily for large .

58 Consider hyperparameter tuning in a 10-dimensional space where you suspect some hyperparameters have little effect, while a few have strong, complex interactions. Why would Random Search likely outperform Grid Search, and why might Bayesian Optimization outperform both?

Applications of optimization in feature selection, hyperparameter tuning and model selection Hard
A. Random Search is faster because it evaluates fewer points, but its solutions are of lower quality than Grid Search. Bayesian Optimization is the slowest but most accurate.
B. Grid Search is guaranteed to find the global optimum, whereas Random Search and Bayesian Optimization are heuristics with no guarantees.
C. Random Search is more efficient because it doesn't waste evaluations on unimportant dimensions, while Bayesian Optimization is even better as it uses past results to intelligently model the interactions and focus the search on promising regions.
D. Grid Search explores the interactions between all parameters perfectly, making it superior for this scenario. Random Search ignores interactions completely.

59 When using a genetic algorithm (GA) for wrapper-based feature selection, the "fitness function" is a critical component. What is a primary challenge associated with defining and using this fitness function compared to a standard loss function in model training?

Applications of optimization in feature selection, hyperparameter tuning and model selection Hard
A. The fitness function evaluation is extremely expensive, as it requires training and validating an entire ML model for each feature subset (chromosome) in the GA's population.
B. It is difficult to encode a feature subset into a chromosome representation suitable for the GA's crossover and mutation operators.
C. The fitness function is non-differentiable, but this is not a challenge for GAs. The main issue is that it must be a convex function of the feature set.
D. The fitness function is often noisy, meaning evaluating the same feature subset twice can yield different results due to stochastic aspects of model training.

60 Multi-objective optimization is often required for model selection, for instance, minimizing prediction error while also minimizing model inference time. How does the concept of a Pareto front apply in this scenario?

Applications of optimization in feature selection, hyperparameter tuning and model selection Hard
A. The Pareto front is a single, unique model that provides the absolute best error and the absolute best inference time simultaneously.
B. A Pareto front is an optimization algorithm, like a genetic algorithm, specifically designed to solve multi-objective problems.
C. The Pareto front consists of all models for which you cannot improve one objective (e.g., decrease error) without worsening the other objective (e.g., increasing inference time). The final choice is a trade-off selected from this set of optimal solutions.
D. The Pareto front represents all the models that have an unacceptable prediction error, which should be discarded from the search.