Top Machine Learning Interview Questions & Answers

1. What are supervised, unsupervised, and reinforcement learning?

Supervised: learning a mapping x → y from labelled examples; used for classification/regression.
Unsupervised: finding structure in unlabelled data (clustering, density estimation, dimensionality reduction).
Reinforcement learning (RL): an agent learns a policy by interacting with an environment to maximize cumulative reward (trial-and-error, delayed feedback).

2. What is the difference between classification and regression?

Classification predicts discrete labels (e.g., spam/not spam). Regression predicts continuous values (e.g., house price). Evaluation metrics, loss functions, and model outputs differ (probabilities/classes vs real-valued predictions).

3. Define overfitting and underfitting.

Overfitting: model learns noise/peculiarities of training data → excellent train performance, poor generalization.
Underfitting: model too simple to capture underlying pattern → poor performance on both train and test. Goal: right model complexity.

4. What is the bias-variance tradeoff?

Decomposes prediction error into bias² (error from wrong model assumptions), variance (sensitivity to training set), and irreducible error. High-bias models underfit; high-variance models overfit. Tradeoff: lowering bias often raises variance and vice-versa; choose complexity/regularization to minimize total error.

5. Explain parametric vs. non-parametric models.

Parametric: fixed number of parameters (e.g., linear regression); require assumption about functional form; efficient and need less data.
Non-parametric: flexible number of parameters that grow with data (e.g., k-NN, decision trees); can model complex patterns but may need more data and be computationally heavier.

6. What is the difference between online learning, batch learning, and mini-batch learning?

Batch learning: train on whole dataset at once (common for small datasets).
Online learning: update model sequentially with one example at a time — useful for streaming data.
Mini-batch: compromise: update using small batches (typical in deep learning for efficiency and stable gradients).

7. Explain discriminative vs. generative models.

Discriminative: model P(y|x) or decision boundary directly (e.g., logistic regression, SVM). Focus on classification accuracy.
Generative: model joint P(x,y) or P(x|y) and prior P(y) (e.g., Naive Bayes, GMM). Can generate samples and handle missing features; sometimes better with small data or when modeling data distribution matters.

8. What are some common supervised learning algorithms?

Linear/logistic regression, decision trees, random forests, gradient boosting (XGBoost/LightGBM/CatBoost), SVM, k-NN, Naive Bayes, neural networks (MLP, CNN, RNN), ensembles (bagging/stacking).

9. What are some common unsupervised learning algorithms?

K-means, hierarchical clustering, DBSCAN, Gaussian Mixture Models (EM), PCA, t-SNE/UMAP (visualization), autoencoders, association rule mining.

10. Explain the concept of decision boundary.

A decision boundary is the surface in feature space where the classifier’s predicted class changes. For linear models it's a hyperplane; for nonlinear models it can be complex. Visualizing boundaries helps understand model behavior and misclassification regions.

11. What metrics are used to evaluate a classification model?

Accuracy, precision, recall (sensitivity), specificity, F1-score, ROC AUC, PR AUC, confusion matrix, Matthews correlation coefficient (MCC), Cohen’s kappa. Choice depends on class balance and business costs.

12. What metrics are used to evaluate a regression model?

MSE (mean squared error), RMSE, MAE (mean absolute error), R² (coefficient of determination), adjusted R², MAPE (mean absolute percentage error). Use R² for explained variance; choose metric aligned with error costs.

13. What is cross-validation and why is it important?

A resampling method to estimate model generalization by splitting data into train/validation folds (e.g., k-fold). Reduces variance of performance estimate, helps hyperparameter tuning, and prevents overfitting to a particular train/validation split.

14. What is the difference between precision and recall?

Precision = TP / (TP + FP): proportion of predicted positives that are true.
Recall = TP / (TP + FN): proportion of actual positives that are correctly predicted. Precision favors fewer false positives; recall favors fewer false negatives.

15. What is F1-score and when to use it?

F1 = 2 * (precision * recall) / (precision + recall) — harmonic mean of precision and recall. Use when you need a balance between precision and recall (especially with class imbalance). Can use Fβ to weight recall more (β>1) or precision more (β<1).

16. What are ROC curve and AUC?

ROC plots true positive rate (recall) vs false positive rate for all thresholds. AUC (area under ROC) summarizes separability (1.0 perfect, 0.5 random). Use AUC to compare classifiers independent of threshold; for imbalanced data PR-AUC can be more informative.

17. What is a confusion matrix?

A table showing counts of True Positives, False Positives, True Negatives, False Negatives for classification. It’s the basis for precision, recall, specificity and many other metrics.

18. What is stratified sampling?

Sampling technique that preserves the class distribution in train/test or cross-validation splits—important for classification with imbalanced classes to ensure each fold contains representative proportions.

19. What is feature scaling and why is it required?

Rescaling features to comparable ranges (e.g., 0–1 or mean 0/unit variance). Required because many algorithms (SVM, k-NN, gradient descent based models) are sensitive to feature scales; improves convergence and prevents dominance by large-scale features.

20. Difference between normalization and standardization.

Normalization (min-max): scales to [0,1] or [a,b].
Standardization (z-score): subtract mean and divide by std → zero mean and unit variance. Use normalization when bounds matter; standardization when features are normally distributed or for algorithms assuming centered data.

21. What is one-hot encoding?

Converts categorical variable with k categories into k binary features each indicating presence of a category (or k-1 if dropping one to avoid redundancy). Prevents ordinal assumptions.

22. Difference between one-hot vs. label encoding.

Label encoding: assign integer labels (0...k-1) — may introduce ordinal relation incorrectly.
One-hot: binary vector representation — no ordinal bias but increases dimensionality. Choose based on algorithm and feature cardinality.

23. What is binning in feature engineering?

Converting continuous variable into discrete intervals/bins (equal-width, equal-frequency, or supervised binning). Helps with nonlinearity, robustness to outliers, and interpretability.

24. What are methods to handle missing values?

Drop rows/columns, impute with mean/median/mode, forward/backward fill (time series), KNN imputation, iterative/multivariate imputer (model-based), use missingness indicator features, or prediction models for missingness. Choice depends on missingness mechanism (MCAR, MAR, MNAR).

25. What is dimensionality reduction?

Techniques to reduce number of features while retaining important information — helps with visualization, computation, noise reduction, and mitigating curse of dimensionality. Methods: PCA, LDA, autoencoders, t-SNE (visualization), UMAP.

26. Explain PCA

Principal Component Analysis finds orthogonal directions (principal components) that maximize variance. Compute covariance matrix, eigenvectors/eigenvalues; project data onto top components to reduce dimensionality while preserving variance. Components are linear combinations of original features.

27. What is the difference between feature selection and feature extraction?

Feature selection: pick a subset of original features (filter, wrapper, embedded methods).
Feature extraction: create new features from originals (PCA, autoencoders, polynomial features). Selection preserves original semantics; extraction transforms data into new space.

28. How do you detect and handle multicollinearity?

Detect with correlation matrix and Variance Inflation Factor (VIF). Handle by dropping/combining correlated variables, PCA or other dimensionality reduction, regularization (ridge), or using tree-based models which are less sensitive.

29. What is dummy variable trap?

Collinearity caused by including all one-hot encoded columns (sum of dummies = 1). Avoid by dropping one category (use k-1 dummies) or using regularization.

30. How do you handle categorical vs. numerical features?

Numerical: scale/transform (log, power), handle outliers.
Categorical: encode (one-hot, target, frequency, embeddings), consider cardinality; rare categories can be grouped. For trees scaling not required; for distance-based and linear models, scale numerical features.

31. What is Central Limit Theorem (CLT)?

For Independent and Identically Distributed samples with finite mean and variance, the distribution of the sample mean approaches a normal distribution as sample size increases, regardless of original distribution. Foundation for many inferential statistics.

32. Difference between correlation and causation.

Correlation: statistical association between variables.
Causation: one variable cause change in another. Correlation alone does not prove causation; causal inference requires experimental design, temporal ordering, or methods like instrumental variables, randomized trials, or causal graphs.

33. What is a p-value?

Probability, assuming the null hypothesis is true, of observing data at least as extreme as the observed. Low p-value suggests evidence against null. It’s not the probability that the null is true.

34. Explain Type I and Type II errors.

Type I (a): false positive — rejecting true null.
Type II (b): false negative — failing to reject false null.
Power = 1-β is probability of correctly rejecting false null. Tradeoffs depend on context.

35. What is conditional probability and Bayes’ theorem?

Conditional probability P(A|B) is probability of A given B. Bayes’ theorem: P(A|B) = P(B|A) P(A) / P(B) — used to invert conditional probabilities and underlies Bayesian inference and Naive Bayes classifier.

36. Explain linear regression assumptions.

Linearity, independence of errors, homoscedasticity (constant variance), normality of errors (for inference), no perfect multicollinearity, and correct model specification. Violation affects inference and predictions differently.

37. What is the difference between simple and multiple linear regression?

Simple: one predictor y = β0 + β1 x + ε. Multiple: multiple predictors y = β0 + β1 x1 + ... + βp xp + ε. Interpretation and matrix form extension; issues like multicollinearity only arise in multiple regression.

38. What is multicollinearity in regression and how to handle it?

High correlation among predictors leads to unstable coefficient estimates and inflated variances. Handle by dropping/recombining features, PCA, ridge regression, or collecting more data.

39. What is ridge regression?

Linear regression with L2 regularization: minimize RSS + λ Σ β_j^2. Shrinks coefficients towards zero (but not exactly zero), reduces variance and multicollinearity, controlled by λ.

40. What is lasso regression?

Linear regression with L1 regularization: minimize RSS + λ Σ |β_j|. Encourages sparsity—can set some coefficients exactly to zero—useful for feature selection.

41. What is elastic net regression?

Combines L1 and L2 penalties: α L1 + (1-α) L2. Useful when correlated features exist—balances variable selection (L1) and stability (L2).

42. How does logistic regression work?

Models log-odds of class probability as linear function: log(p/(1-p)) = β0 + βx. Uses sigmoid p = 1 / (1+e^{-z}) to map to [0,1]. Parameters estimated by maximum likelihood (cross-entropy loss). Outputs probabilities and class by threshold.

43. What are the assumptions of logistic regression?

Correctly specified link (logit), independence of observations, linearity between continuous predictors and log-odds, no multicollinearity, large sample size for stable MLE. Does not assume normality of predictors.

44. What is the difference between log-loss and MSE?

Log-loss (cross-entropy): -Σ[y log p + (1-y) log(1-p)] for classification probabilities—penalizes confident wrong predictions.
MSE: (1/n)Σ(y - ŷ)^2 for regression—penalizes squared errors. Log-loss suited for probabilistic classification.

45. Explain Naive Bayes classifier and its assumptions.

A generative classifier that applies Bayes’ theorem with strong conditional independence assumption: features are independent given class. Fast, works well with small data and high-dimensional discrete features; assumptions rarely hold but often give good results.

46. Difference between Gaussian, Multinomial, and Bernoulli Naive Bayes.

Gaussian: continuous features assumed normally distributed.
Multinomial: counts/features representing frequencies (e.g., word counts in documents).
Bernoulli: binary features representing presence/absence (e.g., word present or not).

47. Explain decision tree splitting criteria (Gini, Entropy, Information Gain).

Entropy: ∑_(i=1)^n▒〖p_i log_2⁡〖p_i 〗〗 . Information gain = parent entropy − weighted
Gini impurity: 1 − ∑_(i=1)^n▒〖p_i〗^2 (probability of misclassification). Both measure node purity; higher gain/greater impurity reduction chosen. Gini tends to be faster; entropy is linked to information theory.

48. What is pruning in decision trees?

Reducing tree complexity to prevent overfitting. Pre-pruning stops growth early (max depth, min samples). Post-pruning grows full tree then trims branches based on validation or cost-complexity (e.g., minimal cost-complexity pruning).

49. Explain Random Forests.

Ensemble of decision trees using bootstrap samples (bagging) and feature randomness at splits. Predictions averaged (regression) or majority voted (classification). Reduces variance, robust to overfitting, and provides feature importance.

50. What is bagging and how does it work?

Bootstrap Aggregating: train multiple models on bootstrapped samples and aggregate predictions (average/vote). Reduces variance by combining diverse models; works well with high-variance base learners like trees.

51. What is boosting?

Sequential ensemble technique where each new model focuses on correcting errors of previous models. Weak learners are combined additively to produce a strong learner; reduces bias and can achieve high accuracy but may overfit without regularization.

52. Explain AdaBoost.

Adaptive Boosting: sequentially fits weak learners; after each iteration, increases weights of misclassified samples so next learner focuses on them. Final prediction is weighted vote of learners. Sensitive to noisy data/outliers but simple and effective with low-capacity learners.

53. Explain Gradient Boosting (conceptually without gradient descent math).

Builds additive model by fitting each new weak learner to the residuals (errors) of the current ensemble under a chosen loss. Each learner reduces remaining error; learning rate and number of iterations control fitting. Conceptually performs gradient descent in function space (fit direction that reduces loss).

54. What is XGBoost and why is it effective?

An optimized, regularized implementation of gradient boosting with enhancements: second-order loss approximation, shrinkage (learning rate), column subsampling, handling missing values, tree pruning, parallelization and cache-aware algorithms. Built-in regularization reduces overfitting and speeds training.

55. Difference between Random Forest and Gradient Boosted Trees.

Random Forest: parallel bagging ensemble reducing variance; trees grown deep and independently.
Gradient Boosting: sequential trees that correct previous errors, reducing bias; typically needs careful tuning (learning rate, regularization). RF is simpler and less sensitive to hyperparameters; GB often achieves higher accuracy when tuned.

56. What is k-means clustering? Limitations?

Partition data into k clusters by minimizing within-cluster variance; iteratively assign points to nearest centroid and update centroids.

Limitations: requires k, sensitive to initialization, assumes spherical clusters, sensitive to outliers, uses Euclidean distance so not suitable for all data types.

57. How do you pick the number of clusters?

Elbow method (plot inertia vs k), silhouette score (higher better), gap statistic, BIC/AIC for mixture models, domain knowledge and stability analysis. Combine multiple methods and visualize results.

58. What is hierarchical clustering?

Builds nested clusters as dendrogram: agglomerative (bottom-up, merge closest pairs) or divisive (top-down, split). No need to pre-specify k; useful for small datasets and exploring cluster hierarchy.

59. Compare agglomerative vs. divisive clustering

Agglomerative: start with points as clusters and merge iteratively (common and computationally cheaper).
Divisive: start with one cluster and split; less common and more computationally expensive but can yield different structures.

60. What is DBSCAN and when to use it?

Density-Based Spatial Clustering (DBSCAN): groups points with high density (minPts within ε radius) and labels low-density points as noise. Good for arbitrary-shaped clusters and noise; struggles with variable density and high dimensionality.

61. What is Gaussian Mixture Model (GMM)?

Probabilistic model representing data as a mixture of Gaussian components. Each point has soft assignment (probability) to each component. More flexible than k-means (elliptical clusters), parameters learned typically via EM.

62. What is Expectation-Maximization algorithm (conceptually)?

Iterative method to estimate parameters with latent variables:

E-step: compute expected value of latent variables given current parameters;
M-step: maximize likelihood w.r.t parameters given those expectations. Repeat until convergence.

63. What is dimensionality reduction? Differences between PCA and LDA.

Dimensionality reduction reduces features while preserving structure. PCA: unsupervised, finds directions of max variance. LDA: supervised, finds linear combinations maximizing class separability. PCA ignores labels; LDA uses them.

64. Explain t-SNE and its use cases.

t-distributed Stochastic Neighbor Embedding: nonlinear technique to embed high-dimensional data into 2–3D preserving local neighborhood structure—excellent for visualization of clusters. Not for general dimensionality reduction for downstream tasks; sensitive to perplexity and nonparametric.

65. What is reinforcement learning in simple terms?

Learning by trial-and-error: an agent takes actions in an environment, receives rewards, and learns a policy to maximize cumulative reward over time.

66. What are states, actions, and rewards?

State: representation of environment at a time.
Action: decision the agent can make.
Reward: scalar feedback signal indicating immediate desirability of a state-action transition. Together define interaction in RL.

67. What is the exploration-exploitation tradeoff?

Agent must choose between exploiting known high-reward actions and exploring uncertain actions that may yield higher long-term reward. Strategies: ε-greedy, softmax, upper confidence bounds (UCB), Thompson sampling.

68. Explain Q-learning (conceptually without neural networks).

Model-free RL algorithm that learns action-value function Q(s,a) estimating expected return for taking a in s and following optimal policy thereafter. Updates Q using Bellman optimality bootstrapping from observed rewards and next-state Q values until convergence.

69. What is a Markov Decision Process (MDP)?

Formal RL framework tuple (S, A, P, R, γ) where S states, A actions, P(s'|s,a) transition probabilities, R(s,a) reward function, and γ discount factor. Markov property: next state depends only on current state and action.

70. Difference between model-free vs. model-based methods.

Model-free: learn policy/value directly from experience (e.g., Q-learning, policy gradients) without learning transition dynamics.
Model-based: learn/assume model of environment dynamics P and/or R, and plan using that model (can be more sample-efficient but model bias may hurt).

71. Explain bias vs. variance mathematically.

For estimator \hat{f}(x) and true f(x), expected squared error:
E[(y − \hat{f}(x))^2] = (Bias[\hat{f}(x)])^2 + Var[\hat{f}(x)] + σ^2
where Bias = E[\hat{f}(x)] − f(x), Var = E[(\hat{f}(x) − E[\hat{f}(x)])^2], and σ^2 is irreducible noise.

72. How do you detect overfitting besides accuracy?

Large gap between train and validation performance, learning curves showing diverging error, high variance in cross-validation, poor performance on unseen test or holdout, and complex model with small data.

73. What are ensemble learning methods and why are they effective?

Combine multiple models (bagging, boosting, stacking) to improve robustness and performance. Ensembles reduce variance (bagging), bias (boosting), or exploit complementary strengths (stacking). Diversity among base learners is key.

74. How does feature importance differ between tree models and linear models?

Linear models: importance from coefficients (may need scaling to compare).
Tree models: importance from impurity decrease, split frequency, or permutation importance. Tree importances capture nonlinear interactions; coefficient magnitudes assume linearity and depend on feature scale.

75. What are support vector machines (SVM) and how do kernels work?

SVM finds hyperplane maximizing margin between classes. Kernels implicitly map inputs into higher-dimensional space to make data linearly separable without computing mapping explicitly—only kernel function values K(x_i,x_j) are needed (e.g., RBF, polynomial).

76. What is the kernel trick in SVM?

Compute inner products in feature-mapped space using a kernel function so you can train in high-dimensional space efficiently without explicit transformation. Enables complex decision boundaries with convex optimization.

77. How is margin maximization important in SVM?

Maximizing margin reduces generalization error (margin implies confidence); support vectors define the boundary. Larger margin → better expected generalization; soft-margin SVM trades margin for misclassification using parameter C.

78. What is multi-class classification (One-vs-One, One-vs-Rest)?

For binary learners extended to multi-class:

One-vs-Rest (OvR): train K classifiers, each separates one class vs rest.
One-vs-One (OvO): train K*(K-1)/2 classifiers between pairs and combine votes. Choice depends on algorithm and dataset.

79. Explain clustering evaluation metrics (silhouette score, Davies-Bouldin index).

Silhouette (−1..1): measures cohesion vs separation; higher is better.
Davies-Bouldin: average ratio of intra-cluster to inter-cluster distances; lower is better. Use multiple metrics and visualization for robust assessment.

80. What is curse of dimensionality in ML?

High-dimensional spaces become sparse; distances lose meaning; number of samples needed grows exponentially; models overfit easily. Affects nearest-neighbor, density estimation, and increases computational cost.

81. What are techniques to handle high-dimensional data?

Feature selection (filter/wrapper/embedded), dimensionality reduction (PCA, autoencoders), regularization (L1/L2), tree-based methods, embeddings, feature hashing, and collecting more data.

82. What is the difference between PCA and Factor Analysis?

PCA: deterministic linear transform maximizing variance (no probabilistic noise model).
Factor Analysis: probabilistic model assuming observed variables generated from a few latent factors plus noise; FA models covariance structure and separates unique variance from common factors.

83. Explain what distance metrics are commonly used in ML

Euclidean, Manhattan (L1), Minkowski (Lp), Cosine similarity (angle-based), Mahalanobis (covariance-aware), Hamming (categorical/binary), Jaccard (set similarity).

84. Compare Euclidean, Manhattan, and Cosine similarity

Euclidean (L2): straight-line distance—sensitive to magnitude and scaling.
Manhattan (L1): sum of absolute differences—robust to outliers in some contexts.
Cosine: measures angle between vectors—ignores magnitude, focuses on direction (useful for text TF-IDF).

85. What is Mahalanobis distance and when is it used?

Distance accounting for feature covariance: d_M(x) = sqrt((x-μ)^T Σ^{-1} (x-μ)). Used to detect multivariate outliers and for Gaussian-based models where features are correlated.

86. Explain hypothesis testing in ML.

Formulate null and alternative hypotheses about data or model (e.g., no difference between algorithms). Compute test statistic and p-value; reject null if p < a. Used in model comparison, A/B testing, and significance testing of features.

87. What is Maximum Likelihood Estimation (MLE)?

Choose parameters that maximize probability (likelihood) of observed data: θ_MLE = argmax_θ P(data | θ). Widely used for parameter estimation; asymptotically unbiased and efficient under regularity conditions.

88. What is Maximum A Posteriori Estimation (MAP)?

MAP maximizes posterior P(θ|data) ∝ P(data|θ) P(θ)—combines likelihood with prior. Equivalent to MLE with regularization when prior is expressed as penalty (e.g., Gaussian prior → L2).

89. Difference between Frequentist vs. Bayesian inference.

Frequentist: parameters are fixed; probabilities reflect long-run frequencies; inference via estimators, p-values, confidence intervals.
Bayesian: parameters are random with prior; update beliefs via posterior; credible intervals have direct probabilistic interpretation.

90. What is KL divergence?

Kullback – Leibler divergence measures how one probability distribution Q diverges from true distribution P: D_KL(P||Q) = Σ P(x) log(P(x)/Q(x)). Non-symmetric and non-negative; zero when distributions equal. Used in variational inference and loss design.

91. Explain Jensen’s inequality.

For convex function φ, φ(E[X]) ≤ E[φ(X)]. For concave functions, inequality flips. Fundamental in bounding expectations and deriving variational bounds.

92. What is covariance vs. correlation?

Covariance: Cov(X,Y) = E[(X−μX)(Y−μY)] measures joint variability; scale-dependent.
Correlation (Pearson): standardized covariance Cov/ (σX σY), ranges [−1,1], scale-free measure of linear association.

93. Explain Law of Large Numbers.

As sample size n → ∞, sample average converges (in probability or almost surely) to expected value. Justifies using sample means to estimate population means.

94. What is a Markov Chain?

A stochastic process where the probability of the next state depends only on the current state (Markov property). Used to model sequences and to derive long-run distributions.

95. Explain Hidden Markov Models (HMMs).

Probabilistic models with hidden (latent) states that follow a Markov chain and observable emissions conditioned on states. Inference tasks: decoding (Viterbi), likelihood (Forward), parameter learning (Baum–Welch/EM).

96. How would you approach building a recommendation system?

Steps: define objective (implicit vs explicit feedback), data collection, exploratory data analysis, choose algorithm (collaborative filtering, matrix factorization, content-based, hybrid), deal with cold-start, evaluate offline (RMSE, ranking metrics like NDCG, MAP) and online (A/B), scale/serve, monitor and iterate.

97. How do you identify data leakage in ML?

Signs: unrealistically high validation performance, features that include future info or derived from target, leakage through preprocessing done before splitting, timestamp mistakes. Detect by auditing features, temporal splits, and checking importance of suspicious features.

98. How do you handle imbalanced datasets (SMOTE, undersampling, etc.)?

Resampling: oversample minority (SMOTE, ADASYN), undersample majority, or combine. Algorithmic: class weights, cost-sensitive learning, ensemble techniques (balanced bagging), threshold tuning, anomaly detection approach, and evaluation metrics like PR-AUC or F1.

99. How do you monitor ML models after deployment?

Track input data distributions, prediction distributions, performance metrics on periodic labeled samples, detect data/concept drift (statistical tests, population stability index), set alerting thresholds, maintain logging, automated retraining pipelines, and business-level KPIs.

100. Explain trade-offs between interpretability and accuracy when choosing ML models.

Simpler models (linear, shallow trees) are highly interpretable but may have higher bias and lower accuracy on complex data. Complex models (ensembles, deep nets) often achieve higher accuracy but are less transparent. Trade-offs: choose interpretable model when explanations/regulation/diagnostics matter; use complex models for maximum performance and apply interpretability tools (SHAP, LIME, rule extraction, surrogate models) if needed.

Topics

Python Machine Learning Deep Learning