Classification predicts discrete labels (e.g., spam/not spam). Regression predicts continuous values (e.g., house price). Evaluation metrics, loss functions, and model outputs differ (probabilities/classes vs real-valued predictions).
Decomposes prediction error into bias2 (error from wrong model assumptions), variance (sensitivity to training set), and irreducible error. High-bias models underfit; high-variance models overfit. Tradeoff: lowering bias often raises variance and vice-versa; choose complexity/regularization to minimize total error.
Linear/logistic regression, decision trees, random forests, gradient boosting (XGBoost/LightGBM/CatBoost), SVM, k-NN, Naive Bayes, neural networks (MLP, CNN, RNN), ensembles (bagging/stacking).
K-means, hierarchical clustering, DBSCAN, Gaussian Mixture Models (EM), PCA, t-SNE/UMAP (visualization), autoencoders, association rule mining.
A decision boundary is the surface in feature space where the classifier’s predicted class changes. For linear models it's a hyperplane; for nonlinear models it can be complex. Visualizing boundaries helps understand model behavior and misclassification regions.
Accuracy, precision, recall (sensitivity), specificity, F1-score, ROC AUC, PR AUC, confusion matrix, Matthews correlation coefficient (MCC), Cohen’s kappa. Choice depends on class balance and business costs.
MSE (mean squared error), RMSE, MAE (mean absolute error), R2 (coefficient of determination), adjusted R2, MAPE (mean absolute percentage error). Use R2 for explained variance; choose metric aligned with error costs.
A resampling method to estimate model generalization by splitting data into train/validation folds (e.g., k-fold). Reduces variance of performance estimate, helps hyperparameter tuning, and prevents overfitting to a particular train/validation split.
F1 = 2 * (precision * recall) / (precision + recall) — harmonic mean of precision and recall. Use when you need a balance between precision and recall (especially with class imbalance). Can use Fβ to weight recall more (β>1) or precision more (β<1).
ROC plots true positive rate (recall) vs false positive rate for all thresholds. AUC (area under ROC) summarizes separability (1.0 perfect, 0.5 random). Use AUC to compare classifiers independent of threshold; for imbalanced data PR-AUC can be more informative.
A table showing counts of True Positives, False Positives, True Negatives, False Negatives for classification. It’s the basis for precision, recall, specificity and many other metrics.
Sampling technique that preserves the class distribution in train/test or cross-validation splits—important for classification with imbalanced classes to ensure each fold contains representative proportions.
Rescaling features to comparable ranges (e.g., 0–1 or mean 0/unit variance). Required because many algorithms (SVM, k-NN, gradient descent based models) are sensitive to feature scales; improves convergence and prevents dominance by large-scale features.
Converts categorical variable with k categories into k binary features each indicating presence of a category (or k-1 if dropping one to avoid redundancy). Prevents ordinal assumptions.
Converting continuous variable into discrete intervals/bins (equal-width, equal-frequency, or supervised binning). Helps with nonlinearity, robustness to outliers, and interpretability.
Drop rows/columns, impute with mean/median/mode, forward/backward fill (time series), KNN imputation, iterative/multivariate imputer (model-based), use missingness indicator features, or prediction models for missingness. Choice depends on missingness mechanism (MCAR, MAR, MNAR).
Techniques to reduce number of features while retaining important information — helps with visualization, computation, noise reduction, and mitigating curse of dimensionality. Methods: PCA, LDA, autoencoders, t-SNE (visualization), UMAP.
Principal Component Analysis finds orthogonal directions (principal components) that maximize variance. Compute covariance matrix, eigenvectors/eigenvalues; project data onto top components to reduce dimensionality while preserving variance. Components are linear combinations of original features.
Detect with correlation matrix and Variance Inflation Factor (VIF). Handle by dropping/combining correlated variables, PCA or other dimensionality reduction, regularization (ridge), or using tree-based models which are less sensitive.
Collinearity caused by including all one-hot encoded columns (sum of dummies = 1). Avoid by dropping one category (use k-1 dummies) or using regularization.
For Independent and Identically Distributed samples with finite mean and variance, the distribution of the sample mean approaches a normal distribution as sample size increases, regardless of original distribution. Foundation for many inferential statistics.
Probability, assuming the null hypothesis is true, of observing data at least as extreme as the observed. Low p-value suggests evidence against null. It’s not the probability that the null is true.
Type I (a): false positive — rejecting true null.
Type II (b): false negative — failing to reject false null.
Power = 1-β is probability of correctly rejecting false null. Tradeoffs depend on context.
Conditional probability P(A|B) is probability of A given B. Bayes’ theorem: P(A|B) = P(B|A) P(A) / P(B) — used to invert conditional probabilities and underlies Bayesian inference and Naive Bayes classifier.
Linearity, independence of errors, homoscedasticity (constant variance), normality of errors (for inference), no perfect multicollinearity, and correct model specification. Violation affects inference and predictions differently.
Simple: one predictor y = β0 + β1 x + ε. Multiple: multiple predictors y = β0 + β1 x1 + ... + βp xp + ε. Interpretation and matrix form extension; issues like multicollinearity only arise in multiple regression.
High correlation among predictors leads to unstable coefficient estimates and inflated variances. Handle by dropping/recombining features, PCA, ridge regression, or collecting more data.
Linear regression with L2 regularization: minimize RSS + λ Σ β_j^2. Shrinks coefficients towards zero (but not exactly zero), reduces variance and multicollinearity, controlled by λ.
Linear regression with L1 regularization: minimize RSS + λ Σ |β_j|. Encourages sparsity—can set some coefficients exactly to zero—useful for feature selection.
Combines L1 and L2 penalties: α L1 + (1-α) L2. Useful when correlated features exist—balances variable selection (L1) and stability (L2).
Models log-odds of class probability as linear function: log(p/(1-p)) = β0 + βx. Uses sigmoid p = 1 / (1+e^{-z}) to map to [0,1]. Parameters estimated by maximum likelihood (cross-entropy loss). Outputs probabilities and class by threshold.
Correctly specified link (logit), independence of observations, linearity between continuous predictors and log-odds, no multicollinearity, large sample size for stable MLE. Does not assume normality of predictors.
A generative classifier that applies Bayes’ theorem with strong conditional independence assumption: features are independent given class. Fast, works well with small data and high-dimensional discrete features; assumptions rarely hold but often give good results.
Reducing tree complexity to prevent overfitting. Pre-pruning stops growth early (max depth, min samples). Post-pruning grows full tree then trims branches based on validation or cost-complexity (e.g., minimal cost-complexity pruning).
Ensemble of decision trees using bootstrap samples (bagging) and feature randomness at splits. Predictions averaged (regression) or majority voted (classification). Reduces variance, robust to overfitting, and provides feature importance.
Bootstrap Aggregating: train multiple models on bootstrapped samples and aggregate predictions (average/vote). Reduces variance by combining diverse models; works well with high-variance base learners like trees.
Sequential ensemble technique where each new model focuses on correcting errors of previous models. Weak learners are combined additively to produce a strong learner; reduces bias and can achieve high accuracy but may overfit without regularization.
Adaptive Boosting: sequentially fits weak learners; after each iteration, increases weights of misclassified samples so next learner focuses on them. Final prediction is weighted vote of learners. Sensitive to noisy data/outliers but simple and effective with low-capacity learners.
Builds additive model by fitting each new weak learner to the residuals (errors) of the current ensemble under a chosen loss. Each learner reduces remaining error; learning rate and number of iterations control fitting. Conceptually performs gradient descent in function space (fit direction that reduces loss).
An optimized, regularized implementation of gradient boosting with enhancements: second-order loss approximation, shrinkage (learning rate), column subsampling, handling missing values, tree pruning, parallelization and cache-aware algorithms. Built-in regularization reduces overfitting and speeds training.
Partition data into k clusters by minimizing within-cluster variance; iteratively assign points to nearest centroid and update centroids.
Elbow method (plot inertia vs k), silhouette score (higher better), gap statistic, BIC/AIC for mixture models, domain knowledge and stability analysis. Combine multiple methods and visualize results.
Builds nested clusters as dendrogram: agglomerative (bottom-up, merge closest pairs) or divisive (top-down, split). No need to pre-specify k; useful for small datasets and exploring cluster hierarchy.
Density-Based Spatial Clustering (DBSCAN): groups points with high density (minPts within ε radius) and labels low-density points as noise. Good for arbitrary-shaped clusters and noise; struggles with variable density and high dimensionality.
Probabilistic model representing data as a mixture of Gaussian components. Each point has soft assignment (probability) to each component. More flexible than k-means (elliptical clusters), parameters learned typically via EM.
Iterative method to estimate parameters with latent variables:
Dimensionality reduction reduces features while preserving structure. PCA: unsupervised, finds directions of max variance. LDA: supervised, finds linear combinations maximizing class separability. PCA ignores labels; LDA uses them.
t-distributed Stochastic Neighbor Embedding: nonlinear technique to embed high-dimensional data into 2–3D preserving local neighborhood structure—excellent for visualization of clusters. Not for general dimensionality reduction for downstream tasks; sensitive to perplexity and nonparametric.
Learning by trial-and-error: an agent takes actions in an environment, receives rewards, and learns a policy to maximize cumulative reward over time.
Agent must choose between exploiting known high-reward actions and exploring uncertain actions that may yield higher long-term reward. Strategies: ε-greedy, softmax, upper confidence bounds (UCB), Thompson sampling.
Model-free RL algorithm that learns action-value function Q(s,a) estimating expected return for taking a in s and following optimal policy thereafter. Updates Q using Bellman optimality bootstrapping from observed rewards and next-state Q values until convergence.
Formal RL framework tuple (S, A, P, R, γ) where S states, A actions, P(s'|s,a) transition probabilities, R(s,a) reward function, and γ discount factor. Markov property: next state depends only on current state and action.
For estimator \hat{f}(x) and true f(x), expected squared error:
E[(y − \hat{f}(x))^2] = (Bias[\hat{f}(x)])^2 + Var[\hat{f}(x)] + σ^2
where Bias = E[\hat{f}(x)] − f(x), Var = E[(\hat{f}(x) − E[\hat{f}(x)])^2], and σ^2 is irreducible noise.
Large gap between train and validation performance, learning curves showing diverging error, high variance in cross-validation, poor performance on unseen test or holdout, and complex model with small data.
Combine multiple models (bagging, boosting, stacking) to improve robustness and performance. Ensembles reduce variance (bagging), bias (boosting), or exploit complementary strengths (stacking). Diversity among base learners is key.
SVM finds hyperplane maximizing margin between classes. Kernels implicitly map inputs into higher-dimensional space to make data linearly separable without computing mapping explicitly—only kernel function values K(x_i,x_j) are needed (e.g., RBF, polynomial).
Compute inner products in feature-mapped space using a kernel function so you can train in high-dimensional space efficiently without explicit transformation. Enables complex decision boundaries with convex optimization.
Maximizing margin reduces generalization error (margin implies confidence); support vectors define the boundary. Larger margin → better expected generalization; soft-margin SVM trades margin for misclassification using parameter C.
For binary learners extended to multi-class:
High-dimensional spaces become sparse; distances lose meaning; number of samples needed grows exponentially; models overfit easily. Affects nearest-neighbor, density estimation, and increases computational cost.
Feature selection (filter/wrapper/embedded), dimensionality reduction (PCA, autoencoders), regularization (L1/L2), tree-based methods, embeddings, feature hashing, and collecting more data.
Euclidean, Manhattan (L1), Minkowski (Lp), Cosine similarity (angle-based), Mahalanobis (covariance-aware), Hamming (categorical/binary), Jaccard (set similarity).
Distance accounting for feature covariance: d_M(x) = sqrt((x-μ)^T Σ^{-1} (x-μ)). Used to detect multivariate outliers and for Gaussian-based models where features are correlated.
Formulate null and alternative hypotheses about data or model (e.g., no difference between algorithms). Compute test statistic and p-value; reject null if p < a. Used in model comparison, A/B testing, and significance testing of features.
Choose parameters that maximize probability (likelihood) of observed data: θ_MLE = argmax_θ P(data | θ). Widely used for parameter estimation; asymptotically unbiased and efficient under regularity conditions.
MAP maximizes posterior P(θ|data) ∝ P(data|θ) P(θ)—combines likelihood with prior. Equivalent to MLE with regularization when prior is expressed as penalty (e.g., Gaussian prior → L2).
Kullback – Leibler divergence measures how one probability distribution Q diverges from true distribution P: D_KL(P||Q) = Σ P(x) log(P(x)/Q(x)). Non-symmetric and non-negative; zero when distributions equal. Used in variational inference and loss design.
For convex function φ, φ(E[X]) ≤ E[φ(X)]. For concave functions, inequality flips. Fundamental in bounding expectations and deriving variational bounds.
As sample size n → ∞, sample average converges (in probability or almost surely) to expected value. Justifies using sample means to estimate population means.
A stochastic process where the probability of the next state depends only on the current state (Markov property). Used to model sequences and to derive long-run distributions.
Probabilistic models with hidden (latent) states that follow a Markov chain and observable emissions conditioned on states. Inference tasks: decoding (Viterbi), likelihood (Forward), parameter learning (Baum–Welch/EM).
Steps: define objective (implicit vs explicit feedback), data collection, exploratory data analysis, choose algorithm (collaborative filtering, matrix factorization, content-based, hybrid), deal with cold-start, evaluate offline (RMSE, ranking metrics like NDCG, MAP) and online (A/B), scale/serve, monitor and iterate.
Signs: unrealistically high validation performance, features that include future info or derived from target, leakage through preprocessing done before splitting, timestamp mistakes. Detect by auditing features, temporal splits, and checking importance of suspicious features.
Resampling: oversample minority (SMOTE, ADASYN), undersample majority, or combine. Algorithmic: class weights, cost-sensitive learning, ensemble techniques (balanced bagging), threshold tuning, anomaly detection approach, and evaluation metrics like PR-AUC or F1.
Track input data distributions, prediction distributions, performance metrics on periodic labeled samples, detect data/concept drift (statistical tests, population stability index), set alerting thresholds, maintain logging, automated retraining pipelines, and business-level KPIs.
Simpler models (linear, shallow trees) are highly interpretable but may have higher bias and lower accuracy on complex data. Complex models (ensembles, deep nets) often achieve higher accuracy but are less transparent. Trade-offs: choose interpretable model when explanations/regulation/diagnostics matter; use complex models for maximum performance and apply interpretability tools (SHAP, LIME, rule extraction, surrogate models) if needed.
Copyright © 2023 - Proleed Academy | All Rights Reserved.