Top Deep Learning Interview Questions & Answers
Table of Contents
- What is a neural network, and how does it work?
- Explain the difference between deep learning and machine learning.
- What is an activation function? Name a few commonly used activations functions.
- Why is non-linearity important in neural networks?
- Define epoch, batch, and iteration in the context of training a neural network.
- What is the purpose of a loss function? Give examples.
- Explain the concept of backpropagation in deep learning.
- What are the main differences between supervised and unsupervised learning?
- What is stochastic gradient descent (SGD)?
- Name some commonly used optimizers in deep learning.
- What is regularization, and why is it important?
- How does dropout help prevent overfitting?
- What is batch normalization, and what are its benefits?
- Define underfitting and overfitting. How can you detect them?
- How does early stopping work?
- What are convolutional neural networks (CNNs) mainly used for?
- Briefly describe a recurrent neural network (RNN).
- What is the role of an embedding layer in NLP?
- Explain what tokenization means in NLP.
- What is stemming, and how does it differ from lemmatization?
- List common techniques for text pre-processing.
- What is word2vec? Why is it important?
- Name a few real-world applications of NLP.
- What are precision, recall, and F1-score?
- What is accuracy, and when is it a misleading metric?
- What does the confusion matrix show?
- Give examples of common deep learning frameworks.
- How are CNNs different from RNNs?
- What are the steps to prepare text data for input to a neural network?
- Name a few popular NLP libraries.
- Explain how backpropagation updates weights in a neural network.
- What is the vanishing gradient problem? In which contexts does it occur?
- Describe the ReLU activation function and its advantages.
- Compare Adam and RMSProp optimizers.
- What is weight initialization? Why does it matter?
- Discuss the tradeoff between bias and variance in deep learning.
- How does data augmentation improve model generalization?
- What is class imbalance, and how can you address it?
- What are hyperparameters? List some examples in deep learning models.
- How do you perform hyperparameter tuning?
- What is the difference between validation and test sets?
- Describe K-fold cross-validation and its advantages.
- Explain L1 and L2 regularization.
- What is transfer learning, and when is it useful?
- Describe the basic structure of a CNN layer.
- What is a pooling layer? Why is it used?
- How do RNNs process sequences differently from feedforward networks?
- Explain the concept and advantages of LSTMs over vanilla RNNs.
- What is a GRU, and how does it differ from an LSTM?
- What are attention mechanisms in neural networks?
- Describe the general architecture of a transformer model.
- How does self-attention work?
- Explain the positional encoding in transformers.
- What are autoencoders, and what are their uses?
- How do generative adversarial networks (GANs) work?
- Compare the roles of the generator and discriminator in a GAN.
- Describe BLEU score and its use in NLP evaluation.
- How would you evaluate a sentiment analysis system?
- What are embeddings? How are word2vec and GloVe different?
- Explain sequence-to-sequence (seq2seq) models.
- What is beam search, and how does it differ from greedy search?li>
- What is language modeling? How is it different from classification?
- Explain token types as used in BERT.
- What is fine-tuning in the context of BERT or GPT?
- How would you handle out-of-vocabulary words?
- What is an attention head in transformer models?
- Describe masked language modeling.
- How would you deploy a trained deep learning model in production?
- What challenges arise when serving NLP models at scale?
- How do you debug vanishing or exploding gradients?
- Explain the mathematical derivation of backpropagation for a simple neural network.
- How do you compute gradients for recurrent neural networks?
- What are residual connections, and why do they help in deep networks?
- Describe the impact of layer normalization in transformers.
- How does BERT handle context differently than traditional word embeddings?
- Explain how transformers achieve parallelization over RNNs.
- What are encoder-decoder architectures? How do they apply to machine translation?
- Detail the differences between BERT, GPT, and T5 architectures.
- How do you prevent mode collapse in GANs?
- Explain the use of reinforcement learning in natural language generation (e.g., RLHF).
- How are large language models (LLMs) fine-tuned using human feedback?
- Describe how curriculum learning can improve model training.
- What is multi-head attention, and how does it improve model representation?
- Discuss the importance of softmax in attention mechanisms.
- How can you interpret or visualize neural network decisions (explainability)?
- How does the Transformer’s memory cost scale with input length and why?
- What is perplexity in language modeling?
- How do you ensure reproducibility in large-scale deep learning experiments?
- What techniques are effective for handling noisy or adversarial data?
- How would you optimize inference speed for deployed NLP models?
- Implement a custom loss function in TensorFlow/PyTorch (describe the steps).
- How do you implement early stopping during training in code?
- Given an example, walk through debugging a non-converging neural network.
- Discuss distributed training strategies in deep learning.
- What are the tradeoffs between quantization, pruning, and distillation?
- How would you use transfer learning for a domain with little labeled data?
- Explain zero-shot/few-shot learning in LLMs.
- How would you deploy a deep learning model for real-time prediction?
- Discuss several ways of mitigating bias in NLP models.
- How do you keep updated with recent advances in deep learning and NLP?
1. What is a neural network, and how does it work?
A neural network is a computational model inspired by the human brain. It consists of layers of interconnected nodes (neurons) that transform inputs into outputs.
- Input layer: Receives the data
- Hidden layers: Perform computations using weights, biases, and activation functions
- Output layer: Produces predictions
Working: Data passes through layers → weighted sum + bias → activation → output. During training, weights are adjusted to minimize error using backpropagation.
2. Explain the difference between deep learning and machine learning.
| Feature |
Machine Learning |
Deep Learning |
| Features |
Handcrafted |
Automatically learned |
| Models |
Linear regression, Decision Trees |
Neural Networks (Deep) |
| Data |
Works with small data |
Requires large data |
| Hardware |
CPU sufficient |
Needs GPU |
| Example |
Random Forest |
CNN, RNN |
Deep learning is a subset of ML that uses multiple layers to automatically extract features.
3. What is an activation function? Name a few commonly used activations functions.
An activation function introduces non-linearity into the model, helping it learn complex patterns.
Common functions:
- Sigmoid:

- Tanh: tanh(x)
- ReLU: max (0, x)
- Leaky ReLU: max (0.01x, x)
- Softmax: Converts logits to probabilities (used in classification).
4. Why is non-linearity important in neural networks?
Without non-linearity, the network would behave like a linear model, regardless of layers. Non-linear activation functions allow the network to learn complex, non-linear relationships in data.
5. Define epoch, batch, and iteration in the context of training a neural network.
- Epoch: One full pass of the entire dataset through the model.
- Batch: A subset of the dataset processed at once.
- Iteration: One update step; for ( N ) samples and batch size ( B ), there are ( \frac{N}{B} ) iterations per epoch.
6. What is the purpose of a loss function? Give examples.
A loss function measures how well the model predicts compared to true values.
- Regression: Mean Squared Error (MSE)
- Classification: Cross-Entropy Loss
It guides optimization—lower loss means better predictions.
7. Explain the concept of backpropagation in deep learning.
Backpropagation computes gradients of the loss with respect to each weight using the chain rule, then updates weights using an optimizer (e.g., SGD).
Steps:
- Forward pass → compute output
- Compute loss
- Backward pass → calculate gradients
- Update weights
8. What are the main differences between supervised and unsupervised learning?
| Aspect |
Supervised |
Unsupervised |
| Data |
Labeled |
Unlabeled |
| Goal |
Predict outputs |
Discover patterns |
| Examples |
Classification, Regression |
Clustering, Dimensionality reduction |
9. What is stochastic gradient descent (SGD)?
An optimization algorithm that updates weights using one (or few) samples at a time instead of the entire dataset.
Update rule:

Helps in faster and more generalized training.
10. Name some commonly used optimizers in deep learning.
- SGD
- Momentum
- RMSProp
- Adam
- Adagrad
- AdamW
11. What is regularization, and why is it important?
Regularization prevents overfitting by penalizing large weights.
Types:
- L1 (Lasso): Adds |w| penalty
- L2 (Ridge): Adds w² penalty
- Dropout
It helps improve generalization.
12. How does dropout help prevent overfitting?
Dropout randomly deactivates a fraction of neurons during training, forcing the model to learn redundant representations and preventing dependence on specific neurons.
13. What is batch normalization, and what are its benefits?
Batch normalization normalizes activations in a layer across a mini-batch.
Benefits:
- Stabilizes training
- Allows higher learning rates
- Reduces internal covariate shift
- Acts as a regularizer
14. Define underfitting and overfitting. How can you detect them?
- Underfitting: Model too simple → poor performance on train & test data
- Overfitting: Model too complex → good train, poor test performance
Detection: Compare train vs. validation loss/accuracy.
15. How does early stopping work?
Training stops when validation loss stops improving after certain epochs (patience). Prevents overfitting by saving the model at its best performance.
16. What are convolutional neural networks (CNNs) mainly used for?
CNNs are primarily used for image and spatial data, e.g., image classification, object detection, facial recognition.
17. Briefly describe a recurrent neural network (RNN).
RNNs process sequential data by maintaining a hidden state that captures past information.
Used in time series, text, and speech data.
18. What is the role of an embedding layer in NLP?
Converts words into dense vectors that capture semantic relationships.
Example: Word2Vec or learned embeddings in Keras Embedding layer.
19. Explain what tokenization means in NLP.
Tokenization splits text into smaller units—words, subwords, or characters—for processing by models.
Example: "I love NLP" → ["I", "love", "NLP"]
20. What is stemming, and how does it differ from lemmatization?
- Stemming: Removes suffixes → crude cut (e.g., “running” → “run”)
- Lemmatization: Converts to base word using dictionary (e.g., “better” → “good”)
Lemmatization is more accurate linguistically.
21. List common techniques for text pre-processing.
- Lowercasing
- Removing punctuation/stopwords
- Tokenization
- Stemming/Lemmatization
- Handling numbers
- Padding/truncation
- Encoding (e.g., word2vec, TF-IDF)
22. What is word2vec? Why is it important?
A neural embedding model that represents words as dense vectors based on context (Skip-gram, CBOW).
Captures semantic similarity (e.g., vector(king) - vector(man) + vector(woman) ≈ vector(queen)).
23. Name a few real-world applications of NLP.
- Sentiment analysis
- Chatbots
- Machine translation
- Speech recognition
- Text summarization
- Question answering
24. What are precision, recall, and F1-score?
- Precision: TP / (TP + FP) → correctness
- Recall: TP / (TP + FN) → completeness
- F1-score: Harmonic mean →

Used when data is imbalanced.
25. What is accuracy, and when is it a misleading metric?
Accuracy = (TP + TN) / Total
Misleading in imbalanced datasets (e.g., 95% accuracy if 95% are one class).
26. What does the confusion matrix show?
A table showing true vs. predicted classes:
|
Predicted |
|
P |
N |
| Actual P |
TP |
FN |
| Actual N |
FP |
TN |
Helps derive precision, recall, F1, accuracy.
27. Give examples of common deep learning frameworks.
- TensorFlow
- Keras
- PyTorch
- MXNet
- JAX
28. How are CNNs different from RNNs?
| Feature |
CNN |
RNN |
| Data |
Spatial (images) |
Sequential (text, time) |
| Operation |
Convolution |
Recurrence |
| Parallelization |
Easy |
Difficult |
| Memory |
No memory |
DMaintains hidden state |
29. What are the steps to prepare text data for input to a neural network?
- Text cleaning (remove noise)
- Tokenization
- Convert to integer sequences
- Padding/truncating
- Embedding (Word2Vec, GloVe, or learnable)
- Feed into model
30. Name a few popular NLP libraries.
- NLTK
- spaCy
- Transformers (Hugging Face)
- Gensim
- TextBlob
31. Explain how backpropagation updates weights in a neural network.
Backpropagation computes the gradient of the loss with respect to each weight using the chain rule.
Step:
- Forward pass: Compute predictions and loss.
- Backward pass: Compute gradients (∂Loss/∂Weight).
- Weight update:

where η\etaη is the learning rate.
32. What is the vanishing gradient problem? In which contexts does it occur?
When gradients become very small during backpropagation, earlier layers update very slowly → learning stalls.
Occurs mostly in:
- Deep networks with many layers
- Sigmoid/tanh activations
Mitigation: ReLU, batch normalization, residual connections, LSTM/GRU.
33. Describe the ReLU activation function and its advantages.
Advantages:
- Non-linear yet simple
- Avoids vanishing gradient for positive inputs
- Speeds up convergence
Drawback: Dying ReLU (neuron outputs zero forever if weights push input < 0).
34. Compare Adam and RMSProp optimizers.
| Feature | `
Adam |
RMSProp |
| Momentum |
Yes (β1) |
No |
| Adaptive LR |
Yes |
Yes |
| Use |
Most general-purpose |
Good for RNNs |
Adam = RMSProp + Momentum + Bias correction.
35. What is weight initialization? Why does it matter?
Its how initial weights are set before training.
Poor initialization → slow convergence or vanishing/exploding gradients.
Good schemes:
- Xavier/Glorot: for sigmoid/tanh
- He initialization: for ReLU
36. Discuss the tradeoff between bias and variance in deep learning.
- High bias: Simple model → underfits
- High variance: Complex model → overfits
Goal: balance both → minimal generalization error.
37. How does data augmentation improve model generalization?
It increases dataset diversity artificially by applying transformations (rotation, cropping, noise).
→ Prevents overfitting and improves robustness.
38. What is class imbalance, and how can you address it?
When one class dominates → model biased.
Solutions:
- Resampling (oversample minority / undersample majority)
- Class weighting
- Synthetic data (SMOTE)
- Metrics: F1, ROC-AUC instead of accuracy.
39. What are hyperparameters? List some examples in deep learning models.
Settings chosen before training.
Examples: learning rate, batch size, epochs, number of layers, dropout rate, optimizer type.
40. How do you perform hyperparameter tuning?
Techniques:
- Grid search
- Random search
- Bayesian optimization
- Hyperband / Optuna
- Manual tuning + validation set
41. What is the difference between validation and test sets?
- Validation set: used during training for hyperparameter tuning.
- Test set: used after training to measure final performance.
42. Describe K-fold cross-validation and its advantages.
Data is split into K parts → each used once as validation, rest as training.
Advantages:
- More reliable performance estimate
- Uses all data efficiently
43. Explain L1 and L2 regularization.
- L1 (Lasso): adds
- L2 (Ridge): adds

Purpose: prevent overfitting by penalizing complexity.
44. What is transfer learning, and when is it useful?
Reuse pretrained model on new task with limited data.
Example: use pretrained ResNet on new image dataset, or BERT for text classification.
Useful when data is small or related.
45. Describe the basic structure of a CNN layer.
Components:
- Convolution → filters extract spatial features
- Activation → e.g., ReLU
- Pooling → reduces dimensions
- Normalization (optional)
46. What is a pooling layer? Why is it used?
Reduces spatial size while retaining features.
Types:
- Max pooling
- Average pooling
Benefits: reduces parameters, translation invariance.
47. How do RNNs process sequences differently from feedforward networks?
RNNs maintain hidden states that carry information from previous time steps, enabling temporal dependencies.
Feedforward networks treat all inputs independently.
48. Explain the concept and advantages of LSTMs over vanilla RNNs.
LSTMs have gates (input, forget, output) that control information flow, solving vanishing gradient.
They remember long-term dependencies better.
49. What is a GRU, and how does it differ from an LSTM?
Gated Recurrent Unit = simpler LSTM:
- Two gates: update and reset
- No separate cell state
→ Faster training, similar performance.
50. What are attention mechanisms in neural networks?
Attention lets the model focus on relevant parts of the input when producing each output.
Introduced in seq2seq → revolutionized NLP.
51. Describe the general architecture of a transformer model.
- Encoder-decoder structure
- Each layer: multi-head self-attention + feedforward network + layer norm
- Parallelizable (no recurrence)
52. How does self-attention work?
Computes attention scores using:
- • Queries (Q), Keys (K), Values (V)
→ captures relationships between all tokens.
53. Explain the positional encoding in transformers.
Adds position info since transformer has no sequence order.
Uses sinusoidal functions:
54. What are autoencoders, and what are their uses?
Neural networks trained to reconstruct input.
Structure: Encoder → bottleneck → Decoder
Uses:
- Dimensionality reduction
- Denoising
- Feature learning
- Anomaly detection
55. How do generative adversarial networks (GANs) work?
Two networks:
- Generator (G): produces fake samples
- Discriminator (D): distinguishes real vs fake
They train adversarially:
GGG tries to fool DDD, DDD tries to detect fakes.
56. Compare the roles of the generator and discriminator in a GAN.
| Component |
Role |
| Generator |
Creates synthetic data |
| Discriminator |
Classifies real vs. fake |
They compete → equilibrium when G’s fakes look real.
57. Describe BLEU score and its use in NLP evaluation.
BLEU (Bilingual Evaluation Understudy) measures similarity between generated and reference text using n-gram overlap.
Used in machine translation, summarization.
58. How would you evaluate a sentiment analysis system?
- Metrics: Accuracy, Precision, Recall, F1-score
- Confusion matrix
- Cross-validation
- Manual inspection of misclassifications
59. What are embeddings? How are word2vec and GloVe different?
Embeddings = dense vector representations of words.
- Word2Vec: learns via context prediction (neural)
- GloVe: uses co-occurrence statistics (matrix factorization)
60. Explain sequence-to-sequence (seq2seq) models.
Architecture with encoder (encodes input sequence) and decoder (generates output).
Used in translation, summarization.
Often enhanced with attention.
61. What is beam search, and how does it differ from greedy search?
- Greedy: picks best token at each step.
- Beam search: keeps top-k candidates at each step → explores more possibilities → better sequences.
62. What is language modeling? How is it different from classification?
Predicts next word given context (sequence probability).
Classification predicts label for entire input.
Language model: P(w₁, w₂, …, wₙ).
63. Explain token types as used in BERT.
BERT uses:
- Token embeddings (word)
- Segment embeddings (sentence A/B)
- Position embeddings
Summed to form final input representation.
64. What is fine-tuning in the context of BERT or GPT?
Start with pretrained model → train on task-specific data (e.g., classification, QA).
Usually update all weights with small learning rate.
65. How would you handle out-of-vocabulary words?
- Use subword tokenization (BPE, WordPiece)
- Use UNK token
- Train character-level models
66. What is an attention head in transformer models?
Each head learns different relationships.
Multi-head attention = several attention layers in parallel, capturing diverse contexts.
67. Describe masked language modeling.
Task where some tokens are masked, and the model predicts them.
Used in BERT pretraining to learn bidirectional context.
68. How would you deploy a trained deep learning model in production?
Steps:
- Serialize (e.g., .pt, .h5, ONNX)
- Serve via API (Flask/FastAPI/TorchServe)
- Containerize (Docker)
- Deploy on cloud (AWS, GCP)
- Monitor performance
69. What challenges arise when serving NLP models at scale?
- Latency and memory usage
- Batch inference
- Tokenization overhead
- Updating vocab/models
- Monitoring drift
70. How do you debug vanishing or exploding gradients?
- Use ReLU or Leaky ReLU
- Gradient clipping
- Batch normalization
- Proper weight initialization
- Residual connections
Lemmatization is more accurate linguistically.
71. Explain the mathematical derivation of backpropagation for a simple neural network.
72. How do you compute gradients for recurrent neural networks?
Use backpropagation through time (BPTT): unroll the RNN across timesteps, compute forward pass across unrolled graph, then apply chain rule backward through time accumulating gradients for shared parameters. Gradients are sums over timesteps; long dependencies may cause vanishing/exploding gradients — apply gradient clipping, gating (LSTM/GRU), or truncated BPTT.
73. What are residual connections, and why do they help in deep networks?
Residual (skip) connections add input xxx to a layer’s output: y=F(x)+x). They allow gradients to flow directly to earlier layers, easing optimization and enabling very deep networks by preventing degradation (helps avoid vanishing gradients and makes identity mapping easy).
74. Describe the impact of layer normalization in transformers.
LayerNorm normalizes activations per sample across feature dimension (not across batch). It stabilizes training, smooths optimization, reduces internal covariate shift for each token, and works well with variable batch sizes — critical in transformers where per-position normalization improves convergence and training stability.
75. How does BERT handle context differently than traditional word embeddings?
Traditional embeddings (Word2Vec/GloVe) are static: one vector per word. BERT produces contextual embeddings — the embedding for a token depends on the whole sentence via bidirectional self-attention (masked LM pretraining), so the same word in different contexts has different vectors.
76. Explain how transformers achieve parallelization over RNNs.
Transformers use self-attention where each token attends to all tokens in a layer; the operations are matrix multiplications across whole sequences — they do not have timestep-dependent recurrence. This allows processing all tokens in parallel on GPUs/TPUs. (Cost: attention is O(n2)O(n^2)O(n2) in sequence length.)
77. What are encoder-decoder architectures? How do they apply to machine translation?
Encoder transforms input sequence into representations; decoder generates target sequence conditioned on encoder outputs (and previously generated tokens). In MT: encoder reads source sentence, decoder produces translated sentence token-by-token, often using attention to focus on relevant source positions.
78. Detail the differences between BERT, GPT, and T5 architectures.
- BERT: Bidirectional encoder-only, masked language modeling + next-sentence prediction (pretraining), best for encoding tasks (classification, NER).
- GPT: Decoder-only, autoregressive LM (left-to-right), great for generative tasks and next-token prediction.
- T5: Encoder–decoder, unified text-to-text LM with span-masking pretraining; flexible for both generation and understanding tasks.
79. How do you prevent mode collapse in GANs?
Techniques: feature matching, minibatch discrimination, unrolled GANs, adding noise to inputs/labels, Wasserstein GAN with gradient penalty (WGAN-GP), spectral normalization, two-time-scale updates for G and D. Also tune capacities and learning rates to keep discriminator from overpowering generator.
80. Explain the use of reinforcement learning in natural language generation (e.g., RLHF).
RL is used to optimize non-differentiable objectives (e.g., human preference, BLEU, safety rewards). In RLHF: a reward model is trained from human preference labels; the LM is fine-tuned with policy-gradient or PPO to maximize the learned human-centered reward while sometimes constraining divergence from the base model.
81. How are large language models (LLMs) fine-tuned using human feedback?
Typical RLHF pipeline: (1) gather comparison judgments from humans, (2) train a reward model to predict human preferences, (3) fine-tune base LM via RL (PPO) to maximize reward, often with KL penalty to prevent drift. Iteratively refine data and reward model.
82. Describe how curriculum learning can improve model training.
Train on easier examples first and progressively increase difficulty. Benefits: smoother loss landscape, faster convergence, better generalization. Useful when tasks have natural difficulty ordering (e.g., short → long sequences).
83. What is multi-head attention, and how does it improve model representation?
Multi-head attention runs several attention mechanisms in parallel with different linear projections (heads). Each head can capture different types of relationships/positional patterns; concatenating them yields richer token representations than a single attention head.
84. Discuss the importance of softmax in attention mechanisms.
Softmax converts raw attention scores into a probability distribution (weights sum to 1) so the output is a convex combination of values. It emphasizes high-score positions and ensures stable, interpretable attention weights.
85. How can you interpret or visualize neural network decisions (explainability)?
Methods: saliency maps/gradient-based attribution, Integrated Gradients, LIME/SHAP, attention visualization (with caution), activation maximization, feature importance, concept activation vectors (TCAV). Combine quantitative metrics with human inspection.
86. How does the Transformer’s memory cost scale with input length and why?
Self-attention requires computing an n×nn\times nn×n attention matrix (for sequence length nnn), so memory and compute scale as O(n2)O(n^2)O(n2). This quadratic scaling becomes a bottleneck for long sequences.
87. What is perplexity in language modeling?
88. How do you ensure reproducibility in large-scale deep learning experiments?
Fix random seeds, set deterministic flags (where available), log environment (package versions, CUDA), containerize (Docker), save checkpoints and config files, use fixed data splits, document hyperparameters, and store seeds used for data shuffling.
Solutions:
89. What techniques are effective for handling noisy or adversarial data?
Robust training: data cleaning, noise-robust loss functions, label smoothing, adversarial training, outlier detection, robust regularization, data augmentation, and certified defenses for adversarial examples where needed.
90. How would you optimize inference speed for deployed NLP models?
Use quantization (INT8), pruning, model distillation to smaller student models, batching and asynchronous inference, ONNX/TensorRT conversion, caching tokenized prefixes, operator fusion, fewer layers or smaller hidden sizes, and CPU/GPU inf servers tuned for low latency.
91. Implement a custom loss function in TensorFlow/PyTorch (describe the steps).
- PyTorch: subclass nn.Module or write function: def forward(self, outputs, targets): return loss. Use reductions and ensure differentiability. Use it in training loop as loss = criterion(outputs, targets) then loss.backward().
- TensorFlow/Keras: subclass tf.keras.losses.Loss and implement call(y_true,y_pred) or provide a callable. Compile model with loss=custom_loss and train as usual.
92. How do you implement early stopping during training in code?
Use a callback that monitors validation metric, saves best checkpoint, and stops training if no improvement for patience epochs. Examples: tf.keras.callbacks.EarlyStopping, PyTorch Lightning EarlyStopping, or custom loop checking val_loss and breaking training.
93. Given an example, walk through debugging a non-converging neural network.
Checklist: (1) verify data pipeline and labels, (2) check loss scale and numeric issues (NaNs), (3) reduce learning rate or try a scheduler, (4) try gradient clipping, (5) check weight initialization, (6) normalize inputs, (7) test a tiny overfit experiment on a few examples, (8) simplify architecture, (9) inspect activations/grad norms for vanishing/exploding, (10) try different optimizer/learning rate warmup.
94. Discuss distributed training strategies in deep learning.
- Data parallelism: each worker has a full model and different data shards; gradients are aggregated (synchronous or asynchronous).
- Model parallelism: split model across devices (useful for very large models).
- Pipeline parallelism: split layers across stages, stream micro-batches.
Tools: Horovod, PyTorch DDP, DeepSpeed (ZeRO), Megatron-LM.
Tradeoffs: communication overhead vs memory savings.
95. What are the tradeoffs between quantization, pruning, and distillation?
- Quantization: reduces numeric precision → lower latency & memory; may slightly reduce accuracy.
- Pruning: removes weights → sparsity and smaller models but needs hardware support for sparse ops or sparse-to-dense conversion.
- Distillation: trains a smaller student to mimic teacher → often best accuracy/size tradeoff.
Combine methods carefully (distill then quantize/prune).
96. How would you use transfer learning for a domain with little labeled data?
Freeze majority of pre-trained encoder, fine-tune last few layers, use strong data augmentation, use domain-adaptive pretraining (continue LM pretraining on unlabeled in-domain text), use few-shot/meta-learning or prompt-based methods for LLMs.
97. Explain zero-shot/few-shot learning in LLMs.
- Zero-shot: model performs a task without explicit task-specific fine-tuning by using prompts that describe the task.
- Few-shot: provide a few labeled examples within the prompt (in-context learning) so the model infers the mapping. Works well for large, pretrained autoregressive LMs.
98. How would you deploy a deep learning model for real-time prediction?
Build a low-latency serving stack: serialize model to optimized runtime (ONNX/TensorRT), host behind a fast API (FastAPI/TorchServe), enable batching with latency constraints, autoscale, use inference-optimized instances, add caching for repeated requests, monitor latency/error rates and fallback strategies.
99. Discuss several ways of mitigating bias in NLP models.
Audit datasets for representational harms, use balanced sampling, apply counterfactual data augmentation, debias embeddings (projection/removal), adversarial de-biasing, include fairness constraints during training, perform post-hoc calibration, involve domain experts and human evaluation.
100. How do you keep updated with recent advances in deep learning and NLP?
Read arXiv, follow major conferences (NeurIPS, ICML, ACL, ICLR), follow researchers on X/Twitter, subscribe to newsletters (The Batch, TLDR), follow repositories (Hugging Face), read blog posts, replicate key papers, and participate in community forums and reading groups.