CNNs ,RNNs and LSTM

Paper Link Key Highlights Year
Learning Representation by Backpropagation PDF • Introduced the backpropagation learning procedure for training multilayer neural networks, enabling them to learn internal representations and features essential for solving complex tasks unattainable by simple perceptrons.
• Introduced momentum in gradient descent for acceleration and weight-decay regularization to improve generalization.
• Demonstrated that multi-layer networks could successfully approximate complex nonlinear mappings.
1985
Backpropagation Applied to Handwriting Zip Code Recognition PDF • Developed the convolutional neural network (CNN) architecture for image recognition, applying backpropagation to recognize handwritten ZIP codes from raw pixel input.
• Showed CNNs could automatically learn relevant feature hierarchies (edges, shapes, etc.) directly from data without manual feature engineering.
• Highlighted challenges in generalizing to out-of-distribution data and handling varied handwriting styles or noise.
1989
Learning Long Term Dependencies with Gradient Descent is Difficult PDF • Analyzed why standard gradient descent methods (such as backpropagation-through-time) struggle to learn long-term dependencies in recurrent neural networks.
• Identified the vanishing and exploding gradient problem, where error signals decay exponentially or grow uncontrollably.
• Theorized with proofs and experiments that problems requiring retention of information over long intervals are essentially inaccessible to standard RNN training.
1994
Long Short-Term Memory (LSTM) PDF • Identified the vanishing/exploding gradient problem in RNNs.
• Proposed the LSTM architecture.
• Demonstrated experimentally that LSTM can bridge minimal time lags of hundreds or thousands of steps where previous methods fail.
• Shown to have favorable complexity: O(1) per step/weight, similar to BPTT, and local in space and time.
1997

MAML , Seq2Seq ,Transformers , BERT & GPT

Paper Core Goal Key Idea / Architecture Strengths Limitations Legacy / Impact
Seq2Seq (2014)
Sutskever, Vinyals, Le
End-to-end sequence mapping (e.g., machine translation). Encoder–decoder LSTMs, with a ā€œreverse sourceā€ trick to ease training. First strong neural MT system, robust to long sentences, learned semantic sentence embeddings. Bottleneck of fixed vector; vocab limits (UNK); data hungry; hard optimization. Sparked neural MT revolution; laid groundwork for attention + Transformers.
Transformer (2017)
Vaswani et al., ā€œAttention Is All You Needā€
Faster, more scalable sequence modeling. Pure self-attention (no recurrence or conv), multi-head attention, positional encodings. Parallelizable; better long-range modeling; SOTA in MT. Quadratic self-attention cost; still compute heavy for long sequences. Paradigm shift; foundation of modern NLP, vision, multimodal models.
GPT (2018)
Radford et al., ā€œImproving Language Understanding by Generative Pre-Trainingā€
Universal transfer for NLP tasks. Two-stage: pre-train on LM objective (unidirectional) → fine-tune via task serialization. Strong transfer; simple traversal-style input; SOTA on many NLP tasks. Unidirectional context; task mismatch risks; smaller scale vs successors. Pioneered pre-train → fine-tune pipeline; ancestor of GPT-2/3/4.
BERT (2018)
Devlin et al.
Deep bidirectional pre-training for language understanding. Transformer encoder with MLM + NSP; fine-tune for downstream tasks. Huge SOTA gains on GLUE, SQuAD, etc.; bidirectional context; versatile with minimal task tweaks. [MASK] mismatch; fine-tuning instability; heavy compute cost. ā€œImageNet momentā€ for NLP; inspired RoBERTa, ALBERT, DistilBERT, and more.
MAML (2017)
Finn et al.
Meta-learning for fast adaptation to new tasks. Inner loop (task adaptation) + outer loop (meta-optimization) to learn good initial weights. Few-shot learning success; general framework (works for vision, RL, etc.). Expensive (2nd-order gradients); sensitive to task distribution + hyperparams. Landmark in meta-learning; inspired FOMAML, Reptile, ANIL, etc.
MultiModel (2017)
Kaiser et al.
Unified multitask, multimodal learning. Shared Transformer-style core + modality-specific subnets for text, vision, speech. First real multimodal generalist; showed surprising cross-domain transfer. Lagged behind top specialist models; very resource heavy; complex to train. Early blueprint for generalist AI; influenced multimodal transformers (PaLM-E, Flamingo, GPT-4V).