CODERATWORK7

CNNs ,RNNs and LSTM

Paper	Link	Key Highlights	Year
Learning Representation by Backpropagation	PDF	• Introduced the backpropagation learning procedure for training multilayer neural networks, enabling them to learn internal representations and features essential for solving complex tasks unattainable by simple perceptrons. • Introduced momentum in gradient descent for acceleration and weight-decay regularization to improve generalization. • Demonstrated that multi-layer networks could successfully approximate complex nonlinear mappings.	1985
Backpropagation Applied to Handwriting Zip Code Recognition	PDF	• Developed the convolutional neural network (CNN) architecture for image recognition, applying backpropagation to recognize handwritten ZIP codes from raw pixel input. • Showed CNNs could automatically learn relevant feature hierarchies (edges, shapes, etc.) directly from data without manual feature engineering. • Highlighted challenges in generalizing to out-of-distribution data and handling varied handwriting styles or noise.	1989
Learning Long Term Dependencies with Gradient Descent is Difficult	PDF	• Analyzed why standard gradient descent methods (such as backpropagation-through-time) struggle to learn long-term dependencies in recurrent neural networks. • Identified the vanishing and exploding gradient problem, where error signals decay exponentially or grow uncontrollably. • Theorized with proofs and experiments that problems requiring retention of information over long intervals are essentially inaccessible to standard RNN training.	1994
Long Short-Term Memory (LSTM)	PDF	• Identified the vanishing/exploding gradient problem in RNNs. • Proposed the LSTM architecture. • Demonstrated experimentally that LSTM can bridge minimal time lags of hundreds or thousands of steps where previous methods fail. • Shown to have favorable complexity: O(1) per step/weight, similar to BPTT, and local in space and time.	1997

MAML , Seq2Seq ,Transformers , BERT & GPT

Paper	Core Goal	Key Idea / Architecture	Strengths	Limitations	Legacy / Impact
Seq2Seq (2014) Sutskever, Vinyals, Le	End-to-end sequence mapping (e.g., machine translation).	Encoder–decoder LSTMs, with a “reverse source” trick to ease training.	First strong neural MT system, robust to long sentences, learned semantic sentence embeddings.	Bottleneck of fixed vector; vocab limits (UNK); data hungry; hard optimization.	Sparked neural MT revolution; laid groundwork for attention + Transformers.
Transformer (2017) Vaswani et al., “Attention Is All You Need”	Faster, more scalable sequence modeling.	Pure self-attention (no recurrence or conv), multi-head attention, positional encodings.	Parallelizable; better long-range modeling; SOTA in MT.	Quadratic self-attention cost; still compute heavy for long sequences.	Paradigm shift; foundation of modern NLP, vision, multimodal models.
GPT (2018) Radford et al., “Improving Language Understanding by Generative Pre-Training”	Universal transfer for NLP tasks.	Two-stage: pre-train on LM objective (unidirectional) → fine-tune via task serialization.	Strong transfer; simple traversal-style input; SOTA on many NLP tasks.	Unidirectional context; task mismatch risks; smaller scale vs successors.	Pioneered pre-train → fine-tune pipeline; ancestor of GPT-2/3/4.
BERT (2018) Devlin et al.	Deep bidirectional pre-training for language understanding.	Transformer encoder with MLM + NSP; fine-tune for downstream tasks.	Huge SOTA gains on GLUE, SQuAD, etc.; bidirectional context; versatile with minimal task tweaks.	[MASK] mismatch; fine-tuning instability; heavy compute cost.	“ImageNet moment” for NLP; inspired RoBERTa, ALBERT, DistilBERT, and more.
MAML (2017) Finn et al.	Meta-learning for fast adaptation to new tasks.	Inner loop (task adaptation) + outer loop (meta-optimization) to learn good initial weights.	Few-shot learning success; general framework (works for vision, RL, etc.).	Expensive (2nd-order gradients); sensitive to task distribution + hyperparams.	Landmark in meta-learning; inspired FOMAML, Reptile, ANIL, etc.
MultiModel (2017) Kaiser et al.	Unified multitask, multimodal learning.	Shared Transformer-style core + modality-specific subnets for text, vision, speech.	First real multimodal generalist; showed surprising cross-domain transfer.	Lagged behind top specialist models; very resource heavy; complex to train.	Early blueprint for generalist AI; influenced multimodal transformers (PaLM-E, Flamingo, GPT-4V).