As with the script cheat sheet, that’s not a post, at least in the classical sense 😊 Rather it’s a collection of knowledge that I will keep updating, for me to find easily.
Main taxonomies of LLMs/AIs
(products listed as of June 2025)
- By Architecture
- By Training Approach
- Autoregressive: GPT series, LLaMA, PaLM
- Masked Language Modeling: BERT, RoBERTa, DeBERTa
- Encoder-Decoder: T5, BART, UL2
- Reinforcement Learning from Human Feedback (RLHF): ChatGPT, Claude, Bard
- By Scale/Size
- Small: <1B parameters (DistilBERT, MobileBERT)
- Medium: 1-10B parameters (GPT-2, T5-Base)
- Large: 10-100B parameters (GPT-3, PaLM-62B)
- Very Large: 100B+ parameters (GPT-4, PaLM-540B, Claude)
- By Modality
- Text-only: GPT-3, BERT, T5
- Multimodal: GPT-4V, DALL-E, Flamingo, Claude 3
- Vision: CLIP, ALIGN
- Audio: Whisper, MusicLM
- Code: Codex, CodeT5, GitHub Copilot
- By Capability/Purpose
- Foundation Models: GPT, BERT, T5 (general-purpose)
- Specialized: BioBERT (biomedical), FinBERT (finance)
- Conversational: ChatGPT, Claude, Bard
- Code Generation: Codex, CodeT5, StarCoder
- Reasoning: PaLM-2, GPT-4, Claude
- By Training Data
- Web-trained: Most large models (Common Crawl, web scrapes)
- Curated: Models trained on filtered, high-quality datasets
- Domain-specific: Models trained on specialized corpora
- Synthetic: Models incorporating AI-generated training data
Architecture Categories
1. Transformer-based: Uses attention mechanisms to process sequences in parallel. Self-attention allows the model to weigh relationships between all tokens simultaneously.
Self-Attention: Attention(Q,K,V) = softmax(QKT/√dk)V
Multi-Head: MultiHead(Q,K,V)=Concat(head1,…,headh)WO
Processes all positions simultaneously with O(n2) complexity.
2 . Recurrent (LSTM/GRU): Processes sequences step-by-step using memory states. Information flows through hidden states that capture context from previous tokens.
Hidden State Update: ht=f(Whht−1+Wxxt+b)
LSTM Gates:
Forget: ft=σ(Wf⋅[ht−1,xt]+bf)
Input: it=σ(Wi⋅[ht−1,xt]+bi)
Output: ot=σ(Wo⋅[ht−1,xt]+bo)
Sequential processing with O(n) complexity
3. Convolutional: Applies sliding filters across text sequences to detect local patterns and features, similar to image processing.
1D Convolution: (f∗g)[n]=∑mf[m]⋅g[n−m]
Feature Maps: yi,j=ReLU(∑kwk⋅xi+k,j+b)
Local pattern detection with sliding windows
4. Hybrid: Combines multiple architectures (e.g., transformer + CNN) to leverage different strengths.
Combines architectures: Output=Transformer(CNN(input))
Or parallel processing: Output=α⋅Trans(x)+β⋅RNN(x)
Key difference: Transformers use parallel attention (O(n2) memory), RNNs use sequential states (O(n) memory), CNNs use local convolutions.
Main pros & cons:
- Transformer-based
Pros: Fast parallel processing, excellent at understanding context
Cons: Memory-hungry, struggles with very long texts
- Recurrent (LSTM/GRU)
Pros: Memory-efficient, good at sequential patterns
Cons: Slow training, forgets distant information
- Convolutional
Pros: Fast, good at detecting local patterns
Cons: Limited long-range understanding, less flexible
- Hybrid
Pros: Combines strengths of multiple approaches
Cons: More complex, harder to optimize
In layman’s terms:
- Transformers is reading the whole page at once
- Recurrent is reading word-by-word while taking notes
- Convolutional is scanning for specific phrases, and
- Hybrid is using multiple reading strategies together.
Training Approach Categories
– Autoregressive: Predicts the next token given previous tokens. Trained left-to-right on text sequences.
– Masked Language Modeling: Randomly masks tokens in text and learns to predict them using bidirectional context from both sides.
– Encoder-Decoder: Encodes input into representations, then decodes to generate output. Useful for translation and summarization tasks.
– Reinforcement Learning from Human Feedback (RLHF): Fine-tunes models using human feedback as rewards. Trains the model to generate responses humans prefer through reinforcement learning.
Main pros & cons:
- Autoregressive (GPT-style)
Pros: Great at creative text generation, coherent long-form writing
Cons: Can’t “look ahead” in text, slower for some tasks
- Masked Language Modeling (BERT-style)
Pros: Understands context from both directions, excellent for comprehension
Cons: Poor at generating new text from scratch
- Encoder-Decoder (T5-style)
Pros: Flexible for many tasks, good at text-to-text transformations
Cons: More complex architecture, requires more computational resources
- Reinforcement Learning from Human Feedback (RLHF, ChatGPT/Claude-style)
Pros: Produces helpful, harmless responses aligned with human preferences
Cons: Expensive to train, can be overly cautious or verbose
In layman’s terms:
- Autoregressive is like writing a story word-by-word
- Masked is like fill-in-the-blank exercises
- Encoder-Decoder is like translation between languages, and
- RLHF is like having a human tutor guide the AI’s responses.