Tag Archives: LLM

My LLM/AI cheat sheet

As with the script cheat sheet, that’s not a post, at least in the classical sense 😊 Rather it’s a collection of knowledge that I will keep updating, for me to find easily.

Main taxonomies of LLMs/AIs

(products listed as of June 2025)

  1. By Architecture
    • Transformer-based: GPT, BERT, T5, PaLM, Claude
    • Recurrent: LSTM, GRU-based models (more info)
    • Convolutional: CNN-based language models (more info)
    • Hybrid: Models combining multiple architectures
  2. By Training Approach
    • Autoregressive: GPT series, LLaMA, PaLM
    • Masked Language Modeling: BERT, RoBERTa, DeBERTa
    • Encoder-Decoder: T5, BART, UL2
    • Reinforcement Learning from Human Feedback (RLHF): ChatGPT, Claude, Bard
  3. By Scale/Size
    • Small: <1B parameters (DistilBERT, MobileBERT)
    • Medium: 1-10B parameters (GPT-2, T5-Base)
    • Large: 10-100B parameters (GPT-3, PaLM-62B)
    • Very Large: 100B+ parameters (GPT-4, PaLM-540B, Claude)
  4. By Modality
    • Text-only: GPT-3, BERT, T5
    • Multimodal: GPT-4V, DALL-E, Flamingo, Claude 3
    • Vision: CLIP, ALIGN
    • Audio: Whisper, MusicLM
    • Code: Codex, CodeT5, GitHub Copilot
  5. By Capability/Purpose
    • Foundation Models: GPT, BERT, T5 (general-purpose)
    • Specialized: BioBERT (biomedical), FinBERT (finance)
    • Conversational: ChatGPT, Claude, Bard
    • Code Generation: Codex, CodeT5, StarCoder
    • Reasoning: PaLM-2, GPT-4, Claude
  6. By Training Data
    • Web-trained: Most large models (Common Crawl, web scrapes)
    • Curated: Models trained on filtered, high-quality datasets
    • Domain-specific: Models trained on specialized corpora
    • Synthetic: Models incorporating AI-generated training data

Architecture Categories

1. Transformer-based: Uses attention mechanisms to process sequences in parallel. Self-attention allows the model to weigh relationships between all tokens simultaneously.

Self-Attention: Attention(Q,K,V) = softmax(QKT/√dk)V

Multi-Head: MultiHead(Q,K,V)=Concat(head1​,…,headh​)WO

Processes all positions simultaneously with O(n2) complexity.

2 . Recurrent (LSTM/GRU): Processes sequences step-by-step using memory states. Information flows through hidden states that capture context from previous tokens.

Hidden State Update: ht​=f(Wh​ht−1​+Wx​xt​+b)

LSTM Gates:

Forget: ft=σ(Wf⋅[ht−1,xt]+bf)

Input: it=σ(Wi⋅[ht−1,xt]+bi)

Output: ot=σ(Wo⋅[ht−1,xt]+bo)

Sequential processing with O(n) complexity

3. Convolutional: Applies sliding filters across text sequences to detect local patterns and features, similar to image processing.

1D Convolution: (f∗g)[n]=∑mf[m]⋅g[n−m]

Feature Maps: yi,j=ReLU(∑kwk⋅xi+k,j+b)

Local pattern detection with sliding windows

4. Hybrid: Combines multiple architectures (e.g., transformer + CNN) to leverage different strengths.

Combines architectures: Output=Transformer(CNN(input))

Or parallel processing: Output=α⋅Trans(x)+β⋅RNN(x)

Key difference: Transformers use parallel attention (O(n2) memory), RNNs use sequential states (O(n) memory), CNNs use local convolutions.

Main pros & cons:

  • Transformer-based

Pros: Fast parallel processing, excellent at understanding context

Cons: Memory-hungry, struggles with very long texts

  • Recurrent (LSTM/GRU)

Pros: Memory-efficient, good at sequential patterns

Cons: Slow training, forgets distant information

  • Convolutional

Pros: Fast, good at detecting local patterns

Cons: Limited long-range understanding, less flexible

  • Hybrid

Pros: Combines strengths of multiple approaches

Cons: More complex, harder to optimize

In layman’s terms:

Training Approach Categories

– Autoregressive: Predicts the next token given previous tokens. Trained left-to-right on text sequences.

– Masked Language Modeling: Randomly masks tokens in text and learns to predict them using bidirectional context from both sides.

– Encoder-Decoder: Encodes input into representations, then decodes to generate output. Useful for translation and summarization tasks.

– Reinforcement Learning from Human Feedback (RLHF): Fine-tunes models using human feedback as rewards. Trains the model to generate responses humans prefer through reinforcement learning.

Main pros & cons:

  • Autoregressive (GPT-style)

Pros: Great at creative text generation, coherent long-form writing

Cons: Can’t “look ahead” in text, slower for some tasks

  • Masked Language Modeling (BERT-style)

Pros: Understands context from both directions, excellent for comprehension

Cons: Poor at generating new text from scratch

  • Encoder-Decoder (T5-style)

Pros: Flexible for many tasks, good at text-to-text transformations

Cons: More complex architecture, requires more computational resources

  • Reinforcement Learning from Human Feedback (RLHF, ChatGPT/Claude-style)

Pros: Produces helpful, harmless responses aligned with human preferences

Cons: Expensive to train, can be overly cautious or verbose

In layman’s terms: