LLM | Dot Jim

As with the script cheat sheet, that’s not a post, at least in the classical sense 😊 Rather it’s a collection of knowledge that I will keep updating, for me to find easily.

Main taxonomies of LLMs/AIs

(products listed as of June 2025)

By Architecture
- Transformer-based: GPT, BERT, T5, PaLM, Claude
- Recurrent: LSTM, GRU-based models (more info)
- Convolutional: CNN-based language models (more info)
- Hybrid: Models combining multiple architectures
By Training Approach
- Autoregressive: GPT series, LLaMA, PaLM
- Masked Language Modeling: BERT, RoBERTa, DeBERTa
- Encoder-Decoder: T5, BART, UL2
- Reinforcement Learning from Human Feedback (RLHF): ChatGPT, Claude, Bard
By Scale/Size
- Small: <1B parameters (DistilBERT, MobileBERT)
- Medium: 1-10B parameters (GPT-2, T5-Base)
- Large: 10-100B parameters (GPT-3, PaLM-62B)
- Very Large: 100B+ parameters (GPT-4, PaLM-540B, Claude)
By Modality
- Text-only: GPT-3, BERT, T5
- Multimodal: GPT-4V, DALL-E, Flamingo, Claude 3
- Vision: CLIP, ALIGN
- Audio: Whisper, MusicLM
- Code: Codex, CodeT5, GitHub Copilot
By Capability/Purpose
- Foundation Models: GPT, BERT, T5 (general-purpose)
- Specialized: BioBERT (biomedical), FinBERT (finance)
- Conversational: ChatGPT, Claude, Bard
- Code Generation: Codex, CodeT5, StarCoder
- Reasoning: PaLM-2, GPT-4, Claude
By Training Data
- Web-trained: Most large models (Common Crawl, web scrapes)
- Curated: Models trained on filtered, high-quality datasets
- Domain-specific: Models trained on specialized corpora
- Synthetic: Models incorporating AI-generated training data

Architecture Categories

1. Transformer-based: Uses attention mechanisms to process sequences in parallel. Self-attention allows the model to weigh relationships between all tokens simultaneously.

Self-Attention: Attention(Q,K,V) = softmax(QK^T/√d_k)V

Multi-Head: MultiHead(Q,K,V)=Concat(head₁,…,head_h)W^O

Processes all positions simultaneously with O(n²) complexity.

2 . Recurrent (LSTM/GRU): Processes sequences step-by-step using memory states. Information flows through hidden states that capture context from previous tokens.

Hidden State Update: h_t=f(W_hh_t−1+W_xx_t+b)

LSTM Gates:

Forget: f_t=σ(W_f⋅[h_t−1,x_t]+b_f)

Input: i_t=σ(W_i⋅[h_t−1,x_t]+b_i)

Output: o_t=σ(W_o⋅[h_t−1,x_t]+b_o)

Sequential processing with O(n) complexity

3. Convolutional: Applies sliding filters across text sequences to detect local patterns and features, similar to image processing.

1D Convolution: (f∗g)[n]=∑_mf[m]⋅g[n−m]

Feature Maps: y_i,j=ReLU(∑_kw_k⋅x_i+k,j+b)

Local pattern detection with sliding windows

4. Hybrid: Combines multiple architectures (e.g., transformer + CNN) to leverage different strengths.

Combines architectures: Output=Transformer(CNN(input))

Or parallel processing: Output=α⋅Trans(x)+β⋅RNN(x)

Key difference: Transformers use parallel attention (O(n²) memory), RNNs use sequential states (O(n) memory), CNNs use local convolutions.

Main pros & cons:

Transformer-based

Pros: Fast parallel processing, excellent at understanding context

Cons: Memory-hungry, struggles with very long texts

Recurrent (LSTM/GRU)

Pros: Memory-efficient, good at sequential patterns

Cons: Slow training, forgets distant information

Convolutional

Pros: Fast, good at detecting local patterns

Cons: Limited long-range understanding, less flexible

Hybrid

Pros: Combines strengths of multiple approaches

Cons: More complex, harder to optimize

In layman’s terms:

Transformers is reading the whole page at once
Recurrent is reading word-by-word while taking notes
Convolutional is scanning for specific phrases, and
Hybrid is using multiple reading strategies together.

Training Approach Categories

– Autoregressive: Predicts the next token given previous tokens. Trained left-to-right on text sequences.

– Masked Language Modeling: Randomly masks tokens in text and learns to predict them using bidirectional context from both sides.

– Encoder-Decoder: Encodes input into representations, then decodes to generate output. Useful for translation and summarization tasks.

– Reinforcement Learning from Human Feedback (RLHF): Fine-tunes models using human feedback as rewards. Trains the model to generate responses humans prefer through reinforcement learning.

Main pros & cons:

Autoregressive (GPT-style)

Pros: Great at creative text generation, coherent long-form writing

Cons: Can’t “look ahead” in text, slower for some tasks

Masked Language Modeling (BERT-style)

Pros: Understands context from both directions, excellent for comprehension

Cons: Poor at generating new text from scratch

Encoder-Decoder (T5-style)

Pros: Flexible for many tasks, good at text-to-text transformations

Cons: More complex architecture, requires more computational resources

Reinforcement Learning from Human Feedback (RLHF, ChatGPT/Claude-style)

Pros: Produces helpful, harmless responses aligned with human preferences

Cons: Expensive to train, can be overly cautious or verbose

In layman’s terms:

Autoregressive is like writing a story word-by-word
Masked is like fill-in-the-blank exercises
Encoder-Decoder is like translation between languages, and
RLHF is like having a human tutor guide the AI’s responses.

	My LLM/AI cheat shee… on My script cheat sheet
	Jim on How many coffee capsules is it…
	Jim on How many coffee capsules is it…
	Marcelo Ancelmo on How many coffee capsules is it…
	adamo on Powershell: Get Active Directo…

Dot Jim

Tag Archives: LLM

My LLM/AI cheat sheet

Main taxonomies of LLMs/AIs

Software, Greece, Switzerland. And coffee. LOTS of coffee !