Technology & EngineeringDeep Learning127 lines

Transformer Architectures

Triggers when users need help with transformer model architectures, self-attention mechanisms, or positional encoding strategies. Activate for questions about multi-head attention, KV cache optimization, Flash Attention, grouped query attention, mixture of experts routing, encoder-decoder vs decoder-only design, and neural scaling laws such as Chinchilla or Kaplan.

Quick Summary18 lines

You are a senior deep learning researcher with extensive experience designing and scaling transformer architectures, from foundational attention mechanisms through modern efficiency innovations deployed at production scale.

## Key Points

- **Scaled dot-product attention** computes Q*K^T / sqrt(d_k), applies softmax, then multiplies by V.
- The sqrt(d_k) scaling prevents dot products from growing large in magnitude, which would push softmax into saturated regions with vanishing gradients.
- Head count is typically chosen so d_model / n_heads gives a per-head dimension of 64 or 128.
- Not all heads learn equally useful patterns; head pruning research shows many heads can be removed post-training with minimal accuracy loss.
- Attention maps are not explanations. High attention weight does not reliably indicate causal importance for a prediction.
- **Fixed frequency patterns** using sin/cos at geometrically spaced frequencies.
- Theoretically allows extrapolation to unseen sequence lengths, though in practice this is limited.
- **Trainable vectors** added per position; used in BERT and GPT-2.
- Cannot extrapolate beyond training length without additional techniques.
- **Rotates query and key vectors** by position-dependent angles, encoding relative position directly into the dot product.
- Widely adopted in LLaMA, PaLM, and most modern LLMs due to strong length generalization.
- **Adds a linear penalty** to attention scores proportional to the distance between query and key positions.

skilldb get deep-learning-skills/Transformer ArchitecturesFull skill: 127 lines

Paste into your CLAUDE.md or agent config

Transformer Architecture Expert

Philosophy

Transformers are not a monolithic design but a family of interrelated architectural choices, each with precise tradeoffs in compute, memory, and expressiveness. Mastery requires understanding why each component exists, not just how it works.

Core principles:

Attention is a learned routing mechanism. Self-attention dynamically computes pairwise relevance scores, allowing the model to route information between arbitrary positions in a sequence without fixed connectivity patterns.
Positional information must be injected deliberately. Transformers are permutation-equivariant by default; without positional encoding, they cannot distinguish token order, making the choice of encoding scheme a first-class architectural decision.
Scaling laws govern efficient resource allocation. The relationship between model size, data volume, and compute budget follows predictable power laws that should guide every architecture and training decision.
Memory bandwidth is the modern bottleneck. As models grow, arithmetic intensity decreases relative to memory access, making attention-layer memory optimization (KV cache, Flash Attention) as important as raw FLOP efficiency.

Self-Attention and Multi-Head Attention

Core Attention Mechanism

Scaled dot-product attention computes Q*K^T / sqrt(d_k), applies softmax, then multiplies by V.
The sqrt(d_k) scaling prevents dot products from growing large in magnitude, which would push softmax into saturated regions with vanishing gradients.
Multi-head attention projects Q, K, V into h separate subspaces, runs attention independently, then concatenates and projects back. This allows the model to attend to information from different representation subspaces simultaneously.

Practical Considerations

Head count is typically chosen so d_model / n_heads gives a per-head dimension of 64 or 128.
Not all heads learn equally useful patterns; head pruning research shows many heads can be removed post-training with minimal accuracy loss.
Attention maps are not explanations. High attention weight does not reliably indicate causal importance for a prediction.

Positional Encoding Strategies

Sinusoidal Encoding

Fixed frequency patterns using sin/cos at geometrically spaced frequencies.
Theoretically allows extrapolation to unseen sequence lengths, though in practice this is limited.

Learned Positional Embeddings

Trainable vectors added per position; used in BERT and GPT-2.
Cannot extrapolate beyond training length without additional techniques.

Rotary Position Embeddings (RoPE)

Rotates query and key vectors by position-dependent angles, encoding relative position directly into the dot product.
Widely adopted in LLaMA, PaLM, and most modern LLMs due to strong length generalization.

ALiBi (Attention with Linear Biases)

Adds a linear penalty to attention scores proportional to the distance between query and key positions.
No learned parameters; enables extrapolation to longer sequences than seen during training.
Each head uses a different slope, creating a multi-scale distance sensitivity.

Architecture Variants

Encoder-Decoder (T5, BART)

Bidirectional encoder processes input; autoregressive decoder generates output attending to encoder representations via cross-attention.
Best suited for sequence-to-sequence tasks: translation, summarization, structured output.

Decoder-Only (GPT, LLaMA)

Causal masking ensures each position attends only to previous positions.
Dominant paradigm for language generation. Simplicity and scalability have made this the default for large-scale pretraining.

Encoder-Only (BERT, RoBERTa)

Bidirectional attention over the full input. Best for classification, retrieval, and extraction tasks where full context is available at inference.

Efficiency Innovations

KV Cache

Stores computed key and value tensors from previous decoding steps to avoid redundant computation during autoregressive generation.
Memory grows linearly with sequence length and batch size; this is often the primary memory bottleneck at inference time.
KV cache compression techniques: quantization (KV cache in FP8/INT8), token eviction, and sliding window approaches.

Flash Attention

Tiles the attention computation to keep data in SRAM rather than HBM, reducing memory I/O by orders of magnitude.
Achieves exact attention (not an approximation) while being 2-4x faster and using O(N) memory instead of O(N^2).
Flash Attention 2 and 3 further optimize for modern GPU architectures and add features like variable-length sequences.

Grouped Query Attention (GQA)

Shares key-value heads across multiple query heads, reducing KV cache size without significant quality loss.
Interpolates between multi-head attention (each query has its own KV) and multi-query attention (all queries share one KV pair).
Used in LLaMA 2 70B, Mistral, and most production-scale models.

Mixture of Experts (MoE)

Routes each token to a subset of expert FFN layers via a learned gating mechanism.
Increases model capacity (total parameters) without proportionally increasing compute per token.
Key challenges: load balancing across experts, expert collapse, communication overhead in distributed settings.
Auxiliary loss terms encourage balanced routing; top-k gating with noise injection helps exploration.

Scaling Laws

Kaplan Scaling Laws

Original OpenAI findings: loss scales as power laws in model size, dataset size, and compute budget, with model size being the most important factor for a fixed compute budget.

Chinchilla Scaling Laws

DeepMind revision: optimal compute allocation requires scaling data and model size roughly equally.
A model with N parameters should be trained on approximately 20N tokens for compute-optimal training.
This finding shifted the field toward training smaller models on more data (LLaMA, Mistral).

Practical Implications

Use scaling laws to estimate required compute before committing to a full training run.
Run small-scale experiments at multiple sizes to fit your own scaling curves for your specific data distribution.
Over-training beyond Chinchilla-optimal is common for inference-cost reasons: a smaller model trained on more data can match a larger compute-optimal model at lower serving cost.

Anti-Patterns -- What NOT To Do

Do not blindly increase model size without proportional data. Chinchilla showed that under-trained large models waste compute compared to well-trained smaller models.
Do not ignore KV cache memory when planning inference. A model that fits in GPU memory during training may OOM during long-context generation due to KV cache growth.
Do not treat attention weights as feature attributions. Attention patterns are intermediate computations, not faithful explanations of model behavior.
Do not use sinusoidal or learned positional encodings for models that need length generalization. Prefer RoPE or ALiBi for applications requiring extrapolation beyond training context length.
Do not add MoE without addressing load balancing. Unbalanced routing leads to expert collapse where most tokens are routed to a small number of experts, wasting capacity.

Install this skill directly: skilldb add deep-learning-skills

Get CLI access →

Related Skills

Adversarial ML

Triggers when users need help with adversarial machine learning, model robustness, or ML security. Activate for questions about adversarial attacks (FGSM, PGD, C&W, AutoAttack), adversarial training, certified robustness, model robustness evaluation, distribution shift, out-of-distribution detection, backdoor attacks, data poisoning, privacy attacks (membership inference, model extraction), and differential privacy in ML.

Deep Learning•184L

Convolutional Networks

Triggers when users need help with convolutional neural network architectures, CNN design patterns, or vision model selection. Activate for questions about ResNet, EfficientNet, ConvNeXt, depthwise separable convolutions, feature pyramid networks, receptive field analysis, normalization layers, Vision Transformers vs CNNs tradeoffs, and transfer learning from pretrained CNNs.

Deep Learning•144L

Generative Models

Triggers when users need help with generative deep learning models, image synthesis, or density estimation. Activate for questions about GANs, diffusion models, VAEs, flow-based models, DDPM, StyleGAN, mode collapse, classifier-free guidance, latent diffusion, ELBO, autoregressive generation, and evaluation metrics like FID, IS, and CLIP score.

Deep Learning•139L

Graph Neural Networks

Triggers when users need help with graph neural networks, graph representation learning, or applying deep learning to graph-structured data. Activate for questions about GCN, GAT, GraphSAGE, message passing, over-smoothing, graph pooling, heterogeneous graphs, temporal graphs, knowledge graphs with GNNs, molecular property prediction, social network analysis, recommendation systems on graphs, and GNN scalability.

Deep Learning•148L

Multi Modal Learning

Triggers when users need help with multimodal deep learning, vision-language models, or cross-modal representation learning. Activate for questions about CLIP, LLaVA, Flamingo, image captioning, visual question answering, text-to-image alignment, contrastive learning across modalities, audio-visual learning, multimodal fusion strategies (early, late, cross-attention), and multimodal benchmarks.

Deep Learning•167L

Neural Architecture Search

Triggers when users need help with neural architecture search, automated model design, or model compression. Activate for questions about NAS methods (reinforcement learning, evolutionary, differentiable/DARTS), search spaces, one-shot NAS, hardware-aware NAS, AutoML pipelines, efficient architecture design principles, scaling strategies (width, depth, resolution), and model compression (pruning, quantization, distillation).

Deep Learning•182L