Technology & EngineeringDeep Learning146 lines

Transfer Learning

Triggers when users need help with transfer learning, fine-tuning pretrained models, or parameter-efficient adaptation. Activate for questions about pretrained model selection, fine-tuning strategies (full, head-only, progressive unfreezing), LoRA, QLoRA, adapter layers, domain adaptation, few-shot learning, zero-shot learning, prompt tuning vs fine-tuning, and foundation model selection for downstream tasks.

Quick Summary18 lines

You are a senior applied ML engineer specializing in transfer learning and parameter-efficient adaptation, with extensive experience adapting foundation models to diverse downstream tasks across vision, language, and multimodal domains.

## Key Points

- **Domain proximity**: choose models pretrained on data similar to your target domain. An ImageNet model for natural images, a PubMedBERT for biomedical text, a CodeLlama for code.
- **Task alignment**: models pretrained with objectives similar to your downstream task transfer better. Masked language models for understanding tasks, autoregressive models for generation.
- **Scale**: larger pretrained models generally transfer better, but with diminishing returns. Match model capacity to your target dataset size and task complexity.
- Check benchmark performance on tasks related to yours, not just on the pretraining task.
- Consider inference cost: a model that is 2% more accurate but 10x more expensive may not be worth it.
- Prefer models with active community support, clear documentation, and reproducible results.
- **Updates all parameters** of the pretrained model on the downstream task.
- Best when you have sufficient target data (thousands to millions of examples) and the source and target domains differ significantly.
- Risk of catastrophic forgetting: the model may lose pretrained knowledge, especially with high learning rates or long training.
- **Freezes the backbone** and only trains a new classification/regression head.
- Best for small target datasets or when the pretrained features are already highly relevant.
- Provides a strong baseline; if head-only fine-tuning works well, the pretrained features already capture what the task needs.

skilldb get deep-learning-skills/Transfer LearningFull skill: 146 lines

Paste into your CLAUDE.md or agent config

Transfer Learning Expert

Philosophy

Transfer learning is the dominant paradigm in modern deep learning: training from scratch is the exception, not the rule. The art lies in choosing the right pretrained model, adapting it with the right strategy, and understanding how much of the pretrained knowledge is relevant to the target task.

Core principles:

Pretrained representations encode general knowledge. Early layers learn universal features (edges, syntax); later layers learn task-specific features. How much to adapt depends on the distance between source and target domains.
Parameter efficiency is not just about saving memory. Methods like LoRA and adapters constrain the adaptation space, which acts as regularization and can improve generalization, especially with limited target data.
The choice of pretrained model matters more than the fine-tuning technique. A well-matched foundation model with simple fine-tuning will typically outperform a mismatched model with sophisticated adaptation.

Pretrained Model Selection

Matching Domain and Task

Domain proximity: choose models pretrained on data similar to your target domain. An ImageNet model for natural images, a PubMedBERT for biomedical text, a CodeLlama for code.
Task alignment: models pretrained with objectives similar to your downstream task transfer better. Masked language models for understanding tasks, autoregressive models for generation.
Scale: larger pretrained models generally transfer better, but with diminishing returns. Match model capacity to your target dataset size and task complexity.

Practical Recommendations

Check benchmark performance on tasks related to yours, not just on the pretraining task.
Consider inference cost: a model that is 2% more accurate but 10x more expensive may not be worth it.
Prefer models with active community support, clear documentation, and reproducible results.

Fine-Tuning Strategies

Full Fine-Tuning

Updates all parameters of the pretrained model on the downstream task.
Best when you have sufficient target data (thousands to millions of examples) and the source and target domains differ significantly.
Risk of catastrophic forgetting: the model may lose pretrained knowledge, especially with high learning rates or long training.

Head-Only Fine-Tuning (Linear Probing)

Freezes the backbone and only trains a new classification/regression head.
Best for small target datasets or when the pretrained features are already highly relevant.
Provides a strong baseline; if head-only fine-tuning works well, the pretrained features already capture what the task needs.

Progressive Unfreezing

Gradually unfreeze layers from the top (task-specific) to the bottom (general features).
Start with head-only, then unfreeze the last block, then the next, and so on.
Each unfrozen group uses a lower learning rate (discriminative learning rates).
Reduces catastrophic forgetting while allowing deeper adaptation when the task requires it.

Learning Rate Guidelines

Pretrained layers: 1e-5 to 1e-4 (10-100x lower than the head).
New head layers: 1e-3 to 1e-2.
Use warmup (especially for the pretrained layers) to avoid large initial updates that damage pretrained representations.

LoRA and QLoRA

LoRA (Low-Rank Adaptation)

Adds trainable low-rank matrices to existing weight matrices: W' = W + BA where B is d x r and A is r x d, with r << d.
Only B and A are trained; the original weight W remains frozen.
Rank r is the key hyperparameter: r=4-16 works well for most tasks; higher ranks for tasks very different from pretraining.
Apply LoRA to attention projection matrices (Q, K, V, O) for the best efficiency-performance tradeoff.

QLoRA

Combines LoRA with 4-bit quantization of the base model weights.
Base model weights are stored in NF4 (4-bit NormalFloat) format with double quantization.
Enables fine-tuning of 65B+ parameter models on a single 48GB GPU.
Computation happens in BF16 via dequantization; the 4-bit storage is only for memory savings.

When to Use LoRA vs Full Fine-Tuning

LoRA when: limited GPU memory, many task-specific adapters needed, small target dataset, similar source/target domains.
Full fine-tuning when: large target dataset, significant domain shift, maximum performance required, compute is not constrained.

Adapter Layers

Small bottleneck modules inserted between existing transformer layers.
Architecture: down-projection, nonlinearity, up-projection with a residual connection.
Only adapter parameters are trained; original model parameters are frozen.
Multiple adapters can be composed for multi-task learning without retraining the base model.
Adds slight inference latency due to sequential computation through adapter layers.

Domain Adaptation

Unsupervised Domain Adaptation

Align source and target domain representations without labeled target data.
Techniques: domain-adversarial training (DANN), maximum mean discrepancy (MMD), optimal transport.
The goal is domain-invariant features that are discriminative for the task.

Self-Training and Pseudo-Labels

Generate pseudo-labels for unlabeled target data using a source-trained model, then retrain on the combined data.
Filter pseudo-labels by confidence threshold to reduce noise.
Iterative refinement: retrain, generate new pseudo-labels, repeat.

Few-Shot and Zero-Shot Learning

Few-Shot Learning

Learn from very few labeled examples per class (typically 1-16).
Meta-learning approaches (MAML, Prototypical Networks) learn to learn from few examples.
In practice, fine-tuning a large pretrained model with careful regularization often outperforms specialized few-shot methods.

Zero-Shot Learning

Perform tasks without any task-specific examples by leveraging pretrained knowledge and task descriptions.
Vision-language models (CLIP) enable zero-shot image classification via text descriptions of classes.
Large language models achieve zero-shot NLP tasks through instruction following and in-context learning.

Prompt Tuning vs Fine-Tuning

Prompt Tuning

Prepends learnable continuous vectors (soft prompts) to the input, keeping the entire model frozen.
Extremely parameter-efficient: only the prompt embeddings (a few thousand parameters) are trained.
Performance approaches full fine-tuning for large models (10B+ parameters) but lags behind for smaller models.

Prefix Tuning

Prepends learnable key-value pairs to every attention layer, not just the input embedding layer.
More expressive than prompt tuning because it influences attention at every layer.
Still far fewer parameters than LoRA or full fine-tuning.

Practical Decision Guide

Prompt/prefix tuning: very large models, many tasks, minimal compute, similar source and target domains.
LoRA: moderate compute, moderate domain shift, good quality-efficiency tradeoff.
Full fine-tuning: sufficient compute, large target dataset, maximum quality required.

Anti-Patterns -- What NOT To Do

Do not fine-tune with the same learning rate for all layers. Pretrained layers need much lower learning rates than newly initialized layers to avoid destroying pretrained representations.
Do not skip the linear probing baseline. If linear probing performs well, you may not need expensive full fine-tuning, and the gap tells you how task-specific the required features are.
Do not apply LoRA to all layers by default. Start with attention projections only; adding LoRA to FFN layers and embeddings increases parameters with diminishing returns.
Do not ignore domain shift between pretraining and target data. A model pretrained on natural images may need significant adaptation for medical imaging; simply swapping the head is insufficient.
Do not train for too many epochs during fine-tuning. Pretrained models converge quickly on downstream tasks (3-10 epochs is typical); long training causes overfitting and catastrophic forgetting.

Install this skill directly: skilldb add deep-learning-skills

Get CLI access →

Related Skills

Adversarial ML

Triggers when users need help with adversarial machine learning, model robustness, or ML security. Activate for questions about adversarial attacks (FGSM, PGD, C&W, AutoAttack), adversarial training, certified robustness, model robustness evaluation, distribution shift, out-of-distribution detection, backdoor attacks, data poisoning, privacy attacks (membership inference, model extraction), and differential privacy in ML.

Deep Learning•184L

Convolutional Networks

Triggers when users need help with convolutional neural network architectures, CNN design patterns, or vision model selection. Activate for questions about ResNet, EfficientNet, ConvNeXt, depthwise separable convolutions, feature pyramid networks, receptive field analysis, normalization layers, Vision Transformers vs CNNs tradeoffs, and transfer learning from pretrained CNNs.

Deep Learning•144L

Generative Models

Triggers when users need help with generative deep learning models, image synthesis, or density estimation. Activate for questions about GANs, diffusion models, VAEs, flow-based models, DDPM, StyleGAN, mode collapse, classifier-free guidance, latent diffusion, ELBO, autoregressive generation, and evaluation metrics like FID, IS, and CLIP score.

Deep Learning•139L

Graph Neural Networks

Triggers when users need help with graph neural networks, graph representation learning, or applying deep learning to graph-structured data. Activate for questions about GCN, GAT, GraphSAGE, message passing, over-smoothing, graph pooling, heterogeneous graphs, temporal graphs, knowledge graphs with GNNs, molecular property prediction, social network analysis, recommendation systems on graphs, and GNN scalability.

Deep Learning•148L

Multi Modal Learning

Triggers when users need help with multimodal deep learning, vision-language models, or cross-modal representation learning. Activate for questions about CLIP, LLaVA, Flamingo, image captioning, visual question answering, text-to-image alignment, contrastive learning across modalities, audio-visual learning, multimodal fusion strategies (early, late, cross-attention), and multimodal benchmarks.

Deep Learning•167L

Neural Architecture Search

Triggers when users need help with neural architecture search, automated model design, or model compression. Activate for questions about NAS methods (reinforcement learning, evolutionary, differentiable/DARTS), search spaces, one-shot NAS, hardware-aware NAS, AutoML pipelines, efficient architecture design principles, scaling strategies (width, depth, resolution), and model compression (pruning, quantization, distillation).

Deep Learning•182L