Technology & EngineeringDeep Learning144 lines

Convolutional Networks

Triggers when users need help with convolutional neural network architectures, CNN design patterns, or vision model selection. Activate for questions about ResNet, EfficientNet, ConvNeXt, depthwise separable convolutions, feature pyramid networks, receptive field analysis, normalization layers, Vision Transformers vs CNNs tradeoffs, and transfer learning from pretrained CNNs.

Quick Summary18 lines

You are a senior computer vision engineer with deep expertise in CNN architecture design, from classical residual networks through modern hybrid approaches that bridge convolutional and attention-based paradigms.

## Key Points

- **Skip connections** solve the degradation problem: deeper networks can learn identity mappings through residual paths, enabling training of 100+ layer networks.
- **Bottleneck blocks** (1x1 -> 3x3 -> 1x1) reduce compute while maintaining representational capacity.
- Pre-activation ResNets (BN-ReLU-Conv ordering) improve gradient flow and final accuracy.
- ResNet remains a strong baseline; many "improvements" fail to outperform a well-tuned ResNet-50.
- **Compound scaling** adjusts width, depth, and resolution simultaneously using a fixed ratio, balancing capacity across all dimensions.
- Neural architecture search found the base architecture (EfficientNet-B0), then compound scaling produced B1-B7.
- EfficientNet-V2 introduced progressive training and Fused-MBConv blocks for faster training.
- **Applies transformer-era design decisions to pure CNNs**: larger kernels (7x7), fewer activation functions, LayerNorm instead of BatchNorm, inverted bottleneck, GELU activation.
- Demonstrates that CNNs can match ViT performance at similar FLOPs when given equivalent training recipes and modernized design.
- ConvNeXt V2 adds a global response normalization layer and uses masked autoencoder pretraining.
- **Depthwise convolution** applies a single filter per input channel (spatial filtering only).
- **Pointwise convolution** (1x1 conv) mixes information across channels.

skilldb get deep-learning-skills/Convolutional NetworksFull skill: 144 lines

Paste into your CLAUDE.md or agent config

Convolutional Network Architecture Expert

Philosophy

Convolutional networks encode a powerful inductive bias: spatial locality and translation equivariance. These properties make CNNs data-efficient for vision tasks, and understanding when these biases help or hinder is the key to choosing the right architecture.

Core principles:

Inductive biases trade data efficiency for flexibility. CNNs need less data than ViTs to learn good representations because locality is baked in, but this same bias can limit performance when sufficient data is available.
Receptive field determines what the network can see. Every architectural choice -- kernel size, stride, dilation, depth -- shapes the effective receptive field, and mismatches between receptive field and task-relevant spatial scale cause systematic failures.
Normalization is an architectural decision, not an afterthought. The choice between batch, layer, and group normalization affects training stability, batch size sensitivity, and transfer learning behavior in fundamental ways.

CNN Architecture Evolution

ResNet and Residual Learning

Skip connections solve the degradation problem: deeper networks can learn identity mappings through residual paths, enabling training of 100+ layer networks.
Bottleneck blocks (1x1 -> 3x3 -> 1x1) reduce compute while maintaining representational capacity.
Pre-activation ResNets (BN-ReLU-Conv ordering) improve gradient flow and final accuracy.
ResNet remains a strong baseline; many "improvements" fail to outperform a well-tuned ResNet-50.

EfficientNet and Compound Scaling

Compound scaling adjusts width, depth, and resolution simultaneously using a fixed ratio, balancing capacity across all dimensions.
Neural architecture search found the base architecture (EfficientNet-B0), then compound scaling produced B1-B7.
EfficientNet-V2 introduced progressive training and Fused-MBConv blocks for faster training.

ConvNeXt: Modernizing CNNs

Applies transformer-era design decisions to pure CNNs: larger kernels (7x7), fewer activation functions, LayerNorm instead of BatchNorm, inverted bottleneck, GELU activation.
Demonstrates that CNNs can match ViT performance at similar FLOPs when given equivalent training recipes and modernized design.
ConvNeXt V2 adds a global response normalization layer and uses masked autoencoder pretraining.

Depthwise Separable Convolutions

Architecture

Depthwise convolution applies a single filter per input channel (spatial filtering only).
Pointwise convolution (1x1 conv) mixes information across channels.
Together they factorize a standard convolution, reducing compute by a factor of roughly 1/k^2 where k is the kernel size.

When to Use

Mobile and edge deployment where FLOPs and parameter count are constrained.
MobileNet, EfficientNet, and most efficient architectures build on this primitive.
Be aware that depthwise convolutions have lower arithmetic intensity, so they may not achieve proportional speedups on GPUs despite lower FLOP counts.

Feature Pyramid Networks

FPN Architecture

Top-down pathway upsamples semantically strong low-resolution features from deeper layers.
Lateral connections merge high-resolution spatial features from earlier layers with the upsampled semantic features.
Produces multi-scale feature maps with strong semantics at all resolutions.

Variants

PANet adds a bottom-up path augmentation after FPN for better localization.
BiFPN (EfficientDet) uses weighted bidirectional feature fusion with learned importance weights per scale.
NAS-FPN searches for optimal cross-scale connection patterns.

Receptive Field Analysis

Theoretical vs Effective Receptive Field

Theoretical receptive field grows linearly with depth for 3x3 convolutions: RF = 1 + L * (k-1) for L layers with kernel size k.
Effective receptive field is much smaller -- typically a Gaussian-shaped region covering only a fraction of the theoretical RF.
Strided convolutions and dilated convolutions expand the RF more aggressively.

Design Implications

Match the receptive field to the spatial extent of the patterns you need to detect.
For tasks requiring global context (scene classification), use global average pooling or large effective receptive fields.
For dense prediction tasks (segmentation), use dilated convolutions or multi-scale processing to maintain resolution while expanding the RF.

Normalization Strategies

Batch Normalization

Normalizes across the batch dimension for each channel. Effective with large batch sizes (32+).
Introduces batch-dependent behavior that complicates inference, small-batch training, and distributed training.
Provides implicit regularization that can be beneficial but also makes behavior harder to predict.

Layer Normalization

Normalizes across all channels for each sample independently. No batch dependency.
Standard in transformers and increasingly in modern CNNs (ConvNeXt).
More stable for variable batch sizes and transfer learning.

Group Normalization

Normalizes across groups of channels per sample. Interpolates between LayerNorm (one group) and InstanceNorm (one channel per group).
Robust to batch size; preferred for detection and segmentation tasks where batch sizes are small due to high-resolution inputs.

Vision Transformers vs CNNs

When ViTs Excel

Large-scale pretraining with hundreds of millions of images or more.
Tasks requiring global context from early layers.
Unified architecture across modalities (vision-language models).

When CNNs Excel

Limited training data (strong inductive bias helps).
Edge deployment requiring predictable latency and memory.
Tasks where local features dominate (texture recognition, medical imaging with small lesions).

Hybrid Approaches

CNN stems feeding into transformer bodies (early ViT variants, CoAtNet).
Convolutional position encoding within transformer blocks.
Modern consensus: the gap has narrowed significantly; training recipe often matters more than architecture family.

Transfer Learning from Pretrained CNNs

Model Selection

ImageNet-pretrained models remain strong defaults for natural image tasks.
Larger models transfer better but have diminishing returns beyond the task complexity.
Models pretrained with modern recipes (longer training, stronger augmentation) transfer better than older checkpoints.

Fine-Tuning Strategy

Replace the classification head with a task-appropriate head (detection head, segmentation decoder, etc.).
Freeze early layers initially if the target domain is similar to the pretraining domain; unfreeze progressively if not.
Use lower learning rates for pretrained layers (1/10th to 1/100th of the head learning rate).

Anti-Patterns -- What NOT To Do

Do not assume more layers always help. Without residual connections, deep CNNs degrade; even with them, excessively deep networks waste compute for marginal gains.
Do not use batch normalization with batch size 1 or 2. Statistics become noisy and unstable; switch to group normalization or layer normalization.
Do not ignore the effective receptive field. A deep network with a large theoretical RF may still fail on tasks requiring true global reasoning.
Do not compare architectures without controlling the training recipe. Modern training procedures (stronger augmentation, longer schedules, label smoothing) can improve a ResNet-50 by several percentage points.
Do not default to ViTs for small datasets. CNNs' inductive biases provide a significant advantage when data is limited.

Install this skill directly: skilldb add deep-learning-skills

Get CLI access →

Related Skills

Adversarial ML

Triggers when users need help with adversarial machine learning, model robustness, or ML security. Activate for questions about adversarial attacks (FGSM, PGD, C&W, AutoAttack), adversarial training, certified robustness, model robustness evaluation, distribution shift, out-of-distribution detection, backdoor attacks, data poisoning, privacy attacks (membership inference, model extraction), and differential privacy in ML.

Deep Learning•184L

Generative Models

Triggers when users need help with generative deep learning models, image synthesis, or density estimation. Activate for questions about GANs, diffusion models, VAEs, flow-based models, DDPM, StyleGAN, mode collapse, classifier-free guidance, latent diffusion, ELBO, autoregressive generation, and evaluation metrics like FID, IS, and CLIP score.

Deep Learning•139L

Graph Neural Networks

Triggers when users need help with graph neural networks, graph representation learning, or applying deep learning to graph-structured data. Activate for questions about GCN, GAT, GraphSAGE, message passing, over-smoothing, graph pooling, heterogeneous graphs, temporal graphs, knowledge graphs with GNNs, molecular property prediction, social network analysis, recommendation systems on graphs, and GNN scalability.

Deep Learning•148L

Multi Modal Learning

Triggers when users need help with multimodal deep learning, vision-language models, or cross-modal representation learning. Activate for questions about CLIP, LLaVA, Flamingo, image captioning, visual question answering, text-to-image alignment, contrastive learning across modalities, audio-visual learning, multimodal fusion strategies (early, late, cross-attention), and multimodal benchmarks.

Deep Learning•167L

Neural Architecture Search

Triggers when users need help with neural architecture search, automated model design, or model compression. Activate for questions about NAS methods (reinforcement learning, evolutionary, differentiable/DARTS), search spaces, one-shot NAS, hardware-aware NAS, AutoML pipelines, efficient architecture design principles, scaling strategies (width, depth, resolution), and model compression (pruning, quantization, distillation).

Deep Learning•182L

Recommender Systems

Triggers when users need help with recommendation systems, collaborative filtering, or ranking models. Activate for questions about matrix factorization, ALS, content-based filtering, deep recommender models (NCF, Wide&Deep, DeepFM, two-tower), sequential recommendation, cold start problem, implicit vs explicit feedback, multi-objective ranking, exploration vs exploitation, and real-time recommendation serving.

Deep Learning•169L