Neural Network Architecture Design
Guides the design of neural network architectures for various tasks. Covers layer
Neural Network Architecture Design
Overview
Designing a neural network architecture involves selecting the right combination of layers, connections, and hyperparameters to learn the target function from data. Architecture choices determine the model's capacity, inductive biases, training dynamics, and inference cost. Poor architecture decisions cannot be compensated by longer training or more data.
Use this skill when building a new deep learning model from scratch, when adapting a pretrained architecture to a new domain, or when debugging training failures that may stem from architectural issues.
Core Framework
Architecture Selection by Data Type
| Data Type | Primary Architecture | Alternatives |
|---|---|---|
| Tabular | MLP, TabNet | FT-Transformer |
| Images | CNN (ResNet, EfficientNet) | Vision Transformer (ViT) |
| Text/Sequences | Transformer | LSTM (legacy), Mamba (SSM) |
| Audio | 1D CNN + Transformer | Wav2Vec, Whisper |
| Graphs | GNN (GCN, GAT) | GraphSAGE |
| Point clouds | PointNet, DGCNN | Sparse 3D CNN |
| Multimodal | Cross-attention fusion | Early/late fusion |
Key Building Blocks
- Residual connections: Enable training of deep networks (50+ layers) by providing gradient shortcuts.
- Normalization: BatchNorm (CNNs), LayerNorm (Transformers), GroupNorm (small batches).
- Attention mechanisms: Self-attention for global context, cross-attention for multi-input fusion.
- Pooling: Global average pooling replaces fully connected layers to reduce parameters.
- Dropout / DropPath: Regularization by randomly zeroing activations during training.
Process
- Identify the data modality and map to the appropriate architecture family.
- Start with a proven baseline architecture (e.g., ResNet-50 for images, BERT-base for text).
- Determine input/output shapes and any special requirements (variable length, multi-task heads).
- Set initial depth and width based on dataset size: smaller data needs shallower, narrower networks.
- Choose activation functions: ReLU for CNNs, GELU for Transformers, SiLU/Swish for modern networks.
- Add regularization proportional to overfitting risk: dropout (0.1-0.5), weight decay (1e-4 to 1e-2).
- Design the training recipe: optimizer (AdamW), learning rate schedule (cosine with warmup), batch size.
- Train a small-scale version first to validate the architecture before scaling up.
- Profile memory and compute; optimize bottlenecks (e.g., reduce attention complexity for long sequences).
- Document the architecture with a diagram and parameter count breakdown.
Key Principles
- Start with established architectures and modify incrementally; inventing from scratch rarely outperforms.
- Depth increases representational power but makes training harder; use residual connections beyond 10 layers.
- Width (hidden dimension) should scale with data complexity; doubling width quadruples compute.
- The learning rate is the most important hyperparameter; tune it before anything else.
- Batch normalization and layer normalization serve different purposes; do not interchange blindly.
- Pretrained weights almost always outperform random initialization when available.
- Parameter count alone does not determine model quality; architecture inductive biases matter more.
- Monitor gradient norms during training to detect vanishing or exploding gradients early.
Common Pitfalls
- Making the network too large for the dataset, leading to severe overfitting.
- Using batch normalization with very small batch sizes (use GroupNorm or LayerNorm instead).
- Stacking layers without residual connections and wondering why training loss plateaus.
- Applying dropout after batch normalization, which can cause inference inconsistencies.
- Ignoring inference cost until deployment; a model that cannot meet latency requirements is useless.
- Choosing an architecture based on paper benchmarks without considering your specific data distribution.
Output Format
When proposing a neural network architecture:
- Task Description: Input modality, output type, and performance requirements.
- Architecture Diagram: Layer-by-layer description with shapes and connections.
- Parameter Count: Total and per-component breakdown.
- Training Recipe: Optimizer, LR schedule, batch size, epochs, regularization.
- Baseline Comparison: Expected performance relative to simpler approaches.
- Compute Estimate: GPU hours for training, inference latency per sample.
- Scaling Plan: How to increase capacity if more data becomes available.
Related Skills
Computer Vision Pipeline Design
Designing computer vision pipelines for image and video analysis tasks. Covers
Data Preprocessing
Systematic approach to data cleaning, transformation, and feature preparation for
ML Deployment and MLOps
ML model deployment and MLOps practices for production systems. Covers serving
ML Model Evaluation
Comprehensive model evaluation and metrics selection for machine learning. Covers
ML Model Selection
Guides you through choosing the right machine learning model for a given problem.
NLP Pipeline Design
Designing end-to-end natural language processing pipelines from text ingestion to