Skip to content
📦 Technology & EngineeringAi Ml76 lines

Neural Network Architecture Design

Guides the design of neural network architectures for various tasks. Covers layer

Paste into your CLAUDE.md or agent config

Neural Network Architecture Design

Overview

Designing a neural network architecture involves selecting the right combination of layers, connections, and hyperparameters to learn the target function from data. Architecture choices determine the model's capacity, inductive biases, training dynamics, and inference cost. Poor architecture decisions cannot be compensated by longer training or more data.

Use this skill when building a new deep learning model from scratch, when adapting a pretrained architecture to a new domain, or when debugging training failures that may stem from architectural issues.

Core Framework

Architecture Selection by Data Type

Data TypePrimary ArchitectureAlternatives
TabularMLP, TabNetFT-Transformer
ImagesCNN (ResNet, EfficientNet)Vision Transformer (ViT)
Text/SequencesTransformerLSTM (legacy), Mamba (SSM)
Audio1D CNN + TransformerWav2Vec, Whisper
GraphsGNN (GCN, GAT)GraphSAGE
Point cloudsPointNet, DGCNNSparse 3D CNN
MultimodalCross-attention fusionEarly/late fusion

Key Building Blocks

  • Residual connections: Enable training of deep networks (50+ layers) by providing gradient shortcuts.
  • Normalization: BatchNorm (CNNs), LayerNorm (Transformers), GroupNorm (small batches).
  • Attention mechanisms: Self-attention for global context, cross-attention for multi-input fusion.
  • Pooling: Global average pooling replaces fully connected layers to reduce parameters.
  • Dropout / DropPath: Regularization by randomly zeroing activations during training.

Process

  1. Identify the data modality and map to the appropriate architecture family.
  2. Start with a proven baseline architecture (e.g., ResNet-50 for images, BERT-base for text).
  3. Determine input/output shapes and any special requirements (variable length, multi-task heads).
  4. Set initial depth and width based on dataset size: smaller data needs shallower, narrower networks.
  5. Choose activation functions: ReLU for CNNs, GELU for Transformers, SiLU/Swish for modern networks.
  6. Add regularization proportional to overfitting risk: dropout (0.1-0.5), weight decay (1e-4 to 1e-2).
  7. Design the training recipe: optimizer (AdamW), learning rate schedule (cosine with warmup), batch size.
  8. Train a small-scale version first to validate the architecture before scaling up.
  9. Profile memory and compute; optimize bottlenecks (e.g., reduce attention complexity for long sequences).
  10. Document the architecture with a diagram and parameter count breakdown.

Key Principles

  • Start with established architectures and modify incrementally; inventing from scratch rarely outperforms.
  • Depth increases representational power but makes training harder; use residual connections beyond 10 layers.
  • Width (hidden dimension) should scale with data complexity; doubling width quadruples compute.
  • The learning rate is the most important hyperparameter; tune it before anything else.
  • Batch normalization and layer normalization serve different purposes; do not interchange blindly.
  • Pretrained weights almost always outperform random initialization when available.
  • Parameter count alone does not determine model quality; architecture inductive biases matter more.
  • Monitor gradient norms during training to detect vanishing or exploding gradients early.

Common Pitfalls

  • Making the network too large for the dataset, leading to severe overfitting.
  • Using batch normalization with very small batch sizes (use GroupNorm or LayerNorm instead).
  • Stacking layers without residual connections and wondering why training loss plateaus.
  • Applying dropout after batch normalization, which can cause inference inconsistencies.
  • Ignoring inference cost until deployment; a model that cannot meet latency requirements is useless.
  • Choosing an architecture based on paper benchmarks without considering your specific data distribution.

Output Format

When proposing a neural network architecture:

  1. Task Description: Input modality, output type, and performance requirements.
  2. Architecture Diagram: Layer-by-layer description with shapes and connections.
  3. Parameter Count: Total and per-component breakdown.
  4. Training Recipe: Optimizer, LR schedule, batch size, epochs, regularization.
  5. Baseline Comparison: Expected performance relative to simpler approaches.
  6. Compute Estimate: GPU hours for training, inference latency per sample.
  7. Scaling Plan: How to increase capacity if more data becomes available.