Skip to main content
Technology & EngineeringAi Ml89 lines

Neural Network Architecture

Guides the design of neural network architectures for various tasks. Covers layer

Quick Summary18 lines
Designing a neural network architecture involves selecting the right combination of layers, connections, and hyperparameters to learn the target function from data. Architecture choices determine the model's capacity, inductive biases, training dynamics, and inference cost. Poor architecture decisions cannot be compensated by longer training or more data.

## Key Points

- **Residual connections**: Enable training of deep networks (50+ layers) by providing gradient shortcuts.
- **Normalization**: BatchNorm (CNNs), LayerNorm (Transformers), GroupNorm (small batches).
- **Attention mechanisms**: Self-attention for global context, cross-attention for multi-input fusion.
- **Pooling**: Global average pooling replaces fully connected layers to reduce parameters.
- **Dropout / DropPath**: Regularization by randomly zeroing activations during training.
1. Identify the data modality and map to the appropriate architecture family.
2. Start with a proven baseline architecture (e.g., ResNet-50 for images, BERT-base for text).
3. Determine input/output shapes and any special requirements (variable length, multi-task heads).
4. Set initial depth and width based on dataset size: smaller data needs shallower, narrower networks.
5. Choose activation functions: ReLU for CNNs, GELU for Transformers, SiLU/Swish for modern networks.
6. Add regularization proportional to overfitting risk: dropout (0.1-0.5), weight decay (1e-4 to 1e-2).
7. Design the training recipe: optimizer (AdamW), learning rate schedule (cosine with warmup), batch size.
skilldb get ai-ml-skills/Neural Network ArchitectureFull skill: 89 lines
Paste into your CLAUDE.md or agent config

Neural Network Architecture Design

Core Philosophy

Overview

Designing a neural network architecture involves selecting the right combination of layers, connections, and hyperparameters to learn the target function from data. Architecture choices determine the model's capacity, inductive biases, training dynamics, and inference cost. Poor architecture decisions cannot be compensated by longer training or more data.

Use this skill when building a new deep learning model from scratch, when adapting a pretrained architecture to a new domain, or when debugging training failures that may stem from architectural issues.

Core Framework

Architecture Selection by Data Type

Data TypePrimary ArchitectureAlternatives
TabularMLP, TabNetFT-Transformer
ImagesCNN (ResNet, EfficientNet)Vision Transformer (ViT)
Text/SequencesTransformerLSTM (legacy), Mamba (SSM)
Audio1D CNN + TransformerWav2Vec, Whisper
GraphsGNN (GCN, GAT)GraphSAGE
Point cloudsPointNet, DGCNNSparse 3D CNN
MultimodalCross-attention fusionEarly/late fusion

Key Building Blocks

  • Residual connections: Enable training of deep networks (50+ layers) by providing gradient shortcuts.
  • Normalization: BatchNorm (CNNs), LayerNorm (Transformers), GroupNorm (small batches).
  • Attention mechanisms: Self-attention for global context, cross-attention for multi-input fusion.
  • Pooling: Global average pooling replaces fully connected layers to reduce parameters.
  • Dropout / DropPath: Regularization by randomly zeroing activations during training.

Process

  1. Identify the data modality and map to the appropriate architecture family.
  2. Start with a proven baseline architecture (e.g., ResNet-50 for images, BERT-base for text).
  3. Determine input/output shapes and any special requirements (variable length, multi-task heads).
  4. Set initial depth and width based on dataset size: smaller data needs shallower, narrower networks.
  5. Choose activation functions: ReLU for CNNs, GELU for Transformers, SiLU/Swish for modern networks.
  6. Add regularization proportional to overfitting risk: dropout (0.1-0.5), weight decay (1e-4 to 1e-2).
  7. Design the training recipe: optimizer (AdamW), learning rate schedule (cosine with warmup), batch size.
  8. Train a small-scale version first to validate the architecture before scaling up.
  9. Profile memory and compute; optimize bottlenecks (e.g., reduce attention complexity for long sequences).
  10. Document the architecture with a diagram and parameter count breakdown.

Key Principles

  • Start with established architectures and modify incrementally; inventing from scratch rarely outperforms.
  • Depth increases representational power but makes training harder; use residual connections beyond 10 layers.
  • Width (hidden dimension) should scale with data complexity; doubling width quadruples compute.
  • The learning rate is the most important hyperparameter; tune it before anything else.
  • Batch normalization and layer normalization serve different purposes; do not interchange blindly.
  • Pretrained weights almost always outperform random initialization when available.
  • Parameter count alone does not determine model quality; architecture inductive biases matter more.
  • Monitor gradient norms during training to detect vanishing or exploding gradients early.

Common Pitfalls

  • Making the network too large for the dataset, leading to severe overfitting.
  • Using batch normalization with very small batch sizes (use GroupNorm or LayerNorm instead).
  • Stacking layers without residual connections and wondering why training loss plateaus.
  • Applying dropout after batch normalization, which can cause inference inconsistencies.
  • Ignoring inference cost until deployment; a model that cannot meet latency requirements is useless.
  • Choosing an architecture based on paper benchmarks without considering your specific data distribution.

Output Format

When proposing a neural network architecture:

  1. Task Description: Input modality, output type, and performance requirements.
  2. Architecture Diagram: Layer-by-layer description with shapes and connections.
  3. Parameter Count: Total and per-component breakdown.
  4. Training Recipe: Optimizer, LR schedule, batch size, epochs, regularization.
  5. Baseline Comparison: Expected performance relative to simpler approaches.
  6. Compute Estimate: GPU hours for training, inference latency per sample.
  7. Scaling Plan: How to increase capacity if more data becomes available.

Anti-Patterns

Over-engineering for hypothetical requirements. Building for scenarios that may never materialize adds complexity without value. Solve the problem in front of you first.

Ignoring the existing ecosystem. Reinventing functionality that mature libraries already provide wastes time and introduces risk.

Premature abstraction. Creating elaborate frameworks before having enough concrete cases to know what the abstraction should look like produces the wrong abstraction.

Neglecting error handling at system boundaries. Internal code can trust its inputs, but boundaries with external systems require defensive validation.

Skipping documentation. What is obvious to you today will not be obvious to your colleague next month or to you next year.

Install this skill directly: skilldb add ai-ml-skills

Get CLI access →