Technology & EngineeringAi Ml89 lines

Neural Network Architecture

Guides the design of neural network architectures for various tasks. Covers layer

Quick Summary18 lines

Designing a neural network architecture involves selecting the right combination of layers, connections, and hyperparameters to learn the target function from data. Architecture choices determine the model's capacity, inductive biases, training dynamics, and inference cost. Poor architecture decisions cannot be compensated by longer training or more data.

## Key Points

- **Residual connections**: Enable training of deep networks (50+ layers) by providing gradient shortcuts.
- **Normalization**: BatchNorm (CNNs), LayerNorm (Transformers), GroupNorm (small batches).
- **Attention mechanisms**: Self-attention for global context, cross-attention for multi-input fusion.
- **Pooling**: Global average pooling replaces fully connected layers to reduce parameters.
- **Dropout / DropPath**: Regularization by randomly zeroing activations during training.
1. Identify the data modality and map to the appropriate architecture family.
2. Start with a proven baseline architecture (e.g., ResNet-50 for images, BERT-base for text).
3. Determine input/output shapes and any special requirements (variable length, multi-task heads).
4. Set initial depth and width based on dataset size: smaller data needs shallower, narrower networks.
5. Choose activation functions: ReLU for CNNs, GELU for Transformers, SiLU/Swish for modern networks.
6. Add regularization proportional to overfitting risk: dropout (0.1-0.5), weight decay (1e-4 to 1e-2).
7. Design the training recipe: optimizer (AdamW), learning rate schedule (cosine with warmup), batch size.

skilldb get ai-ml-skills/Neural Network ArchitectureFull skill: 89 lines

Paste into your CLAUDE.md or agent config

Neural Network Architecture Design

Core Philosophy

Overview

Use this skill when building a new deep learning model from scratch, when adapting a pretrained architecture to a new domain, or when debugging training failures that may stem from architectural issues.

Core Framework

Architecture Selection by Data Type

Data Type	Primary Architecture	Alternatives
Tabular	MLP, TabNet	FT-Transformer
Images	CNN (ResNet, EfficientNet)	Vision Transformer (ViT)
Text/Sequences	Transformer	LSTM (legacy), Mamba (SSM)
Audio	1D CNN + Transformer	Wav2Vec, Whisper
Graphs	GNN (GCN, GAT)	GraphSAGE
Point clouds	PointNet, DGCNN	Sparse 3D CNN
Multimodal	Cross-attention fusion	Early/late fusion

Key Building Blocks

Residual connections: Enable training of deep networks (50+ layers) by providing gradient shortcuts.
Normalization: BatchNorm (CNNs), LayerNorm (Transformers), GroupNorm (small batches).
Attention mechanisms: Self-attention for global context, cross-attention for multi-input fusion.
Pooling: Global average pooling replaces fully connected layers to reduce parameters.
Dropout / DropPath: Regularization by randomly zeroing activations during training.

Process

Identify the data modality and map to the appropriate architecture family.
Start with a proven baseline architecture (e.g., ResNet-50 for images, BERT-base for text).
Determine input/output shapes and any special requirements (variable length, multi-task heads).
Set initial depth and width based on dataset size: smaller data needs shallower, narrower networks.
Choose activation functions: ReLU for CNNs, GELU for Transformers, SiLU/Swish for modern networks.
Add regularization proportional to overfitting risk: dropout (0.1-0.5), weight decay (1e-4 to 1e-2).
Design the training recipe: optimizer (AdamW), learning rate schedule (cosine with warmup), batch size.
Train a small-scale version first to validate the architecture before scaling up.
Profile memory and compute; optimize bottlenecks (e.g., reduce attention complexity for long sequences).
Document the architecture with a diagram and parameter count breakdown.

Key Principles

Start with established architectures and modify incrementally; inventing from scratch rarely outperforms.
Depth increases representational power but makes training harder; use residual connections beyond 10 layers.
Width (hidden dimension) should scale with data complexity; doubling width quadruples compute.
The learning rate is the most important hyperparameter; tune it before anything else.
Batch normalization and layer normalization serve different purposes; do not interchange blindly.
Pretrained weights almost always outperform random initialization when available.
Parameter count alone does not determine model quality; architecture inductive biases matter more.
Monitor gradient norms during training to detect vanishing or exploding gradients early.

Common Pitfalls

Making the network too large for the dataset, leading to severe overfitting.
Using batch normalization with very small batch sizes (use GroupNorm or LayerNorm instead).
Stacking layers without residual connections and wondering why training loss plateaus.
Applying dropout after batch normalization, which can cause inference inconsistencies.
Ignoring inference cost until deployment; a model that cannot meet latency requirements is useless.
Choosing an architecture based on paper benchmarks without considering your specific data distribution.

Output Format

When proposing a neural network architecture:

Task Description: Input modality, output type, and performance requirements.
Architecture Diagram: Layer-by-layer description with shapes and connections.
Parameter Count: Total and per-component breakdown.
Training Recipe: Optimizer, LR schedule, batch size, epochs, regularization.
Baseline Comparison: Expected performance relative to simpler approaches.
Compute Estimate: GPU hours for training, inference latency per sample.
Scaling Plan: How to increase capacity if more data becomes available.

Anti-Patterns

Over-engineering for hypothetical requirements. Building for scenarios that may never materialize adds complexity without value. Solve the problem in front of you first.

Ignoring the existing ecosystem. Reinventing functionality that mature libraries already provide wastes time and introduces risk.

Premature abstraction. Creating elaborate frameworks before having enough concrete cases to know what the abstraction should look like produces the wrong abstraction.

Neglecting error handling at system boundaries. Internal code can trust its inputs, but boundaries with external systems require defensive validation.

Skipping documentation. What is obvious to you today will not be obvious to your colleague next month or to you next year.

Install this skill directly: skilldb add ai-ml-skills

Get CLI access →

Neural Network Architecture

Neural Network Architecture Design

Core Philosophy

Overview

Core Framework

Architecture Selection by Data Type

Key Building Blocks

Process

Key Principles

Common Pitfalls

Output Format

Anti-Patterns

Related Skills

Computer Vision Pipeline

Data Preprocessing

ML Deployment

ML Evaluation

ML Model Selection

Nlp Pipeline