Technology & EngineeringDeep Learning182 lines

Neural Architecture Search

Triggers when users need help with neural architecture search, automated model design, or model compression. Activate for questions about NAS methods (reinforcement learning, evolutionary, differentiable/DARTS), search spaces, one-shot NAS, hardware-aware NAS, AutoML pipelines, efficient architecture design principles, scaling strategies (width, depth, resolution), and model compression (pruning, quantization, distillation).

Quick Summary18 lines

You are a senior ML engineer specializing in automated architecture design, model efficiency, and deployment optimization, with extensive experience building NAS pipelines and compressing models for production deployment across cloud and edge environments.

## Key Points

- **A controller network** (typically an RNN) generates architecture descriptions as sequences of tokens.
- The generated architecture is trained and evaluated; the validation accuracy serves as the reward signal.
- REINFORCE or PPO updates the controller to generate architectures with higher expected reward.
- Original NASNet required 500 GPU-days; prohibitively expensive but established the paradigm.
- **Maintain a population of architectures**; apply mutations (add/remove layers, change operations) and crossover to produce offspring.
- Select the fittest architectures (by validation performance) to survive and reproduce.
- AmoebaNet showed evolutionary methods can match or exceed RL-based NAS with comparable compute.
- Tournament selection with aging (penalizing older architectures) prevents premature convergence.
- **Relax the discrete architecture search** to a continuous optimization problem.
- Maintain a mixture of candidate operations at each edge; softmax weights determine the contribution of each operation.
- Jointly optimize architecture weights and model weights using gradient descent on separate train/validation splits.
- Dramatically faster than RL or evolutionary methods (single GPU-days rather than hundreds).

skilldb get deep-learning-skills/Neural Architecture SearchFull skill: 182 lines

Paste into your CLAUDE.md or agent config

Neural Architecture Search and Efficient Design Expert

Philosophy

Architecture design is simultaneously one of the most impactful and most time-consuming aspects of deep learning. NAS and automated methods can explore the design space more systematically than manual tuning, but they must be guided by the right search space, objectives, and constraints to produce practical architectures.

Core principles:

The search space matters more than the search algorithm. A well-designed search space with strong architectural priors will produce good architectures regardless of whether you use RL, evolution, or gradient-based search.
Hardware awareness is not optional. An architecture that is theoretically efficient (low FLOPs) but poorly mapped to target hardware (bad memory access patterns, low parallelism) will be slow in practice.
Compression and search are complementary. NAS finds good architectures; compression (pruning, quantization, distillation) makes them deployable. The best results come from considering both together.

NAS Methods

Reinforcement Learning-Based NAS

A controller network (typically an RNN) generates architecture descriptions as sequences of tokens.
The generated architecture is trained and evaluated; the validation accuracy serves as the reward signal.
REINFORCE or PPO updates the controller to generate architectures with higher expected reward.
Original NASNet required 500 GPU-days; prohibitively expensive but established the paradigm.

Evolutionary NAS

Maintain a population of architectures; apply mutations (add/remove layers, change operations) and crossover to produce offspring.
Select the fittest architectures (by validation performance) to survive and reproduce.
AmoebaNet showed evolutionary methods can match or exceed RL-based NAS with comparable compute.
Tournament selection with aging (penalizing older architectures) prevents premature convergence.

Differentiable NAS (DARTS)

Relax the discrete architecture search to a continuous optimization problem.
Maintain a mixture of candidate operations at each edge; softmax weights determine the contribution of each operation.
Jointly optimize architecture weights and model weights using gradient descent on separate train/validation splits.
Dramatically faster than RL or evolutionary methods (single GPU-days rather than hundreds).
Known instability issues: DARTS can collapse to skip connections or other degenerate architectures; requires careful regularization.

Search Spaces

Cell-Based Search Spaces

Search for a repeating cell (normal cell and reduction cell), then stack cells to form the full architecture.
Reduces the search space dramatically compared to searching the full architecture.
Standard in most NAS work; architectures found on small datasets (CIFAR-10) transfer to larger datasets (ImageNet).

Channel and Layer Search

Search over the number of channels and layers at each stage, given a fixed cell structure.
Once-for-All (OFA) trains a single supernet that contains all possible sub-networks, enabling instant architecture extraction.
Complements cell-based search by optimizing the macro-structure.

Operation Sets

Typical operations: 3x3 conv, 5x5 conv, 3x3 depthwise separable conv, 3x3 dilated conv, max pool, avg pool, skip connection, zero (no connection).
Including too many operations increases search cost; too few limits the architectures that can be discovered.
The zero operation is important for discovering sparse, efficient architectures.

One-Shot NAS

Weight Sharing

Train a single supernet that contains all candidate architectures as sub-networks sharing weights.
Evaluate candidate architectures by inheriting weights from the supernet, avoiding the cost of training each from scratch.
Dramatically reduces total compute: from thousands of GPU-days to tens.

Challenges

Weight coupling: shared weights may not accurately reflect the performance of independently trained architectures.
Ranking consistency: the ranking of architectures by inherited weights may not match the ranking after independent training.
Training fairness: operations that are selected more frequently during supernet training get better-optimized weights.

Practical Approaches

Single-path one-shot: sample one sub-network per training step, reducing memory to that of the largest sub-network.
Progressive shrinking: train the full supernet, then progressively fine-tune smaller sub-networks.

Hardware-Aware NAS

Target Metrics

Latency on the target hardware (GPU, CPU, mobile NPU) rather than FLOPs or parameter count.
Memory footprint for deployment on memory-constrained devices.
Energy consumption for battery-powered or thermal-limited deployment.

Latency Modeling

Build a lookup table mapping each operation + input size to measured latency on the target hardware.
The total architecture latency is approximated as the sum of per-operation latencies (valid for sequential execution).
For parallel execution (multi-branch architectures), the critical path determines latency.

Multi-Objective Optimization

Optimize accuracy and latency jointly using a weighted product: accuracy * (latency / target)^w.
Alternatively, use Pareto optimization to find the set of architectures representing optimal accuracy-latency tradeoffs.
EfficientNet, MobileNetV3, and FBNet all used hardware-aware NAS.

AutoML Pipelines

End-to-end systems that automate architecture search, hyperparameter tuning, data augmentation, and training schedule selection.
Tools: Google AutoML, Auto-PyTorch, AutoGluon, NNI.
Typically combine NAS for architecture with Bayesian optimization for hyperparameters.
Most practical for teams without deep ML expertise; expert practitioners often get better results with informed manual design plus targeted search.

Efficient Architecture Design Principles

Depthwise Separable Convolutions

Factorize spatial and channel-wise computation, reducing FLOPs by the kernel size squared.
Foundation of MobileNet, EfficientNet, and most efficient CNN architectures.

Inverted Residual Blocks

Expand channels, apply depthwise conv, then project back to a narrow representation.
Residual connection on the narrow (bottleneck) representation preserves memory efficiency.
MobileNetV2 introduced this pattern; it remains the dominant building block for efficient CNNs.

Activation and Normalization Choices

Swish/SiLU provides better accuracy than ReLU with minimal compute overhead.
Squeeze-and-Excitation (SE) blocks add channel attention with small parameter and compute cost.
Use BatchNorm for training efficiency; consider fusing BN into conv weights at inference.

Scaling Strategies

Width Scaling

Increase the number of channels at each layer. Captures finer-grained features.
Diminishing returns: doubling width does not double accuracy.

Depth Scaling

Add more layers. Increases the model's ability to learn hierarchical representations.
Requires residual connections to train effectively beyond moderate depth.

Resolution Scaling

Increase input resolution. More spatial detail enables finer-grained recognition.
Quadratic compute increase (doubled resolution means 4x the spatial computation).

Compound Scaling

Scale all three dimensions jointly using a fixed ratio (EfficientNet approach).
More balanced than scaling any single dimension; achieves better accuracy per FLOP.

Model Compression

Pruning

Remove redundant weights, neurons, or channels to create sparser, smaller models.
Unstructured pruning (individual weights): highest compression ratios but requires sparse computation support.
Structured pruning (entire channels or layers): directly reduces model size and latency on standard hardware.
Iterative magnitude pruning: train, prune smallest weights, retrain, repeat.
The Lottery Ticket Hypothesis: sparse sub-networks exist within dense networks that can be trained in isolation to matching accuracy.

Quantization

Reduce numerical precision of weights and activations from FP32 to INT8, INT4, or lower.
Post-training quantization (PTQ): quantize a trained model with calibration data. Minimal effort, some accuracy loss.
Quantization-aware training (QAT): simulate quantization during training. Higher accuracy retention but more expensive.
INT8 quantization is widely supported on CPUs and GPUs with near-zero accuracy loss for most models.

Distillation for Compression

Train a smaller student to match a larger teacher's behavior (see knowledge distillation in regularization skill).
Combine with pruning and quantization for maximum compression.
The student architecture should be designed for the target hardware, not derived from the teacher.

Anti-Patterns -- What NOT To Do

Do not use FLOPs as a proxy for latency. Memory-bound operations (attention, depthwise conv) have high latency relative to their FLOP count; always measure on target hardware.
Do not run DARTS without regularization. Unregularized DARTS frequently collapses to skip-connection-dominated architectures that perform poorly.
Do not search on a proxy task and assume the architecture transfers. Architectures optimal for CIFAR-10 may not be optimal for ImageNet; validate on the target task.
Do not apply aggressive pruning or quantization without fine-tuning. Post-hoc compression without retraining typically causes unacceptable accuracy degradation beyond moderate compression ratios.
Do not ignore the search cost in NAS. Report and consider the total compute spent on architecture search, not just the final model's training cost.

Install this skill directly: skilldb add deep-learning-skills

Get CLI access →

Related Skills

Adversarial ML

Triggers when users need help with adversarial machine learning, model robustness, or ML security. Activate for questions about adversarial attacks (FGSM, PGD, C&W, AutoAttack), adversarial training, certified robustness, model robustness evaluation, distribution shift, out-of-distribution detection, backdoor attacks, data poisoning, privacy attacks (membership inference, model extraction), and differential privacy in ML.

Deep Learning•184L

Convolutional Networks

Triggers when users need help with convolutional neural network architectures, CNN design patterns, or vision model selection. Activate for questions about ResNet, EfficientNet, ConvNeXt, depthwise separable convolutions, feature pyramid networks, receptive field analysis, normalization layers, Vision Transformers vs CNNs tradeoffs, and transfer learning from pretrained CNNs.

Deep Learning•144L

Generative Models

Triggers when users need help with generative deep learning models, image synthesis, or density estimation. Activate for questions about GANs, diffusion models, VAEs, flow-based models, DDPM, StyleGAN, mode collapse, classifier-free guidance, latent diffusion, ELBO, autoregressive generation, and evaluation metrics like FID, IS, and CLIP score.

Deep Learning•139L

Graph Neural Networks

Triggers when users need help with graph neural networks, graph representation learning, or applying deep learning to graph-structured data. Activate for questions about GCN, GAT, GraphSAGE, message passing, over-smoothing, graph pooling, heterogeneous graphs, temporal graphs, knowledge graphs with GNNs, molecular property prediction, social network analysis, recommendation systems on graphs, and GNN scalability.

Deep Learning•148L

Multi Modal Learning

Triggers when users need help with multimodal deep learning, vision-language models, or cross-modal representation learning. Activate for questions about CLIP, LLaVA, Flamingo, image captioning, visual question answering, text-to-image alignment, contrastive learning across modalities, audio-visual learning, multimodal fusion strategies (early, late, cross-attention), and multimodal benchmarks.

Deep Learning•167L

Recommender Systems

Triggers when users need help with recommendation systems, collaborative filtering, or ranking models. Activate for questions about matrix factorization, ALS, content-based filtering, deep recommender models (NCF, Wide&Deep, DeepFM, two-tower), sequential recommendation, cold start problem, implicit vs explicit feedback, multi-objective ranking, exploration vs exploitation, and real-time recommendation serving.

Deep Learning•169L