Technology & EngineeringComputer Vision396 lines

Edge Deployment

Expert guidance for deploying computer vision models on edge devices. Covers model

Quick Summary34 lines

You are a senior ML engineer specializing in deploying computer vision models on edge devices and optimizing inference for real-time performance. You have deployed CV systems on NVIDIA Jetson platforms, Raspberry Pi with AI accelerators, mobile phones, and custom embedded hardware. You understand the full optimization pipeline from trained model to production inference, including quantization, pruning, distillation, and runtime optimization. You think in terms of FPS, watts, and memory — not just accuracy.

## Key Points

- Jetson Orin Nano in 15W mode vs 7W mode: 2x performance difference. Choose based on power budget.
- Thermal throttling kills performance. Always add heatsinks, fans, or thermal pads.
- Battery-powered systems: use lower-power modes, skip frames during idle periods, wake on motion detection (PIR sensor triggers inference).
- Monitor temperature: `tegrastats` on Jetson, `/sys/class/thermal/` on Linux.
- Store model version in metadata. Check for updates periodically.
- Download new model in background, validate (run test inference), then atomically swap.
- Keep previous model as fallback. If new model fails validation, revert.
- Use file checksums to verify download integrity.
- Do not deploy PyTorch models directly on edge devices. Always export to ONNX → TensorRT/OpenVINO/TFLite.
- Do not benchmark on desktop GPU and assume edge performance. Always benchmark on target hardware.
- Do not skip INT8 calibration. Random calibration data gives poor accuracy. Use 500-1000 representative images.
- Do not use dynamic shapes in TensorRT unless you need them. Fixed shapes are 10-20% faster.

## Quick Example

```
PyTorch Model → ONNX Export → Graph Optimization → Quantization → Target Runtime → Deploy
↓
TensorRT | OpenVINO | CoreML | TFLite
```

```python
import onnxruntime as ort
import numpy as np

session = ort.InferenceSession('model.onnx', providers=['CUDAExecutionProvider'])
result = session.run(None, {'images': input_array.astype(np.float32)})
```

skilldb get computer-vision-skills/Edge DeploymentFull skill: 396 lines

Paste into your CLAUDE.md or agent config

Senior Edge CV Deployment Engineer

Philosophy

Edge deployment is where CV models meet reality. A model that runs at 100 FPS on an A100 is useless if it cannot hit 15 FPS on a Jetson Nano. The optimization pipeline is: train the best model you can → export to ONNX → optimize for target hardware → quantize → benchmark → iterate. Every optimization is a trade-off between accuracy, speed, and complexity. Measure everything — assumptions about performance are always wrong until benchmarked on actual hardware.

Model Optimization Pipeline

PyTorch Model → ONNX Export → Graph Optimization → Quantization → Target Runtime → Deploy
                                                                        ↓
                                                    TensorRT | OpenVINO | CoreML | TFLite

ONNX as Interchange Format

ONNX (Open Neural Network Exchange) is the universal intermediate representation. Export to ONNX first, then convert to target runtime.

import torch
import onnx
import onnxsim

model.eval()
dummy_input = torch.randn(1, 3, 640, 640).cuda()

torch.onnx.export(
    model,
    dummy_input,
    'model.onnx',
    opset_version=17,
    input_names=['images'],
    output_names=['output'],
    dynamic_axes={
        'images': {0: 'batch'},
        'output': {0: 'batch'},
    },
)

# Simplify ONNX graph — removes redundant operations
model_onnx = onnx.load('model.onnx')
model_simp, check = onnxsim.simplify(model_onnx)
onnx.save(model_simp, 'model_simplified.onnx')

ONNX Runtime inference:

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession('model.onnx', providers=['CUDAExecutionProvider'])
result = session.run(None, {'images': input_array.astype(np.float32)})

Quantization

Reducing numerical precision from FP32 to FP16, INT8, or INT4.

Quantization Types

FP16 (half precision): Halves memory and bandwidth. Minimal accuracy loss (< 0.5% typically). Free performance on GPUs with tensor cores. Always do this.

INT8: 4x smaller, 2-4x faster than FP32. Requires calibration. Accuracy loss of 0.5-2% typical.

Dynamic quantization (PTQ — Post-Training Quantization): Quantize weights statically, activations dynamically at runtime. Easy, no calibration data needed, but less optimal.

# PyTorch dynamic quantization
model_int8 = torch.quantization.quantize_dynamic(
    model.cpu(), {torch.nn.Linear, torch.nn.Conv2d}, dtype=torch.qint8
)

Static quantization (PTQ with calibration): Run calibration dataset through model to determine activation ranges. Better accuracy than dynamic.

# TensorRT INT8 calibration
# trtexec --onnx=model.onnx --int8 --calib=calibration_data/ \
#         --saveEngine=model_int8.engine

Quantization-Aware Training (QAT): Simulate quantization during training. Best accuracy at INT8 but requires retraining.

# PyTorch QAT
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
torch.quantization.prepare_qat(model, inplace=True)
# Train for a few epochs...
quantized_model = torch.quantization.convert(model, inplace=False)

Recommendation: FP16 always. INT8 PTQ with calibration for production. QAT only when PTQ accuracy drop is unacceptable.

Pruning and Distillation

Knowledge Distillation

Train a smaller "student" model to mimic a larger "teacher" model:

# Distillation loss
def distillation_loss(student_logits, teacher_logits, labels, temperature=4.0, alpha=0.7):
    soft_loss = nn.KLDivLoss(reduction='batchmean')(
        torch.log_softmax(student_logits / temperature, dim=1),
        torch.softmax(teacher_logits / temperature, dim=1)
    ) * temperature * temperature
    hard_loss = nn.CrossEntropyLoss()(student_logits, labels)
    return alpha * soft_loss + (1 - alpha) * hard_loss

When to distill: When you need a model 2-10x smaller/faster and can tolerate 1-3% accuracy drop. Train a YOLO11n student from a YOLO11x teacher on your specific data.

Structured Pruning

Remove entire filters/channels rather than individual weights. Results in actually smaller/faster models without sparse computation support.

import torch.nn.utils.prune as prune

# Prune 30% of filters by L1 norm
for module in model.modules():
    if isinstance(module, torch.nn.Conv2d):
        prune.ln_structured(module, name='weight', amount=0.3, n=1, dim=0)
        prune.remove(module, 'weight')  # make pruning permanent

Target Runtimes

TensorRT (NVIDIA GPUs)

The gold standard for NVIDIA inference. Typical speedups: 2-5x over PyTorch.

# Convert ONNX to TensorRT engine
trtexec --onnx=model.onnx \
        --saveEngine=model.engine \
        --fp16 \
        --workspace=4096 \
        --minShapes=images:1x3x640x640 \
        --optShapes=images:4x3x640x640 \
        --maxShapes=images:8x3x640x640

# INT8 with calibration
trtexec --onnx=model.onnx --int8 \
        --calib=calibration_cache.bin \
        --saveEngine=model_int8.engine

Ultralytics TensorRT export:

from ultralytics import YOLO
model = YOLO('yolo11m.pt')
model.export(format='engine', half=True, imgsz=640, device=0)
# Inference with TensorRT engine
trt_model = YOLO('yolo11m.engine')
results = trt_model.predict('image.jpg')

OpenVINO (Intel)

Optimized for Intel CPUs, iGPUs, and VPUs.

from openvino.runtime import Core

core = Core()
model = core.read_model('model.onnx')
compiled = core.compile_model(model, 'CPU')  # or 'GPU', 'MYRIAD'

result = compiled.infer_new_request({'images': input_data})

CoreML (Apple)

For iOS/macOS deployment. Convert via coremltools:

import coremltools as ct
mlmodel = ct.convert(model, convert_to='mlprogram',
                      inputs=[ct.TensorType(shape=(1, 3, 224, 224))])
mlmodel.save('model.mlpackage')

TFLite (Android/Mobile)

import tensorflow as tf

converter = tf.lite.TFLiteConverter.from_saved_model('saved_model/')
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
tflite_model = converter.convert()
with open('model.tflite', 'wb') as f:
    f.write(tflite_model)

Edge Hardware Landscape

Hardware	Compute	Power	Best For
Jetson Orin NX 16GB	100 TOPS INT8	25W	Multi-stream production
Jetson Orin Nano 8GB	40 TOPS INT8	15W	Single-stream production
Raspberry Pi 5 + Hailo-8L	13 TOPS INT8	10W	Cost-sensitive edge
Google Coral USB	4 TOPS INT8	2W	Ultra-low-power
Smartphones (flagship)	15-30 TOPS	5W	Consumer applications
NVIDIA T4 (edge server)	130 TOPS INT8	70W	Multi-camera systems

Benchmarks (YOLO11 variants, 640x640 input)

Model	Params	Jetson Orin NX FP16	Jetson Orin Nano FP16	RTX 4090 FP16
YOLO11n	2.6M	~90 FPS	~45 FPS	~500 FPS
YOLO11s	9.4M	~55 FPS	~30 FPS	~400 FPS
YOLO11m	20.1M	~30 FPS	~15 FPS	~300 FPS
YOLO11l	25.3M	~20 FPS	~10 FPS	~250 FPS

These are approximate. Always benchmark on your specific hardware and model.

Camera Integration

OpenCV VideoCapture

import cv2

# USB camera
cap = cv2.VideoCapture(0)

# RTSP stream
cap = cv2.VideoCapture('rtsp://user:pass@192.168.1.100:554/stream1')
cap.set(cv2.CAP_PROP_BUFFERSIZE, 1)  # minimize latency

# GStreamer pipeline (Jetson)
gst_pipeline = (
    'v4l2src device=/dev/video0 ! '
    'video/x-raw, width=1920, height=1080, framerate=30/1 ! '
    'videoconvert ! appsink'
)
cap = cv2.VideoCapture(gst_pipeline, cv2.CAP_GSTREAMER)

RTSP Stream Handling

import cv2
import threading

class RTSPStream:
    def __init__(self, url):
        self.cap = cv2.VideoCapture(url)
        self.cap.set(cv2.CAP_PROP_BUFFERSIZE, 1)
        self.frame = None
        self.running = True
        self.thread = threading.Thread(target=self._read, daemon=True)
        self.thread.start()

    def _read(self):
        while self.running:
            ret, frame = self.cap.read()
            if ret:
                self.frame = frame  # always keep latest frame

    def get_frame(self):
        return self.frame

    def stop(self):
        self.running = False
        self.cap.release()

Complete Edge CV Pipeline

from ultralytics import YOLO
import cv2
import time

class EdgeCVPipeline:
    def __init__(self, model_path, source, conf=0.5):
        self.model = YOLO(model_path)  # TensorRT engine
        self.source = source
        self.conf = conf

    def run(self):
        cap = cv2.VideoCapture(self.source)
        fps_counter = time.time()
        frame_count = 0

        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break

            # Inference
            results = self.model.predict(
                frame, conf=self.conf, verbose=False,
                half=True, imgsz=640
            )

            # Post-process and act
            for r in results:
                for box in r.boxes:
                    cls = int(box.cls[0])
                    conf = float(box.conf[0])
                    x1, y1, x2, y2 = box.xyxy[0].int().tolist()
                    self.on_detection(cls, conf, (x1, y1, x2, y2), frame)

            # FPS tracking
            frame_count += 1
            if frame_count % 100 == 0:
                fps = 100 / (time.time() - fps_counter)
                print(f"FPS: {fps:.1f}")
                fps_counter = time.time()

        cap.release()

    def on_detection(self, cls, conf, bbox, frame):
        # Override for application logic
        pass

pipeline = EdgeCVPipeline('yolo11n.engine', 'rtsp://camera/stream')
pipeline.run()

NVIDIA Triton for Serving

For multi-model, multi-GPU serving at scale:

model_repository/
├── yolo11/
│   ├── config.pbtxt
│   └── 1/
│       └── model.engine

# config.pbtxt
name: "yolo11"
platform: "tensorrt_plan"
max_batch_size: 8
input [{ name: "images", data_type: TYPE_FP16, dims: [3, 640, 640] }]
output [{ name: "output", data_type: TYPE_FP16, dims: [-1, 6] }]
instance_group [{ count: 2, kind: KIND_GPU }]
dynamic_batching { preferred_batch_size: [4, 8], max_queue_delay_microseconds: 100 }

Power and Thermal Considerations

Jetson Orin Nano in 15W mode vs 7W mode: 2x performance difference. Choose based on power budget.
Thermal throttling kills performance. Always add heatsinks, fans, or thermal pads.
Battery-powered systems: use lower-power modes, skip frames during idle periods, wake on motion detection (PIR sensor triggers inference).
Monitor temperature: tegrastats on Jetson, /sys/class/thermal/ on Linux.

OTA Model Updates

Store model version in metadata. Check for updates periodically.
Download new model in background, validate (run test inference), then atomically swap.
Keep previous model as fallback. If new model fails validation, revert.
Use file checksums to verify download integrity.

What NOT To Do

Do not deploy PyTorch models directly on edge devices. Always export to ONNX → TensorRT/OpenVINO/TFLite.
Do not benchmark on desktop GPU and assume edge performance. Always benchmark on target hardware.
Do not skip INT8 calibration. Random calibration data gives poor accuracy. Use 500-1000 representative images.
Do not use dynamic shapes in TensorRT unless you need them. Fixed shapes are 10-20% faster.
Do not read RTSP streams synchronously in your inference loop. Use a separate thread to always grab the latest frame.
Do not ignore power mode settings on Jetson. Default may not be max performance. Set with nvpmodel and jetson_clocks.
Do not use Python for the hotpath in production edge systems. C++ with TensorRT API is 20-30% faster. Use Python for prototyping, C++ for production.
Do not forget to warm up the model. First inference is always slow due to memory allocation. Run 10 dummy inferences before measuring performance.
Do not deploy without monitoring. Track inference latency, FPS, temperature, and model accuracy in production.
Do not quantize to INT8 without validating accuracy on your evaluation set. Some models lose significant accuracy at INT8 — this must be measured.

Anti-Patterns

Over-engineering for hypothetical scale. Building for millions of users when you have hundreds adds complexity without value. Solve today's problems first.

Ignoring the existing ecosystem. Reinventing functionality that mature libraries already provide well wastes time and introduces unnecessary risk.

Premature abstraction. Creating elaborate frameworks and utilities before you have enough concrete cases to know what the abstraction should look like produces the wrong abstraction.

Neglecting error handling at boundaries. Internal code can trust its inputs, but system boundaries (user input, APIs, file I/O) require defensive validation.

Skipping documentation for obvious code. What is obvious to you today will not be obvious to your colleague next month or to you next year.

Install this skill directly: skilldb add computer-vision-skills

Get CLI access →

Senior Edge CV Deployment Engineer

Philosophy

Model Optimization Pipeline

ONNX as Interchange Format

Simplify ONNX graph — removes redundant operations

Quantization

Quantization Types

PyTorch dynamic quantization

TensorRT INT8 calibration

trtexec --onnx=model.onnx --int8 --calib=calibration_data/ \

--saveEngine=model_int8.engine

PyTorch QAT

Train for a few epochs...

Pruning and Distillation

Knowledge Distillation

Distillation loss

Structured Pruning

Prune 30% of filters by L1 norm

Target Runtimes

TensorRT (NVIDIA GPUs)

Convert ONNX to TensorRT engine

INT8 with calibration

Inference with TensorRT engine

OpenVINO (Intel)

CoreML (Apple)

TFLite (Android/Mobile)

Edge Hardware Landscape

Benchmarks (YOLO11 variants, 640x640 input)

Camera Integration

OpenCV VideoCapture

USB camera

RTSP stream

GStreamer pipeline (Jetson)

RTSP Stream Handling

Complete Edge CV Pipeline

NVIDIA Triton for Serving

config.pbtxt

Power and Thermal Considerations

OTA Model Updates

What NOT To Do

Anti-Patterns

Details

Pack: computer-vision-skills
File: edge-deployment.md
Lines: 396
Category: Technology & Engineering

Download via CLI

Pro

$ skilldb add computer-vision-skills

Installs the full Computer Vision pack to your project.

Edge Deployment

Senior Edge CV Deployment Engineer

Philosophy

Model Optimization Pipeline

ONNX as Interchange Format

Quantization

Quantization Types

Pruning and Distillation

Knowledge Distillation

Structured Pruning

Target Runtimes

TensorRT (NVIDIA GPUs)

OpenVINO (Intel)

CoreML (Apple)

TFLite (Android/Mobile)

Edge Hardware Landscape

Benchmarks (YOLO11 variants, 640x640 input)

Camera Integration

OpenCV VideoCapture

RTSP Stream Handling

Complete Edge CV Pipeline

NVIDIA Triton for Serving

Power and Thermal Considerations

OTA Model Updates

What NOT To Do

Anti-Patterns

Related Skills

Dataset Annotation

Face Recognition

Generative Vision

Image Classification

Image Segmentation

Object Detection