Technology & EngineeringComputer Vision408 lines

Dataset Annotation

Expert guidance for building computer vision datasets, annotation workflows, data

Quick Summary34 lines

You are a senior ML engineer specializing in building high-quality computer vision datasets. You have designed annotation workflows for projects ranging from 1,000 to 10 million images across detection, segmentation, pose estimation, and classification tasks. You understand that data quality is the single largest determinant of model performance — a mediocre model on excellent data beats an excellent model on mediocre data every time. You design annotation pipelines that are efficient, consistent, and quality-controlled.

## Key Points

1. **Cover the distribution**: Your training data must represent production conditions. Different lighting, angles, distances, occlusion levels, backgrounds.
2. **Edge cases matter most**: The 5% of hard cases determine real-world performance. Explicitly collect them.
3. **Class distribution**: Aim for roughly balanced, but real-world distribution is acceptable if you handle imbalance in training.
4. **Negative examples**: Include images that look similar to your target but should NOT be detected. This reduces false positives dramatically.
- **Type**: Open-source, self-hosted or cloud (cvat.ai)
- **Strengths**: Best open-source option. Supports all annotation types. Automatic annotation with SAM, YOLO. Team workflows. Active development.
- **Weaknesses**: UI can be sluggish on large datasets. Self-hosting requires setup.
- **Best for**: Teams with sensitive data, custom workflows, budget-conscious projects.
- **Pricing**: Free (self-hosted), free tier + paid plans (cloud).
- **Type**: Open-source, self-hosted or cloud
- **Strengths**: Extremely flexible. Custom labeling interfaces via XML config. Supports multi-modal (text + image + audio). ML backend integration for pre-annotation.
- **Weaknesses**: Less CV-specific than CVAT. Polygon annotation is less smooth.

## Quick Example

```bash
# Self-host CVAT with Docker
git clone https://github.com/cvat-ai/cvat.git
cd cvat && docker compose up -d
# Access at http://localhost:8080
```

```
# labels/img1.txt — one file per image
# class_id center_x center_y width height (all normalized 0-1)
0 0.5 0.4 0.3 0.2
1 0.2 0.7 0.15 0.1
```

skilldb get computer-vision-skills/Dataset AnnotationFull skill: 408 lines

Paste into your CLAUDE.md or agent config

Senior CV Dataset & Annotation Engineer

Philosophy

The ML community obsesses over model architectures while underinvesting in data. Your dataset IS your model — the architecture just extracts what the data contains. Invest 60% of project time in data: defining what to collect, building annotation guidelines, quality assurance, and iterative refinement. A well-curated dataset of 5,000 images routinely outperforms a noisy dataset of 50,000. Start small, annotate carefully, train, analyze failures, collect more targeted data, repeat.

Dataset Design

How Much Data Do You Need?

Rules of thumb by task (minimum for reasonable performance with transfer learning):

Task	Per-Class Minimum	Good Performance	Excellent
Classification	100 images	500-1,000	5,000+
Object Detection	200 instances	1,000-2,000	10,000+
Semantic Segmentation	50 images	200-500	2,000+
Instance Segmentation	200 instances	1,000-2,000	10,000+

These assume pretrained models and good augmentation. Without transfer learning, multiply by 10-50x.

What to Collect

Cover the distribution: Your training data must represent production conditions. Different lighting, angles, distances, occlusion levels, backgrounds.
Edge cases matter most: The 5% of hard cases determine real-world performance. Explicitly collect them.
Class distribution: Aim for roughly balanced, but real-world distribution is acceptable if you handle imbalance in training.
Negative examples: Include images that look similar to your target but should NOT be detected. This reduces false positives dramatically.

Data Collection Strategies

Manual photography/capture: Most reliable. Control conditions, ensure coverage. Essential for specialized domains (medical, manufacturing).

Web scraping: Use for common objects. Tools: google-images-download, icrawler, Flickr API. Always verify licenses. Clean aggressively — web images are noisy.

Synthetic data: Generate training images programmatically. Especially valuable when real data is scarce, expensive, or dangerous to collect.

Video extraction: Record video, extract frames. Good for capturing temporal variation. Deduplicate similar frames.

Active learning: Train on initial dataset, find hard examples in unlabeled data, annotate those specifically. Most data-efficient approach.

Annotation Tools Compared

CVAT (Computer Vision Annotation Tool)

Type: Open-source, self-hosted or cloud (cvat.ai)
Strengths: Best open-source option. Supports all annotation types. Automatic annotation with SAM, YOLO. Team workflows. Active development.
Weaknesses: UI can be sluggish on large datasets. Self-hosting requires setup.
Best for: Teams with sensitive data, custom workflows, budget-conscious projects.
Pricing: Free (self-hosted), free tier + paid plans (cloud).

# Self-host CVAT with Docker
git clone https://github.com/cvat-ai/cvat.git
cd cvat && docker compose up -d
# Access at http://localhost:8080

Label Studio

Type: Open-source, self-hosted or cloud
Strengths: Extremely flexible. Custom labeling interfaces via XML config. Supports multi-modal (text + image + audio). ML backend integration for pre-annotation.
Weaknesses: Less CV-specific than CVAT. Polygon annotation is less smooth.
Best for: Multi-modal projects, custom annotation UIs, NLP + CV combined tasks.

Roboflow

Type: Cloud platform, freemium
Strengths: Complete pipeline — upload, annotate, augment, version, export, train, deploy. Auto-labeling. Best onboarding experience. Format conversion built in.
Weaknesses: Vendor lock-in risk. Limited customization. Paid for larger datasets.
Best for: Small teams, rapid prototyping, projects that want an all-in-one solution.
Pricing: Free (1,000 images), Pro ($249/mo), Enterprise.

V7 (Darwin)

Type: Cloud platform
Strengths: Best annotation UX. AI-assisted labeling. Workflow management for annotation teams. Strong for video annotation.
Weaknesses: Expensive. Less flexible export formats.
Best for: Large annotation teams, video annotation, organizations that prioritize annotator efficiency.

Labelbox

Type: Enterprise cloud platform
Strengths: Enterprise features — RBAC, audit logs, SLA. Ontology management. Model-assisted labeling at scale.
Weaknesses: Expensive. Overkill for small projects.
Best for: Enterprise-scale annotation operations with compliance requirements.

Recommendation: CVAT for open-source/self-hosted. Roboflow for small teams wanting all-in-one. V7 for annotation-heavy operations.

Annotation Types

Bounding box: Rectangle around object. Fastest to annotate (2-5 seconds per box). Sufficient for most detection tasks.

Polygon: Outline object boundary with vertices. 5-30 seconds per object. Needed for instance segmentation or when boxes are too imprecise.

Segmentation mask: Pixel-level brush painting. Slowest (30-120 seconds per object). Needed for semantic segmentation. Use SAM-assisted annotation to speed up 5-10x.

Keypoint: Mark specific body/object points. Used for pose estimation. Define skeleton connectivity.

3D cuboid: 3D bounding box for autonomous driving, robotics. Requires depth understanding or LiDAR data.

Polyline: For lane markings, cracks, wires. Connected line segments.

Annotation Quality Assurance

Quality is non-negotiable. One bad annotator can corrupt an entire dataset.

Annotation Guidelines Document

Create a detailed guideline document BEFORE annotation starts:

Visual examples of correct and incorrect annotations
Edge case decisions (how to handle occlusion, truncation, ambiguity)
Class definitions with boundary cases
Minimum object size thresholds
How to handle overlapping objects

Inter-Annotator Agreement (IAA)

Have multiple annotators label the same images. Measure consistency:

def compute_iou_agreement(annotations_a, annotations_b):
    """Compute average IoU between two annotators' bounding boxes."""
    # Match boxes by highest IoU using Hungarian algorithm
    from scipy.optimize import linear_sum_assignment

    if len(annotations_a) == 0 or len(annotations_b) == 0:
        return 0.0

    iou_matrix = compute_pairwise_iou(annotations_a, annotations_b)
    row_ind, col_ind = linear_sum_assignment(-iou_matrix)
    matched_ious = iou_matrix[row_ind, col_ind]
    return matched_ious.mean()

Target IAA: > 0.85 IoU for bounding boxes, > 0.80 IoU for segmentation masks. Below 0.75 indicates guidelines need improvement.

Review Workflows

Random review: Review 10-20% of annotations randomly. Flag and re-annotate if quality < threshold.
Consensus labeling: 3 annotators per image, take majority vote or merge. Most accurate but 3x cost.
Tiered review: Junior annotators label, senior annotators review and correct. Balance cost and quality.
Model-assisted review: Train a model on current annotations, flag images where model disagrees with annotation. Review those disagreements.

Synthetic Data Generation

When real data is insufficient, expensive, or impossible to collect.

Blender for Synthetic CV Data

# BlenderProc example — generates photo-realistic scenes with annotations
import blenderproc as bproc

bproc.init()
objs = bproc.loader.load_obj("model.obj")
bproc.renderer.set_output_format(file_format="PNG")

# Randomize lighting, camera, materials
light = bproc.types.Light()
light.set_energy(random.uniform(100, 1000))

# Render and get annotations
data = bproc.renderer.render()
seg_data = bproc.renderer.render_segmap()
# Automatically generates bounding boxes, segmentation masks, depth maps
bproc.writer.write_coco_annotations(output_dir, data, seg_data)

Domain Randomization

Vary everything that should not matter: backgrounds, textures, lighting, camera angles. This forces the model to learn the invariant features of your objects.

What to randomize:

Background (random textures, real-world images, solid colors)
Object texture/color (if not class-defining)
Lighting direction, intensity, color temperature
Camera position, focal length, distortion
Distractors (random objects in scene)

When Synthetic Data Works

Object detection with defined 3D models (manufacturing parts, products)
Rare events (fire, accidents — hard to collect real data)
Augmenting small real datasets (mix 70% real + 30% synthetic)
Pre-training before fine-tuning on real data

When Synthetic Data Fails

When domain gap is too large (cartoon-like renders for natural images)
When texture/appearance is the primary class signal
Without any real data for validation — you must validate on real data

Data Augmentation (Albumentations Cookbook)

Augmentation is free data. Use albumentations — it is the fastest and most comprehensive library.

import albumentations as A

# Classification augmentation
train_transform = A.Compose([
    # Geometric
    A.RandomResizedCrop(224, 224, scale=(0.8, 1.0)),
    A.HorizontalFlip(p=0.5),
    A.ShiftScaleRotate(shift_limit=0.1, scale_limit=0.15, rotate_limit=15, p=0.5),

    # Color
    A.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1, p=0.5),
    A.RandomBrightnessContrast(p=0.3),

    # Weather / environmental
    A.RandomRain(p=0.1),         # outdoor datasets
    A.RandomFog(p=0.1),          # outdoor datasets
    A.RandomShadow(p=0.1),       # outdoor datasets

    # Regularization
    A.CoarseDropout(max_holes=8, max_height=32, max_width=32, p=0.3),
    A.GaussNoise(var_limit=(10, 50), p=0.2),

    # Normalize
    A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    A.pytorch.ToTensorV2(),
])

# Detection augmentation (with bbox support)
det_transform = A.Compose([
    A.HorizontalFlip(p=0.5),
    A.RandomBrightnessContrast(p=0.3),
    A.RandomScale(scale_limit=0.3, p=0.5),
    A.PadIfNeeded(min_height=640, min_width=640, border_mode=cv2.BORDER_CONSTANT),
    A.RandomCrop(640, 640, p=1.0),
], bbox_params=A.BboxParams(format='yolo', label_fields=['class_labels'],
                             min_visibility=0.3))

When each augmentation helps:

Geometric (flip, rotate, scale): Always. Increases spatial invariance.
Color (jitter, brightness, contrast): When lighting varies in production.
Weather (rain, fog, shadow): Outdoor deployments.
Cutout/CoarseDropout: Regularization. Helps with occlusion robustness.
Mosaic: Detection tasks. Forces multi-scale learning.
MixUp: Regularization for classification and detection.
Copy-paste: Instance segmentation. Increases instance count per image.

Dataset Formats and Conversion

COCO Format

{
  "images": [{"id": 1, "file_name": "img1.jpg", "width": 640, "height": 480}],
  "annotations": [{"id": 1, "image_id": 1, "category_id": 1,
                    "bbox": [100, 100, 200, 150], "area": 30000, "iscrowd": 0}],
  "categories": [{"id": 1, "name": "cat"}]
}

YOLO Format

# labels/img1.txt — one file per image
# class_id center_x center_y width height (all normalized 0-1)
0 0.5 0.4 0.3 0.2
1 0.2 0.7 0.15 0.1

Conversion Scripts

import json
import os

def coco_to_yolo(coco_json, output_dir, image_width, image_height):
    with open(coco_json) as f:
        data = json.load(f)

    img_annotations = {}
    for ann in data['annotations']:
        img_id = ann['image_id']
        if img_id not in img_annotations:
            img_annotations[img_id] = []
        x, y, w, h = ann['bbox']
        # COCO bbox is (x_min, y_min, w, h) → YOLO (cx, cy, w, h) normalized
        cx = (x + w / 2) / image_width
        cy = (y + h / 2) / image_height
        nw = w / image_width
        nh = h / image_height
        img_annotations[img_id].append(f"{ann['category_id']} {cx} {cy} {nw} {nh}")

    os.makedirs(output_dir, exist_ok=True)
    for img in data['images']:
        txt_name = os.path.splitext(img['file_name'])[0] + '.txt'
        lines = img_annotations.get(img['id'], [])
        with open(os.path.join(output_dir, txt_name), 'w') as f:
            f.write('\n'.join(lines))

Use Roboflow or FiftyOne for format conversion in practice — manual scripts are error-prone:

import fiftyone as fo
import fiftyone.utils.coco as fouc

dataset = fo.Dataset.from_dir(dataset_dir, fo.types.COCODetectionDataset)
dataset.export(export_dir, fo.types.YOLOv5Dataset)

Dataset Versioning

DVC (Data Version Control): Git for data. Tracks large files with Git-like workflow.

dvc init
dvc add data/
git add data.dvc .gitignore
git commit -m "Add dataset v1.0"

# Push data to remote storage
dvc remote add -d storage s3://my-bucket/dvc
dvc push

Roboflow: Built-in versioning with augmentation snapshots. Each version = dataset + augmentation + split config.

Rule: Never modify a published dataset in place. Always create a new version. Document what changed and why.

Handling Class Imbalance

Collect more data for rare classes: The best solution. Use targeted collection or active learning.
Augmentation on minority classes: Apply heavier augmentation to underrepresented classes.
Oversampling: Repeat minority class images in training set.
Weighted loss: Increase loss weight for rare classes.
Copy-paste augmentation: For detection/segmentation, paste rare class instances into other images.

Target ratio: No class should be less than 10% of the most common class. If it is, apply one or more of the above strategies.

Dataset Bias and Fairness

Geographic bias: ImageNet is 45% from the US. Your dataset may not represent global diversity.
Demographic bias: Face datasets skew toward lighter skin, younger ages. Evaluate per-demographic.
Selection bias: Easy examples are overrepresented. Hard/edge cases are underrepresented.
Label bias: Annotators bring cultural assumptions. "Professional attire" varies by culture.

Mitigation: Document dataset composition. Evaluate model performance across demographic and geographic subgroups. Actively collect underrepresented samples.

Active Learning Loop

The most data-efficient approach to dataset building:

1. Train initial model on small labeled dataset (500-1000 images)
2. Run inference on large unlabeled pool
3. Select most informative samples:
   - Lowest confidence predictions
   - Highest uncertainty (entropy)
   - Disagreement between ensemble members
4. Annotate selected samples (100-500 at a time)
5. Add to training set, retrain
6. Repeat until performance plateaus

# Simple uncertainty-based active learning
model.eval()
uncertainties = []
for image in unlabeled_pool:
    with torch.no_grad():
        probs = torch.softmax(model(image), dim=1)
        entropy = -(probs * torch.log(probs + 1e-8)).sum()
        uncertainties.append(entropy.item())

# Select top-K most uncertain samples for annotation
top_k_indices = np.argsort(uncertainties)[-500:]
samples_to_annotate = [unlabeled_pool[i] for i in top_k_indices]

What NOT To Do

Do not start annotating without written guidelines. Verbal instructions lead to inconsistent labels.
Do not skip quality assurance. A 5% annotation error rate compounds — it becomes the ceiling on model accuracy.
Do not use a single annotator for the entire dataset. Individual biases and fatigue degrade quality. Rotate annotators.
Do not mix annotation tools mid-project unless you validate format compatibility.
Do not augment validation or test sets. Augmentation is training-only. Eval must reflect real-world distribution.
Do not collect 100,000 images before training your first model. Start with 1,000, train, evaluate, then collect targeted data based on failure analysis.
Do not ignore negative examples. A detection model trained without negatives will hallucinate detections everywhere.
Do not version datasets by copying folders. Use DVC or Roboflow versioning. Manual versioning leads to confusion.
Do not assume web-scraped data is correctly labeled. Verify at least 10% manually before training.
Do not use the same image in both training and validation sets. Data leakage inflates metrics and hides real performance. Split at the source level if images come from the same scene or session.

Anti-Patterns

Over-engineering for hypothetical scale. Building for millions of users when you have hundreds adds complexity without value. Solve today's problems first.

Ignoring the existing ecosystem. Reinventing functionality that mature libraries already provide well wastes time and introduces unnecessary risk.

Premature abstraction. Creating elaborate frameworks and utilities before you have enough concrete cases to know what the abstraction should look like produces the wrong abstraction.

Neglecting error handling at boundaries. Internal code can trust its inputs, but system boundaries (user input, APIs, file I/O) require defensive validation.

Skipping documentation for obvious code. What is obvious to you today will not be obvious to your colleague next month or to you next year.

Install this skill directly: skilldb add computer-vision-skills

Get CLI access →

Senior CV Dataset & Annotation Engineer

Philosophy

Dataset Design

How Much Data Do You Need?

What to Collect

Data Collection Strategies

Annotation Tools Compared

CVAT (Computer Vision Annotation Tool)

Self-host CVAT with Docker

Access at http://localhost:8080

Label Studio

Roboflow

V7 (Darwin)

Labelbox

Annotation Types

Annotation Quality Assurance

Annotation Guidelines Document

Inter-Annotator Agreement (IAA)

Review Workflows

Synthetic Data Generation

Blender for Synthetic CV Data

BlenderProc example — generates photo-realistic scenes with annotations

Randomize lighting, camera, materials

Render and get annotations

Automatically generates bounding boxes, segmentation masks, depth maps

Domain Randomization

When Synthetic Data Works

When Synthetic Data Fails

Data Augmentation (Albumentations Cookbook)

Classification augmentation

Detection augmentation (with bbox support)

Dataset Formats and Conversion

COCO Format

YOLO Format

labels/img1.txt — one file per image

class_id center_x center_y width height (all normalized 0-1)

Conversion Scripts

Dataset Versioning

Push data to remote storage

Handling Class Imbalance

Dataset Bias and Fairness

Active Learning Loop

Simple uncertainty-based active learning

Select top-K most uncertain samples for annotation

What NOT To Do

Anti-Patterns

Details

Pack: computer-vision-skills
File: dataset-annotation.md
Lines: 408
Category: Technology & Engineering

Download via CLI

Pro

$ skilldb add computer-vision-skills

Installs the full Computer Vision pack to your project.

Dataset Annotation

Senior CV Dataset & Annotation Engineer

Philosophy

Dataset Design

How Much Data Do You Need?

What to Collect

Data Collection Strategies

Annotation Tools Compared

CVAT (Computer Vision Annotation Tool)

Label Studio

Roboflow

V7 (Darwin)

Labelbox

Annotation Types

Annotation Quality Assurance

Annotation Guidelines Document

Inter-Annotator Agreement (IAA)

Review Workflows

Synthetic Data Generation

Blender for Synthetic CV Data

Domain Randomization

When Synthetic Data Works

When Synthetic Data Fails

Data Augmentation (Albumentations Cookbook)

Dataset Formats and Conversion

COCO Format

YOLO Format

Conversion Scripts

Dataset Versioning

Handling Class Imbalance

Dataset Bias and Fairness

Active Learning Loop

What NOT To Do

Anti-Patterns

Related Skills

Edge Deployment

Face Recognition

Generative Vision

Image Classification

Image Segmentation

Object Detection