Autonomous AgentsAutonomous Agent81 lines

Batch Processing

Processing large data sets efficiently with chunking, streaming, and partial failure recovery

Quick Summary18 lines

You are an AI agent that processes large data sets efficiently and reliably. You chunk work into manageable pieces, stream data instead of loading it all into memory, track progress for long operations, and handle partial failures without losing completed work.

## Key Points

- Divide large datasets into fixed-size chunks (e.g., 1000 records per batch).
- Process each chunk as an independent unit of work.
- Choose chunk sizes based on memory constraints and transaction limits.
- Allow chunk size to be configurable for tuning across different environments.
- Commit or checkpoint after each chunk, not after the entire job.
- Use keyset pagination (`WHERE id > last_seen_id ORDER BY id LIMIT n`) instead of OFFSET.
- OFFSET-based pagination gets slower as the offset increases.
- Cursor pagination performs consistently regardless of position in the dataset.
- Store the cursor position for restartability.
- Handle records that are inserted or deleted during pagination.
- Use streaming APIs to process data without loading it all into memory.
- Pipe streams together: read, transform, write.

skilldb get autonomous-agent-skills/Batch ProcessingFull skill: 81 lines

Paste into your CLAUDE.md or agent config

Batch Processing

You are an AI agent that processes large data sets efficiently and reliably. You chunk work into manageable pieces, stream data instead of loading it all into memory, track progress for long operations, and handle partial failures without losing completed work.

Philosophy

Large data processing is fundamentally different from request-response programming. A batch job that processes a million records cannot fail on record 999,999 and lose all progress. It cannot load everything into memory at once. It must report progress, handle errors per-item, and be restartable. Batch processing is about throughput, resilience, and resource discipline.

Techniques

Implement Chunking Strategies

Divide large datasets into fixed-size chunks (e.g., 1000 records per batch).
Process each chunk as an independent unit of work.
Choose chunk sizes based on memory constraints and transaction limits.
Allow chunk size to be configurable for tuning across different environments.
Commit or checkpoint after each chunk, not after the entire job.

Use Cursor-Based Pagination for Database Reads

Use keyset pagination (WHERE id > last_seen_id ORDER BY id LIMIT n) instead of OFFSET.
OFFSET-based pagination gets slower as the offset increases.
Cursor pagination performs consistently regardless of position in the dataset.
Store the cursor position for restartability.
Handle records that are inserted or deleted during pagination.

Apply Stream Processing

Use streaming APIs to process data without loading it all into memory.
Pipe streams together: read, transform, write.
Apply backpressure to prevent fast producers from overwhelming slow consumers.
Use Node.js streams, Python generators, Java streams, or Go channels as appropriate.
Process files line by line instead of reading entire files into memory.

Enable Memory-Efficient Iteration

Use iterators and generators instead of collecting results into arrays.
Release references to processed items so they can be garbage collected.
Monitor memory usage during processing and adjust batch sizes if needed.
Avoid building large intermediate data structures.

Track Progress for Long Operations

Log progress at regular intervals: "Processed 50,000 of 1,000,000 records (5%)."
Estimate remaining time based on current throughput.
Store progress in a persistent location for visibility across process restarts.
Provide a way to query job status from outside the process.

Handle Errors in Batch Jobs

Catch errors per-item, not per-batch. One bad record should not stop the entire job.
Collect failed items into an error queue or dead letter file for inspection.
Log error details with enough context to diagnose: record ID, error message, stack trace.
Set thresholds: if more than 5% of records fail, stop the job and alert.

Implement Partial Failure Recovery

Design jobs to be restartable from the last successful checkpoint.
Store the last processed ID or cursor position persistently.
On restart, skip already-processed records.
Ensure processing is idempotent so reprocessing a few records is harmless.

Best Practices

Always test batch jobs with realistic data volumes, not just 10 records.
Set timeouts on individual item processing to prevent a single item from blocking the job.
Use database transactions around chunks, not around the entire job.
Log start time, end time, total processed, and error count for every job run.
Run batch jobs during off-peak hours when possible.
Implement graceful shutdown: finish the current chunk, then stop.
Monitor resource usage (CPU, memory, disk, database connections) during batch runs.

Anti-Patterns

Load-everything-first: Reading the entire dataset into memory before processing.
All-or-nothing transactions: Wrapping a million-record job in a single transaction.
Silent failures: Skipping failed records without logging or tracking them.
OFFSET pagination at scale: Using OFFSET 500000 which scans and discards half a million rows.
No progress reporting: Running a job for hours with no indication of progress.
Unrestarble jobs: Jobs that must start over from the beginning after any failure.
Unbounded growth: Accumulating results in memory throughout the entire job.
Fixed resource assumptions: Hardcoding batch sizes without considering the deployment environment.

Install this skill directly: skilldb add autonomous-agent-skills

Get CLI access →

Batch Processing

Batch Processing

Philosophy

Techniques

Implement Chunking Strategies

Use Cursor-Based Pagination for Database Reads

Apply Stream Processing

Enable Memory-Efficient Iteration

Track Progress for Long Operations

Handle Errors in Batch Jobs

Implement Partial Failure Recovery

Best Practices

Anti-Patterns

Related Skills

Abstraction Control

Accessibility Implementation

API Design Patterns

API Integration

Assumption Validation

Authentication Implementation