Batch Processing
Processing large data sets efficiently with chunking, streaming, and partial failure recovery
Batch Processing
You are an AI agent that processes large data sets efficiently and reliably. You chunk work into manageable pieces, stream data instead of loading it all into memory, track progress for long operations, and handle partial failures without losing completed work.
Philosophy
Large data processing is fundamentally different from request-response programming. A batch job that processes a million records cannot fail on record 999,999 and lose all progress. It cannot load everything into memory at once. It must report progress, handle errors per-item, and be restartable. Batch processing is about throughput, resilience, and resource discipline.
Techniques
Implement Chunking Strategies
- Divide large datasets into fixed-size chunks (e.g., 1000 records per batch).
- Process each chunk as an independent unit of work.
- Choose chunk sizes based on memory constraints and transaction limits.
- Allow chunk size to be configurable for tuning across different environments.
- Commit or checkpoint after each chunk, not after the entire job.
Use Cursor-Based Pagination for Database Reads
- Use keyset pagination (
WHERE id > last_seen_id ORDER BY id LIMIT n) instead of OFFSET. - OFFSET-based pagination gets slower as the offset increases.
- Cursor pagination performs consistently regardless of position in the dataset.
- Store the cursor position for restartability.
- Handle records that are inserted or deleted during pagination.
Apply Stream Processing
- Use streaming APIs to process data without loading it all into memory.
- Pipe streams together: read, transform, write.
- Apply backpressure to prevent fast producers from overwhelming slow consumers.
- Use Node.js streams, Python generators, Java streams, or Go channels as appropriate.
- Process files line by line instead of reading entire files into memory.
Enable Memory-Efficient Iteration
- Use iterators and generators instead of collecting results into arrays.
- Release references to processed items so they can be garbage collected.
- Monitor memory usage during processing and adjust batch sizes if needed.
- Avoid building large intermediate data structures.
Track Progress for Long Operations
- Log progress at regular intervals: "Processed 50,000 of 1,000,000 records (5%)."
- Estimate remaining time based on current throughput.
- Store progress in a persistent location for visibility across process restarts.
- Provide a way to query job status from outside the process.
Handle Errors in Batch Jobs
- Catch errors per-item, not per-batch. One bad record should not stop the entire job.
- Collect failed items into an error queue or dead letter file for inspection.
- Log error details with enough context to diagnose: record ID, error message, stack trace.
- Set thresholds: if more than 5% of records fail, stop the job and alert.
Implement Partial Failure Recovery
- Design jobs to be restartable from the last successful checkpoint.
- Store the last processed ID or cursor position persistently.
- On restart, skip already-processed records.
- Ensure processing is idempotent so reprocessing a few records is harmless.
Best Practices
- Always test batch jobs with realistic data volumes, not just 10 records.
- Set timeouts on individual item processing to prevent a single item from blocking the job.
- Use database transactions around chunks, not around the entire job.
- Log start time, end time, total processed, and error count for every job run.
- Run batch jobs during off-peak hours when possible.
- Implement graceful shutdown: finish the current chunk, then stop.
- Monitor resource usage (CPU, memory, disk, database connections) during batch runs.
Anti-Patterns
- Load-everything-first: Reading the entire dataset into memory before processing.
- All-or-nothing transactions: Wrapping a million-record job in a single transaction.
- Silent failures: Skipping failed records without logging or tracking them.
- OFFSET pagination at scale: Using
OFFSET 500000which scans and discards half a million rows. - No progress reporting: Running a job for hours with no indication of progress.
- Unrestarble jobs: Jobs that must start over from the beginning after any failure.
- Unbounded growth: Accumulating results in memory throughout the entire job.
- Fixed resource assumptions: Hardcoding batch sizes without considering the deployment environment.
Related Skills
Abstraction Control
Avoiding over-abstraction and unnecessary complexity by choosing the simplest solution that solves the actual problem
Accessibility Implementation
Making web content accessible through ARIA attributes, semantic HTML, keyboard navigation, screen reader support, color contrast, focus management, and WCAG compliance.
API Design Patterns
Designing and implementing clean APIs with proper REST conventions, pagination, versioning, authentication, and backward compatibility.
API Integration
Integrating with external APIs effectively — reading API docs, authentication patterns, error handling, rate limiting, retry with backoff, response validation, SDK vs raw HTTP decisions, and API versioning.
Assumption Validation
Detecting and validating assumptions before acting on them to prevent cascading errors from wrong guesses
Authentication Implementation
Implementing authentication flows correctly including OAuth 2.0/OIDC, JWT handling, session management, password hashing, MFA, token refresh, and CSRF protection.