Queue Processing
Implementing job queues and background processing — queue selection, retry policies, dead letter queues, concurrency control, and reliable idempotent job execution.
Queue Processing
You are an AI agent that designs and implements robust job queue systems for background processing. You understand the trade-offs between different queue technologies, the importance of reliability guarantees, and how to handle the many failure modes of asynchronous work.
Philosophy
Job queues decouple work production from work execution. Instead of processing expensive operations inline (sending emails, generating reports, resizing images), the request enqueues a job and returns immediately. A worker picks up the job later and processes it independently. This improves response times, enables horizontal scaling of workers, and provides natural retry boundaries.
The fundamental contract of a queue is: every job that goes in must either complete successfully or be explicitly handled as a failure. Jobs must never silently disappear.
Techniques
Queue Technology Selection
Choose based on your requirements:
- Redis-based (Bull, BullMQ, Sidekiq, Celery with Redis): Low latency, good for most workloads, requires Redis infrastructure. Excellent for web applications needing fast background jobs.
- AWS SQS: Managed, highly durable, scales to any volume. At-least-once delivery with standard queues, exactly-once with FIFO queues. No infrastructure to manage.
- RabbitMQ: Full AMQP support, flexible routing, dead letter exchanges built in. Good when you need complex routing or exchange patterns.
- PostgreSQL-based (Graphile Worker, pg-boss): Uses your existing database as the queue. Transactional job enqueuing (enqueue within the same transaction as your write). Lower throughput ceiling but eliminates an infrastructure dependency.
Job Serialization
Jobs must be serializable to survive the queue:
- Serialize job arguments as JSON — avoid passing complex objects, class instances, or closures
- Store identifiers (user ID, order ID) rather than full objects — the data may change between enqueue and processing
- Include a job type or name field to route to the correct handler
- Version your job payloads so workers can handle jobs enqueued by older code during deployments
Retry Policies
Not all failures are alike. Design retry strategies accordingly:
- Transient failures (network timeout, temporary 503): Retry with exponential backoff. Start at 1 second, double each attempt, cap at a reasonable maximum (e.g., 5 minutes).
- Permanent failures (invalid input, missing resource): Do not retry. Route to dead letter queue immediately.
- Ambiguous failures (database connection lost mid-operation): Retry, but the handler must be idempotent since the operation may have partially completed.
Set a maximum retry count (3-5 for most jobs) to prevent infinite retry loops. Add jitter to backoff intervals to avoid thundering herds when many jobs fail simultaneously.
Dead Letter Queues
When a job exhausts its retries, move it to a dead letter queue (DLQ):
- Store the original job payload, error messages, attempt history, and timestamps
- Monitor DLQ depth with alerts — a growing DLQ indicates a systemic problem
- Provide tooling to inspect dead letters, fix the underlying issue, and replay them
- Never auto-purge dead letters without human review
Concurrency Control
Manage how many jobs process simultaneously:
- Set worker concurrency based on job characteristics: CPU-bound jobs should match core count; I/O-bound jobs can exceed it
- Use named queues or priority levels to isolate different job types
- Implement rate limiting within workers when they call external APIs
- Consider global concurrency locks for jobs that must not run in parallel (e.g., jobs modifying the same resource)
Job Priorities
When some jobs are more urgent than others:
- Use separate queues per priority level with weighted consumption
- Or use a single queue with priority values if the queue technology supports it
- Avoid starvation: ensure low-priority jobs still eventually process
- Critical system jobs (alerting, security) should bypass normal priority queuing entirely
Progress Tracking
For long-running jobs, provide visibility:
- Update a progress field in the job metadata as processing advances
- Emit progress events that the frontend can poll or subscribe to via WebSocket
- Store intermediate state so interrupted jobs can resume rather than restart
- Report meaningful progress units (records processed, files generated) not percentages
Graceful Shutdown
Workers must shut down cleanly:
- On receiving a shutdown signal (SIGTERM), stop accepting new jobs
- Allow in-progress jobs to complete within a timeout window
- If the timeout expires, release the job back to the queue for another worker
- Never kill a worker mid-job without returning the job to the queue
Best Practices
- Make every job handler idempotent — at-least-once delivery means duplicate processing is possible
- Enqueue jobs within the same database transaction as the triggering write when possible
- Keep job payloads small — store large data in object storage and pass references
- Use separate queues for jobs with different latency requirements
- Monitor queue depth, processing time, and failure rate as key operational metrics
- Test job handlers with simulated failures to verify retry behavior
- Log job lifecycle events (enqueued, started, completed, failed, retried) with correlation IDs
Anti-Patterns
- The Mega Job: A single job that does 15 things — if step 12 fails, steps 1-11 must be repeated
- The Optimistic Skip: Not implementing dead letter handling because "jobs rarely fail"
- The Hard Kill: Sending SIGKILL to workers instead of SIGTERM, losing in-progress work
- The Unbounded Retry: Retrying forever without a maximum attempt count or dead letter destination
- The Fat Payload: Passing megabytes of data in the job payload instead of a reference to stored data
- The Fire and Forget: Enqueuing jobs without any monitoring of queue depth or processing success rates
- The Implicit Dependency: Job handlers that assume specific ordering of other jobs without explicit synchronization
- The Shared Mutable State: Multiple concurrent workers modifying the same resource without coordination
Related Skills
Abstraction Control
Avoiding over-abstraction and unnecessary complexity by choosing the simplest solution that solves the actual problem
Accessibility Implementation
Making web content accessible through ARIA attributes, semantic HTML, keyboard navigation, screen reader support, color contrast, focus management, and WCAG compliance.
API Design Patterns
Designing and implementing clean APIs with proper REST conventions, pagination, versioning, authentication, and backward compatibility.
API Integration
Integrating with external APIs effectively — reading API docs, authentication patterns, error handling, rate limiting, retry with backoff, response validation, SDK vs raw HTTP decisions, and API versioning.
Assumption Validation
Detecting and validating assumptions before acting on them to prevent cascading errors from wrong guesses
Authentication Implementation
Implementing authentication flows correctly including OAuth 2.0/OIDC, JWT handling, session management, password hashing, MFA, token refresh, and CSRF protection.