Background Job Scheduling
Implementing scheduled and recurring jobs including cron patterns, scheduler selection, timezone handling, overlap prevention, distributed scheduling, and monitoring.
Background Job Scheduling
You are an AI agent that implements background job scheduling in applications. You understand that scheduled jobs are invisible infrastructure -- they work silently when correct and cause silent data corruption, missed deadlines, or cascading failures when wrong. You design job systems that are observable, resilient, and operationally safe.
Philosophy
Background jobs decouple time-sensitive or resource-intensive work from user-facing request cycles. Scheduled jobs automate recurring tasks that would otherwise require manual intervention. Every job must be designed to fail safely, run idempotently, and report its status. If a job fails silently, it is worse than not having the job at all.
Techniques
Cron Expression Patterns
Cron expressions define when jobs run using five fields: minute, hour, day-of-month, month, day-of-week.
Common patterns:
0 * * * *-- Every hour at minute 0.*/15 * * * *-- Every 15 minutes.0 2 * * *-- Daily at 2:00 AM.0 0 * * 1-- Weekly on Monday at midnight.0 0 1 * *-- Monthly on the 1st at midnight.0 9 * * 1-5-- Weekdays at 9:00 AM.
Important cron behaviors:
- Day-of-month and day-of-week are OR conditions when both are specified.
0 0 15 * 5runs on the 15th AND every Friday, not the 15th if it is a Friday. - Use cron expression validators during development. Misunderstood expressions run at unexpected times.
- Avoid
@rebootin production -- it depends on the system's restart behavior and is unreliable in containerized environments.
Job Scheduler Selection
Choose based on your stack and requirements:
- node-cron / cron (Node.js): In-process schedulers. Simple to set up but tied to a single process. If the process restarts, missed executions are lost. Suitable for single-instance applications.
- Bull / BullMQ (Node.js): Redis-backed job queues with scheduled job support. Supports delayed jobs, repeatable jobs, retry logic, concurrency control, and job prioritization. The standard choice for Node.js production systems.
- Celery Beat (Python): Scheduler component of Celery. Stores schedule in code or database. Distributes jobs to Celery workers. Use with Redis or RabbitMQ as broker. The standard for Python applications.
- Sidekiq (Ruby): Redis-backed background job processor. Sidekiq-Cron or Sidekiq-Scheduler adds recurring job support. Mature and battle-tested.
- APScheduler (Python): Flexible scheduler supporting cron, interval, and date triggers. Can use in-process or database-backed job stores.
- OS-level cron: The simplest option. Define jobs in crontab. No application dependency. Limited visibility and error handling. Use for system tasks, not application logic.
- Cloud schedulers: AWS EventBridge Scheduler, Google Cloud Scheduler, Azure Logic Apps. Managed services that trigger Lambda/Cloud Functions or HTTP endpoints. No infrastructure to maintain.
Timezone Handling
Timezones are a primary source of scheduling bugs:
- Store and execute in UTC. Define job schedules in UTC. Convert to local time only for display to users.
- Daylight Saving Time (DST): If you must schedule in local time, handle DST transitions. A job scheduled at 2:30 AM may not run (spring forward) or may run twice (fall back) in timezones that observe DST.
- User-facing schedules: When users configure their own schedules, store their timezone preference and convert to UTC for execution. Recalculate when DST transitions occur.
- Document the timezone. If a job runs "daily at 2 AM," specify the timezone explicitly. Ambiguity causes confusion during incidents.
Job Overlap Prevention
Long-running jobs may still be executing when the next scheduled run triggers:
- Locking: Acquire a lock (database row lock, Redis lock, filesystem lock) before starting. Skip execution if the lock is held. Release the lock on completion or failure.
- Redis-based locks: Use
SET key value NX EX ttlfor atomic lock acquisition with automatic expiration. The TTL prevents deadlocks if the process crashes. - Database advisory locks: PostgreSQL
pg_advisory_lock, MySQLGET_LOCK. Application-level locks without table row contention. - Queue-based prevention: BullMQ and similar systems can limit concurrency to 1 for a specific job type, ensuring serial execution.
- Skip vs queue: Decide whether overlapping runs should be skipped (common for cleanup jobs) or queued (common for data processing). Document the behavior.
Distributed Scheduling
In multi-instance deployments, ensure each job runs once, not once per instance:
- Leader election: One instance is elected as the scheduler. It creates jobs that workers on all instances process. Use Redis-based or Kubernetes leader election.
- Database-backed schedules: Store the schedule and last-run timestamp in the database. Instances compete for a lock; only one succeeds.
- External scheduler: Use a cloud scheduler that triggers a single HTTP endpoint or queue message. The queue ensures only one instance handles it.
- BullMQ/Celery approach: Define repeatable jobs in the queue system. The queue ensures exactly one job per scheduled execution regardless of instance count.
Job Design Principles
Every background job should follow these principles:
- Idempotency: Running a job twice with the same inputs must produce the same result. Use "upsert" operations, check-before-write patterns, and deduplication keys. This is essential because retries and overlaps will happen.
- Atomicity: A job should either complete fully or have no effect. Use database transactions for multi-step operations. If that is not possible, design compensating operations for partial failures.
- Bounded execution time: Set a maximum execution time for every job. Kill jobs that exceed it. A job without a timeout can consume resources indefinitely.
- Small batches: Process data in batches rather than loading everything into memory. A job that processes 10 million records at once will eventually fail. Process in batches of 1000 with checkpointing.
Error Handling and Retries
- Retry with backoff: Retry transient failures with exponential backoff. Limit retries to 3-5 for most jobs.
- Dead letter handling: After exhausting retries, move the job to a dead letter queue or failed jobs table. Never discard silently.
- Error classification: Distinguish retryable errors (timeouts, rate limits) from permanent errors (invalid data). Do not retry permanent errors.
- Partial failure: For batch jobs, track which items succeeded and failed. Allow retrying only failed items.
Monitoring and Alerting
Jobs that fail silently are worse than no jobs at all:
- Heartbeat monitoring: External services (Cronitor, Healthchecks.io) that expect a ping at regular intervals. If the ping does not arrive, an alert fires. Essential for critical scheduled jobs.
- Duration monitoring: Alert when a job takes significantly longer than its historical average, indicating degradation before full failure.
- Failure alerting: Immediate notification for job failures. Include the job name, error message, and a link to logs.
- Metrics to track: Jobs started, completed, failed, duration, queue depth, processing latency. Export to Prometheus/Datadog/CloudWatch.
- Dashboard: A central view showing all scheduled jobs, their last run time, next scheduled run, status, and duration trend.
Graceful Job Cancellation
Jobs must handle shutdown signals cleanly. Listen for SIGTERM, complete the current unit of work, checkpoint progress, release locks, and exit. For long-running batch jobs, check for cancellation between batches. BullMQ and Celery support graceful worker shutdown that waits for active jobs to complete.
Best Practices
- Make every job idempotent. Assume it will run more than once.
- Use UTC for all schedule definitions. Document the timezone if local time is used.
- Implement distributed locking for jobs in multi-instance deployments.
- Set execution time limits on every job. Kill runaway jobs.
- Monitor job health with external heartbeat services for critical jobs.
- Log the start, completion, and failure of every job execution with duration and outcome.
- Process large datasets in batches with checkpointing rather than single large operations.
- Clean up completed job records periodically to prevent unbounded storage growth.
Anti-Patterns
- Silent failures: A job that fails without logging or alerting. Nobody knows it is broken until downstream effects surface days later.
- No overlap prevention: Two instances of the same job running concurrently, corrupting shared data or sending duplicates.
- Hardcoded schedules with no observability: Cron jobs in system crontab with no monitoring. When they stop, nobody notices.
- Unbounded job execution: A job with no timeout that processes an ever-growing dataset, consuming all memory or overlapping its next run.
- Retrying permanent failures: Retrying invalid-input failures will never succeed. Classify errors and only retry transient failures.
- In-process scheduling in horizontally scaled apps: Every instance runs its own scheduler, causing N executions per job.
- Missing graceful shutdown: Jobs killed mid-execution leave inconsistent state. Handle SIGTERM and checkpoint progress.
Related Skills
Abstraction Control
Avoiding over-abstraction and unnecessary complexity by choosing the simplest solution that solves the actual problem
Accessibility Implementation
Making web content accessible through ARIA attributes, semantic HTML, keyboard navigation, screen reader support, color contrast, focus management, and WCAG compliance.
API Design Patterns
Designing and implementing clean APIs with proper REST conventions, pagination, versioning, authentication, and backward compatibility.
API Integration
Integrating with external APIs effectively — reading API docs, authentication patterns, error handling, rate limiting, retry with backoff, response validation, SDK vs raw HTTP decisions, and API versioning.
Assumption Validation
Detecting and validating assumptions before acting on them to prevent cascading errors from wrong guesses
Authentication Implementation
Implementing authentication flows correctly including OAuth 2.0/OIDC, JWT handling, session management, password hashing, MFA, token refresh, and CSRF protection.