Autonomous AgentsAutonomous Agent123 lines

Background Job Scheduling

Implementing scheduled and recurring jobs including cron patterns, scheduler selection, timezone handling, overlap prevention, distributed scheduling, and monitoring.

Quick Summary18 lines

You are an AI agent that implements background job scheduling in applications. You understand that scheduled jobs are invisible infrastructure -- they work silently when correct and cause silent data corruption, missed deadlines, or cascading failures when wrong. You design job systems that are observable, resilient, and operationally safe.

## Key Points

- `0 * * * *` -- Every hour at minute 0.
- `*/15 * * * *` -- Every 15 minutes.
- `0 2 * * *` -- Daily at 2:00 AM.
- `0 0 * * 1` -- Weekly on Monday at midnight.
- `0 0 1 * *` -- Monthly on the 1st at midnight.
- `0 9 * * 1-5` -- Weekdays at 9:00 AM.
- Day-of-month and day-of-week are OR conditions when both are specified. `0 0 15 * 5` runs on the 15th AND every Friday, not the 15th if it is a Friday.
- Use cron expression validators during development. Misunderstood expressions run at unexpected times.
- Avoid `@reboot` in production -- it depends on the system's restart behavior and is unreliable in containerized environments.
- **node-cron / cron (Node.js):** In-process schedulers. Simple to set up but tied to a single process. If the process restarts, missed executions are lost. Suitable for single-instance applications.
- **Sidekiq (Ruby):** Redis-backed background job processor. Sidekiq-Cron or Sidekiq-Scheduler adds recurring job support. Mature and battle-tested.
- **APScheduler (Python):** Flexible scheduler supporting cron, interval, and date triggers. Can use in-process or database-backed job stores.

skilldb get autonomous-agent-skills/Background Job SchedulingFull skill: 123 lines

Paste into your CLAUDE.md or agent config

Background Job Scheduling

You are an AI agent that implements background job scheduling in applications. You understand that scheduled jobs are invisible infrastructure -- they work silently when correct and cause silent data corruption, missed deadlines, or cascading failures when wrong. You design job systems that are observable, resilient, and operationally safe.

Philosophy

Background jobs decouple time-sensitive or resource-intensive work from user-facing request cycles. Scheduled jobs automate recurring tasks that would otherwise require manual intervention. Every job must be designed to fail safely, run idempotently, and report its status. If a job fails silently, it is worse than not having the job at all.

Techniques

Cron Expression Patterns

Cron expressions define when jobs run using five fields: minute, hour, day-of-month, month, day-of-week.

Common patterns:

0 * * * * -- Every hour at minute 0.
*/15 * * * * -- Every 15 minutes.
0 2 * * * -- Daily at 2:00 AM.
0 0 * * 1 -- Weekly on Monday at midnight.
0 0 1 * * -- Monthly on the 1st at midnight.
0 9 * * 1-5 -- Weekdays at 9:00 AM.

Important cron behaviors:

Day-of-month and day-of-week are OR conditions when both are specified. 0 0 15 * 5 runs on the 15th AND every Friday, not the 15th if it is a Friday.
Use cron expression validators during development. Misunderstood expressions run at unexpected times.
Avoid @reboot in production -- it depends on the system's restart behavior and is unreliable in containerized environments.

Job Scheduler Selection

Choose based on your stack and requirements:

node-cron / cron (Node.js): In-process schedulers. Simple to set up but tied to a single process. If the process restarts, missed executions are lost. Suitable for single-instance applications.
Bull / BullMQ (Node.js): Redis-backed job queues with scheduled job support. Supports delayed jobs, repeatable jobs, retry logic, concurrency control, and job prioritization. The standard choice for Node.js production systems.
Celery Beat (Python): Scheduler component of Celery. Stores schedule in code or database. Distributes jobs to Celery workers. Use with Redis or RabbitMQ as broker. The standard for Python applications.
Sidekiq (Ruby): Redis-backed background job processor. Sidekiq-Cron or Sidekiq-Scheduler adds recurring job support. Mature and battle-tested.
APScheduler (Python): Flexible scheduler supporting cron, interval, and date triggers. Can use in-process or database-backed job stores.
OS-level cron: The simplest option. Define jobs in crontab. No application dependency. Limited visibility and error handling. Use for system tasks, not application logic.
Cloud schedulers: AWS EventBridge Scheduler, Google Cloud Scheduler, Azure Logic Apps. Managed services that trigger Lambda/Cloud Functions or HTTP endpoints. No infrastructure to maintain.

Timezone Handling

Timezones are a primary source of scheduling bugs:

Store and execute in UTC. Define job schedules in UTC. Convert to local time only for display to users.
Daylight Saving Time (DST): If you must schedule in local time, handle DST transitions. A job scheduled at 2:30 AM may not run (spring forward) or may run twice (fall back) in timezones that observe DST.
User-facing schedules: When users configure their own schedules, store their timezone preference and convert to UTC for execution. Recalculate when DST transitions occur.
Document the timezone. If a job runs "daily at 2 AM," specify the timezone explicitly. Ambiguity causes confusion during incidents.

Job Overlap Prevention

Long-running jobs may still be executing when the next scheduled run triggers:

Locking: Acquire a lock (database row lock, Redis lock, filesystem lock) before starting. Skip execution if the lock is held. Release the lock on completion or failure.
Redis-based locks: Use SET key value NX EX ttl for atomic lock acquisition with automatic expiration. The TTL prevents deadlocks if the process crashes.
Database advisory locks: PostgreSQL pg_advisory_lock, MySQL GET_LOCK. Application-level locks without table row contention.
Queue-based prevention: BullMQ and similar systems can limit concurrency to 1 for a specific job type, ensuring serial execution.
Skip vs queue: Decide whether overlapping runs should be skipped (common for cleanup jobs) or queued (common for data processing). Document the behavior.

Distributed Scheduling

In multi-instance deployments, ensure each job runs once, not once per instance:

Leader election: One instance is elected as the scheduler. It creates jobs that workers on all instances process. Use Redis-based or Kubernetes leader election.
Database-backed schedules: Store the schedule and last-run timestamp in the database. Instances compete for a lock; only one succeeds.
External scheduler: Use a cloud scheduler that triggers a single HTTP endpoint or queue message. The queue ensures only one instance handles it.
BullMQ/Celery approach: Define repeatable jobs in the queue system. The queue ensures exactly one job per scheduled execution regardless of instance count.

Job Design Principles

Every background job should follow these principles:

Idempotency: Running a job twice with the same inputs must produce the same result. Use "upsert" operations, check-before-write patterns, and deduplication keys. This is essential because retries and overlaps will happen.
Atomicity: A job should either complete fully or have no effect. Use database transactions for multi-step operations. If that is not possible, design compensating operations for partial failures.
Bounded execution time: Set a maximum execution time for every job. Kill jobs that exceed it. A job without a timeout can consume resources indefinitely.
Small batches: Process data in batches rather than loading everything into memory. A job that processes 10 million records at once will eventually fail. Process in batches of 1000 with checkpointing.

Error Handling and Retries

Retry with backoff: Retry transient failures with exponential backoff. Limit retries to 3-5 for most jobs.
Dead letter handling: After exhausting retries, move the job to a dead letter queue or failed jobs table. Never discard silently.
Error classification: Distinguish retryable errors (timeouts, rate limits) from permanent errors (invalid data). Do not retry permanent errors.
Partial failure: For batch jobs, track which items succeeded and failed. Allow retrying only failed items.

Monitoring and Alerting

Jobs that fail silently are worse than no jobs at all:

Heartbeat monitoring: External services (Cronitor, Healthchecks.io) that expect a ping at regular intervals. If the ping does not arrive, an alert fires. Essential for critical scheduled jobs.
Duration monitoring: Alert when a job takes significantly longer than its historical average, indicating degradation before full failure.
Failure alerting: Immediate notification for job failures. Include the job name, error message, and a link to logs.
Metrics to track: Jobs started, completed, failed, duration, queue depth, processing latency. Export to Prometheus/Datadog/CloudWatch.
Dashboard: A central view showing all scheduled jobs, their last run time, next scheduled run, status, and duration trend.

Graceful Job Cancellation

Jobs must handle shutdown signals cleanly. Listen for SIGTERM, complete the current unit of work, checkpoint progress, release locks, and exit. For long-running batch jobs, check for cancellation between batches. BullMQ and Celery support graceful worker shutdown that waits for active jobs to complete.

Best Practices

Make every job idempotent. Assume it will run more than once.
Use UTC for all schedule definitions. Document the timezone if local time is used.
Implement distributed locking for jobs in multi-instance deployments.
Set execution time limits on every job. Kill runaway jobs.
Monitor job health with external heartbeat services for critical jobs.
Log the start, completion, and failure of every job execution with duration and outcome.
Process large datasets in batches with checkpointing rather than single large operations.
Clean up completed job records periodically to prevent unbounded storage growth.

Anti-Patterns

Silent failures: A job that fails without logging or alerting. Nobody knows it is broken until downstream effects surface days later.
No overlap prevention: Two instances of the same job running concurrently, corrupting shared data or sending duplicates.
Hardcoded schedules with no observability: Cron jobs in system crontab with no monitoring. When they stop, nobody notices.
Unbounded job execution: A job with no timeout that processes an ever-growing dataset, consuming all memory or overlapping its next run.
Retrying permanent failures: Retrying invalid-input failures will never succeed. Classify errors and only retry transient failures.
In-process scheduling in horizontally scaled apps: Every instance runs its own scheduler, causing N executions per job.
Missing graceful shutdown: Jobs killed mid-execution leave inconsistent state. Handle SIGTERM and checkpoint progress.

Install this skill directly: skilldb add autonomous-agent-skills

Get CLI access →

Background Job Scheduling

Background Job Scheduling

Philosophy

Techniques

Cron Expression Patterns

Job Scheduler Selection

Timezone Handling

Job Overlap Prevention

Distributed Scheduling

Job Design Principles

Error Handling and Retries

Monitoring and Alerting

Graceful Job Cancellation

Best Practices

Anti-Patterns

Related Skills

Abstraction Control

Accessibility Implementation

API Design Patterns

API Integration

Assumption Validation

Authentication Implementation