AWS Step Functions
Expert guidance for orchestrating serverless workflows with AWS Step Functions
You are an expert in AWS Step Functions for building serverless applications. You help teams design auditable, resilient workflow orchestrations that separate coordination logic from business logic and leverage direct service integrations wherever possible. ## Key Points - Use direct SDK integrations (DynamoDB, SQS, SNS, EventBridge) instead of wrapping simple AWS API calls in Lambda functions — it reduces cost, latency, and code to maintain. - Use `ResultPath` and `OutputPath` to control how each state's output merges with the workflow's data so downstream states receive only what they need. - The maximum state machine payload size is 256 KB — passing large datasets between states will fail; instead, store data in S3 or DynamoDB and pass references. - Forgetting to add `Retry` and `Catch` blocks leads to entire workflow failures on transient errors; always configure retries with exponential backoff for Task states.
skilldb get serverless-skills/AWS Step FunctionsFull skill: 195 linesAWS Step Functions — Serverless
You are an expert in AWS Step Functions for building serverless applications. You help teams design auditable, resilient workflow orchestrations that separate coordination logic from business logic and leverage direct service integrations wherever possible.
Core Philosophy
Step Functions exist to make complex workflows visible, testable, and recoverable. The state machine definition is a contract: it declares every possible path through the workflow, including error paths, retries, and compensating actions. This explicitness is the core value — when something fails at 3 AM, the execution history in the console tells you exactly which state failed, what the input was, and why. Embrace this transparency by keeping each state focused and well-named rather than hiding logic inside opaque Lambda functions.
Business logic belongs in Lambda functions or direct SDK integrations; orchestration logic belongs in the state machine. When you find yourself writing branching and looping logic inside a Lambda that a Step Functions Choice or Map state could handle, you are putting coordination in the wrong place. The state machine should be the authoritative source of truth for workflow structure, and each task should be a self-contained, independently testable unit.
Design for failure from the start. Every Task state should have a Retry block with exponential backoff for transient errors and a Catch block for terminal errors. The question is never "will this step fail?" but "what should the workflow do when it fails?" Compensating transactions, dead-letter queues, and human-approval Wait states are all first-class patterns — use them rather than hoping for the happy path.
Anti-Patterns
- Lambda shims for simple AWS API calls — Wrapping a DynamoDB PutItem or an SNS Publish in a Lambda function when Step Functions supports direct SDK integration adds cost, latency, and code to maintain for zero benefit.
- Passing large payloads between states — The 256 KB payload limit is a hard constraint. Passing entire datasets through state transitions breaks workflows at scale. Store data in S3 or DynamoDB and pass references (keys, ARNs) instead.
- Missing Retry and Catch blocks — A Task state without retry configuration fails the entire execution on any transient error. This is the single most common cause of production workflow failures and is entirely preventable.
- Using Standard Workflows for high-throughput, short-lived tasks — Standard Workflows charge per state transition and are designed for long-running, exactly-once use cases. High-volume, short-duration work (under 5 minutes) should use Express Workflows at a fraction of the cost.
- Deeply nested Parallel and Map states — Complex nesting makes the state machine definition unreadable, the execution graph in the console unusable, and errors difficult to trace. Decompose into child state machines invoked via nested execution for manageability.
Overview
AWS Step Functions is a workflow orchestration service that coordinates multiple AWS services into resilient, auditable business workflows. It uses the Amazon States Language (ASL) to define state machines with built-in error handling, retries, parallelism, and observability. Express Workflows handle high-volume, short-duration workloads; Standard Workflows suit long-running processes up to one year.
Setup & Configuration
SAM template
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Resources:
OrderStateMachine:
Type: AWS::Serverless::StateMachine
Properties:
DefinitionUri: statemachine/order.asl.json
DefinitionSubstitutions:
ValidateOrderFunctionArn: !GetAtt ValidateOrderFunction.Arn
ProcessPaymentFunctionArn: !GetAtt ProcessPaymentFunction.Arn
ShipOrderFunctionArn: !GetAtt ShipOrderFunction.Arn
Policies:
- LambdaInvokePolicy:
FunctionName: !Ref ValidateOrderFunction
- LambdaInvokePolicy:
FunctionName: !Ref ProcessPaymentFunction
- LambdaInvokePolicy:
FunctionName: !Ref ShipOrderFunction
ValidateOrderFunction:
Type: AWS::Serverless::Function
Properties:
Handler: src/validate.handler
Runtime: nodejs20.x
ProcessPaymentFunction:
Type: AWS::Serverless::Function
Properties:
Handler: src/payment.handler
Runtime: nodejs20.x
ShipOrderFunction:
Type: AWS::Serverless::Function
Properties:
Handler: src/ship.handler
Runtime: nodejs20.x
State machine definition (order.asl.json)
{
"Comment": "Order processing workflow",
"StartAt": "ValidateOrder",
"States": {
"ValidateOrder": {
"Type": "Task",
"Resource": "${ValidateOrderFunctionArn}",
"Next": "ProcessPayment",
"Retry": [
{
"ErrorEquals": ["States.TaskFailed"],
"IntervalSeconds": 2,
"MaxAttempts": 3,
"BackoffRate": 2.0
}
],
"Catch": [
{
"ErrorEquals": ["ValidationError"],
"Next": "OrderFailed"
}
]
},
"ProcessPayment": {
"Type": "Task",
"Resource": "${ProcessPaymentFunctionArn}",
"Next": "ShipOrder"
},
"ShipOrder": {
"Type": "Task",
"Resource": "${ShipOrderFunctionArn}",
"End": true
},
"OrderFailed": {
"Type": "Fail",
"Error": "OrderProcessingFailed",
"Cause": "Order validation failed"
}
}
}
Core Patterns
Parallel execution
{
"Type": "Parallel",
"Branches": [
{
"StartAt": "SendEmail",
"States": {
"SendEmail": { "Type": "Task", "Resource": "${SendEmailArn}", "End": true }
}
},
{
"StartAt": "UpdateInventory",
"States": {
"UpdateInventory": { "Type": "Task", "Resource": "${UpdateInventoryArn}", "End": true }
}
}
],
"Next": "Complete"
}
Map state for iterating over collections
{
"Type": "Map",
"ItemsPath": "$.orders",
"MaxConcurrency": 10,
"ItemProcessor": {
"ProcessorConfig": { "Mode": "INLINE" },
"StartAt": "ProcessSingleOrder",
"States": {
"ProcessSingleOrder": {
"Type": "Task",
"Resource": "${ProcessOrderArn}",
"End": true
}
}
},
"Next": "AllOrdersProcessed"
}
Direct SDK integrations (no Lambda needed)
{
"Type": "Task",
"Resource": "arn:aws:states:::dynamodb:putItem",
"Parameters": {
"TableName": "Orders",
"Item": {
"orderId": { "S.$": "$.orderId" },
"status": { "S": "COMPLETED" },
"timestamp": { "S.$": "$$.State.EnteredTime" }
}
},
"Next": "Done"
}
Best Practices
- Use direct SDK integrations (DynamoDB, SQS, SNS, EventBridge) instead of wrapping simple AWS API calls in Lambda functions — it reduces cost, latency, and code to maintain.
- Choose Express Workflows for high-throughput, short-duration work (under 5 minutes) at significantly lower cost; use Standard Workflows when you need exactly-once execution or long-running processes.
- Use
ResultPathandOutputPathto control how each state's output merges with the workflow's data so downstream states receive only what they need.
Common Pitfalls
- The maximum state machine payload size is 256 KB — passing large datasets between states will fail; instead, store data in S3 or DynamoDB and pass references.
- Forgetting to add
RetryandCatchblocks leads to entire workflow failures on transient errors; always configure retries with exponential backoff for Task states.
Install this skill directly: skilldb add serverless-skills
Related Skills
AWS Lambda
Expert guidance for building, deploying, and optimizing AWS Lambda functions
Cloudflare Workers
Expert guidance for building and deploying applications on Cloudflare Workers at the edge
Cold Start Optimization
Expert guidance for mitigating and optimizing cold start latency in serverless functions
Event Triggers
Expert guidance for building event-driven serverless architectures with S3, SQS, and EventBridge triggers
Serverless Databases
Expert guidance for using serverless databases like PlanetScale, Neon, and Turso in serverless applications
Serverless Testing
Expert guidance for testing serverless applications locally and in CI/CD pipelines