Technology & EngineeringServerless195 lines

AWS Step Functions

Expert guidance for orchestrating serverless workflows with AWS Step Functions

Quick Summary10 lines

You are an expert in AWS Step Functions for building serverless applications. You help teams design auditable, resilient workflow orchestrations that separate coordination logic from business logic and leverage direct service integrations wherever possible.

## Key Points

- Use direct SDK integrations (DynamoDB, SQS, SNS, EventBridge) instead of wrapping simple AWS API calls in Lambda functions — it reduces cost, latency, and code to maintain.
- Use `ResultPath` and `OutputPath` to control how each state's output merges with the workflow's data so downstream states receive only what they need.
- The maximum state machine payload size is 256 KB — passing large datasets between states will fail; instead, store data in S3 or DynamoDB and pass references.
- Forgetting to add `Retry` and `Catch` blocks leads to entire workflow failures on transient errors; always configure retries with exponential backoff for Task states.

skilldb get serverless-skills/AWS Step FunctionsFull skill: 195 lines

Paste into your CLAUDE.md or agent config

AWS Step Functions — Serverless

You are an expert in AWS Step Functions for building serverless applications. You help teams design auditable, resilient workflow orchestrations that separate coordination logic from business logic and leverage direct service integrations wherever possible.

Core Philosophy

Step Functions exist to make complex workflows visible, testable, and recoverable. The state machine definition is a contract: it declares every possible path through the workflow, including error paths, retries, and compensating actions. This explicitness is the core value — when something fails at 3 AM, the execution history in the console tells you exactly which state failed, what the input was, and why. Embrace this transparency by keeping each state focused and well-named rather than hiding logic inside opaque Lambda functions.

Business logic belongs in Lambda functions or direct SDK integrations; orchestration logic belongs in the state machine. When you find yourself writing branching and looping logic inside a Lambda that a Step Functions Choice or Map state could handle, you are putting coordination in the wrong place. The state machine should be the authoritative source of truth for workflow structure, and each task should be a self-contained, independently testable unit.

Design for failure from the start. Every Task state should have a Retry block with exponential backoff for transient errors and a Catch block for terminal errors. The question is never "will this step fail?" but "what should the workflow do when it fails?" Compensating transactions, dead-letter queues, and human-approval Wait states are all first-class patterns — use them rather than hoping for the happy path.

Anti-Patterns

Lambda shims for simple AWS API calls — Wrapping a DynamoDB PutItem or an SNS Publish in a Lambda function when Step Functions supports direct SDK integration adds cost, latency, and code to maintain for zero benefit.
Passing large payloads between states — The 256 KB payload limit is a hard constraint. Passing entire datasets through state transitions breaks workflows at scale. Store data in S3 or DynamoDB and pass references (keys, ARNs) instead.
Missing Retry and Catch blocks — A Task state without retry configuration fails the entire execution on any transient error. This is the single most common cause of production workflow failures and is entirely preventable.
Using Standard Workflows for high-throughput, short-lived tasks — Standard Workflows charge per state transition and are designed for long-running, exactly-once use cases. High-volume, short-duration work (under 5 minutes) should use Express Workflows at a fraction of the cost.
Deeply nested Parallel and Map states — Complex nesting makes the state machine definition unreadable, the execution graph in the console unusable, and errors difficult to trace. Decompose into child state machines invoked via nested execution for manageability.

Overview

AWS Step Functions is a workflow orchestration service that coordinates multiple AWS services into resilient, auditable business workflows. It uses the Amazon States Language (ASL) to define state machines with built-in error handling, retries, parallelism, and observability. Express Workflows handle high-volume, short-duration workloads; Standard Workflows suit long-running processes up to one year.

Setup & Configuration

SAM template

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31

Resources:
  OrderStateMachine:
    Type: AWS::Serverless::StateMachine
    Properties:
      DefinitionUri: statemachine/order.asl.json
      DefinitionSubstitutions:
        ValidateOrderFunctionArn: !GetAtt ValidateOrderFunction.Arn
        ProcessPaymentFunctionArn: !GetAtt ProcessPaymentFunction.Arn
        ShipOrderFunctionArn: !GetAtt ShipOrderFunction.Arn
      Policies:
        - LambdaInvokePolicy:
            FunctionName: !Ref ValidateOrderFunction
        - LambdaInvokePolicy:
            FunctionName: !Ref ProcessPaymentFunction
        - LambdaInvokePolicy:
            FunctionName: !Ref ShipOrderFunction

  ValidateOrderFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: src/validate.handler
      Runtime: nodejs20.x

  ProcessPaymentFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: src/payment.handler
      Runtime: nodejs20.x

  ShipOrderFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: src/ship.handler
      Runtime: nodejs20.x

State machine definition (order.asl.json)

{
  "Comment": "Order processing workflow",
  "StartAt": "ValidateOrder",
  "States": {
    "ValidateOrder": {
      "Type": "Task",
      "Resource": "${ValidateOrderFunctionArn}",
      "Next": "ProcessPayment",
      "Retry": [
        {
          "ErrorEquals": ["States.TaskFailed"],
          "IntervalSeconds": 2,
          "MaxAttempts": 3,
          "BackoffRate": 2.0
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["ValidationError"],
          "Next": "OrderFailed"
        }
      ]
    },
    "ProcessPayment": {
      "Type": "Task",
      "Resource": "${ProcessPaymentFunctionArn}",
      "Next": "ShipOrder"
    },
    "ShipOrder": {
      "Type": "Task",
      "Resource": "${ShipOrderFunctionArn}",
      "End": true
    },
    "OrderFailed": {
      "Type": "Fail",
      "Error": "OrderProcessingFailed",
      "Cause": "Order validation failed"
    }
  }
}

Core Patterns

Parallel execution

{
  "Type": "Parallel",
  "Branches": [
    {
      "StartAt": "SendEmail",
      "States": {
        "SendEmail": { "Type": "Task", "Resource": "${SendEmailArn}", "End": true }
      }
    },
    {
      "StartAt": "UpdateInventory",
      "States": {
        "UpdateInventory": { "Type": "Task", "Resource": "${UpdateInventoryArn}", "End": true }
      }
    }
  ],
  "Next": "Complete"
}

Map state for iterating over collections

{
  "Type": "Map",
  "ItemsPath": "$.orders",
  "MaxConcurrency": 10,
  "ItemProcessor": {
    "ProcessorConfig": { "Mode": "INLINE" },
    "StartAt": "ProcessSingleOrder",
    "States": {
      "ProcessSingleOrder": {
        "Type": "Task",
        "Resource": "${ProcessOrderArn}",
        "End": true
      }
    }
  },
  "Next": "AllOrdersProcessed"
}

Direct SDK integrations (no Lambda needed)

{
  "Type": "Task",
  "Resource": "arn:aws:states:::dynamodb:putItem",
  "Parameters": {
    "TableName": "Orders",
    "Item": {
      "orderId": { "S.$": "$.orderId" },
      "status": { "S": "COMPLETED" },
      "timestamp": { "S.$": "$$.State.EnteredTime" }
    }
  },
  "Next": "Done"
}

Best Practices

Use direct SDK integrations (DynamoDB, SQS, SNS, EventBridge) instead of wrapping simple AWS API calls in Lambda functions — it reduces cost, latency, and code to maintain.
Choose Express Workflows for high-throughput, short-duration work (under 5 minutes) at significantly lower cost; use Standard Workflows when you need exactly-once execution or long-running processes.
Use ResultPath and OutputPath to control how each state's output merges with the workflow's data so downstream states receive only what they need.

Common Pitfalls

The maximum state machine payload size is 256 KB — passing large datasets between states will fail; instead, store data in S3 or DynamoDB and pass references.
Forgetting to add Retry and Catch blocks leads to entire workflow failures on transient errors; always configure retries with exponential backoff for Task states.

Install this skill directly: skilldb add serverless-skills

Get CLI access →

AWS Step Functions

AWS Step Functions — Serverless

Core Philosophy

Anti-Patterns

Overview

Setup & Configuration

SAM template

State machine definition (order.asl.json)

Core Patterns

Parallel execution

Map state for iterating over collections

Direct SDK integrations (no Lambda needed)

Best Practices

Common Pitfalls

Related Skills

AWS Lambda

Cloudflare Workers

Cold Start Optimization

Event Triggers

Serverless Databases

Serverless Testing