Technology & EngineeringAws Services233 lines

Ecs Fargate

AWS ECS and Fargate for running containerized applications without managing servers

Quick Summary28 lines

You are an expert in Amazon ECS (Elastic Container Service) and AWS Fargate for deploying and orchestrating containerized workloads.

## Key Points

- **Deploying with the `latest` image tag** -- Rolling back is impossible when every revision points to `latest`. Tag images with a build identifier so deployments are reproducible.
- **Running Fargate tasks in public subnets with public IPs** -- Place tasks in private subnets behind an ALB. Assign public IPs only when there is no NAT Gateway and the task needs internet access.
- **Use Fargate** unless you need GPU instances, specific instance types, or EC2 pricing advantages at scale.
- **Separate execution role and task role**: Execution role pulls images and writes logs. Task role grants permissions your application code needs (S3, DynamoDB, etc.).
- **Store secrets in Secrets Manager or SSM Parameter Store**, referenced in the task definition. Never embed secrets in environment variables or images.
- **Enable deployment circuit breaker** with rollback to automatically revert failed deployments.
- **Use `awsvpc` network mode** (required for Fargate) and place tasks in private subnets behind an ALB.
- **Set CPU and memory realistically**. Over-provisioning wastes money. Use CloudWatch Container Insights to right-size.
- **Use Fargate Spot** for fault-tolerant workloads (batch jobs, workers) to save up to 70%.
- **Tag images with git SHA or build ID**, not just `latest`, so deployments are traceable and reproducible.
- **Health check grace period too short**: New tasks get killed before they finish starting. Set `healthCheckGracePeriodSeconds` on the service.
- **Task role vs execution role confusion**: If your app can't access S3, check the task role. If tasks fail to start (can't pull image), check the execution role.

## Quick Example

```bash
aws ecs create-cluster --cluster-name my-app --capacity-providers FARGATE FARGATE_SPOT
```

```bash
aws ecs register-task-definition --cli-input-json file://task-definition.json
```

skilldb get aws-services-skills/Ecs FargateFull skill: 233 lines

Paste into your CLAUDE.md or agent config

AWS ECS & Fargate — Cloud Services

You are an expert in Amazon ECS (Elastic Container Service) and AWS Fargate for deploying and orchestrating containerized workloads.

Core Philosophy

Containers should be immutable, stateless, and replaceable. ECS Fargate embodies this principle: you define what to run (the task definition), how many to run (desired count), and how to route traffic (the load balancer), then let the platform handle placement, scaling, and recovery. If a task fails, ECS replaces it. If traffic spikes, auto-scaling adds more. Your job is to build a container that starts fast, serves requests, and shuts down cleanly on SIGTERM.

Separate concerns between execution roles and task roles. The execution role is for ECS infrastructure operations -- pulling images from ECR and writing logs to CloudWatch. The task role is for your application code -- accessing S3, DynamoDB, or other AWS services. Merging these into one over-permissioned role violates least privilege and makes it harder to audit what your application actually needs versus what the platform needs.

Deployments should be safe by default. Enable the deployment circuit breaker with automatic rollback so that a bad image does not take down your entire service. Tag images with the git SHA or build ID, not just latest, so every deployment is traceable to a specific commit. Use ECS Exec for debugging live containers rather than SSH, which does not exist in Fargate.

Anti-Patterns

Using a single over-permissioned IAM role for both execution and task -- This grants your application code permissions it does not need (like pulling ECR images) and makes security audits meaningless.
Deploying with the latest image tag -- Rolling back is impossible when every revision points to latest. Tag images with a build identifier so deployments are reproducible.
Setting the health check grace period too short -- New tasks get killed before they finish starting, causing a loop of failed deployments. Set healthCheckGracePeriodSeconds based on actual startup time.
Embedding secrets as plaintext environment variables in the task definition -- Use Secrets Manager or SSM Parameter Store references. Plaintext secrets are visible in the console and API responses.
Running Fargate tasks in public subnets with public IPs -- Place tasks in private subnets behind an ALB. Assign public IPs only when there is no NAT Gateway and the task needs internet access.

Overview

ECS is AWS's container orchestration service. It runs Docker containers as tasks defined by task definitions, organized into services within clusters. Fargate is the serverless compute engine for ECS that eliminates the need to manage EC2 instances. ECS integrates with ALB for load balancing, ECR for container images, CloudWatch for logging, and IAM for fine-grained permissions.

Setup & Configuration

Create a Cluster

aws ecs create-cluster --cluster-name my-app --capacity-providers FARGATE FARGATE_SPOT

Push Image to ECR

# Create repository
aws ecr create-repository --repository-name my-app

# Authenticate Docker to ECR
aws ecr get-login-password --region us-east-1 | \
  docker login --username AWS --password-stdin 123456789012.dkr.ecr.us-east-1.amazonaws.com

# Build, tag, and push
docker build -t my-app .
docker tag my-app:latest 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:latest
docker push 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:latest

Register a Task Definition

{
  "family": "my-app",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789012:role/ecsTaskRole",
  "containerDefinitions": [
    {
      "name": "app",
      "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:latest",
      "portMappings": [{"containerPort": 8080, "protocol": "tcp"}],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/my-app",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "app"
        }
      },
      "environment": [
        {"name": "NODE_ENV", "value": "production"}
      ],
      "secrets": [
        {"name": "DB_PASSWORD", "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:db-password"}
      ],
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
        "interval": 30,
        "timeout": 5,
        "retries": 3
      }
    }
  ]
}

aws ecs register-task-definition --cli-input-json file://task-definition.json

Core Patterns

Create a Service with ALB

aws ecs create-service \
  --cluster my-app \
  --service-name my-app-service \
  --task-definition my-app:1 \
  --desired-count 2 \
  --launch-type FARGATE \
  --network-configuration '{
    "awsvpcConfiguration": {
      "subnets": ["subnet-aaa", "subnet-bbb"],
      "securityGroups": ["sg-12345"],
      "assignPublicIp": "DISABLED"
    }
  }' \
  --load-balancers '[{
    "targetGroupArn": "arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-app/abc123",
    "containerName": "app",
    "containerPort": 8080
  }]'

Auto Scaling

# Register scalable target
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/my-app/my-app-service \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 2 \
  --max-capacity 20

# Target tracking on CPU
aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/my-app/my-app-service \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name cpu-scaling \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 70.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
    },
    "ScaleInCooldown": 300,
    "ScaleOutCooldown": 60
  }'

Rolling Deployment

# Update service to use new task definition revision
aws ecs update-service \
  --cluster my-app \
  --service my-app-service \
  --task-definition my-app:2 \
  --deployment-configuration '{
    "maximumPercent": 200,
    "minimumHealthyPercent": 100,
    "deploymentCircuitBreaker": {
      "enable": true,
      "rollback": true
    }
  }'

# Wait for deployment to stabilize
aws ecs wait services-stable --cluster my-app --services my-app-service

Run a One-Off Task

aws ecs run-task \
  --cluster my-app \
  --task-definition my-app-migration:1 \
  --launch-type FARGATE \
  --network-configuration '{
    "awsvpcConfiguration": {
      "subnets": ["subnet-aaa"],
      "securityGroups": ["sg-12345"],
      "assignPublicIp": "DISABLED"
    }
  }' \
  --overrides '{
    "containerOverrides": [{
      "name": "app",
      "command": ["python", "manage.py", "migrate"]
    }]
  }'

ECS Exec (Interactive Shell)

# Enable ECS Exec on the service
aws ecs update-service \
  --cluster my-app \
  --service my-app-service \
  --enable-execute-command

# Start an interactive shell
aws ecs execute-command \
  --cluster my-app \
  --task arn:aws:ecs:us-east-1:123456789012:task/my-app/abc123 \
  --container app \
  --interactive \
  --command "/bin/sh"

Best Practices

Use Fargate unless you need GPU instances, specific instance types, or EC2 pricing advantages at scale.
Separate execution role and task role: Execution role pulls images and writes logs. Task role grants permissions your application code needs (S3, DynamoDB, etc.).
Store secrets in Secrets Manager or SSM Parameter Store, referenced in the task definition. Never embed secrets in environment variables or images.
Enable deployment circuit breaker with rollback to automatically revert failed deployments.
Use awsvpc network mode (required for Fargate) and place tasks in private subnets behind an ALB.
Set CPU and memory realistically. Over-provisioning wastes money. Use CloudWatch Container Insights to right-size.
Use Fargate Spot for fault-tolerant workloads (batch jobs, workers) to save up to 70%.
Tag images with git SHA or build ID, not just latest, so deployments are traceable and reproducible.

Common Pitfalls

Health check grace period too short: New tasks get killed before they finish starting. Set healthCheckGracePeriodSeconds on the service.
Task role vs execution role confusion: If your app can't access S3, check the task role. If tasks fail to start (can't pull image), check the execution role.
Insufficient task memory: Containers killed with exit code 137 (OOMKilled). Increase the memory in the task definition.
Security group misconfiguration: The task's security group must allow inbound from the ALB security group. The ALB security group must allow inbound from the internet.
ECR image pull failures: Ensure the VPC has a NAT Gateway or VPC endpoints for ECR (com.amazonaws.region.ecr.dkr, com.amazonaws.region.ecr.api, and S3 gateway endpoint).
Log group not created: If using awslogs, the CloudWatch log group must exist before the task starts, or the task will fail. Create it in advance or use "awslogs-create-group": "true".
Ignoring stoppedReason: When tasks fail to start, run aws ecs describe-tasks and check stoppedReason and containers[].reason for the root cause.

Install this skill directly: skilldb add aws-services-skills

Get CLI access →