Skip to main content

Debugging ECS Containers with ECS Exec

ECS Exec uses AWS Systems Manager (SSM) to open an interactive shell session inside a running Fargate container. This is the primary way to debug issues on infrastructure containers like Nessie or OEM code-location tasks.

Prerequisites

Install the Session Manager plugin for the AWS CLI:

# macOS
brew install --cask session-manager-plugin

# Verify
session-manager-plugin --version

You also need AWS CLI v2 configured with credentials that have permission to call ecs:ExecuteCommand on the target cluster.

Connecting to Nessie

1. Find the running task

aws ecs list-tasks \
--cluster dagster-hybrid-agent-AgentCluster \
--service-name nessie \
--region us-east-1 \
--query 'taskArns[0]' \
--output text

This returns a task ARN like arn:aws:ecs:us-east-1:999655274916:task/dagster-hybrid-agent-AgentCluster/abc123.

2. Start an interactive session

aws ecs execute-command \
--cluster dagster-hybrid-agent-AgentCluster \
--task <task-id> \
--container nessie \
--interactive \
--command "/bin/sh" \
--region us-east-1

Replace <task-id> with the full task ARN or just the ID portion (abc123).

3. Common debugging commands

Once inside the container:

# Check Nessie health
curl -s http://localhost:9000/q/health/ready | python3 -m json.tool

# Check Nessie API version
curl -s http://localhost:19120/api/v2/config

# List Iceberg namespaces
curl -s http://localhost:19120/iceberg/v1/namespaces

# Check environment variables (redacts secrets)
env | grep -i nessie | sort

# Check JVM memory usage
cat /proc/1/status | grep -i vm

# Check disk usage
df -h

# View recent logs (if not using awslogs exclusively)
ls /tmp/

Connecting to OEM Code-Location Containers

OEM code-location containers are launched by the Dagster agent as separate ECS tasks. They run in the same cluster but as distinct services.

Find the service name

aws ecs list-services \
--cluster dagster-hybrid-agent-AgentCluster \
--region us-east-1 \
--query 'serviceArns[*]' \
--output table

Connect

# List tasks for the code location
aws ecs list-tasks \
--cluster dagster-hybrid-agent-AgentCluster \
--service-name <service-name> \
--region us-east-1

# Connect
aws ecs execute-command \
--cluster dagster-hybrid-agent-AgentCluster \
--task <task-id> \
--container <container-name> \
--interactive \
--command "/bin/sh" \
--region us-east-1
note

ECS Exec must be enabled on the ECS service for the target container. The Nessie service has this enabled via Terraform (enable_execute_command = true). Dagster agent-managed code-location services may not have it enabled by default — check the CloudFormation template or Dagster agent configuration.

Troubleshooting

"The execute command failed"

The SSM agent inside the container needs outbound HTTPS (port 443) to reach the SSM endpoints. Nessie tasks already allow this. If a different container fails, check its security group egress rules.

"TargetNotConnectedException"

The SSM agent hasn't started yet or the task is still initializing. Wait 30-60 seconds after the task enters RUNNING state and retry.

"An error occurred (InvalidParameterException)"

Verify that enable_execute_command is true on the ECS service. For Nessie this is set in deployments/aws/terraform/solutions/dagster-agent/nessie_ecs.tf. Updating this setting requires a new deployment of the service (Terraform apply triggers a service update, which rolls out a new task with the SSM agent sidecar).

IAM permissions

The task role (not the execution role) must have ssmmessages:* permissions. For Nessie, these are granted by the nessie-ssm-exec policy in deployments/aws/terraform/solutions/dagster-agent/nessie_iam.tf.

Required actions:

ssmmessages:CreateControlChannel
ssmmessages:CreateDataChannel
ssmmessages:OpenControlChannel
ssmmessages:OpenDataChannel

The caller (your IAM user/role) needs:

ecs:ExecuteCommand
ecs:DescribeTasks