Troubleshooting staRburst

This guide helps you diagnose and fix common issues with staRburst.

Accessing Logs

CloudWatch Logs Structure

staRburst automatically sends worker logs to CloudWatch Logs:

Log Group: /aws/ecs/starburst-worker
Log Stream Pattern: starburst/<task-id>
Retention: 7 days (configurable)

Viewing Logs in R

# For ephemeral mode
library(starburst)
plan(starburst, workers = 10)

# Check logs for a specific task
# (get task ID from error messages or futures)

# For detached sessions
session <- starburst_session_attach("session-id")
status <- session$status()

# View failed task logs using AWS CLI or console

Viewing Logs in AWS Console

Navigate to CloudWatch → Log Groups
Find /aws/ecs/starburst-worker
Search for task ID in stream names
Use CloudWatch Insights for advanced queries:

fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

Common Issues

Issue 1: Tasks Stuck in “Pending”

Symptoms: - session$status() shows tasks never start - Workers = 0 in status - Tasks remain in pending state for >5 minutes

Diagnosis:

# Check Fargate quota
config <- get_starburst_config()
sts <- paws.security.identity::sts()
account <- sts$get_caller_identity()

# Check service quotas manually in AWS Console:
# Service Quotas → AWS Fargate → Fargate vCPUs

Common Causes:

Insufficient vCPU quota - Most common issue
- Default Fargate quota: 6 vCPUs in us-east-1
- Each worker uses configured CPU (default: 4 vCPUs)
- With 10 workers × 4 vCPUs = 40 vCPUs needed
Invalid task definition - Wrong CPU/memory combination
- Fargate has strict CPU/memory pairings
- Example: 4 vCPUs supports 8-30 GB memory
Network/subnet issues - VPC configuration problems
- Subnets must have available IP addresses
- Security groups must allow outbound traffic
IAM permission errors - Missing ECS task execution role permissions
- Must have ECR, S3, CloudWatch Logs access

Solutions:

# Solution 1: Request quota increase
# Go to AWS Console → Service Quotas → AWS Fargate
# Request vCPUs quota increase to 100+

# Solution 2: Reduce workers
plan(starburst, workers = 1)  # Use only 1 worker (4 vCPUs)

# Solution 3: Reduce CPU per worker
plan(starburst, workers = 10, cpu = 0.25, memory = "512MB")

# Solution 4: Check IAM permissions
# Ensure ECS task execution role has:
# - AmazonECSTaskExecutionRolePolicy
# - S3 read/write access to starburst bucket
# - CloudWatch Logs write access

Issue 2: Workers Crash Immediately

Symptoms: - Tasks start but stop within 30 seconds - Status shows workers = 0 after initial launch - CloudWatch logs show error before exit

Diagnosis:

# View CloudWatch logs for the failed task
# Look for error messages in the logs

# Common error patterns:
# - "Error: Cannot connect to S3" → S3 permissions
# - "Error loading package" → Package installation failed
# - "Cannot allocate memory" → Memory limit too low
# - "exec format error" → Architecture mismatch

Common Causes:

S3 permission errors - Task role can’t access bucket
Package installation failures - Missing system dependencies
Out of memory - Memory limit too low for workload
Architecture mismatch - ARM64 vs X86_64 image/instance mismatch

Solutions:

# Solution 1: Verify S3 permissions
# Check task role has S3 access:
# IAM → Roles → starburstECSTaskRole → Permissions
# Should have S3 GetObject/PutObject on bucket

# Solution 2: Increase memory
plan(starburst, workers = 5, cpu = 4, memory = "16GB")

# Solution 3: Check Docker build logs
# Re-run starburst setup to rebuild image
# Watch for package installation errors

# Solution 4: For EC2 mode, verify architecture matches
plan(starburst,
     launch_type = "EC2",
     instance_type = "c7g.xlarge")  # Graviton (ARM64)
# Ensure Docker image built for matching architecture

Issue 3: “Access Denied” Errors

Symptoms: - Error messages containing “AccessDenied” or “Forbidden” - Can’t create tasks, access S3, or push Docker images

Diagnosis:

# Check which operation is failing:
# 1. Docker push → ECR permissions
# 2. S3 operations → S3 permissions
# 3. Task launch → ECS permissions

# Verify credentials
library(paws.security.identity)
sts <- paws.security.identity::sts()
identity <- sts$get_caller_identity()
print(identity)  # Should show your AWS account

Common Causes:

No AWS credentials configured
IAM user lacks required permissions
S3 bucket policy blocks access
ECR repository doesn’t exist or blocks access

Solutions:

# Solution 1: Configure AWS credentials
# Option A: Environment variables
Sys.setenv(
  AWS_ACCESS_KEY_ID = "YOUR_KEY",
  AWS_SECRET_ACCESS_KEY = "YOUR_SECRET",
  AWS_DEFAULT_REGION = "us-east-1"
)

# Option B: AWS CLI profile
Sys.setenv(AWS_PROFILE = "your-profile")

# Option C: IAM role (when running on EC2/ECS)
# No configuration needed - automatic

# Solution 2: Add required IAM permissions
# Your IAM user/role needs:
# - ECS: RunTask, DescribeTasks, StopTask
# - ECR: GetAuthorizationToken, BatchCheckLayerAvailability,
#        GetDownloadUrlForLayer, PutImage, InitiateLayerUpload, etc.
# - S3: GetObject, PutObject, ListBucket on your bucket
# - IAM: PassRole (to pass ECS task role)

# Solution 3: Run starburst_setup() to create all resources
library(starburst)
starburst_setup(bucket = "my-starburst-bucket")

Issue 4: High Costs / Runaway Workers

Symptoms: - AWS bill higher than expected - Many tasks running when you expected them to stop - Old sessions still have active workers

Diagnosis:

# List all active sessions
library(starburst)
sessions <- starburst_list_sessions()
print(sessions)

# Check for old sessions with running tasks

Common Causes:

Forgot to cleanup session - Workers keep running
Requested too many workers - Cost adds up quickly
Long-running tasks - Tasks running for hours/days

Solutions:

# Solution 1: Cleanup all sessions
sessions <- starburst_list_sessions()
for (session_id in sessions$session_id) {
  session <- starburst_session_attach(session_id)
  session$cleanup(stop_workers = TRUE, force = TRUE)
}

# Solution 2: Set budget alerts in AWS
# AWS Billing Console → Budgets → Create budget
# Set alert at $100, $500 thresholds

# Solution 3: Use worker validation to prevent mistakes
# staRburst now enforces max 500 workers
# Previously you could accidentally request 10,000+

# Solution 4: Set absolute timeout on sessions
session <- starburst_session(
  workers = 10,
  absolute_timeout = 3600  # Auto-terminate after 1 hour
)

Issue 5: Session Cleanup Not Working

Symptoms: - Called session$cleanup() but workers still running - S3 files not deleted - Tasks still appearing in ECS console

Diagnosis:

# Check if cleanup was called with correct parameters
session$cleanup(stop_workers = TRUE, force = TRUE)

# Verify tasks actually stopped (may take 30-60 seconds)
Sys.sleep(60)

# Check ECS tasks manually
library(paws.compute)
ecs <- paws.compute::ecs(config = list(region = "us-east-1"))
tasks <- ecs$list_tasks(cluster = "starburst-cluster")
print(tasks$taskArns)  # Should be empty or not include your tasks

Common Causes:

Cleanup called without stop_workers - Workers not stopped
Cleanup called without force - S3 files preserved
Tasks in different cluster - Cleanup looking in wrong place
ECS eventual consistency - Tasks take time to stop

Solutions:

# Solution 1: Always use both flags for full cleanup
session$cleanup(stop_workers = TRUE, force = TRUE)

# Solution 2: Wait for ECS to process stop requests
session$cleanup(stop_workers = TRUE)
Sys.sleep(60)  # Wait 1 minute
# Then verify in AWS console

# Solution 3: Manual cleanup if needed
library(paws.compute)
library(paws.storage)

ecs <- paws.compute::ecs(config = list(region = "us-east-1"))
s3 <- paws.storage::s3(config = list(region = "us-east-1"))

# Stop all tasks in cluster
tasks <- ecs$list_tasks(cluster = "starburst-cluster", desiredStatus = "RUNNING")
for (task_arn in tasks$taskArns) {
  ecs$stop_task(cluster = "starburst-cluster", task = task_arn)
}

# Delete all session S3 files
result <- s3$list_objects_v2(Bucket = "your-bucket", Prefix = "sessions/")
# ... delete objects

Issue 6: Results Not Appearing

Symptoms: - session$collect() returns empty list - Tasks show as “completed” but no results - S3 doesn’t contain result files

Diagnosis:

# Check session status
status <- session$status()
print(status)

# Verify tasks were actually submitted
# Check S3 for task files
library(paws.storage)
s3 <- paws.storage::s3(config = list(region = "us-east-1"))
result <- s3$list_objects_v2(
  Bucket = "your-bucket",
  Prefix = sprintf("sessions/%s/results/", session$session_id)
)
print(result$Contents)  # Should show .qs files

Common Causes:

Tasks failed before producing results - Check for errors
Workers can’t write to S3 - Permission issue
Looking at wrong session ID - Attached to wrong session
Results already collected - Results only collected once

Solutions:

# Solution 1: Check task status for errors
status <- session$status()
if (status$failed_tasks > 0) {
  # Check CloudWatch logs for failed task IDs
  # Look for error messages
}

# Solution 2: Verify S3 write permissions
# Task role must have S3 PutObject permission

# Solution 3: Verify session ID
print(session$session_id)
# Make sure this matches the session you created

# Solution 4: Results can only be collected once
# If you already called collect(), results are removed from S3
# You should store results after collection:
results <- session$collect(wait = TRUE)
saveRDS(results, "my_results.rds")  # Save locally

Issue 7: Detached Session Reattach Fails

Symptoms: - starburst_session_attach() throws error - “Session not found” message - Can’t reconnect after closing R

Diagnosis:

# List all sessions to find your session ID
sessions <- starburst_list_sessions()
print(sessions)

# Try to attach with exact session ID
session_id <- "session-abc123..."
session <- starburst_session_attach(session_id)

Common Causes:

Wrong session ID - Typo or wrong ID
Session expired - Exceeded absolute_timeout
S3 manifest deleted - Someone deleted session files
Wrong region - Session created in different region

Solutions:

# Solution 1: List and copy exact session ID
sessions <- starburst_list_sessions()
session_id <- sessions$session_id[1]  # Use exact ID
session <- starburst_session_attach(session_id)

# Solution 2: Save session ID immediately after creation
session <- starburst_session(workers = 10)
session_id <- session$session_id
write(session_id, "my_session_id.txt")  # Save to file
# Later:
session_id <- readLines("my_session_id.txt")
session <- starburst_session_attach(session_id)

# Solution 3: Check correct region
session <- starburst_session_attach(session_id, region = "us-west-2")

Issue 8: Package Installation Failures

Symptoms: - Docker build fails during renv::restore() - Error messages about missing system dependencies - Specific packages fail to install

Diagnosis:

Look at Docker build output when running staRburst. Common error patterns:

Error: installation of package 'X' had non-zero exit status
Error: compilation failed for package 'X'
Error: unable to load shared library

Common Causes:

Missing system dependencies - Package needs system libraries
Package not in CRAN - Private or development package
Version conflicts - renv.lock specifies unavailable version

Solutions:

# Solution 1: Add system dependencies to Dockerfile.base
# Edit starburst package Dockerfile.base template:
# Add RUN apt-get install -y libcurl4-openssl-dev

# Solution 2: Use renv snapshot to capture dependencies
renv::snapshot()  # Updates renv.lock

# Solution 3: Install from GitHub for dev packages
renv::install("user/package")
renv::snapshot()

# Solution 4: Check package availability
install.packages("package")  # Test locally first

Advanced Diagnostics

Checking ECS Task Status

library(paws.compute)
ecs <- paws.compute::ecs(config = list(region = "us-east-1"))

# List all tasks
tasks <- ecs$list_tasks(
  cluster = "starburst-cluster",
  desiredStatus = "RUNNING"
)

# Describe specific task
task_detail <- ecs$describe_tasks(
  cluster = "starburst-cluster",
  tasks = tasks$taskArns[1:1]
)

# Check exit code and reason
print(task_detail$tasks[[1]]$containers[[1]]$exitCode)
print(task_detail$tasks[[1]]$stoppedReason)

Monitoring S3 Storage

library(paws.storage)
s3 <- paws.storage::s3(config = list(region = "us-east-1"))

# List all session files
result <- s3$list_objects_v2(
  Bucket = "your-starburst-bucket",
  Prefix = "sessions/"
)

# Calculate total storage
total_bytes <- sum(sapply(result$Contents, function(x) x$Size))
total_mb <- total_bytes / 1024^2
cat(sprintf("Total storage: %.2f MB\n", total_mb))

Estimating Costs

# Fargate pricing (us-east-1, 2026):
# - vCPU: $0.04048 per hour
# - Memory: $0.004445 per GB-hour

vcpu_price <- 0.04048
memory_price <- 0.004445

workers <- 10
cpu <- 4
memory_gb <- 8
runtime_hours <- 1

cost_per_worker <- (cpu * vcpu_price) + (memory_gb * memory_price)
total_cost <- workers * cost_per_worker * runtime_hours

cat(sprintf("Estimated cost: $%.2f for %d hours\n", total_cost, runtime_hours))

Getting Help

If you encounter issues not covered here:

Check CloudWatch Logs - Most issues have error messages in logs
Review AWS Console - Check ECS, S3, ECR for resource status
File GitHub Issue - Include error messages and logs
AWS Support - For quota increases or AWS-specific issues

Information to Include in Bug Reports:

staRburst version: packageVersion("starburst")
R version: R.version.string
AWS region
Launch type (Fargate vs EC2)
Error messages from R and CloudWatch logs
Session ID (for detached sessions)
Output of session$status() (if applicable)

Troubleshooting staRburst