Seamless AWS cloud bursting for parallel R workloads
staRburst lets you run parallel R code on AWS Fargate with zero infrastructure management. Scale from your laptop to 100+ cloud workers with a simple function call.
Features
- Simple Setup: One-time configuration (~2 minutes), then seamless operation
-
Simple API: Direct
starburst_map()function - no new concepts to learn - Detached Sessions: Submit long-running jobs and detach - retrieve results anytime
- Multiple Backends: Fargate (serverless) and EC2 (cost-optimized) support
- Automatic Environment Sync: Your packages and dependencies automatically available on workers
- Smart Quota Management: Automatically handles AWS quota limits with wave execution
- Cost Transparent: See estimated and actual costs for every run
- Auto Cleanup: Workers shut down automatically when done
Installation
CRAN submission in progress for v0.3.6 (expected within 2-4 weeks).
Once available:
install.packages("starburst")Development version from GitHub:
remotes::install_github("scttfrdmn/starburst")Quick Start
library(starburst)
# One-time setup (2 minutes)
starburst_setup()
# Run parallel computation on AWS
results <- starburst_map(
1:1000,
function(x) expensive_computation(x),
workers = 50
)
#> 🚀 Starting starburst cluster with 50 workers
#> 💰 Estimated cost: ~$2.80/hour
#> 📊 Processing 1000 items with 50 workers
#> 📦 Created 50 chunks (avg 20 items per chunk)
#> 🚀 Submitting tasks...
#> ✓ Submitted 50 tasks
#> ⏳ Progress: 50/50 tasks (3.2 minutes elapsed)
#>
#> ✓ Completed in 3.2 minutes
#> 💰 Estimated cost: $0.15Example: Monte Carlo Simulation
library(starburst)
# Define simulation
simulate_portfolio <- function(seed) {
set.seed(seed)
returns <- rnorm(252, mean = 0.0003, sd = 0.02)
prices <- cumprod(1 + returns)
list(
final_value = prices[252],
sharpe_ratio = mean(returns) / sd(returns) * sqrt(252)
)
}
# Run 10,000 simulations on 100 AWS workers
results <- starburst_map(
1:10000,
simulate_portfolio,
workers = 100
)
#> 🚀 Starting starburst cluster with 100 workers
#> 💰 Estimated cost: ~$5.60/hour
#> 📊 Processing 10000 items with 100 workers
#> ⏳ Progress: 100/100 tasks (3.1 minutes elapsed)
#>
#> ✓ Completed in 3.1 minutes
#> 💰 Estimated cost: $0.29
# Extract results
final_values <- sapply(results, function(x) x$final_value)
sharpe_ratios <- sapply(results, function(x) x$sharpe_ratio)
# Summary
mean(final_values) # Average portfolio outcome
quantile(final_values, c(0.05, 0.95)) # Risk range
# Comparison:
# Local (single core): ~4 hours
# Cloud (100 workers): 3 minutes, $0.29Advanced Usage
Reuse Cluster for Multiple Operations
# Create cluster once
cluster <- starburst_cluster(workers = 50, cpu = 4, memory = "8GB")
# Run multiple analyses
results1 <- cluster$map(dataset1, analysis_function)
results2 <- cluster$map(dataset2, processing_function)
results3 <- cluster$map(dataset3, modeling_function)
# All use the same Docker image and configurationCustom Worker Configuration
# For memory-intensive workloads
results <- starburst_map(
large_datasets,
memory_intensive_function,
workers = 20,
cpu = 8,
memory = "16GB"
)
# For CPU-intensive workloads
results <- starburst_map(
cpu_tasks,
cpu_intensive_function,
workers = 50,
cpu = 4,
memory = "8GB"
)Detached Sessions
Run long jobs and disconnect - results persist in S3:
# Start detached session
session <- starburst_session(workers = 50, detached = TRUE)
# Submit work and get session ID
session$submit(quote({
results <- starburst_map(huge_dataset, expensive_function)
saveRDS(results, "results.rds")
}))
session_id <- session$session_id
# Disconnect - job continues running
# Later (hours/days), reconnect:
session <- starburst_session_attach(session_id)
status <- session$status() # Check progress
results <- session$collect() # Get results
# Cleanup when done
session$cleanup(force = TRUE)How It Works
- Environment Snapshot: Captures your R packages using renv
- Container Build: Creates Docker image with your environment, cached in ECR
- Task Distribution: Splits data into chunks across workers
- Task Submission: Launches Fargate tasks (or sequential batches if quota-limited)
- Data Transfer: Serializes task data to S3 using fast qs format
- Execution: Workers pull data, execute function on chunk items, push results
- Result Collection: Downloads and combines results in correct order
- Cleanup: Automatically shuts down workers
Cost Management
# Set cost limits
starburst_config(
max_cost_per_job = 10, # Hard limit
cost_alert_threshold = 5 # Warning at $5
)
# Costs shown transparently
results <- starburst_map(data, fn, workers = 100)
#> 💰 Estimated cost: ~$3.50/hour
#> ✓ Completed in 23 minutes
#> 💰 Estimated cost: $1.34Quota Management
staRburst automatically handles AWS Fargate quota limitations:
results <- starburst_map(data, fn, workers = 100, cpu = 4)
#> ⚠ Requested 100 workers (400 vCPUs) but quota allows 25 workers (100 vCPUs)
#> ⚠ Using 25 workers instead
#> 💰 Estimated cost: ~$1.40/hourYour work still completes, just with fewer workers. You can request quota increases through AWS Service Quotas.
API Reference
Main Functions
-
starburst_map(.x, .f, workers, ...)- Parallel map over data -
starburst_cluster(workers, cpu, memory)- Create reusable cluster -
starburst_setup()- Initial AWS configuration -
starburst_config(...)- Update configuration -
starburst_status()- Check cluster status
Configuration Options
starburst_config(
region = "us-east-1",
max_cost_per_job = 10,
cost_alert_threshold = 5
)Documentation
Full documentation available at starburst.ing
Comparison
| Feature | staRburst | RStudio Server on EC2 | Coiled (Python) |
|---|---|---|---|
| Setup time | 2 minutes | 30+ minutes | 5 minutes |
| Infrastructure management | Zero | Manual | Zero |
| Learning curve | Minimal | Medium | Medium |
| Auto scaling | Yes | No | Yes |
| Cost optimization | Automatic | Manual | Automatic |
| R-native | Yes | Yes | No (Python) |
Requirements
- R >= 4.0
- AWS account with:
- AWS CLI configured or
AWS_PROFILEset - IAM permissions for ECS, ECR, S3, VPC
- Two IAM roles (created during setup):
-
starburstECSExecutionRole- for ECS/ECR access -
starburstECSTaskRole- for S3 access
-
- AWS CLI configured or
See IMPLEMENTATION_STATUS.md for detailed setup instructions.
Roadmap
v0.3.6 (Current - CRAN Submission)
- ✅ Direct API (
starburst_map,starburst_cluster) - ✅ AWS Fargate integration
- ✅ EC2 backend support with spot instances
- ✅ Detached session mode for long-running jobs
- ✅ Automatic environment management
- ✅ Cost tracking and quota handling
- ✅ Full
futurebackend integration - ✅ Support for
future.apply,furrr,targets - ✅ Comprehensive AWS integration testing
- ✅ CRAN-ready (0 errors, 0 notes)
Contributing
Contributions welcome! See CONTRIBUTING.md.
