Welcome to AI Sandbox Benchmark – an open-source, standardized benchmarking framework designed to evaluate and compare various code execution sandbox environments like Daytona, e2b, CodeSandbox, Modal, Morph, and others.
⚠️ Disclaimer: This project is a work in progress and proof of concept. We are actively working on optimizing performance, improving test coverage, and enhancing the overall user experience. Feedback and contributions are highly welcome!
Whether you're a developer looking to choose the best sandbox for your projects or a contributor aiming to enhance the benchmarking suite, this project is for you!
# Clone the repository
git clone https://github.com/daytonaio/ai-sandbox-benchmark.git
cd ai-sandbox-benchmark
# Set up a virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Configure providers
# See providers/README.md for detailed setup instructions
The easiest way to run benchmarks is using the interactive Terminal UI:
python benchmark.py
- Parallel Provider Execution: Tests run simultaneously across all selected providers
- Interactive TUI: User-friendly terminal interface for selecting tests and providers
- WCAG-Compliant Interface: High-contrast, accessible terminal UI
- Automated CodeSandbox Detection: Warns if CodeSandbox service is not running
- Flexible Test Configuration: Run any combination of tests and providers
- Comprehensive Metrics: Detailed timing for workspace creation, execution, and cleanup
- Statistical Analysis: Mean, standard deviation, and relative performance comparisons
- Warmup Runs: Configurable warmup runs to ensure stable measurements
A basic performance test running the ls command across different providers. The test included one warmup run followed by three timed executions for measurement.
================================================================================
Test Configuration Summary
================================================================================
Warmup Runs: 1
Measurement Runs: 3
Tests Used (1): 1:test_list_directory
Providers Used: daytona, e2b, codesandbox, modal
================================================================================
Performance Comparison for Test 1: test_list_directory
+--------------------+-------------------+--------------------+---------------------+---------------------+
| Metric | Daytona | E2b | Codesandbox | Modal |
+====================+===================+====================+=====================+=====================+
| Workspace Creation | 168.66ms (±6.83) | 646.40ms (±259.36) | 1533.00ms (±472.06) | 528.03ms (±42.06) |
+--------------------+-------------------+--------------------+---------------------+---------------------+
| Code Execution | 366.90ms (±76.72) | 267.24ms (±4.63) | 190.67ms (±4.03) | 477.98ms (±190.28) |
+--------------------+-------------------+--------------------+---------------------+---------------------+
| Cleanup | 145.81ms (±1.49) | 738.66ms (±194.46) | 5314.00ms (±62.03) | 3448.28ms (±215.71) |
+--------------------+-------------------+--------------------+---------------------+---------------------+
| Total Time | 681.37ms | 1652.31ms | 7037.67ms | 4454.29ms |
+--------------------+-------------------+--------------------+---------------------+---------------------+
| vs Daytona % | 0% | +142.5% | +932.9% | +553.7% |
+--------------------+-------------------+--------------------+---------------------+---------------------+
AI Sandbox Benchmark collects detailed performance metrics across providers and offers robust historical tracking:
- Workspace Creation Time: Time taken to initialize the sandbox environment
- Code Execution Time: Time to execute the test code
- Cleanup Time: Time required to tear down resources
- Total Time: Overall end-to-end performance
The benchmark suite now includes performance history tracking that:
- Stores Results: Automatically saves benchmark results to a history file
- Tracks Trends: Analyzes performance changes over time
- Detects Changes: Identifies improvements or regressions between runs
- Compares Providers: Shows relative performance across providers
- Statistical Metrics: Standard deviation, coefficient of variation, min/max values
- Provider Comparisons: Identifies fastest and most consistent providers
- Reliability Tracking: Tracks error rates and failures over time
- Performance Trends: Visualizes performance changes with percentage improvements
- Comprehensive network performance metrics
- Graphical visualization of performance trends
- Automated regression detection and alerting
- Python 3.12+
- Node.js (for CodeSandbox service)
- Git
-
Clone the Repository
git clone https://github.com/nkkko/ai-sandbox-benchmark.git cd ai-sandbox-benchmark
-
Set Up a Virtual Environment (Optional but Recommended)
python -m venv venv source venv/bin/activate
-
Install Python Dependencies
pip install -r requirements.txt
-
Set Up Provider-Specific Requirements
Some providers require additional setup. See the Provider README for detailed setup instructions.
-
Configure Environment Variables
Create a
.env
file in the root directory with the necessary API keys. Refer to the Provider README for detailed instructions on setting up each provider. -
Configure Sandbox Settings (Optional)
The
config.yml
file allows you to customize various aspects of the benchmark, including which environment variables are passed to sandbox environments:# Environment variables to pass to sandboxes env_vars: pass_to_sandbox: - OPENAI_API_KEY # Add other variables as needed # Test configuration tests: warmup_runs: 1 measurement_runs: 10 # Provider-specific settings providers: daytona: default_region: eu morph: # Morph specific settings
The benchmark includes the following tests:
- Calculate Primes - Calculates the first 10 prime numbers, their sum and average
- Improved Calculate Primes - Optimized version of the prime calculation test
- Resource Intensive Calculation - Runs CPU, memory, and disk-intensive tasks to stress test the environment
- Package Installation - Measures installation and import time for simple and complex Python packages
- File I/O Performance - Benchmarks file operations with different file sizes and formats
- Startup Time - Measures Python interpreter and library startup times
- LLM Generated Primes - Generates code using an LLM to calculate prime numbers
- Database Operations - Tests SQLite database performance for various operations
- Container Stability - Measures stability under combined CPU, memory, and disk load
- List Directory - Basic system command execution test using ls command
- System Info - Gathers detailed system information about the environment
- FFT Performance - Benchmarks Fast Fourier Transform computation speed
- FFT Multiprocessing Performance - Tests FFT computation with parallel processing
- Optimized Example - Demonstrates optimized code execution patterns
- Sandbox Utils - Tests utility functions specific to sandbox environments
- Template - Template for creating new tests
You can run benchmarks using either the command-line interface or the interactive Terminal UI.
The benchmark TUI provides an interactive way to select tests and providers:
python benchmark.py
Execute the comparator script directly for command-line benchmarking:
python comparator.py
To use the CLI mode with the TUI script:
python benchmark.py --cli
-
--tests
or-t
: Comma-separated list of test IDs to run (or"all"
). Default:all
-
--providers
or-p
: Comma-separated list of providers to test. Default:daytona,e2b,codesandbox,modal,local,morph
-
--runs
or-r
: Number of measurement runs per test/provider. Default:10
-
--warmup-runs
or-w
: Number of warmup runs. Default:1
-
--target-region
: Target region (e.g.,eu
,us
,asia
). Default:eu
-
--show-history
: Show historical performance comparison. Default: Disabled (flag to enable) -
--history-limit
: Number of previous runs to include in history. Default:5
-
--history-file
: Path to the benchmark history file. Default:benchmark_history.json
-
Run All Tests on All Providers
python comparator.py
-
Run Specific Tests on Selected Providers
python comparator.py --tests 1,3 --providers daytona,codesandbox
-
Run Tests on Local Machine Only
python comparator.py --providers local
-
Run Tests on Morph Sandbox
python comparator.py --providers morph
-
Increase Measurement and Warmup Runs
python comparator.py --runs 20 --warmup-runs 2
-
View Historical Performance Trends
python comparator.py --tests 1 --show-history
-
Compare Recent Performance with History
python comparator.py --tests 1,2 --providers daytona,e2b --show-history --history-limit 10
-
Use Custom History File
python comparator.py --tests 1 --history-file custom_history.json --show-history
The benchmark suite now runs tests on all selected providers in parallel, significantly reducing overall benchmark time. Each test will be executed on all providers simultaneously, rather than waiting for each provider to finish before moving to the next one.
We invite developers, testers, and enthusiasts to contribute by adding new tests or integrating additional sandbox providers. Your contributions help make AI Sandbox Benchmark a comprehensive and reliable tool for the community.
Check out our Contributing Guidelines to get started!
This project is licensed under the Apache 2.0 License.