AI Sandbox Benchmark

Welcome to AI Sandbox Benchmark – an open-source, standardized benchmarking framework designed to evaluate and compare various code execution sandbox environments like Daytona, e2b, CodeSandbox, Modal, Morph, and others.

⚠️ Disclaimer: This project is a work in progress and proof of concept. We are actively working on optimizing performance, improving test coverage, and enhancing the overall user experience. Feedback and contributions are highly welcome!

Whether you're a developer looking to choose the best sandbox for your projects or a contributor aiming to enhance the benchmarking suite, this project is for you!

🏃 Quick Start

Installation

# Clone the repository
git clone https://github.com/daytonaio/ai-sandbox-benchmark.git
cd ai-sandbox-benchmark

# Set up a virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Configure providers
# See providers/README.md for detailed setup instructions

Running Benchmarks

The easiest way to run benchmarks is using the interactive Terminal UI:

python benchmark.py

✨ Features

Parallel Provider Execution: Tests run simultaneously across all selected providers
Interactive TUI: User-friendly terminal interface for selecting tests and providers
WCAG-Compliant Interface: High-contrast, accessible terminal UI
Automated CodeSandbox Detection: Warns if CodeSandbox service is not running
Flexible Test Configuration: Run any combination of tests and providers
Comprehensive Metrics: Detailed timing for workspace creation, execution, and cleanup
Statistical Analysis: Mean, standard deviation, and relative performance comparisons
Warmup Runs: Configurable warmup runs to ensure stable measurements

⚡ Performance Comparison Example

A basic performance test running the ls command across different providers. The test included one warmup run followed by three timed executions for measurement.

================================================================================
                           Test Configuration Summary
================================================================================
Warmup Runs: 1
Measurement Runs: 3
Tests Used (1): 1:test_list_directory
Providers Used: daytona, e2b, codesandbox, modal
================================================================================


Performance Comparison for Test 1: test_list_directory
+--------------------+-------------------+--------------------+---------------------+---------------------+
| Metric             | Daytona           | E2b                | Codesandbox         | Modal               |
+====================+===================+====================+=====================+=====================+
| Workspace Creation | 168.66ms (±6.83)  | 646.40ms (±259.36) | 1533.00ms (±472.06) | 528.03ms (±42.06)   |
+--------------------+-------------------+--------------------+---------------------+---------------------+
| Code Execution     | 366.90ms (±76.72) | 267.24ms (±4.63)   | 190.67ms (±4.03)    | 477.98ms (±190.28)  |
+--------------------+-------------------+--------------------+---------------------+---------------------+
| Cleanup            | 145.81ms (±1.49)  | 738.66ms (±194.46) | 5314.00ms (±62.03)  | 3448.28ms (±215.71) |
+--------------------+-------------------+--------------------+---------------------+---------------------+
| Total Time         | 681.37ms          | 1652.31ms          | 7037.67ms           | 4454.29ms           |
+--------------------+-------------------+--------------------+---------------------+---------------------+
| vs Daytona %       | 0%                | +142.5%            | +932.9%             | +553.7%             |
+--------------------+-------------------+--------------------+---------------------+---------------------+

📈 Metrics & Performance Tracking

AI Sandbox Benchmark collects detailed performance metrics across providers and offers robust historical tracking:

Core Metrics

Workspace Creation Time: Time taken to initialize the sandbox environment
Code Execution Time: Time to execute the test code
Cleanup Time: Time required to tear down resources
Total Time: Overall end-to-end performance

Historical Performance Tracking

The benchmark suite now includes performance history tracking that:

Stores Results: Automatically saves benchmark results to a history file
Tracks Trends: Analyzes performance changes over time
Detects Changes: Identifies improvements or regressions between runs
Compares Providers: Shows relative performance across providers

Advanced Analysis

Statistical Metrics: Standard deviation, coefficient of variation, min/max values
Provider Comparisons: Identifies fastest and most consistent providers
Reliability Tracking: Tracks error rates and failures over time
Performance Trends: Visualizes performance changes with percentage improvements

Future Enhancements

Comprehensive network performance metrics
Graphical visualization of performance trends
Automated regression detection and alerting

🛠 Installation

Prerequisites

Python 3.12+
Node.js (for CodeSandbox service)
Git

Steps

Clone the Repository

git clone https://github.com/nkkko/ai-sandbox-benchmark.git
cd ai-sandbox-benchmark

Set Up a Virtual Environment (Optional but Recommended)
```
python -m venv venv
source venv/bin/activate
```
Install Python Dependencies
```
pip install -r requirements.txt
```
Set Up Provider-Specific Requirements

Some providers require additional setup. See the Provider README for detailed setup instructions.
Configure Environment Variables

Create a .env file in the root directory with the necessary API keys. Refer to the Provider README for detailed instructions on setting up each provider.

Configure Sandbox Settings (Optional)

The config.yml file allows you to customize various aspects of the benchmark, including which environment variables are passed to sandbox environments:

# Environment variables to pass to sandboxes
env_vars:
  pass_to_sandbox:
    - OPENAI_API_KEY
    # Add other variables as needed

# Test configuration
tests:
  warmup_runs: 1
  measurement_runs: 10

# Provider-specific settings
providers:
  daytona:
    default_region: eu
  morph:
    # Morph specific settings

🏃 Usage

Available Tests

The benchmark includes the following tests:

Calculate Primes - Calculates the first 10 prime numbers, their sum and average
Improved Calculate Primes - Optimized version of the prime calculation test
Resource Intensive Calculation - Runs CPU, memory, and disk-intensive tasks to stress test the environment
Package Installation - Measures installation and import time for simple and complex Python packages
File I/O Performance - Benchmarks file operations with different file sizes and formats
Startup Time - Measures Python interpreter and library startup times
LLM Generated Primes - Generates code using an LLM to calculate prime numbers
Database Operations - Tests SQLite database performance for various operations
Container Stability - Measures stability under combined CPU, memory, and disk load
List Directory - Basic system command execution test using ls command
System Info - Gathers detailed system information about the environment
FFT Performance - Benchmarks Fast Fourier Transform computation speed
FFT Multiprocessing Performance - Tests FFT computation with parallel processing
Optimized Example - Demonstrates optimized code execution patterns
Sandbox Utils - Tests utility functions specific to sandbox environments
Template - Template for creating new tests

Run Benchmarks

You can run benchmarks using either the command-line interface or the interactive Terminal UI.

1. Terminal User Interface (Recommended)

The benchmark TUI provides an interactive way to select tests and providers:

python benchmark.py

2. Command-Line Interface

Execute the comparator script directly for command-line benchmarking:

python comparator.py

To use the CLI mode with the TUI script:

python benchmark.py --cli

Available Options

--tests or -t: Comma-separated list of test IDs to run (or "all"). Default: all
--providers or -p: Comma-separated list of providers to test. Default: daytona,e2b,codesandbox,modal,local,morph
--runs or -r: Number of measurement runs per test/provider. Default: 10
--warmup-runs or -w: Number of warmup runs. Default: 1
--target-region: Target region (e.g., eu, us, asia). Default: eu
--show-history: Show historical performance comparison. Default: Disabled (flag to enable)
--history-limit: Number of previous runs to include in history. Default: 5
--history-file: Path to the benchmark history file. Default: benchmark_history.json

Examples

Run All Tests on All Providers
```
python comparator.py
```

Run Specific Tests on Selected Providers

python comparator.py --tests 1,3 --providers daytona,codesandbox

Run Tests on Local Machine Only
```
python comparator.py --providers local
```
Run Tests on Morph Sandbox
```
python comparator.py --providers morph
```

Increase Measurement and Warmup Runs

python comparator.py --runs 20 --warmup-runs 2

View Historical Performance Trends

python comparator.py --tests 1 --show-history

Compare Recent Performance with History

python comparator.py --tests 1,2 --providers daytona,e2b --show-history --history-limit 10

Use Custom History File

python comparator.py --tests 1 --history-file custom_history.json --show-history

Parallel Provider Testing

The benchmark suite now runs tests on all selected providers in parallel, significantly reducing overall benchmark time. Each test will be executed on all providers simultaneously, rather than waiting for each provider to finish before moving to the next one.

🚀 Get Involved

We invite developers, testers, and enthusiasts to contribute by adding new tests or integrating additional sandbox providers. Your contributions help make AI Sandbox Benchmark a comprehensive and reliable tool for the community.

Check out our Contributing Guidelines to get started!

📄 License

This project is licensed under the Apache 2.0 License.

🙏 Credits

Sandbox Providers: See the Provider README for details on all supported providers
Libraries and Tools:
- LangChain
- OpenAI
- NumPy
- Tabulate
- Curses (Terminal UI)
- Dotenv
- Termcolor
- Requests

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.cursor/rules		.cursor/rules
assets		assets
providers		providers
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
benchmark.py		benchmark.py
comparator.py		comparator.py
config.yml		config.yml
metrics.py		metrics.py
package-lock.json		package-lock.json
package.json		package.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Sandbox Benchmark

🏃 Quick Start

Installation

Running Benchmarks

✨ Features

⚡ Performance Comparison Example

📈 Metrics & Performance Tracking

Core Metrics

Historical Performance Tracking

Advanced Analysis

Future Enhancements

🛠 Installation

Prerequisites

Steps

🏃 Usage

Available Tests

Run Benchmarks

1. Terminal User Interface (Recommended)

2. Command-Line Interface

Available Options

Examples

Parallel Provider Testing

🚀 Get Involved

📄 License

🙏 Credits

About

Releases

Packages

Languages

daytonaio/ai-sandbox-benchmark

Folders and files

Latest commit

History

Repository files navigation

AI Sandbox Benchmark

🏃 Quick Start

Installation

Running Benchmarks

✨ Features

⚡ Performance Comparison Example

📈 Metrics & Performance Tracking

Core Metrics

Historical Performance Tracking

Advanced Analysis

Future Enhancements

🛠 Installation

Prerequisites

Steps

🏃 Usage

Available Tests

Run Benchmarks

1. Terminal User Interface (Recommended)

2. Command-Line Interface

Available Options

Examples

Parallel Provider Testing

🚀 Get Involved

📄 License

🙏 Credits

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages