This is a test harness to evaluate LLM models on their ability to consistently follow instructions to succesfully edit Haskell code.
It is a modified version of the Aider benchmark harness adapted to include a Haskell environment.
The benchmark is based on Exercism's Haskell exercises (Github). This benchmark evaluates how effectively a coding assistant and LLMs can translate a natural language coding request into executable code saved into files that pass unit tests. It provides an end-to-end evaluation of not just the LLM's coding ability, but also its capacity to edit existing code and format those code edits so that aider can save the edits to the local source files.
Last updated: 2025-05-28
Model | Tests | Pass % | Pass 1st Try % | Tests Passed | Passes 1st Try | Well Formed % | Errors | Sec/Test | Total Cost ($) | Cost/Test ($) |
---|---|---|---|---|---|---|---|---|---|---|
o3-high | 112 | 88.4 | 73.2 | 99 | 82 | 100.0 | 0 | 51.7 | 19.05 | 0.1701 |
o3 | 112 | 84.8 | 73.2 | 95 | 82 | 100.0 | 0 | 27.2 | 11.81 | 0.1055 |
o1-pro | 112 | 82.1 | 72.3 | 92 | 81 | 99.1 | 1 | 301.6 | 275.04 | 2.4558 |
claude-opus-4-20250514 | 112 | 81.2 | 65.2 | 91 | 73 | 100.0 | 0 | 22.5 | 0.00 | 0.0000 |
deepseek-r1-0528 | 112 | 81.2 | 63.4 | 91 | 71 | 99.1 | 3 | 242.8 | 0.00 | 0.0000 |
gemini-2.5-pro-preview | 112 | 80.4 | 73.2 | 90 | 82 | 99.1 | 2 | 109.4 | 0.00 | 0.0000 |
o1 | 112 | 79.5 | 67.9 | 89 | 76 | 99.1 | 1 | 49.3 | 29.22 | 0.2609 |
claude-sonnet-4-20250514 | 112 | 77.7 | 61.6 | 87 | 69 | 99.1 | 4 | 14.8 | 0.00 | 0.0000 |
gemini-2.5-flash-preview-05-20:thinking | 112 | 75.9 | 58.0 | 85 | 65 | 98.2 | 3 | 29.5 | 0.00 | 0.0000 |
o3-mini | 112 | 75.0 | 63.4 | 84 | 71 | 100.0 | 0 | 37.5 | 2.13 | 0.0190 |
o4-mini | 112 | 74.1 | 67.9 | 83 | 76 | 99.1 | 1 | 29.4 | 1.81 | 0.0162 |
gpt-4.1-2025-04-14 | 112 | 65.2 | 57.1 | 73 | 64 | 100.0 | 0 | 7.6 | 1.14 | 0.0102 |
gpt-4.1-mini-2025-04-14 | 112 | 63.4 | 51.8 | 71 | 58 | 100.0 | 0 | 5.3 | 0.24 | 0.0021 |
Can generally follow the instructions in the Aider benchmark harness; with the following exceptions:
- clone this repo
- exercises are included in the
tmp.benchmarks
directory, no need to clone the exercises (although you are welcome to contribute new ones)
On my macOS machine, running the benchmark in Docker would consistently fail with some heap corruption error (issue). A nix environment is provided although you probably want to run this in a safe environment like a VM (the benchmark runs code produced by an LLM so it's important to run it in an isolated environment).
Once you have a cloned repo:
nix-develop
# set your API keys (alternatively, you can set the keys in .envrc if using direnv (nix env has it set up))
export OPENAI_API_KEY=sk-proj-...
export ANTHROPIC_API_KEY=...
export GEMINI_API_KEY=...
# run the benchmark (try a single exercise first)
./benchmark/benchmark.py o3-mini-run --model o3-mini --edit-format whole --threads 10 --num-tests 1 --exercises-dir polyglot-benchmark --new
./benchmark/benchmark.py o3-mini-full-run --model o3-mini --edit-format whole --threads 10 --exercises-dir polyglot-benchmark --new
# for sonnet thinking
./benchmark/benchmark.py claude-3-7-thinking-full-run-final --model anthropic/claude-3-7-sonnet-20250219 --edit-format whole --threads 5 --exercises-dir polyglot-benchmark --new --read-model-settings .aider.model.settings.yml
You need to be mindful of the API limits of the model you are using. For high volume APIs (e.g. OpenAI), I've had success using 20
threads. For Anthropic, I've had success using 5
threads, etc...
Reference for model providers and models: https://aider.chat/docs/llms.html
After running benchmarks for one or more models, you can generate comparison reports with:
# Generate reports for all benchmarks (automatically uses all folders in tmp.benchmarks except polyglot-benchmark)
./benchmark/summarize_benchmark.py
# Generate reports for specific benchmark directories
./benchmark/summarize_benchmark.py path/to/dir1 path/to/dir2
# Specify custom output paths
./benchmark/summarize_benchmark.py --table-output custom_table.csv --plot-output custom_plot.png
The report generator will:
- Extract key metrics from all benchmark results
- Format model names for better readability
- Sort models by pass rate
- Generate a formatted table in both CSV and Markdown formats
- Create a visual comparison chart showing pass rates and costs
- Save results in a timestamped directory under benchmark-result/
git fetch upstream
git merge upstream/main