Haskell LLM Benchmark

This is a test harness to evaluate LLM models on their ability to consistently follow instructions to succesfully edit Haskell code.

It is a modified version of the Aider benchmark harness adapted to include a Haskell environment.

The benchmark is based on Exercism's Haskell exercises (Github). This benchmark evaluates how effectively a coding assistant and LLMs can translate a natural language coding request into executable code saved into files that pass unit tests. It provides an end-to-end evaluation of not just the LLM's coding ability, but also its capacity to edit existing code and format those code edits so that aider can save the edits to the local source files.

Last updated: 2025-05-28

Model	Tests	Pass %	Pass 1st Try %	Tests Passed	Passes 1st Try	Well Formed %	Errors	Sec/Test	Total Cost ($)	Cost/Test ($)
o3-high	112	88.4	73.2	99	82	100.0	0	51.7	19.05	0.1701
o3	112	84.8	73.2	95	82	100.0	0	27.2	11.81	0.1055
o1-pro	112	82.1	72.3	92	81	99.1	1	301.6	275.04	2.4558
claude-opus-4-20250514	112	81.2	65.2	91	73	100.0	0	22.5	0.00	0.0000
deepseek-r1-0528	112	81.2	63.4	91	71	99.1	3	242.8	0.00	0.0000
gemini-2.5-pro-preview	112	80.4	73.2	90	82	99.1	2	109.4	0.00	0.0000
o1	112	79.5	67.9	89	76	99.1	1	49.3	29.22	0.2609
claude-sonnet-4-20250514	112	77.7	61.6	87	69	99.1	4	14.8	0.00	0.0000
gemini-2.5-flash-preview-05-20:thinking	112	75.9	58.0	85	65	98.2	3	29.5	0.00	0.0000
o3-mini	112	75.0	63.4	84	71	100.0	0	37.5	2.13	0.0190
o4-mini	112	74.1	67.9	83	76	99.1	1	29.4	1.81	0.0162
gpt-4.1-2025-04-14	112	65.2	57.1	73	64	100.0	0	7.6	1.14	0.0102
gpt-4.1-mini-2025-04-14	112	63.4	51.8	71	58	100.0	0	5.3	0.24	0.0021

Instructions

Can generally follow the instructions in the Aider benchmark harness; with the following exceptions:

clone this repo
exercises are included in the tmp.benchmarks directory, no need to clone the exercises (although you are welcome to contribute new ones)

On my macOS machine, running the benchmark in Docker would consistently fail with some heap corruption error (issue). A nix environment is provided although you probably want to run this in a safe environment like a VM (the benchmark runs code produced by an LLM so it's important to run it in an isolated environment).

Once you have a cloned repo:

nix-develop

# set your API keys (alternatively, you can set the keys in .envrc if using direnv (nix env has it set up))
export OPENAI_API_KEY=sk-proj-...
export ANTHROPIC_API_KEY=...
export GEMINI_API_KEY=...

# run the benchmark (try a single exercise first)
./benchmark/benchmark.py o3-mini-run --model o3-mini --edit-format whole --threads 10 --num-tests 1 --exercises-dir polyglot-benchmark --new

./benchmark/benchmark.py o3-mini-full-run --model o3-mini --edit-format whole --threads 10 --exercises-dir polyglot-benchmark --new

# for sonnet thinking
./benchmark/benchmark.py claude-3-7-thinking-full-run-final --model anthropic/claude-3-7-sonnet-20250219 --edit-format whole --threads 5 --exercises-dir polyglot-benchmark --new --read-model-settings .aider.model.settings.yml

You need to be mindful of the API limits of the model you are using. For high volume APIs (e.g. OpenAI), I've had success using 20 threads. For Anthropic, I've had success using 5 threads, etc...

Reference for model providers and models: https://aider.chat/docs/llms.html

Generating Reports

After running benchmarks for one or more models, you can generate comparison reports with:

# Generate reports for all benchmarks (automatically uses all folders in tmp.benchmarks except polyglot-benchmark)
./benchmark/summarize_benchmark.py

# Generate reports for specific benchmark directories
./benchmark/summarize_benchmark.py path/to/dir1 path/to/dir2

# Specify custom output paths
./benchmark/summarize_benchmark.py --table-output custom_table.csv --plot-output custom_plot.png

The report generator will:

Extract key metrics from all benchmark results
Format model names for better readability
Sort models by pass rate
Generate a formatted table in both CSV and Markdown formats
Create a visual comparison chart showing pass rates and costs
Save results in a timestamped directory under benchmark-result/

Updating to latest aider version

git fetch upstream
git merge upstream/main

Name		Name	Last commit message	Last commit date
Latest commit History 12,773 Commits
.github		.github
aider		aider
benchmark-result		benchmark-result
benchmark		benchmark
docker		docker
requirements		requirements
scripts		scripts
tests		tests
tmp.benchmarks		tmp.benchmarks
.dockerignore		.dockerignore
.envrc.sample		.envrc.sample
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.md		CLAUDE.md
CNAME		CNAME
CONTRIBUTING.md		CONTRIBUTING.md
HISTORY.md		HISTORY.md
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.md		README.md
flake.lock		flake.lock
flake.nix		flake.nix
ghc-out.json		ghc-out.json
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Haskell LLM Benchmark

Instructions

Generating Reports

Updating to latest aider version

About

Uh oh!

Languages

License

MercuryTechnologies/haskell_llm_benchmark

Folders and files

Latest commit

History

Repository files navigation

Haskell LLM Benchmark

Instructions

Generating Reports

Updating to latest aider version

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages