Skip to content

MercuryTechnologies/haskell_llm_benchmark

 
 

Repository files navigation

Haskell LLM Benchmark

This is a test harness to evaluate LLM models on their ability to consistently follow instructions to succesfully edit Haskell code.

It is a modified version of the Aider benchmark harness adapted to include a Haskell environment.

The benchmark is based on Exercism's Haskell exercises (Github). This benchmark evaluates how effectively a coding assistant and LLMs can translate a natural language coding request into executable code saved into files that pass unit tests. It provides an end-to-end evaluation of not just the LLM's coding ability, but also its capacity to edit existing code and format those code edits so that aider can save the edits to the local source files.

Last updated: 2025-05-28

Haskell LLM Benchmark

Model Tests Pass % Pass 1st Try % Tests Passed Passes 1st Try Well Formed % Errors Sec/Test Total Cost ($) Cost/Test ($)
o3-high 112 88.4 73.2 99 82 100.0 0 51.7 19.05 0.1701
o3 112 84.8 73.2 95 82 100.0 0 27.2 11.81 0.1055
o1-pro 112 82.1 72.3 92 81 99.1 1 301.6 275.04 2.4558
claude-opus-4-20250514 112 81.2 65.2 91 73 100.0 0 22.5 0.00 0.0000
deepseek-r1-0528 112 81.2 63.4 91 71 99.1 3 242.8 0.00 0.0000
gemini-2.5-pro-preview 112 80.4 73.2 90 82 99.1 2 109.4 0.00 0.0000
o1 112 79.5 67.9 89 76 99.1 1 49.3 29.22 0.2609
claude-sonnet-4-20250514 112 77.7 61.6 87 69 99.1 4 14.8 0.00 0.0000
gemini-2.5-flash-preview-05-20:thinking 112 75.9 58.0 85 65 98.2 3 29.5 0.00 0.0000
o3-mini 112 75.0 63.4 84 71 100.0 0 37.5 2.13 0.0190
o4-mini 112 74.1 67.9 83 76 99.1 1 29.4 1.81 0.0162
gpt-4.1-2025-04-14 112 65.2 57.1 73 64 100.0 0 7.6 1.14 0.0102
gpt-4.1-mini-2025-04-14 112 63.4 51.8 71 58 100.0 0 5.3 0.24 0.0021

Instructions

Can generally follow the instructions in the Aider benchmark harness; with the following exceptions:

  • clone this repo
  • exercises are included in the tmp.benchmarks directory, no need to clone the exercises (although you are welcome to contribute new ones)

On my macOS machine, running the benchmark in Docker would consistently fail with some heap corruption error (issue). A nix environment is provided although you probably want to run this in a safe environment like a VM (the benchmark runs code produced by an LLM so it's important to run it in an isolated environment).

Once you have a cloned repo:

nix-develop

# set your API keys (alternatively, you can set the keys in .envrc if using direnv (nix env has it set up))
export OPENAI_API_KEY=sk-proj-...
export ANTHROPIC_API_KEY=...
export GEMINI_API_KEY=...

# run the benchmark (try a single exercise first)
./benchmark/benchmark.py o3-mini-run --model o3-mini --edit-format whole --threads 10 --num-tests 1 --exercises-dir polyglot-benchmark --new

./benchmark/benchmark.py o3-mini-full-run --model o3-mini --edit-format whole --threads 10 --exercises-dir polyglot-benchmark --new

# for sonnet thinking
./benchmark/benchmark.py claude-3-7-thinking-full-run-final --model anthropic/claude-3-7-sonnet-20250219 --edit-format whole --threads 5 --exercises-dir polyglot-benchmark --new --read-model-settings .aider.model.settings.yml

You need to be mindful of the API limits of the model you are using. For high volume APIs (e.g. OpenAI), I've had success using 20 threads. For Anthropic, I've had success using 5 threads, etc...

Reference for model providers and models: https://aider.chat/docs/llms.html

Generating Reports

After running benchmarks for one or more models, you can generate comparison reports with:

# Generate reports for all benchmarks (automatically uses all folders in tmp.benchmarks except polyglot-benchmark)
./benchmark/summarize_benchmark.py

# Generate reports for specific benchmark directories
./benchmark/summarize_benchmark.py path/to/dir1 path/to/dir2

# Specify custom output paths
./benchmark/summarize_benchmark.py --table-output custom_table.csv --plot-output custom_plot.png

The report generator will:

  • Extract key metrics from all benchmark results
  • Format model names for better readability
  • Sort models by pass rate
  • Generate a formatted table in both CSV and Markdown formats
  • Create a visual comparison chart showing pass rates and costs
  • Save results in a timestamped directory under benchmark-result/

Updating to latest aider version

git fetch upstream
git merge upstream/main

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Languages

  • Haskell 85.4%
  • Python 11.7%
  • CSS 0.6%
  • Shell 0.6%
  • JavaScript 0.5%
  • Tree-sitter Query 0.5%
  • Other 0.7%