SWT-Bench: Assessing capabilities at Unit Test Generation

Script Mode ^‡ Unit Test Mode

Model	\(\mathcal{S}\)	\(\Delta\mathcal{C}\)	Date	Traj.
🥇 AEGIS^‡	47.8%	26.0%	2025-02-17	🔗
🥇🆕 Amazon Q Developer Agent v20250405-dev	37.7%	52.7%	2025-04-10	🔗
🥈 OpenHands CI setup^#	28.3%	52.4%	2025-02-18	🔗
🥉 OpenHands vanilla	22.8%	43.6%	2025-02-18	🔗
SWE-Agent+	18.5%	27.6%	2024-05-22	🔗
SWE-Agent Mistral Large 2	16.3%	23.0%	2024-05-22	🔗
SWE-Agent Claude 3.5 Sonnet	12.3%	30.3%	2024-05-22	🔗
SWE-Agent GPT-4o mini	9.8%	20.9%	2024-05-22	🔗
SWE-Agent Claude 3 Haiku	2.5%	3.0%	2024-05-22	🔗
SWE-Agent Mixtral 8x22B	0.7%	0.9%	2024-05-22	🔗
SWE-Agent GPT-4	15.9%	26.5%	2024-05-22	🔗
Aider GPT-4	12.7%	27.8%	2024-05-22	🔗
AutoCodeRover GPT-4	9.1%	17.9%	2024-05-22	🔗
LIBRO^#	14.1%	23.8%	2024-05-22	🔗
Zero-Shot plus GPT-4 + BM25	9.4%	21.5%	2024-05-22	🔗
Zero-Shot Base GPT-4 + BM25	3.6%	7.6%	2024-05-22	🔗

Model	\(\mathcal{S}\)	\(\Delta\mathcal{C}\)	Date	Traj.
🥇🆕 Amazon Q Developer Agent v20250405-dev	49.0%	57.4%	2025-04-10	🔗
🥈 Otter++ GPT-4o	37.0%	42.8%	2025-03-10	🔗
🥉 Otter GPT-4o	31.4%	37.6%	2025-03-10	🔗
OpenHands CI setup^#	27.7%	52.9%	2025-02-28	🔗
LIBRO GPT-4o ^#	17.8%	38.0%	2025-02-28	🔗
Zero-Shot Plus GPT-4o + BM25	14.3%	34.0%	2025-02-28	🔗

We report the percentage of issues reproduced by the generated tests (\(\mathcal{S}\)) and their mean line coverage of the resolving patch (\(\Delta\mathcal{C}\)).

The results reported here are evaluation results on SWT-Bench Lite and Verified. We have independently executed submitted predictions for verification. ^‡ Generates stand-alone reproduction scripts and does not attempt integration into the test framework. ^# Leverages execution feedback from a correctly set-up CI environment.

News

2025-04-10 The Amazon Q Developer Agent is evaluated on SWT-Bench and achieves SOTA unit test generation results on both SWT-Bench Lite (37.3%) and Verified (48.7%).

2025-02-28 We release SWT-Bench Verified, a set of 433 human-verified solvable issues, matching SWE-Bench Verified. The specialized testing agent Otter achieves SOTA performance.

2025-02-18 We evaluated OpenHands on SWT-Bench, achieving 22.8% success rate on Lite with vanilla setup. We discover that setting up the CI environment for the agent significantly improves the results to 28.3% on Lite and 27.7% on Verified.

2025-02-17 We evaluated AEGIS on SWT-Bench Lite, achieving 47.8% success rate and 26.0% coverage increase. AEGIS is the first submitted agent specifically tailored for the task of software testing and achieves state-of-the-art results.

What is the task in SWT-Bench?

Generate software tests which reproduce user-reported issues and increase code coverage.

Each SWT-Bench task is based on a real-world pull-request from a GitHub repository which resolves a user-reported issue and contains the code patch fixing the issue and unit tests testing the fix. Given the original state of the code base and the user issue, the task is to generate tests that reproduce this issue. These generated tests are expected to fail on the original code base and pass after the issue is resolved.

What metrics do you measure?

We measure whether the generated tests reproduce the user issue and increase the code coverage of the modified code.

For a generated test to reproduce the described issue, it should fail on original code base but pass after the code patch fixing the issue is applied. We call this a Fail-to-Pass (F2P) test. An instance is successfully resolved, if at least one F2P test but no test failing on the fixed state of the codebase (F2F or P2F) is generated. We report the success rate \( \mathcal{S}\). We further measure the increase in line coverage of the lines changed when resolving the issue. We call this the Coverage Increase \(\Delta \mathcal{C}\).

What is the value of test generation?

Reproducing test can aide test-driven development, avoid regression, and are a powerful tool to cross-validate proposed bug fixes.

To resolve a reported issues reliably, it is essential to first reproduce it and then confirm that a proposed bug fix actually addresses it. This is typically done by creating a reproducing test. However creating such tests is a tedious and time consuming process that we want to automate using code agents. Once a reproducing test is generated, it can be used to drive the process of fixing the issue and validate a proposed bug fix.

Are there other relevant benchmarks about code generation and testing?

SWE-bench measures the capability of fixing reported user issues. BaxBench measures capabilities at writing correct and secure web backends.

SWE-bench: SWT-Bench and SWE-bench are based on the same GitHub repositories and issues. While SWT-Bench measures the ability of models or agents to reproduce a given issue, SWE-bench measures their ability to resolve it. We observe that while these tasks are complementary, their hardness is not correlated on an instance level.

BaxBench: BaxBench provides the model with a description of a web application backend and measures the correctness of the generated code. Additionally, BaxBench tests for common security vulnerabilities in the generated code.

Check out our paper for more details!

SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents

Niels Mündler¹, Mark Niklas Müller^1,2, Jingxuan He¹, Martin Vechev¹
¹ETH Zurich ²LogicStar AI

Paper

Code

If you have any remaining questions, please feel free to contact us at contact@swtbench.com

Citing this work

If you use this benchmark, please cite:

 @inproceedings{mundler2024swtbench,
  title={{SWT}-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents},
  author={Niels M{\"u}ndler and Mark Niklas Mueller and Jingxuan He and Martin Vechev},
  booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
  year={2024},
  url={https://openreview.net/forum?id=9Y8zUO11EQ}
}

SWT-Bench is a project by LogicStar AI and the Secure, Reliable, and Intelligent Systems Lab at ETH Zürich.

Site template adapted from LMQL and uses Bulma. All emojis designed by OpenMoji – the open-source emoji and icon project. License: CC BY-SA 4.0
Last updated on 2025-04-10.