ℹ️   About
📜   Paper & Code
  

We report the percentage of issues reproduced by the generated tests (\(\mathcal{S}\)) and their mean line coverage of the resolving patch (\(\Delta\mathcal{C}\)).


Model \(\mathcal{S}\) \(\Delta\mathcal{C}\) Date Traj.
SWE-Agent+ 18.5% 27.6% 🔗
SWE-Agent (Mistral Large 2) 16.3% 23.0% 🔗
SWE-Agent (Claude 3.5 Sonnet) 12.3% 30.3% 🔗
SWE-Agent (GPT-4o mini) 9.8% 20.9% 🔗
SWE-Agent (Claude 3 Haiku) 2.5% 3.0% 🔗
SWE-Agent (Mixtral 8x22B) 0.7% 0.9% 🔗
SWE-Agent (GPT-4) 15.9% 26.5% 🔗
Aider (GPT-4) 12.7% 27.8% 🔗
AutoCodeRover (GPT-4) 9.1% 17.9% 🔗
LIBRO 14.1% 23.8% 🔗
GPT-4 + BM25 (ZSP) 9.4% 21.5% 🔗
GPT-4 + BM25 (ZSB) 3.6% 7.6% 🔗

The results reported here are evaluation results on SWT-Bench Lite. We have independently executed submitted predictions for verification.

What is the task in SWT-Bench?

Generate software tests which reproduce user-reported issues and increase code coverage.

Each SWT-Bench task is based on a pull-request from a GitHub repository which resolves a reported issue and contains the code patch fixing the issue and unit tests testing the fix. Given the original state of the code base and the user issue, the task is to generate tests that reproduce this issue. These generated tests are expected to fail on the original code base and pass after the issue is resolved.

What metrics do you measure?

We measure whether the generated tests reproduce the user issue and increase the code coverage of the modified code.

For a generated test to reproduce the described issue, it should fail on original code base but pass after the code patch fixing the issue is applied. We call this a Fail-to-Pass (F2P) test. An instance is successfully resolved, if at least one F2P test but no test failing on the fixed state of the codebase (F2F or P2F) is generated. We report the success rate \( \mathcal{S}\). We further measure the increase in line coverage of the lines changed when resolving the issue. We call this the Coverage Increase \(\Delta \mathcal{C}\).

What is the value of test generation?

Reproducing test can aide test-driven development, avoid regression, and are a powerful tool to cross-validate proposed bug fixes.

To resolve a reported issues reliably, it is essential to first reproduce it and then confirm that a proposed bug fix actually addresses it. This is typically done by creating a reproducing test. However creating such tests is a tedious and time consuming process that we want to automate using code agents. Once a reproducing test is generated, it can be used to drive the process of fixing the issue and validate a proposed bug fix.

What is the relationship between SWT-Bench and SWE-bench?

SWT-Bench and SWE-bench measure complimentary skills. SWT-Bench measures the capability of reproducing a reported issue, while SWE-bench measures the capability of fixing the issue.

SWT-Bench and SWE-bench are based on the same GitHub repositories and issues. While SWT-Bench measures the ability of models or agents to reproduce a given issue, SWE-bench measures their ability to resolve it. We observe that while these tasks are complimentary, their hardness is not correlated on an instance level.

Check out our paper for more details!

SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents

Niels Mündler1, Mark Niklas Müller1,2, Jingxuan He1, Martin Vechev1
1ETH Zurich 2LogicStar AI


If you have any remaining questions, please feel free to contact us at contact@swtbench.com

Citing this work

If you use this benchmark, please cite:

 @inproceedings{mundler2024swtbench,
  title={{SWT}-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents},
  author={Niels M{\"u}ndler and Mark Niklas Mueller and Jingxuan He and Martin Vechev},
  booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
  year={2024},
  url={https://openreview.net/forum?id=9Y8zUO11EQ}
}