i   About
p   Paper
g   Code
  

Model Org. \(\mathcal{S}\) \(\Delta\mathcal{C}\) Date Traj.
AEGIS 47.8% 26.0% πŸ”—
πŸ†• OpenHands (CI setup)# 28.3% 52.4% πŸ”—
πŸ†• OpenHands (vanilla) 22.8% 43.6% πŸ”—
SWE-Agent+ 18.5% 27.6% πŸ”—
SWE-Agent (Mistral Large 2) 16.3% 23.0% πŸ”—
SWE-Agent (Claude 3.5 Sonnet) 12.3% 30.3% πŸ”—
SWE-Agent (GPT-4o mini) 9.8% 20.9% πŸ”—
SWE-Agent (Claude 3 Haiku) 2.5% 3.0% πŸ”—
SWE-Agent (Mixtral 8x22B) 0.7% 0.9% πŸ”—
SWE-Agent (GPT-4) 15.9% 26.5% πŸ”—
Aider (GPT-4) 12.7% 27.8% πŸ”—
AutoCodeRover (GPT-4) 9.1% 17.9% πŸ”—
LIBRO# 14.1% 23.8% πŸ”—
GPT-4 + BM25 (ZSP) 9.4% 21.5% πŸ”—
GPT-4 + BM25 (ZSB) 3.6% 7.6% πŸ”—

We report the percentage of issues reproduced by the generated tests (\(\mathcal{S}\)) and their mean line coverage of the resolving patch (\(\Delta\mathcal{C}\)).

The results reported here are evaluation results on SWT-Bench Lite and Verified. We have independently executed submitted predictions for verification. Generates stand-alone reproduction scripts and does not attempt integration into the test framework. # This approach leverages execution feedback from a correctly set-up CI environment.

What is the task in SWT-Bench?

Generate software tests which reproduce user-reported issues and increase code coverage.

Each SWT-Bench task is based on a real-world pull-request from a GitHub repository which resolves a user-reported issue and contains the code patch fixing the issue and unit tests testing the fix. Given the original state of the code base and the user issue, the task is to generate tests that reproduce this issue. These generated tests are expected to fail on the original code base and pass after the issue is resolved.

What metrics do you measure?

We measure whether the generated tests reproduce the user issue and increase the code coverage of the modified code.

For a generated test to reproduce the described issue, it should fail on original code base but pass after the code patch fixing the issue is applied. We call this a Fail-to-Pass (F2P) test. An instance is successfully resolved, if at least one F2P test but no test failing on the fixed state of the codebase (F2F or P2F) is generated. We report the success rate \( \mathcal{S}\). We further measure the increase in line coverage of the lines changed when resolving the issue. We call this the Coverage Increase \(\Delta \mathcal{C}\).

What is the value of test generation?

Reproducing test can aide test-driven development, avoid regression, and are a powerful tool to cross-validate proposed bug fixes.

To resolve a reported issues reliably, it is essential to first reproduce it and then confirm that a proposed bug fix actually addresses it. This is typically done by creating a reproducing test. However creating such tests is a tedious and time consuming process that we want to automate using code agents. Once a reproducing test is generated, it can be used to drive the process of fixing the issue and validate a proposed bug fix.

Are there other relevant benchmarks about code generation and testing?

SWE-bench measures the capability of fixing reported user issues. BaxBench measures capabilities at writing correct and secure web backends.

SWE-bench: SWT-Bench and SWE-bench are based on the same GitHub repositories and issues. While SWT-Bench measures the ability of models or agents to reproduce a given issue, SWE-bench measures their ability to resolve it. We observe that while these tasks are complementary, their hardness is not correlated on an instance level.

BaxBench: BaxBench provides the model with a description of a web application backend and measures the correctness of the generated code. Additionally, BaxBench tests for common security vulnerabilities in the generated code.

Check out our paper for more details!

SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents

Niels MΓΌndler1, Mark Niklas MΓΌller1,2, Jingxuan He1, Martin Vechev1
1ETH Zurich 2LogicStar AI


If you have any remaining questions, please feel free to contact us at contact@swtbench.com

Citing this work

If you use this benchmark, please cite:

 @inproceedings{mundler2024swtbench,
  title={{SWT}-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents},
  author={Niels M{\"u}ndler and Mark Niklas Mueller and Jingxuan He and Martin Vechev},
  booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
  year={2024},
  url={https://openreview.net/forum?id=9Y8zUO11EQ}
}