News
The 1st and 3rd place on SWT-Verified are reclaimed by the latest release of OpenHands, equipped with the newly released GPT-5 and GPT-5-mini, respectively.
e-Otter++ claims the first position on the leaderboard with 50.7% and 60.7% on Lite and Verified respectively. They improve upon prior Otter by more deeply integrating execution feedback and heterogeneous prompts in the generation loop.
AssertFlip demonstrates a method to generate test cases by flipping the semantics of generated passing tests, achieving superior performance with a success rate of 35.1% on SWT-Bench Lite and 43.4% on Verified.
The Amazon Q Developer Agent is evaluated on SWT-Bench and achieves SOTA unit test generation results on both SWT-Bench Lite (37.3%) and Verified (48.7%).
We release SWT-Bench
We evaluated OpenHands on SWT-Bench, achieving 22.8% success rate on Lite with vanilla setup. We discover that setting up the CI environment for the agent significantly improves the results to 28.3% on Lite and 27.7% on Verified.
We evaluated AEGIS on SWT-Bench Lite, achieving 47.8% success rate and 26.0% coverage increase. AEGIS is the first submitted agent specifically tailored for the task of software testing and achieves state-of-the-art results.