Github: ****https://github.com/THUDM/SWE-Dev
Huggingface: **** https://huggingface.co/THUDM/SWE-Dev-32B
Paper📚 is coming soon ~
Table of Contents
LLMs🤖 have advanced from generating simple code snippets, to tackling competitive programming, ML problems and real-world software engineering (SWE) problems such as resolving GitHub issues. This progress is highlighted by benchmarks where models must generate test-passing solutions within real-world codebases. Unlike other tasks, SWE requires interaction with complex workflows, including handling runtimes, managing dependencies, debugging errors, and verifying solutions through reproducible test suites. To address these challenges, we introduce SWE-Dev⚙, an open-source SWE agent with a scalable test case construction pipeline. This pipeline synthesizes test cases through a two-step process: generating Gherkin descriptions, a structured format of describing test scenarios and corresponding code patches with LLMs. These test cases are then validated in Docker environments. Results show our test cases align closely with problem statements.
Our contributions include:
<aside> 💡
Figure 1: Model performance with training and inference scaling. "BL" refers to the baseline model. "+FT" denotes fine-tuning with the SWE-dev dataset, while "+IS" represents models with inference scaling. Notably, SWE-Dev-32B achieved a performance of 34.0%, comparable to GPT-4o, even without the benefits of inference scaling.
To address the scarcity of SWE training data, we developed a pipeline for repository crawling, instance extraction, and test case generation. We validate that test cases generated by LLMs can accurately replicate the functionality of the original test cases provided with the corresponding PRs.
Figure 2: Pipeline for test case generation, divided into description generation and code generation phases. The pipeline begins with extracting repository information, followed by generating Gherkin scenarios and then detailed test cases. An optional revision step leverages traceback errors to refine the generated test cases. The final output includes fail-to-pass test cases.
We began by crawling metadata for 240k PyPI packages containing GitHub URLs, filtering for repositories with $\text{Stars} \geq 5$ and $\text{PRs} \geq 3$, resulting in a subset of 59k repositories. Due to network constraints and intricate dependency management, we successfully downloaded 10,416 repositories. Following the methodology outlined in with minor modifications, we extracted 88k instances in total.
To refine this dataset, we applied rule-based filtering to adjust patch lengths to fit model context windows and eliminated trivial or irrelevant issues while maintaining diversity. Ultimately, we retained 38k instances from 4,413 repositories as the training set. As shown in Figure 3, over 4,000 repositories contain fewer than five instances, significantly enhancing the dataset’s diversity.
Figure 3: Distribution of instances per repository in the training dataset. The majority of repositories contribute fewer than five instances, highlighting the dataset's diversity across a wide range of repositories.