Introduction

LLMs🤖 have advanced from generating simple code snippets, to tackling competitive programming, ML problems and real-world software engineering (SWE) problems such as resolving GitHub issues. This progress is highlighted by benchmarks where models must generate test-passing solutions within real-world codebases. Unlike other tasks, SWE requires interaction with complex workflows, including handling runtimes, managing dependencies, debugging errors, and verifying solutions through reproducible test suites. To address these challenges, we introduce SWE-Dev⚙, an open-source SWE agent with a scalable test case construction pipeline. This pipeline synthesizes test cases through a two-step process: generating Gherkin descriptions, a structured format of describing test scenarios and corresponding code patches with LLMs. These test cases are then validated in Docker environments. Results show our test cases align closely with problem statements.

Our contributions include:

<aside> 💡

Pipeline for Dataset and Test Case Creation: We propose a scalable pipeline for synthesizing real-world SWE instances and generating executable test cases using LLMs, addressing the limitations of existing datasets. Our pipeline filters 38k high-quality instances distributed across 4k repositories and generates 2k instances with test cases—comparable to the rate observed when filtering real-world test cases.
Scaling Trends in Training Data and Inference Scaling: We explore the scaling trends in training data size, interaction rounds, and model performance. Specifically, we analyze the impact of data size and investigate multi-round inference techniques. Our SWE-Dev-32B achieves a resolve rate of 34.0%, which improves to 36.6% when scaling up maximum interaction rounds from 30 to 75.
Post-Training Methods Beyond Supervised Fine-Tuning: We compare various post-training strategies, including rejection sampling, KTO, and OREO, as well as hybrid approaches combining RFT and OREO. Our findings highlight rejection sampling as the most effective method for leveraging high-quality data, consistently outperforming alternative approaches. </aside>

Figure 1: Model performance with training and inference scaling. "BL" refers to the baseline model. "+FT" denotes fine-tuning with the SWE-dev dataset, while "+IS" represents models with inference scaling. Notably, SWE-Dev-32B achieved a performance of 34.0%, comparable to GPT-4o, even without the benefits of inference scaling.

Data Construction Pipeline

To address the scarcity of SWE training data, we developed a pipeline for repository crawling, instance extraction, and test case generation. We validate that test cases generated by LLMs can accurately replicate the functionality of the original test cases provided with the corresponding PRs.

Figure 2: Pipeline for test case generation, divided into description generation and code generation phases. The pipeline begins with extracting repository information, followed by generating Gherkin scenarios and then detailed test cases. An optional revision step leverages traceback errors to refine the generated test cases. The final output includes fail-to-pass test cases.

Instance Collection

We began by crawling metadata for 240k PyPI packages containing GitHub URLs, filtering for repositories with $\text{Stars} \geq 5$ and $\text{PRs} \geq 3$, resulting in a subset of 59k repositories. Due to network constraints and intricate dependency management, we successfully downloaded 10,416 repositories. Following the methodology outlined in with minor modifications, we extracted 88k instances in total.

To refine this dataset, we applied rule-based filtering to adjust patch lengths to fit model context windows and eliminated trivial or irrelevant issues while maintaining diversity. Ultimately, we retained 38k instances from 4,413 repositories as the training set. As shown in Figure 3, over 4,000 repositories contain fewer than five instances, significantly enhancing the dataset’s diversity.

Figure 3: Distribution of instances per repository in the training dataset. The majority of repositories contribute fewer than five instances, highlighting the dataset's diversity across a wide range of repositories.

Introduction

Data Construction Pipeline

Instance Collection

Test Case Generation