Benchmark Example

Integrate SWE-Bench into our benchmark framework

SWE-Bench

Define SwebenchClient

class SweBenchClient(BenchClient):
    def __init__(self, agent_url: str):
        super().__init__(agent_url)

    def prepare_environment(self, state_update: Dict[str, Any]) -> Dict[str, Any]:
        return {"env_info": state_update}

    def parse_action(self, raw_action: str) -> Dict[str, Any]:
        return {
            'model_patch': raw_action
        }

Modify entrypoint of SWE-Bench so it can get results from agents

def main(
        dataset_name: str,
        split: str,
        instance_ids: list,
        predictions_path: str,
        agent_url: str,
        max_workers: int,
        force_rebuild: bool,
        cache_level: str,
        clean: bool,
        open_file_limit: int,
        run_id: str,
        timeout: int,
        namespace: str | None,
        instance_image_tag: str = 'latest',
        rewrite_reports: bool = False,
        report_dir: str = '.',
        modal_name_or_path: str = "self_model",
        modal: bool = False
    ):
    """
    Run evaluation harness for the given dataset and predictions.
    """
    # original code
    
    ### 
    # Integrate Benchflow Here
    agent = SweBenchClient(agent_url)
    for task in full_dataset:
        state_update = task
        prediction = agent.get_action(state_update)
        prediction['instance_id'] = task[KEY_INSTANCE_ID]
        prediction['model_name_or_path'] = modal_name_or_path
        predictions[task[KEY_INSTANCE_ID]] = prediction
    ###
    
    # original code
    

Add a scripts as an entrypoint for docker image

Add a Dockerfile to package the benchmark

Upload SWE-Bench to dockerhub at kirk2000/benchflow:swebench-v1

Import nessary packages

Define benchmark config, include required params and optional params to run the benchmarks.

Integrate Swebench Into Benchflow

Last updated