Benchmark Example
Integrate SWE-Bench into our benchmark framework
SWE-Bench
Define SwebenchClient
class SweBenchClient(BenchClient):
def __init__(self, agent_url: str):
super().__init__(agent_url)
def prepare_environment(self, state_update: Dict[str, Any]) -> Dict[str, Any]:
return {"env_info": state_update}
def parse_action(self, raw_action: str) -> Dict[str, Any]:
return {
'model_patch': raw_action
}Modify entrypoint of SWE-Bench so it can get results from agents
def main(
dataset_name: str,
split: str,
instance_ids: list,
predictions_path: str,
agent_url: str,
max_workers: int,
force_rebuild: bool,
cache_level: str,
clean: bool,
open_file_limit: int,
run_id: str,
timeout: int,
namespace: str | None,
instance_image_tag: str = 'latest',
rewrite_reports: bool = False,
report_dir: str = '.',
modal_name_or_path: str = "self_model",
modal: bool = False
):
"""
Run evaluation harness for the given dataset and predictions.
"""
# original code
###
# Integrate Benchflow Here
agent = SweBenchClient(agent_url)
for task in full_dataset:
state_update = task
prediction = agent.get_action(state_update)
prediction['instance_id'] = task[KEY_INSTANCE_ID]
prediction['model_name_or_path'] = modal_name_or_path
predictions[task[KEY_INSTANCE_ID]] = prediction
###
# original code
Add a scripts as an entrypoint for docker image
Add a Dockerfile to package the benchmark
Upload SWE-Bench to dockerhub at kirk2000/benchflow:swebench-v1
Import nessary packages
Define benchmark config, include required params and optional params to run the benchmarks.
Integrate Swebench Into Benchflow
Last updated