DeepYardDeepYard
S

SWE-Explore

Benchmark for evaluating coding agents' repository exploration and code understanding abilities

Open SourceFree

About

Research benchmark that measures fine-grained capabilities of coding agents in repository exploration. Unlike traditional benchmarks, it evaluates specific skills including repository understanding, context retrieval, code localization, and bug diagnosis rather than binary pass/fail metrics. Designed to help researchers and developers assess how well AI agents navigate and comprehend codebases.

Details

Type
Integrations
Language

Tags

evaluationcoding-agentopen-sourceframework