DeepYardDeepYard
S

SlopCodeBench

Benchmark for measuring coding agent performance degradation across iterative tasks

Open SourceFree

About

Language-agnostic evaluation benchmark designed to measure how AI coding agents maintain code quality over extended, iterative development sessions. Features 20 real-world programming problems with 93 checkpoints to track code quality evolution and detect performance degradation patterns. Useful for researchers and developers building autonomous coding agents who need to assess long-horizon task performance beyond single-shot code generation.

Details

Type
Integrations
Language

Tags

evaluationcoding-agentopen-sourceautonomousframework