DeepYardDeepYard
L

LH-Bench

Evaluation framework for measuring long-horizon agent workflows on enterprise tasks

Open SourceFree

About

LH-Bench is a research framework designed to evaluate autonomous agents on complex, multi-step enterprise workflows. Unlike traditional benchmarks that use binary pass/fail metrics, it assesses intermediate artifacts, multi-tool coordination, and alignment with organizational goals. Particularly useful for testing agents that handle subjective business tasks requiring multiple interactions and quality judgments over extended timeframes.

Details

Type
Integrations
Language

Tags

evaluationautonomousmulti-agenttool-useopen-sourceframework