Evaluation
28 tools across 4 categories tagged with “evaluation”
dev tools
(19)RAG Failure Diagnostics Clinic
Diagnose and fix common RAG pipeline failure modes
Arize Phoenix
Open-source LLM observability with tracing, evaluation, and datasets — 8K+ stars
Langfuse
Open-source LLM engineering platform — traces, evals, prompt management — 23K+ stars
promptfoo
Test and evaluate LLM prompts and agents — 11K+ stars
Braintrust
AI evaluation and experiment tracking platform for production LLM apps
AgentDS
Benchmark framework measuring AI agent performance vs human experts on data science tasks
ClawTrap
MITM red-teaming framework for testing autonomous web agent security in real-world scenarios
CUA-Suite
Large-scale human-annotated video dataset for training computer-use agents
EVMbench
Benchmark for evaluating AI agents on smart contract security tasks
FinToolBench
Specialized benchmark for evaluating LLM agents on real-world financial tool use and compliance
LangSmith
Observability, tracing, and evaluation platform for LLM applications
LH-Bench
Evaluation framework for measuring long-horizon agent workflows on enterprise tasks
MCP-in-SoS
Security risk assessment framework for evaluating open-source MCP server implementations
PostTrainBench
Benchmark for evaluating autonomous post-training of LLMs under compute constraints
SlopCodeBench
Benchmark for measuring coding agent performance degradation across iterative tasks
SpecOps
Automated testing framework for GUI-based AI agents in real-world environments
TDAD
Pre-change impact analysis for AI coding agents using graph-based dependency mapping
TML-Bench
Benchmark for evaluating data science agents on Kaggle-style tabular ML tasks
ZEBRAARENA
Diagnostic simulation environment for testing LLM reasoning and tool use capabilities
agents
(5)ImageEdit-R1
RL-powered multi-agent system for complex, instruction-based image editing
KARL
RL-trained enterprise search agents with multi-regime evaluation benchmark
RepoLaunch
Automated repository testing agent for cross-language SWE dataset construction
ReViSQL
Text-to-SQL agent achieving human-level accuracy through step-by-step reasoning
SimLab
Adaptive simulation platform for building, evaluating, and refining long-horizon AI agents
frameworks
(3)Vellum
Visual AI workflow builder with SDK and enterprise governance
Ares
Adaptive reasoning framework that optimizes LLM inference costs by dynamically scaling effort
EvoTool
Evolutionary framework that self-optimizes tool-use policies for long-horizon LLM agent tasks