Evaluation
36 tools across 4 categories tagged with “evaluation”
dev tools
(25)RAG Failure Diagnostics Clinic
Diagnose and fix common RAG pipeline failure modes
Langfuse
Open-source LLM engineering platform — traces, evals, prompt management — 23K+ stars
promptfoo
Test and evaluate LLM prompts and agents — 11K+ stars
Arize Phoenix
Open-source LLM observability with tracing, evaluation, and datasets — 8K+ stars
Braintrust
AI evaluation and experiment tracking platform for production LLM apps
LangSmith
Observability, tracing, and evaluation platform for LLM applications
AgentDS
Benchmark framework measuring AI agent performance vs human experts on data science tasks
CausalFlow
Causal debugging framework that converts failed LLM agent traces into minimal counterfactual fixes
ClawTrap
MITM red-teaming framework for testing autonomous web agent security in real-world scenarios
CUA-Gym
Reinforcement learning benchmark suite for training GUI automation agents with verifiable rewards
CUA-Suite
Large-scale human-annotated video dataset for training computer-use agents
DeltaBox
Millisecond-level checkpoint/rollback system for stateful AI agent exploration
Echo
Learning system that refines AI agents through user feedback on interaction logs
EVMbench
Benchmark for evaluating AI agents on smart contract security tasks
FinToolBench
Specialized benchmark for evaluating LLM agents on real-world financial tool use and compliance
JetBrains AI Toolkit
In-IDE LLM tracing and evaluation for JetBrains developers
LH-Bench
Evaluation framework for measuring long-horizon agent workflows on enterprise tasks
MCP-in-SoS
Security risk assessment framework for evaluating open-source MCP server implementations
PostTrainBench
Benchmark for evaluating autonomous post-training of LLMs under compute constraints
SlopCodeBench
Benchmark for measuring coding agent performance degradation across iterative tasks
SpecOps
Automated testing framework for GUI-based AI agents in real-world environments
TDAD
Pre-change impact analysis for AI coding agents using graph-based dependency mapping
TML-Bench
Benchmark for evaluating data science agents on Kaggle-style tabular ML tasks
Veritas
LLM-powered binary analysis for detecting memory corruption vulnerabilities in stripped code
ZEBRAARENA
Diagnostic simulation environment for testing LLM reasoning and tool use capabilities
frameworks
(6)Ares
Adaptive reasoning framework that optimizes LLM inference costs by dynamically scaling effort
Automat
Autonomous feature engineering for materials science via LLM-driven genetic programming
CAST
Case-driven framework that learns from execution trajectories to optimize LLM tool use
EvoTool
Evolutionary framework that self-optimizes tool-use policies for long-horizon LLM agent tasks
TITAN Agent
Full-featured TypeScript agent framework with multi-agent orchestration and Mission Control UI
Vellum
Visual AI workflow builder with SDK and enterprise governance
agents
(4)ImageEdit-R1
RL-powered multi-agent system for complex, instruction-based image editing
KARL
RL-trained enterprise search agents with multi-regime evaluation benchmark
RepoLaunch
Automated repository testing agent for cross-language SWE dataset construction
ReViSQL
Text-to-SQL agent achieving human-level accuracy through step-by-step reasoning