DeepYardDeepYard

Evaluation

36 tools across 4 categories tagged with “evaluation

🔧

dev tools

(25)
R

RAG Failure Diagnostics Clinic

Diagnose and fix common RAG pipeline failure modes

OSSFree
Shubham Saboo
111.8K3d ago78
L

Langfuse

Open-source LLM engineering platform — traces, evals, prompt management — 23K+ stars

OSSfreemium
Langfuse
28.0Ktoday157
p

promptfoo

Test and evaluate LLM prompts and agents — 11K+ stars

OSSFree
promptfoo
21.6Ktoday286
A

Arize Phoenix

Open-source LLM observability with tracing, evaluation, and datasets — 8K+ stars

OSSfreemium
Arize AI
9.8K250.0K/wtoday173
B

Braintrust

AI evaluation and experiment tracking platform for production LLM apps

Commercialfreemium
Braintrust
1.2Ktoday87
L

LangSmith

Observability, tracing, and evaluation platform for LLM applications

Commercialfreemium
LangChain
901today84
A

AgentDS

Benchmark framework measuring AI agent performance vs human experts on data science tasks

unknownFree
Research Team
C

CausalFlow

Causal debugging framework that converts failed LLM agent traces into minimal counterfactual fixes

OSSFree
Research
C

ClawTrap

MITM red-teaming framework for testing autonomous web agent security in real-world scenarios

OSSFree
Independent Researchers
C

CUA-Gym

Reinforcement learning benchmark suite for training GUI automation agents with verifiable rewards

OSSFree
Research
C

CUA-Suite

Large-scale human-annotated video dataset for training computer-use agents

OSSFree
Research Team
D

DeltaBox

Millisecond-level checkpoint/rollback system for stateful AI agent exploration

OSSFree
Research
E

Echo

Learning system that refines AI agents through user feedback on interaction logs

OSSFree
Research
E

EVMbench

Benchmark for evaluating AI agents on smart contract security tasks

OSSFree
Research Team (Wang et al.)
F

FinToolBench

Specialized benchmark for evaluating LLM agents on real-world financial tool use and compliance

OSSFree
Research Collaboration
J

JetBrains AI Toolkit

In-IDE LLM tracing and evaluation for JetBrains developers

Commercialpaid
JetBrains
L

LH-Bench

Evaluation framework for measuring long-horizon agent workflows on enterprise tasks

OSSFree
Research Project
M

MCP-in-SoS

Security risk assessment framework for evaluating open-source MCP server implementations

OSSFree
Research
P

PostTrainBench

Benchmark for evaluating autonomous post-training of LLMs under compute constraints

OSSFree
Research Collaboration
S

SlopCodeBench

Benchmark for measuring coding agent performance degradation across iterative tasks

OSSFree
Research Collaboration
S

SpecOps

Automated testing framework for GUI-based AI agents in real-world environments

OSSFree
Research
T

TDAD

Pre-change impact analysis for AI coding agents using graph-based dependency mapping

OSSFree
Research Project
T

TML-Bench

Benchmark for evaluating data science agents on Kaggle-style tabular ML tasks

OSSFree
Independent
V

Veritas

LLM-powered binary analysis for detecting memory corruption vulnerabilities in stripped code

OSSFree
Columbia University / King's College London
Z

ZEBRAARENA

Diagnostic simulation environment for testing LLM reasoning and tool use capabilities

OSSFree
Research Project
🧱

frameworks

(6)
🤖

agents

(4)
💬

prompts

(1)