DeepYardDeepYard

Evaluation

28 tools across 4 categories tagged with “evaluation

🔧

dev tools

(19)
R

RAG Failure Diagnostics Clinic

Diagnose and fix common RAG pipeline failure modes

OSSFree
Shubham Saboo
104.2Ktoday74
A

Arize Phoenix

Open-source LLM observability with tracing, evaluation, and datasets — 8K+ stars

OSSfreemiumTrending
Arize AI
8.5K250.0K/w4d ago120
L

Langfuse

Open-source LLM engineering platform — traces, evals, prompt management — 23K+ stars

OSSfreemium
Langfuse
24.1Ktoday141
p

promptfoo

Test and evaluate LLM prompts and agents — 11K+ stars

OSSFree
promptfoo
19.0Ktoday262
B

Braintrust

AI evaluation and experiment tracking platform for production LLM apps

Commercialfreemium
Braintrust
1.1Ktoday82
A

AgentDS

Benchmark framework measuring AI agent performance vs human experts on data science tasks

unknownFree
Research Team
C

ClawTrap

MITM red-teaming framework for testing autonomous web agent security in real-world scenarios

OSSFree
Independent Researchers
C

CUA-Suite

Large-scale human-annotated video dataset for training computer-use agents

OSSFree
Research Team
E

EVMbench

Benchmark for evaluating AI agents on smart contract security tasks

OSSFree
Research Team (Wang et al.)
F

FinToolBench

Specialized benchmark for evaluating LLM agents on real-world financial tool use and compliance

OSSFree
Research Collaboration
L

LangSmith

Observability, tracing, and evaluation platform for LLM applications

Commercialfreemium
LangChain
829today76
L

LH-Bench

Evaluation framework for measuring long-horizon agent workflows on enterprise tasks

OSSFree
Research Project
M

MCP-in-SoS

Security risk assessment framework for evaluating open-source MCP server implementations

OSSFree
Research
P

PostTrainBench

Benchmark for evaluating autonomous post-training of LLMs under compute constraints

OSSFree
Research Collaboration
S

SlopCodeBench

Benchmark for measuring coding agent performance degradation across iterative tasks

OSSFree
Research Collaboration
S

SpecOps

Automated testing framework for GUI-based AI agents in real-world environments

OSSFree
Research
T

TDAD

Pre-change impact analysis for AI coding agents using graph-based dependency mapping

OSSFree
Research Project
T

TML-Bench

Benchmark for evaluating data science agents on Kaggle-style tabular ML tasks

OSSFree
Independent
Z

ZEBRAARENA

Diagnostic simulation environment for testing LLM reasoning and tool use capabilities

OSSFree
Research Project
🤖

agents

(5)
🧱

frameworks

(3)
💬

prompts

(1)