Evaluation

36 tools across 4 categories tagged with “evaluation”

🔧

dev tools

(25)

RAG Failure Diagnostics Clinic

Diagnose and fix common RAG pipeline failure modes

OSSFree

Shubham Saboo

111.8K3d ago78

Langfuse

Open-source LLM engineering platform — traces, evals, prompt management — 23K+ stars

OSSfreemium

Langfuse

28.0Ktoday157

promptfoo

Test and evaluate LLM prompts and agents — 11K+ stars

OSSFree

promptfoo

21.6Ktoday286

Arize Phoenix

Open-source LLM observability with tracing, evaluation, and datasets — 8K+ stars

OSSfreemium

Arize AI

9.8K250.0K/wtoday173

Braintrust

AI evaluation and experiment tracking platform for production LLM apps

Commercialfreemium

Braintrust

1.2Ktoday87

LangSmith

Observability, tracing, and evaluation platform for LLM applications

Commercialfreemium

LangChain

901today84

AgentDS

Benchmark framework measuring AI agent performance vs human experts on data science tasks

unknownFree

Research Team

CausalFlow

Causal debugging framework that converts failed LLM agent traces into minimal counterfactual fixes

OSSFree

Research

ClawTrap

MITM red-teaming framework for testing autonomous web agent security in real-world scenarios

OSSFree

Independent Researchers

CUA-Gym

Reinforcement learning benchmark suite for training GUI automation agents with verifiable rewards

OSSFree

Research

CUA-Suite

Large-scale human-annotated video dataset for training computer-use agents

OSSFree

Research Team

DeltaBox

Millisecond-level checkpoint/rollback system for stateful AI agent exploration

OSSFree

Research

Echo

Learning system that refines AI agents through user feedback on interaction logs

OSSFree

Research

EVMbench

Benchmark for evaluating AI agents on smart contract security tasks

OSSFree

Research Team (Wang et al.)

FinToolBench

Specialized benchmark for evaluating LLM agents on real-world financial tool use and compliance

OSSFree

Research Collaboration

JetBrains AI Toolkit

In-IDE LLM tracing and evaluation for JetBrains developers

Commercialpaid

JetBrains

LH-Bench

Evaluation framework for measuring long-horizon agent workflows on enterprise tasks

OSSFree

Research Project

MCP-in-SoS

Security risk assessment framework for evaluating open-source MCP server implementations

OSSFree

Research

PostTrainBench

Benchmark for evaluating autonomous post-training of LLMs under compute constraints

OSSFree

Research Collaboration

SlopCodeBench

Benchmark for measuring coding agent performance degradation across iterative tasks

OSSFree

Research Collaboration

SpecOps

Automated testing framework for GUI-based AI agents in real-world environments

OSSFree

Research

TDAD

Pre-change impact analysis for AI coding agents using graph-based dependency mapping

OSSFree

Research Project

TML-Bench

Benchmark for evaluating data science agents on Kaggle-style tabular ML tasks

OSSFree

Independent

Veritas

LLM-powered binary analysis for detecting memory corruption vulnerabilities in stripped code

OSSFree

Columbia University / King's College London

ZEBRAARENA

Diagnostic simulation environment for testing LLM reasoning and tool use capabilities

OSSFree

Research Project

🧱

frameworks

(6)

Ares

Adaptive reasoning framework that optimizes LLM inference costs by dynamically scaling effort

OSSFree

Research Collaboration

Automat

Autonomous feature engineering for materials science via LLM-driven genetic programming

OSSFree

Trinity College Dublin

CAST

Case-driven framework that learns from execution trajectories to optimize LLM tool use

OSSFree

Research collaboration

EvoTool

Evolutionary framework that self-optimizes tool-use policies for long-horizon LLM agent tasks

OSSFree

Research Team (Yang et al.)

TITAN Agent

Full-featured TypeScript agent framework with multi-agent orchestration and Mission Control UI

OSSFree

Djtony707

Vellum

Visual AI workflow builder with SDK and enterprise governance

Commercialfreemium

Vellum AI

🤖

agents

(4)

ImageEdit-R1

RL-powered multi-agent system for complex, instruction-based image editing

OSSFree

Research Collaboration

KARL

RL-trained enterprise search agents with multi-regime evaluation benchmark

OSSFree

Research Team (Chang et al.)

RepoLaunch

Automated repository testing agent for cross-language SWE dataset construction

OSSFree

Research Team (Li et al.)

ReViSQL

Text-to-SQL agent achieving human-level accuracy through step-by-step reasoning

unknownFree

Research Team

💬

prompts

(1)

RAG Retrieval Grading Prompt

Prompt for evaluating whether retrieved documents answer the user's question

OSSFree

Shubham Saboo

111.8K3d ago78

Evaluation

dev tools

RAG Failure Diagnostics Clinic

Langfuse

promptfoo

Arize Phoenix

Braintrust

LangSmith

AgentDS

CausalFlow

ClawTrap

CUA-Gym

CUA-Suite

DeltaBox

Echo

EVMbench

FinToolBench

JetBrains AI Toolkit

LH-Bench

MCP-in-SoS

PostTrainBench

SlopCodeBench

SpecOps

TDAD

TML-Bench

Veritas

ZEBRAARENA

frameworks

Ares

Automat

CAST

EvoTool

TITAN Agent

Vellum

agents

ImageEdit-R1

KARL

RepoLaunch

ReViSQL

prompts

RAG Retrieval Grading Prompt

Related Tags