DeepYardDeepYard
T

TML-Bench

Benchmark for evaluating data science agents on Kaggle-style tabular ML tasks

Open SourceFree

About

TML-Bench is an evaluation benchmark specifically designed to test autonomous coding agents on tabular machine learning tasks. It simulates Kaggle-style competitions with varying time budgets, assessing agents' ability to handle end-to-end data science workflows including data preprocessing, feature engineering, model selection, and hyperparameter tuning. The benchmark measures both correctness and reliability under realistic resource constraints, providing standardized metrics for comparing agent performance on practical ML tasks.

Details

Type
Integrations
Language

Tags

evaluationcoding-agentautonomouspythonopen-source