🏅 DABStep Leaderboard

The Data Agent Benchmark for Multi-step Reasoning (DABStep) is looking to measure and push the state-of-the-art in Data Analysis by LLMs. The benchmark is composed of ~450 data analysis questions (Dataset Link) centered around 1 or more documents that agents will have to understand and cross reference in order to answer correctly.

We have set up a notebook to quickly get an agent baseline using the free Huggingface Inference API: Colab Notebook

Check out the official technical reports here: Adyen Report Hugging Face Report

Join the discussion on the discord server!

Reproduce the baseline results with the agent code we open sourced here


Gemini 2.5 Pro Reasoning Prompt Baseline	94.44	57.67	Amity Solutions Thailand \| user aongwachi	Link	gemini-2.5-pro-preview-05-06	19-09-2025


CambioML energent.ai DS Agent	94.44	57.67	CambioML \| user lingjiekong	Link	GPT-5	19-09-2025
DS-STAR	87.5	45.24	Google Cloud AI Research	Link	Gemini-2.5-Pro	08-07-2025
Amity DA Agent v0.1	80.56	41.01	Amity Solutions Thailand \| user aongwachi	Link	gemini-2.5-pro-preview-05-06	31-05-2025
AgenticData	94.44	40.48	Tsinghua University	Link	Qwen 3	03-08-2025
Mphasis-I2I-Agents	80.56	28.04	Mphasis Limited \| user PranavSatheesan		claude-3-5-sonnet-20241022	10-04-2025
Claude 4 Sonnet ReACT Baseline	81.94	19.84	Hugging Face	Link	claude-sonnet-4-20250514	28-05-2025
Open Data Scientist	84.72	16.4	TogetherAI \| user federicotogether	Link	DeepSeek-V3	24-06-2025
o4-mini Reasoning Prompt Baseline	76.39	14.55	Hugging Face	Link	OpenAI o4-mini	22-04-2025
Claude 3.7 Sonnet ReACT Baseline	75	13.76	Hugging Face	Link	claude-3-7-sonnet-20250219	07-04-2025
o3-mini Reasoning Prompt Baseline	72.22	13.76	Adyen	Link	OpenAI o3-mini	02-02-2025
Gemini 2.5 Pro Reasoning Prompt Baseline	66.67	12.7	Hugging Face	Link	Gemini 2.5 Pro	05-04-2025
GPT 4.1 ReACT Baseline	80.56	12.43	Hugging Face	Link	OpenAI GPT 4.1	22-04-2025
o1 Reasoning Prompt Baseline	69.44	11.11	Adyen	Link	OpenAI o1	31-01-2025
Gemini Data Science Agent	61.11	9.79	Google		Gemini 2.0 Flash	10-02-2025
Claude 3.5 Sonnet ReACT Baseline	77.78	9.26	Adyen	Link	claude-3-5-sonnet-latest	23-01-2025
GPT 4.1-mini ReACT baseline	77.78	8.99	Hugging Face	Link	OpenAI GPT 4.1-mini	22-04-2025
Llama 4 Maverick ReACT Baseline	75	8.73	Hugging Face	Link	Llama 4 Maverick	09-04-2025
GPT 4o ReACT Baseline	66.67	6.08	Adyen	Link	GPT 4o	23-01-2025
Deepseek V3 ReACT Baseline	66.67	5.56	Adyen	Link	Deepseek v3	23-01-2025
Claude 3.5 Haiku ReACT Baseline	77.78	5.03	Adyen	Link	claude-3-5-haiku-20241022	24-01-2025
Llama 3.3 70B ReACT Baseline	68.06	3.7	Adyen	Link	Llama 3.3 70B Instruct	23-01-2025
GPT 4o-mini ReACT Baseline	69.44	3.44	Adyen	Link	GPT 40-mini	24-01-2025
Qwen Coder ReAct Baseline	54.17	3.44	Adyen	Link	Qwen 2.5 Coder 32B Instruct	28-01-2025
Llama 4 Scout ReACT Baseline	52.78	1.85	Hugging Face	Link	Llama 4 Scout	09-04-2025
Llama 3.2 1B ReACT Baseline	1.39	0	Adyen	Link	Llama 3.2 1B Instruct	23-01-2025

Benchmark Validation Standards

All submissions are initially added to the Unvalidated Leaderboard. The Adyen/Hugging Face team will attempt, with the participation of the respective submission team, to validate any entries that rank within the top 10.

Validation confirms that a submission's results were achieved using a novel approach involving data analysis agents. To support validation, participants must provide clear evidence of their methodology. This can be done in one of the following ways:

Preferred: Share a research paper or blog post along with the source code to enable full reproducibility.
Submit a complete dataset that includes reasoning traces demonstrating how the results were produced.
Provide access to an API that the Adyen/Hugging Face team can use to independently validate and reproduce results.

Our goal with DABStep is to foster rapid progress and collaboration in the open research community. We strongly encourage participants to share their work and open-source their code whenever possible.

Once validated, submissions will be featured and showcased on the Validated Leaderboard, including annotations indicating the validation method used (e.g., traces, code, API).

Submissions

Scores are expressed as the percentage of correct answers.

Each question calls for an answer that is either a string (one or a few words), a number, or a comma separated list of strings or floats, unless specified otherwise. There is only one correct answer. Hence, evaluation is done via quasi exact match between a model’s answer and the ground truth (up to some normalization that is tied to the “type” of the ground truth).

We expect submissions to be json-line files with the following format. Mandatory fields are: task_id and agent_answer. However, reasoning_trace is optional:

{"task_id": "task_id_1", "agent_answer": "Answer 1 from your agent", "reasoning_trace": "The different steps by which your model reached answer 1"}
{"task_id": "task_id_2", "agent_answer": "Answer 2 from your agent", "reasoning_trace": "The different steps by which your model reached answer 2"}

Our scoring function can be found here.

Agent name

Model family

System prompt example

Repo URL with agent code

Organisation

Contact email (will be stored privately, & used if there is an issue with your submission)

File