🏅 DABStep Leaderboard
The Data Agent Benchmark for Multi-step Reasoning (DABStep) is looking to measure and push the state-of-the-art in Data Analysis by LLMs. The benchmark is composed of ~450 data analysis questions (Dataset Link) centered around 1 or more documents that agents will have to understand and cross reference in order to answer correctly.
We have set up a notebook to quickly get an agent baseline using the free Huggingface Inference API: Colab Notebook
Check out the official technical reports here: Adyen Report Hugging Face Report
Join the discussion on the discord server!
Reproduce the baseline results with the agent code we open sourced here
Gemini 2.5 Pro Reasoning Prompt Baseline | 80.56 | 28.04 | Mphasis Limited | user PranavSatheesan | Qwen 2.5 Coder 32B Instruct | 10-04-2025 |
qq da test ci agent qwen3 sft prompt | 80.56 | 41.01 | qq da test ci agent qwen3 sft prompt | user ricklovelisa | gemini-2.5-pro-preview-05-06 | 31-05-2025 |
Amity DA Agent v0.1 | 80.56 | 41.01 | Amity Solutions Thailand | user aongwachi | gemini-2.5-pro-preview-05-06 | 31-05-2025 | |
DICE | 75 | 27.25 | Microsoft | user mk3328 | o3-mini | 17-04-2025 | |
GPT4.1-TestAgent-0530 | 73.61 | 25.66 | org1 | user metalCan | GPT4.1 | 30-05-2025 | |
gpt4.1-ALL | 4.17 | 25.66 | org1 | user metalCan | GPT | 26-05-2025 | |
Test-user-1 | 81.94 | 24.87 | test-user-org | user PranavSatheesan | test-user-1 | 26-03-2025 | |
test-user-submission-1 | 81.94 | 24.87 | test-user-org | user PranavSatheesan | claude-sonnet | 26-03-2025 | |
foo3 | 75 | 24.6 | personal | user sijiawang0221 | llm | 20-05-2025 | |
qq da test ci agent cl | 79.17 | 20.63 | qq da test ci agent cl | user ricklovelisa | cl | 26-05-2025 | |
foo_think | 81.94 | 18.52 | personal | user szhang37 | claude | 18-04-2025 | |
DA-Agent-Anonymous-0419 | 69.44 | 18.25 | Anonymous | user Johnnam513 | Anonymous | 22-04-2025 | |
DA-Agent-Anonymous-0419-2 | 70.83 | 16.93 | Anonymous | user Johnnam513 | Anonymous | 22-04-2025 | |
newagent | 79.17 | 14.55 | federicotogether | user federicotogether | deepseek | 19-05-2025 | |
MiniReactDS-Full | 83.33 | 14.02 | federicotogether | user federicotogether | deepseek | 14-05-2025 | |
test_foo | 68.06 | 14.02 | personal | user szhang37 | claude | 04-04-2025 | |
test_foo2 | 68.06 | 14.02 | personal | user lanwuwei | claude | 09-04-2025 | |
DA-Agent-Anonymous-0424 | 76.39 | 13.23 | Anonymous | user Johnnam513 | Anonymous | 24-04-2025 | |
foo | 70.83 | 12.96 | personal | user sijiawang0221 | claude | 18-03-2025 | |
gpt4.1-200 | 4.17 | 12.96 | org1 | user metalCan | gpt | 26-05-2025 | |
test_bar | 76.39 | 12.7 | personal | user szhang37 | claude | 06-04-2025 | |
foo2 | 70.83 | 12.7 | personal | user sijiawang0221 | claude | 18-03-2025 | |
bar | 66.67 | 12.43 | personal | user sijiawang0221 | claude | 18-03-2025 | |
qq da test ci agent ds new prompt | 70.83 | 11.64 | qq da test ci agent ds new prompt | user ricklovelisa | ds | 28-05-2025 | |
qq da test ci agent db | 58.33 | 10.58 | qq da test ci agent | user ricklovelisa | db | 26-05-2025 | |
qq da test ci agent db new prompt | 30.56 | 9.79 | qq da test ci agent db new prompt | user ricklovelisa | db | 28-05-2025 | |
gpt-4.1 | 70.83 | 8.99 | ONE LAB | user yiliu051016 | openai | 01-05-2025 | |
qq da test ci agent ds | 72.22 | 7.94 | qq da test ci agent ds | user ricklovelisa | ds | 26-05-2025 | |
xyx_expr | 72.22 | 7.67 | personal | user xyxlyy | doubao | 07-05-2025 | |
ds-rookie-v0 | 72.22 | 7.14 | ds-org | user shangzhu | deepseek | 29-04-2025 | |
xyx_0425 | 73.61 | 6.88 | personal | user xyxlyy | doubao | 25-04-2025 | |
My Agent V1 S2 | 69.44 | 6.35 | Individual | user arkouda | Mini Family | 22-03-2025 | |
Agent V1 P2 | 69.44 | 6.35 | Org V1 P2 | user ak1352 | Family V1 P2 | 22-03-2025 | |
gpt-4.1-mini | 75 | 6.08 | ONE LAB | user yiliu051016 | openai | 01-05-2025 | |
im-a-good-agent-4 | 73.61 | 6.08 | anon | user trungtvu | anon | 30-05-2025 | |
im-a-good-agent-2 | 58.33 | 6.08 | anon | user trungtvu | good-agent | 30-05-2025 | |
im-a-good-agent-3 | 58.33 | 6.08 | anon | user trungtvu | anon | 30-05-2025 | |
im-a-good-agent | 58.33 | 6.08 | bespokelabs | user trungtvu | im-a-good-agent | 30-05-2025 | |
qq da test ci agent sft new prompt | 20.83 | 6.08 | qq da test ci agent sft new prompt | user ricklovelisa | sft | 28-05-2025 | |
gpt4.1-100 | 4.17 | 6.08 | org1 | user metalCan | GPT | 26-05-2025 | |
My Agent V2 S2 | 70.83 | 5.82 | Individual | user arkouda | Mini Family | 24-03-2025 | |
DA-Agent-Anonymous-0417 | 73.61 | 5.56 | Anonymous-0417 | user Johnnam513 | Anonymous | 18-04-2025 | |
DA-Agent-Anonymous-0418 | 73.61 | 5.56 | Anonymous-0418 | user Johnnam513 | Anonymous | 18-04-2025 | |
gg-agent-qwq-32b-0423 | 59.72 | 5.56 | gg | user geo11 | gg-agent | 28-05-2025 | |
im-a-good-da-agent-4 | 66.67 | 5.29 | anon | user trungtvu | im-a-good-da-agent | 30-05-2025 | |
gg-agent-0501-step120-s12-s1 | 66.67 | 5.29 | gg | user geo11 | gg-agent | 18-05-2025 | |
ych_agent_v3 | 65.28 | 5.29 | personal | user EmersonYCH | v3 | 07-04-2025 | |
ych_codeagent | 65.28 | 5.29 | personal | user EmersonYCH | model1 | 03-04-2025 | |
QQ Test DB | 50 | 5.03 | Self | user ricklovelisa | DB | 10-04-2025 | |
ych_agent_1 | 63.89 | 4.76 | personal | user EmersonYCH | model-pro | 03-04-2025 | |
ych_agent_Pro15 | 63.89 | 4.76 | personal | user EmersonYCH | Pro | 07-04-2025 | |
gg-agent-qwen3-32b-0522 | 66.67 | 4.5 | gg | user geo11 | gg-agent | 23-05-2025 | |
My Agent V2 S1 | 62.5 | 4.5 | Individual | user arkouda | Mini Family | 22-03-2025 | |
Agent V2 S1 | 62.5 | 4.5 | Org V2 S1 | user ak1352 | Family V2 S1 | 22-03-2025 | |
Test-qwq | 59.72 | 4.5 | test | qwq | 17-03-2025 | |
gg-agent-qwen3-32b-0524-new | 31.94 | 4.5 | gg | user geo11 | gg-agent | 26-05-2025 | |
gg-agent-s12 | 68.06 | 4.23 | gg | user geo11 | gg-agent | 04-05-2025 | |
qq da test ci agent sft sft prompt | 59.72 | 3.97 | qq da test ci agent sft sft prompt | user ricklovelisa | sft | 30-05-2025 | |
qq da test ci agent sft | 41.67 | 3.97 | qq da test ci agent | user ricklovelisa | sft | 27-05-2025 | |
gg-agent-qwen235b-s12 | 66.67 | 3.7 | gg | user geo11 | gg-agent | 23-05-2025 | |
gg-agent-qwen3-32b-0522-reason2 | 58.33 | 3.7 | gg | user geo11 | gg-agent | 28-05-2025 | |
gg-agent-0423-concise-s12 | 54.17 | 3.7 | gg | user geo11 | gg-agent | 12-05-2025 | |
Agent V1 P1 | 68.06 | 3.44 | Org V1 P1 | user ak1352 | Family V1 P1 | 21-03-2025 | |
gg-agent-qwen3-32b-0526-id13 | 51.39 | 3.44 | gg | user geo11 | gg-agent | 29-05-2025 | |
qq da test ci agent qwen3 sft prompt | 51.39 | 3.44 | qq da test ci agent qwen3 sft prompt | user ricklovelisa | qwen3 | 30-05-2025 | |
gg-agent | 47.22 | 3.44 | gg | user geo11 | gg-agent | 29-04-2025 | |
gg-agent-0501-step120-s12 | 34.72 | 3.44 | gg | user geo11 | gg-agent | 15-05-2025 | |
qq da test ci workflow ds | 29.17 | 3.44 | qq da test ci workflow | user ricklovelisa | ds | 26-05-2025 | |
Agent Variant 1 S | 65.28 | 3.17 | Org Variant 1 S | user ak1352 | Family Variant 1 S | 21-03-2025 | |
gg-agent-0501-s12 | 61.11 | 3.17 | gg | user geo11 | gg-agent | 08-05-2025 | |
gg-agent-qwen3-235b-base | 58.33 | 3.17 | gg | user geo11 | gg-agent | 25-05-2025 | |
gg-agent-0509-rl-step40-s12 | 56.94 | 3.17 | gg | user geo11 | gg-agent | 13-05-2025 | |
gg-agent-qwq-32b-0526-id10 | 50 | 3.17 | gg | user geo11 | gg-agent | 28-05-2025 | |
4o-mini | 65.28 | 2.91 | ONE_LAB | user yiliu051016 | openai | 25-04-2025 | |
gg-agent-qwen3-32b-0526-id12 | 61.11 | 2.91 | gg | user geo11 | gg-agent | 29-05-2025 | |
gg-agent-qwen2.5-coder-32b-instruct | 58.33 | 2.91 | gg | user geo11 | gg-agent | 25-05-2025 | |
gg-agent-qwen3-32b-0522-train-prompt | 51.39 | 2.91 | gg | user geo11 | gg-agent | 25-05-2025 | |
gg-agent-qwen2.5_32b_1 | 40.28 | 2.91 | gg | user geo11 | gg-agent | 23-05-2025 | |
gg-agent-qwen3-32b-0524-new2-copy | 31.94 | 2.91 | gg | user geo11 | gg-agent | 27-05-2025 | |
gg-agent-0423-concise-new-s12 | 52.78 | 2.65 | gg | user geo11 | gg-agent | 12-05-2025 | |
Agent No 3 Test | 50 | 2.65 | Agent Test Org 3 | user ak1352 | Agent Test Model Family 3 | 19-03-2025 | |
Agent 50 2 2 | 50 | 2.65 | Org 50 2 2 | user ak1352 | Family 50 2 2 | 24-03-2025 | |
Agent 50 2 4 | 50 | 2.65 | Org 50 2 4 | user ak1352 | Family 50 2 4 | 24-03-2025 | |
gg-agent-qwen3-30b-0522 | 50 | 2.65 | gg | user geo11 | gg-agent | 23-05-2025 | |
gg-agent-qwen3-32b-0524-new2-reason2 | 47.22 | 2.65 | gg | user geo11 | gg-agent | 28-05-2025 | |
gg-agent-qwen3-32b-0524 | 41.67 | 2.65 | gg | user geo11 | gg-agent | 26-05-2025 | |
gpt4.1-50 | 4.17 | 2.65 | org1 | user metalCan | GPT | 25-05-2025 | |
gpt4.1-double-50 | 4.17 | 2.65 | org1 | user metalCan | gpt | 29-05-2025 | |
gg-agent-0512-qwen32b-step100-s12 | 45.83 | 2.38 | gg | user geo11 | gg-agent | 14-05-2025 | |
gg-agent-qwen3-32b-base | 20.83 | 2.38 | gg | user geo11 | gg-agent | 25-05-2025 | |
Agent No 1 Test | 15.28 | 2.38 | Agent Test Org | user ak1352 | Agent Test Model Family | 19-03-2025 | |
Agent 50 2 3 | 15.28 | 2.38 | Org 50 2 3 | user ak1352 | Family 50 2 3 | 24-03-2025 | |
Agent 50 2 | 15.28 | 2.38 | Org 50 2 | user ak1352 | Family 50 2 | 24-03-2025 | |
gg-agent-qwen3-32b-0524-new2-reason1 | 56.94 | 2.12 | gg | user geo11 | gg-agent | 28-05-2025 | |
gg-agent-qwen3-235b-0523-rerun | 43.06 | 2.12 | gg | user geo11 | gg-agent | 25-05-2025 | |
QQ TEST 3 | 26.39 | 2.12 | QQ TEST 3 | user ricklovelisa | QQ TEST 3 | 11-04-2025 | |
gg-agent-qwen3-30b-0522-train-prompt | 23.61 | 2.12 | gg | user geo11 | gg-agent | 25-05-2025 |
Benchmark Validation Standards
All submissions are initially added to the Unvalidated Leaderboard. The Adyen/Hugging Face team will attempt, with the participation of the respective submission team, to validate any entries that rank within the top 10.
Validation confirms that a submission's results were achieved using a novel approach involving data analysis agents. To support validation, participants must provide clear evidence of their methodology. This can be done in one of the following ways:
- Preferred: Share a research paper or blog post along with the source code to enable full reproducibility.
- Submit a complete dataset that includes reasoning traces demonstrating how the results were produced.
- Provide access to an API that the Adyen/Hugging Face team can use to independently validate and reproduce results.
Our goal with DABStep is to foster rapid progress and collaboration in the open research community. We strongly encourage participants to share their work and open-source their code whenever possible.
Once validated, submissions will be featured and showcased on the Validated Leaderboard, including annotations indicating the validation method used (e.g., traces
, code
, API
).
Submissions
Scores are expressed as the percentage of correct answers.
Each question calls for an answer that is either a string (one or a few words), a number, or a comma separated list of strings or floats, unless specified otherwise. There is only one correct answer. Hence, evaluation is done via quasi exact match between a model’s answer and the ground truth (up to some normalization that is tied to the “type” of the ground truth).
We expect submissions to be json-line files with the following format.
Mandatory fields are: task_id
and agent_answer
. However, reasoning_trace
is optional:
{"task_id": "task_id_1", "agent_answer": "Answer 1 from your agent", "reasoning_trace": "The different steps by which your model reached answer 1"}
{"task_id": "task_id_2", "agent_answer": "Answer 2 from your agent", "reasoning_trace": "The different steps by which your model reached answer 2"}
Our scoring function can be found here.