🏅 DABStep Leaderboard

The Data Agent Benchmark for Multi-step Reasoning (DABStep) is looking to measure and push the state-of-the-art in Data Analysis by LLMs. The benchmark is composed of ~450 data analysis questions (Dataset Link) centered around 1 or more documents that agents will have to understand and cross reference in order to answer correctly.

We have set up a notebook to quickly get an agent baseline using the free Huggingface Inference API: Colab Notebook

Check out the official technical reports here: Adyen Report Hugging Face Report

Join the discussion on the discord server!

Gemini 2.5 Pro Reasoning Prompt Baseline

77.78
14.02
Hugging Face
Qwen 2.5 Coder 32B Instruct
07-04-2025

Benchmark Validation Standards

All submissions are initially added to the Unvalidated Leaderboard. The Adyen/Hugging Face team will attempt, with the participation of the respective submission team, to validate any entries that rank within the top 10.

Validation confirms that a submission's results were achieved using a novel approach involving data analysis agents. To support validation, participants must provide clear evidence of their methodology. This can be done in one of the following ways:

  • Preferred: Share a research paper or blog post along with the source code to enable full reproducibility.
  • Submit a complete dataset that includes reasoning traces demonstrating how the results were produced.
  • Provide access to an API that the Adyen/Hugging Face team can use to independently validate and reproduce results.

Our goal with DABStep is to foster rapid progress and collaboration in the open research community. We strongly encourage participants to share their work and open-source their code whenever possible.

Once validated, submissions will be featured and showcased on the Validated Leaderboard, including annotations indicating the validation method used (e.g., traces, code, API).

Submissions

Scores are expressed as the percentage of correct answers.

Each question calls for an answer that is either a string (one or a few words), a number, or a comma separated list of strings or floats, unless specified otherwise. There is only one correct answer. Hence, evaluation is done via quasi exact match between a model’s answer and the ground truth (up to some normalization that is tied to the “type” of the ground truth).

We expect submissions to be json-line files with the following format. Mandatory fields are: task_id and agent_answer. However, reasoning_trace is optional:

{"task_id": "task_id_1", "agent_answer": "Answer 1 from your agent", "reasoning_trace": "The different steps by which your model reached answer 1"}
{"task_id": "task_id_2", "agent_answer": "Answer 2 from your agent", "reasoning_trace": "The different steps by which your model reached answer 2"}

Our scoring function can be found here.