SciTaRC: Benchmarking QA on Scientific Tabular Data that Requires Language Reasoning and Complex Computation
SciTaRC is an expert-authored benchmark of questions about tabular data in scientific papers requiring both deep language reasoning and complex computation. We show that current state-of-the-art AI models fail on at least 23% of these questions.
Clone the repository and install the minimal dependencies:
git clone https://github.com/JHU-CLSP/SciTaRC.git
cd SciTaRC
pip install -r requirements.txt
The benchmark data is provided locally as scitarc_dataset.json and is also accessible via Hugging Face. The dataset consists of 371 expert-annotated questions. Every instance includes an expert-annotated pseudo-code plan to facilitate granular diagnosis of model failures.
{
"paper": "2401.06769",
"relevant_tables": [
[
"\\begin{table*}[h!]\n",
"...",
"\\end{table*}\n"
]
],
"question": "Which model has the biggest difference in translation quality when translating into English versus from English, and what is the value of that difference?",
"answer": "NLLB-200-1.3B. 64.71",
"plan": "SELECT all models\nLOOP for each model\n SELECT all language pair containing en(English)\n LOOP for each language pair containing en (English)\n COMPUTE diff = abs(score translating into English − score translating from English)\n..."
}
paper(string): The arXiv ID of the source scientific paper.question(string): The complex, multi-step question.answer(string): The ground-truth answer.plan(string): The expert-authored pseudo-code blueprint (e.g., SELECT, LOOP, COMPUTE, IF).relevant_tables(list): The exact LaTeX source code for the specific table(s) required.tables(list): The LaTeX source code for all tables and figures extracted from the paper.fulltext(string): The complete LaTeX source text of the original scientific paper.
Our unified inference script uses vllm and supports testing the execution bottleneck by separating reasoning plans from execution.
If no --output-file is provided, outputs are automatically named and saved to generations/[model_tag]_[plan_mode]_[exec_mode].json.
Key Arguments:
--plan-mode:none(Direct QA),self(Autonomous Planning),gold(Oracle Planning).--exec-mode:language(Chain-of-Thought),code(Program-of-Thought).--use-hf: Add this flag to stream the dataset directly from Hugging Face instead of the local JSON.
Standard Direct QA (No Plan):
python generate.py \
--model-id meta-llama/Llama-3.1-8B-Instruct \
--plan-mode none \
--exec-mode language
Oracle Code Execution (Gold Plan + Program of Thoughts):
python generate.py \
--model-id Qwen/Qwen2.5-Coder-7B-Instruct \
--plan-mode gold \
--exec-mode code
Outputs are saved to the generations/ directory.
Because answers are free-form and complex, we use an LLM-as-a-Judge protocol (aligned >95% with human annotators) to robustly evaluate logical reasoning and mathematical accuracy.
If no --output-json is provided, results are automatically named and saved to evaluations/[generation_filename]_eval.json.
python evaluate.py \
--generation-json generations/YOUR_GENERATION_FILE.json \
--evaluator-model meta-llama/Llama-3.3-70B-Instruct \
--prompt-file eval_prompt.txt
We also provide an Exact Match (EM) script for strict baseline comparisons. You can run this on single files or entire directories. Use the --inplace flag to append the EM scores directly to your existing evaluation JSONs.
python exact_match.py --files evaluations/[YOUR_FILE]_eval.json --inplace
Calculate input and reasoning complexity metrics (
python get_metrics.py scitarc_dataset.json complexity_metrics.csv
If you use this dataset, please cite our paper:
@misc{wang2026scitarc,
title={SciTaRC: Benchmarking QA on Scientific Tabular Data that Requires Language Reasoning and Complex Computation},
author={Hexuan Wang and Yaxuan Ren and Srikar Bommireddypalli and Shuxian Chen and Adarsh Prabhudesai and Rongkun Zhou and Elina Baral and Philipp Koehn},
year={2026},
eprint={2603.08910},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.08910},
}