This is the repo for the paper: FACE: A Fine-Grained Reference-Free Evaluator for Conversational Information Access, Hideaki Joko, Faegheh Hasibi, SIGIR 2026.
Specifically, the repository contains:
- The
CRSArena-Eval datasetwith human-annotated conversations and meta-evaluation scripts. - The
CRSArena-Eval interfacefor interactive meta-evaluation of your evaluator vs. baselines. - The
FACEimplementation with particle generation and scoring tools.
- CRSArena-Eval is a meta-evaluation dataset of human-annotated conversations between users and 9 Conversational Recommender Systems (CRSs), designed for evaluating CRS evaluators.
- FACE is a Fine-grained, Aspect-based Conversation Evaluation method that provides evaluation scores for diverse turn and dialogue level qualities of recommendation conversations.
The directory dataset/ contains the CRSArena-Eval dataset.
This dataset is designed for meta-evaluation of CRS evaluators and is built on the CRSArena-Dial dataset.
crs_arena_eval.json: The main dataset file containing 467 conversations with 4,473 utterances, annotated with both turn-level and dialogue-level quality scores by human evaluators.
Turn-level aspects:
- Relevance (0-3): Does the assistant's response make sense and meet the user's interests?
- Interestingness (0-2): Does the response make the user want to continue the conversation?
Dialogue-level aspects:
- Understanding (0-2): Does the assistant understand the user's request and try to fulfill it?
- Task Completion (0-2): Does the assistant make recommendations that the user finally accepts?
- Interest Arousal (0-2): Does the assistant try to spark the user's interest in something new?
- Efficiency (0-1): Does the assistant suggest items matching the user's interests within the first three interactions?
- Overall Impression (0-4): What is the overall impression of the assistant's performance?
Table: General statistics of the CRSArena-Eval dataset.
| Statistic | Value |
|---|---|
| # Conversations | 467 |
| # Utterances | 4,473 |
| Avg. utterances per conversation | 9.58 |
| Avg. words per user utterance | 7.53 |
| Avg. words per system utterance | 15.18 |
| # Final labels (after aggregation) | 6,805 |
👉 For detailed dataset schema and structure, see dataset/README.md.
The dataset/run/ directory contains scripts and data for reproducing the evaluation results reported in the paper.
-
eval.py: Evaluation script that computes Pearson and Spearman correlations between predictions and CRSArena-Eval human annotations. -
face_run.json: FACE predictions for the CRSArena-Eval dataset in the standard run file format.
The face/ directory contains the implementation of the FACE evaluation method.
particle_generation/: Converts dialogue turns into atomic conversation particles -- self-contained information units consisting of dialogue acts, text mentions, and user feedback.face_scoring/: Scores particle-based dialogues using 16 optimized prompts per aspect, aggregating results to turn/dialogue-level scores.reproduce_result_table/: Scripts for reconstructing the main result table from the paper.
- Install dependencies (requires uv):
cd face && uv sync
- Generate particles from a conversation:
uv run particle_generation/particle_generator.py examples/example_conv.json \ --turn-index 1 --speaker ASST --samples 10 - Score a conversation with FACE:
uv run face_scoring/face.py --conversation examples/example_particles.json \ --aspect dialogue_overall
👉 For detailed usage, LLM setup, and available aspects, see face/README.md.
We provide an easy-to-use meta-evaluation interface to evaluate your evaluator against the CRSArena-Eval dataset.
The public interface is hosted at
https://informagi.github.io/face/interface/.
See interface/README.md for detailed instructions on how to run the interface locally.
We also provide a python script to evaluate your evaluator on the CRSArena-Eval dataset.
👉 For detailed run file format and evaluation instructions, see dataset/run/README.md.
@inproceedings{Joko:2026:FACE,
title={FACE: A Fine-Grained Reference-Free Evaluator for Conversational Information Access},
author={Joko, Hideaki and Hasibi, Faegheh},
booktitle={Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval},
year={2026}
}If you have any questions, please contact Hideaki Joko (hideaki.joko@ru.nl)
