TREC AutoJudge ← TREC AutoJudge

Nov 25, 2025: Anonymized Runs for Pilot Released
Dec 9, 2025: Announced Pilot submission due Jan 30, 2026

TREC AutoJudge 2026

Task: Given all system runs for a track, predict an evaluation score for each system and produce a leaderboard (higher = better).

Manual or semi-automatic runs are welcome!

You may also submit any other artifacts, such as point‑wise grades, preference labels, nugget sets, or rationales. The more you share, the more we will share back with you!

The goal is to understand the advantages, disadvantages, vulnerabilities, and guardrails of different LLM‑as‑a‑Judge approaches compared with manual relevance labeling.

Ask questions via email to dietz@cs.unh.edu or on the #trec-auto-eval channel on the NIST slack server.

TREC Auto Judge Proposal

AutoJudge Pilot at TREC 25

We are launching a pilot now (yes, 2025!) because it is rare to have four RAG‑oriented tracks running simultaneously at TREC. This also lets us study the problem before commercial LLMs are trained on the test data, which could otherwise bias meta‑evaluation.

Any registered TREC 2025 team can participate in the pilot.
Pilot Data. Please obtain anonymized participant runs from the dataset page. – password is the same as for TREC “active members”.
Given: anonymized participant runs from four host tracks of TREC 2025. Please obtain necessary corpora and task instruction from the web pages of the host tracks
- TREC 2025 BioGen: https://trec-biogen.github.io/docs/
- TREC 2025 DRAGUN: https://trec-dragun.github.io/
- TREC 2025 RAG: https://trec-rag.github.io/annoucements/2025-track-guidelines/
- TREC 2025 RAGTIME: https://trec-ragtime.github.io/
Predict:
1. the leaderboard of systems (a tab‑separated file in the same format as “trec_eval -q”)
2. optional: relevance labels (qrels files)
3. optional: any additional artifacts such as nuggets, preference judgments, or rationales.

You can submit your Auto-Judge system by uploading files or by submitting your code via TIRA (URL announced soon).

Your submission will be evaluated by Kendall’s tau using the official leaderboard of the host track as the ground truth (upon availability). Other meta-evaluation metrics will be reported as well.

The pilot will run until the end of January 2026. Intermediate results will be shared in time for SIGIR paper submissions.

Results that will be shared with you

When official leaderboards and manual assessments are shared with participants of the host tracks, we will share results with you. After the pilot concludes we will share all data with everybody who participated so you can perform any meta-evaluation you like.

Ideally this would allow everybody to drill into questions on consistency between different human judges vs different LLMs. How judgments relate to nuggets. Even whether manual judgments can be retrofit to new nugget banks.

This meta-evaluation will be informal, but I encourage you to incorporate it into your TREC notebook for the proceedings.

Also, please share any prior findings and opinions with the AutoJudge organizers, so we can make the official AutoJudge track more useful.

Track Coordinators

Main:

Laura Dietz, University of New Hampshire, USA, dietz@cs.unh.edu
Naghmeh Farzi, University of New Hampshire, USA, naghmeh.farzi@unh.edu
Eugene Yang, Johns Hopkins University, USA, eugene.yang@jhu.edu
(Content Modification) Oleg Zendel, RMIT University, Australia, oleg.zendel@rmit.edu.au

Advisory:

Charles L. A. Clarke, University of Waterloo, Canada, claclark@plg.uwaterloo.ca
Hossein A. Rahmani, University College London, UK, hossein.rahmani.22@ucl.ac.uk

TIRA integration:

TIRA Maik Fröbe, Friedrich-Schiller-Universität Jena, Germany, maik.froebe@uni-jena.de
TIRA Tim Hagen, University of Kassel and hessian.AI, Germany, tim.hagen@uni-kassel.de
TIRA Martin Potthast, University of Kassel, hessian.AI, and ScaDS.AI, Germany, martin.potthast@uni-kassel.de

Host Track Liaisons:

RAG liaison Ronak Pradeep, University of Waterloo, Canada, rpradeep@uwaterloo.ca
RAGTIME liaison Dawn Lawrie, Johns Hopkins University, USA, lawrie@jhu.edu
RAGTIME liaison Eugene Yang, Johns Hopkins University, USA, eugene.yang@jhu.edu
DRAGUN liaison Dake Zhang, University of Waterloo, Canada, dake.zhang@uwaterloo.ca
BioGen liaison Deepak Gupta, NIH, USA, deepak.gupta@nih.gov