
Nov 25, 2025: Anonymized Runs for Pilot Released
Dec 9, 2025: Announced Pilot submission due Jan 30, 2026
Task: Given all system runs for a track, predict an evaluation score for each system and produce a leaderboard (higher = better).
Manual or semi-automatic runs are welcome!
You may also submit any other artifacts, such as point‑wise grades, preference labels, nugget sets, or rationales. The more you share, the more we will share back with you!
The goal is to understand the advantages, disadvantages, vulnerabilities, and guardrails of different LLM‑as‑a‑Judge approaches compared with manual relevance labeling.
Ask questions via email to dietz@cs.unh.edu or on the #trec-auto-eval channel on the NIST slack server.
We are launching a pilot now (yes, 2025!) because it is rare to have four RAG‑oriented tracks running simultaneously at TREC. This also lets us study the problem before commercial LLMs are trained on the test data, which could otherwise bias meta‑evaluation.
Any registered TREC 2025 team can participate in the pilot.
Pilot Data. Please obtain anonymized participant runs from the dataset page. – password is the same as for TREC “active members”.
Given: anonymized participant runs from four host tracks of TREC 2025. Please obtain necessary corpora and task instruction from the web pages of the host tracks
Predict:
trec_eval -q”)You can submit your Auto-Judge system by uploading files or by submitting your code via TIRA (URL announced soon).
Your submission will be evaluated by Kendall’s tau using the official leaderboard of the host track as the ground truth (upon availability). Other meta-evaluation metrics will be reported as well.
The pilot will run until the end of January 2026. Intermediate results will be shared in time for SIGIR paper submissions.
Results that will be shared with you
When official leaderboards and manual assessments are shared with participants of the host tracks, we will share results with you. After the pilot concludes we will share all data with everybody who participated so you can perform any meta-evaluation you like.
Ideally this would allow everybody to drill into questions on consistency between different human judges vs different LLMs. How judgments relate to nuggets. Even whether manual judgments can be retrofit to new nugget banks.
This meta-evaluation will be informal, but I encourage you to incorporate it into your TREC notebook for the proceedings.
Also, please share any prior findings and opinions with the AutoJudge organizers, so we can make the official AutoJudge track more useful.
Main:
Advisory:
TIRA integration:
Host Track Liaisons: