TREC AutoJudge

  • Nov 25, 2025: Anonymized Runs for Pilot Released

  • Dec 9, 2025: Announced Pilot submission due Jan 30, 2026

TREC AutoJudge 2026

Task: Given all system runs for a track, predict an evaluation score for each system and produce a leaderboard (higher = better).

Manual or semi-automatic runs are welcome!

You may also submit any other artifacts, such as point‑wise grades, preference labels, nugget sets, or rationales. The more you share, the more we will share back with you!

The goal is to understand the advantages, disadvantages, vulnerabilities, and guardrails of different LLM‑as‑a‑Judge approaches compared with manual relevance labeling.

Ask questions via email to dietz@cs.unh.edu or on the #trec-auto-eval channel on the NIST slack server.

TREC Auto Judge Proposal

AutoJudge Pilot at TREC 25

We are launching a pilot now (yes, 2025!) because it is rare to have four RAG‑oriented tracks running simultaneously at TREC. This also lets us study the problem before commercial LLMs are trained on the test data, which could otherwise bias meta‑evaluation.

You can submit your Auto-Judge system by uploading files or by submitting your code via TIRA (URL announced soon).

Your submission will be evaluated by Kendall’s tau using the official leaderboard of the host track as the ground truth (upon availability). Other meta-evaluation metrics will be reported as well.

The pilot will run until the end of January 2026. Intermediate results will be shared in time for SIGIR paper submissions.

Results that will be shared with you

When official leaderboards and manual assessments are shared with participants of the host tracks, we will share results with you. After the pilot concludes we will share all data with everybody who participated so you can perform any meta-evaluation you like.

Ideally this would allow everybody to drill into questions on consistency between different human judges vs different LLMs. How judgments relate to nuggets. Even whether manual judgments can be retrofit to new nugget banks.

This meta-evaluation will be informal, but I encourage you to incorporate it into your TREC notebook for the proceedings.

Also, please share any prior findings and opinions with the AutoJudge organizers, so we can make the official AutoJudge track more useful.

Track Coordinators

Main:

Advisory:

TIRA integration:

Host Track Liaisons: