TREC AutoJudge

Organization

Nov 25, 2024: Anonymized Runs for Pilot Released

AutoJudge Pilot at TREC 25

Obtain the Data Set

Submission

Pilot participants will hand in

I am explicitly inviting “manual runs”, i.e. judgments. This could be based on manual nuggets with auto-scanning. It could also allow someone to develop new support tools for human judgments.

I am planning to keep this submission open “for submission” until the end of January. But I would highly appreciate it if you are able to share intermediate results at the TREC 25 workshop (Dec 11).

Meta Evaluation

When official leaderboards and manual assessments are shared with participants of the host tracks, we will share them with you as well, so participants can perform a meta-evaluation.

Ideally this would allow everybody to drill into questions on consistency between different human judges vs different LLMs. How judgments relate to nuggets. Even whether manual judgments can be retrofit to new nugget banks.

This meta-evaluation will be informal, but I encourage you to incorporate it into your TREC notebook for the proceedings.

Also, please share any prior findings and opinions with the AutoJudge organizers, so we can make the official AutoJudge track more useful.

Proposal for TREC AutoJudge 2026

TREC Auto Judge Proposal

Track Coordinators

Main:

Advisory:

TIRA integration:

Host Track Liaisons: