For the 2022 track, click here.
For the 2021 track, click here.
For the 2020 track, click here.
For the 2019 track, click here.
Web search engines are frequently used to help people make decisions about health-related issues. Unfortunately, the web is filled with misinformation regarding the efficacy of treatments for health issues. Search users may not be able to discern correct from incorrect information, nor credible from non-credible sources. As a result of finding misinformation deemed by the user to be useful to their decision making task, they can make incorrect decisions that waste money and put their health at risk.
The TREC Health Misinformation track fosters research on retrieval methods that promote reliable and correct information over misinformation for health-related decision making tasks.
Track annoucements are made via google groups.
Task Description: Participants devise search technologies that promote credible and correct information over incorrect information, with the assumption that correct information can better lead people to make correct decisions.
Each topic concerns itself with a health issue and a treatment for that issue. The topics represent a user who is looking for information that is useful for making a decision about whether or not the treatment is helpful or unhelpful for treating the health issue. Better search result rankings will place very useful documents that are credible and correct at the top of ranking, and will not return incorrect information. In this search task, incorrect information is considered harmful and should not be returned.
For each topic, we have chosen a ‘stance’ for the topic on whether the treatment is helpful or unhelpful for the health issue. We do not claim to be providing medical advice, and medical decisions should never be made based on the stance we have chosen. Consult a medical doctor for professional advice. If a treatment is considered ‘helpful’, then correct documents will those construed to be supportive of the treatment and incorrect documents will be those that would disuade the searcher from the treatment. Likewise, an ‘unhelpful’ treatment should return documents that disuade the searcher from using the treatment and should avoid returning documents that are supportive of using the treatment. For each topic, we have included an evidence
link for a webpage that the topic author used as the basis for the topic’s stance.
The primary ad-hoc task is to use the topic’s query
or description
for a topic and return a ranking of documents without use of any of the other topic fields. Manual runs may also make use of the other fields of the topic, e.g. the stance
and evidence
fields, but these runs will need to declare the use of this data to allow us to distinguish them from the primary task.
This year we will be using the noclean version of the C4 dataset used by Google to train their T5 model. The collection is comprised of text extracts from the April 2019 snapshot of Common Crawl. The Collection contains ~ 1B English documents.
You can download the corpus on a Debian/Ubuntu machine using the following commands (see HuggingFace for further information).
sudo apt-get install git-lfs
git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/allenai/c4
cd c4
git lfs pull --include="en.noclean/c4-train*"
The collection is made up of the 7168 gzipped jsonl files located in the en.nolean directory. We are using only the c4-train.*.json.gz
files and not the c4-validation.*.json.gz
files. Each file containes ~150k documents, and has one document per line. A document is a json object with the fields text
, url
and timestamp
. As packaged in c4.noclean, documents do not contain a document identifier. For this TREC track we will be adding our own document identifiers to the collection.
The docno spec is as follows:
For documents inside files c4/en.noclean/c4-train.?????-of-07168.json.gz
, the docno wil be en.noclean.c4-train.?????-of-07168.<N>
where <N>
is the line number of the document starting at 0. This goes for all 7168 training files in the path c4/en.noclean/.
So for example, in the file en.noclean/c4-train.01234-of-07168.json.gz
the first document’s identifier will be en.noclean.c4-train.01234-of-07168.0
, the second document’s identifier will be en.noclean.c4-train.01234-of-07168.1
and the last document’s identifier will be en.noclean.c4-train.01234-of-07168.148409
.
One way to insert document identifiers is by using the provided python script. Another would be to name the documents as you index them.
"""
Script to add docnos to files in c4/no.clean
To process all files:
python renamer.py --path <path-to-c4-repo>
To process a subset, e.g. the first 20 files:
python renamer.py --path <path-to-c4-repo> --pattern 000[01]?
"""
import argparse
import glob
import gzip
parser = argparse.ArgumentParser(description='Add docnos to C4 collection.')
parser.add_argument('--path', type=str, help='Root of C4 git repo.', required=True)
parser.add_argument('--pattern', type=str, default="?????", help='File name patterns to process.')
args = parser.parse_args()
pattern = args.pattern
path = args.path
def new_docno(file_number, line_number):
return f'en.noclean.c4-train.{file_number}-of-07168.{line_number}'
files = sorted(list(glob.iglob(f'{path}/en.noclean/c4-train.{pattern}-of-07168.json.gz')))
for filepath in files:
with gzip.open(filepath) as f:
file_number = filepath[-22:-22 + 5]
file_name = filepath[-31:]
print(f"adding docnos to file number {file_number} ...")
with gzip.open(f'{path}/en.noclean.withdocnos/{file_name}', 'wb') as o:
for line_number, line in enumerate(f.readlines()):
line = line.decode('utf-8')
new_line = f"{{\"docno\":\"{new_docno(file_number, line_number)}\",{line[1:]}"
o.write(new_line.encode('utf-8'))
Topics are released and available in the active participants part of the TREC website: https://trec.nist.gov/act_part/tracks/misinfo/misinfo-2021-topics.xml
Topics are authored by the track organizers. The NIST assessors will be provided the topic’s query, description, and narrative and be asked to make judgments as per the assessing guidelines.
The topics will be provided as XML files using the following format:
<topics>
<topic>
<number>1234</number>
<query>dexamethasone croup</query>
<description>Is dexamethasone a good treatment for croup?</description>
<narrative>Croup is an infection of the upper airway and causes swelling,
which obstructs breathing and leads to a barking cough. As one kind of
corticosteroids, dexamethasone can weaken the immune response and
therefore mitigate symptoms such as swelling. A very useful document
would discuss the effectiveness of dexamethasone for croup, i.e. a very
useful document specifically addresses or answers the search topic's
question. A useful document would provide information that would help
a user make a decision about treating croup with dexamethasone, and
may discuss either separately or jointly: croup, recommended treatments
for croup, the pros and cons of dexamethasone, etc.</narrative>
<disclaimer>We do not claim to be providing medical advice, and medical
decisions should never be made based on the stance we have chosen.
Consult a medical doctor for professional advice.</disclaimer>
<stance>helpful</stance>
<evidence>https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5804741/</evidence>
</topic>
<topic>
...
</topic>
</topics>
The final qrels will contain assessments with respect to the following criteria:
Note: not-useful documents will not be assessed with respect to credibility and supportiveness.
Submitted runs will be evaluated with respect to the three criteria: usefulness, correctness, and credibility. We will be using the compatibility measure in a similar fashion as we did in the 2020 version of this track (See overview paper for more details).
For more details, you can read the 2021 assessing guidlines as used by NIST.
Participating groups will be allowed to submit as many runs as they like, but they need authorization from the Track organizers before submitting more than 10 runs. Not all runs are likely to be used for pooling and groups will need to specify a preference ordering for pooling purposes.
Runs may be either automatic or manual.
Automatic runs: Only the topic’s query
or description
field may be used for automatic runs. Use of the other fields, e.g. stance
, evidence
, and narrative
, will make a run a manual run. An automatic run is made without any tuning or customization based on the topics. Best practice for an automatic run is to avoid using the topics or even looking at them until all decisions and code have been written to produce the automatic run.
Manual runs: A manual run is anything that is not an automatic run. Manual runs commonly have some human input based on the topics, e.g., hand-crafted queries or relevance feedback. All topic fields may be used for manual runs. We encourage manual runs in addition to automatic runs.
Submission format for runs will follow the standard TREC run format. For each topic, please return 1,000 ranked documents. The standard TREC run format is as follows:
qid Q0 docno rank score tag
where:
qid
: the topic number;Q0
: unused and should always be Q0;docno
: the document id number returned by your system for the topic qid
(See above for more details on how docnos should be constructed it);ran
: the rank the document is retrieved;score
: the score (integer or floating point) that generated the ranking. The score must be in descending (non-increasing) order. The score is important to handle tied scores. (trec_eval
sorts documents by the specified scores values and not your ranks values);tag
: a tag that uniquely identifies your group AND the method you used to produce the run. Each run should have a different tag.The fields should be separated with a space.
An example run is shown below:
1 Q0 en.noclean.c4-train.04124-of-07168.69102 1 14.8928003311 myGroupNameMyMethodName
1 Q0 en.noclean.c4-train.03346-of-07168.52165 2 14.7590999603 myGroupNameMyMethodName
1 Q0 en.noclean.c4-train.03904-of-07168.54203 3 14.5707998276 myGroupNameMyMethodName
...
This year, we will only have the ad-hoc retrieval task. We will consider adding back in the other tasks in future years.