Abstract

We provide a resource for automatically harvesting relevance benchmarks from Wikipedia - which we refer to as "Wikimarks"to differentiate them from manually created benchmarks. Unlike simulated benchmarks, they are based on manual annotations of Wikipedia authors. Studies on the TREC Complex Answer Retrieval track demonstrated that leaderboards under Wikimarks and manually annotated benchmarks are very similar. Because of their availability, Wikimarks can fill an important need for Information Retrieval research. We provide a meta-resource to harvest Wikimarks for several information retrieval tasks across different languages: paragraph retrieval, entity ranking, query-specific clustering, outline prediction, and relevant entity linking and many more. In addition, we provide example Wikimarks for English, Simple English, and Japanese derived from the 01/01/2022 Wikipedia dump. Resource available: https: //trema-unh.github.io/wikimarks/

Department(s)

Computer Science

Publication Status

Public Access

Comments

Directorate for Computer and Information Science and Engineering, Grant 1846017

Keywords and Phrases

query-specific clustering; relevant entity linking; test collections

International Standard Book Number (ISBN)

978-145038732-3

Document Type

Article - Conference proceedings

Document Version

Citation

File Type

text

Language(s)

English

Rights

© 2024 Association for Computing Machinery, All rights reserved.

Publication Date

06 Jul 2022

Share

 
COinS