Abstract
We provide a resource for automatically harvesting relevance benchmarks from Wikipedia - which we refer to as "Wikimarks"to differentiate them from manually created benchmarks. Unlike simulated benchmarks, they are based on manual annotations of Wikipedia authors. Studies on the TREC Complex Answer Retrieval track demonstrated that leaderboards under Wikimarks and manually annotated benchmarks are very similar. Because of their availability, Wikimarks can fill an important need for Information Retrieval research. We provide a meta-resource to harvest Wikimarks for several information retrieval tasks across different languages: paragraph retrieval, entity ranking, query-specific clustering, outline prediction, and relevant entity linking and many more. In addition, we provide example Wikimarks for English, Simple English, and Japanese derived from the 01/01/2022 Wikipedia dump. Resource available: https: //trema-unh.github.io/wikimarks/
Recommended Citation
L. Dietz et al., "Wikimarks: Harvesting Relevance Benchmarks from Wikipedia," SIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 3003 - 3012, Association for Computing Machinery, Jul 2022.
The definitive version is available at https://doi.org/10.1145/3477495.3531731
Department(s)
Computer Science
Publication Status
Public Access
Keywords and Phrases
query-specific clustering; relevant entity linking; test collections
International Standard Book Number (ISBN)
978-145038732-3
Document Type
Article - Conference proceedings
Document Version
Citation
File Type
text
Language(s)
English
Rights
© 2024 Association for Computing Machinery, All rights reserved.
Publication Date
06 Jul 2022
Comments
Directorate for Computer and Information Science and Engineering, Grant 1846017