Abstract
This study evaluates the ability of large language models (LLMs) to map biomedical ontology terms to their corresponding ontology IDs across the Human Phenotype Ontology (HPO), Gene Ontology (GO), and UniProtKB terminologies. Using counts of ontology IDs in the PubMed Central (PMC) dataset as a surrogate for their prevalence in the biomedical literature, we examined the relationship between ontology ID prevalence and mapping accuracy. Results indicate that ontology ID prevalence strongly predicts accurate mapping of HPO terms to HPO IDs, GO terms to GO IDs, and protein names to UniProtKB accession numbers. Higher prevalence of ontology IDs in the biomedical literature correlated with higher mapping accuracy. Predictive models based on receiver operating characteristic (ROC) curves confirmed this relationship. In contrast, this pattern did not apply to mapping protein names to Human Genome Organisation's (HUGO) gene symbols. GPT-4 achieved a high baseline performance (95 %) in mapping protein names to HUGO gene symbols, with mapping accuracy unaffected by prevalence. We propose that the high prevalence of HUGO gene symbols in the literature has caused these symbols to become lexicalized, enabling GPT-4 to map protein names to HUGO gene symbols with high accuracy. These findings highlight the limitations of LLMs in mapping ontology terms to low-prevalence ontology IDs and underscore the importance of incorporating ontology ID prevalence into the training and evaluation of LLMs for biomedical applications.
Recommended Citation
T. S. Do et al., "Mapping Biomedical Ontology Terms to Ids: Effect of Domain Prevalence on Prediction Accuracy," Proceedings 2025 IEEE Conference on Artificial Intelligence Cai 2025, pp. 555 - 560, Institute of Electrical and Electronics Engineers, Jan 2025.
The definitive version is available at https://doi.org/10.1109/CAI64502.2025.00101
Department(s)
Electrical and Computer Engineering
Keywords and Phrases
Gene Ontology; Human Phenotype Ontology; large language models; lexicalization; machine codes; Ontology mapping; UniProt KB; Zipf's Law
Document Type
Article - Conference proceedings
Document Version
Citation
File Type
text
Language(s)
English
Rights
© 2025 Institute of Electrical and Electronics Engineers, All rights reserved.
Publication Date
01 Jan 2025
