Abstract

Normalization of medical concepts to an ontology is a key aspect of the natural language processing of biomedical text. It enables the mapping of medical expressions to standardized ontology terms and their identifiers, thereby enhancing the interoperability and computability of medical concepts. Although large language models (LLMs) can identify and standardize medical terms, they may struggle to accurately map ontology terms to their corresponding ontology identifiers. These challenges arise from the stochastic nature of LLMs, their limited exposure to uncommon ontology identifiers during training, and their lack of an integrated lookup mechanism. We generated test sets of synthetic terms to assess normalization performance by both zero-shot prompted and retrieval-augmented generation (RAG) prompted methods across two ontologies (Human Phenotype Ontology and Gene Ontology) and three LLMs (GPT-4o, LLaMA 3.3 70B, and Phi-4). To ensure a calibrated and fair evaluation of normalization, the test set was balanced along two axes: (1) term prevalence in biomedical literature, as estimated by PubMed Central frequency counts, and (2) semantic proximity to ontology terms, as assessed by cosine similarity of BioBERT embeddings. Our results demonstrate that RAG consistently outperforms zero-shot prompting, particularly on low-prevalence terms that are infrequently encountered in the biomedical literature. This highlights the value of RAG in compensating for gaps in model exposure to uncommon medical concepts. We demonstrate that a synthetic test set can be a valuable tool for evaluating biomedical term normalization across LLMs.

Department(s)

Electrical and Computer Engineering

Keywords and Phrases

BioBERT; cosine similarity; Gene Ontology; Human Phenotype Ontology; large language models; normalization; ontology identifiers; Ontology mapping

Document Type

Article - Conference proceedings

Document Version

Citation

File Type

text

Language(s)

English

Rights

© 2025 Institute of Electrical and Electronics Engineers, All rights reserved.

Publication Date

01 Jan 2025

Share

 
COinS