Masters Theses

Abstract

This work presents a structured benchmarking study of multimodal large language models (MLLMs) applied to electrocardiogram (ECG) interpretation tasks. We evaluate three representative architectures: MedGemma, HuatuoGPT-Vision, and LLaVA-Med, across progressive experimental stages involving text-only structured prompt normalization, text–image fusion with ECG plots, and full multimodal fusion incorporating time-series signals. A standardized five-section cardiology prompt was designed to enforce consistent output structure and SCP-code alignment, enabling reproducible metric computation across models. Quantitative evaluation using BERTScore, token-level F1, and diagnostic accuracy demonstrates that HuatuoGPT-Vision achieves the highest semantic and diagnostic alignment, while MedGemma exhibits superior formatting stability and reproducibility. In contrast, LLaVA-Med showed limited ability to handle extended clinical prompts, yielding a high invalid-response rate. Preliminary multimodal results suggest that augmenting textual and visual prompts with ECG time-series data doesn't enhance diagnostic precision and semantic coherence, indicating image-forward training practices. Overall, the findings highlight the critical role of structured reasoning, and modality fusion in improving interpretability and reliability of medical MLLMs, providing a reproducible framework for future ECG-centric language–vision model evaluation.

Advisor(s)

Yang, Huiyuan

Committee Member(s)

Maity, Suman
Yu, Xiaowei

Department(s)

Computer Science

Degree Name

M.S. in Computer Science

Publisher

Missouri University of Science and Technology

Publication Date

Fall 2025

Pagination

ix, 57 pages

Note about bibliography

Includes_bibliographical_references_(pages 54-56)

Rights

© 2026 Prisha Anil , All Rights Reserved

Document Type

Thesis - Open Access

File Type

text

Language

English

Thesis Number

T 12555

Share

 
COinS