Treffer: Evaluating Locally Run Large Language Models (Gemma 2, Mistral Nemo, and Llama 3) for Outpatient Otorhinolaryngology Care: Retrospective Study.

Title:

Evaluating Locally Run Large Language Models (Gemma 2, Mistral Nemo, and Llama 3) for Outpatient Otorhinolaryngology Care: Retrospective Study.

Authors:

Buhr CR; Department of Otorhinolaryngology, University Medical Center of the Johannes Gutenberg-University Mainz, Langenbeckstraße 1, Mainz, 55131, Germany, +49 6131 17 7362.; School of Medicine, University of St Andrews, St Andrews, United Kingdom., Seifen C; Department of Otorhinolaryngology, University Medical Center of the Johannes Gutenberg-University Mainz, Langenbeckstraße 1, Mainz, 55131, Germany, +49 6131 17 7362., Bahr-Hamm K; Department of Otorhinolaryngology, University Medical Center of the Johannes Gutenberg-University Mainz, Langenbeckstraße 1, Mainz, 55131, Germany, +49 6131 17 7362., Huppertz T; Department of Otorhinolaryngology, University Medical Center of the Johannes Gutenberg-University Mainz, Langenbeckstraße 1, Mainz, 55131, Germany, +49 6131 17 7362., Pordzik J; Department of Otorhinolaryngology, University Medical Center of the Johannes Gutenberg-University Mainz, Langenbeckstraße 1, Mainz, 55131, Germany, +49 6131 17 7362., Smith H; School of Computer Science, University of St Andrews, St Andrews, United Kingdom., Kelsey T; School of Computer Science, University of St Andrews, St Andrews, United Kingdom., Blaikie A; School of Medicine, University of St Andrews, St Andrews, United Kingdom., Matthias C; Department of Otorhinolaryngology, University Medical Center of the Johannes Gutenberg-University Mainz, Langenbeckstraße 1, Mainz, 55131, Germany, +49 6131 17 7362., Kuhn S; Institute for Digital Medicine, Philipps University Marburg, University Hospital Giessen and Marburg, Marburg, Germany., Eckrich J; Department of Otorhinolaryngology, University Medical Center of the Johannes Gutenberg-University Mainz, Langenbeckstraße 1, Mainz, 55131, Germany, +49 6131 17 7362.

Source:

JMIR formative research [JMIR Form Res] 2025 Nov 25; Vol. 9, pp. e76896. Date of Electronic Publication: 2025 Nov 25.

Publication Type:

Journal Article

Language:

English

Journal Info:

Publisher: JMIR Publications Country of Publication: Canada NLM ID: 101726394 Publication Model: Electronic Cited Medium: Internet ISSN: 2561-326X (Electronic) Linking ISSN: 2561326X NLM ISO Abbreviation: JMIR Form Res Subsets: MEDLINE

Imprint Name(s):

Original Publication: Toronto, ON, Canada : JMIR Publications, [2017]-

MeSH Terms:

Otolaryngology*/methods , Ambulatory Care*/methods , Programming Languages*, Retrospective Studies ; Humans ; Female ; Male ; Outpatients/statistics & numerical data ; Adult ; Middle Aged ; Large Language Models

References:

Eur Arch Otorhinolaryngol. 2025 Mar;282(3):1593-1607. (PMID: 39792200)
Acta Otolaryngol. 2024 Mar;144(3):237-242. (PMID: 38781053)
J Chiropr Med. 2016 Jun;15(2):155-63. (PMID: 27330520)
JMIR Med Educ. 2023 Dec 5;9:e49183. (PMID: 38051578)
Nat Med. 2024 Nov;30(11):3098-3100. (PMID: 39054373)
Eur Arch Otorhinolaryngol. 2025 Mar;282(3):1631-1639. (PMID: 39427271)
Eur Arch Otorhinolaryngol. 2024 Apr;281(4):2105-2114. (PMID: 37991498)
Nat Sci Sleep. 2024 Dec 27;16:2269-2277. (PMID: 39741798)
Eur Arch Otorhinolaryngol. 2024 Apr;281(4):2023-2030. (PMID: 38345613)
Nat Med. 2023 Aug;29(8):1930-1940. (PMID: 37460753)
OTO Open. 2023 Aug 22;7(3):e67. (PMID: 37614494)
Laryngoscope. 2025 Sep;135(9):3049-3063. (PMID: 40309961)
J Med Internet Res. 2021 Nov 25;23(11):e25856. (PMID: 34842535)

Contributed Indexing:

Keywords: artificial intelligence; chatbot; digital health; global health; large language models; low- and middle-income countries; otorhinolaryngology; telehealth; telemedicine

Entry Date(s):

Date Created: 20251125 Date Completed: 20251125 Latest Revision: 20251129

Update Code:

20251129

PubMed Central ID:

PMC12646549

DOI:

10.2196/76896

PMID:

41289564

Database:

MEDLINE

Weitere Informationen

Background: Large language models (LLMs) have great potential to improve and make the work of clinicians more efficient. Previous studies have mainly focused on web-based services, such as ChatGPT, often with simulated cases. For the processing of personalized patient data, web-based services have major data protection concerns. Ensuring compliance with data protection and medical device regulations therefore remains a critical challenge for adopting LLMs in clinical settings.
Objective: This retrospective single-center study aimed to evaluate locally run LLMs (Gemma 2, Mistral Nemo, and Llama 3) in providing diagnosis and treatment recommendation for real-world outpatient cases in otorhinolaryngology (ORL).
Methods: Outpatient cases (n=30) from regular consultation hours and the emergency service at a university hospital ORL outpatient department were randomly selected. Documentation by ORL doctors, including anamnesis and examination results, was passed to the locally run LLMs (Gemma 2, Mistral Nemo, and Llama 3), which were asked to provide diagnostic and treatment strategies. Recommendations of the LLMs and the treating ORL doctors were rated by 3 experienced ORL consultants on a 6-point Likert scale for medical adequacy, conciseness, coherence, and comprehensibility. Moreover, consultants were asked whether the answers pose a risk to the patient's safety. A modified Turing test was performed to distinguish responses generated by LLMs from those of doctors. Finally, the potential influence of the information generated by the LLMs on the raters' own diagnosis and treatment opinions was evaluated.
Results: Over all categories, ORL doctors achieved superior (P<.0005) ratings compared to locally run LLMs (Llama 3, Mistral Nemo, and Gemma 2). ORL doctors' responses were considered hazardous for patients in only 1% of the ratings, whereas recommendations by Llama 3, Gemma 2, and Mistral Nemo were considered hazardous in 54%, 47%, and 32% of cases, respectively. According to the raters, the LLM's information rarely influenced their judgment, with Mistral Nemo, Gemma 2, and Llama 3 achieving 1%, 3%, and 4% of the ratings, respectively.
Conclusions: Although locally run LLM models still underperform compared with their web-based counterparts, they achieved respectable results on outpatient treatment in this study. Nevertheless, the retrospective and single-center nature of the study, along with the clinicians' documentation style, may have introduced bias in favor of human recommendations. In the future, locally run LLMs will help address data protection concerns; however, further refinement and prospective validation are still needed to meet strict medical device requirements. As locally run LLMs continue to evolve, they are likely to become comparably powerful to web-based LLMs and become established as useful tools to support doctors in clinical practice.
(© Christoph Raphael Buhr, Christopher Seifen, Katharina Bahr-Hamm, Tilman Huppertz, Johannes Pordzik, Harry Smith, Tom Kelsey, Andrew Blaikie, Christoph Matthias, Sebastian Kuhn, Jonas Eckrich. Originally published in JMIR Formative Research (https://formative.jmir.org).)

Treffer: Evaluating Locally Run Large Language Models (Gemma 2, Mistral Nemo, and Llama 3) for Outpatient Otorhinolaryngology Care: Retrospective Study.

Weitere Informationen

Links

Zusatz-Funktionen