Treffer: Large Language Models for Supporting Clear Writing and Detecting Spin in Randomized Controlled Trials in Oncology: Comparative Analysis of GPT Models and Prompts.
J Clin Oncol. 2016 Mar 1;34(7):706-13. (PMID: 26755507)
J Clin Oncol. 2006 Jul 1;24(19):3089-94. (PMID: 16809734)
J Clin Oncol. 2005 Dec 20;23(36):9227-33. (PMID: 16275936)
JAMA. 2014 Apr 2;311(13):1300-7. (PMID: 24691606)
J Clin Oncol. 2014 Dec 20;32(36):4120-6. (PMID: 25403215)
J Clin Oncol. 2015 Dec 1;33(34):4039-47. (PMID: 26351344)
J Clin Oncol. 2021 Feb 1;39(4):295-307. (PMID: 33332189)
J Clin Oncol. 2013 Jan 20;31(3):301-7. (PMID: 23233721)
J Clin Oncol. 2006 Oct 10;24(29):4738-45. (PMID: 16966688)
Lancet Oncol. 2014 Jan;15(1):114-22. (PMID: 24332514)
Proc Mach Learn Res. 2025 Jun;287:458-479. (PMID: 41257216)
J Clin Oncol. 2023 May 10;41(14):2607-2616. (PMID: 36763945)
J Clin Oncol. 2008 Nov 20;26(33):5458-64. (PMID: 18955452)
JAMA Oncol. 2017 Nov 01;3(11):1538-1545. (PMID: 28715540)
Lancet Oncol. 2018 Jun;19(6):799-811. (PMID: 29753703)
Evid Based Med. 2016 Dec;21(6):201-202. (PMID: 27737894)
Ann Emerg Med. 2019 May 14;:423-431. (PMID: 31101371)
J Clin Oncol. 2016 Mar 10;34(8):786-93. (PMID: 26371143)
Cureus. 2024 Dec 15;16(12):e75748. (PMID: 39811231)
J Clin Oncol. 2021 Jul 20;39(21):2367-2374. (PMID: 33739848)
Lancet Oncol. 2014 Jan;15(1):59-68. (PMID: 24331154)
J Clin Oncol. 2005 Jun 1;23(16):3697-705. (PMID: 15738537)
JAMA Oncol. 2020 Dec 01;6(12):1923-1930. (PMID: 33030515)
J Clin Oncol. 2005 Feb 1;23(4):792-9. (PMID: 15681523)
J Clin Oncol. 2013 Feb 20;31(6):744-51. (PMID: 23129742)
JAMA Netw Open. 2021 Dec 1;4(12):e2135765. (PMID: 34874407)
Lancet. 2011 Jan 22;377(9762):321-31. (PMID: 21247627)
Weitere Informationen
Background: Randomized controlled trials (RCTs) are the gold standard for evaluating interventions in oncology, but reporting can be subject to "spin"-presenting results in ways that mislead readers about true efficacy.
Objective: This study aimed to investigate whether large language models (LLMs) could provide a standardized approach to detect spin, particularly in the conclusions, where it most commonly occurs.
Methods: We randomly sampled 250 two-arm, single-primary end point oncology RCTs from 7 major medical journals published between 2005 and 2023. Two authors independently annotated trials as positive or negative based on whether they met their primary end point. Three commercial LLMs (GPT-3.5 Turbo, GPT-4o, and GPT-o1) were tasked with classifying trials as positive or negative when provided with (1) conclusions only; (2) methods and conclusions; (3) methods, results, and conclusions; or (4) title and full abstract. LLM performance was evaluated against human annotations. Afterward, trials incorrectly classified as positive when the model was provided only with the conclusions but correctly classified as negative when provided with the whole abstract were analyzed for patterns that may indicate the presence of spin. Model performance was assessed using accuracy, precision, recall, and F1-score calculated from confusion matrices.
Results: Of the 250 trials, 146 (58.4%) were positive, and 104 (41.6%) were negative. The GPT-o1 model demonstrated the highest performance across all conditions, with F1-scores of 0.932 (conclusions only; 95% CI 0.90-0.96), 0.96 (methods and conclusions; 95% CI 0.93-0.98), 0.98 (methods, results, and conclusions; 95% CI 0.96-0.99), and 0.97 (title and abstract; 95% CI 0.95-0.99). Analysis of trials incorrectly classified as positive when the model was provided only with the conclusions revealed shared patterns, including absence of primary end point results, emphasis on subgroup improvements, or unclear distinction between primary and secondary end points. These patterns were almost never found in trials correctly classified as negative.
Conclusions: LLMs can effectively detect potential spin in oncology RCT reporting by identifying discrepancies between how trials are presented in the conclusions vs the full abstracts. This approach could serve as a supplementary tool for improving transparency in scientific reporting, although further development is needed to address more complex trial designs beyond those examined in this feasibility study.
(© Carole Koechli, Fabio Dennstädt, Christina Schröder, Daniel M Aebersold, Robert Förster, Daniel R Zwahlen, Paul Windisch. Originally published in JMIR Cancer (https://cancer.jmir.org).)