Treffer: Pre-Meta: priors-augmented retrieval for LLM-based metadata generation.

Title:
Pre-Meta: priors-augmented retrieval for LLM-based metadata generation.
Authors:
Tinn P; SINTEF AS, Oslo 0373, Norway., Sørbø S; SINTEF AS, Oslo 0373, Norway., Jiang S; SINTEF AS, Oslo 0373, Norway., Voutetakis K; Institute of Chemical Biology, National Hellenic Research Foundation, Athens 11635, Greece., Giounis SM; Institute of Chemical Biology, National Hellenic Research Foundation, Athens 11635, Greece., Pilalis E; Institute of Chemical Biology, National Hellenic Research Foundation, Athens 11635, Greece.; e-NIOS Applications PC, Kallithea 17671, Greece., Papadodima O; Institute of Chemical Biology, National Hellenic Research Foundation, Athens 11635, Greece., Roman D; SINTEF AS, Oslo 0373, Norway.; Bucharest University of Economic Studies, Bucharest 010374, Romania.
Source:
Bioinformatics (Oxford, England) [Bioinformatics] 2025 Oct 02; Vol. 41 (10).
Publication Type:
Journal Article
Language:
English
Journal Info:
Publisher: Oxford University Press Country of Publication: England NLM ID: 9808944 Publication Model: Print Cited Medium: Internet ISSN: 1367-4811 (Electronic) Linking ISSN: 13674803 NLM ISO Abbreviation: Bioinformatics Subsets: MEDLINE
Imprint Name(s):
Original Publication: Oxford : Oxford University Press, c1998-
References:
Artif Intell Med. 2017 Jul;80:11-28. (PMID: 28818520)
Nat Rev Genet. 2020 Oct;21(10):615-629. (PMID: 32694666)
Nucleic Acids Res. 2007 Jan;35(Database issue):D760-5. (PMID: 17099226)
Yearb Med Inform. 2018 Aug;27(1):129-139. (PMID: 30157516)
Mayo Clin Proc Digit Health. 2024 Apr 11;2(2):186-191. (PMID: 40207170)
Nat Genet. 2001 Dec;29(4):365-71. (PMID: 11726920)
Bioinformatics. 2022 Sep 30;38(19):4656-4657. (PMID: 35980167)
Bioinformatics. 2024 Feb 1;40(2):. (PMID: 38341654)
Ann Biomed Eng. 2023 Dec;51(12):2647-2651. (PMID: 37328703)
BMC Bioinformatics. 2015 Apr 30;16:138. (PMID: 25925131)
Grant Information:
HE 101093216 UPCAST; HE 101070284 enRichMyData; HE 101189771 DataPACT; PNRR 760049 CauseFinder
Entry Date(s):
Date Created: 20250919 Date Completed: 20251013 Latest Revision: 20251015
Update Code:
20251015
PubMed Central ID:
PMC12516316
DOI:
10.1093/bioinformatics/btaf519
PMID:
40973196
Database:
MEDLINE

Weitere Informationen

Motivation: While high-throughput sequencing technologies have dramatically accelerated genomic data generation, the manual processes required for dataset annotation and metadata creation impede the efficient discovery and publication of these resources across disparate public repositories. Large language models (LLMs) have the potential to streamline dataset profiling and discovery. However, their current limitations in generalizing across specialized knowledge domains, particularly in fields such as biomedical genomics, prevent them from fully realizing this potential. This article presents Pre-Meta, an LLM-agnostic and domain-independent data annotation pipeline with an enriched retrieval procedure that leverages related priors-such as pre-generated metadata tags and ontologies-as auxiliary information to improve the accuracy of automated metadata generation.
Results: Validated using five selected metadata fields sampled across 1500 papers, the Pre-Meta assisted annotation experiment-without finetuning and prompt optimization-demonstrates a systemic improvement in the annotation task: shown through a 23%, 72%, and 75% accuracy gain from conventional RAG adoptions of GPT-4o mini, Llama 8B, and Mistral 7B respectively.
Availability and Implementation: The code, data access, and scripts are available at: https://github.com/SINTEF-SE/LLMDap.
(© The Author(s) 2025. Published by Oxford University Press.)