Treffer: Pre-Meta: priors-augmented retrieval for LLM-based metadata generation.
Nat Rev Genet. 2020 Oct;21(10):615-629. (PMID: 32694666)
Nucleic Acids Res. 2007 Jan;35(Database issue):D760-5. (PMID: 17099226)
Yearb Med Inform. 2018 Aug;27(1):129-139. (PMID: 30157516)
Mayo Clin Proc Digit Health. 2024 Apr 11;2(2):186-191. (PMID: 40207170)
Nat Genet. 2001 Dec;29(4):365-71. (PMID: 11726920)
Bioinformatics. 2022 Sep 30;38(19):4656-4657. (PMID: 35980167)
Bioinformatics. 2024 Feb 1;40(2):. (PMID: 38341654)
Ann Biomed Eng. 2023 Dec;51(12):2647-2651. (PMID: 37328703)
BMC Bioinformatics. 2015 Apr 30;16:138. (PMID: 25925131)
Weitere Informationen
Motivation: While high-throughput sequencing technologies have dramatically accelerated genomic data generation, the manual processes required for dataset annotation and metadata creation impede the efficient discovery and publication of these resources across disparate public repositories. Large language models (LLMs) have the potential to streamline dataset profiling and discovery. However, their current limitations in generalizing across specialized knowledge domains, particularly in fields such as biomedical genomics, prevent them from fully realizing this potential. This article presents Pre-Meta, an LLM-agnostic and domain-independent data annotation pipeline with an enriched retrieval procedure that leverages related priors-such as pre-generated metadata tags and ontologies-as auxiliary information to improve the accuracy of automated metadata generation.
Results: Validated using five selected metadata fields sampled across 1500 papers, the Pre-Meta assisted annotation experiment-without finetuning and prompt optimization-demonstrates a systemic improvement in the annotation task: shown through a 23%, 72%, and 75% accuracy gain from conventional RAG adoptions of GPT-4o mini, Llama 8B, and Mistral 7B respectively.
Availability and Implementation: The code, data access, and scripts are available at: https://github.com/SINTEF-SE/LLMDap.
(© The Author(s) 2025. Published by Oxford University Press.)