Result: Practical implications of using non-relational databases to store large genomic data files and novel phenotypes.

Title:
Practical implications of using non-relational databases to store large genomic data files and novel phenotypes.
Authors:
Moreira Souza A; Institute of Mathematics and Computer Sciences, University of Sao Paulo, Sao Carlos, Sao Paulo, Brazil., Weigert RAS; Institute of Mathematics and Computer Sciences, University of Sao Paulo, Sao Carlos, Sao Paulo, Brazil., Machado de Sousa EP; Institute of Mathematics and Computer Sciences, University of Sao Paulo, Sao Carlos, Sao Paulo, Brazil., Tassoni Andrietta L; Department of Animal Nutrition and Production, School of Veterinary Medicine and Animal Science, University of Sao Paulo, Pirassununga, Sao Paulo, Brazil., Ventura RV; Department of Animal Nutrition and Production, School of Veterinary Medicine and Animal Science, University of Sao Paulo, Pirassununga, Sao Paulo, Brazil.
Source:
Journal of animal breeding and genetics = Zeitschrift fur Tierzuchtung und Zuchtungsbiologie [J Anim Breed Genet] 2022 Jan; Vol. 139 (1), pp. 100-112. Date of Electronic Publication: 2021 Aug 29.
Publication Type:
Journal Article
Language:
English
Journal Info:
Publisher: Blackwell Country of Publication: Germany NLM ID: 100955807 Publication Model: Print-Electronic Cited Medium: Internet ISSN: 1439-0388 (Electronic) Linking ISSN: 09312668 NLM ISO Abbreviation: J Anim Breed Genet Subsets: MEDLINE
Imprint Name(s):
Publication: 1994- : Berlin : Blackwell
Original Publication: Hamburg : P. Parey, c1985-
References:
Aniceto, R. , Xavier, R. , Guimarães, V. , Hondo, F. , Holanda, M. , Walter, M. E. , & Lifschitz, S. (2015). Evaluating the Cassandra NoSQL database approach for genomic data persistency. International Journal of Genomics, 2015, 1-7. https://doi.org/10.1155/2015/502795.
Chesnais, J. P. , Cooper, T. A. , Wiggans, G. R. , Sargolzaei, M. , Pryce, J. E. , & Miglior, F. (2016). Using genomics to enhance selection of novel traits in North American Dairy Cattle. Journal of Dairy Science, 99(3), 2413-2427. https://doi.org/10.3168/jds.2015-9970.
Groeneveld, E. , & Lichtenberg, H. (2016). TheSNPpit-a high performance database system for managing large scale SNP data. PLoS One, 11(10), 1-18. https://doi.org/10.1371/journal.pone.0164043.
Hayes, B. J. , & Daetwyler, H. D. (2019). 1000 Bull genomes project to map simple and complex genetic traits in cattle: Applications and outcomes. Annual Review of Animal Biosciences, 7, 89-102. https://doi.org/10.1146/annurev-animal-020518-115024.
Hayes, B. , & Goddard, M. (2010). Genome-wide association and genomic selection in animal breeding. Genome, 53(11), 876-883. https://doi.org/10.1139/G10-076.
Nanni, L. , Pinoli, P. , Canakoglu, A. , & Ceri, S. (2019). PyGMQL: Scalable data extraction and analysis for heterogeneous genomic datasets. BMC Bioinformatics, 20(1), 1-11. https://doi.org/10.1186/s12859-019-3159-9.
Nicolazzi, E. L. , Picciolini, M. , Strozzi, F. , Schnabel, R. D. , Lawley, C. , Pirani, A. , Brew, F. , & Stella, A. (2014). SNPchiMp: A database to disentangle the SNPchip jungle in bovine livestock. BMC Genomics, 15(1), 1-6. https://doi.org/10.1186/1471-2164-15-123.
Purcell, S. , Neale, B. , Todd-Brown, K. , Thomas, L. , Ferreira, M. A. R. , Bender, D. , Maller, J. , Sklar, P. , De Bakker, P. I. W. , Daly, M. J. , & Sham, P. C. (2007). PLINK: A tool set for whole-genome association and population-based linkage analyses. American Journal of Human Genetics, 81(3), 559-575. https://doi.org/10.1086/519795.
Reuter, J. A. , Spacek, D. V. , & Snyder, M. P. (2015). High-throughput sequencing technologies. Molecular Cell, 58(4), 586-597. https://doi.org/10.1016/j.molcel.2015.05.004.
Schaeffer, L. R. (2006). Strategy for applying genome-wide selection in dairy cattle. Journal of Animal Breeding and Genetics, 123(4), 218-223. https://doi.org/10.1111/j.1439-0388.2006.00595.x.
Yang, A. , Troup, M. , & Ho, J. W. K. (2017). Scalability and validation of big data bioinformatics software. Computational and Structural Biotechnology Journal, 15, 379-386. https://doi.org/10.1016/j.csbj.2017.07.002.
Grant Information:
16/19514-2 Fundação de Amparo à Pesquisa do Estado de São Paulo; 20/04461-6 Fundação de Amparo à Pesquisa do Estado de São Paulo
Contributed Indexing:
Keywords: FASTQ; MongoDB; SNP; novel phenotypes
Entry Date(s):
Date Created: 20210830 Date Completed: 20211216 Latest Revision: 20211216
Update Code:
20250114
DOI:
10.1111/jbg.12644
PMID:
34459042
Database:
MEDLINE

Further Information

The objective of our study was to provide practical directions on the storage of genomic information and novel phenotypes (treated here as unstructured data) using a non-relational database. The MongoDB technology was assessed for this purpose, enabling frequent data transactions involving numerous individuals under genetic evaluation. Our study investigated different genomic (Illumina Final Report, PLINK, 0125, FASTQ, and VCF formats) and phenotypic (including media files) information, using both real and simulated datasets. Advantages of our centralized database concept include the sublinear running time for queries after increasing the number of samples/markers exponentially, in addition to the comprehensive management of distinct data formats while searching for specific genomic regions. A comparison of our non-relational and generic solution, with an existing relational approach (developed for tabular data types using 2 bits to store genotypes), showed reduced importing time to handle 50M SNPs (PLINK format) achieved by the relational schema. Our experimental results also reinforce that data conversion is a costly step required to manage genomic data into both relational and non-relational database systems, and therefore, must be carefully treated for large applications.
(© 2021 Wiley-VCH GmbH.)