Critical assessment of on-premise approaches to scalable genome analysis

Amira Al-Aamri, Syafiq Kamarul Azman, Gihan Daw Elbait, Habiba S. Alsafar, Andreas Henschel

Research output: Contribution to journalArticlepeer-review

Abstract

Background: Plummeting DNA sequencing cost in recent years has enabled genome sequencing projects to scale up by several orders of magnitude, which is transforming genomics into a highly data-intensive field of research. This development provides the much needed statistical power required for genotype–phenotype predictions in complex diseases. Methods: In order to efficiently leverage the wealth of information, we here assessed several genomic data science tools. The rationale to focus on on-premise installations is to cope with situations where data confidentiality and compliance regulations etc. rule out cloud based solutions. We established a comprehensive qualitative and quantitative comparison between BCFtools, SnpSift, Hail, GEMINI, and OpenCGA. The tools were compared in terms of data storage technology, query speed, scalability, annotation, data manipulation, visualization, data output representation, and availability. Results: Tools that leverage sophisticated data structures are noted as the most suitable for large-scale projects in varying degrees of scalability in comparison to flat-file manipulation (e.g., BCFtools, and SnpSift). Remarkably, for small to mid-size projects, even lightweight relational database. Conclusion: The assessment criteria provide insights into the typical questions posed in scalable genomics and serve as guidance for the development of scalable computational infrastructure in genomics. © 2023, BioMed Central Ltd., part of Springer Nature.
Original languageAmerican English
JournalBMC Bioinformatics
Volume24
Issue number1
DOIs
StatePublished - 2023

Keywords

  • Big data
  • Data visualization
  • Digital storage
  • DNA sequences
  • Gene encoding
  • Metadata
  • Precipitation (meteorology)
  • Query processing
  • Scalability
  • Critical assessment
  • Genome analysis
  • Genomic data
  • Genomic data science
  • Genomics
  • Genomics database
  • Horizontal scaling
  • NoSQL
  • SQL
  • VCF
  • Genome
  • chromosomal mapping
  • data science
  • DNA sequencing
  • factual database
  • genomics
  • Chromosome Mapping
  • Data Science
  • Databases
  • Factual
  • Sequence Analysis
  • DNA

Fingerprint

Dive into the research topics of 'Critical assessment of on-premise approaches to scalable genome analysis'. Together they form a unique fingerprint.

Cite this