GiGA: Giraph-based Genome Assembler For Gigabase Scale Genomes
Abstract
Managing prodigious volumes of NGS input data in a cost effective way forces a growing number of sequence analytic applications to run on scaled out clusters of low cost commodity hardware. Large-scale de novo genome assembly is no exception. However, traditional MPI-based assembly software cannot scale well with the huge volumes of data unless sophisticated and costly compute resources are provided, which are unavailable to most researchers. The model of underlying computation should be changed significantly to address this critical need. In this work, we develop GiGA, a parallel Giraph-based Genome Assembler that uses the de Bruijn graph approach for the assembly. GiGA uses recent big data analytic software, Hadoop and Giraph, which carefully consider data locality, thus automatically scale with terabytes of data on low cost commodity clusters. Our benchmark-evaluation over GAGE data sets shows that GiGA achieved significantly higher scalability, substantially lower misassembly and competitive NG50 compared to other assembly software. GiGA performs almost 1.5x faster than Contrail, a Hadoop-based genome assembler developed for commodity cluster. To demonstrate the capability of GiGA to assemble large-scale vertebrate genomes over hundreds of cores, we assemble a human genome data set (SRA000271) of size 452 gigabyte and almost 2 billion reads with 512 cores in almost 8.5 hours.
Recommended Citation
P. K. Kondikoppa et al., "GiGA: Giraph-based Genome Assembler For Gigabase Scale Genomes," Proceedings of the 8th International Conference on Bioinformatics and Computational Biology, BICOB 2016, pp. 55 - 62, Jan 2016.
Department(s)
Computer Science
International Standard Book Number (ISBN)
978-194343603-3
Document Type
Article - Conference proceedings
Document Version
Citation
File Type
text
Language(s)
English
Rights
© 2024 The Authors, All rights reserved.
Publication Date
01 Jan 2016
Comments
National Science Foundation, Grant 1341008