GiGA: Giraph-based Genome Assembler For Gigabase Scale Genomes

Abstract

Managing prodigious volumes of NGS input data in a cost effective way forces a growing number of sequence analytic applications to run on scaled out clusters of low cost commodity hardware. Large-scale de novo genome assembly is no exception. However, traditional MPI-based assembly software cannot scale well with the huge volumes of data unless sophisticated and costly compute resources are provided, which are unavailable to most researchers. The model of underlying computation should be changed significantly to address this critical need. In this work, we develop GiGA, a parallel Giraph-based Genome Assembler that uses the de Bruijn graph approach for the assembly. GiGA uses recent big data analytic software, Hadoop and Giraph, which carefully consider data locality, thus automatically scale with terabytes of data on low cost commodity clusters. Our benchmark-evaluation over GAGE data sets shows that GiGA achieved significantly higher scalability, substantially lower misassembly and competitive NG50 compared to other assembly software. GiGA performs almost 1.5x faster than Contrail, a Hadoop-based genome assembler developed for commodity cluster. To demonstrate the capability of GiGA to assemble large-scale vertebrate genomes over hundreds of cores, we assemble a human genome data set (SRA000271) of size 452 gigabyte and almost 2 billion reads with 512 cores in almost 8.5 hours.

Department(s)

Computer Science

Comments

National Science Foundation, Grant 1341008

International Standard Book Number (ISBN)

978-194343603-3

Document Type

Article - Conference proceedings

Document Version

Citation

File Type

text

Language(s)

English

Rights

© 2024 The Authors, All rights reserved.

Publication Date

01 Jan 2016

This document is currently not available here.

Share

 
COinS