Computer Science Faculty Research & Creative Works

GiGA: Giraph-based Genome Assembler For Gigabase Scale Genomes

Praveen Kumar Kondikoppa
Arghya Kusum Das
Sayan Goswami
Richard Platania
Seung Jong Park, Missouri University of Science and TechnologyFollow

Abstract

Managing prodigious volumes of NGS input data in a cost effective way forces a growing number of sequence analytic applications to run on scaled out clusters of low cost commodity hardware. Large-scale de novo genome assembly is no exception. However, traditional MPI-based assembly software cannot scale well with the huge volumes of data unless sophisticated and costly compute resources are provided, which are unavailable to most researchers. The model of underlying computation should be changed significantly to address this critical need. In this work, we develop GiGA, a parallel Giraph-based Genome Assembler that uses the de Bruijn graph approach for the assembly. GiGA uses recent big data analytic software, Hadoop and Giraph, which carefully consider data locality, thus automatically scale with terabytes of data on low cost commodity clusters. Our benchmark-evaluation over GAGE data sets shows that GiGA achieved significantly higher scalability, substantially lower misassembly and competitive NG50 compared to other assembly software. GiGA performs almost 1.5x faster than Contrail, a Hadoop-based genome assembler developed for commodity cluster. To demonstrate the capability of GiGA to assemble large-scale vertebrate genomes over hundreds of cores, we assemble a human genome data set (SRA000271) of size 452 gigabyte and almost 2 billion reads with 512 cores in almost 8.5 hours.

Recommended Citation

P. K. Kondikoppa et al., "GiGA: Giraph-based Genome Assembler For Gigabase Scale Genomes," Proceedings of the 8th International Conference on Bioinformatics and Computational Biology, BICOB 2016, pp. 55 - 62, Jan 2016.

Department(s)

Computer Science

Comments

National Science Foundation, Grant 1341008

International Standard Book Number (ISBN)

978-194343603-3

Document Type

Article - Conference proceedings

Document Version

Citation

File Type

text

Language(s)

English

Rights

Publication Date

01 Jan 2016

This document is currently not available here.

COinS

Computer Science Faculty Research & Creative Works

GiGA: Giraph-based Genome Assembler For Gigabase Scale Genomes

Abstract

Recommended Citation

Department(s)

Comments

International Standard Book Number (ISBN)

Document Type

Document Version

File Type

Language(s)

Rights

Publication Date

Search

Browse

Author Corner

Related Content

Useful Links

Article Locations

Computer Science Faculty Research & Creative Works

GiGA: Giraph-based Genome Assembler For Gigabase Scale Genomes

Author

Abstract

Recommended Citation

Department(s)

Comments

International Standard Book Number (ISBN)

Document Type

Document Version

File Type

Language(s)

Rights

Publication Date

Share

Search

Browse

Author Corner

Related Content

Useful Links

Article Locations