Computer Science Faculty Research & Creative Works

Large-scale Parallel Genome Assembler Over Cloud Computing Environment

Arghya Kusum Das
Praveen Kumar Koppa
Sayan Goswami
Richard Platania
Seung Jong Park, Missouri University of Science and TechnologyFollow

Abstract

The size of high throughput DNA sequencing data has already reached the terabyte scale. To manage this huge volume of data, many downstream sequencing applications started using locality-based computing over different cloud infrastructures to take advantage of elastic (pay as you go) resources at a lower cost. However, the locality-based programming model (e.g. MapReduce) is relatively new. Consequently, developing scalable data-intensive bioinformatics applications using this model and understanding the hardware environment that these applications require for good performance, both require further research. In this paper, we present a de Bruijn graph oriented Parallel Giraph-based Genome Assembler (GiGA), as well as the hardware platform required for its optimal performance. GiGA uses the power of Hadoop (MapReduce) and Giraph (large-scale graph analysis) to achieve high scalability over hundreds of compute nodes by collocating the computation and data. GiGA achieves significantly higher scalability with competitive assembly quality compared to contemporary parallel assemblers (e.g. ABySS and Contrail) over traditional HPC cluster. Moreover, we show that the performance of GiGA is significantly improved by using an SSD-based private cloud infrastructure over traditional HPC cluster. We observe that the performance of GiGA on 256 cores of this SSD-based cloud infrastructure closely matches that of 512 cores of traditional HPC cluster.

Recommended Citation

A. K. Das et al., "Large-scale Parallel Genome Assembler Over Cloud Computing Environment," Journal of Bioinformatics and Computational Biology, vol. 15, no. 3, article no. 1740003, World Scientific Publishing, Jun 2017.

The definitive version is available at https://doi.org/10.1142/S0219720017400030

Department(s)

Computer Science

Keywords and Phrases

Big data genome assembly; cloud computing; Giraph; Hadoop; solid state drive (SSD); traditional HPC cluster

International Standard Serial Number (ISSN)

1757-6334; 0219-7200

Document Type

Article - Conference proceedings

Document Version

Citation

File Type

text

Language(s)

English

Rights

Publication Date

01 Jun 2017

PubMed ID

28610458

Link to Full Text

COinS

Computer Science Faculty Research & Creative Works

Large-scale Parallel Genome Assembler Over Cloud Computing Environment

Abstract

Recommended Citation

Department(s)

Keywords and Phrases

International Standard Serial Number (ISSN)

Document Type

Document Version

File Type

Language(s)

Rights

Publication Date

PubMed ID

Search

Browse

Author Corner

Related Content

Useful Links

Article Locations

Computer Science Faculty Research & Creative Works

Large-scale Parallel Genome Assembler Over Cloud Computing Environment

Author

Abstract

Recommended Citation

Department(s)

Keywords and Phrases

International Standard Serial Number (ISSN)

Document Type

Document Version

File Type

Language(s)

Rights

Publication Date

PubMed ID

Share

Search

Browse

Author Corner

Related Content

Useful Links

Article Locations