Abstract

Genome sequencing technology has witnessed tremendous progress in terms of throughput as well as cost per base pair, resulting in an explosion in the size of data. Consequently, typical sequence assembly tools demand a lot of processing power and memory and are unable to assemble big datasets unless run on hundreds of nodes. In this paper, we present a distributed assembler that achieves both scalability and memory efficiency by using partitioned de Bruijn graphs. By enhancing the memory-to-disk swapping and reducing the network communication in the cluster, we can assemble large sequences such as human genomes (452 GB) on just two nodes in 14.5 hours, and also scale up to 128 nodes in 23 minutes. We also assemble a synthetic wheat genome with 1.1 TB of raw reads on 8 nodes in 18.5 hours and on 128 nodes in 1.25 hours.

Recommended Citation

S. Goswami et al., "Lazer: Distributed Memory-efficient Assembly Of Large-scale Genomes," Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016, pp. 1171 - 1181, article no. 7840721, Institute of Electrical and Electronics Engineers, Jan 2016.

The definitive version is available at https://doi.org/10.1109/BigData.2016.7840721

Department(s)

Computer Science

Comments

National Science Foundation, Grant 1338051

Keywords and Phrases

big data; genome assembly

International Standard Book Number (ISBN)

978-146739004-0

Document Type

Article - Conference proceedings

Document Version

Citation

File Type

text

Language(s)

English

Rights

Publication Date

01 Jan 2016

Download

Full Text Link

Included in

Computer Sciences Commons

COinS

Computer Science Faculty Research & Creative Works

Lazer: Distributed Memory-efficient Assembly Of Large-scale Genomes

Abstract

Recommended Citation

Department(s)

Comments

Keywords and Phrases

International Standard Book Number (ISBN)

Document Type

Document Version

File Type

Language(s)

Rights

Publication Date

Included in

Search

Browse

Author Corner

Related Content

Useful Links

Article Locations

Computer Science Faculty Research & Creative Works

Lazer: Distributed Memory-efficient Assembly Of Large-scale Genomes

Author

Abstract

Recommended Citation

Department(s)

Comments

Keywords and Phrases

International Standard Book Number (ISBN)

Document Type

Document Version

File Type

Language(s)

Rights

Publication Date

Included in

Share

Search

Browse

Author Corner

Related Content

Useful Links

Article Locations