Abstract
Genome sequencing technology has witnessed tremendous progress in terms of throughput as well as cost per base pair, resulting in an explosion in the size of data. Consequently, typical sequence assembly tools demand a lot of processing power and memory and are unable to assemble big datasets unless run on hundreds of nodes. In this paper, we present a distributed assembler that achieves both scalability and memory efficiency by using partitioned de Bruijn graphs. By enhancing the memory-to-disk swapping and reducing the network communication in the cluster, we can assemble large sequences such as human genomes (452 GB) on just two nodes in 14.5 hours, and also scale up to 128 nodes in 23 minutes. We also assemble a synthetic wheat genome with 1.1 TB of raw reads on 8 nodes in 18.5 hours and on 128 nodes in 1.25 hours.
Recommended Citation
S. Goswami et al., "Lazer: Distributed Memory-efficient Assembly Of Large-scale Genomes," Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016, pp. 1171 - 1181, article no. 7840721, Institute of Electrical and Electronics Engineers, Jan 2016.
The definitive version is available at https://doi.org/10.1109/BigData.2016.7840721
Department(s)
Computer Science
Keywords and Phrases
big data; genome assembly
International Standard Book Number (ISBN)
978-146739004-0
Document Type
Article - Conference proceedings
Document Version
Citation
File Type
text
Language(s)
English
Rights
© 2024 Institute of Electrical and Electronics Engineers, All rights reserved.
Publication Date
01 Jan 2016
Comments
National Science Foundation, Grant 1338051