Abstract

Third-generation DNA sequencing technologies such as single-molecule real-time sequencing (SMRT) and nanopore sequencing have the potential to fill the gaps in the existing genome databases since the raw sequences produced by these machines are much longer than those of previous generations and therefore result in more contiguous assemblies. However, these long reads have a high error rate, which makes the assembly process computationally challenging. Moreover, since existing long-read assemblers are designed to run on a single machine, they either take days to complete or run out of memory on even moderate-sized datasets. In this paper, we present a distributed long-read assembler that can assemble large-scale noisy sequence datasets on thousands of cores, resulting in orders of magnitude faster assembly times. By effectively using the map-reduce computation model with a distributed hash-map, both built using a high-performance active messaging middleware, we can assemble a PacBio human genome dataset with 139 billion base-pairs (about 130 GB) in about 33 minutes (using 2,560 cores) compared to more than 38 hours (using 28 cores) with the current state-of-the-art assembler.

Recommended Citation

S. Goswami et al., "Distributed De Novo Assembler For Large-scale Long-read Datasets," Proceedings - 2020 IEEE International Conference on Big Data, Big Data 2020, pp. 1166 - 1175, article no. 9377979, Institute of Electrical and Electronics Engineers, Dec 2020.

The definitive version is available at https://doi.org/10.1109/BigData50022.2020.9377979

Department(s)

Computer Science

Comments

National Science Foundation, Grant IBSS-L-1620451

Keywords and Phrases

big data; genome assembly; high-performance computing; long reads; map-reduce; third-generation sequences

International Standard Book Number (ISBN)

978-172816251-5

Document Type

Article - Conference proceedings

Document Version

Citation

File Type

text

Language(s)

English

Rights

Publication Date

10 Dec 2020

Download

Full Text Link

Included in

Computer Sciences Commons

COinS

Computer Science Faculty Research & Creative Works

Distributed De Novo Assembler For Large-scale Long-read Datasets

Abstract

Recommended Citation

Department(s)

Comments

Keywords and Phrases

International Standard Book Number (ISBN)

Document Type

Document Version

File Type

Language(s)

Rights

Publication Date

Included in

Search

Browse

Author Corner

Related Content

Useful Links

Article Locations

Computer Science Faculty Research & Creative Works

Distributed De Novo Assembler For Large-scale Long-read Datasets

Author

Abstract

Recommended Citation

Department(s)

Comments

Keywords and Phrases

International Standard Book Number (ISBN)

Document Type

Document Version

File Type

Language(s)

Rights

Publication Date

Included in

Share

Search

Browse

Author Corner

Related Content

Useful Links

Article Locations