Computer Science Faculty Research & Creative Works

A Hybrid And Scalable Error Correction Algorithm For Indel And Substitution Errors Of Long Reads

Abstract

Background: Long-read sequencing has shown the promises to overcome the short length limitations of second-generation sequencing by providing more complete assembly. However, the computation of the long sequencing reads is challenged by their higher error rates (e.g., 13% vs. 1%) and higher cost ($0.3 vs. $0.03 per Mbp) compared to the short reads. Methods: In this paper, we present a new hybrid error correction tool, called ParLECH (Parallel Long-read Error Correction using Hybrid methodology). The error correction algorithm of ParLECH is distributed in nature and efficiently utilizes the k-mer coverage information of high throughput Illumina short-read sequences to rectify the PacBio long-read sequences.ParLECH first constructs a de Bruijn graph from the short reads, and then replaces the indel error regions of the long reads with their corresponding widest path (or maximum min-coverage path) in the short read-based de Bruijn graph. ParLECH then utilizes the k-mer coverage information of the short reads to divide each long read into a sequence of low and high coverage regions, followed by a majority voting to rectify each substituted error base. Results: ParLECH outperforms latest state-of-the-art hybrid error correction methods on real PacBio datasets. Our experimental evaluation results demonstrate that ParLECH can correct large-scale real-world datasets in an accurate and scalable manner. ParLECH can correct the indel errors of human genome PacBio long reads (312 GB) with Illumina short reads (452 GB) in less than 29 h using 128 compute nodes. ParLECH can align more than 92% bases of an E. coli PacBio dataset with the reference genome, proving its accuracy. Conclusion: ParLECH can scale to over terabytes of sequencing data using hundreds of computing nodes. The proposed hybrid error correction methodology is novel and rectifies both indel and substitution errors present in the original long reads or newly introduced by the short reads.

Recommended Citation

A. K. Das et al., "A Hybrid And Scalable Error Correction Algorithm For Indel And Substitution Errors Of Long Reads," BMC Genomics, vol. 20, article no. 948, BioMed Central, Dec 2019.

The definitive version is available at https://doi.org/10.1186/s12864-019-6286-9

Department(s)

Computer Science

Publication Status

Open Access

Comments

National Science Foundation, Grant 1338051

Keywords and Phrases

Hadoop; Hybrid error correction; Illumina; NoSQL; PacBio

International Standard Serial Number (ISSN)

1471-2164

Document Type

Article - Journal

Document Version

Final Version

File Type

text

Language(s)

English

Rights

Creative Commons Licensing

This work is licensed under a Creative Commons Attribution 4.0 License.

Publication Date

20 Dec 2019

PubMed ID

31856721

Download

Full Text Link

Included in

Computer Sciences Commons

COinS

Computer Science Faculty Research & Creative Works

A Hybrid And Scalable Error Correction Algorithm For Indel And Substitution Errors Of Long Reads

Abstract

Recommended Citation

Department(s)

Publication Status

Comments

Keywords and Phrases

International Standard Serial Number (ISSN)

Document Type

Document Version

File Type

Language(s)

Rights

Creative Commons Licensing

Publication Date

PubMed ID

Included in

Search

Browse

Author Corner

Related Content

Useful Links

Article Locations

Computer Science Faculty Research & Creative Works

A Hybrid And Scalable Error Correction Algorithm For Indel And Substitution Errors Of Long Reads

Author

Abstract

Recommended Citation

Department(s)

Publication Status

Comments

Keywords and Phrases

International Standard Serial Number (ISSN)

Document Type

Document Version

File Type

Language(s)

Rights

Creative Commons Licensing

Publication Date

PubMed ID

Included in

Share

Search

Browse

Author Corner

Related Content

Useful Links

Article Locations