Computer Science Faculty Research & Creative Works

ParSECH: Parallel Sequencing Error Correction With Hadoop For Large-scale Genome Sequences

Arghya Kusum Das
Shayan Shams
Sayan Goswami
Richard Platania
Kisung Lee
Seung Jong Park, Missouri University of Science and TechnologyFollow

Abstract

A scalable and accurate error correction tool is essential for all next-generation sequencing (NGS) projects as high-throughput sequencing machines have started producing terabytes of data with significantly higher error-rates compared to conventional Sanger sequencing. In this paper, we develop ParSECH, a scalable and fully distributed error correction software based on kappa;-mer spectrum analysis, without the need of a reference genome. To achieve high scalability over terabytes of data and hundreds of cores, ParSECH utilizes two opensource big data frameworks: Hadoop and Hazelcast. To achieve high accuracy, unlike existing error correction tools that use a single kappa;-mer coverage cutoff to detect errors, ParSECH determines the skewness involved in the kappa;-mer coverage of each individual read, followed by correcting the errors in each read separately for low and high coverage regions of the genome. We demonstrate the scalability of ParSECH by correcting the errors of both simulated and real whole human genome data with coverage ranging from 2x to 40x. ParSECH can correct the largest dataset (452GB human genome), which could not be handled by the existing error correction tools, in about 39 hours. For a small E.coli genome dataset, ParSECH demonstrates 94% accuracy, higher than 90% accuracy of Quake.

Recommended Citation

A. K. Das et al., "ParSECH: Parallel Sequencing Error Correction With Hadoop For Large-scale Genome Sequences," Proceedings of the 9th International Conference on Bioinformatics and Computational Biology, BICOB 2017, pp. 121 - 128, Jan 2017.

Department(s)

Computer Science

International Standard Book Number (ISBN)

978-194343607-1

Document Type

Article - Conference proceedings

Document Version

Citation

File Type

text

Language(s)

English

Rights

Publication Date

01 Jan 2017

This document is currently not available here.

COinS

Computer Science Faculty Research & Creative Works

ParSECH: Parallel Sequencing Error Correction With Hadoop For Large-scale Genome Sequences

Abstract

Recommended Citation

Department(s)

International Standard Book Number (ISBN)

Document Type

Document Version

File Type

Language(s)

Rights

Publication Date

Search

Browse

Author Corner

Related Content

Useful Links

Article Locations

Computer Science Faculty Research & Creative Works

ParSECH: Parallel Sequencing Error Correction With Hadoop For Large-scale Genome Sequences

Author

Abstract

Recommended Citation

Department(s)

International Standard Book Number (ISBN)

Document Type

Document Version

File Type

Language(s)

Rights

Publication Date

Share

Search

Browse

Author Corner

Related Content

Useful Links

Article Locations