ParSECH: Parallel Sequencing Error Correction With Hadoop For Large-scale Genome Sequences

Abstract

A scalable and accurate error correction tool is essential for all next-generation sequencing (NGS) projects as high-throughput sequencing machines have started producing terabytes of data with significantly higher error-rates compared to conventional Sanger sequencing. In this paper, we develop ParSECH, a scalable and fully distributed error correction software based on kappa;-mer spectrum analysis, without the need of a reference genome. To achieve high scalability over terabytes of data and hundreds of cores, ParSECH utilizes two opensource big data frameworks: Hadoop and Hazelcast. To achieve high accuracy, unlike existing error correction tools that use a single kappa;-mer coverage cutoff to detect errors, ParSECH determines the skewness involved in the kappa;-mer coverage of each individual read, followed by correcting the errors in each read separately for low and high coverage regions of the genome. We demonstrate the scalability of ParSECH by correcting the errors of both simulated and real whole human genome data with coverage ranging from 2x to 40x. ParSECH can correct the largest dataset (452GB human genome), which could not be handled by the existing error correction tools, in about 39 hours. For a small E.coli genome dataset, ParSECH demonstrates 94% accuracy, higher than 90% accuracy of Quake.

Department(s)

Computer Science

International Standard Book Number (ISBN)

978-194343607-1

Document Type

Article - Conference proceedings

Document Version

Citation

File Type

text

Language(s)

English

Rights

© 2024 The Authors, All rights reserved.

Publication Date

01 Jan 2017

This document is currently not available here.

Share

 
COinS