ParSECH: Parallel Sequencing Error Correction With Hadoop For Large-scale Genome Sequences
Abstract
A scalable and accurate error correction tool is essential for all next-generation sequencing (NGS) projects as high-throughput sequencing machines have started producing terabytes of data with significantly higher error-rates compared to conventional Sanger sequencing. In this paper, we develop ParSECH, a scalable and fully distributed error correction software based on kappa;-mer spectrum analysis, without the need of a reference genome. To achieve high scalability over terabytes of data and hundreds of cores, ParSECH utilizes two opensource big data frameworks: Hadoop and Hazelcast. To achieve high accuracy, unlike existing error correction tools that use a single kappa;-mer coverage cutoff to detect errors, ParSECH determines the skewness involved in the kappa;-mer coverage of each individual read, followed by correcting the errors in each read separately for low and high coverage regions of the genome. We demonstrate the scalability of ParSECH by correcting the errors of both simulated and real whole human genome data with coverage ranging from 2x to 40x. ParSECH can correct the largest dataset (452GB human genome), which could not be handled by the existing error correction tools, in about 39 hours. For a small E.coli genome dataset, ParSECH demonstrates 94% accuracy, higher than 90% accuracy of Quake.
Recommended Citation
A. K. Das et al., "ParSECH: Parallel Sequencing Error Correction With Hadoop For Large-scale Genome Sequences," Proceedings of the 9th International Conference on Bioinformatics and Computational Biology, BICOB 2017, pp. 121 - 128, Jan 2017.
Department(s)
Computer Science
International Standard Book Number (ISBN)
978-194343607-1
Document Type
Article - Conference proceedings
Document Version
Citation
File Type
text
Language(s)
English
Rights
© 2024 The Authors, All rights reserved.
Publication Date
01 Jan 2017