Computer Science Faculty Research & Creative Works

Parallel Hash-Based EST Clustering Algorithm for Gene Sequencing

Rameshreddy Mudhireddy
Fikret Erçal, Missouri University of Science and TechnologyFollow
Ronald L. Frank, Missouri University of Science and TechnologyFollow

Abstract

EST clustering is a simple, yet effective method to discover all the genes present in a variety of species. Although using ESTs is a cost-effective approach in gene discovery, the amount of data, and hence the computational resources required, make it a very challenging problem. Time and storage requirements for EST clustering problems are prohibitively expensive. Existing tools have quadratic time complexity resulting from all against all sequence comparisons. with the rapid growth of EST data we need better and faster clustering tools. In this paper, we present HECT (Hash based EST Clustering Tool), a novel time- and memory-efficient algorithm for EST clustering. We report that HECT can cluster a 10,000 Human EST dataset (which is also used in benchmarking d2_cluster), in 207 minutes on a 1 GHz Pentium III processor which is 36 times faster than the original d2_cluster algorithm. A parallel version of HECT (PECT) is also developed and used to cluster 269,035 soybean EST sequences on IA-32 Linux cluster at National Center for Supercomputing Applications at UIUC. The parallel algorithm exhibited excellent speedup over its sequential counterpart and its memory requirements are almost negligible making it suitable to run virtually on any data size. The performance of the proposed clustering algorithms is compared against other known clustering techniques and results are reported in the paper.

Recommended Citation

R. Mudhireddy et al., "Parallel Hash-Based EST Clustering Algorithm for Gene Sequencing," DNA and Cell Biology, vol. 23, no. 10, pp. 615 - 623, Mary Ann Liebert, Inc., Oct 2004.

The definitive version is available at https://doi.org/10.1089/dna.2004.23.615

Department(s)

Computer Science

Second Department

Biological Sciences

Keywords and Phrases

EST Clustering; Hash; Human EST Dataset; Genetic programming (Computer science); Human gene mapping

International Standard Serial Number (ISSN)

1044-5498; 1B135:B176557-7430

Document Type

Article - Journal

Document Version

Citation

File Type

text

Language(s)

English

Rights

Publication Date

01 Oct 2004

PubMed ID

15585119

Link to Full Text

COinS

Computer Science Faculty Research & Creative Works

Parallel Hash-Based EST Clustering Algorithm for Gene Sequencing

Abstract

Recommended Citation

Department(s)

Second Department

Keywords and Phrases

International Standard Serial Number (ISSN)

Document Type

Document Version

File Type

Language(s)

Rights

Publication Date

PubMed ID

Search

Browse

Faculty Gallery

Author Corner

Related Content

Useful Links

Article Locations

Computer Science Faculty Research & Creative Works

Parallel Hash-Based EST Clustering Algorithm for Gene Sequencing

Author

Abstract

Recommended Citation

Department(s)

Second Department

Keywords and Phrases

International Standard Serial Number (ISSN)

Document Type

Document Version

File Type

Language(s)

Rights

Publication Date

PubMed ID

Share

Search

Browse

Faculty Gallery

Author Corner

Related Content

Useful Links

Article Locations