Background: Clustering the ESTs from a large dataset representing a single species is a convenient starting point for a number of investigations into gene discovery, genome evolution, expression patterns, and alternatively spliced transcripts. Several methods have been developed to accomplish this, the most widely available being UniGene, a public domain collection of gene-oriented clusters for over 45 different species created and maintained by NCBI. The goal is for each cluster to represent a unique gene, but currently it is not known how closely the overall results represent that reality. UniGene's build procedure begins with initial mRNA clusters before joining ESTs. UniGene's results for soybean indicate a significant amount of redundancy among some sequences reported to be unique mRNAs. To establish a valid non-redundant known gene set for Glycine max we applied our algorithm to the clustering of only mRNA sequences. The mRNA dataset was run through the algorithm using two different matching stringencies. The resulting cluster compositions were compared to each other and to UniGene. Clusters exhibiting differences among the three methods were analyzed by 1) nucleotide and amino acid alignment and 2) submitting authors conclusions to determine whether members of a single cluster represented the same gene or not.

Results: Of the 12 clusters that were examined closely most contained examples of sequences that did not belong in the same cluster. However, neither the two stringencies of PECT nor UniGene had a significantly greater record of accuracy in placing paralogs into separate clusters.

Conclusion: Our results reveal that, although each method produces some errors, using multiple stringencies for matching or a sequential hierarchical method of increasing stringencies can provide more reliable results and therefore allow greater confidence in the vast majority of clusters that contain only ESTs and no mRNA sequences.

Meeting Name

2nd Annual MidSouth Computational Biology and Bioinformatics Society Conference. Bioinformatics: A Systems Approach (2004: Oct. 7-9, Little Rock, AR)


Biological Sciences

Second Department

Computer Science

Keywords and Phrases

Expression patterns; Gene discovery; Genome evolution; Hierarchical method; Non-redundant; Public domains; Reliable results; Single species

International Standard Serial Number (ISSN)


Document Type

Article - Conference proceedings

Document Version

Final Version

File Type





© 2005 BioMed Central, All rights reserved.

Publication Date

01 Jul 2005

PubMed ID