Abstract

In this paper, we describe a system incorporating an improved technique that detects the similarity of two XML documents based on content and structure similarity using keys. the technique consists of three major components: A subtree generator and validator, a key generator, and similarity components that compare content and structure of the XML documents. First, an XML document is stored in a relational database and extracted into small subtrees using leaf-node parents. the leaf-node parents are considered as a root of a subtree which is then recursively traversed bottom-up for matching. Second, a possible key(s) is identified in order to match XML subtrees from two documents efficiently. Key matchings help in reducing the number of comparisons dramatically. in addition, the number of subtrees to be processed is reduced in the subtree validation phase using instance statistics and taxonomic analyzer. the subtrees are matched by the key(s) first and the remaining subtrees are matched by finding degrees of similarity in content and structure. to obtain improved similarity comparison results, XML element names are transformed according to their semantic similarity. the results show that the clustering points are selected appropriately, and the overall execution time is reduced dramatically. Copyright 2009 ACM.

Department(s)

Computer Science

Keywords and Phrases

Clustering; Keys; Similarity measures; Taxonomy analyzer; XML

International Standard Book Number (ISBN)

978-160558512-3

Document Type

Article - Conference proceedings

Document Version

Citation

File Type

text

Language(s)

English

Rights

© 2024 Association for Computing Machinery, All rights reserved.

Publication Date

01 Dec 2009

Share

 
COinS