Doctoral Dissertations


"XML (eXtensible Mark-up Language) has become the fundamental standard for efficient data management and exchange. Due to the widespread use of XML for describing and exchanging data on the web, XML-based comparison is central issues in database management and information retrieval. In fact, although many heterogeneous XML sources have similar content, they may be described using different tag names and structures. This work proposes a series of algorithms for detection of structural and content changes among XML data. The first is an algorithm called XDoI (XML Data Integration Based on Content and Structure Similarity Using Keys) that clusters XML documents into subtrees using leaf-node parents as clustering points. This algorithm matches subtrees using the key concept and compares unmatched subtrees for similarities in both content and structure. The experimental results show that this approach finds much more accurate matches with or without the presence of keys in the subtrees. A second algorithm proposed here is called XDI-CSSK (a system for detecting xml similarity in content and structure using relational database); it eliminates unnecessary clustering points using instance statistics and a taxonomic analyzer. As the number of subtrees to be compared is reduced, the overall execution time is reduced dramatically. Semantic similarity plays a crucial role in precise computational similarity measures. A third algorithm, called XML-SIM (structure and content semantic similarity detection using keys) is based on previous work to detect XML semantic similarity based on structure and content. This algorithm is an improvement over XDI-CSSK and XDoI in that it determines content similarity based on semantic structural similarity. In an experimental evaluation, it outperformed previous approaches in terms of both execution time and false positive rates. Information changes periodically; therefore, it is important to be able to detect changes among different versions of an XML document and use that information to identify semantic similarities. Finally, this work introduces an approach to detect XML similarity and thus to join XML document versions using a change detection mechanism. In this approach, subtree keys still play an important role in order to avoid unnecessary subtree comparisons within multiple versions of the same document. Real data sets from bibliographic domains demonstrate the effectiveness of all these algorithms"--Abstract, page iv-v.


Madria, Sanjay Kumar

Committee Member(s)

Leopold, Jennifer
Erçal, Fikret
Yu, Vincent (Wen-Bin)
Sabharwal, Chaman


Computer Science

Degree Name

Ph. D. in Computer Science


Missouri University of Science and Technology

Publication Date

Fall 2010

Journal article titles appearing in thesis/dissertation

  • XML data integration based on content and structure similarity using keys
  • System for detecting XML similarity in content and structure using a relational database
  • XML-SIM: structure and content semantic similarity detection using keys
  • XML-SIM-change: structure and content semantic similarity detection among XML document versions


xii, 123 pages

Note about bibliography

Includes bibliographical references.


© 2010 Waraporn Viyanon, All rights reserved.

Document Type

Dissertation - Open Access

File Type




Subject Headings

Cluster analysis -- Mathematical models
Matching theory
Semantic integration (Computer systems)
XML (Document markup language)

Thesis Number

T 9706

Print OCLC #


Electronic OCLC #