Doctoral Dissertations
Abstract
"XML (eXtensible Mark-up Language) has become the fundamental standard for efficient data management and exchange. Due to the widespread use of XML for describing and exchanging data on the web, XML-based comparison is central issues in database management and information retrieval. In fact, although many heterogeneous XML sources have similar content, they may be described using different tag names and structures. This work proposes a series of algorithms for detection of structural and content changes among XML data. The first is an algorithm called XDoI (XML Data Integration Based on Content and Structure Similarity Using Keys) that clusters XML documents into subtrees using leaf-node parents as clustering points. This algorithm matches subtrees using the key concept and compares unmatched subtrees for similarities in both content and structure. The experimental results show that this approach finds much more accurate matches with or without the presence of keys in the subtrees. A second algorithm proposed here is called XDI-CSSK (a system for detecting xml similarity in content and structure using relational database); it eliminates unnecessary clustering points using instance statistics and a taxonomic analyzer. As the number of subtrees to be compared is reduced, the overall execution time is reduced dramatically. Semantic similarity plays a crucial role in precise computational similarity measures. A third algorithm, called XML-SIM (structure and content semantic similarity detection using keys) is based on previous work to detect XML semantic similarity based on structure and content. This algorithm is an improvement over XDI-CSSK and XDoI in that it determines content similarity based on semantic structural similarity. In an experimental evaluation, it outperformed previous approaches in terms of both execution time and false positive rates. Information changes periodically; therefore, it is important to be able to detect changes among different versions of an XML document and use that information to identify semantic similarities. Finally, this work introduces an approach to detect XML similarity and thus to join XML document versions using a change detection mechanism. In this approach, subtree keys still play an important role in order to avoid unnecessary subtree comparisons within multiple versions of the same document. Real data sets from bibliographic domains demonstrate the effectiveness of all these algorithms"--Abstract, page iv-v.
Advisor(s)
Madria, Sanjay Kumar
Committee Member(s)
Leopold, Jennifer
Erçal, Fikret
Yu, Vincent (Wen-Bin)
Sabharwal, Chaman
Department(s)
Computer Science
Degree Name
Ph. D. in Computer Science
Publisher
Missouri University of Science and Technology
Publication Date
Fall 2010
Journal article titles appearing in thesis/dissertation
- XML data integration based on content and structure similarity using keys
- System for detecting XML similarity in content and structure using a relational database
- XML-SIM: structure and content semantic similarity detection using keys
- XML-SIM-change: structure and content semantic similarity detection among XML document versions
Pagination
xii, 123 pages
Note about bibliography
Includes bibliographical references.
Rights
© 2010 Waraporn Viyanon, All rights reserved.
Document Type
Dissertation - Open Access
File Type
text
Language
English
Subject Headings
Cluster analysis -- Mathematical modelsMatching theorySemantic integration (Computer systems)XML (Document markup language)
Thesis Number
T 9706
Print OCLC #
750016252
Electronic OCLC #
750018227
Recommended Citation
Viyanon, Waraporn, "Structure and content semantic similarity detection of eXtensible markup language documents using keys" (2010). Doctoral Dissertations. 1950.
https://scholarsmine.mst.edu/doctoral_dissertations/1950