Masters Theses
Keywords and Phrases
Clustering; Feature Selection; Natural Language; Text mining; Vector Space; Wordnet
Text mining using the vector space representation has proven to be an valuable tool for classification, prediction, information retrieval and extraction. The nature of text data presents several issues to these tasks, including large dimension and the existence of special polysemous and synonymous words. A variety of techniques have been devised to overcome these shortcomings, including feature selection and word sense disambiguation. Privacy preserving data mining is also an area of emerging interest. Existing techniques for privacy preserving data mining require the use of secure computation protocols, which often incur a greatly increased computational cost. In this paper, a generalization-based method is presented for creating a semantic-preserving vector space which reduces dimension as well as addresses problems with special word types. The SPVSM also allows private text data to be safely represented without degrading cluster accuracy or performance. Further, the result produced is also usable in combination with theoretic based techniques such as latent semantic indexing. The performance of text clustering using the semantic preserving generalization method is evaluated and compared to existing feature selection techniques, and shown to have significant merit from a clustering perspective.
Jiang, Wei
Committee Member(s)
Leopold, Jennifer
Wunsch, Donald C.
Computer Science
Degree Name
M.S. in Computer Science
Missouri University of Science and Technology
Publication Date
Fall 2012
viii, 40 pages
Note about bibliography
Includes bibliographical references (pages 76-77).
© 2012 Michael Howar, All rights reserved.
Document Type
Thesis - Open Access
File Type
Subject Headings
Text processing (Computer science)Data protectionData mining -- Statistical methods
Thesis Number
T 10093
Electronic OCLC #
Recommended Citation
Howard, Michael, "Semantic preserving text tepresentation and its applications in text clustering" (2012). Masters Theses. 6946.