Keywords and Phrases
Clustering; Feature Selection; Natural Language; Text mining; Vector Space; Wordnet
Text mining using the vector space representation has proven to be an valuable tool for classification, prediction, information retrieval and extraction. The nature of text data presents several issues to these tasks, including large dimension and the existence of special polysemous and synonymous words. A variety of techniques have been devised to overcome these shortcomings, including feature selection and word sense disambiguation. Privacy preserving data mining is also an area of emerging interest. Existing techniques for privacy preserving data mining require the use of secure computation protocols, which often incur a greatly increased computational cost. In this paper, a generalization-based method is presented for creating a semantic-preserving vector space which reduces dimension as well as addresses problems with special word types. The SPVSM also allows private text data to be safely represented without degrading cluster accuracy or performance. Further, the result produced is also usable in combination with theoretic based techniques such as latent semantic indexing. The performance of text clustering using the semantic preserving generalization method is evaluated and compared to existing feature selection techniques, and shown to have significant merit from a clustering perspective.
Wunsch, Donald C.
M.S. in Computer Science
Missouri University of Science and Technology
viii, 40 pages
© 2012 Michael Howar, All rights reserved.
Thesis - Open Access
Library of Congress Subject Headings
Text processing (Computer science)
Data mining -- Statistical methods
Electronic OCLC #
Howard, Michael, "Semantic preserving text tepresentation and its applications in text clustering" (2012). Masters Theses. 6946.