Masters Theses

Keywords and Phrases

Clustering; Feature Selection; Natural Language; Text mining; Vector Space; Wordnet

Abstract

Text mining using the vector space representation has proven to be an valuable tool for classification, prediction, information retrieval and extraction. The nature of text data presents several issues to these tasks, including large dimension and the existence of special polysemous and synonymous words. A variety of techniques have been devised to overcome these shortcomings, including feature selection and word sense disambiguation. Privacy preserving data mining is also an area of emerging interest. Existing techniques for privacy preserving data mining require the use of secure computation protocols, which often incur a greatly increased computational cost. In this paper, a generalization-based method is presented for creating a semantic-preserving vector space which reduces dimension as well as addresses problems with special word types. The SPVSM also allows private text data to be safely represented without degrading cluster accuracy or performance. Further, the result produced is also usable in combination with theoretic based techniques such as latent semantic indexing. The performance of text clustering using the semantic preserving generalization method is evaluated and compared to existing feature selection techniques, and shown to have significant merit from a clustering perspective.

Advisor(s)

Jiang, Wei

Committee Member(s)

Leopold, Jennifer
Wunsch, Donald C.

Department(s)

Computer Science

Degree Name

M.S. in Computer Science

Publisher

Missouri University of Science and Technology

Publication Date

Fall 2012

Pagination

viii, 40 pages

Note about bibliography

Includes bibliographical references (pages 76-77).

Rights

© 2012 Michael Howar, All rights reserved.

Document Type

Thesis - Open Access

File Type

text

Language

English

Library of Congress Subject Headings

Text processing (Computer science)
Data protection
Data mining -- Statistical methods

Thesis Number

T 10093

Electronic OCLC #

828737701

Share

 
COinS