A New Domain-Independent Field Matching Algorithm for Large Databases

Abstract

In large databases, string-valued attributes are very important due to their entity identifying and descriptive roles. Due to various reasons, the name of an entity may be presented in several different ways, as in "New Mexico Tech" and "NMT" for New Mexico Institute of Mining and Technology in Socorro, New Mexico, USA. The task of field matching is to determine whether two syntactically different values are alternatives of the same semantic entity. Field matching problem is recognized important even though little research has been done on the field matching algorithms. In this paper, a new domain-independent token-based field matching algorithm is proposed and tested. The new algorithm achieves high string matching accuracy and efficiency by introducing string matching point concept and defining proper string matching patterns. A new general string matching framework enables practical algorithms to be developed easily according to the characteristics of problems and data.

Meeting Name

International Conference on Data Mining (2005: Jun. 20-23, Las Vegas, NV)

Department(s)

Geosciences and Geological and Petroleum Engineering

Keywords and Phrases

Data Cleaning; Domain Independent; Field Matching Algorithm; String Matching Patterns; Information Management; Information Theory

International Standard Book Number (ISBN)

978-1932415797

Document Type

Article - Conference proceedings

Document Version

Citation

File Type

text

Language(s)

English

Publication Date

01 Jun 2005

This document is currently not available here.

Share

 
COinS