A New Domain-Independent Field Matching Algorithm for Large Databases
In large databases, string-valued attributes are very important due to their entity identifying and descriptive roles. Due to various reasons, the name of an entity may be presented in several different ways, as in "New Mexico Tech" and "NMT" for New Mexico Institute of Mining and Technology in Socorro, New Mexico, USA. The task of field matching is to determine whether two syntactically different values are alternatives of the same semantic entity. Field matching problem is recognized important even though little research has been done on the field matching algorithms. In this paper, a new domain-independent token-based field matching algorithm is proposed and tested. The new algorithm achieves high string matching accuracy and efficiency by introducing string matching point concept and defining proper string matching patterns. A new general string matching framework enables practical algorithms to be developed easily according to the characteristics of problems and data.
M. Wei et al., "A New Domain-Independent Field Matching Algorithm for Large Databases," Proceedings of the International Conference on Data Mining (2005, Las Vegas, NV), pp. 126-131, Jun 2005.
International Conference on Data Mining (2005: Jun. 20-23, Las Vegas, NV)
Geosciences and Geological and Petroleum Engineering
Keywords and Phrases
Data Cleaning; Domain Independent; Field Matching Algorithm; String Matching Patterns; Information Management; Information Theory
International Standard Book Number (ISBN)
Article - Conference proceedings