Abstract

Redundant or duplicate data are the most troublesome problem in database management and applications. Approximate field matching is the key solution to resolve the problem by identifying semantically equivalent string values in syntactically different representations. This paper considers token-based solutions and proposes a general field matching framework to generalize the field matching problem in different domains. By introducing a concept of String Matching Points (SMP) in string comparison, string matching accuracy and efficiency are improved, compared with other commonly-applied field matching algorithms. The paper discusses the development of field matching algorithms from the developed general framework. The framework and corresponding algorithm are tested on a public data set of the NASA publication abstract database. The approach can be applied to address the similar problems in other databases.

Recommended Citation

M. Wei et al., "Improving Database Quality through Eliminating Duplicate Records," Data Science Journal, vol. 5, pp. 127 - 142, Committee on Data for Science and Technology, Nov 2006.

The definitive version is available at https://doi.org/10.2481/dsj.5.127

Department(s)

Geosciences and Geological and Petroleum Engineering

Keywords and Phrases

Field Matching; General Field Matching Framework; String Matching Patterns; String Matching Points; Algorithms; Electronic Publishing; Problem Solving; Records Management; Semantics; Database Systems

International Standard Serial Number (ISSN)

1683-1470

Document Type

Article - Journal

Document Version

Final Version

File Type

text

Language(s)

English

Rights

Publication Date

01 Nov 2006

Download

Full Text Link

Included in

Geology Commons, Petroleum Engineering Commons

COinS

Geosciences and Geological and Petroleum Engineering Faculty Research & Creative Works

Improving Database Quality through Eliminating Duplicate Records

Abstract

Recommended Citation

Department(s)

Keywords and Phrases

International Standard Serial Number (ISSN)

Document Type

Document Version

File Type

Language(s)

Rights

Publication Date

Included in

Search

Browse

Faculty Gallery

Author Corner

Related Content

Useful Links

Article Locations

Geosciences and Geological and Petroleum Engineering Faculty Research & Creative Works

Improving Database Quality through Eliminating Duplicate Records

Author

Abstract

Recommended Citation

Department(s)

Keywords and Phrases

International Standard Serial Number (ISSN)

Document Type

Document Version

File Type

Language(s)

Rights

Publication Date

Included in

Share

Search

Browse

Faculty Gallery

Author Corner

Related Content

Useful Links

Article Locations