T-plausibility: Semantic Preserving Text Sanitization

Abstract

Text documents play significant roles in decision making and scientific research. Under federal regulations, documents (e.g., pathology records) containing personally identifiable information cannot be shared freely, unless properly sanitized. Generally speaking, document sanitization consists of finding and hiding personally identifiable information. The first task has received much attention from the research community, but the main strategy for the second task has been to simply remove personal identifiers and very sensitive information (e.g., diseases and treatment). It is not hard to see that if important information (e.g., diagnoses and personal medical histories) is completely removed from pathology records, these records are no longer readable, and even worse, they no longer contain sufficient information for research purposes. Observe that the sensitive information "tuberculosis" can be replaced with the less sensitive term "infectious disease". That is, instead of simply removing sensitive terms, these terms can be hidden by more general but semantically related terms to protect sensitive information, without unnecessarily degrading the amount of information contained in the document. Based on this observation, the main contribution of this paper is to provide a novel information theoretic approach to text sanitization, and develop efficient heuristics to sanitize text documents.

Recommended Citation

W. Jiang et al., "T-plausibility: Semantic Preserving Text Sanitization," Proceedings of the International Conference on Computational Science and Engineering, 2009, Institute of Electrical and Electronics Engineers (IEEE), Aug 2009.

The definitive version is available at https://doi.org/10.1109/CSE.2009.353

Meeting Name

International Conference on Computational Science and Engineering, 2009

Department(s)

Computer Science

Sponsor(s)

United States. Air Force. Office of Scientific Research
National Science Foundation (U.S.)

Keywords and Phrases

Data Privacy; Information Dissemination; Medical Administrative Data Processing

Document Type

Article - Conference proceedings

Document Version

Final Version

File Type

text

Language(s)

English

Rights

Publication Date

01 Aug 2009

Computer Science Faculty Research & Creative Works

T-plausibility: Semantic Preserving Text Sanitization

Abstract

Recommended Citation

Meeting Name

Department(s)

Sponsor(s)

Keywords and Phrases

Document Type

Document Version

File Type

Language(s)

Rights

Publication Date

Included in

Search

Browse

Author Corner

Related Content

Useful Links

Article Locations

Computer Science Faculty Research & Creative Works

T-plausibility: Semantic Preserving Text Sanitization

Author

Abstract

Recommended Citation

Meeting Name

Department(s)

Sponsor(s)

Keywords and Phrases

Document Type

Document Version

File Type

Language(s)

Rights

Publication Date

Included in

Share

Search

Browse

Author Corner

Related Content

Useful Links

Article Locations