T-Plausibility: Generalizing Words to Desensitize Text

Abstract

De-identified data has the potential to be shared widely to support decision making and research. While significant advances have been made in anonymization of structured data, anonymization of textual information is in it infancy. Document sanitization requires finding and removing personally identifiable information. While current tools are effective at removing specific types of information (names, addresses, dates), they fail on two counts. the first is that complete text redaction may not be necessary to prevent re-identification, since this can affect the readability and usability of the text. More serious is that identifying information, as well as sensitive information, can be quite subtle and still be present in the text even after the removal of obvious identifiers. Observe that a diagnosis "tuberculosis" is sensitive, but in some situations it can also be identifying. Replacing it with the less sensitive term "infectious disease" also reduces identifiability. that is, instead of simply removing sensitive terms, these terms can be hidden by more general but semantically related terms to protect sensitive and identifying information, without unnecessarily degrading the amount of information contained in the document. based on this observation, the main contribution of this paper is to provide a novel information theoretic approach to text sanitization and develop efficient heuristics to sanitize text documents.

Recommended Citation

B. Anandan et al., "T-Plausibility: Generalizing Words to Desensitize Text," Transactions on Data Privacy, vol. 5, no. 3, pp. 505 - 534, Association for Computing Machinery (ACM), Dec 2012.

Department(s)

Computer Science

Keywords and Phrases

Privacy; Text anonymization

International Standard Serial Number (ISSN)

2013-1631; 1888-5063

Document Type

Article - Journal

Document Version

Citation

File Type

text

Language(s)

English

Rights

Publication Date

01 Dec 2012

Computer Science Faculty Research & Creative Works

T-Plausibility: Generalizing Words to Desensitize Text

Abstract

Recommended Citation

Department(s)

Keywords and Phrases

International Standard Serial Number (ISSN)

Document Type

Document Version

File Type

Language(s)

Rights

Publication Date

Search

Browse

Author Corner

Related Content

Useful Links

Article Locations

Computer Science Faculty Research & Creative Works

T-Plausibility: Generalizing Words to Desensitize Text

Author

Abstract

Recommended Citation

Department(s)

Keywords and Phrases

International Standard Serial Number (ISSN)

Document Type

Document Version

File Type

Language(s)

Rights

Publication Date

Share

Search

Browse

Author Corner

Related Content

Useful Links

Article Locations