Building a best-in-class automated de-identification tool for electronic health records through ensemble learning

Karthik Murugadoss; Ajit Rajasekharan; Bradley Malin; Vineet Agarwal; Sairam Bade; Jeff R. Anderson; Jason L. Ross; William A. Faubion; John D. Halamka; Venky Soundararajan; Sankar Ardhanari

doi:10.1016/j.patter.2021.100255

Building a best-in-class automated de-identification tool for electronic health records through ensemble learning

Karthik Murugadoss, Ajit Rajasekharan, Bradley Malin, Vineet Agarwal, Sairam Bade, Jeff R. Anderson, Jason L. Ross, William A. Faubion, John D. Halamka, Venky Soundararajan, Sankar Ardhanari

Research output: Contribution to journal › Article › peer-review

Abstract

The presence of personally identifiable information (PII) in natural language portions of electronic health records (EHRs) constrains their broad reuse. Despite continuous improvements in automated detection of PII, residual identifiers require manual validation and correction. Here, we describe an automated de-identification system that employs an ensemble architecture, incorporating attention-based deep-learning models and rule-based methods, supported by heuristics for detecting PII in EHR data. Detected identifiers are then transformed into plausible, though fictional, surrogates to further obfuscate any leaked identifier. Our approach outperforms existing tools, with a recall of 0.992 and precision of 0.979 on the i2b2 2014 dataset and a recall of 0.994 and precision of 0.967 on a dataset of 10,000 notes from the Mayo Clinic. The de-identification system presented here enables the generation of de-identified patient data at the scale required for modern machine-learning applications to help accelerate medical discoveries.

Original language	English (US)
Article number	100255
Journal	Patterns
Volume	2
Issue number	6
DOIs	https://doi.org/10.1016/j.patter.2021.100255
State	Published - Jun 11 2021

Keywords

DSML 4: Production: Data science output is validated, understood, and regularly used for multiple domains/platforms
anonymization
de-identification
ensemble
mayo
nference
obfuscation

ASJC Scopus subject areas

General Decision Sciences

Access to Document

10.1016/j.patter.2021.100255

Cite this

@article{c4c208e655ec496e8ac575a517127d6c,

title = "Building a best-in-class automated de-identification tool for electronic health records through ensemble learning",

abstract = "The presence of personally identifiable information (PII) in natural language portions of electronic health records (EHRs) constrains their broad reuse. Despite continuous improvements in automated detection of PII, residual identifiers require manual validation and correction. Here, we describe an automated de-identification system that employs an ensemble architecture, incorporating attention-based deep-learning models and rule-based methods, supported by heuristics for detecting PII in EHR data. Detected identifiers are then transformed into plausible, though fictional, surrogates to further obfuscate any leaked identifier. Our approach outperforms existing tools, with a recall of 0.992 and precision of 0.979 on the i2b2 2014 dataset and a recall of 0.994 and precision of 0.967 on a dataset of 10,000 notes from the Mayo Clinic. The de-identification system presented here enables the generation of de-identified patient data at the scale required for modern machine-learning applications to help accelerate medical discoveries.",

keywords = "DSML 4: Production: Data science output is validated, understood, and regularly used for multiple domains/platforms, anonymization, de-identification, ensemble, mayo, nference, obfuscation",

author = "Karthik Murugadoss and Ajit Rajasekharan and Bradley Malin and Vineet Agarwal and Sairam Bade and Anderson, {Jeff R.} and Ross, {Jason L.} and Faubion, {William A.} and Halamka, {John D.} and Venky Soundararajan and Sankar Ardhanari",

note = "Publisher Copyright: {\textcopyright} 2021 The Authors",

year = "2021",

month = jun,

day = "11",

doi = "10.1016/j.patter.2021.100255",

language = "English (US)",

volume = "2",

journal = "Patterns",

issn = "2666-3899",

publisher = "Cell Press",

number = "6",

}

TY - JOUR

T1 - Building a best-in-class automated de-identification tool for electronic health records through ensemble learning

AU - Murugadoss, Karthik

AU - Rajasekharan, Ajit

AU - Malin, Bradley

AU - Agarwal, Vineet

AU - Bade, Sairam

AU - Anderson, Jeff R.

AU - Ross, Jason L.

AU - Faubion, William A.

AU - Halamka, John D.

AU - Soundararajan, Venky

AU - Ardhanari, Sankar

PY - 2021/6/11

Y1 - 2021/6/11

N2 - The presence of personally identifiable information (PII) in natural language portions of electronic health records (EHRs) constrains their broad reuse. Despite continuous improvements in automated detection of PII, residual identifiers require manual validation and correction. Here, we describe an automated de-identification system that employs an ensemble architecture, incorporating attention-based deep-learning models and rule-based methods, supported by heuristics for detecting PII in EHR data. Detected identifiers are then transformed into plausible, though fictional, surrogates to further obfuscate any leaked identifier. Our approach outperforms existing tools, with a recall of 0.992 and precision of 0.979 on the i2b2 2014 dataset and a recall of 0.994 and precision of 0.967 on a dataset of 10,000 notes from the Mayo Clinic. The de-identification system presented here enables the generation of de-identified patient data at the scale required for modern machine-learning applications to help accelerate medical discoveries.

AB - The presence of personally identifiable information (PII) in natural language portions of electronic health records (EHRs) constrains their broad reuse. Despite continuous improvements in automated detection of PII, residual identifiers require manual validation and correction. Here, we describe an automated de-identification system that employs an ensemble architecture, incorporating attention-based deep-learning models and rule-based methods, supported by heuristics for detecting PII in EHR data. Detected identifiers are then transformed into plausible, though fictional, surrogates to further obfuscate any leaked identifier. Our approach outperforms existing tools, with a recall of 0.992 and precision of 0.979 on the i2b2 2014 dataset and a recall of 0.994 and precision of 0.967 on a dataset of 10,000 notes from the Mayo Clinic. The de-identification system presented here enables the generation of de-identified patient data at the scale required for modern machine-learning applications to help accelerate medical discoveries.

KW - DSML 4: Production: Data science output is validated, understood, and regularly used for multiple domains/platforms

KW - anonymization

KW - de-identification

KW - ensemble

KW - mayo

KW - nference

KW - obfuscation

UR - http://www.scopus.com/inward/record.url?scp=85107794852&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85107794852&partnerID=8YFLogxK

U2 - 10.1016/j.patter.2021.100255

DO - 10.1016/j.patter.2021.100255

M3 - Article

AN - SCOPUS:85107794852

SN - 2666-3899

VL - 2

JO - Patterns

JF - Patterns

IS - 6

M1 - 100255

ER -

Building a best-in-class automated de-identification tool for electronic health records through ensemble learning

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this