A frequency-filtering strategy of obtaining PHI-free sentences from clinical data repository

Dingcheng Li; Majid Rastegar-Mojarad; Ravikumar Komandur Elayavilli; Yanshan Wang; Saeed Mehrabi; Yue Yu; Sunghwan Sohn; Yanpeng Li; Naveed Afzal; Hongfang Liu

doi:10.1145/2808719.2808752

A frequency-filtering strategy of obtaining PHI-free sentences from clinical data repository

Dingcheng Li, Majid Rastegar-Mojarad, Ravikumar Komandur Elayavilli, Yanshan Wang, Saeed Mehrabi, Yue Yu, Sunghwan Sohn, Yanpeng Li, Naveed Afzal, Hongfang Liu

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

4 Scopus citations

Abstract

Clinical natural language processing (NLP) has become indispensable in the secondary use of electronic medical records (EMRs). However, it is found that current clinical NLP tools face the problem of portability among different institutes. An ideal solution to this problem is cross-institutional data sharing. However, the legal enforcement of no revelation of protected health information (PHI) obstructs this practice even with the availability of state-of-the-art de-identification tools. In this paper, we investigated the use of a frequency-filtering approach to extract PHI-free sentences utilizing the Enterprise Data Trust (EDT), a large collection of EMRs at Mayo Clinic. Our approach is based on the assumption that sentences appearing frequently tend to contain no PHI. This assumption originates from the observation that there exist a large number of redundant descriptions of similar patient conditions in EDT. Both manual and automatic evaluations on the sentence set with frequencies higher than one show no PHI are found. The promising results demonstrate the potential of sharing highly frequent sentences among institutes.

Original language	English (US)
Title of host publication	BCB 2015 - 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics
Publisher	Association for Computing Machinery, Inc
Pages	315-324
Number of pages	10
ISBN (Electronic)	9781450338530
DOIs	https://doi.org/10.1145/2808719.2808752
State	Published - Sep 9 2015
Event	6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2015 - Atlanta, United States Duration: Sep 9 2015 → Sep 12 2015

Publication series

Name	BCB 2015 - 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics

Other

Other	6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2015
Country/Territory	United States
City	Atlanta
Period	9/9/15 → 9/12/15

Keywords

Cross-institutional data-sharing
EMR
Frequency-filtering strategy
PHI-free
Protected health information
Sentence frequency
Word bigram

ASJC Scopus subject areas

Software
Health Informatics
Computer Science Applications
Biomedical Engineering

Access to Document

10.1145/2808719.2808752

Cite this

Li, D., Rastegar-Mojarad, M., Elayavilli, R. K., Wang, Y., Mehrabi, S., Yu, Y., Sohn, S., Li, Y., Afzal, N., & Liu, H. (2015). A frequency-filtering strategy of obtaining PHI-free sentences from clinical data repository. In BCB 2015 - 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (pp. 315-324). (BCB 2015 - 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics). Association for Computing Machinery, Inc. https://doi.org/10.1145/2808719.2808752

A frequency-filtering strategy of obtaining PHI-free sentences from clinical data repository. / Li, Dingcheng; Rastegar-Mojarad, Majid; Elayavilli, Ravikumar Komandur et al.
BCB 2015 - 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. Association for Computing Machinery, Inc, 2015. p. 315-324 (BCB 2015 - 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Li, D, Rastegar-Mojarad, M, Elayavilli, RK, Wang, Y, Mehrabi, S, Yu, Y, Sohn, S, Li, Y, Afzal, N & Liu, H 2015, A frequency-filtering strategy of obtaining PHI-free sentences from clinical data repository. in BCB 2015 - 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. BCB 2015 - 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, Association for Computing Machinery, Inc, pp. 315-324, 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2015, Atlanta, United States, 9/9/15. https://doi.org/10.1145/2808719.2808752

Li D, Rastegar-Mojarad M, Elayavilli RK, Wang Y, Mehrabi S, Yu Y et al. A frequency-filtering strategy of obtaining PHI-free sentences from clinical data repository. In BCB 2015 - 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. Association for Computing Machinery, Inc. 2015. p. 315-324. (BCB 2015 - 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics). doi: 10.1145/2808719.2808752

Li, Dingcheng ; Rastegar-Mojarad, Majid ; Elayavilli, Ravikumar Komandur et al. / A frequency-filtering strategy of obtaining PHI-free sentences from clinical data repository. BCB 2015 - 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. Association for Computing Machinery, Inc, 2015. pp. 315-324 (BCB 2015 - 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics).

@inproceedings{70fc1b7eab204377b150a3d182a37b15,

title = "A frequency-filtering strategy of obtaining PHI-free sentences from clinical data repository",

abstract = "Clinical natural language processing (NLP) has become indispensable in the secondary use of electronic medical records (EMRs). However, it is found that current clinical NLP tools face the problem of portability among different institutes. An ideal solution to this problem is cross-institutional data sharing. However, the legal enforcement of no revelation of protected health information (PHI) obstructs this practice even with the availability of state-of-the-art de-identification tools. In this paper, we investigated the use of a frequency-filtering approach to extract PHI-free sentences utilizing the Enterprise Data Trust (EDT), a large collection of EMRs at Mayo Clinic. Our approach is based on the assumption that sentences appearing frequently tend to contain no PHI. This assumption originates from the observation that there exist a large number of redundant descriptions of similar patient conditions in EDT. Both manual and automatic evaluations on the sentence set with frequencies higher than one show no PHI are found. The promising results demonstrate the potential of sharing highly frequent sentences among institutes.",

keywords = "Cross-institutional data-sharing, EMR, Frequency-filtering strategy, PHI-free, Protected health information, Sentence frequency, Word bigram",

author = "Dingcheng Li and Majid Rastegar-Mojarad and Elayavilli, {Ravikumar Komandur} and Yanshan Wang and Saeed Mehrabi and Yue Yu and Sunghwan Sohn and Yanpeng Li and Naveed Afzal and Hongfang Liu",

year = "2015",

month = sep,

day = "9",

doi = "10.1145/2808719.2808752",

language = "English (US)",

series = "BCB 2015 - 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics",

publisher = "Association for Computing Machinery, Inc",

pages = "315--324",

booktitle = "BCB 2015 - 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics",

}

TY - GEN

T1 - A frequency-filtering strategy of obtaining PHI-free sentences from clinical data repository

AU - Li, Dingcheng

AU - Rastegar-Mojarad, Majid

AU - Elayavilli, Ravikumar Komandur

AU - Wang, Yanshan

AU - Mehrabi, Saeed

AU - Yu, Yue

AU - Sohn, Sunghwan

AU - Li, Yanpeng

AU - Afzal, Naveed

AU - Liu, Hongfang

PY - 2015/9/9

Y1 - 2015/9/9

N2 - Clinical natural language processing (NLP) has become indispensable in the secondary use of electronic medical records (EMRs). However, it is found that current clinical NLP tools face the problem of portability among different institutes. An ideal solution to this problem is cross-institutional data sharing. However, the legal enforcement of no revelation of protected health information (PHI) obstructs this practice even with the availability of state-of-the-art de-identification tools. In this paper, we investigated the use of a frequency-filtering approach to extract PHI-free sentences utilizing the Enterprise Data Trust (EDT), a large collection of EMRs at Mayo Clinic. Our approach is based on the assumption that sentences appearing frequently tend to contain no PHI. This assumption originates from the observation that there exist a large number of redundant descriptions of similar patient conditions in EDT. Both manual and automatic evaluations on the sentence set with frequencies higher than one show no PHI are found. The promising results demonstrate the potential of sharing highly frequent sentences among institutes.

AB - Clinical natural language processing (NLP) has become indispensable in the secondary use of electronic medical records (EMRs). However, it is found that current clinical NLP tools face the problem of portability among different institutes. An ideal solution to this problem is cross-institutional data sharing. However, the legal enforcement of no revelation of protected health information (PHI) obstructs this practice even with the availability of state-of-the-art de-identification tools. In this paper, we investigated the use of a frequency-filtering approach to extract PHI-free sentences utilizing the Enterprise Data Trust (EDT), a large collection of EMRs at Mayo Clinic. Our approach is based on the assumption that sentences appearing frequently tend to contain no PHI. This assumption originates from the observation that there exist a large number of redundant descriptions of similar patient conditions in EDT. Both manual and automatic evaluations on the sentence set with frequencies higher than one show no PHI are found. The promising results demonstrate the potential of sharing highly frequent sentences among institutes.

KW - Cross-institutional data-sharing

KW - EMR

KW - Frequency-filtering strategy

KW - PHI-free

KW - Protected health information

KW - Sentence frequency

KW - Word bigram

UR - http://www.scopus.com/inward/record.url?scp=84963554018&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84963554018&partnerID=8YFLogxK

U2 - 10.1145/2808719.2808752

DO - 10.1145/2808719.2808752

M3 - Conference contribution

AN - SCOPUS:84963554018

T3 - BCB 2015 - 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics

SP - 315

EP - 324

BT - BCB 2015 - 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics

PB - Association for Computing Machinery, Inc

T2 - 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2015

Y2 - 9 September 2015 through 12 September 2015

ER -

A frequency-filtering strategy of obtaining PHI-free sentences from clinical data repository

Abstract

Publication series

Other

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this