A frequency-filtering strategy of obtaining PHI-free sentences from clinical data repository

Dingcheng Li, Majid Rastegar-Mojarad, Ravikumar Komandur Elayavilli, Yanshan Wang, Saeed Mehrabi, Yue Yu, Sunghwan Sohn, Yanpeng Li, Naveed Afzal, Hongfang D Liu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

Clinical natural language processing (NLP) has become indispensable in the secondary use of electronic medical records (EMRs). However, it is found that current clinical NLP tools face the problem of portability among different institutes. An ideal solution to this problem is cross-institutional data sharing. However, the legal enforcement of no revelation of protected health information (PHI) obstructs this practice even with the availability of state-of-the-art de-identification tools. In this paper, we investigated the use of a frequency-filtering approach to extract PHI-free sentences utilizing the Enterprise Data Trust (EDT), a large collection of EMRs at Mayo Clinic. Our approach is based on the assumption that sentences appearing frequently tend to contain no PHI. This assumption originates from the observation that there exist a large number of redundant descriptions of similar patient conditions in EDT. Both manual and automatic evaluations on the sentence set with frequencies higher than one show no PHI are found. The promising results demonstrate the potential of sharing highly frequent sentences among institutes.

Original languageEnglish (US)
Title of host publicationBCB 2015 - 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics
PublisherAssociation for Computing Machinery, Inc
Pages315-324
Number of pages10
ISBN (Print)9781450338530
DOIs
StatePublished - Sep 9 2015
Event6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2015 - Atlanta, United States
Duration: Sep 9 2015Sep 12 2015

Other

Other6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2015
CountryUnited States
CityAtlanta
Period9/9/159/12/15

Fingerprint

Health
Natural Language Processing
Electronic medical equipment
Electronic Health Records
Information Dissemination
Processing
Industry
Availability

Keywords

  • Cross-institutional data-sharing
  • EMR
  • Frequency-filtering strategy
  • PHI-free
  • Protected health information
  • Sentence frequency
  • Word bigram

ASJC Scopus subject areas

  • Software
  • Health Informatics
  • Computer Science Applications
  • Biomedical Engineering

Cite this

Li, D., Rastegar-Mojarad, M., Elayavilli, R. K., Wang, Y., Mehrabi, S., Yu, Y., ... Liu, H. D. (2015). A frequency-filtering strategy of obtaining PHI-free sentences from clinical data repository. In BCB 2015 - 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (pp. 315-324). Association for Computing Machinery, Inc. https://doi.org/10.1145/2808719.2808752

A frequency-filtering strategy of obtaining PHI-free sentences from clinical data repository. / Li, Dingcheng; Rastegar-Mojarad, Majid; Elayavilli, Ravikumar Komandur; Wang, Yanshan; Mehrabi, Saeed; Yu, Yue; Sohn, Sunghwan; Li, Yanpeng; Afzal, Naveed; Liu, Hongfang D.

BCB 2015 - 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. Association for Computing Machinery, Inc, 2015. p. 315-324.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Li, D, Rastegar-Mojarad, M, Elayavilli, RK, Wang, Y, Mehrabi, S, Yu, Y, Sohn, S, Li, Y, Afzal, N & Liu, HD 2015, A frequency-filtering strategy of obtaining PHI-free sentences from clinical data repository. in BCB 2015 - 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. Association for Computing Machinery, Inc, pp. 315-324, 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2015, Atlanta, United States, 9/9/15. https://doi.org/10.1145/2808719.2808752
Li D, Rastegar-Mojarad M, Elayavilli RK, Wang Y, Mehrabi S, Yu Y et al. A frequency-filtering strategy of obtaining PHI-free sentences from clinical data repository. In BCB 2015 - 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. Association for Computing Machinery, Inc. 2015. p. 315-324 https://doi.org/10.1145/2808719.2808752
Li, Dingcheng ; Rastegar-Mojarad, Majid ; Elayavilli, Ravikumar Komandur ; Wang, Yanshan ; Mehrabi, Saeed ; Yu, Yue ; Sohn, Sunghwan ; Li, Yanpeng ; Afzal, Naveed ; Liu, Hongfang D. / A frequency-filtering strategy of obtaining PHI-free sentences from clinical data repository. BCB 2015 - 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. Association for Computing Machinery, Inc, 2015. pp. 315-324
@inproceedings{70fc1b7eab204377b150a3d182a37b15,
title = "A frequency-filtering strategy of obtaining PHI-free sentences from clinical data repository",
abstract = "Clinical natural language processing (NLP) has become indispensable in the secondary use of electronic medical records (EMRs). However, it is found that current clinical NLP tools face the problem of portability among different institutes. An ideal solution to this problem is cross-institutional data sharing. However, the legal enforcement of no revelation of protected health information (PHI) obstructs this practice even with the availability of state-of-the-art de-identification tools. In this paper, we investigated the use of a frequency-filtering approach to extract PHI-free sentences utilizing the Enterprise Data Trust (EDT), a large collection of EMRs at Mayo Clinic. Our approach is based on the assumption that sentences appearing frequently tend to contain no PHI. This assumption originates from the observation that there exist a large number of redundant descriptions of similar patient conditions in EDT. Both manual and automatic evaluations on the sentence set with frequencies higher than one show no PHI are found. The promising results demonstrate the potential of sharing highly frequent sentences among institutes.",
keywords = "Cross-institutional data-sharing, EMR, Frequency-filtering strategy, PHI-free, Protected health information, Sentence frequency, Word bigram",
author = "Dingcheng Li and Majid Rastegar-Mojarad and Elayavilli, {Ravikumar Komandur} and Yanshan Wang and Saeed Mehrabi and Yue Yu and Sunghwan Sohn and Yanpeng Li and Naveed Afzal and Liu, {Hongfang D}",
year = "2015",
month = "9",
day = "9",
doi = "10.1145/2808719.2808752",
language = "English (US)",
isbn = "9781450338530",
pages = "315--324",
booktitle = "BCB 2015 - 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics",
publisher = "Association for Computing Machinery, Inc",

}

TY - GEN

T1 - A frequency-filtering strategy of obtaining PHI-free sentences from clinical data repository

AU - Li, Dingcheng

AU - Rastegar-Mojarad, Majid

AU - Elayavilli, Ravikumar Komandur

AU - Wang, Yanshan

AU - Mehrabi, Saeed

AU - Yu, Yue

AU - Sohn, Sunghwan

AU - Li, Yanpeng

AU - Afzal, Naveed

AU - Liu, Hongfang D

PY - 2015/9/9

Y1 - 2015/9/9

N2 - Clinical natural language processing (NLP) has become indispensable in the secondary use of electronic medical records (EMRs). However, it is found that current clinical NLP tools face the problem of portability among different institutes. An ideal solution to this problem is cross-institutional data sharing. However, the legal enforcement of no revelation of protected health information (PHI) obstructs this practice even with the availability of state-of-the-art de-identification tools. In this paper, we investigated the use of a frequency-filtering approach to extract PHI-free sentences utilizing the Enterprise Data Trust (EDT), a large collection of EMRs at Mayo Clinic. Our approach is based on the assumption that sentences appearing frequently tend to contain no PHI. This assumption originates from the observation that there exist a large number of redundant descriptions of similar patient conditions in EDT. Both manual and automatic evaluations on the sentence set with frequencies higher than one show no PHI are found. The promising results demonstrate the potential of sharing highly frequent sentences among institutes.

AB - Clinical natural language processing (NLP) has become indispensable in the secondary use of electronic medical records (EMRs). However, it is found that current clinical NLP tools face the problem of portability among different institutes. An ideal solution to this problem is cross-institutional data sharing. However, the legal enforcement of no revelation of protected health information (PHI) obstructs this practice even with the availability of state-of-the-art de-identification tools. In this paper, we investigated the use of a frequency-filtering approach to extract PHI-free sentences utilizing the Enterprise Data Trust (EDT), a large collection of EMRs at Mayo Clinic. Our approach is based on the assumption that sentences appearing frequently tend to contain no PHI. This assumption originates from the observation that there exist a large number of redundant descriptions of similar patient conditions in EDT. Both manual and automatic evaluations on the sentence set with frequencies higher than one show no PHI are found. The promising results demonstrate the potential of sharing highly frequent sentences among institutes.

KW - Cross-institutional data-sharing

KW - EMR

KW - Frequency-filtering strategy

KW - PHI-free

KW - Protected health information

KW - Sentence frequency

KW - Word bigram

UR - http://www.scopus.com/inward/record.url?scp=84963554018&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84963554018&partnerID=8YFLogxK

U2 - 10.1145/2808719.2808752

DO - 10.1145/2808719.2808752

M3 - Conference contribution

AN - SCOPUS:84963554018

SN - 9781450338530

SP - 315

EP - 324

BT - BCB 2015 - 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics

PB - Association for Computing Machinery, Inc

ER -