TY - GEN
T1 - A frequency-filtering strategy of obtaining PHI-free sentences from clinical data repository
AU - Li, Dingcheng
AU - Rastegar-Mojarad, Majid
AU - Elayavilli, Ravikumar Komandur
AU - Wang, Yanshan
AU - Mehrabi, Saeed
AU - Yu, Yue
AU - Sohn, Sunghwan
AU - Li, Yanpeng
AU - Afzal, Naveed
AU - Liu, Hongfang
N1 - Publisher Copyright:
Copyright 2015 ACM.
PY - 2015/9/9
Y1 - 2015/9/9
N2 - Clinical natural language processing (NLP) has become indispensable in the secondary use of electronic medical records (EMRs). However, it is found that current clinical NLP tools face the problem of portability among different institutes. An ideal solution to this problem is cross-institutional data sharing. However, the legal enforcement of no revelation of protected health information (PHI) obstructs this practice even with the availability of state-of-the-art de-identification tools. In this paper, we investigated the use of a frequency-filtering approach to extract PHI-free sentences utilizing the Enterprise Data Trust (EDT), a large collection of EMRs at Mayo Clinic. Our approach is based on the assumption that sentences appearing frequently tend to contain no PHI. This assumption originates from the observation that there exist a large number of redundant descriptions of similar patient conditions in EDT. Both manual and automatic evaluations on the sentence set with frequencies higher than one show no PHI are found. The promising results demonstrate the potential of sharing highly frequent sentences among institutes.
AB - Clinical natural language processing (NLP) has become indispensable in the secondary use of electronic medical records (EMRs). However, it is found that current clinical NLP tools face the problem of portability among different institutes. An ideal solution to this problem is cross-institutional data sharing. However, the legal enforcement of no revelation of protected health information (PHI) obstructs this practice even with the availability of state-of-the-art de-identification tools. In this paper, we investigated the use of a frequency-filtering approach to extract PHI-free sentences utilizing the Enterprise Data Trust (EDT), a large collection of EMRs at Mayo Clinic. Our approach is based on the assumption that sentences appearing frequently tend to contain no PHI. This assumption originates from the observation that there exist a large number of redundant descriptions of similar patient conditions in EDT. Both manual and automatic evaluations on the sentence set with frequencies higher than one show no PHI are found. The promising results demonstrate the potential of sharing highly frequent sentences among institutes.
KW - Cross-institutional data-sharing
KW - EMR
KW - Frequency-filtering strategy
KW - PHI-free
KW - Protected health information
KW - Sentence frequency
KW - Word bigram
UR - http://www.scopus.com/inward/record.url?scp=84963554018&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84963554018&partnerID=8YFLogxK
U2 - 10.1145/2808719.2808752
DO - 10.1145/2808719.2808752
M3 - Conference contribution
AN - SCOPUS:84963554018
T3 - BCB 2015 - 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics
SP - 315
EP - 324
BT - BCB 2015 - 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics
PB - Association for Computing Machinery, Inc
T2 - 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2015
Y2 - 9 September 2015 through 12 September 2015
ER -