A Frequency-based Strategy of Obtaining Sentences from Clinical Data Repository for Crowdsourcing

Dingcheng Li, Majid Rastegar Mojarad, Yanpeng Li, Sunghwan Sohn, Saeed Mehrabi, Ravikumar Komandur Elayavilli, Yue Yu, Hongfang D Liu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In clinical NLP, one major barrier to adopting crowdsourcing for NLP annotation is the issue of confidentiality for protected health information (PHI) in clinical narratives. In this paper, we investigated the use of a frequency-based approach to extract sentences without PHI. Our approach is based on the assumption that sentences appearing frequently tend to contain no PHI. Both manual and automatic evaluations on 500 sentences out of the 7.9 million sentences of frequencies higher than one show that no PHI can be found among them. The promising results provide potentials of releasing those sentences for obtaining sentence-level NLP annotations via crowdsourcing.

Original languageEnglish (US)
Title of host publicationStudies in Health Technology and Informatics
PublisherIOS Press
Pages1033-1034
Number of pages2
Volume216
ISBN (Print)9781614995630
DOIs
StatePublished - 2015
Event15th World Congress on Health and Biomedical Informatics, MEDINFO 2015 - Sao Paulo, Brazil
Duration: Aug 19 2015Aug 23 2015

Publication series

NameStudies in Health Technology and Informatics
Volume216
ISSN (Print)09269630
ISSN (Electronic)18798365

Other

Other15th World Congress on Health and Biomedical Informatics, MEDINFO 2015
CountryBrazil
CitySao Paulo
Period8/19/158/23/15

Fingerprint

Crowdsourcing
Health
Confidentiality

Keywords

  • bigram filtering
  • clinical notes
  • crowdsourcing
  • de-identification
  • high-frequent sentences
  • patient health information

ASJC Scopus subject areas

  • Biomedical Engineering
  • Health Informatics
  • Health Information Management

Cite this

Li, D., Rastegar Mojarad, M., Li, Y., Sohn, S., Mehrabi, S., Komandur Elayavilli, R., ... Liu, H. D. (2015). A Frequency-based Strategy of Obtaining Sentences from Clinical Data Repository for Crowdsourcing. In Studies in Health Technology and Informatics (Vol. 216, pp. 1033-1034). (Studies in Health Technology and Informatics; Vol. 216). IOS Press. https://doi.org/10.3233/978-1-61499-564-7-1033

A Frequency-based Strategy of Obtaining Sentences from Clinical Data Repository for Crowdsourcing. / Li, Dingcheng; Rastegar Mojarad, Majid; Li, Yanpeng; Sohn, Sunghwan; Mehrabi, Saeed; Komandur Elayavilli, Ravikumar; Yu, Yue; Liu, Hongfang D.

Studies in Health Technology and Informatics. Vol. 216 IOS Press, 2015. p. 1033-1034 (Studies in Health Technology and Informatics; Vol. 216).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Li, D, Rastegar Mojarad, M, Li, Y, Sohn, S, Mehrabi, S, Komandur Elayavilli, R, Yu, Y & Liu, HD 2015, A Frequency-based Strategy of Obtaining Sentences from Clinical Data Repository for Crowdsourcing. in Studies in Health Technology and Informatics. vol. 216, Studies in Health Technology and Informatics, vol. 216, IOS Press, pp. 1033-1034, 15th World Congress on Health and Biomedical Informatics, MEDINFO 2015, Sao Paulo, Brazil, 8/19/15. https://doi.org/10.3233/978-1-61499-564-7-1033
Li D, Rastegar Mojarad M, Li Y, Sohn S, Mehrabi S, Komandur Elayavilli R et al. A Frequency-based Strategy of Obtaining Sentences from Clinical Data Repository for Crowdsourcing. In Studies in Health Technology and Informatics. Vol. 216. IOS Press. 2015. p. 1033-1034. (Studies in Health Technology and Informatics). https://doi.org/10.3233/978-1-61499-564-7-1033
Li, Dingcheng ; Rastegar Mojarad, Majid ; Li, Yanpeng ; Sohn, Sunghwan ; Mehrabi, Saeed ; Komandur Elayavilli, Ravikumar ; Yu, Yue ; Liu, Hongfang D. / A Frequency-based Strategy of Obtaining Sentences from Clinical Data Repository for Crowdsourcing. Studies in Health Technology and Informatics. Vol. 216 IOS Press, 2015. pp. 1033-1034 (Studies in Health Technology and Informatics).
@inproceedings{1aeb01ed720540efa912d210eff8861d,
title = "A Frequency-based Strategy of Obtaining Sentences from Clinical Data Repository for Crowdsourcing",
abstract = "In clinical NLP, one major barrier to adopting crowdsourcing for NLP annotation is the issue of confidentiality for protected health information (PHI) in clinical narratives. In this paper, we investigated the use of a frequency-based approach to extract sentences without PHI. Our approach is based on the assumption that sentences appearing frequently tend to contain no PHI. Both manual and automatic evaluations on 500 sentences out of the 7.9 million sentences of frequencies higher than one show that no PHI can be found among them. The promising results provide potentials of releasing those sentences for obtaining sentence-level NLP annotations via crowdsourcing.",
keywords = "bigram filtering, clinical notes, crowdsourcing, de-identification, high-frequent sentences, patient health information",
author = "Dingcheng Li and {Rastegar Mojarad}, Majid and Yanpeng Li and Sunghwan Sohn and Saeed Mehrabi and {Komandur Elayavilli}, Ravikumar and Yue Yu and Liu, {Hongfang D}",
year = "2015",
doi = "10.3233/978-1-61499-564-7-1033",
language = "English (US)",
isbn = "9781614995630",
volume = "216",
series = "Studies in Health Technology and Informatics",
publisher = "IOS Press",
pages = "1033--1034",
booktitle = "Studies in Health Technology and Informatics",

}

TY - GEN

T1 - A Frequency-based Strategy of Obtaining Sentences from Clinical Data Repository for Crowdsourcing

AU - Li, Dingcheng

AU - Rastegar Mojarad, Majid

AU - Li, Yanpeng

AU - Sohn, Sunghwan

AU - Mehrabi, Saeed

AU - Komandur Elayavilli, Ravikumar

AU - Yu, Yue

AU - Liu, Hongfang D

PY - 2015

Y1 - 2015

N2 - In clinical NLP, one major barrier to adopting crowdsourcing for NLP annotation is the issue of confidentiality for protected health information (PHI) in clinical narratives. In this paper, we investigated the use of a frequency-based approach to extract sentences without PHI. Our approach is based on the assumption that sentences appearing frequently tend to contain no PHI. Both manual and automatic evaluations on 500 sentences out of the 7.9 million sentences of frequencies higher than one show that no PHI can be found among them. The promising results provide potentials of releasing those sentences for obtaining sentence-level NLP annotations via crowdsourcing.

AB - In clinical NLP, one major barrier to adopting crowdsourcing for NLP annotation is the issue of confidentiality for protected health information (PHI) in clinical narratives. In this paper, we investigated the use of a frequency-based approach to extract sentences without PHI. Our approach is based on the assumption that sentences appearing frequently tend to contain no PHI. Both manual and automatic evaluations on 500 sentences out of the 7.9 million sentences of frequencies higher than one show that no PHI can be found among them. The promising results provide potentials of releasing those sentences for obtaining sentence-level NLP annotations via crowdsourcing.

KW - bigram filtering

KW - clinical notes

KW - crowdsourcing

KW - de-identification

KW - high-frequent sentences

KW - patient health information

UR - http://www.scopus.com/inward/record.url?scp=84952021827&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84952021827&partnerID=8YFLogxK

U2 - 10.3233/978-1-61499-564-7-1033

DO - 10.3233/978-1-61499-564-7-1033

M3 - Conference contribution

SN - 9781614995630

VL - 216

T3 - Studies in Health Technology and Informatics

SP - 1033

EP - 1034

BT - Studies in Health Technology and Informatics

PB - IOS Press

ER -