A Frequency-based Strategy of Obtaining Sentences from Clinical Data Repository for Crowdsourcing

Dingcheng Li, Majid Rastegar Mojarad, Yanpeng Li, Sunghwan Sohn, Saeed Mehrabi, Ravikumar Komandur Elayavilli, Yue Yu, Hongfang Liu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In clinical NLP, one major barrier to adopting crowdsourcing for NLP annotation is the issue of confidentiality for protected health information (PHI) in clinical narratives. In this paper, we investigated the use of a frequency-based approach to extract sentences without PHI. Our approach is based on the assumption that sentences appearing frequently tend to contain no PHI. Both manual and automatic evaluations on 500 sentences out of the 7.9 million sentences of frequencies higher than one show that no PHI can be found among them. The promising results provide potentials of releasing those sentences for obtaining sentence-level NLP annotations via crowdsourcing.

Original languageEnglish (US)
Title of host publicationMEDINFO 2015
Subtitle of host publicationeHealth-Enabled Health - Proceedings of the 15th World Congress on Health and Biomedical Informatics
EditorsAndrew Georgiou, Indra Neil Sarkar, Paulo Mazzoncini de Azevedo Marques
PublisherIOS Press
Pages1033-1034
Number of pages2
ISBN (Electronic)9781614995630
DOIs
StatePublished - 2015
Event15th World Congress on Health and Biomedical Informatics, MEDINFO 2015 - Sao Paulo, Brazil
Duration: Aug 19 2015Aug 23 2015

Publication series

NameStudies in Health Technology and Informatics
Volume216
ISSN (Print)0926-9630
ISSN (Electronic)1879-8365

Other

Other15th World Congress on Health and Biomedical Informatics, MEDINFO 2015
CountryBrazil
CitySao Paulo
Period8/19/158/23/15

Keywords

  • bigram filtering
  • clinical notes
  • crowdsourcing
  • de-identification
  • high-frequent sentences
  • patient health information

ASJC Scopus subject areas

  • Biomedical Engineering
  • Health Informatics
  • Health Information Management

Fingerprint Dive into the research topics of 'A Frequency-based Strategy of Obtaining Sentences from Clinical Data Repository for Crowdsourcing'. Together they form a unique fingerprint.

Cite this