Open Health Natural Language Processing Collaboratory

Project: Research project

Project Details


Project Summary One of the major barriers in leveraging Electronic Health Record (EHR) data for clinical and translational science is the prevalent use of unstructured or semi-structured clinical narratives for documenting clinical information. Natural Language Processing (NLP), which extracts structured information from narratives, has received great attention and has played a critical role in enabling secondary use of EHRs for clinical and translational research. As demonstrated by large scale efforts such as ACT (Accrual of patients for Clinical Trials), eMERGE, and PCORnet, using EHR data for research rests on the capabilities of a robust data and informatics infrastructure that allows the structuring of clinical narratives and supports the extraction of clinical information for downstream applications. Current successful NLP use cases often require a strong informatics team (with NLP experts) to work with clinicians to supply their domain knowledge and build customized NLP engines iteratively. This requires close collaboration between NLP experts and clinicians, not feasible at institutions with limited informatics support. Additionally, the usability, portability, and generalizability of the NLP systems are still limited, partially due to the lack of access to EHRs across institutions to train the systems. The limited availability of EHR data limits the training available to improve the workforce competence in clinical NLP. We aim to address the above challenges by extending our existing collaboration among multiple CTSA hubs on open health natural language processing (OHNLP) to share distributional information of NLP artifacts (i.e., words, n-grams, phrases, sentences, concept mentions, concepts, and text segments) acquired from real EHRs across multiple institutions. We will leverage the advanced privacy-preserving computing infrastructure of iDASH (integrating Data for Analysis, Anonymization, and SHaring) for privacy- preserving data analysis models and will partner with diverse communities including Observational Health Data Sciences and Informatics (OHDSI), Precision Medicine Initiative (PMI), PCORnet, and Rare Diseases Clinical Research Network (RDCRN) to demonstrate the utility of NLP for translational research. This CTSA innovation award RFA provides us with a unique opportunity to address the challenges faced with clinical NLP and through strong partnership with multiple research communities and leadership roles of the research team in clinical NLP, we envision that the successful delivery of this project will broaden the utilization of clinical NLP across the research community. There are four aims planned: i) obtain PHI-suppressed NLP artifacts with retained distribution information across multiple institutions and assess the privacy risk of accessing PHI- suppressed artifacts, ii) generate a synthetic text corpus for exploratory analysis of clinical narratives and assess its utility in NLP tasks leveraging various NLP challenges, iii) develop privacy-preserving computational phenotyping models empowered with NLP, and iv) partner with diverse communities to demonstrate the utility of our project for translational research.
StatusNot started