Evaluation of patient-level retrieval from electronic health record data for a cohort discovery task

Steven R. Chamberlin, Steven D. Bedrick, Aaron M. Cohen, Yanshan Wang, Andrew Wen, Sijia Liu, Hongfang Liu, William R. Hersh

Research output: Contribution to journalArticlepeer-review


Objective: Growing numbers of academic medical centers offer patient cohort discovery tools to their researchers, yet the performance of systems for this use case is not well understood. The objective of this research was to assess patient-level information retrieval methods using electronic health records for different types of cohort definition retrieval. Materials and Methods: We developed a test collection consisting of about 100 000 patient records and 56 test topics that characterized patient cohort requests for various clinical studies. Automated information retrieval tasks using word-based approaches were performed, varying 4 different parameters for a total of 48 permutations, with performance measured using B-Pref. We subsequently created structured Boolean queries for the 56 topics for performance comparisons. In addition, we performed a more detailed analysis of 10 topics. Results: The best-performing word-based automated query parameter settings achieved a mean B-Pref of 0.167 across all 56 topics. The way a topic was structured (topic representation) had the largest impact on performance. Performance not only varied widely across topics, but there was also a large variance in sensitivity to parameter settings across the topics. Structured queries generally performed better than automated queries on measures of recall and precision but were still not able to recall all relevant patients found by the automated queries. Conclusion: While word-based automated methods of cohort retrieval offer an attractive solution to the labor-intensive nature of this task currently used at many medical centers, we generally found suboptimal performance in those approaches, with better performance obtained from structured Boolean queries. Future work will focus on using the test collection to develop and evaluate new approaches to query structure, weighting algorithms, and application of semantic methods.

Original languageEnglish (US)
Pages (from-to)395-404
Number of pages10
JournalJAMIA Open
Issue number3
StatePublished - 2020


  • Electronic health record
  • Information retrieval
  • Patient cohort discovery
  • Structured queries

ASJC Scopus subject areas

  • Health Informatics


Dive into the research topics of 'Evaluation of patient-level retrieval from electronic health record data for a cohort discovery task'. Together they form a unique fingerprint.

Cite this