TY - JOUR
T1 - Evaluation of patient-level retrieval from electronic health record data for a cohort discovery task
AU - Chamberlin, Steven R.
AU - Bedrick, Steven D.
AU - Cohen, Aaron M.
AU - Wang, Yanshan
AU - Wen, Andrew
AU - Liu, Sijia
AU - Liu, Hongfang
AU - Hersh, William R.
N1 - Funding Information:
Although many medical centers, especially those funded by the Clinical & Translational Science Award program, offer patient cohort discovery tools, this function has not been well studied. This research evaluates patient-level cohort retrieval over a large extract of complete EHR data for an academic medical center, along with 56 diverse information needs. Our results found that structured Boolean queries, searching over unstructured and structed data, outperformed word-based automated methods over the same data. Substantial work remains for defining the best methods for cohort discovery from EHR data, especially in the development of methods that allow automated techniques that do not require users to construct Boolean queries themselves.
Publisher Copyright:
© The Author(s) 2020.
PY - 2020
Y1 - 2020
N2 - Objective: Growing numbers of academic medical centers offer patient cohort discovery tools to their researchers, yet the performance of systems for this use case is not well understood. The objective of this research was to assess patient-level information retrieval methods using electronic health records for different types of cohort definition retrieval. Materials and Methods: We developed a test collection consisting of about 100 000 patient records and 56 test topics that characterized patient cohort requests for various clinical studies. Automated information retrieval tasks using word-based approaches were performed, varying 4 different parameters for a total of 48 permutations, with performance measured using B-Pref. We subsequently created structured Boolean queries for the 56 topics for performance comparisons. In addition, we performed a more detailed analysis of 10 topics. Results: The best-performing word-based automated query parameter settings achieved a mean B-Pref of 0.167 across all 56 topics. The way a topic was structured (topic representation) had the largest impact on performance. Performance not only varied widely across topics, but there was also a large variance in sensitivity to parameter settings across the topics. Structured queries generally performed better than automated queries on measures of recall and precision but were still not able to recall all relevant patients found by the automated queries. Conclusion: While word-based automated methods of cohort retrieval offer an attractive solution to the labor-intensive nature of this task currently used at many medical centers, we generally found suboptimal performance in those approaches, with better performance obtained from structured Boolean queries. Future work will focus on using the test collection to develop and evaluate new approaches to query structure, weighting algorithms, and application of semantic methods.
AB - Objective: Growing numbers of academic medical centers offer patient cohort discovery tools to their researchers, yet the performance of systems for this use case is not well understood. The objective of this research was to assess patient-level information retrieval methods using electronic health records for different types of cohort definition retrieval. Materials and Methods: We developed a test collection consisting of about 100 000 patient records and 56 test topics that characterized patient cohort requests for various clinical studies. Automated information retrieval tasks using word-based approaches were performed, varying 4 different parameters for a total of 48 permutations, with performance measured using B-Pref. We subsequently created structured Boolean queries for the 56 topics for performance comparisons. In addition, we performed a more detailed analysis of 10 topics. Results: The best-performing word-based automated query parameter settings achieved a mean B-Pref of 0.167 across all 56 topics. The way a topic was structured (topic representation) had the largest impact on performance. Performance not only varied widely across topics, but there was also a large variance in sensitivity to parameter settings across the topics. Structured queries generally performed better than automated queries on measures of recall and precision but were still not able to recall all relevant patients found by the automated queries. Conclusion: While word-based automated methods of cohort retrieval offer an attractive solution to the labor-intensive nature of this task currently used at many medical centers, we generally found suboptimal performance in those approaches, with better performance obtained from structured Boolean queries. Future work will focus on using the test collection to develop and evaluate new approaches to query structure, weighting algorithms, and application of semantic methods.
KW - Electronic health record
KW - Information retrieval
KW - Patient cohort discovery
KW - Structured queries
UR - http://www.scopus.com/inward/record.url?scp=85101337530&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85101337530&partnerID=8YFLogxK
U2 - 10.1093/JAMIAOPEN/OOAA026
DO - 10.1093/JAMIAOPEN/OOAA026
M3 - Article
AN - SCOPUS:85101337530
VL - 3
SP - 395
EP - 404
JO - JAMIA Open
JF - JAMIA Open
SN - 2574-2531
IS - 3
ER -