Using large clinical corpora for query expansion in text-based cohort identification

Dongqing Zhu; Stephen Wu; Ben Carterette; Hongfang Liu

doi:10.1016/j.jbi.2014.03.010

Using large clinical corpora for query expansion in text-based cohort identification

Dongqing Zhu, Stephen Wu, Ben Carterette, Hongfang Liu

Digital Health Sciences

Research output: Contribution to journal › Article › peer-review

24 Scopus citations

Abstract

In light of the heightened problems of polysemy, synonymy, and hyponymy in clinical text, we hypothesize that patient cohort identification can be improved by using a large, in-domain clinical corpus for query expansion. We evaluate the utility of four auxiliary collections for the Text REtrieval Conference task of IR-based cohort retrieval, considering the effects of collection size, the inherent difficulty of a query, and the interaction between the collections. Each collection was applied to aid in cohort retrieval from the Pittsburgh NLP Repository by using a mixture of relevance models. Measured by mean average precision, performance using any auxiliary resource (MAP. = 0.386 and above) is shown to improve over the baseline query likelihood model (MAP. = 0.373). Considering subsets of the Mayo Clinic collection, we found that after including 2.5 billion term instances, retrieval is not improved by adding more instances. However, adding the Mayo Clinic collection did improve performance significantly over any existing setup, with a system using all four auxiliary collections obtaining the best results (MAP. = 0.4223). Because optimal results in the mixture of relevance models would require selective sampling of the collections, the common sense approach of "use all available data" is inappropriate. However, we found that it was still beneficial to add the Mayo corpus to any mixture of relevance models. On the task of IR-based cohort identification, query expansion with the Mayo Clinic corpus resulted in consistent and significant improvements. As such, any IR query expansion with access to a large clinical corpus could benefit from the additional resource. Additionally, we have shown that more data is not necessarily better, implying that there is value in collection curation.

Original language	English (US)
Pages (from-to)	275-281
Number of pages	7
Journal	Journal of Biomedical Informatics
Volume	49
DOIs	https://doi.org/10.1016/j.jbi.2014.03.010
State	Published - Jun 2014

Keywords

Clinical text
Cohort identification
Electronic medical records
Information retrieval
Query expansion

ASJC Scopus subject areas

Health Informatics
Computer Science Applications

Access to Document

10.1016/j.jbi.2014.03.010

Cite this

@article{5951e8ae07d34a9098e4931fc92af88c,

title = "Using large clinical corpora for query expansion in text-based cohort identification",

abstract = "In light of the heightened problems of polysemy, synonymy, and hyponymy in clinical text, we hypothesize that patient cohort identification can be improved by using a large, in-domain clinical corpus for query expansion. We evaluate the utility of four auxiliary collections for the Text REtrieval Conference task of IR-based cohort retrieval, considering the effects of collection size, the inherent difficulty of a query, and the interaction between the collections. Each collection was applied to aid in cohort retrieval from the Pittsburgh NLP Repository by using a mixture of relevance models. Measured by mean average precision, performance using any auxiliary resource (MAP. = 0.386 and above) is shown to improve over the baseline query likelihood model (MAP. = 0.373). Considering subsets of the Mayo Clinic collection, we found that after including 2.5 billion term instances, retrieval is not improved by adding more instances. However, adding the Mayo Clinic collection did improve performance significantly over any existing setup, with a system using all four auxiliary collections obtaining the best results (MAP. = 0.4223). Because optimal results in the mixture of relevance models would require selective sampling of the collections, the common sense approach of {"}use all available data{"} is inappropriate. However, we found that it was still beneficial to add the Mayo corpus to any mixture of relevance models. On the task of IR-based cohort identification, query expansion with the Mayo Clinic corpus resulted in consistent and significant improvements. As such, any IR query expansion with access to a large clinical corpus could benefit from the additional resource. Additionally, we have shown that more data is not necessarily better, implying that there is value in collection curation.",

keywords = "Clinical text, Cohort identification, Electronic medical records, Information retrieval, Query expansion",

author = "Dongqing Zhu and Stephen Wu and Ben Carterette and Hongfang Liu",

note = "Funding Information: This work was supported in part by the SHARPn (Strategic Health IT Advanced Research Projects) Area 4: Secondary Use of EHR Data Cooperative Agreement from the HHS Office of the National Coordinator, Washington, DC. DHHS 90TR000201.",

year = "2014",

month = jun,

doi = "10.1016/j.jbi.2014.03.010",

language = "English (US)",

volume = "49",

pages = "275--281",

journal = "Journal of Biomedical Informatics",

issn = "1532-0464",

publisher = "Academic Press Inc.",

}

TY - JOUR

T1 - Using large clinical corpora for query expansion in text-based cohort identification

AU - Zhu, Dongqing

AU - Wu, Stephen

AU - Carterette, Ben

AU - Liu, Hongfang

N1 - Funding Information: This work was supported in part by the SHARPn (Strategic Health IT Advanced Research Projects) Area 4: Secondary Use of EHR Data Cooperative Agreement from the HHS Office of the National Coordinator, Washington, DC. DHHS 90TR000201.

PY - 2014/6

Y1 - 2014/6

N2 - In light of the heightened problems of polysemy, synonymy, and hyponymy in clinical text, we hypothesize that patient cohort identification can be improved by using a large, in-domain clinical corpus for query expansion. We evaluate the utility of four auxiliary collections for the Text REtrieval Conference task of IR-based cohort retrieval, considering the effects of collection size, the inherent difficulty of a query, and the interaction between the collections. Each collection was applied to aid in cohort retrieval from the Pittsburgh NLP Repository by using a mixture of relevance models. Measured by mean average precision, performance using any auxiliary resource (MAP. = 0.386 and above) is shown to improve over the baseline query likelihood model (MAP. = 0.373). Considering subsets of the Mayo Clinic collection, we found that after including 2.5 billion term instances, retrieval is not improved by adding more instances. However, adding the Mayo Clinic collection did improve performance significantly over any existing setup, with a system using all four auxiliary collections obtaining the best results (MAP. = 0.4223). Because optimal results in the mixture of relevance models would require selective sampling of the collections, the common sense approach of "use all available data" is inappropriate. However, we found that it was still beneficial to add the Mayo corpus to any mixture of relevance models. On the task of IR-based cohort identification, query expansion with the Mayo Clinic corpus resulted in consistent and significant improvements. As such, any IR query expansion with access to a large clinical corpus could benefit from the additional resource. Additionally, we have shown that more data is not necessarily better, implying that there is value in collection curation.

AB - In light of the heightened problems of polysemy, synonymy, and hyponymy in clinical text, we hypothesize that patient cohort identification can be improved by using a large, in-domain clinical corpus for query expansion. We evaluate the utility of four auxiliary collections for the Text REtrieval Conference task of IR-based cohort retrieval, considering the effects of collection size, the inherent difficulty of a query, and the interaction between the collections. Each collection was applied to aid in cohort retrieval from the Pittsburgh NLP Repository by using a mixture of relevance models. Measured by mean average precision, performance using any auxiliary resource (MAP. = 0.386 and above) is shown to improve over the baseline query likelihood model (MAP. = 0.373). Considering subsets of the Mayo Clinic collection, we found that after including 2.5 billion term instances, retrieval is not improved by adding more instances. However, adding the Mayo Clinic collection did improve performance significantly over any existing setup, with a system using all four auxiliary collections obtaining the best results (MAP. = 0.4223). Because optimal results in the mixture of relevance models would require selective sampling of the collections, the common sense approach of "use all available data" is inappropriate. However, we found that it was still beneficial to add the Mayo corpus to any mixture of relevance models. On the task of IR-based cohort identification, query expansion with the Mayo Clinic corpus resulted in consistent and significant improvements. As such, any IR query expansion with access to a large clinical corpus could benefit from the additional resource. Additionally, we have shown that more data is not necessarily better, implying that there is value in collection curation.

KW - Clinical text

KW - Cohort identification

KW - Electronic medical records

KW - Information retrieval

KW - Query expansion

UR - http://www.scopus.com/inward/record.url?scp=84902549853&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84902549853&partnerID=8YFLogxK

U2 - 10.1016/j.jbi.2014.03.010

DO - 10.1016/j.jbi.2014.03.010

M3 - Article

C2 - 24680983

AN - SCOPUS:84902549853

SN - 1532-0464

VL - 49

SP - 275

EP - 281

JO - Journal of Biomedical Informatics

JF - Journal of Biomedical Informatics

ER -

Using large clinical corpora for query expansion in text-based cohort identification

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this