Leveraging word embeddings and medical entity extraction for biomedical dataset retrieval using unstructured texts

Yanshan Wang; Majid Rastegar-Mojarad; Ravikumar Komandur-Elayavilli; Hongfang Liu

doi:10.1093/database/bax091

Leveraging word embeddings and medical entity extraction for biomedical dataset retrieval using unstructured texts

Yanshan Wang, Majid Rastegar-Mojarad, Ravikumar Komandur-Elayavilli, Hongfang Liu

Digital Health Sciences

Research output: Contribution to journal › Article › peer-review

5 Scopus citations

Abstract

The recent movement towards open data in the biomedical domain has generated a large number of datasets that are publicly accessible. The Big Data to Knowledge data indexing project, biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE), has gathered these datasets in a one-stop portal aiming at facilitating their reuse for accelerating scientific advances. However, as the number of biomedical datasets stored and indexed increases, it becomes more and more challenging to retrieve the relevant datasets according to researchers' queries. In this article, we propose an information retrieval (IR) system to tackle this problem and implement it for the bioCADDIE Dataset Retrieval Challenge. The system leverages the unstructured texts of each dataset including the title and description for the dataset, and utilizes a state-of-the-art IR model, medical named entity extraction techniques, query expansion with deep learning-based word embeddings and a re-ranking strategy to enhance the retrieval performance. In empirical experiments, we compared the proposed system with 11 baseline systems using the bioCADDIE Dataset Retrieval Challenge datasets. The experimental results show that the proposed system outperforms other systems in terms of inference Average Precision and inference normalized Discounted Cumulative Gain, implying that the proposed system is a viable option for biomedical dataset retrieval. Database URL: https://github.com/yanshanwang/biocaddie2016mayodata.

Original language	English (US)
Journal	Database : the journal of biological databases and curation
Volume	2017
DOIs	https://doi.org/10.1093/database/bax091
State	Published - Jan 1 2017

ASJC Scopus subject areas

Information Systems
General Biochemistry, Genetics and Molecular Biology
General Agricultural and Biological Sciences

Access to Document

10.1093/database/bax091

Cite this

@article{6b569bc964ba41318bee397c857415ce,

title = "Leveraging word embeddings and medical entity extraction for biomedical dataset retrieval using unstructured texts",

abstract = "The recent movement towards open data in the biomedical domain has generated a large number of datasets that are publicly accessible. The Big Data to Knowledge data indexing project, biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE), has gathered these datasets in a one-stop portal aiming at facilitating their reuse for accelerating scientific advances. However, as the number of biomedical datasets stored and indexed increases, it becomes more and more challenging to retrieve the relevant datasets according to researchers' queries. In this article, we propose an information retrieval (IR) system to tackle this problem and implement it for the bioCADDIE Dataset Retrieval Challenge. The system leverages the unstructured texts of each dataset including the title and description for the dataset, and utilizes a state-of-the-art IR model, medical named entity extraction techniques, query expansion with deep learning-based word embeddings and a re-ranking strategy to enhance the retrieval performance. In empirical experiments, we compared the proposed system with 11 baseline systems using the bioCADDIE Dataset Retrieval Challenge datasets. The experimental results show that the proposed system outperforms other systems in terms of inference Average Precision and inference normalized Discounted Cumulative Gain, implying that the proposed system is a viable option for biomedical dataset retrieval. Database URL: https://github.com/yanshanwang/biocaddie2016mayodata.",

author = "Yanshan Wang and Majid Rastegar-Mojarad and Ravikumar Komandur-Elayavilli and Hongfang Liu",

note = "Publisher Copyright: {\textcopyright} The Author(s) 2017. Published by Oxford University Press.",

year = "2017",

month = jan,

day = "1",

doi = "10.1093/database/bax091",

language = "English (US)",

volume = "2017",

journal = "Database : the journal of biological databases and curation",

issn = "1758-0463",

publisher = "Oxford University Press",

}

TY - JOUR

T1 - Leveraging word embeddings and medical entity extraction for biomedical dataset retrieval using unstructured texts

AU - Wang, Yanshan

AU - Rastegar-Mojarad, Majid

AU - Komandur-Elayavilli, Ravikumar

AU - Liu, Hongfang

PY - 2017/1/1

Y1 - 2017/1/1

N2 - The recent movement towards open data in the biomedical domain has generated a large number of datasets that are publicly accessible. The Big Data to Knowledge data indexing project, biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE), has gathered these datasets in a one-stop portal aiming at facilitating their reuse for accelerating scientific advances. However, as the number of biomedical datasets stored and indexed increases, it becomes more and more challenging to retrieve the relevant datasets according to researchers' queries. In this article, we propose an information retrieval (IR) system to tackle this problem and implement it for the bioCADDIE Dataset Retrieval Challenge. The system leverages the unstructured texts of each dataset including the title and description for the dataset, and utilizes a state-of-the-art IR model, medical named entity extraction techniques, query expansion with deep learning-based word embeddings and a re-ranking strategy to enhance the retrieval performance. In empirical experiments, we compared the proposed system with 11 baseline systems using the bioCADDIE Dataset Retrieval Challenge datasets. The experimental results show that the proposed system outperforms other systems in terms of inference Average Precision and inference normalized Discounted Cumulative Gain, implying that the proposed system is a viable option for biomedical dataset retrieval. Database URL: https://github.com/yanshanwang/biocaddie2016mayodata.

AB - The recent movement towards open data in the biomedical domain has generated a large number of datasets that are publicly accessible. The Big Data to Knowledge data indexing project, biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE), has gathered these datasets in a one-stop portal aiming at facilitating their reuse for accelerating scientific advances. However, as the number of biomedical datasets stored and indexed increases, it becomes more and more challenging to retrieve the relevant datasets according to researchers' queries. In this article, we propose an information retrieval (IR) system to tackle this problem and implement it for the bioCADDIE Dataset Retrieval Challenge. The system leverages the unstructured texts of each dataset including the title and description for the dataset, and utilizes a state-of-the-art IR model, medical named entity extraction techniques, query expansion with deep learning-based word embeddings and a re-ranking strategy to enhance the retrieval performance. In empirical experiments, we compared the proposed system with 11 baseline systems using the bioCADDIE Dataset Retrieval Challenge datasets. The experimental results show that the proposed system outperforms other systems in terms of inference Average Precision and inference normalized Discounted Cumulative Gain, implying that the proposed system is a viable option for biomedical dataset retrieval. Database URL: https://github.com/yanshanwang/biocaddie2016mayodata.

UR - http://www.scopus.com/inward/record.url?scp=85055929975&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85055929975&partnerID=8YFLogxK

U2 - 10.1093/database/bax091

DO - 10.1093/database/bax091

M3 - Article

C2 - 31725862

AN - SCOPUS:85055929975

SN - 1758-0463

VL - 2017

JO - Database : the journal of biological databases and curation

JF - Database : the journal of biological databases and curation

ER -

Leveraging word embeddings and medical entity extraction for biomedical dataset retrieval using unstructured texts

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this