A comparison of word embeddings for the biomedical natural language processing

Yanshan Wang, Sijia Liu, Naveed Afzal, Majid Rastegar-Mojarad, Liwei Wang, Feichen Shen, Paul Kingsbury, Hongfang D Liu

Research output: Contribution to journalArticle

25 Citations (Scopus)

Abstract

Background: Word embeddings have been prevalently used in biomedical Natural Language Processing (NLP) applications due to the ability of the vector representations being able to capture useful semantic properties and linguistic relationships between words. Different textual resources (e.g., Wikipedia and biomedical literature corpus) have been utilized in biomedical NLP to train word embeddings and these word embeddings have been commonly leveraged as feature input to downstream machine learning models. However, there has been little work on evaluating the word embeddings trained from different textual resources. Methods: In this study, we empirically evaluated word embeddings trained from four different corpora, namely clinical notes, biomedical publications, Wikipedia, and news. For the former two resources, we trained word embeddings using unstructured electronic health record (EHR) data available at Mayo Clinic and articles (MedLit) from PubMed Central, respectively. For the latter two resources, we used publicly available pre-trained word embeddings, GloVe and Google News. The evaluation was done qualitatively and quantitatively. For the qualitative evaluation, we randomly selected medical terms from three categories (i.e., disorder, symptom, and drug), and manually inspected the five most similar words computed by embeddings for each term. We also analyzed the word embeddings through a 2-dimensional visualization plot of 377 medical terms. For the quantitative evaluation, we conducted both intrinsic and extrinsic evaluation. For the intrinsic evaluation, we evaluated the word embeddings’ ability to capture medical semantics by measruing the semantic similarity between medical terms using four published datasets: Pedersen's dataset, Hliaoutakis's dataset, MayoSRS, and UMNSRS. For the extrinsic evaluation, we applied word embeddings to multiple downstream biomedical NLP applications, including clinical information extraction (IE), biomedical information retrieval (IR), and relation extraction (RE), with data from shared tasks. Results: The qualitative evaluation shows that the word embeddings trained from EHR and MedLit can find more similar medical terms than those trained from GloVe and Google News. The intrinsic quantitative evaluation verifies that the semantic similarity captured by the word embeddings trained from EHR is closer to human experts’ judgments on all four tested datasets. The extrinsic quantitative evaluation shows that the word embeddings trained on EHR achieved the best F1 score of 0.900 for the clinical IE task; no word embeddings improved the performance for the biomedical IR task; and the word embeddings trained on Google News had the best overall F1 score of 0.790 for the RE task. Conclusion: Based on the evaluation results, we can draw the following conclusions. First, the word embeddings trained from EHR and MedLit can capture the semantics of medical terms better, and find semantically relevant medical terms closer to human experts’ judgments than those trained from GloVe and Google News. Second, there does not exist a consistent global ranking of word embeddings for all downstream biomedical NLP applications. However, adding word embeddings as extra features will improve results on most downstream tasks. Finally, the word embeddings trained from the biomedical domain corpora do not necessarily have better performance than those trained from the general domain corpora for any downstream biomedical NLP task.

Original languageEnglish (US)
Pages (from-to)12-20
Number of pages9
JournalJournal of Biomedical Informatics
Volume87
DOIs
StatePublished - Nov 1 2018

Fingerprint

Natural Language Processing
Semantics
Health
Electronic Health Records
Processing
Information Storage and Retrieval
Information retrieval
Linguistics
Learning systems
Visualization
Aptitude
PubMed
Publications

Keywords

  • Information extraction
  • Information retrieval
  • Machine learning
  • Natural language processing
  • Word embeddings

ASJC Scopus subject areas

  • Computer Science Applications
  • Health Informatics

Cite this

A comparison of word embeddings for the biomedical natural language processing. / Wang, Yanshan; Liu, Sijia; Afzal, Naveed; Rastegar-Mojarad, Majid; Wang, Liwei; Shen, Feichen; Kingsbury, Paul; Liu, Hongfang D.

In: Journal of Biomedical Informatics, Vol. 87, 01.11.2018, p. 12-20.

Research output: Contribution to journalArticle

Wang, Y, Liu, S, Afzal, N, Rastegar-Mojarad, M, Wang, L, Shen, F, Kingsbury, P & Liu, HD 2018, 'A comparison of word embeddings for the biomedical natural language processing', Journal of Biomedical Informatics, vol. 87, pp. 12-20. https://doi.org/10.1016/j.jbi.2018.09.008
Wang, Yanshan ; Liu, Sijia ; Afzal, Naveed ; Rastegar-Mojarad, Majid ; Wang, Liwei ; Shen, Feichen ; Kingsbury, Paul ; Liu, Hongfang D. / A comparison of word embeddings for the biomedical natural language processing. In: Journal of Biomedical Informatics. 2018 ; Vol. 87. pp. 12-20.
@article{84a151364b7849629519aff065018f62,
title = "A comparison of word embeddings for the biomedical natural language processing",
abstract = "Background: Word embeddings have been prevalently used in biomedical Natural Language Processing (NLP) applications due to the ability of the vector representations being able to capture useful semantic properties and linguistic relationships between words. Different textual resources (e.g., Wikipedia and biomedical literature corpus) have been utilized in biomedical NLP to train word embeddings and these word embeddings have been commonly leveraged as feature input to downstream machine learning models. However, there has been little work on evaluating the word embeddings trained from different textual resources. Methods: In this study, we empirically evaluated word embeddings trained from four different corpora, namely clinical notes, biomedical publications, Wikipedia, and news. For the former two resources, we trained word embeddings using unstructured electronic health record (EHR) data available at Mayo Clinic and articles (MedLit) from PubMed Central, respectively. For the latter two resources, we used publicly available pre-trained word embeddings, GloVe and Google News. The evaluation was done qualitatively and quantitatively. For the qualitative evaluation, we randomly selected medical terms from three categories (i.e., disorder, symptom, and drug), and manually inspected the five most similar words computed by embeddings for each term. We also analyzed the word embeddings through a 2-dimensional visualization plot of 377 medical terms. For the quantitative evaluation, we conducted both intrinsic and extrinsic evaluation. For the intrinsic evaluation, we evaluated the word embeddings’ ability to capture medical semantics by measruing the semantic similarity between medical terms using four published datasets: Pedersen's dataset, Hliaoutakis's dataset, MayoSRS, and UMNSRS. For the extrinsic evaluation, we applied word embeddings to multiple downstream biomedical NLP applications, including clinical information extraction (IE), biomedical information retrieval (IR), and relation extraction (RE), with data from shared tasks. Results: The qualitative evaluation shows that the word embeddings trained from EHR and MedLit can find more similar medical terms than those trained from GloVe and Google News. The intrinsic quantitative evaluation verifies that the semantic similarity captured by the word embeddings trained from EHR is closer to human experts’ judgments on all four tested datasets. The extrinsic quantitative evaluation shows that the word embeddings trained on EHR achieved the best F1 score of 0.900 for the clinical IE task; no word embeddings improved the performance for the biomedical IR task; and the word embeddings trained on Google News had the best overall F1 score of 0.790 for the RE task. Conclusion: Based on the evaluation results, we can draw the following conclusions. First, the word embeddings trained from EHR and MedLit can capture the semantics of medical terms better, and find semantically relevant medical terms closer to human experts’ judgments than those trained from GloVe and Google News. Second, there does not exist a consistent global ranking of word embeddings for all downstream biomedical NLP applications. However, adding word embeddings as extra features will improve results on most downstream tasks. Finally, the word embeddings trained from the biomedical domain corpora do not necessarily have better performance than those trained from the general domain corpora for any downstream biomedical NLP task.",
keywords = "Information extraction, Information retrieval, Machine learning, Natural language processing, Word embeddings",
author = "Yanshan Wang and Sijia Liu and Naveed Afzal and Majid Rastegar-Mojarad and Liwei Wang and Feichen Shen and Paul Kingsbury and Liu, {Hongfang D}",
year = "2018",
month = "11",
day = "1",
doi = "10.1016/j.jbi.2018.09.008",
language = "English (US)",
volume = "87",
pages = "12--20",
journal = "Journal of Biomedical Informatics",
issn = "1532-0464",
publisher = "Academic Press Inc.",

}

TY - JOUR

T1 - A comparison of word embeddings for the biomedical natural language processing

AU - Wang, Yanshan

AU - Liu, Sijia

AU - Afzal, Naveed

AU - Rastegar-Mojarad, Majid

AU - Wang, Liwei

AU - Shen, Feichen

AU - Kingsbury, Paul

AU - Liu, Hongfang D

PY - 2018/11/1

Y1 - 2018/11/1

N2 - Background: Word embeddings have been prevalently used in biomedical Natural Language Processing (NLP) applications due to the ability of the vector representations being able to capture useful semantic properties and linguistic relationships between words. Different textual resources (e.g., Wikipedia and biomedical literature corpus) have been utilized in biomedical NLP to train word embeddings and these word embeddings have been commonly leveraged as feature input to downstream machine learning models. However, there has been little work on evaluating the word embeddings trained from different textual resources. Methods: In this study, we empirically evaluated word embeddings trained from four different corpora, namely clinical notes, biomedical publications, Wikipedia, and news. For the former two resources, we trained word embeddings using unstructured electronic health record (EHR) data available at Mayo Clinic and articles (MedLit) from PubMed Central, respectively. For the latter two resources, we used publicly available pre-trained word embeddings, GloVe and Google News. The evaluation was done qualitatively and quantitatively. For the qualitative evaluation, we randomly selected medical terms from three categories (i.e., disorder, symptom, and drug), and manually inspected the five most similar words computed by embeddings for each term. We also analyzed the word embeddings through a 2-dimensional visualization plot of 377 medical terms. For the quantitative evaluation, we conducted both intrinsic and extrinsic evaluation. For the intrinsic evaluation, we evaluated the word embeddings’ ability to capture medical semantics by measruing the semantic similarity between medical terms using four published datasets: Pedersen's dataset, Hliaoutakis's dataset, MayoSRS, and UMNSRS. For the extrinsic evaluation, we applied word embeddings to multiple downstream biomedical NLP applications, including clinical information extraction (IE), biomedical information retrieval (IR), and relation extraction (RE), with data from shared tasks. Results: The qualitative evaluation shows that the word embeddings trained from EHR and MedLit can find more similar medical terms than those trained from GloVe and Google News. The intrinsic quantitative evaluation verifies that the semantic similarity captured by the word embeddings trained from EHR is closer to human experts’ judgments on all four tested datasets. The extrinsic quantitative evaluation shows that the word embeddings trained on EHR achieved the best F1 score of 0.900 for the clinical IE task; no word embeddings improved the performance for the biomedical IR task; and the word embeddings trained on Google News had the best overall F1 score of 0.790 for the RE task. Conclusion: Based on the evaluation results, we can draw the following conclusions. First, the word embeddings trained from EHR and MedLit can capture the semantics of medical terms better, and find semantically relevant medical terms closer to human experts’ judgments than those trained from GloVe and Google News. Second, there does not exist a consistent global ranking of word embeddings for all downstream biomedical NLP applications. However, adding word embeddings as extra features will improve results on most downstream tasks. Finally, the word embeddings trained from the biomedical domain corpora do not necessarily have better performance than those trained from the general domain corpora for any downstream biomedical NLP task.

AB - Background: Word embeddings have been prevalently used in biomedical Natural Language Processing (NLP) applications due to the ability of the vector representations being able to capture useful semantic properties and linguistic relationships between words. Different textual resources (e.g., Wikipedia and biomedical literature corpus) have been utilized in biomedical NLP to train word embeddings and these word embeddings have been commonly leveraged as feature input to downstream machine learning models. However, there has been little work on evaluating the word embeddings trained from different textual resources. Methods: In this study, we empirically evaluated word embeddings trained from four different corpora, namely clinical notes, biomedical publications, Wikipedia, and news. For the former two resources, we trained word embeddings using unstructured electronic health record (EHR) data available at Mayo Clinic and articles (MedLit) from PubMed Central, respectively. For the latter two resources, we used publicly available pre-trained word embeddings, GloVe and Google News. The evaluation was done qualitatively and quantitatively. For the qualitative evaluation, we randomly selected medical terms from three categories (i.e., disorder, symptom, and drug), and manually inspected the five most similar words computed by embeddings for each term. We also analyzed the word embeddings through a 2-dimensional visualization plot of 377 medical terms. For the quantitative evaluation, we conducted both intrinsic and extrinsic evaluation. For the intrinsic evaluation, we evaluated the word embeddings’ ability to capture medical semantics by measruing the semantic similarity between medical terms using four published datasets: Pedersen's dataset, Hliaoutakis's dataset, MayoSRS, and UMNSRS. For the extrinsic evaluation, we applied word embeddings to multiple downstream biomedical NLP applications, including clinical information extraction (IE), biomedical information retrieval (IR), and relation extraction (RE), with data from shared tasks. Results: The qualitative evaluation shows that the word embeddings trained from EHR and MedLit can find more similar medical terms than those trained from GloVe and Google News. The intrinsic quantitative evaluation verifies that the semantic similarity captured by the word embeddings trained from EHR is closer to human experts’ judgments on all four tested datasets. The extrinsic quantitative evaluation shows that the word embeddings trained on EHR achieved the best F1 score of 0.900 for the clinical IE task; no word embeddings improved the performance for the biomedical IR task; and the word embeddings trained on Google News had the best overall F1 score of 0.790 for the RE task. Conclusion: Based on the evaluation results, we can draw the following conclusions. First, the word embeddings trained from EHR and MedLit can capture the semantics of medical terms better, and find semantically relevant medical terms closer to human experts’ judgments than those trained from GloVe and Google News. Second, there does not exist a consistent global ranking of word embeddings for all downstream biomedical NLP applications. However, adding word embeddings as extra features will improve results on most downstream tasks. Finally, the word embeddings trained from the biomedical domain corpora do not necessarily have better performance than those trained from the general domain corpora for any downstream biomedical NLP task.

KW - Information extraction

KW - Information retrieval

KW - Machine learning

KW - Natural language processing

KW - Word embeddings

UR - http://www.scopus.com/inward/record.url?scp=85054036088&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85054036088&partnerID=8YFLogxK

U2 - 10.1016/j.jbi.2018.09.008

DO - 10.1016/j.jbi.2018.09.008

M3 - Article

VL - 87

SP - 12

EP - 20

JO - Journal of Biomedical Informatics

JF - Journal of Biomedical Informatics

SN - 1532-0464

ER -