Using machine learning for concept extraction on clinical documents from multiple data sources

Manabu Torii; Kavishwar Wagholikar; Hongfang Liu

doi:10.1136/amiajnl-2011-000155

Using machine learning for concept extraction on clinical documents from multiple data sources

Manabu Torii, Kavishwar Wagholikar, Hongfang Liu

Digital Health Sciences

Research output: Contribution to journal › Article › peer-review

84 Scopus citations

Abstract

Objective: Concept extraction is a process to identify phrases referring to concepts of interests in unstructured text. It is a critical component in automated text processing. We investigate the performance of machine learning taggers for clinical concept extraction, particularly the portability of taggers across documents from multiple data sources. Methods: We used BioTagger-GM to train machine learning taggers, which we originally developed for the detection of gene/protein names in the biology domain. Trained taggers were evaluated using the annotated clinical documents made available in the 2010 i2b2/VA Challenge workshop, consisting of documents from four data sources. Results: As expected, performance of a tagger trained on one data source degraded when evaluated on another source, but the degradation of the performance varied depending on data sources. A tagger trained on multiple data sources was robust, and it achieved an F score as high as 0.890 on one data source. The results also suggest that performance of machine learning taggers is likely to improve if more annotated documents are available for training. Conclusion: Our study shows how the performance of machine learning taggers is degraded when they are ported across clinical documents from different sources. The portability of taggers can be enhanced by training on datasets from multiple sources. The study also shows that BioTagger-GM can be easily extended to detect clinical concept mentions with good performance.

Original language	English (US)
Pages (from-to)	580-587
Number of pages	8
Journal	Journal of the American Medical Informatics Association
Volume	18
Issue number	5
DOIs	https://doi.org/10.1136/amiajnl-2011-000155
State	Published - Sep 2011

ASJC Scopus subject areas

Health Informatics

Access to Document

10.1136/amiajnl-2011-000155

Cite this

@article{e9049c0efa8f49249d368531831cff84,

title = "Using machine learning for concept extraction on clinical documents from multiple data sources",

abstract = "Objective: Concept extraction is a process to identify phrases referring to concepts of interests in unstructured text. It is a critical component in automated text processing. We investigate the performance of machine learning taggers for clinical concept extraction, particularly the portability of taggers across documents from multiple data sources. Methods: We used BioTagger-GM to train machine learning taggers, which we originally developed for the detection of gene/protein names in the biology domain. Trained taggers were evaluated using the annotated clinical documents made available in the 2010 i2b2/VA Challenge workshop, consisting of documents from four data sources. Results: As expected, performance of a tagger trained on one data source degraded when evaluated on another source, but the degradation of the performance varied depending on data sources. A tagger trained on multiple data sources was robust, and it achieved an F score as high as 0.890 on one data source. The results also suggest that performance of machine learning taggers is likely to improve if more annotated documents are available for training. Conclusion: Our study shows how the performance of machine learning taggers is degraded when they are ported across clinical documents from different sources. The portability of taggers can be enhanced by training on datasets from multiple sources. The study also shows that BioTagger-GM can be easily extended to detect clinical concept mentions with good performance.",

author = "Manabu Torii and Kavishwar Wagholikar and Hongfang Liu",

year = "2011",

month = sep,

doi = "10.1136/amiajnl-2011-000155",

language = "English (US)",

volume = "18",

pages = "580--587",

journal = "Journal of the American Medical Informatics Association",

issn = "1067-5027",

publisher = "Oxford University Press",

number = "5",

}

TY - JOUR

T1 - Using machine learning for concept extraction on clinical documents from multiple data sources

AU - Torii, Manabu

AU - Wagholikar, Kavishwar

AU - Liu, Hongfang

PY - 2011/9

Y1 - 2011/9

N2 - Objective: Concept extraction is a process to identify phrases referring to concepts of interests in unstructured text. It is a critical component in automated text processing. We investigate the performance of machine learning taggers for clinical concept extraction, particularly the portability of taggers across documents from multiple data sources. Methods: We used BioTagger-GM to train machine learning taggers, which we originally developed for the detection of gene/protein names in the biology domain. Trained taggers were evaluated using the annotated clinical documents made available in the 2010 i2b2/VA Challenge workshop, consisting of documents from four data sources. Results: As expected, performance of a tagger trained on one data source degraded when evaluated on another source, but the degradation of the performance varied depending on data sources. A tagger trained on multiple data sources was robust, and it achieved an F score as high as 0.890 on one data source. The results also suggest that performance of machine learning taggers is likely to improve if more annotated documents are available for training. Conclusion: Our study shows how the performance of machine learning taggers is degraded when they are ported across clinical documents from different sources. The portability of taggers can be enhanced by training on datasets from multiple sources. The study also shows that BioTagger-GM can be easily extended to detect clinical concept mentions with good performance.

AB - Objective: Concept extraction is a process to identify phrases referring to concepts of interests in unstructured text. It is a critical component in automated text processing. We investigate the performance of machine learning taggers for clinical concept extraction, particularly the portability of taggers across documents from multiple data sources. Methods: We used BioTagger-GM to train machine learning taggers, which we originally developed for the detection of gene/protein names in the biology domain. Trained taggers were evaluated using the annotated clinical documents made available in the 2010 i2b2/VA Challenge workshop, consisting of documents from four data sources. Results: As expected, performance of a tagger trained on one data source degraded when evaluated on another source, but the degradation of the performance varied depending on data sources. A tagger trained on multiple data sources was robust, and it achieved an F score as high as 0.890 on one data source. The results also suggest that performance of machine learning taggers is likely to improve if more annotated documents are available for training. Conclusion: Our study shows how the performance of machine learning taggers is degraded when they are ported across clinical documents from different sources. The portability of taggers can be enhanced by training on datasets from multiple sources. The study also shows that BioTagger-GM can be easily extended to detect clinical concept mentions with good performance.

UR - http://www.scopus.com/inward/record.url?scp=80053268343&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80053268343&partnerID=8YFLogxK

U2 - 10.1136/amiajnl-2011-000155

DO - 10.1136/amiajnl-2011-000155

M3 - Article

C2 - 21709161

AN - SCOPUS:80053268343

SN - 1067-5027

VL - 18

SP - 580

EP - 587

JO - Journal of the American Medical Informatics Association

JF - Journal of the American Medical Informatics Association

IS - 5

ER -

Using machine learning for concept extraction on clinical documents from multiple data sources

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this