TY - JOUR
T1 - Using machine learning for concept extraction on clinical documents from multiple data sources
AU - Torii, Manabu
AU - Wagholikar, Kavishwar
AU - Liu, Hongfang
PY - 2011/9
Y1 - 2011/9
N2 - Objective: Concept extraction is a process to identify phrases referring to concepts of interests in unstructured text. It is a critical component in automated text processing. We investigate the performance of machine learning taggers for clinical concept extraction, particularly the portability of taggers across documents from multiple data sources. Methods: We used BioTagger-GM to train machine learning taggers, which we originally developed for the detection of gene/protein names in the biology domain. Trained taggers were evaluated using the annotated clinical documents made available in the 2010 i2b2/VA Challenge workshop, consisting of documents from four data sources. Results: As expected, performance of a tagger trained on one data source degraded when evaluated on another source, but the degradation of the performance varied depending on data sources. A tagger trained on multiple data sources was robust, and it achieved an F score as high as 0.890 on one data source. The results also suggest that performance of machine learning taggers is likely to improve if more annotated documents are available for training. Conclusion: Our study shows how the performance of machine learning taggers is degraded when they are ported across clinical documents from different sources. The portability of taggers can be enhanced by training on datasets from multiple sources. The study also shows that BioTagger-GM can be easily extended to detect clinical concept mentions with good performance.
AB - Objective: Concept extraction is a process to identify phrases referring to concepts of interests in unstructured text. It is a critical component in automated text processing. We investigate the performance of machine learning taggers for clinical concept extraction, particularly the portability of taggers across documents from multiple data sources. Methods: We used BioTagger-GM to train machine learning taggers, which we originally developed for the detection of gene/protein names in the biology domain. Trained taggers were evaluated using the annotated clinical documents made available in the 2010 i2b2/VA Challenge workshop, consisting of documents from four data sources. Results: As expected, performance of a tagger trained on one data source degraded when evaluated on another source, but the degradation of the performance varied depending on data sources. A tagger trained on multiple data sources was robust, and it achieved an F score as high as 0.890 on one data source. The results also suggest that performance of machine learning taggers is likely to improve if more annotated documents are available for training. Conclusion: Our study shows how the performance of machine learning taggers is degraded when they are ported across clinical documents from different sources. The portability of taggers can be enhanced by training on datasets from multiple sources. The study also shows that BioTagger-GM can be easily extended to detect clinical concept mentions with good performance.
UR - http://www.scopus.com/inward/record.url?scp=80053268343&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=80053268343&partnerID=8YFLogxK
U2 - 10.1136/amiajnl-2011-000155
DO - 10.1136/amiajnl-2011-000155
M3 - Article
C2 - 21709161
AN - SCOPUS:80053268343
SN - 1067-5027
VL - 18
SP - 580
EP - 587
JO - Journal of the American Medical Informatics Association
JF - Journal of the American Medical Informatics Association
IS - 5
ER -