Semantic Classification of Biomedical Concepts Using Distributional Similarity

Jung Wei Fan, Carol Friedman

Research output: Contribution to journalArticle

31 Citations (Scopus)

Abstract

Objective: To develop an automated, high-throughput, and reproducible method for reclassifying and validating ontological concepts for natural language processing applications. Design: We developed a distributional similarity approach to classify the Unified Medical Language System (UMLS) concepts. Classification models were built for seven broad biomedically relevant semantic classes created by grouping subsets of the UMLS semantic types. We used contextual features based on syntactic properties obtained from two different large corpora and used α-skew divergence as the similarity measure. Measurements: The testing sets were automatically generated based on the changes by the National Library of Medicine to the semantic classification of concepts from the UMLS 2005AA to the 2006AA release. Error rates were calculated and a misclassification analysis was performed. Results: The estimated lowest error rates were 0.198 and 0.116 when considering the correct classification to be covered by our top prediction and top 2 predictions, respectively. Conclusion: The results demonstrated that the distributional similarity approach can recommend high level semantic classification suitable for use in natural language processing.

Original languageEnglish (US)
Pages (from-to)467-477
Number of pages11
JournalJournal of the American Medical Informatics Association
Volume14
Issue number4
DOIs
StatePublished - Jul 1 2007
Externally publishedYes

Fingerprint

Unified Medical Language System
Semantics
Natural Language Processing
National Library of Medicine (U.S.)

ASJC Scopus subject areas

  • Health Informatics

Cite this

Semantic Classification of Biomedical Concepts Using Distributional Similarity. / Fan, Jung Wei; Friedman, Carol.

In: Journal of the American Medical Informatics Association, Vol. 14, No. 4, 01.07.2007, p. 467-477.

Research output: Contribution to journalArticle

@article{db84e340bf4a460893a4bf479f9957c0,
title = "Semantic Classification of Biomedical Concepts Using Distributional Similarity",
abstract = "Objective: To develop an automated, high-throughput, and reproducible method for reclassifying and validating ontological concepts for natural language processing applications. Design: We developed a distributional similarity approach to classify the Unified Medical Language System (UMLS) concepts. Classification models were built for seven broad biomedically relevant semantic classes created by grouping subsets of the UMLS semantic types. We used contextual features based on syntactic properties obtained from two different large corpora and used α-skew divergence as the similarity measure. Measurements: The testing sets were automatically generated based on the changes by the National Library of Medicine to the semantic classification of concepts from the UMLS 2005AA to the 2006AA release. Error rates were calculated and a misclassification analysis was performed. Results: The estimated lowest error rates were 0.198 and 0.116 when considering the correct classification to be covered by our top prediction and top 2 predictions, respectively. Conclusion: The results demonstrated that the distributional similarity approach can recommend high level semantic classification suitable for use in natural language processing.",
author = "Fan, {Jung Wei} and Carol Friedman",
year = "2007",
month = "7",
day = "1",
doi = "10.1197/jamia.M2314",
language = "English (US)",
volume = "14",
pages = "467--477",
journal = "Journal of the American Medical Informatics Association : JAMIA",
issn = "1067-5027",
publisher = "Oxford University Press",
number = "4",

}

TY - JOUR

T1 - Semantic Classification of Biomedical Concepts Using Distributional Similarity

AU - Fan, Jung Wei

AU - Friedman, Carol

PY - 2007/7/1

Y1 - 2007/7/1

N2 - Objective: To develop an automated, high-throughput, and reproducible method for reclassifying and validating ontological concepts for natural language processing applications. Design: We developed a distributional similarity approach to classify the Unified Medical Language System (UMLS) concepts. Classification models were built for seven broad biomedically relevant semantic classes created by grouping subsets of the UMLS semantic types. We used contextual features based on syntactic properties obtained from two different large corpora and used α-skew divergence as the similarity measure. Measurements: The testing sets were automatically generated based on the changes by the National Library of Medicine to the semantic classification of concepts from the UMLS 2005AA to the 2006AA release. Error rates were calculated and a misclassification analysis was performed. Results: The estimated lowest error rates were 0.198 and 0.116 when considering the correct classification to be covered by our top prediction and top 2 predictions, respectively. Conclusion: The results demonstrated that the distributional similarity approach can recommend high level semantic classification suitable for use in natural language processing.

AB - Objective: To develop an automated, high-throughput, and reproducible method for reclassifying and validating ontological concepts for natural language processing applications. Design: We developed a distributional similarity approach to classify the Unified Medical Language System (UMLS) concepts. Classification models were built for seven broad biomedically relevant semantic classes created by grouping subsets of the UMLS semantic types. We used contextual features based on syntactic properties obtained from two different large corpora and used α-skew divergence as the similarity measure. Measurements: The testing sets were automatically generated based on the changes by the National Library of Medicine to the semantic classification of concepts from the UMLS 2005AA to the 2006AA release. Error rates were calculated and a misclassification analysis was performed. Results: The estimated lowest error rates were 0.198 and 0.116 when considering the correct classification to be covered by our top prediction and top 2 predictions, respectively. Conclusion: The results demonstrated that the distributional similarity approach can recommend high level semantic classification suitable for use in natural language processing.

UR - http://www.scopus.com/inward/record.url?scp=34250744835&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=34250744835&partnerID=8YFLogxK

U2 - 10.1197/jamia.M2314

DO - 10.1197/jamia.M2314

M3 - Article

VL - 14

SP - 467

EP - 477

JO - Journal of the American Medical Informatics Association : JAMIA

JF - Journal of the American Medical Informatics Association : JAMIA

SN - 1067-5027

IS - 4

ER -