Semantic Classification of Biomedical Concepts Using Distributional Similarity

Jung Wei Fan; Carol Friedman

doi:10.1197/jamia.M2314

Semantic Classification of Biomedical Concepts Using Distributional Similarity

Jung Wei Fan, Carol Friedman

Digital Health Sciences

Research output: Contribution to journal › Article › peer-review

32 Scopus citations

Abstract

Objective: To develop an automated, high-throughput, and reproducible method for reclassifying and validating ontological concepts for natural language processing applications. Design: We developed a distributional similarity approach to classify the Unified Medical Language System (UMLS) concepts. Classification models were built for seven broad biomedically relevant semantic classes created by grouping subsets of the UMLS semantic types. We used contextual features based on syntactic properties obtained from two different large corpora and used α-skew divergence as the similarity measure. Measurements: The testing sets were automatically generated based on the changes by the National Library of Medicine to the semantic classification of concepts from the UMLS 2005AA to the 2006AA release. Error rates were calculated and a misclassification analysis was performed. Results: The estimated lowest error rates were 0.198 and 0.116 when considering the correct classification to be covered by our top prediction and top 2 predictions, respectively. Conclusion: The results demonstrated that the distributional similarity approach can recommend high level semantic classification suitable for use in natural language processing.

Original language	English (US)
Pages (from-to)	467-477
Number of pages	11
Journal	Journal of the American Medical Informatics Association
Volume	14
Issue number	4
DOIs	https://doi.org/10.1197/jamia.M2314
State	Published - Jul 2007

ASJC Scopus subject areas

Health Informatics

Access to Document

10.1197/jamia.M2314

Cite this

@article{db84e340bf4a460893a4bf479f9957c0,

title = "Semantic Classification of Biomedical Concepts Using Distributional Similarity",

abstract = "Objective: To develop an automated, high-throughput, and reproducible method for reclassifying and validating ontological concepts for natural language processing applications. Design: We developed a distributional similarity approach to classify the Unified Medical Language System (UMLS) concepts. Classification models were built for seven broad biomedically relevant semantic classes created by grouping subsets of the UMLS semantic types. We used contextual features based on syntactic properties obtained from two different large corpora and used α-skew divergence as the similarity measure. Measurements: The testing sets were automatically generated based on the changes by the National Library of Medicine to the semantic classification of concepts from the UMLS 2005AA to the 2006AA release. Error rates were calculated and a misclassification analysis was performed. Results: The estimated lowest error rates were 0.198 and 0.116 when considering the correct classification to be covered by our top prediction and top 2 predictions, respectively. Conclusion: The results demonstrated that the distributional similarity approach can recommend high level semantic classification suitable for use in natural language processing.",

author = "Fan, {Jung Wei} and Carol Friedman",

note = "Funding Information: This work was supported by Grants R01 LM7659 and R01 LM8635 from the National Library of Medicine. ",

year = "2007",

month = jul,

doi = "10.1197/jamia.M2314",

language = "English (US)",

volume = "14",

pages = "467--477",

journal = "Journal of the American Medical Informatics Association",

issn = "1067-5027",

publisher = "Oxford University Press",

number = "4",

}

TY - JOUR

T1 - Semantic Classification of Biomedical Concepts Using Distributional Similarity

AU - Fan, Jung Wei

AU - Friedman, Carol

N1 - Funding Information: This work was supported by Grants R01 LM7659 and R01 LM8635 from the National Library of Medicine.

PY - 2007/7

Y1 - 2007/7

N2 - Objective: To develop an automated, high-throughput, and reproducible method for reclassifying and validating ontological concepts for natural language processing applications. Design: We developed a distributional similarity approach to classify the Unified Medical Language System (UMLS) concepts. Classification models were built for seven broad biomedically relevant semantic classes created by grouping subsets of the UMLS semantic types. We used contextual features based on syntactic properties obtained from two different large corpora and used α-skew divergence as the similarity measure. Measurements: The testing sets were automatically generated based on the changes by the National Library of Medicine to the semantic classification of concepts from the UMLS 2005AA to the 2006AA release. Error rates were calculated and a misclassification analysis was performed. Results: The estimated lowest error rates were 0.198 and 0.116 when considering the correct classification to be covered by our top prediction and top 2 predictions, respectively. Conclusion: The results demonstrated that the distributional similarity approach can recommend high level semantic classification suitable for use in natural language processing.

AB - Objective: To develop an automated, high-throughput, and reproducible method for reclassifying and validating ontological concepts for natural language processing applications. Design: We developed a distributional similarity approach to classify the Unified Medical Language System (UMLS) concepts. Classification models were built for seven broad biomedically relevant semantic classes created by grouping subsets of the UMLS semantic types. We used contextual features based on syntactic properties obtained from two different large corpora and used α-skew divergence as the similarity measure. Measurements: The testing sets were automatically generated based on the changes by the National Library of Medicine to the semantic classification of concepts from the UMLS 2005AA to the 2006AA release. Error rates were calculated and a misclassification analysis was performed. Results: The estimated lowest error rates were 0.198 and 0.116 when considering the correct classification to be covered by our top prediction and top 2 predictions, respectively. Conclusion: The results demonstrated that the distributional similarity approach can recommend high level semantic classification suitable for use in natural language processing.

UR - http://www.scopus.com/inward/record.url?scp=34250744835&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=34250744835&partnerID=8YFLogxK

U2 - 10.1197/jamia.M2314

DO - 10.1197/jamia.M2314

M3 - Article

C2 - 17460124

AN - SCOPUS:34250744835

SN - 1067-5027

VL - 14

SP - 467

EP - 477

JO - Journal of the American Medical Informatics Association

JF - Journal of the American Medical Informatics Association

IS - 4

ER -

Semantic Classification of Biomedical Concepts Using Distributional Similarity

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this