Unified medical Language System term occurrences in clinical notes: A large-scale corpus analysis

Stephen T. Wu, Hongfang D Liu, Dingcheng Li, Cui Tao, Mark A. Musen, Christopher G. Chute, Nigam H. Shah

Research output: Contribution to journalArticle

47 Citations (Scopus)

Abstract

Objective To characterise empirical instances of Unified Medical Language System (UMLS) Metathesaurus term strings in a large clinical corpus, and to illustrate what types of term characteristics are generalisable across data sources. Design Based on the occurrences of UMLS terms in a 51 million document corpus of Mayo Clinic clinical notes, this study computes statistics about the terms' string attributes, source terminologies, semantic types and syntactic categories. Term occurrences in 2010 i2b2/VA text were also mapped; eight example filters were designed from the Mayo-based statistics and applied to i2b2/VA data. Results For the corpus analysis, negligible numbers of mapped terms in the Mayo corpus had over six words or 55 characters. Of source terminologies in the UMLS, the Consumer Health Vocabulary and Systematized Nomenclature of MedicinedClinical Terms (SNOMED-CT) had the best coverage in Mayo clinical notes at 106 426 and 94 788 unique terms, respectively. Of 15 semantic groups in the UMLS, seven groups accounted for 92.08% of term occurrences in Mayo data. Syntactically, over 90% of matched terms were in noun phrases. For the crossinstitutional analysis, using five example filters on i2b2/ VA data reduces the actual lexicon to 19.13% of the size of the UMLS and only sees a 2% reduction in matched terms. Conclusion The corpus statistics presented here are instructive for building lexicons from the UMLS. Features intrinsic to Metathesaurus terms (well formedness, length and language) generalise easily across clinical institutions, but term frequencies should be adapted with caution. The semantic groups of mapped terms may differ slightly from institution to institution, but they differ greatly when moving to the biomedical literature domain.

Original languageEnglish (US)
JournalJournal of the American Medical Informatics Association
Volume19
Issue numberE1
DOIs
StatePublished - Jun 2012

Fingerprint

Unified Medical Language System
Semantics
Terminology
Vocabulary
Information Storage and Retrieval
Language

ASJC Scopus subject areas

  • Health Informatics

Cite this

Unified medical Language System term occurrences in clinical notes : A large-scale corpus analysis. / Wu, Stephen T.; Liu, Hongfang D; Li, Dingcheng; Tao, Cui; Musen, Mark A.; Chute, Christopher G.; Shah, Nigam H.

In: Journal of the American Medical Informatics Association, Vol. 19, No. E1, 06.2012.

Research output: Contribution to journalArticle

Wu, Stephen T. ; Liu, Hongfang D ; Li, Dingcheng ; Tao, Cui ; Musen, Mark A. ; Chute, Christopher G. ; Shah, Nigam H. / Unified medical Language System term occurrences in clinical notes : A large-scale corpus analysis. In: Journal of the American Medical Informatics Association. 2012 ; Vol. 19, No. E1.
@article{80fcdbdf5758420bb51fcdbd2f7c5623,
title = "Unified medical Language System term occurrences in clinical notes: A large-scale corpus analysis",
abstract = "Objective To characterise empirical instances of Unified Medical Language System (UMLS) Metathesaurus term strings in a large clinical corpus, and to illustrate what types of term characteristics are generalisable across data sources. Design Based on the occurrences of UMLS terms in a 51 million document corpus of Mayo Clinic clinical notes, this study computes statistics about the terms' string attributes, source terminologies, semantic types and syntactic categories. Term occurrences in 2010 i2b2/VA text were also mapped; eight example filters were designed from the Mayo-based statistics and applied to i2b2/VA data. Results For the corpus analysis, negligible numbers of mapped terms in the Mayo corpus had over six words or 55 characters. Of source terminologies in the UMLS, the Consumer Health Vocabulary and Systematized Nomenclature of MedicinedClinical Terms (SNOMED-CT) had the best coverage in Mayo clinical notes at 106 426 and 94 788 unique terms, respectively. Of 15 semantic groups in the UMLS, seven groups accounted for 92.08{\%} of term occurrences in Mayo data. Syntactically, over 90{\%} of matched terms were in noun phrases. For the crossinstitutional analysis, using five example filters on i2b2/ VA data reduces the actual lexicon to 19.13{\%} of the size of the UMLS and only sees a 2{\%} reduction in matched terms. Conclusion The corpus statistics presented here are instructive for building lexicons from the UMLS. Features intrinsic to Metathesaurus terms (well formedness, length and language) generalise easily across clinical institutions, but term frequencies should be adapted with caution. The semantic groups of mapped terms may differ slightly from institution to institution, but they differ greatly when moving to the biomedical literature domain.",
author = "Wu, {Stephen T.} and Liu, {Hongfang D} and Dingcheng Li and Cui Tao and Musen, {Mark A.} and Chute, {Christopher G.} and Shah, {Nigam H.}",
year = "2012",
month = "6",
doi = "10.1136/amiajnl-2011-000744",
language = "English (US)",
volume = "19",
journal = "Journal of the American Medical Informatics Association : JAMIA",
issn = "1067-5027",
publisher = "Oxford University Press",
number = "E1",

}

TY - JOUR

T1 - Unified medical Language System term occurrences in clinical notes

T2 - A large-scale corpus analysis

AU - Wu, Stephen T.

AU - Liu, Hongfang D

AU - Li, Dingcheng

AU - Tao, Cui

AU - Musen, Mark A.

AU - Chute, Christopher G.

AU - Shah, Nigam H.

PY - 2012/6

Y1 - 2012/6

N2 - Objective To characterise empirical instances of Unified Medical Language System (UMLS) Metathesaurus term strings in a large clinical corpus, and to illustrate what types of term characteristics are generalisable across data sources. Design Based on the occurrences of UMLS terms in a 51 million document corpus of Mayo Clinic clinical notes, this study computes statistics about the terms' string attributes, source terminologies, semantic types and syntactic categories. Term occurrences in 2010 i2b2/VA text were also mapped; eight example filters were designed from the Mayo-based statistics and applied to i2b2/VA data. Results For the corpus analysis, negligible numbers of mapped terms in the Mayo corpus had over six words or 55 characters. Of source terminologies in the UMLS, the Consumer Health Vocabulary and Systematized Nomenclature of MedicinedClinical Terms (SNOMED-CT) had the best coverage in Mayo clinical notes at 106 426 and 94 788 unique terms, respectively. Of 15 semantic groups in the UMLS, seven groups accounted for 92.08% of term occurrences in Mayo data. Syntactically, over 90% of matched terms were in noun phrases. For the crossinstitutional analysis, using five example filters on i2b2/ VA data reduces the actual lexicon to 19.13% of the size of the UMLS and only sees a 2% reduction in matched terms. Conclusion The corpus statistics presented here are instructive for building lexicons from the UMLS. Features intrinsic to Metathesaurus terms (well formedness, length and language) generalise easily across clinical institutions, but term frequencies should be adapted with caution. The semantic groups of mapped terms may differ slightly from institution to institution, but they differ greatly when moving to the biomedical literature domain.

AB - Objective To characterise empirical instances of Unified Medical Language System (UMLS) Metathesaurus term strings in a large clinical corpus, and to illustrate what types of term characteristics are generalisable across data sources. Design Based on the occurrences of UMLS terms in a 51 million document corpus of Mayo Clinic clinical notes, this study computes statistics about the terms' string attributes, source terminologies, semantic types and syntactic categories. Term occurrences in 2010 i2b2/VA text were also mapped; eight example filters were designed from the Mayo-based statistics and applied to i2b2/VA data. Results For the corpus analysis, negligible numbers of mapped terms in the Mayo corpus had over six words or 55 characters. Of source terminologies in the UMLS, the Consumer Health Vocabulary and Systematized Nomenclature of MedicinedClinical Terms (SNOMED-CT) had the best coverage in Mayo clinical notes at 106 426 and 94 788 unique terms, respectively. Of 15 semantic groups in the UMLS, seven groups accounted for 92.08% of term occurrences in Mayo data. Syntactically, over 90% of matched terms were in noun phrases. For the crossinstitutional analysis, using five example filters on i2b2/ VA data reduces the actual lexicon to 19.13% of the size of the UMLS and only sees a 2% reduction in matched terms. Conclusion The corpus statistics presented here are instructive for building lexicons from the UMLS. Features intrinsic to Metathesaurus terms (well formedness, length and language) generalise easily across clinical institutions, but term frequencies should be adapted with caution. The semantic groups of mapped terms may differ slightly from institution to institution, but they differ greatly when moving to the biomedical literature domain.

UR - http://www.scopus.com/inward/record.url?scp=84863537188&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84863537188&partnerID=8YFLogxK

U2 - 10.1136/amiajnl-2011-000744

DO - 10.1136/amiajnl-2011-000744

M3 - Article

C2 - 22493050

AN - SCOPUS:84863537188

VL - 19

JO - Journal of the American Medical Informatics Association : JAMIA

JF - Journal of the American Medical Informatics Association : JAMIA

SN - 1067-5027

IS - E1

ER -