Towards a semantic lexicon for clinical natural language processing.

Hongfang D Liu, Stephen T. Wu, Dingcheng Li, Siddhartha Jonnalagadda, Sunghwan Sohn, Kavishwar Wagholikar, Peter J. Haug, Stanley M. Huff, Christopher G. Chute

Research output: Contribution to journalArticle

12 Citations (Scopus)

Abstract

A semantic lexicon which associates words and phrases in text to concepts is critical for extracting and encoding clinical information in free text and therefore achieving semantic interoperability between structured and unstructured data in Electronic Health Records (EHRs). Directly using existing standard terminologies may have limited coverage with respect to concepts and their corresponding mentions in text. In this paper, we analyze how tokens and phrases in a large corpus distribute and how well the UMLS captures the semantics. A corpus-driven semantic lexicon, MedLex, has been constructed where the semantics is based on the UMLS assisted with variants mined and usage information gathered from clinical text. The detailed corpus analysis of tokens, chunks, and concept mentions shows the UMLS is an invaluable source for natural language processing. Increasing the semantic coverage of tokens provides a good foundation in capturing clinical information comprehensively. The study also yields some insights in developing practical NLP systems.

Original languageEnglish (US)
Pages (from-to)568-576
Number of pages9
JournalAMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium
Volume2012
StatePublished - 2012

Fingerprint

Natural Language Processing
Semantics
Unified Medical Language System
Electronic Health Records
Terminology

ASJC Scopus subject areas

  • Medicine(all)

Cite this

Towards a semantic lexicon for clinical natural language processing. / Liu, Hongfang D; Wu, Stephen T.; Li, Dingcheng; Jonnalagadda, Siddhartha; Sohn, Sunghwan; Wagholikar, Kavishwar; Haug, Peter J.; Huff, Stanley M.; Chute, Christopher G.

In: AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium, Vol. 2012, 2012, p. 568-576.

Research output: Contribution to journalArticle

Liu, HD, Wu, ST, Li, D, Jonnalagadda, S, Sohn, S, Wagholikar, K, Haug, PJ, Huff, SM & Chute, CG 2012, 'Towards a semantic lexicon for clinical natural language processing.', AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium, vol. 2012, pp. 568-576.
Liu, Hongfang D ; Wu, Stephen T. ; Li, Dingcheng ; Jonnalagadda, Siddhartha ; Sohn, Sunghwan ; Wagholikar, Kavishwar ; Haug, Peter J. ; Huff, Stanley M. ; Chute, Christopher G. / Towards a semantic lexicon for clinical natural language processing. In: AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium. 2012 ; Vol. 2012. pp. 568-576.
@article{18bcecbd1f3a488786992ed69d53ad4e,
title = "Towards a semantic lexicon for clinical natural language processing.",
abstract = "A semantic lexicon which associates words and phrases in text to concepts is critical for extracting and encoding clinical information in free text and therefore achieving semantic interoperability between structured and unstructured data in Electronic Health Records (EHRs). Directly using existing standard terminologies may have limited coverage with respect to concepts and their corresponding mentions in text. In this paper, we analyze how tokens and phrases in a large corpus distribute and how well the UMLS captures the semantics. A corpus-driven semantic lexicon, MedLex, has been constructed where the semantics is based on the UMLS assisted with variants mined and usage information gathered from clinical text. The detailed corpus analysis of tokens, chunks, and concept mentions shows the UMLS is an invaluable source for natural language processing. Increasing the semantic coverage of tokens provides a good foundation in capturing clinical information comprehensively. The study also yields some insights in developing practical NLP systems.",
author = "Liu, {Hongfang D} and Wu, {Stephen T.} and Dingcheng Li and Siddhartha Jonnalagadda and Sunghwan Sohn and Kavishwar Wagholikar and Haug, {Peter J.} and Huff, {Stanley M.} and Chute, {Christopher G.}",
year = "2012",
language = "English (US)",
volume = "2012",
pages = "568--576",
journal = "AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium",
issn = "1559-4076",
publisher = "American Medical Informatics Association",

}

TY - JOUR

T1 - Towards a semantic lexicon for clinical natural language processing.

AU - Liu, Hongfang D

AU - Wu, Stephen T.

AU - Li, Dingcheng

AU - Jonnalagadda, Siddhartha

AU - Sohn, Sunghwan

AU - Wagholikar, Kavishwar

AU - Haug, Peter J.

AU - Huff, Stanley M.

AU - Chute, Christopher G.

PY - 2012

Y1 - 2012

N2 - A semantic lexicon which associates words and phrases in text to concepts is critical for extracting and encoding clinical information in free text and therefore achieving semantic interoperability between structured and unstructured data in Electronic Health Records (EHRs). Directly using existing standard terminologies may have limited coverage with respect to concepts and their corresponding mentions in text. In this paper, we analyze how tokens and phrases in a large corpus distribute and how well the UMLS captures the semantics. A corpus-driven semantic lexicon, MedLex, has been constructed where the semantics is based on the UMLS assisted with variants mined and usage information gathered from clinical text. The detailed corpus analysis of tokens, chunks, and concept mentions shows the UMLS is an invaluable source for natural language processing. Increasing the semantic coverage of tokens provides a good foundation in capturing clinical information comprehensively. The study also yields some insights in developing practical NLP systems.

AB - A semantic lexicon which associates words and phrases in text to concepts is critical for extracting and encoding clinical information in free text and therefore achieving semantic interoperability between structured and unstructured data in Electronic Health Records (EHRs). Directly using existing standard terminologies may have limited coverage with respect to concepts and their corresponding mentions in text. In this paper, we analyze how tokens and phrases in a large corpus distribute and how well the UMLS captures the semantics. A corpus-driven semantic lexicon, MedLex, has been constructed where the semantics is based on the UMLS assisted with variants mined and usage information gathered from clinical text. The detailed corpus analysis of tokens, chunks, and concept mentions shows the UMLS is an invaluable source for natural language processing. Increasing the semantic coverage of tokens provides a good foundation in capturing clinical information comprehensively. The study also yields some insights in developing practical NLP systems.

UR - http://www.scopus.com/inward/record.url?scp=84880802623&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84880802623&partnerID=8YFLogxK

M3 - Article

VL - 2012

SP - 568

EP - 576

JO - AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium

JF - AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium

SN - 1559-4076

ER -