A comparison study on algorithms of detecting long forms for short forms in biomedical text

Manabu Torii, Zhang Zhi Hu, Min Song, Cathy H. Wu, Hongfang D Liu

Research output: Contribution to journalArticle

22 Citations (Scopus)

Abstract

Motivation: With more and more research dedicated to literature mining in the biomedical domain, more and more systems are available for people to choose from when building literature mining applications. In this study, we focus on one specific kind of literature mining task, i.e., detecting definitions of acronyms, abbreviations, and symbols in biomedical text. We denote acronyms, abbreviations, and symbols as short forms (SFs) and their corresponding definitions as long forms (LFs). The study was designed to answer the following questions; i) how well a system performs in detecting LFs from novel text, ii) what the coverage is for various terminological knowledge bases in including SFs as synonyms of their LFs, and iii) how to combine results from various SF knowledge bases. Method: We evaluated the following three publicly available detection systems in detecting LFs for SFs: i) a handcrafted pattern/rule based system by Ao and Takagi, ALICE, ii) a machine learning system by Chang et al., and iii) a simple alignment-based program by Schwartz and Hearst. In addition, we investigated the conceptual coverage of two terminological knowledge bases: i) the UMLS (the Unified Medical Language System), and ii) the BioThesaurus (a thesaurus of names for all UniProt protein records). We also implemented a web interface that provides a virtual integration of various SF knowledge bases. Results: We found that detection systems agree with each other on most cases, and the existing terminological knowledge bases have a good coverage of synonymous relationship for frequently defined LFs. The web interface allows people to detect SF definitions from text and to search several SF knowledge bases.

Original languageEnglish (US)
Article numberS5
JournalBMC Bioinformatics
Volume8
Issue numberSUPPL. 9
DOIs
StatePublished - Nov 27 2007
Externally publishedYes

Fingerprint

Knowledge Bases
Learning systems
Knowledge Base
Thesauri
Knowledge based systems
Unified Medical Language System
Controlled Vocabulary
Proteins
Abbreviation
Acronym
Mining
Coverage
Names
Form
Text
Thesaurus
Rule-based Systems
Learning Systems
Research
Machine Learning

ASJC Scopus subject areas

  • Medicine(all)
  • Structural Biology
  • Applied Mathematics

Cite this

A comparison study on algorithms of detecting long forms for short forms in biomedical text. / Torii, Manabu; Hu, Zhang Zhi; Song, Min; Wu, Cathy H.; Liu, Hongfang D.

In: BMC Bioinformatics, Vol. 8, No. SUPPL. 9, S5, 27.11.2007.

Research output: Contribution to journalArticle

Torii, Manabu ; Hu, Zhang Zhi ; Song, Min ; Wu, Cathy H. ; Liu, Hongfang D. / A comparison study on algorithms of detecting long forms for short forms in biomedical text. In: BMC Bioinformatics. 2007 ; Vol. 8, No. SUPPL. 9.
@article{ada68f733b5b4dfab4f4751a7c56103f,
title = "A comparison study on algorithms of detecting long forms for short forms in biomedical text",
abstract = "Motivation: With more and more research dedicated to literature mining in the biomedical domain, more and more systems are available for people to choose from when building literature mining applications. In this study, we focus on one specific kind of literature mining task, i.e., detecting definitions of acronyms, abbreviations, and symbols in biomedical text. We denote acronyms, abbreviations, and symbols as short forms (SFs) and their corresponding definitions as long forms (LFs). The study was designed to answer the following questions; i) how well a system performs in detecting LFs from novel text, ii) what the coverage is for various terminological knowledge bases in including SFs as synonyms of their LFs, and iii) how to combine results from various SF knowledge bases. Method: We evaluated the following three publicly available detection systems in detecting LFs for SFs: i) a handcrafted pattern/rule based system by Ao and Takagi, ALICE, ii) a machine learning system by Chang et al., and iii) a simple alignment-based program by Schwartz and Hearst. In addition, we investigated the conceptual coverage of two terminological knowledge bases: i) the UMLS (the Unified Medical Language System), and ii) the BioThesaurus (a thesaurus of names for all UniProt protein records). We also implemented a web interface that provides a virtual integration of various SF knowledge bases. Results: We found that detection systems agree with each other on most cases, and the existing terminological knowledge bases have a good coverage of synonymous relationship for frequently defined LFs. The web interface allows people to detect SF definitions from text and to search several SF knowledge bases.",
author = "Manabu Torii and Hu, {Zhang Zhi} and Min Song and Wu, {Cathy H.} and Liu, {Hongfang D}",
year = "2007",
month = "11",
day = "27",
doi = "10.1186/1471-2105-8-S9-S5",
language = "English (US)",
volume = "8",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",
number = "SUPPL. 9",

}

TY - JOUR

T1 - A comparison study on algorithms of detecting long forms for short forms in biomedical text

AU - Torii, Manabu

AU - Hu, Zhang Zhi

AU - Song, Min

AU - Wu, Cathy H.

AU - Liu, Hongfang D

PY - 2007/11/27

Y1 - 2007/11/27

N2 - Motivation: With more and more research dedicated to literature mining in the biomedical domain, more and more systems are available for people to choose from when building literature mining applications. In this study, we focus on one specific kind of literature mining task, i.e., detecting definitions of acronyms, abbreviations, and symbols in biomedical text. We denote acronyms, abbreviations, and symbols as short forms (SFs) and their corresponding definitions as long forms (LFs). The study was designed to answer the following questions; i) how well a system performs in detecting LFs from novel text, ii) what the coverage is for various terminological knowledge bases in including SFs as synonyms of their LFs, and iii) how to combine results from various SF knowledge bases. Method: We evaluated the following three publicly available detection systems in detecting LFs for SFs: i) a handcrafted pattern/rule based system by Ao and Takagi, ALICE, ii) a machine learning system by Chang et al., and iii) a simple alignment-based program by Schwartz and Hearst. In addition, we investigated the conceptual coverage of two terminological knowledge bases: i) the UMLS (the Unified Medical Language System), and ii) the BioThesaurus (a thesaurus of names for all UniProt protein records). We also implemented a web interface that provides a virtual integration of various SF knowledge bases. Results: We found that detection systems agree with each other on most cases, and the existing terminological knowledge bases have a good coverage of synonymous relationship for frequently defined LFs. The web interface allows people to detect SF definitions from text and to search several SF knowledge bases.

AB - Motivation: With more and more research dedicated to literature mining in the biomedical domain, more and more systems are available for people to choose from when building literature mining applications. In this study, we focus on one specific kind of literature mining task, i.e., detecting definitions of acronyms, abbreviations, and symbols in biomedical text. We denote acronyms, abbreviations, and symbols as short forms (SFs) and their corresponding definitions as long forms (LFs). The study was designed to answer the following questions; i) how well a system performs in detecting LFs from novel text, ii) what the coverage is for various terminological knowledge bases in including SFs as synonyms of their LFs, and iii) how to combine results from various SF knowledge bases. Method: We evaluated the following three publicly available detection systems in detecting LFs for SFs: i) a handcrafted pattern/rule based system by Ao and Takagi, ALICE, ii) a machine learning system by Chang et al., and iii) a simple alignment-based program by Schwartz and Hearst. In addition, we investigated the conceptual coverage of two terminological knowledge bases: i) the UMLS (the Unified Medical Language System), and ii) the BioThesaurus (a thesaurus of names for all UniProt protein records). We also implemented a web interface that provides a virtual integration of various SF knowledge bases. Results: We found that detection systems agree with each other on most cases, and the existing terminological knowledge bases have a good coverage of synonymous relationship for frequently defined LFs. The web interface allows people to detect SF definitions from text and to search several SF knowledge bases.

UR - http://www.scopus.com/inward/record.url?scp=38449110656&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=38449110656&partnerID=8YFLogxK

U2 - 10.1186/1471-2105-8-S9-S5

DO - 10.1186/1471-2105-8-S9-S5

M3 - Article

C2 - 18047706

AN - SCOPUS:38449110656

VL - 8

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

IS - SUPPL. 9

M1 - S5

ER -