A comparison study of biomedical short form definition detection algorithms

Manabu Torii, Hongfang Liu, Zhangzhi Hu, Cathy Wu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Scopus citations

Abstract

With more and more research dedicated to literature mining in the biomedical domain, more and more systems are available for people to choose from to build literature mining applications. In this study, we focus on one specific kind of task, i.e., detecting definitions of acronyms/abbreviations/ symbols in biomedical text. The study was designed to answer the following questions; i) how well a system performs in detecting definitions when provided with a large set of documents recently published in the biomedical domain, ii) what the coverage is for various knowledge bases in including acronyms/abbreviations/symbols as synonyms of their definitions, and iii) how to combine results from various systems. We evaluated three publicly available systems, namely, ALICE (a handcrafted pattern/rule based system), a system by Chang et al. (a machine-learning system), and an algorithm by Schwartz and Hearst (a simple alignment-based program), in detecting definitions for acronyms/abbreviations/symbols as well as the conceptual coverage of existing thesauri, namely, the UMLS (the Unified Medical Language System) and the BioThesaurus (a thesaurus of names for all UniProt protein records). We found that all three systems agreed on a large portion of the results (over 94% of all definitions detected) mainly due to the fact that most acronyms/abbreviations/ symbols were formed through various initializations from their definitions. The precisions and recalls of the three systems are comparable. However, based on manual investigation of the results, we found that most systems have some difficulty in detecting definitions for chemical/gene/protein symbols where ALICE has relatively better performance of chemical/gene/protein symbols comparing to the other two possibly due to fine tuning of the system for those symbols. We also found existing knowledge bases have a good coverage of definitions for those frequently defined acronyms/abbreviations/symbols. Potential combinations of the three systems were also discussed and implemented.

Original languageEnglish (US)
Title of host publicationCIKM 2006 Workshop - Proceedings of TMBIO 2006
Subtitle of host publicationACM First International Workshop on Text Mining in Bioinformatics
Pages52-59
Number of pages8
DOIs
StatePublished - 2006
EventTMBIO 2006: ACM 1st International Workshop on Text Mining in Bioinformatics, held in conjunction with the ACM 15th Conference on Information and Knowledge Management, CIKM 2006 - Arlington, VA, United States
Duration: Nov 10 2006Nov 10 2006

Publication series

NameProceedings of TMBIO 2006: ACM First International Workshop on Text Mining in Bioinformatics

Other

OtherTMBIO 2006: ACM 1st International Workshop on Text Mining in Bioinformatics, held in conjunction with the ACM 15th Conference on Information and Knowledge Management, CIKM 2006
Country/TerritoryUnited States
CityArlington, VA
Period11/10/0611/10/06

Keywords

  • Acronyms/abbreviations/symbols
  • Algorithm evaluation
  • Biomedical literature mining
  • Information extraction
  • Machine learning
  • Natural language processing
  • Rule-based systems

ASJC Scopus subject areas

  • General Biochemistry, Genetics and Molecular Biology
  • Bioengineering
  • General Computer Science

Fingerprint

Dive into the research topics of 'A comparison study of biomedical short form definition detection algorithms'. Together they form a unique fingerprint.

Cite this