A comparison study of biomedical short form definition detection algorithms

Manabu Torii, Hongfang D Liu, Zhangzhi Hu, Cathy Wu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

With more and more research dedicated to literature mining in the biomedical domain, more and more systems are available for people to choose from to build literature mining applications. In this study, we focus on one specific kind of task, i.e., detecting definitions of acronyms/abbreviations/ symbols in biomedical text. The study was designed to answer the following questions; i) how well a system performs in detecting definitions when provided with a large set of documents recently published in the biomedical domain, ii) what the coverage is for various knowledge bases in including acronyms/abbreviations/symbols as synonyms of their definitions, and iii) how to combine results from various systems. We evaluated three publicly available systems, namely, ALICE (a handcrafted pattern/rule based system), a system by Chang et al. (a machine-learning system), and an algorithm by Schwartz and Hearst (a simple alignment-based program), in detecting definitions for acronyms/abbreviations/symbols as well as the conceptual coverage of existing thesauri, namely, the UMLS (the Unified Medical Language System) and the BioThesaurus (a thesaurus of names for all UniProt protein records). We found that all three systems agreed on a large portion of the results (over 94% of all definitions detected) mainly due to the fact that most acronyms/abbreviations/ symbols were formed through various initializations from their definitions. The precisions and recalls of the three systems are comparable. However, based on manual investigation of the results, we found that most systems have some difficulty in detecting definitions for chemical/gene/protein symbols where ALICE has relatively better performance of chemical/gene/protein symbols comparing to the other two possibly due to fine tuning of the system for those symbols. We also found existing knowledge bases have a good coverage of definitions for those frequently defined acronyms/abbreviations/symbols. Potential combinations of the three systems were also discussed and implemented.

Original languageEnglish (US)
Title of host publicationProceedings of TMBIO 2006: ACM First International Workshop on Text Mining in Bioinformatics
Pages52-59
Number of pages8
DOIs
StatePublished - 2006
Externally publishedYes
EventTMBIO 2006: ACM 1st International Workshop on Text Mining in Bioinformatics, held in conjunction with the ACM 15th Conference on Information and Knowledge Management, CIKM 2006 - Arlington, VA, United States
Duration: Nov 10 2006Nov 10 2006

Other

OtherTMBIO 2006: ACM 1st International Workshop on Text Mining in Bioinformatics, held in conjunction with the ACM 15th Conference on Information and Knowledge Management, CIKM 2006
CountryUnited States
CityArlington, VA
Period11/10/0611/10/06

Fingerprint

Controlled Vocabulary
Thesauri
Knowledge Bases
Proteins
Learning systems
Unified Medical Language System
Genes
Knowledge based systems
Names
Tuning
Research
Machine Learning

Keywords

  • Acronyms/abbreviations/symbols
  • Algorithm evaluation
  • Biomedical literature mining
  • Information extraction
  • Machine learning
  • Natural language processing
  • Rule-based systems

ASJC Scopus subject areas

  • Biochemistry, Genetics and Molecular Biology(all)
  • Bioengineering
  • Computer Science(all)

Cite this

Torii, M., Liu, H. D., Hu, Z., & Wu, C. (2006). A comparison study of biomedical short form definition detection algorithms. In Proceedings of TMBIO 2006: ACM First International Workshop on Text Mining in Bioinformatics (pp. 52-59) https://doi.org/10.1145/1183535.1183548

A comparison study of biomedical short form definition detection algorithms. / Torii, Manabu; Liu, Hongfang D; Hu, Zhangzhi; Wu, Cathy.

Proceedings of TMBIO 2006: ACM First International Workshop on Text Mining in Bioinformatics. 2006. p. 52-59.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Torii, M, Liu, HD, Hu, Z & Wu, C 2006, A comparison study of biomedical short form definition detection algorithms. in Proceedings of TMBIO 2006: ACM First International Workshop on Text Mining in Bioinformatics. pp. 52-59, TMBIO 2006: ACM 1st International Workshop on Text Mining in Bioinformatics, held in conjunction with the ACM 15th Conference on Information and Knowledge Management, CIKM 2006, Arlington, VA, United States, 11/10/06. https://doi.org/10.1145/1183535.1183548
Torii M, Liu HD, Hu Z, Wu C. A comparison study of biomedical short form definition detection algorithms. In Proceedings of TMBIO 2006: ACM First International Workshop on Text Mining in Bioinformatics. 2006. p. 52-59 https://doi.org/10.1145/1183535.1183548
Torii, Manabu ; Liu, Hongfang D ; Hu, Zhangzhi ; Wu, Cathy. / A comparison study of biomedical short form definition detection algorithms. Proceedings of TMBIO 2006: ACM First International Workshop on Text Mining in Bioinformatics. 2006. pp. 52-59
@inproceedings{39f88af2fc934c8a906ec1982fc52d8b,
title = "A comparison study of biomedical short form definition detection algorithms",
abstract = "With more and more research dedicated to literature mining in the biomedical domain, more and more systems are available for people to choose from to build literature mining applications. In this study, we focus on one specific kind of task, i.e., detecting definitions of acronyms/abbreviations/ symbols in biomedical text. The study was designed to answer the following questions; i) how well a system performs in detecting definitions when provided with a large set of documents recently published in the biomedical domain, ii) what the coverage is for various knowledge bases in including acronyms/abbreviations/symbols as synonyms of their definitions, and iii) how to combine results from various systems. We evaluated three publicly available systems, namely, ALICE (a handcrafted pattern/rule based system), a system by Chang et al. (a machine-learning system), and an algorithm by Schwartz and Hearst (a simple alignment-based program), in detecting definitions for acronyms/abbreviations/symbols as well as the conceptual coverage of existing thesauri, namely, the UMLS (the Unified Medical Language System) and the BioThesaurus (a thesaurus of names for all UniProt protein records). We found that all three systems agreed on a large portion of the results (over 94{\%} of all definitions detected) mainly due to the fact that most acronyms/abbreviations/ symbols were formed through various initializations from their definitions. The precisions and recalls of the three systems are comparable. However, based on manual investigation of the results, we found that most systems have some difficulty in detecting definitions for chemical/gene/protein symbols where ALICE has relatively better performance of chemical/gene/protein symbols comparing to the other two possibly due to fine tuning of the system for those symbols. We also found existing knowledge bases have a good coverage of definitions for those frequently defined acronyms/abbreviations/symbols. Potential combinations of the three systems were also discussed and implemented.",
keywords = "Acronyms/abbreviations/symbols, Algorithm evaluation, Biomedical literature mining, Information extraction, Machine learning, Natural language processing, Rule-based systems",
author = "Manabu Torii and Liu, {Hongfang D} and Zhangzhi Hu and Cathy Wu",
year = "2006",
doi = "10.1145/1183535.1183548",
language = "English (US)",
isbn = "1595935266",
pages = "52--59",
booktitle = "Proceedings of TMBIO 2006: ACM First International Workshop on Text Mining in Bioinformatics",

}

TY - GEN

T1 - A comparison study of biomedical short form definition detection algorithms

AU - Torii, Manabu

AU - Liu, Hongfang D

AU - Hu, Zhangzhi

AU - Wu, Cathy

PY - 2006

Y1 - 2006

N2 - With more and more research dedicated to literature mining in the biomedical domain, more and more systems are available for people to choose from to build literature mining applications. In this study, we focus on one specific kind of task, i.e., detecting definitions of acronyms/abbreviations/ symbols in biomedical text. The study was designed to answer the following questions; i) how well a system performs in detecting definitions when provided with a large set of documents recently published in the biomedical domain, ii) what the coverage is for various knowledge bases in including acronyms/abbreviations/symbols as synonyms of their definitions, and iii) how to combine results from various systems. We evaluated three publicly available systems, namely, ALICE (a handcrafted pattern/rule based system), a system by Chang et al. (a machine-learning system), and an algorithm by Schwartz and Hearst (a simple alignment-based program), in detecting definitions for acronyms/abbreviations/symbols as well as the conceptual coverage of existing thesauri, namely, the UMLS (the Unified Medical Language System) and the BioThesaurus (a thesaurus of names for all UniProt protein records). We found that all three systems agreed on a large portion of the results (over 94% of all definitions detected) mainly due to the fact that most acronyms/abbreviations/ symbols were formed through various initializations from their definitions. The precisions and recalls of the three systems are comparable. However, based on manual investigation of the results, we found that most systems have some difficulty in detecting definitions for chemical/gene/protein symbols where ALICE has relatively better performance of chemical/gene/protein symbols comparing to the other two possibly due to fine tuning of the system for those symbols. We also found existing knowledge bases have a good coverage of definitions for those frequently defined acronyms/abbreviations/symbols. Potential combinations of the three systems were also discussed and implemented.

AB - With more and more research dedicated to literature mining in the biomedical domain, more and more systems are available for people to choose from to build literature mining applications. In this study, we focus on one specific kind of task, i.e., detecting definitions of acronyms/abbreviations/ symbols in biomedical text. The study was designed to answer the following questions; i) how well a system performs in detecting definitions when provided with a large set of documents recently published in the biomedical domain, ii) what the coverage is for various knowledge bases in including acronyms/abbreviations/symbols as synonyms of their definitions, and iii) how to combine results from various systems. We evaluated three publicly available systems, namely, ALICE (a handcrafted pattern/rule based system), a system by Chang et al. (a machine-learning system), and an algorithm by Schwartz and Hearst (a simple alignment-based program), in detecting definitions for acronyms/abbreviations/symbols as well as the conceptual coverage of existing thesauri, namely, the UMLS (the Unified Medical Language System) and the BioThesaurus (a thesaurus of names for all UniProt protein records). We found that all three systems agreed on a large portion of the results (over 94% of all definitions detected) mainly due to the fact that most acronyms/abbreviations/ symbols were formed through various initializations from their definitions. The precisions and recalls of the three systems are comparable. However, based on manual investigation of the results, we found that most systems have some difficulty in detecting definitions for chemical/gene/protein symbols where ALICE has relatively better performance of chemical/gene/protein symbols comparing to the other two possibly due to fine tuning of the system for those symbols. We also found existing knowledge bases have a good coverage of definitions for those frequently defined acronyms/abbreviations/symbols. Potential combinations of the three systems were also discussed and implemented.

KW - Acronyms/abbreviations/symbols

KW - Algorithm evaluation

KW - Biomedical literature mining

KW - Information extraction

KW - Machine learning

KW - Natural language processing

KW - Rule-based systems

UR - http://www.scopus.com/inward/record.url?scp=34547670829&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=34547670829&partnerID=8YFLogxK

U2 - 10.1145/1183535.1183548

DO - 10.1145/1183535.1183548

M3 - Conference contribution

AN - SCOPUS:34547670829

SN - 1595935266

SN - 9781595935267

SP - 52

EP - 59

BT - Proceedings of TMBIO 2006: ACM First International Workshop on Text Mining in Bioinformatics

ER -