TY - GEN
T1 - A comparison study of biomedical short form definition detection algorithms
AU - Torii, Manabu
AU - Liu, Hongfang
AU - Hu, Zhangzhi
AU - Wu, Cathy
PY - 2006
Y1 - 2006
N2 - With more and more research dedicated to literature mining in the biomedical domain, more and more systems are available for people to choose from to build literature mining applications. In this study, we focus on one specific kind of task, i.e., detecting definitions of acronyms/abbreviations/ symbols in biomedical text. The study was designed to answer the following questions; i) how well a system performs in detecting definitions when provided with a large set of documents recently published in the biomedical domain, ii) what the coverage is for various knowledge bases in including acronyms/abbreviations/symbols as synonyms of their definitions, and iii) how to combine results from various systems. We evaluated three publicly available systems, namely, ALICE (a handcrafted pattern/rule based system), a system by Chang et al. (a machine-learning system), and an algorithm by Schwartz and Hearst (a simple alignment-based program), in detecting definitions for acronyms/abbreviations/symbols as well as the conceptual coverage of existing thesauri, namely, the UMLS (the Unified Medical Language System) and the BioThesaurus (a thesaurus of names for all UniProt protein records). We found that all three systems agreed on a large portion of the results (over 94% of all definitions detected) mainly due to the fact that most acronyms/abbreviations/ symbols were formed through various initializations from their definitions. The precisions and recalls of the three systems are comparable. However, based on manual investigation of the results, we found that most systems have some difficulty in detecting definitions for chemical/gene/protein symbols where ALICE has relatively better performance of chemical/gene/protein symbols comparing to the other two possibly due to fine tuning of the system for those symbols. We also found existing knowledge bases have a good coverage of definitions for those frequently defined acronyms/abbreviations/symbols. Potential combinations of the three systems were also discussed and implemented.
AB - With more and more research dedicated to literature mining in the biomedical domain, more and more systems are available for people to choose from to build literature mining applications. In this study, we focus on one specific kind of task, i.e., detecting definitions of acronyms/abbreviations/ symbols in biomedical text. The study was designed to answer the following questions; i) how well a system performs in detecting definitions when provided with a large set of documents recently published in the biomedical domain, ii) what the coverage is for various knowledge bases in including acronyms/abbreviations/symbols as synonyms of their definitions, and iii) how to combine results from various systems. We evaluated three publicly available systems, namely, ALICE (a handcrafted pattern/rule based system), a system by Chang et al. (a machine-learning system), and an algorithm by Schwartz and Hearst (a simple alignment-based program), in detecting definitions for acronyms/abbreviations/symbols as well as the conceptual coverage of existing thesauri, namely, the UMLS (the Unified Medical Language System) and the BioThesaurus (a thesaurus of names for all UniProt protein records). We found that all three systems agreed on a large portion of the results (over 94% of all definitions detected) mainly due to the fact that most acronyms/abbreviations/ symbols were formed through various initializations from their definitions. The precisions and recalls of the three systems are comparable. However, based on manual investigation of the results, we found that most systems have some difficulty in detecting definitions for chemical/gene/protein symbols where ALICE has relatively better performance of chemical/gene/protein symbols comparing to the other two possibly due to fine tuning of the system for those symbols. We also found existing knowledge bases have a good coverage of definitions for those frequently defined acronyms/abbreviations/symbols. Potential combinations of the three systems were also discussed and implemented.
KW - Acronyms/abbreviations/symbols
KW - Algorithm evaluation
KW - Biomedical literature mining
KW - Information extraction
KW - Machine learning
KW - Natural language processing
KW - Rule-based systems
UR - http://www.scopus.com/inward/record.url?scp=34547670829&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=34547670829&partnerID=8YFLogxK
U2 - 10.1145/1183535.1183548
DO - 10.1145/1183535.1183548
M3 - Conference contribution
AN - SCOPUS:34547670829
SN - 1595935266
SN - 9781595935267
T3 - Proceedings of TMBIO 2006: ACM First International Workshop on Text Mining in Bioinformatics
SP - 52
EP - 59
BT - CIKM 2006 Workshop - Proceedings of TMBIO 2006
T2 - TMBIO 2006: ACM 1st International Workshop on Text Mining in Bioinformatics, held in conjunction with the ACM 15th Conference on Information and Knowledge Management, CIKM 2006
Y2 - 10 November 2006 through 10 November 2006
ER -