TY - JOUR
T1 - Deriving a probabilistic syntacto-semantic grammar for biomedicine based on domain-specific terminologies
AU - Fan, Jung Wei
AU - Friedman, Carol
N1 - Funding Information:
We thank Dr. Wendy Chapman for help with access to the University of Pittsburgh NLP Repository. We thank Drs. Noémie Elhadad, Yang Huang, Herbert Chase, Chintan Patel, and Francis Morrison for their intellectual input in discussing the research ideas. This study was performed during the first author’s Ph.D. training in the Department of Biomedical Informatics, Columbia University, and was supported by Grant LM008635 from the National Library of Medicine .
PY - 2011/10
Y1 - 2011/10
N2 - Biomedical natural language processing (BioNLP) is a useful technique that unlocks valuable information stored in textual data for practice and/or research. Syntactic parsing is a critical component of BioNLP applications that rely on correctly determining the sentence and phrase structure of free text. In addition to dealing with the vast amount of domain-specific terms, a robust biomedical parser needs to model the semantic grammar to obtain viable syntactic structures. With either a rule-based or corpus-based approach, the grammar engineering process requires substantial time and knowledge from experts, and does not always yield a semantically transferable grammar. To reduce the human effort and to promote semantic transferability, we propose an automated method for deriving a probabilistic grammar based on a training corpus consisting of concept strings and semantic classes from the Unified Medical Language System (UMLS), a comprehensive terminology resource widely used by the community. The grammar is designed to specify noun phrases only due to the nominal nature of the majority of biomedical terminological concepts. Evaluated on manually parsed clinical notes, the derived grammar achieved a recall of 0.644, precision of 0.737, and average cross-bracketing of 0.61, which demonstrated better performance than a control grammar with the semantic information removed. Error analysis revealed shortcomings that could be addressed to improve performance. The results indicated the feasibility of an approach which automatically incorporates terminology semantics in the building of an operational grammar. Although the current performance of the unsupervised solution does not adequately replace manual engineering, we believe once the performance issues are addressed, it could serve as an aide in a semi-supervised solution.
AB - Biomedical natural language processing (BioNLP) is a useful technique that unlocks valuable information stored in textual data for practice and/or research. Syntactic parsing is a critical component of BioNLP applications that rely on correctly determining the sentence and phrase structure of free text. In addition to dealing with the vast amount of domain-specific terms, a robust biomedical parser needs to model the semantic grammar to obtain viable syntactic structures. With either a rule-based or corpus-based approach, the grammar engineering process requires substantial time and knowledge from experts, and does not always yield a semantically transferable grammar. To reduce the human effort and to promote semantic transferability, we propose an automated method for deriving a probabilistic grammar based on a training corpus consisting of concept strings and semantic classes from the Unified Medical Language System (UMLS), a comprehensive terminology resource widely used by the community. The grammar is designed to specify noun phrases only due to the nominal nature of the majority of biomedical terminological concepts. Evaluated on manually parsed clinical notes, the derived grammar achieved a recall of 0.644, precision of 0.737, and average cross-bracketing of 0.61, which demonstrated better performance than a control grammar with the semantic information removed. Error analysis revealed shortcomings that could be addressed to improve performance. The results indicated the feasibility of an approach which automatically incorporates terminology semantics in the building of an operational grammar. Although the current performance of the unsupervised solution does not adequately replace manual engineering, we believe once the performance issues are addressed, it could serve as an aide in a semi-supervised solution.
KW - Biomedical terminology
KW - Natural language processing
KW - Probabilistic parsing
KW - Semantic grammar
UR - http://www.scopus.com/inward/record.url?scp=80052894088&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=80052894088&partnerID=8YFLogxK
U2 - 10.1016/j.jbi.2011.04.006
DO - 10.1016/j.jbi.2011.04.006
M3 - Article
C2 - 21549857
AN - SCOPUS:80052894088
SN - 1532-0464
VL - 44
SP - 805
EP - 814
JO - Journal of Biomedical Informatics
JF - Journal of Biomedical Informatics
IS - 5
ER -