Onto-BioThesaurus: ontological representation of gene/protein names for biomedica

Project: Research project

Project Details


The long-term goal of our research is to develop resources and natural language processing (NLP) systems for knowledge management in the biomedical domain. As biomedical data stored in disparate resources undergo a very rapid growth in both scale and complexity, ontology-based knowledge management is becoming increasingly popular since it provides explicit descriptions of biomedical entities and an approach to annotating and analyzing the results of biomedical research. Much of information and knowledge relevant to biomedical research is still recorded in free text format. In the past decade, NLP has been shown to have the potential to accelerate the biomedical knowledge management process. One critical component in NLP systems is identifying gene/protein names (i.e., gene/protein name identification) and normalizing them to standard representations (i.e., gene/protein name normalization). Gene/protein name identification has been tackled with good performance but gene/protein name normalization tends to be challenging. First, there is a lack of standard representations for gene/protein names. Researchers have used structured databases such as protein database, UniProtKB, or gene resource Entrez Gene as the reference for names. But it is problematic to associate names to individual records in those databases since a name in text can be generic and refer to a group of records. Additionally, like other biomedical concepts such as diseases or lab procedures, genes or proteins usually appear in text as short forms abbreviated from their names or descriptions. The prevalent use of short forms is another challenge faced by NLP applications because of very high ambiguity of short forms. Specifically, the proposed research aims to: 1) develop onto-BioThesaurus by enriching BioThesaurus, an existing gene/protein thesaurus, with gene/protein-related ontologies. Hypothesis: aligning gene/protein names to gene/protein-related ontologies can i) detect systematic ambiguity, ii) enable automatic reasoning during gene/protein named entity tagging, and iii) facilitate ontology-based knowledge management; 2) enhance onto-BioThesaurus by harvesting short form knowledge from online resources and text. Hypothesis: harvesting synonyms especially gene/protein short forms is critical for resolving the ambiguity, synonymy, and novelty problem for gene/protein name normalization; 3) normalize gene/protein names using onto-BioThesaurus. Hypothesis: there are several advantages (i.e., lowering ambiguity, handling novelty, and linking gene/protein concepts to biomedical ontologies) over the traditional gene/protein name normalization when using onto-BioThesaurus and we expect improved performance of various lookup and disambiguation methods;and 4) evaluate research methods and distribute research outcome. Hypothesis: evaluating research methods and distributing research outcome to public are critical to advance basic and applied biomedical science.
Effective start/end date9/1/098/31/10


  • U.S. National Library of Medicine: $608,650.00


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.