Onto-BioThesaurus: ontological representation of gene/protein names for biomedica

Project: Research project

Project Details

Description

The long-term goal of our research is to develop resources and natural language processing (NLP)
systems for knowledge management in the biomedical domain. As biomedical data stored in disparate
resources undergo a very rapid growth in both scale and complexity, ontology-based knowledge management
is becoming increasingly popular since it provides explicit descriptions of biomedical entities and an approach
to annotating and analyzing the results of biomedical research. Much of information and knowledge relevant to
biomedical research is still recorded in free text format. In the past decade, NLP has been shown to have the
potential to accelerate the biomedical knowledge management process. One critical component in NLP
systems is identifying gene/protein names (i.e., gene/protein name identification) and normalizing them to
standard representations (i.e., gene/protein name normalization). Gene/protein name identification has been
tackled with good performance but gene/protein name normalization tends to be challenging. First, there is a
lack of standard representations for gene/protein names. Researchers have used structured databases such as
protein database, UniProtKB, or gene resource Entrez Gene as the reference for names. But it is problematic to
associate names to individual records in those databases since a name in text can be generic and refer to a
group of records. Additionally, like other biomedical concepts such as diseases or lab procedures, genes or
proteins usually appear in text as short forms abbreviated from their names or descriptions. The prevalent use
of short forms is another challenge faced by NLP applications because of very high ambiguity of short forms.
Specifically, the proposed research aims to:
1) develop onto-BioThesaurus by enriching BioThesaurus, an existing gene/protein thesaurus, with
gene/protein-related ontologies. Hypothesis: aligning gene/protein names to gene/protein-related ontologies
can i) detect systematic ambiguity, ii) enable automatic reasoning during gene/protein named entity tagging, and
iii) facilitate ontology-based knowledge management;
2) enhance onto-BioThesaurus by harvesting short form knowledge from online resources and text.
Hypothesis: harvesting synonyms especially gene/protein short forms is critical for resolving the ambiguity,
synonymy, and novelty problem for gene/protein name normalization;
3) normalize gene/protein names using onto-BioThesaurus. Hypothesis: there are several advantages (i.e.,
lowering ambiguity, handling novelty, and linking gene/protein concepts to biomedical ontologies) over the
traditional gene/protein name normalization when using onto-BioThesaurus and we expect improved
performance of various lookup and disambiguation methods;and
4) evaluate research methods and distribute research outcome. Hypothesis: evaluating research methods
and distributing research outcome to public are critical to advance basic and applied biomedical science.
StatusFinished
Effective start/end date9/1/099/29/13

Funding

  • U.S. National Library of Medicine: $614,000.00
  • U.S. National Library of Medicine: $608,650.00

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.