IProLINK: An integrated protein resource for literature mining

Zhang Z. Hu, Inderjeet Mani, Vincent Hermoso, Hongfang D Liu, Cathy H. Wu

Research output: Contribution to journalArticle

32 Citations (Scopus)

Abstract

The exponential growth of large-scale molecular sequence data and of the PubMed scientific literature has prompted active research in biological literature mining and information extraction to facilitate genome/proteome annotation and improve the quality of biological databases. Motivated by the promise of text mining methodologies, but at the same time, the lack of adequate curated data for training and benchmarking, the Protein Information Resource (PIR) has developed a resource for protein literature mining - iProLINK (integrated Protein Literature INformation and Knowledge). As PIR focuses its effort on the curation of the UniProt protein sequence database, the goal of iProLINK is to provide curated data sources that can be utilized for text mining research in the areas of bibliography mapping, annotation extraction, protein named entity recognition, and protein ontology development. The data sources for bibliography mapping and annotation extraction include mapped citations (PubMed ID to protein entry and feature line mapping) and annotation-tagged literature corpora. The latter includes several hundred abstracts and full-text articles tagged with experimentally validated post-translational modifications (PTMs) annotated in the PIR protein sequence database. The data sources for entity recognition and ontology development include a protein name dictionary, word token dictionaries, protein name-tagged literature corpora along with tagging guidelines, as well as a protein ontology based on PIRSF protein family names. iProLINK is freely accessible at http://pir.georgetown.edu/iprolink, with hypertext links for all downloadable files.

Original languageEnglish (US)
Pages (from-to)409-416
Number of pages8
JournalComputational Biology and Chemistry
Volume28
Issue number5-6
DOIs
StatePublished - Dec 2004
Externally publishedYes

Fingerprint

Mining
resources
proteins
Proteins
Protein
Resources
Information Storage and Retrieval
annotations
Annotation
Names
Protein Databases
Data Mining
Ontology
Bibliography
Text Mining
bibliographies
dictionaries
Protein Sequence
PubMed
Bibliographies

Keywords

  • Literature mining
  • Natural language processing
  • Post-translation modifications
  • Protein annotation
  • Pubmed
  • Uniprot

ASJC Scopus subject areas

  • Biochemistry
  • Structural Biology
  • Analytical Chemistry
  • Physical and Theoretical Chemistry

Cite this

IProLINK : An integrated protein resource for literature mining. / Hu, Zhang Z.; Mani, Inderjeet; Hermoso, Vincent; Liu, Hongfang D; Wu, Cathy H.

In: Computational Biology and Chemistry, Vol. 28, No. 5-6, 12.2004, p. 409-416.

Research output: Contribution to journalArticle

Hu, Zhang Z. ; Mani, Inderjeet ; Hermoso, Vincent ; Liu, Hongfang D ; Wu, Cathy H. / IProLINK : An integrated protein resource for literature mining. In: Computational Biology and Chemistry. 2004 ; Vol. 28, No. 5-6. pp. 409-416.
@article{a1c292d33c244e6db0c1e43b34d5f272,
title = "IProLINK: An integrated protein resource for literature mining",
abstract = "The exponential growth of large-scale molecular sequence data and of the PubMed scientific literature has prompted active research in biological literature mining and information extraction to facilitate genome/proteome annotation and improve the quality of biological databases. Motivated by the promise of text mining methodologies, but at the same time, the lack of adequate curated data for training and benchmarking, the Protein Information Resource (PIR) has developed a resource for protein literature mining - iProLINK (integrated Protein Literature INformation and Knowledge). As PIR focuses its effort on the curation of the UniProt protein sequence database, the goal of iProLINK is to provide curated data sources that can be utilized for text mining research in the areas of bibliography mapping, annotation extraction, protein named entity recognition, and protein ontology development. The data sources for bibliography mapping and annotation extraction include mapped citations (PubMed ID to protein entry and feature line mapping) and annotation-tagged literature corpora. The latter includes several hundred abstracts and full-text articles tagged with experimentally validated post-translational modifications (PTMs) annotated in the PIR protein sequence database. The data sources for entity recognition and ontology development include a protein name dictionary, word token dictionaries, protein name-tagged literature corpora along with tagging guidelines, as well as a protein ontology based on PIRSF protein family names. iProLINK is freely accessible at http://pir.georgetown.edu/iprolink, with hypertext links for all downloadable files.",
keywords = "Literature mining, Natural language processing, Post-translation modifications, Protein annotation, Pubmed, Uniprot",
author = "Hu, {Zhang Z.} and Inderjeet Mani and Vincent Hermoso and Liu, {Hongfang D} and Wu, {Cathy H.}",
year = "2004",
month = "12",
doi = "10.1016/j.compbiolchem.2004.09.010",
language = "English (US)",
volume = "28",
pages = "409--416",
journal = "Computational Biology and Chemistry",
issn = "1476-9271",
publisher = "Elsevier Limited",
number = "5-6",

}

TY - JOUR

T1 - IProLINK

T2 - An integrated protein resource for literature mining

AU - Hu, Zhang Z.

AU - Mani, Inderjeet

AU - Hermoso, Vincent

AU - Liu, Hongfang D

AU - Wu, Cathy H.

PY - 2004/12

Y1 - 2004/12

N2 - The exponential growth of large-scale molecular sequence data and of the PubMed scientific literature has prompted active research in biological literature mining and information extraction to facilitate genome/proteome annotation and improve the quality of biological databases. Motivated by the promise of text mining methodologies, but at the same time, the lack of adequate curated data for training and benchmarking, the Protein Information Resource (PIR) has developed a resource for protein literature mining - iProLINK (integrated Protein Literature INformation and Knowledge). As PIR focuses its effort on the curation of the UniProt protein sequence database, the goal of iProLINK is to provide curated data sources that can be utilized for text mining research in the areas of bibliography mapping, annotation extraction, protein named entity recognition, and protein ontology development. The data sources for bibliography mapping and annotation extraction include mapped citations (PubMed ID to protein entry and feature line mapping) and annotation-tagged literature corpora. The latter includes several hundred abstracts and full-text articles tagged with experimentally validated post-translational modifications (PTMs) annotated in the PIR protein sequence database. The data sources for entity recognition and ontology development include a protein name dictionary, word token dictionaries, protein name-tagged literature corpora along with tagging guidelines, as well as a protein ontology based on PIRSF protein family names. iProLINK is freely accessible at http://pir.georgetown.edu/iprolink, with hypertext links for all downloadable files.

AB - The exponential growth of large-scale molecular sequence data and of the PubMed scientific literature has prompted active research in biological literature mining and information extraction to facilitate genome/proteome annotation and improve the quality of biological databases. Motivated by the promise of text mining methodologies, but at the same time, the lack of adequate curated data for training and benchmarking, the Protein Information Resource (PIR) has developed a resource for protein literature mining - iProLINK (integrated Protein Literature INformation and Knowledge). As PIR focuses its effort on the curation of the UniProt protein sequence database, the goal of iProLINK is to provide curated data sources that can be utilized for text mining research in the areas of bibliography mapping, annotation extraction, protein named entity recognition, and protein ontology development. The data sources for bibliography mapping and annotation extraction include mapped citations (PubMed ID to protein entry and feature line mapping) and annotation-tagged literature corpora. The latter includes several hundred abstracts and full-text articles tagged with experimentally validated post-translational modifications (PTMs) annotated in the PIR protein sequence database. The data sources for entity recognition and ontology development include a protein name dictionary, word token dictionaries, protein name-tagged literature corpora along with tagging guidelines, as well as a protein ontology based on PIRSF protein family names. iProLINK is freely accessible at http://pir.georgetown.edu/iprolink, with hypertext links for all downloadable files.

KW - Literature mining

KW - Natural language processing

KW - Post-translation modifications

KW - Protein annotation

KW - Pubmed

KW - Uniprot

UR - http://www.scopus.com/inward/record.url?scp=9544257285&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=9544257285&partnerID=8YFLogxK

U2 - 10.1016/j.compbiolchem.2004.09.010

DO - 10.1016/j.compbiolchem.2004.09.010

M3 - Article

C2 - 15556482

AN - SCOPUS:9544257285

VL - 28

SP - 409

EP - 416

JO - Computational Biology and Chemistry

JF - Computational Biology and Chemistry

SN - 1476-9271

IS - 5-6

ER -