TY - JOUR
T1 - IProLINK
T2 - An integrated protein resource for literature mining
AU - Hu, Zhang Zhi
AU - Mani, Inderjeet
AU - Hermoso, Vincent
AU - Liu, Hongfang
AU - Wu, Cathy H.
N1 - Funding Information:
The project is supported by grants ITR-0205470, DBI-0138188, and IIS-0430743 from the National Science Foundation and grant U01-HG02712 from the National Institutes of Health, USA.
PY - 2004/12
Y1 - 2004/12
N2 - The exponential growth of large-scale molecular sequence data and of the PubMed scientific literature has prompted active research in biological literature mining and information extraction to facilitate genome/proteome annotation and improve the quality of biological databases. Motivated by the promise of text mining methodologies, but at the same time, the lack of adequate curated data for training and benchmarking, the Protein Information Resource (PIR) has developed a resource for protein literature mining - iProLINK (integrated Protein Literature INformation and Knowledge). As PIR focuses its effort on the curation of the UniProt protein sequence database, the goal of iProLINK is to provide curated data sources that can be utilized for text mining research in the areas of bibliography mapping, annotation extraction, protein named entity recognition, and protein ontology development. The data sources for bibliography mapping and annotation extraction include mapped citations (PubMed ID to protein entry and feature line mapping) and annotation-tagged literature corpora. The latter includes several hundred abstracts and full-text articles tagged with experimentally validated post-translational modifications (PTMs) annotated in the PIR protein sequence database. The data sources for entity recognition and ontology development include a protein name dictionary, word token dictionaries, protein name-tagged literature corpora along with tagging guidelines, as well as a protein ontology based on PIRSF protein family names. iProLINK is freely accessible at http://pir.georgetown.edu/iprolink, with hypertext links for all downloadable files.
AB - The exponential growth of large-scale molecular sequence data and of the PubMed scientific literature has prompted active research in biological literature mining and information extraction to facilitate genome/proteome annotation and improve the quality of biological databases. Motivated by the promise of text mining methodologies, but at the same time, the lack of adequate curated data for training and benchmarking, the Protein Information Resource (PIR) has developed a resource for protein literature mining - iProLINK (integrated Protein Literature INformation and Knowledge). As PIR focuses its effort on the curation of the UniProt protein sequence database, the goal of iProLINK is to provide curated data sources that can be utilized for text mining research in the areas of bibliography mapping, annotation extraction, protein named entity recognition, and protein ontology development. The data sources for bibliography mapping and annotation extraction include mapped citations (PubMed ID to protein entry and feature line mapping) and annotation-tagged literature corpora. The latter includes several hundred abstracts and full-text articles tagged with experimentally validated post-translational modifications (PTMs) annotated in the PIR protein sequence database. The data sources for entity recognition and ontology development include a protein name dictionary, word token dictionaries, protein name-tagged literature corpora along with tagging guidelines, as well as a protein ontology based on PIRSF protein family names. iProLINK is freely accessible at http://pir.georgetown.edu/iprolink, with hypertext links for all downloadable files.
KW - Literature mining
KW - Natural language processing
KW - Post-translation modifications
KW - Protein annotation
KW - Pubmed
KW - Uniprot
UR - http://www.scopus.com/inward/record.url?scp=9544257285&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=9544257285&partnerID=8YFLogxK
U2 - 10.1016/j.compbiolchem.2004.09.010
DO - 10.1016/j.compbiolchem.2004.09.010
M3 - Article
C2 - 15556482
AN - SCOPUS:9544257285
SN - 1476-9271
VL - 28
SP - 409
EP - 416
JO - Computational Biology and Chemistry
JF - Computational Biology and Chemistry
IS - 5-6
ER -