Mining terminological knowledge in large biomedical corpora.

Hongfang Liu; Carol Friedman

Mining terminological knowledge in large biomedical corpora.

Hongfang Liu, Carol Friedman

Digital Health Sciences

Research output: Contribution to journal › Article › peer-review

39 Scopus citations

Abstract

Terminological knowledge of the biomedical domain is important for natural language processing (NLP) and information retrieval (IR) applications, and a number of terminological knowledge sources, such as LocusLink, GeneBank, and the UMLS, already exist. However, because of the tremendous amount of research activity in the field, new terms and symbols are continually being created, many of which are published in the literature, but are not available in any of the other resources. Therefore, effective mining of the literature for new terminology is critical for furthering NLP and IR applications. Abbreviations are widely used in the biomedical domain, and the understanding of abbreviations requires a terminological knowledge base that consists of abbreviations with their associated senses. In previous work, several methods have been developed for automatic construction of abbreviation knowledge bases from parenthetical expressions. However, these methods pair abbreviations and their expansions based on manually crafted patterns or rules. In this paper, we propose an automatic method, which is not based on patterns or rules but is based on the use of collocations, to extract a set of related terms from parenthetical expressions including abbreviations associated with their expansions and other types of related terms such as synonyms, or hyponyms etc. Our method is based on the observation that terms associated with parenthetical expressions i) are usually related, and ii) are often collocations because they tend to co-occur more often than expected by chance. Our method was applied to the collection of MEDLINE abstracts. The method and the results were evaluated using two collections: Berman's handcrafted abbreviation list and the LocusLink collection.

Original language	English (US)
Pages (from-to)	415-426
Number of pages	12
Journal	Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
State	Published - 2003

ASJC Scopus subject areas

General Medicine

Cite this

@article{b4d3a686c2134df98ee1f19f72a69001,

title = "Mining terminological knowledge in large biomedical corpora.",

abstract = "Terminological knowledge of the biomedical domain is important for natural language processing (NLP) and information retrieval (IR) applications, and a number of terminological knowledge sources, such as LocusLink, GeneBank, and the UMLS, already exist. However, because of the tremendous amount of research activity in the field, new terms and symbols are continually being created, many of which are published in the literature, but are not available in any of the other resources. Therefore, effective mining of the literature for new terminology is critical for furthering NLP and IR applications. Abbreviations are widely used in the biomedical domain, and the understanding of abbreviations requires a terminological knowledge base that consists of abbreviations with their associated senses. In previous work, several methods have been developed for automatic construction of abbreviation knowledge bases from parenthetical expressions. However, these methods pair abbreviations and their expansions based on manually crafted patterns or rules. In this paper, we propose an automatic method, which is not based on patterns or rules but is based on the use of collocations, to extract a set of related terms from parenthetical expressions including abbreviations associated with their expansions and other types of related terms such as synonyms, or hyponyms etc. Our method is based on the observation that terms associated with parenthetical expressions i) are usually related, and ii) are often collocations because they tend to co-occur more often than expected by chance. Our method was applied to the collection of MEDLINE abstracts. The method and the results were evaluated using two collections: Berman's handcrafted abbreviation list and the LocusLink collection.",

author = "Hongfang Liu and Carol Friedman",

year = "2003",

language = "English (US)",

pages = "415--426",

journal = "Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing",

issn = "2335-6936",

publisher = "World Scientific Publishing Co., Inc.",

}

TY - JOUR

T1 - Mining terminological knowledge in large biomedical corpora.

AU - Liu, Hongfang

AU - Friedman, Carol

PY - 2003

Y1 - 2003

N2 - Terminological knowledge of the biomedical domain is important for natural language processing (NLP) and information retrieval (IR) applications, and a number of terminological knowledge sources, such as LocusLink, GeneBank, and the UMLS, already exist. However, because of the tremendous amount of research activity in the field, new terms and symbols are continually being created, many of which are published in the literature, but are not available in any of the other resources. Therefore, effective mining of the literature for new terminology is critical for furthering NLP and IR applications. Abbreviations are widely used in the biomedical domain, and the understanding of abbreviations requires a terminological knowledge base that consists of abbreviations with their associated senses. In previous work, several methods have been developed for automatic construction of abbreviation knowledge bases from parenthetical expressions. However, these methods pair abbreviations and their expansions based on manually crafted patterns or rules. In this paper, we propose an automatic method, which is not based on patterns or rules but is based on the use of collocations, to extract a set of related terms from parenthetical expressions including abbreviations associated with their expansions and other types of related terms such as synonyms, or hyponyms etc. Our method is based on the observation that terms associated with parenthetical expressions i) are usually related, and ii) are often collocations because they tend to co-occur more often than expected by chance. Our method was applied to the collection of MEDLINE abstracts. The method and the results were evaluated using two collections: Berman's handcrafted abbreviation list and the LocusLink collection.

AB - Terminological knowledge of the biomedical domain is important for natural language processing (NLP) and information retrieval (IR) applications, and a number of terminological knowledge sources, such as LocusLink, GeneBank, and the UMLS, already exist. However, because of the tremendous amount of research activity in the field, new terms and symbols are continually being created, many of which are published in the literature, but are not available in any of the other resources. Therefore, effective mining of the literature for new terminology is critical for furthering NLP and IR applications. Abbreviations are widely used in the biomedical domain, and the understanding of abbreviations requires a terminological knowledge base that consists of abbreviations with their associated senses. In previous work, several methods have been developed for automatic construction of abbreviation knowledge bases from parenthetical expressions. However, these methods pair abbreviations and their expansions based on manually crafted patterns or rules. In this paper, we propose an automatic method, which is not based on patterns or rules but is based on the use of collocations, to extract a set of related terms from parenthetical expressions including abbreviations associated with their expansions and other types of related terms such as synonyms, or hyponyms etc. Our method is based on the observation that terms associated with parenthetical expressions i) are usually related, and ii) are often collocations because they tend to co-occur more often than expected by chance. Our method was applied to the collection of MEDLINE abstracts. The method and the results were evaluated using two collections: Berman's handcrafted abbreviation list and the LocusLink collection.

UR - http://www.scopus.com/inward/record.url?scp=0042128697&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0042128697&partnerID=8YFLogxK

M3 - Article

C2 - 12603046

AN - SCOPUS:0042128697

SN - 2335-6936

SP - 415

EP - 426

JO - Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing

JF - Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing

ER -

Mining terminological knowledge in large biomedical corpora.

Abstract

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this