Structuralizing biomedical abstracts with discriminative linguistic features

Sejin Nam; Senator Jeong; Sang Kyun Kim; Hong Gee Kim; Victoria Ngo; Nansu Zong

doi:10.1016/j.compbiomed.2016.10.026

Structuralizing biomedical abstracts with discriminative linguistic features

Sejin Nam, Senator Jeong, Sang Kyun Kim, Hong Gee Kim, Victoria Ngo, Nansu Zong

Artificial Intelligence and Informatics

Research output: Contribution to journal › Article › peer-review

Abstract

Objective Nearly 75% of the abstracts in MEDLINE papers present in an unstructured format. This study aims to automate the reformatting of unstructured abstracts into the Introduction, Methods, Results, and Discussion (IMRAD) format. The quality of this reformatting relies on the features used in sentence classification. Therefore, we explored the most effective linguistic features in MEDLINE papers. Methods We constructed a feature set consisting of bag of words, linguistic features, grammatical features, and structural features. In order to evaluate the effectiveness, which is the capability of the sentence classification with the features, three datasets from PubMed Central Open Access Subset were selected and constructed: (1) structured abstract (SA) for training, (2) unstructured RCT abstract (UA-1) and (3) unstructured general abstract (UA-2). F-score and accuracy were used to measure the effectiveness on IMRAD section level and the overall classification. Results Adding linguistic features improves the classification of the abstract sentence from 1.2% to 35.8% in terms of accuracy in three abstract datasets. The highest accuracies achieved were 91.7% in SA, 86.3% in UA-1, and 77.9% in UA-2. Linguistic features (dimensions=15) had fewer dimensions than bag-of-words (dimensions= 1541). All representative linguistic features (n-gram and verb phrase, and noun phrase) for each section are identified in our system (available at http://abstract.bike.re.kr). Conclusion Linguistic features can be used to effectively classify sentence with low computation burden in MEDLINE abstract.

Original language	English (US)
Pages (from-to)	276-285
Number of pages	10
Journal	Computers in Biology and Medicine
Volume	79
DOIs	https://doi.org/10.1016/j.compbiomed.2016.10.026
State	Published - Dec 1 2016

Keywords

Biomedical research paper
Discriminative linguistic features
IMRAD format
Sentence classification
Structured abstract

ASJC Scopus subject areas

Computer Science Applications
Health Informatics

Access to Document

10.1016/j.compbiomed.2016.10.026

Cite this

@article{567a9a2439b7406cad6eb715a49d6c28,

title = "Structuralizing biomedical abstracts with discriminative linguistic features",

abstract = "Objective Nearly 75% of the abstracts in MEDLINE papers present in an unstructured format. This study aims to automate the reformatting of unstructured abstracts into the Introduction, Methods, Results, and Discussion (IMRAD) format. The quality of this reformatting relies on the features used in sentence classification. Therefore, we explored the most effective linguistic features in MEDLINE papers. Methods We constructed a feature set consisting of bag of words, linguistic features, grammatical features, and structural features. In order to evaluate the effectiveness, which is the capability of the sentence classification with the features, three datasets from PubMed Central Open Access Subset were selected and constructed: (1) structured abstract (SA) for training, (2) unstructured RCT abstract (UA-1) and (3) unstructured general abstract (UA-2). F-score and accuracy were used to measure the effectiveness on IMRAD section level and the overall classification. Results Adding linguistic features improves the classification of the abstract sentence from 1.2% to 35.8% in terms of accuracy in three abstract datasets. The highest accuracies achieved were 91.7% in SA, 86.3% in UA-1, and 77.9% in UA-2. Linguistic features (dimensions=15) had fewer dimensions than bag-of-words (dimensions= 1541). All representative linguistic features (n-gram and verb phrase, and noun phrase) for each section are identified in our system (available at http://abstract.bike.re.kr). Conclusion Linguistic features can be used to effectively classify sentence with low computation burden in MEDLINE abstract.",

keywords = "Biomedical research paper, Discriminative linguistic features, IMRAD format, Sentence classification, Structured abstract",

author = "Sejin Nam and Senator Jeong and Kim, {Sang Kyun} and Kim, {Hong Gee} and Victoria Ngo and Nansu Zong",

note = "Publisher Copyright: {\textcopyright} 2016 Elsevier Ltd",

year = "2016",

month = dec,

day = "1",

doi = "10.1016/j.compbiomed.2016.10.026",

language = "English (US)",

volume = "79",

pages = "276--285",

journal = "Computers in Biology and Medicine",

issn = "0010-4825",

publisher = "Elsevier Limited",

}

TY - JOUR

T1 - Structuralizing biomedical abstracts with discriminative linguistic features

AU - Nam, Sejin

AU - Jeong, Senator

AU - Kim, Sang Kyun

AU - Kim, Hong Gee

AU - Ngo, Victoria

AU - Zong, Nansu

PY - 2016/12/1

Y1 - 2016/12/1

N2 - Objective Nearly 75% of the abstracts in MEDLINE papers present in an unstructured format. This study aims to automate the reformatting of unstructured abstracts into the Introduction, Methods, Results, and Discussion (IMRAD) format. The quality of this reformatting relies on the features used in sentence classification. Therefore, we explored the most effective linguistic features in MEDLINE papers. Methods We constructed a feature set consisting of bag of words, linguistic features, grammatical features, and structural features. In order to evaluate the effectiveness, which is the capability of the sentence classification with the features, three datasets from PubMed Central Open Access Subset were selected and constructed: (1) structured abstract (SA) for training, (2) unstructured RCT abstract (UA-1) and (3) unstructured general abstract (UA-2). F-score and accuracy were used to measure the effectiveness on IMRAD section level and the overall classification. Results Adding linguistic features improves the classification of the abstract sentence from 1.2% to 35.8% in terms of accuracy in three abstract datasets. The highest accuracies achieved were 91.7% in SA, 86.3% in UA-1, and 77.9% in UA-2. Linguistic features (dimensions=15) had fewer dimensions than bag-of-words (dimensions= 1541). All representative linguistic features (n-gram and verb phrase, and noun phrase) for each section are identified in our system (available at http://abstract.bike.re.kr). Conclusion Linguistic features can be used to effectively classify sentence with low computation burden in MEDLINE abstract.

AB - Objective Nearly 75% of the abstracts in MEDLINE papers present in an unstructured format. This study aims to automate the reformatting of unstructured abstracts into the Introduction, Methods, Results, and Discussion (IMRAD) format. The quality of this reformatting relies on the features used in sentence classification. Therefore, we explored the most effective linguistic features in MEDLINE papers. Methods We constructed a feature set consisting of bag of words, linguistic features, grammatical features, and structural features. In order to evaluate the effectiveness, which is the capability of the sentence classification with the features, three datasets from PubMed Central Open Access Subset were selected and constructed: (1) structured abstract (SA) for training, (2) unstructured RCT abstract (UA-1) and (3) unstructured general abstract (UA-2). F-score and accuracy were used to measure the effectiveness on IMRAD section level and the overall classification. Results Adding linguistic features improves the classification of the abstract sentence from 1.2% to 35.8% in terms of accuracy in three abstract datasets. The highest accuracies achieved were 91.7% in SA, 86.3% in UA-1, and 77.9% in UA-2. Linguistic features (dimensions=15) had fewer dimensions than bag-of-words (dimensions= 1541). All representative linguistic features (n-gram and verb phrase, and noun phrase) for each section are identified in our system (available at http://abstract.bike.re.kr). Conclusion Linguistic features can be used to effectively classify sentence with low computation burden in MEDLINE abstract.

KW - Biomedical research paper

KW - Discriminative linguistic features

KW - IMRAD format

KW - Sentence classification

KW - Structured abstract

UR - http://www.scopus.com/inward/record.url?scp=84995655285&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84995655285&partnerID=8YFLogxK

U2 - 10.1016/j.compbiomed.2016.10.026

DO - 10.1016/j.compbiomed.2016.10.026

M3 - Article

C2 - 27838533

AN - SCOPUS:84995655285

SN - 0010-4825

VL - 79

SP - 276

EP - 285

JO - Computers in Biology and Medicine

JF - Computers in Biology and Medicine

ER -

Structuralizing biomedical abstracts with discriminative linguistic features

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this