TY - JOUR
T1 - Structuralizing biomedical abstracts with discriminative linguistic features
AU - Nam, Sejin
AU - Jeong, Senator
AU - Kim, Sang Kyun
AU - Kim, Hong Gee
AU - Ngo, Victoria
AU - Zong, Nansu
N1 - Publisher Copyright:
© 2016 Elsevier Ltd
PY - 2016/12/1
Y1 - 2016/12/1
N2 - Objective Nearly 75% of the abstracts in MEDLINE papers present in an unstructured format. This study aims to automate the reformatting of unstructured abstracts into the Introduction, Methods, Results, and Discussion (IMRAD) format. The quality of this reformatting relies on the features used in sentence classification. Therefore, we explored the most effective linguistic features in MEDLINE papers. Methods We constructed a feature set consisting of bag of words, linguistic features, grammatical features, and structural features. In order to evaluate the effectiveness, which is the capability of the sentence classification with the features, three datasets from PubMed Central Open Access Subset were selected and constructed: (1) structured abstract (SA) for training, (2) unstructured RCT abstract (UA-1) and (3) unstructured general abstract (UA-2). F-score and accuracy were used to measure the effectiveness on IMRAD section level and the overall classification. Results Adding linguistic features improves the classification of the abstract sentence from 1.2% to 35.8% in terms of accuracy in three abstract datasets. The highest accuracies achieved were 91.7% in SA, 86.3% in UA-1, and 77.9% in UA-2. Linguistic features (dimensions=15) had fewer dimensions than bag-of-words (dimensions= 1541). All representative linguistic features (n-gram and verb phrase, and noun phrase) for each section are identified in our system (available at http://abstract.bike.re.kr). Conclusion Linguistic features can be used to effectively classify sentence with low computation burden in MEDLINE abstract.
AB - Objective Nearly 75% of the abstracts in MEDLINE papers present in an unstructured format. This study aims to automate the reformatting of unstructured abstracts into the Introduction, Methods, Results, and Discussion (IMRAD) format. The quality of this reformatting relies on the features used in sentence classification. Therefore, we explored the most effective linguistic features in MEDLINE papers. Methods We constructed a feature set consisting of bag of words, linguistic features, grammatical features, and structural features. In order to evaluate the effectiveness, which is the capability of the sentence classification with the features, three datasets from PubMed Central Open Access Subset were selected and constructed: (1) structured abstract (SA) for training, (2) unstructured RCT abstract (UA-1) and (3) unstructured general abstract (UA-2). F-score and accuracy were used to measure the effectiveness on IMRAD section level and the overall classification. Results Adding linguistic features improves the classification of the abstract sentence from 1.2% to 35.8% in terms of accuracy in three abstract datasets. The highest accuracies achieved were 91.7% in SA, 86.3% in UA-1, and 77.9% in UA-2. Linguistic features (dimensions=15) had fewer dimensions than bag-of-words (dimensions= 1541). All representative linguistic features (n-gram and verb phrase, and noun phrase) for each section are identified in our system (available at http://abstract.bike.re.kr). Conclusion Linguistic features can be used to effectively classify sentence with low computation burden in MEDLINE abstract.
KW - Biomedical research paper
KW - Discriminative linguistic features
KW - IMRAD format
KW - Sentence classification
KW - Structured abstract
UR - http://www.scopus.com/inward/record.url?scp=84995655285&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84995655285&partnerID=8YFLogxK
U2 - 10.1016/j.compbiomed.2016.10.026
DO - 10.1016/j.compbiomed.2016.10.026
M3 - Article
C2 - 27838533
AN - SCOPUS:84995655285
SN - 0010-4825
VL - 79
SP - 276
EP - 285
JO - Computers in Biology and Medicine
JF - Computers in Biology and Medicine
ER -