Coreference analysis in clinical notes

A multi-pass sieve with alternate anaphora resolution modules

Siddhartha Reddy Jonnalagadda, Dingcheng Li, Sunghwan Sohn, Stephen Tze Inn Wu, Kavishwar Wagholikar, Manabu Torii, Hongfang D Liu

Research output: Contribution to journalArticle

17 Citations (Scopus)

Abstract

Objective This paper describes the coreference resolution system submitted by Mayo Clinic for the 2011 i2b2/VA/Cincinnati shared task Track 1C. The goal of the task was to construct a system that links the markables corresponding to the same entity. Materials and methods The task organizers provided progress notes and discharge summaries that were annotated with the markables of treatment, problem, test, person, and pronoun. We used a multi-pass sieve algorithm that applies deterministic rules in the order of preciseness and simultaneously gathers information about the entities in the documents. Our system, MedCoref, also uses a state-of-the-art machine learning framework as an alternative to the final, rule-based pronoun resolution sieve. Results The best system that uses a multi-pass sieve has an overall score of 0.836 (average of B3, MUC, Blanc, and CEAF F score) for the training set and 0.843 for the test set. Discussion A supervised machine learning system that typically uses a single function to find coreferents cannot accommodate irregularities encountered in data especially given the insufficient number of examples. On the other hand, a completely deterministic system could lead to a decrease in recall (sensitivity) when the rules are not exhaustive. The sieve-based framework allows one to combine reliable machine learning components with rules designed by experts. Conclusion Using relatively simple rules, part-of-speech information, and semantic type properties, an effective coreference resolution system could be designed. The source code of the system described is available at https:// sourceforge.net/projects/ohnlp/files/MedCoref.

Original languageEnglish (US)
Pages (from-to)867-874
Number of pages8
JournalJournal of the American Medical Informatics Association
Volume19
Issue number5
DOIs
StatePublished - Sep 2012

Fingerprint

Semantics
Machine Learning
Supervised Machine Learning

ASJC Scopus subject areas

  • Health Informatics
  • Medicine(all)

Cite this

Coreference analysis in clinical notes : A multi-pass sieve with alternate anaphora resolution modules. / Jonnalagadda, Siddhartha Reddy; Li, Dingcheng; Sohn, Sunghwan; Wu, Stephen Tze Inn; Wagholikar, Kavishwar; Torii, Manabu; Liu, Hongfang D.

In: Journal of the American Medical Informatics Association, Vol. 19, No. 5, 09.2012, p. 867-874.

Research output: Contribution to journalArticle

Jonnalagadda, Siddhartha Reddy ; Li, Dingcheng ; Sohn, Sunghwan ; Wu, Stephen Tze Inn ; Wagholikar, Kavishwar ; Torii, Manabu ; Liu, Hongfang D. / Coreference analysis in clinical notes : A multi-pass sieve with alternate anaphora resolution modules. In: Journal of the American Medical Informatics Association. 2012 ; Vol. 19, No. 5. pp. 867-874.
@article{4c5d6c805cdc41bd936f37233a33ae17,
title = "Coreference analysis in clinical notes: A multi-pass sieve with alternate anaphora resolution modules",
abstract = "Objective This paper describes the coreference resolution system submitted by Mayo Clinic for the 2011 i2b2/VA/Cincinnati shared task Track 1C. The goal of the task was to construct a system that links the markables corresponding to the same entity. Materials and methods The task organizers provided progress notes and discharge summaries that were annotated with the markables of treatment, problem, test, person, and pronoun. We used a multi-pass sieve algorithm that applies deterministic rules in the order of preciseness and simultaneously gathers information about the entities in the documents. Our system, MedCoref, also uses a state-of-the-art machine learning framework as an alternative to the final, rule-based pronoun resolution sieve. Results The best system that uses a multi-pass sieve has an overall score of 0.836 (average of B3, MUC, Blanc, and CEAF F score) for the training set and 0.843 for the test set. Discussion A supervised machine learning system that typically uses a single function to find coreferents cannot accommodate irregularities encountered in data especially given the insufficient number of examples. On the other hand, a completely deterministic system could lead to a decrease in recall (sensitivity) when the rules are not exhaustive. The sieve-based framework allows one to combine reliable machine learning components with rules designed by experts. Conclusion Using relatively simple rules, part-of-speech information, and semantic type properties, an effective coreference resolution system could be designed. The source code of the system described is available at https:// sourceforge.net/projects/ohnlp/files/MedCoref.",
author = "Jonnalagadda, {Siddhartha Reddy} and Dingcheng Li and Sunghwan Sohn and Wu, {Stephen Tze Inn} and Kavishwar Wagholikar and Manabu Torii and Liu, {Hongfang D}",
year = "2012",
month = "9",
doi = "10.1136/amiajnl-2011-000766",
language = "English (US)",
volume = "19",
pages = "867--874",
journal = "Journal of the American Medical Informatics Association : JAMIA",
issn = "1067-5027",
publisher = "Oxford University Press",
number = "5",

}

TY - JOUR

T1 - Coreference analysis in clinical notes

T2 - A multi-pass sieve with alternate anaphora resolution modules

AU - Jonnalagadda, Siddhartha Reddy

AU - Li, Dingcheng

AU - Sohn, Sunghwan

AU - Wu, Stephen Tze Inn

AU - Wagholikar, Kavishwar

AU - Torii, Manabu

AU - Liu, Hongfang D

PY - 2012/9

Y1 - 2012/9

N2 - Objective This paper describes the coreference resolution system submitted by Mayo Clinic for the 2011 i2b2/VA/Cincinnati shared task Track 1C. The goal of the task was to construct a system that links the markables corresponding to the same entity. Materials and methods The task organizers provided progress notes and discharge summaries that were annotated with the markables of treatment, problem, test, person, and pronoun. We used a multi-pass sieve algorithm that applies deterministic rules in the order of preciseness and simultaneously gathers information about the entities in the documents. Our system, MedCoref, also uses a state-of-the-art machine learning framework as an alternative to the final, rule-based pronoun resolution sieve. Results The best system that uses a multi-pass sieve has an overall score of 0.836 (average of B3, MUC, Blanc, and CEAF F score) for the training set and 0.843 for the test set. Discussion A supervised machine learning system that typically uses a single function to find coreferents cannot accommodate irregularities encountered in data especially given the insufficient number of examples. On the other hand, a completely deterministic system could lead to a decrease in recall (sensitivity) when the rules are not exhaustive. The sieve-based framework allows one to combine reliable machine learning components with rules designed by experts. Conclusion Using relatively simple rules, part-of-speech information, and semantic type properties, an effective coreference resolution system could be designed. The source code of the system described is available at https:// sourceforge.net/projects/ohnlp/files/MedCoref.

AB - Objective This paper describes the coreference resolution system submitted by Mayo Clinic for the 2011 i2b2/VA/Cincinnati shared task Track 1C. The goal of the task was to construct a system that links the markables corresponding to the same entity. Materials and methods The task organizers provided progress notes and discharge summaries that were annotated with the markables of treatment, problem, test, person, and pronoun. We used a multi-pass sieve algorithm that applies deterministic rules in the order of preciseness and simultaneously gathers information about the entities in the documents. Our system, MedCoref, also uses a state-of-the-art machine learning framework as an alternative to the final, rule-based pronoun resolution sieve. Results The best system that uses a multi-pass sieve has an overall score of 0.836 (average of B3, MUC, Blanc, and CEAF F score) for the training set and 0.843 for the test set. Discussion A supervised machine learning system that typically uses a single function to find coreferents cannot accommodate irregularities encountered in data especially given the insufficient number of examples. On the other hand, a completely deterministic system could lead to a decrease in recall (sensitivity) when the rules are not exhaustive. The sieve-based framework allows one to combine reliable machine learning components with rules designed by experts. Conclusion Using relatively simple rules, part-of-speech information, and semantic type properties, an effective coreference resolution system could be designed. The source code of the system described is available at https:// sourceforge.net/projects/ohnlp/files/MedCoref.

UR - http://www.scopus.com/inward/record.url?scp=84872240730&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84872240730&partnerID=8YFLogxK

U2 - 10.1136/amiajnl-2011-000766

DO - 10.1136/amiajnl-2011-000766

M3 - Article

VL - 19

SP - 867

EP - 874

JO - Journal of the American Medical Informatics Association : JAMIA

JF - Journal of the American Medical Informatics Association : JAMIA

SN - 1067-5027

IS - 5

ER -