A systematic review of validity evidence for checklists versus global rating scales in simulation-based assessment

Jonathan S. Ilgen, Irene W Y Ma, Rose Hatala, David Allan Cook

Research output: Contribution to journalArticle

100 Citations (Scopus)

Abstract

Context: The relative advantages and disadvantages of checklists and global rating scales (GRSs) have long been debated. To compare the merits of these scale types, we conducted a systematic review of the validity evidence for checklists and GRSs in the context of simulation-based assessment of health professionals. Methods: We conducted a systematic review of multiple databases including MEDLINE, EMBASE and Scopus to February 2013. We selected studies that used both a GRS and checklist in the simulation-based assessment of health professionals. Reviewers working in duplicate evaluated five domains of validity evidence, including correlation between scales and reliability. We collected information about raters, instrument characteristics, assessment context, and task. We pooled reliability and correlation coefficients using random-effects meta-analysis. Results: We found 45 studies that used a checklist and GRS in simulation-based assessment. All studies included physicians or physicians in training; one study also included nurse anaesthetists. Topics of assessment included open and laparoscopic surgery (n = 22), endoscopy (n = 8), resuscitation (n = 7) and anaesthesiology (n = 4). The pooled GRS-checklist correlation was 0.76 (95% confidence interval [CI] 0.69-0.81, n = 16 studies). Inter-rater reliability was similar between scales (GRS 0.78, 95% CI 0.71-0.83, n = 23; checklist 0.81, 95% CI 0.75-0.85, n = 21), whereas GRS inter-item reliabilities (0.92, 95% CI 0.84-0.95, n = 6) and inter-station reliabilities (0.80, 95% CI 0.73-0.85, n = 10) were higher than those for checklists (0.66, 95% CI 0-0.84, n = 4 and 0.69, 95% CI 0.56-0.77, n = 10, respectively). Content evidence for GRSs usually referenced previously reported instruments (n = 33), whereas content evidence for checklists usually described expert consensus (n = 26). Checklists and GRSs usually had similar evidence for relations to other variables. Conclusions: Checklist inter-rater reliability and trainee discrimination were more favourable than suggested in earlier work, but each task requires a separate checklist. Compared with the checklist, the GRS has higher average inter-item and inter-station reliability, can be used across multiple tasks, and may better capture nuanced elements of expertise. Discuss ideas arising from the article at "www.mededuc.com discuss"

Original languageEnglish (US)
Pages (from-to)161-173
Number of pages13
JournalMedical Education
Volume49
Issue number2
DOIs
StatePublished - Feb 1 2015

Fingerprint

rating scale
Checklist
simulation
confidence
evidence
Confidence Intervals
health professionals
physician
Nurse Anesthetists
Physicians
Anesthesiology
trainee
surgery
Health
Resuscitation
MEDLINE
expertise
discrimination
nurse
Laparoscopy

ASJC Scopus subject areas

  • Medicine(all)
  • Education

Cite this

A systematic review of validity evidence for checklists versus global rating scales in simulation-based assessment. / Ilgen, Jonathan S.; Ma, Irene W Y; Hatala, Rose; Cook, David Allan.

In: Medical Education, Vol. 49, No. 2, 01.02.2015, p. 161-173.

Research output: Contribution to journalArticle

@article{5683adf2c0f745eea46818f4547ee43b,
title = "A systematic review of validity evidence for checklists versus global rating scales in simulation-based assessment",
abstract = "Context: The relative advantages and disadvantages of checklists and global rating scales (GRSs) have long been debated. To compare the merits of these scale types, we conducted a systematic review of the validity evidence for checklists and GRSs in the context of simulation-based assessment of health professionals. Methods: We conducted a systematic review of multiple databases including MEDLINE, EMBASE and Scopus to February 2013. We selected studies that used both a GRS and checklist in the simulation-based assessment of health professionals. Reviewers working in duplicate evaluated five domains of validity evidence, including correlation between scales and reliability. We collected information about raters, instrument characteristics, assessment context, and task. We pooled reliability and correlation coefficients using random-effects meta-analysis. Results: We found 45 studies that used a checklist and GRS in simulation-based assessment. All studies included physicians or physicians in training; one study also included nurse anaesthetists. Topics of assessment included open and laparoscopic surgery (n = 22), endoscopy (n = 8), resuscitation (n = 7) and anaesthesiology (n = 4). The pooled GRS-checklist correlation was 0.76 (95{\%} confidence interval [CI] 0.69-0.81, n = 16 studies). Inter-rater reliability was similar between scales (GRS 0.78, 95{\%} CI 0.71-0.83, n = 23; checklist 0.81, 95{\%} CI 0.75-0.85, n = 21), whereas GRS inter-item reliabilities (0.92, 95{\%} CI 0.84-0.95, n = 6) and inter-station reliabilities (0.80, 95{\%} CI 0.73-0.85, n = 10) were higher than those for checklists (0.66, 95{\%} CI 0-0.84, n = 4 and 0.69, 95{\%} CI 0.56-0.77, n = 10, respectively). Content evidence for GRSs usually referenced previously reported instruments (n = 33), whereas content evidence for checklists usually described expert consensus (n = 26). Checklists and GRSs usually had similar evidence for relations to other variables. Conclusions: Checklist inter-rater reliability and trainee discrimination were more favourable than suggested in earlier work, but each task requires a separate checklist. Compared with the checklist, the GRS has higher average inter-item and inter-station reliability, can be used across multiple tasks, and may better capture nuanced elements of expertise. Discuss ideas arising from the article at {"}www.mededuc.com discuss{"}",
author = "Ilgen, {Jonathan S.} and Ma, {Irene W Y} and Rose Hatala and Cook, {David Allan}",
year = "2015",
month = "2",
day = "1",
doi = "10.1111/medu.12621",
language = "English (US)",
volume = "49",
pages = "161--173",
journal = "Medical Education",
issn = "0308-0110",
publisher = "Wiley-Blackwell",
number = "2",

}

TY - JOUR

T1 - A systematic review of validity evidence for checklists versus global rating scales in simulation-based assessment

AU - Ilgen, Jonathan S.

AU - Ma, Irene W Y

AU - Hatala, Rose

AU - Cook, David Allan

PY - 2015/2/1

Y1 - 2015/2/1

N2 - Context: The relative advantages and disadvantages of checklists and global rating scales (GRSs) have long been debated. To compare the merits of these scale types, we conducted a systematic review of the validity evidence for checklists and GRSs in the context of simulation-based assessment of health professionals. Methods: We conducted a systematic review of multiple databases including MEDLINE, EMBASE and Scopus to February 2013. We selected studies that used both a GRS and checklist in the simulation-based assessment of health professionals. Reviewers working in duplicate evaluated five domains of validity evidence, including correlation between scales and reliability. We collected information about raters, instrument characteristics, assessment context, and task. We pooled reliability and correlation coefficients using random-effects meta-analysis. Results: We found 45 studies that used a checklist and GRS in simulation-based assessment. All studies included physicians or physicians in training; one study also included nurse anaesthetists. Topics of assessment included open and laparoscopic surgery (n = 22), endoscopy (n = 8), resuscitation (n = 7) and anaesthesiology (n = 4). The pooled GRS-checklist correlation was 0.76 (95% confidence interval [CI] 0.69-0.81, n = 16 studies). Inter-rater reliability was similar between scales (GRS 0.78, 95% CI 0.71-0.83, n = 23; checklist 0.81, 95% CI 0.75-0.85, n = 21), whereas GRS inter-item reliabilities (0.92, 95% CI 0.84-0.95, n = 6) and inter-station reliabilities (0.80, 95% CI 0.73-0.85, n = 10) were higher than those for checklists (0.66, 95% CI 0-0.84, n = 4 and 0.69, 95% CI 0.56-0.77, n = 10, respectively). Content evidence for GRSs usually referenced previously reported instruments (n = 33), whereas content evidence for checklists usually described expert consensus (n = 26). Checklists and GRSs usually had similar evidence for relations to other variables. Conclusions: Checklist inter-rater reliability and trainee discrimination were more favourable than suggested in earlier work, but each task requires a separate checklist. Compared with the checklist, the GRS has higher average inter-item and inter-station reliability, can be used across multiple tasks, and may better capture nuanced elements of expertise. Discuss ideas arising from the article at "www.mededuc.com discuss"

AB - Context: The relative advantages and disadvantages of checklists and global rating scales (GRSs) have long been debated. To compare the merits of these scale types, we conducted a systematic review of the validity evidence for checklists and GRSs in the context of simulation-based assessment of health professionals. Methods: We conducted a systematic review of multiple databases including MEDLINE, EMBASE and Scopus to February 2013. We selected studies that used both a GRS and checklist in the simulation-based assessment of health professionals. Reviewers working in duplicate evaluated five domains of validity evidence, including correlation between scales and reliability. We collected information about raters, instrument characteristics, assessment context, and task. We pooled reliability and correlation coefficients using random-effects meta-analysis. Results: We found 45 studies that used a checklist and GRS in simulation-based assessment. All studies included physicians or physicians in training; one study also included nurse anaesthetists. Topics of assessment included open and laparoscopic surgery (n = 22), endoscopy (n = 8), resuscitation (n = 7) and anaesthesiology (n = 4). The pooled GRS-checklist correlation was 0.76 (95% confidence interval [CI] 0.69-0.81, n = 16 studies). Inter-rater reliability was similar between scales (GRS 0.78, 95% CI 0.71-0.83, n = 23; checklist 0.81, 95% CI 0.75-0.85, n = 21), whereas GRS inter-item reliabilities (0.92, 95% CI 0.84-0.95, n = 6) and inter-station reliabilities (0.80, 95% CI 0.73-0.85, n = 10) were higher than those for checklists (0.66, 95% CI 0-0.84, n = 4 and 0.69, 95% CI 0.56-0.77, n = 10, respectively). Content evidence for GRSs usually referenced previously reported instruments (n = 33), whereas content evidence for checklists usually described expert consensus (n = 26). Checklists and GRSs usually had similar evidence for relations to other variables. Conclusions: Checklist inter-rater reliability and trainee discrimination were more favourable than suggested in earlier work, but each task requires a separate checklist. Compared with the checklist, the GRS has higher average inter-item and inter-station reliability, can be used across multiple tasks, and may better capture nuanced elements of expertise. Discuss ideas arising from the article at "www.mededuc.com discuss"

UR - http://www.scopus.com/inward/record.url?scp=84921531029&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84921531029&partnerID=8YFLogxK

U2 - 10.1111/medu.12621

DO - 10.1111/medu.12621

M3 - Article

C2 - 25626747

AN - SCOPUS:84921531029

VL - 49

SP - 161

EP - 173

JO - Medical Education

JF - Medical Education

SN - 0308-0110

IS - 2

ER -