Comparison of three information sources for smoking information in electronic health records

Liwei Wang; Xiaoyang Ruan; Ping Yang; Hongfang Liu

doi:10.4137/CIN.S40604

Comparison of three information sources for smoking information in electronic health records

Liwei Wang, Xiaoyang Ruan, Ping Yang, Hongfang Liu

Research output: Contribution to journal › Review article › peer-review

7 Scopus citations

Abstract

Objective: The primary aim was to compare independent and joint performance of retrieving smoking status through different sources, including narrative text processed by natural language processing (NLP), patient-provided information (PPI), and diagnosis codes (ie, International Classification of Diseases, Ninth Revision [ICD-9]). We also compared the performance of retrieving smoking strength information (ie, heavy/light smoker) from narrative text and PPI. Materia ls and methods: Our study leveraged an existing lung cancer cohort for smoking status, amount, and strength information, which was manually chart-reviewed. On the NLP side, smoking-related electronic medical record (EMR) data were retrieved first. A pattern-based smoking information extraction module was then implemented to extract smoking-related information. After that, heuristic rules were used to obtain smoking status-related information. Smoking information was also obtained from structured data sources based on diagnosis codes and PPI. Sensitivity, specificity, and accuracy were measured using patients with coverage (ie, the proportion of patients whose smoking status/strength can be effectively determined). Results: NLP alone has the best overall performance for smoking status extraction (patient coverage: 0.88; sensitivity: 0.97; specificity: 0.70; accuracy: 0.88); combining PPI with NLP further improved patient coverage to 0.96. ICD-9 does not provide additional improvement to NLP and its combination with PPI. For smoking strength, combining NLP with PPI has slight improvement over NLP alone. Co nclusio n: These findings suggest that narrative text could serve as a more reliable and comprehensive source for obtaining smoking-related information than structured data sources. PPI, the readily available structured data, could be used as a complementary source for more comprehensive patient coverage.

Original language	English (US)
Pages (from-to)	237-242
Number of pages	6
Journal	Cancer Informatics
Volume	15
DOIs	https://doi.org/10.4137/CIN.S40604
State	Published - 2016

Keywords

ICD-9
Natural language processing
Patient-provided information
Smoking status
Smoking strength

ASJC Scopus subject areas

Oncology
Cancer Research

Access to Document

10.4137/CIN.S40604

Cite this

@article{7198d78893b2417a931030f4fb3b226b,

title = "Comparison of three information sources for smoking information in electronic health records",

abstract = "Objective: The primary aim was to compare independent and joint performance of retrieving smoking status through different sources, including narrative text processed by natural language processing (NLP), patient-provided information (PPI), and diagnosis codes (ie, International Classification of Diseases, Ninth Revision [ICD-9]). We also compared the performance of retrieving smoking strength information (ie, heavy/light smoker) from narrative text and PPI. Materia ls and methods: Our study leveraged an existing lung cancer cohort for smoking status, amount, and strength information, which was manually chart-reviewed. On the NLP side, smoking-related electronic medical record (EMR) data were retrieved first. A pattern-based smoking information extraction module was then implemented to extract smoking-related information. After that, heuristic rules were used to obtain smoking status-related information. Smoking information was also obtained from structured data sources based on diagnosis codes and PPI. Sensitivity, specificity, and accuracy were measured using patients with coverage (ie, the proportion of patients whose smoking status/strength can be effectively determined). Results: NLP alone has the best overall performance for smoking status extraction (patient coverage: 0.88; sensitivity: 0.97; specificity: 0.70; accuracy: 0.88); combining PPI with NLP further improved patient coverage to 0.96. ICD-9 does not provide additional improvement to NLP and its combination with PPI. For smoking strength, combining NLP with PPI has slight improvement over NLP alone. Co nclusio n: These findings suggest that narrative text could serve as a more reliable and comprehensive source for obtaining smoking-related information than structured data sources. PPI, the readily available structured data, could be used as a complementary source for more comprehensive patient coverage.",

keywords = "ICD-9, Natural language processing, Patient-provided information, Smoking status, Smoking strength",

author = "Liwei Wang and Xiaoyang Ruan and Ping Yang and Hongfang Liu",

note = "Publisher Copyright: {\textcopyright} the authors, publisher and licensee Libertas Academica Limited.",

year = "2016",

doi = "10.4137/CIN.S40604",

language = "English (US)",

volume = "15",

pages = "237--242",

journal = "Cancer Informatics",

issn = "1176-9351",

publisher = "Libertas Academica Ltd.",

}

TY - JOUR

T1 - Comparison of three information sources for smoking information in electronic health records

AU - Wang, Liwei

AU - Ruan, Xiaoyang

AU - Yang, Ping

AU - Liu, Hongfang

N1 - Publisher Copyright: © the authors, publisher and licensee Libertas Academica Limited.

PY - 2016

Y1 - 2016

N2 - Objective: The primary aim was to compare independent and joint performance of retrieving smoking status through different sources, including narrative text processed by natural language processing (NLP), patient-provided information (PPI), and diagnosis codes (ie, International Classification of Diseases, Ninth Revision [ICD-9]). We also compared the performance of retrieving smoking strength information (ie, heavy/light smoker) from narrative text and PPI. Materia ls and methods: Our study leveraged an existing lung cancer cohort for smoking status, amount, and strength information, which was manually chart-reviewed. On the NLP side, smoking-related electronic medical record (EMR) data were retrieved first. A pattern-based smoking information extraction module was then implemented to extract smoking-related information. After that, heuristic rules were used to obtain smoking status-related information. Smoking information was also obtained from structured data sources based on diagnosis codes and PPI. Sensitivity, specificity, and accuracy were measured using patients with coverage (ie, the proportion of patients whose smoking status/strength can be effectively determined). Results: NLP alone has the best overall performance for smoking status extraction (patient coverage: 0.88; sensitivity: 0.97; specificity: 0.70; accuracy: 0.88); combining PPI with NLP further improved patient coverage to 0.96. ICD-9 does not provide additional improvement to NLP and its combination with PPI. For smoking strength, combining NLP with PPI has slight improvement over NLP alone. Co nclusio n: These findings suggest that narrative text could serve as a more reliable and comprehensive source for obtaining smoking-related information than structured data sources. PPI, the readily available structured data, could be used as a complementary source for more comprehensive patient coverage.

AB - Objective: The primary aim was to compare independent and joint performance of retrieving smoking status through different sources, including narrative text processed by natural language processing (NLP), patient-provided information (PPI), and diagnosis codes (ie, International Classification of Diseases, Ninth Revision [ICD-9]). We also compared the performance of retrieving smoking strength information (ie, heavy/light smoker) from narrative text and PPI. Materia ls and methods: Our study leveraged an existing lung cancer cohort for smoking status, amount, and strength information, which was manually chart-reviewed. On the NLP side, smoking-related electronic medical record (EMR) data were retrieved first. A pattern-based smoking information extraction module was then implemented to extract smoking-related information. After that, heuristic rules were used to obtain smoking status-related information. Smoking information was also obtained from structured data sources based on diagnosis codes and PPI. Sensitivity, specificity, and accuracy were measured using patients with coverage (ie, the proportion of patients whose smoking status/strength can be effectively determined). Results: NLP alone has the best overall performance for smoking status extraction (patient coverage: 0.88; sensitivity: 0.97; specificity: 0.70; accuracy: 0.88); combining PPI with NLP further improved patient coverage to 0.96. ICD-9 does not provide additional improvement to NLP and its combination with PPI. For smoking strength, combining NLP with PPI has slight improvement over NLP alone. Co nclusio n: These findings suggest that narrative text could serve as a more reliable and comprehensive source for obtaining smoking-related information than structured data sources. PPI, the readily available structured data, could be used as a complementary source for more comprehensive patient coverage.

KW - ICD-9

KW - Natural language processing

KW - Patient-provided information

KW - Smoking status

KW - Smoking strength

UR - http://www.scopus.com/inward/record.url?scp=85012241833&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85012241833&partnerID=8YFLogxK

U2 - 10.4137/CIN.S40604

DO - 10.4137/CIN.S40604

M3 - Review article

AN - SCOPUS:85012241833

SN - 1176-9351

VL - 15

SP - 237

EP - 242

JO - Cancer Informatics

JF - Cancer Informatics

ER -

Comparison of three information sources for smoking information in electronic health records

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this