Stochastic identification of Malware with dynamic traces

Curtis Storlie, Blake Anderson, Scott Vander Wiel, Daniel Quist, Curtis Hash, Nathan Brown

Research output: Contribution to journalArticle

12 Citations (Scopus)

Abstract

A novel approach to malware classification is introduced based on analysis of instruction traces that are collected dynamically from the program in question. The method has been implemented online in a sandbox environment (i.e., a security mechanism for separating running programs) at Los Alamos National Laboratory, and is intended for eventual host-based use, provided the issue of sampling the instructions executed by a given process without disruption to the user can be satisfactorily addressed. The procedure represents an instruction trace with a Markov chain structure in which the transition matrix, P, has rows modeled as Dirichlet vectors. The malware class (malicious or benign) is modeled using a flexible spline logistic regression model with variable selection on the elements of P, which are observed with error. The utility of the method is illustrated on a sample of traces from malware and nonmalware programs, and the results are compared to other leading detection schemes (both signature and classification based). This article also has supplementary materials available online.

Original languageEnglish (US)
Pages (from-to)1-18
Number of pages18
JournalAnnals of Applied Statistics
Volume8
Issue number1
DOIs
StatePublished - 2014
Externally publishedYes

Fingerprint

Malware
Trace
Logistic Regression Model
Transition Matrix
Signature Scheme
Variable Selection
Splines
Markov processes
Dirichlet
Spline
Logistics
Markov chain
Sampling

Keywords

  • Adaptive lasso
  • Classification
  • Elastic net
  • Empirical bayes
  • Logistic regression
  • Malware detection
  • Relaxed lasso
  • Splines

ASJC Scopus subject areas

  • Statistics, Probability and Uncertainty
  • Modeling and Simulation
  • Statistics and Probability

Cite this

Storlie, C., Anderson, B., Wiel, S. V., Quist, D., Hash, C., & Brown, N. (2014). Stochastic identification of Malware with dynamic traces. Annals of Applied Statistics, 8(1), 1-18. https://doi.org/10.1214/13-AOAS703

Stochastic identification of Malware with dynamic traces. / Storlie, Curtis; Anderson, Blake; Wiel, Scott Vander; Quist, Daniel; Hash, Curtis; Brown, Nathan.

In: Annals of Applied Statistics, Vol. 8, No. 1, 2014, p. 1-18.

Research output: Contribution to journalArticle

Storlie, C, Anderson, B, Wiel, SV, Quist, D, Hash, C & Brown, N 2014, 'Stochastic identification of Malware with dynamic traces', Annals of Applied Statistics, vol. 8, no. 1, pp. 1-18. https://doi.org/10.1214/13-AOAS703
Storlie, Curtis ; Anderson, Blake ; Wiel, Scott Vander ; Quist, Daniel ; Hash, Curtis ; Brown, Nathan. / Stochastic identification of Malware with dynamic traces. In: Annals of Applied Statistics. 2014 ; Vol. 8, No. 1. pp. 1-18.
@article{ab6a2f4787f1403c837677e46b49ee69,
title = "Stochastic identification of Malware with dynamic traces",
abstract = "A novel approach to malware classification is introduced based on analysis of instruction traces that are collected dynamically from the program in question. The method has been implemented online in a sandbox environment (i.e., a security mechanism for separating running programs) at Los Alamos National Laboratory, and is intended for eventual host-based use, provided the issue of sampling the instructions executed by a given process without disruption to the user can be satisfactorily addressed. The procedure represents an instruction trace with a Markov chain structure in which the transition matrix, P, has rows modeled as Dirichlet vectors. The malware class (malicious or benign) is modeled using a flexible spline logistic regression model with variable selection on the elements of P, which are observed with error. The utility of the method is illustrated on a sample of traces from malware and nonmalware programs, and the results are compared to other leading detection schemes (both signature and classification based). This article also has supplementary materials available online.",
keywords = "Adaptive lasso, Classification, Elastic net, Empirical bayes, Logistic regression, Malware detection, Relaxed lasso, Splines",
author = "Curtis Storlie and Blake Anderson and Wiel, {Scott Vander} and Daniel Quist and Curtis Hash and Nathan Brown",
year = "2014",
doi = "10.1214/13-AOAS703",
language = "English (US)",
volume = "8",
pages = "1--18",
journal = "Annals of Applied Statistics",
issn = "1932-6157",
publisher = "Institute of Mathematical Statistics",
number = "1",

}

TY - JOUR

T1 - Stochastic identification of Malware with dynamic traces

AU - Storlie, Curtis

AU - Anderson, Blake

AU - Wiel, Scott Vander

AU - Quist, Daniel

AU - Hash, Curtis

AU - Brown, Nathan

PY - 2014

Y1 - 2014

N2 - A novel approach to malware classification is introduced based on analysis of instruction traces that are collected dynamically from the program in question. The method has been implemented online in a sandbox environment (i.e., a security mechanism for separating running programs) at Los Alamos National Laboratory, and is intended for eventual host-based use, provided the issue of sampling the instructions executed by a given process without disruption to the user can be satisfactorily addressed. The procedure represents an instruction trace with a Markov chain structure in which the transition matrix, P, has rows modeled as Dirichlet vectors. The malware class (malicious or benign) is modeled using a flexible spline logistic regression model with variable selection on the elements of P, which are observed with error. The utility of the method is illustrated on a sample of traces from malware and nonmalware programs, and the results are compared to other leading detection schemes (both signature and classification based). This article also has supplementary materials available online.

AB - A novel approach to malware classification is introduced based on analysis of instruction traces that are collected dynamically from the program in question. The method has been implemented online in a sandbox environment (i.e., a security mechanism for separating running programs) at Los Alamos National Laboratory, and is intended for eventual host-based use, provided the issue of sampling the instructions executed by a given process without disruption to the user can be satisfactorily addressed. The procedure represents an instruction trace with a Markov chain structure in which the transition matrix, P, has rows modeled as Dirichlet vectors. The malware class (malicious or benign) is modeled using a flexible spline logistic regression model with variable selection on the elements of P, which are observed with error. The utility of the method is illustrated on a sample of traces from malware and nonmalware programs, and the results are compared to other leading detection schemes (both signature and classification based). This article also has supplementary materials available online.

KW - Adaptive lasso

KW - Classification

KW - Elastic net

KW - Empirical bayes

KW - Logistic regression

KW - Malware detection

KW - Relaxed lasso

KW - Splines

UR - http://www.scopus.com/inward/record.url?scp=84898036888&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84898036888&partnerID=8YFLogxK

U2 - 10.1214/13-AOAS703

DO - 10.1214/13-AOAS703

M3 - Article

VL - 8

SP - 1

EP - 18

JO - Annals of Applied Statistics

JF - Annals of Applied Statistics

SN - 1932-6157

IS - 1

ER -