deeplog: anomaly detection and diagnosis from system logs ...lifeifei/papers/dl_ccs.pdf · deeplog:...
TRANSCRIPT
DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning
Min Du, Feifei Li, Guineng Zheng, Vivek SrikumarUniversity of Utah
Background
2
Background
System Event Log
3
Background
System Event Log
Available practically on
every computer system!
4
Background
System Event Log
Automatic Analysis?
5
Available practically on
every computer system!
Background
6
Automatically detected anomaly
Background
System
Event
Log
7
Started service A on port 80
Executor updated: app-1 is now LOADING
……
Background
System
Event
Log
Structured DataMessage type
Log key
……
printf(“Started service
%s on port %d”, x, y);LOG
PARSING
8
Started service A on port 80
Executor updated: app-1 is now LOADING
……
Background
System
Event
Log
Structured DataMessage type
Log key
……
printf(“Started service
%s on port %d”, x, y);LOG
PARSING
Started service A on port 80
Executor updated: app-1 is now LOADING
……
Started service * on port * (log key ID: 1)
Executor updated: * is now LOADING (log key ID: 2)
……
9
Background
System
Event
Log
Structured Data Anomaly
DetectionMessage type
Log key
……
printf(“Started service
%s on port %d”, x, y);
LOG ANALYSIS
LOG
PARSING
10
Background
System
Event
Log
Structured Data Anomaly
DetectionMessage type
Log key
……
printf(“Started service
%s on port %d”, x, y);
LOG ANALYSIS
Message count vector:
Xu’SOSP09, Lou’ATC10, etc.
LOG
PARSING
11
Background
Structured Data Anomaly
DetectionMessage type
Log key
……
printf(“Started service
%s on port %d”, x, y);
LOG ANALYSIS
Message count vector:
Xu’SOSP09, Lou’ATC10, etc.
Problem: Offline batched processing
LOG
PARSING
System
Event
Log
12
Background
Structured Data Anomaly
DetectionMessage type
Log key
……
printf(“Started service
%s on port %d”, x, y);
LOG ANALYSIS
Message count vector:
Xu’SOSP09, Lou’ATC10, etc.
Problem: Offline batched processing
Build workflow model:
Lou’KDD10, Beschastnikh’ICSE14, Yu’ASPLOS16, etc.
LOG
PARSING
System
Event
Log
13
Background
Structured Data Anomaly
DetectionMessage type
Log key
……
printf(“Started service
%s on port %d”, x, y);
LOG ANALYSIS
Message count vector:
Xu’SOSP09, Lou’ATC10, etc.
Problem: Offline batched processing
Build workflow model:
Lou’KDD10, Beschastnikh’ICSE14, Yu’ASPLOS16, etc.
Problem: Only for simple execution path anomalies
LOG
PARSING
System
Event
Log
14
Background
Structured Data Anomaly
DetectionMessage type
Log key
……
printf(“Started service
%s on port %d”, x, y);
LOG ANALYSIS
Message count vector:
Xu’SOSP09, Lou’ATC10, etc.
Problem: Offline batched processing
Build workflow model:
Lou’KDD10, Beschastnikh’ICSE14, Yu’ASPLOS16, etc.
Problem: Only for simple execution path anomalies
LOG
PARSING
Common problem:
Only Log keys
(Message types)
are considered.
System
Event
Log
15
DeepLog
log message (log key underlined) log key parameter value vector
𝑡1 Deletion of file1 complete 𝑘1 [𝑡1 - 𝑡0, file1]
𝑡2 Took 0.61 seconds to deallocate network … 𝑘2 [𝑡2 - 𝑡1, 0.61]
𝑡3 VM Stopped (Lifecycle Event) 𝑘3 [𝑡3 - 𝑡2]
… … …
16
DeepLog
SPELLA streaming log
parser published in
ICDM’16
log message (log key underlined) log key parameter value vector
𝑡1 Deletion of file1 complete 𝑘1 [𝑡1 - 𝑡0, file1]
𝑡2 Took 0.61 seconds to deallocate network … 𝑘2 [𝑡2 - 𝑡1, 0.61]
𝑡3 VM Stopped (Lifecycle Event) 𝑘3 [𝑡3 - 𝑡2]
… … …
17
DeepLog
SPELLA streaming log
parser published in
ICDM’16
log keylog message parameters
log message (log key underlined) log key parameter value vector
𝑡1 Deletion of file1 complete 𝑘1 [𝑡1 - 𝑡0, file1]
𝑡2 Took 0.61 seconds to deallocate network … 𝑘2 [𝑡2 - 𝑡1, 0.61]
𝑡3 VM Stopped (Lifecycle Event) 𝑘3 [𝑡3 - 𝑡2]
… … …
18
DeepLog
SPELLA streaming log
parser published in
ICDM’16
Deletion of file1 complete.
log keylog message parameters
log message (log key underlined) log key parameter value vector
𝑡1 Deletion of file1 complete 𝑘1 [𝑡1 - 𝑡0, file1]
𝑡2 Took 0.61 seconds to deallocate network … 𝑘2 [𝑡2 - 𝑡1, 0.61]
𝑡3 VM Stopped (Lifecycle Event) 𝑘3 [𝑡3 - 𝑡2]
… … …
19
DeepLog
SPELLA streaming log
parser published in
ICDM’16
Deletion of file1 complete.
log keylog message parameters
log message (log key underlined) log key parameter value vector
𝑡1 Deletion of file1 complete 𝑘1 [𝑡1 - 𝑡0, file1]
𝑡2 Took 0.61 seconds to deallocate network … 𝑘2 [𝑡2 - 𝑡1, 0.61]
𝑡3 VM Stopped (Lifecycle Event) 𝑘3 [𝑡3 - 𝑡2]
… … …
20
Deletion of * complete. [file1]
DeepLog
SPELLA streaming log
parser published in
ICDM’16
Deletion of file1 complete.
log keylog message
Deletion of file2 complete.
parameters
log message (log key underlined) log key parameter value vector
𝑡1 Deletion of file1 complete 𝑘1 [𝑡1 - 𝑡0, file1]
𝑡2 Took 0.61 seconds to deallocate network … 𝑘2 [𝑡2 - 𝑡1, 0.61]
𝑡3 VM Stopped (Lifecycle Event) 𝑘3 [𝑡3 - 𝑡2]
… … …
21
Deletion of * complete. [file1]
DeepLog
SPELLA streaming log
parser published in
ICDM’16
Deletion of file1 complete. Deletion of * complete.
log keylog message
Deletion of file2 complete. Deletion of * complete.
parameters
[file1]
[file2]
log message (log key underlined) log key parameter value vector
𝑡1 Deletion of file1 complete 𝑘1 [𝑡1 - 𝑡0, file1]
𝑡2 Took 0.61 seconds to deallocate network … 𝑘2 [𝑡2 - 𝑡1, 0.61]
𝑡3 VM Stopped (Lifecycle Event) 𝑘3 [𝑡3 - 𝑡2]
… … …
22
DeepLog
log message (log key underlined) log key parameter value vector
𝑡1 Deletion of file1 complete 𝑘1 [𝑡1 - 𝑡0, file1]
𝑡2 Took 0.61 seconds to deallocate network … 𝑘2 [𝑡2 - 𝑡1, 0.61]
𝑡3 VM Stopped (Lifecycle Event) 𝑘3 [𝑡3 - 𝑡2]
… … …
23
DeepLog
log message (log key underlined) log key parameter value vector
𝑡1 Deletion of file1 complete 𝑘1 [𝑡1 - 𝑡0, file1]
𝑡2 Took 0.61 seconds to deallocate network … 𝑘2 [𝑡2 - 𝑡1, 0.61]
𝑡3 VM Stopped (Lifecycle Event) 𝑘3 [𝑡3 - 𝑡2]
… … …
24
DeepLog
log message (log key underlined) log key parameter value vector
𝑡1 Deletion of file1 complete 𝑘1 [𝑡1 - 𝑡0, file1]
𝑡2 Took 0.61 seconds to deallocate network … 𝑘2 [𝑡2 - 𝑡1, 0.61]
𝑡3 VM Stopped (Lifecycle Event) 𝑘3 [𝑡3 - 𝑡2]
… … …
25
DeepLog
Anomaly Detection
log message (log key underlined) log key parameter value vector
𝑡1 Deletion of file1 complete 𝑘1 [𝑡1 - 𝑡0, file1]
𝑡2 Took 0.61 seconds to deallocate network … 𝑘2 [𝑡2 - 𝑡1, 0.61]
𝑡3 VM Stopped (Lifecycle Event) 𝑘3 [𝑡3 - 𝑡2]
… … …
26
DeepLog
Anomaly Detection Diagnosis
log message (log key underlined) log key parameter value vector
𝑡1 Deletion of file1 complete 𝑘1 [𝑡1 - 𝑡0, file1]
𝑡2 Took 0.61 seconds to deallocate network … 𝑘2 [𝑡2 - 𝑡1, 0.61]
𝑡3 VM Stopped (Lifecycle Event) 𝑘3 [𝑡3 - 𝑡2]
… … …
27
DeepLog
DeepLog
Anomaly Detection Diagnosis
log message (log key underlined) log key parameter value vector
𝑡1 Deletion of file1 complete 𝑘1 [𝑡1 - 𝑡0, file1]
𝑡2 Took 0.61 seconds to deallocate network … 𝑘2 [𝑡2 - 𝑡1, 0.61]
𝑡3 VM Stopped (Lifecycle Event) 𝑘3 [𝑡3 - 𝑡2]
… … …
28
DeepLog Architecture
Training
Stage
Detection
Stage
MODELS
29
DeepLog Architecture
Detection
Stage
MODELS
30
DeepLog Architecture
31
DeepLog Architecture
32
DeepLog Architecture
33
DeepLog Architecture
34
DeepLog Architecture
35
DeepLog Architecture
Training
Stage
Detection
Stage
MODELS
36
DeepLog Architecture
Training
Stage
MODELS
37
DeepLog Architecture
38
DeepLog Architecture
39
DeepLog Architecture
40
DeepLog Architecture
41
DeepLog Architecture
42
DeepLog Architecture
43
DeepLog Architecture
44
DeepLog Architecture
45
DeepLog Architecture
MODELS
46
Log Key Anomaly Detection model
47
Example log key sequence:
25 18 54 57 18 56 … 25 18 54 57 56 18 …
➢ a rigorous set of logic and control flows
➢ a (more structured) natural language
Log Key Anomaly Detection model
48
Example log key sequence:
25 18 54 57 18 56 … 25 18 54 57 56 18 …
➢ a rigorous set of logic and control flows
➢ a (more structured) natural language
natural language modeling
multi-class classifier: history sequence => next key to appear
Log Key Anomaly Detection model
49
Example log key sequence:
25 18 54 57 18 56 … 25 18 54 57 56 18 …
➢ a rigorous set of logic and control flows
➢ a (more structured) natural language
natural language modeling
multi-class classifier: history sequence => next key to appear
A log key is detected to be abnormal if it does not
follow the prediction.
Log Key Anomaly Detection model
Use long short-term memory (LSTM) architecture
50
Log Key Anomaly Detection model
Use long short-term memory (LSTM) architecture
51
Log Key Anomaly Detection model
Use long short-term memory (LSTM) architecture
Training:
log key sequence:
h=3 25 18 54 57 18 56 … 25 18 54 57 56 18 …
52
Log Key Anomaly Detection model
Use long short-term memory (LSTM) architecture
Training:
log key sequence:
h=3 25 18 54 57 18 56 … 25 18 54 57 56 18 …
53
Log Key Anomaly Detection model
Use long short-term memory (LSTM) architecture
Training:
log key sequence:
h=3 25 18 54 57 18 56 … 25 18 54 57 56 18 …
54
Log Key Anomaly Detection model
Use long short-term memory (LSTM) architecture
Training:
log key sequence:
h=3 25 18 54 57 18 56 … 25 18 54 57 56 18 …
55
Log Key Anomaly Detection model
Use long short-term memory (LSTM) architecture
56
Detection:
In detection stage, DeepLog checks if the actual next log key
is among its top g probable predictions.
Log Key Anomaly Detection model
57
Log Key Anomaly Detection model
58
Log Key Anomaly Detection model
59
Workflow Construction
Input: log key sequence
25 18 54 57 18 56 … 25 18 54 57 56 18 …
Output:
60
Workflow Construction
61
Method 1: Using Log Key Anomaly Detection model
--- LSTM prediction probabilities
Workflow Construction
62
Method 1: Using Log Key Anomaly Detection model
--- LSTM prediction probabilities
An example of concurrency detection:
Workflow Construction
63
Method 1: Using Log Key Anomaly Detection model
--- LSTM prediction probabilities
An example of concurrency detection:
Workflow Construction
64
Method 1: Using Log Key Anomaly Detection model
--- LSTM prediction probabilities
An example of concurrency detection:
Workflow Construction
65
Method 1: Using Log Key Anomaly Detection model
--- LSTM prediction probabilities
An example of concurrency detection:
Method 1: Using Log Key Anomaly Detection model
--- LSTM prediction probabilities
An example of concurrency detection:
Workflow Construction
66
Method 2: A density-based clustering approach
Workflow Construction
67
Co-occurrence matrix of log keys (𝒌𝒊, 𝒌𝒋) within distance 𝒅
Workflow Construction
68
Method 2: A density-based clustering approach
𝑓𝑑(𝑘𝑖 , 𝑘𝑗) : the frequency of (𝑘𝑖 , 𝑘𝑗) appearing together within distance d
𝑓(𝑘𝑖) : the frequency of 𝑘𝑖 in the input sequence
𝑝𝑑(i, 𝑗) : the probability of (𝑘𝑖 , 𝑘𝑗) appearing together within distance d
Example:
Log messages of a particular log key:
𝒕𝟐: 𝑻𝒐𝒐𝒌 𝟎. 𝟔𝟏 𝒔𝒆𝒄𝒐𝒏𝒅𝒔 𝒕𝒐 𝒅𝒆𝒂𝒍𝒍𝒐𝒄𝒂𝒕𝒆 𝒏𝒆𝒕𝒘𝒐𝒓𝒌 …𝒕′𝟐: 𝑻𝒐𝒐𝒌 𝟏. 𝟏 𝒔𝒆𝒄𝒐𝒏𝒅𝒔 𝒕𝒐 𝒅𝒆𝒂𝒍𝒍𝒐𝒄𝒂𝒕𝒆 𝒏𝒆𝒕𝒘𝒐𝒓𝒌 …
….
Parameter Value Anomaly Detection model
69
Example:
Log messages of a particular log key:
𝒕𝟐: 𝑻𝒐𝒐𝒌 𝟎. 𝟔𝟏 𝒔𝒆𝒄𝒐𝒏𝒅𝒔 𝒕𝒐 𝒅𝒆𝒂𝒍𝒍𝒐𝒄𝒂𝒕𝒆 𝒏𝒆𝒕𝒘𝒐𝒓𝒌 …𝒕′𝟐: 𝑻𝒐𝒐𝒌 𝟏. 𝟏 𝒔𝒆𝒄𝒐𝒏𝒅𝒔 𝒕𝒐 𝒅𝒆𝒂𝒍𝒍𝒐𝒄𝒂𝒕𝒆 𝒏𝒆𝒕𝒘𝒐𝒓𝒌 …
….
Parameter value vectors overtime:
[𝒕𝟐- 𝒕𝟏, 0.61], [𝒕′𝟐- 𝒕′𝟏, 1.1], ….
Parameter Value Anomaly Detection model
70
Example:
Log messages of a particular log key:
𝒕𝟐: 𝑻𝒐𝒐𝒌 𝟎. 𝟔𝟏 𝒔𝒆𝒄𝒐𝒏𝒅𝒔 𝒕𝒐 𝒅𝒆𝒂𝒍𝒍𝒐𝒄𝒂𝒕𝒆 𝒏𝒆𝒕𝒘𝒐𝒓𝒌 …𝒕′𝟐: 𝑻𝒐𝒐𝒌 𝟏. 𝟏 𝒔𝒆𝒄𝒐𝒏𝒅𝒔 𝒕𝒐 𝒅𝒆𝒂𝒍𝒍𝒐𝒄𝒂𝒕𝒆 𝒏𝒆𝒕𝒘𝒐𝒓𝒌 …
….
Parameter value vectors overtime:
[𝒕𝟐- 𝒕𝟏, 0.61], [𝒕′𝟐- 𝒕′𝟏, 1.1], ….
Multi-variate time series data anomaly detection problem!
Parameter Value Anomaly Detection model
71
Multi-variate time series data anomaly detection problem
✓ Leverage LSTM-based approach;
✓ A parameter value vector is given as input at each time step;
✓ An anomaly is detected if the mean-square-error (MSE)
between prediction and actual data is too big.
Parameter Value Anomaly Detection model
72
Parameter Value Anomaly Detection model
history
time
value
73
Multi-variate time series data anomaly detection problem
✓ Leverage LSTM-based approach;
✓ A parameter value vector is given as input at each time step;
✓ An anomaly is detected if the mean-square-error (MSE)
between prediction and actual data is too big.
Parameter Value Anomaly Detection model
prediction
74
time
value history
Multi-variate time series data anomaly detection problem
✓ Leverage LSTM-based approach;
✓ A parameter value vector is given as input at each time step;
✓ An anomaly is detected if the mean-square-error (MSE)
between prediction and actual data is too big.
Parameter Value Anomaly Detection model
actual
time
75
predictionvalue history
Multi-variate time series data anomaly detection problem
✓ Leverage LSTM-based approach;
✓ A parameter value vector is given as input at each time step;
✓ An anomaly is detected if the mean-square-error (MSE)
between prediction and actual data is too big.
Parameter Value Anomaly Detection model
actual
time
76
predictionvalue history
MSE > Threshold ?
Multi-variate time series data anomaly detection problem
✓ Leverage LSTM-based approach;
✓ A parameter value vector is given as input at each time step;
✓ An anomaly is detected if the mean-square-error (MSE)
between prediction and actual data is too big.
Parameter Value Anomaly Detection model
history
time
value
77
Multi-variate time series data anomaly detection problem
✓ Leverage LSTM-based approach;
✓ A parameter value vector is given as input at each time step;
✓ An anomaly is detected if the mean-square-error (MSE)
between prediction and actual data is too big.
Parameter Value Anomaly Detection model
actual
prediction
time
value
78
history
Multi-variate time series data anomaly detection problem
✓ Leverage LSTM-based approach;
✓ A parameter value vector is given as input at each time step;
✓ An anomaly is detected if the mean-square-error (MSE)
between prediction and actual data is too big.
Parameter Value Anomaly Detection model
actual
prediction
time
value
79
history
MSE > Threshold ?
Multi-variate time series data anomaly detection problem
✓ Leverage LSTM-based approach;
✓ A parameter value vector is given as input at each time step;
✓ An anomaly is detected if the mean-square-error (MSE)
between prediction and actual data is too big.
Parameter Value Anomaly Detection model
history
…time
value
80
Multi-variate time series data anomaly detection problem
✓ Leverage LSTM-based approach;
✓ A parameter value vector is given as input at each time step;
✓ An anomaly is detected if the mean-square-error (MSE)
between prediction and actual data is too big.
LSTM model online update
Q: How to handle false positive?
81
LSTM model online update
historyLog sequence:
Q: How to handle false positive?
82
LSTM model online update
history
model
Log sequence:
Q: How to handle false positive?
83
LSTM model online update
history
model
Log sequence:
prediction
Q: How to handle false positive?
84
LSTM model online update
history current
model
Anomaly?
Log sequence:
prediction
Q: How to handle false positive?
85
LSTM model online update
history current
model
Anomaly?
Log sequence:
prediction
Q: How to handle false positive?
Yes
86
LSTM model online update
history current
model
Anomaly?
Log sequence:
prediction
Q: How to handle false positive?
Yes
False
positive?
87
LSTM model online update
history current
model
Anomaly?Yes
update model using this case: “history -> current”
False
positive?
Yes
Log sequence:
prediction
Q: How to handle false positive?
88
Evaluation results on HDFS log data [1]. (over a million log entries with labeled anomalies)
[1] PCA (SOSP’09), IM (UsenixATC’10), N-gram (baseline language model)
Evaluation – log key anomaly detection
Up
is g
oo
d
89
Evaluation – parameter value anomaly detection
Evaluation results on OpenStack cloud log
with different confidence intervals (CIs)
MSE:
mean square error
90
Evaluation – parameter value anomaly detection
MSE:
mean square error
generated on CloudLab;
VM creation/deletion operations;
injected performance anomalies.
Evaluation results on OpenStack cloud log
with different confidence intervals (CIs)91
Evaluation – parameter value anomaly detection
Evaluation results on OpenStack cloud log
with different confidence intervals (CIs)
MSE:
mean square error
thre
sh
old
s
92
Evaluation – parameter value anomaly detection
Evaluation results on OpenStack cloud log
with different confidence intervals (CIs)
MSE:
mean square error
thre
sh
old
s
ANOMALY
93
Evaluation – parameter value anomaly detection
Evaluation results on OpenStack cloud log
with different confidence intervals (CIs)
MSE:
mean square error
thre
sh
old
s
ANOMALY
False
Positive
94
Evaluation – LSTM model online update
Evaluation on Blue Gene/L log,
with and without online model update.
Up
is g
oo
d
95
Evaluation – LSTM model online update
Evaluation on Blue Gene/L log,
with and without online model update.
Up
is g
oo
d
HPC log with labeled anomalies;
Available at
https://www.usenix.org/cfdr-data
96
Evaluation – case study: network security log
97
Dataset: IEEE VAST Challenge 2011 (Mini Challenge 2 – Computer Networking Operations)
The dataset contains firewall log, IDS log, etc.
Evaluation – case study: network security log
98
Dataset: IEEE VAST Challenge 2011 (Mini Challenge 2 – Computer Networking Operations)
The dataset contains firewall log, IDS log, etc.
Detection results.
Evaluation – case study: network security log
99
Dataset: IEEE VAST Challenge 2011 (Mini Challenge 2 – Computer Networking Operations)
The dataset contains firewall log, IDS log, etc.
Detection results.Could be fixed with prior knowledge
of “documented IP”
Evaluation – workflow construction
Constructed workflow of VM Creation.(previously generated OpenStack cloud log)
100
Evaluation – workflow construction
How does it help to
diagnose anomalies?
Constructed workflow of VM Creation.(previously generated OpenStack cloud log)
101
Evaluation – workflow construction
Parameter value
anomaly
How does it help to
diagnose anomalies?
Constructed workflow of VM Creation.(previously generated OpenStack cloud log)
102
Evaluation – workflow construction
Time difference
(performance) anomaly
Parameter value
anomaly
How does it help to
diagnose anomalies?
Constructed workflow of VM Creation.(previously generated OpenStack cloud log)
103
Evaluation – workflow construction
How does it help to
diagnose anomalies?
Constructed workflow of VM Creation.(previously generated OpenStack cloud log)
104
Identified anomaly:Instance took too long to build
because of the transition
from 52 -> 53
Evaluation – workflow construction
How does it help to
diagnose anomalies?
Identified anomaly:Instance took too long to build
because of the transition
from 52 -> 53
Injected anomaly: During VM creation,
network speed from controller
to compute node is throttled.Constructed workflow of VM Creation.
(previously generated OpenStack cloud log)
105
Summary
DeepLog
➢ A realtime system log anomaly detection framework.
➢ LSTM is used to model system execution paths and log parameter values.
➢ Workflow models are built to help anomaly diagnosis.
➢ It supports online model update.
Min Du
Feifei Li
106
Thank you!