1 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Mool – Automated Root Cause Analysis using ML
Rohit Choudhary & Gaurav Nagar, Hortonworks
2 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Introduction HDP
– Cumulative Big Data Package with 25+ Certified Open Source Apache Projects– Source Code arrives from both - Community and Internal Engineering
QE and Certification Process– Every change goes through Git and Gerrit– System tests are written for each components, 100s of new tests added every release
Release Stability – Determined by System Test failure and pass percentages– Once new features and System Tests and are at 100%, we call the release done!
Releases– On-premise Releases– Cloud Releases – HDI and HDC
3 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Problem Statement Test Suite Size
– System Tests are organized as Suites, also called Splits – 700– Several 1000s of test cases, executed in every run
Infrastructure– YarnCloud Infrastructure &OpenStack Infrastructure– 700 X 5 Node+ HDP Clusters – Creation and Tear Downs– Test Suites are run on each clusters and Logs are collected– Test produce 1-1.5 TB of System Logs across our stack everyday
Failure Assessments and Subsequent Process– Component owners undertake the responsibilities of identifying failures– Time-taking, Repetitive without increasing system knowledge– Restrictive (reduces our ability to release faster)
4 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Mool – Automated Log Analysis
Root Cause across components in one click– Identify common failure causes across components
Recommend Actions instead of assisted search with– Systemic Knowledge/Repository of Errors and their associations– Recency of occurrence– Source modifications as data features– Current and past reported issues in ticketing systems
Integrate with downstream process lifecycle– Test Analysis– Ticketing system integration
Mool – Sanskrit meaning Root
5 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Past Industry Efforts – AALA @Siemens
Automated Log Analysis Using Machine Learning, Weixi Li, Uppsala Universitet
“The biggest limitation in our project is that it is hard to find experts to analyze the logs manually. Although the clustering algorithms do not need any feedback during learning, the evaluation of different models need to compare our prediction results with the true
answers. But it is not possible to find the experts to do manual log analysis in this project, therefore, the evaluation is based on the test system verdicts.”
6 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
RCA Analysis Process
Log Message Feature ExtractionTest Failure Feature Extraction
Feature Extraction1
Enriched with Test Execution TimeOrigin ComponentsEnrichment
Error CategorizationRCA AnalysisError Repository Upgrades
Learning
2
3
7 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Algebra
TC2
TC1
TC3
TC4
Run ID
Component
Suite
E2
E1
E3
E4
Test Case – Error CorrelationTC1 = {E1, E2,E4}TC2 = {E1, E3}TC3 = {E3, E4}TC4 = {E1, E4}
Error – Test Case Correlation (Conversely)E1 = {TC1, TC2,TC4}E4 = {TC1, TC4}
Where Components = {C1, C2, C3, C4}
8 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Algebra - Explained
Suite i
Suite l…
Suite j
Suite k
Suite n
T =t
Errors Test Cases
E1, E2, E3… TC1, TC2, T3…
E1, E2, E3… TC1, TC2, T3…
E1, E2, E3… TC1, TC2, T3…
E1, E2, E3… TC1, TC2, T3…
E1, E2, E3… TC1, TC2, T3…
T =0 T =f
Sing
le C
lust
er R
un
9 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Algebra - Explained
Suite i
T =t
Errors Test Cases
E1, E2, E3… TC1, TC2, T3…
T =0 T =f
Mul
ti-cl
uste
r Run
Suite i
T =t2
E1, E2, E3… TCi1, TCi2, Ti3…
T =t1 T =f2
Suite i
T =t3
E1, E2, E3… TC1, TC2, T3…
T =t2 T =f3
Suite i
T =t4
E1, E2, E3… TC1, TC2, T3…
T =t3 T =f4
10 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Error Paths and Feature Extraction
Hive Server2
Yarn
ATS
HDFS
Livy
Yarn
HDFS
Pig
Hive
Yarn
HDFS
Spark Oozie WorkflowHive Suite Test Suites
Stack Call
E1, E2, E3 E1, E2, E3, E4, E5, E6 En….
Test Case Features = {name, suite_name, start_time, end_time, status}Error Features = {stacktrace, message, occurrence_time, origin, category, file_name}
Errors
11 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Salient Points: Failure Sample & Error Samples
Test Case Failures
12 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
System Interactions
Ensemble Modeling & Learning
Customer Reports
Data Pipeline
Source Code
HistoricalError DB
Ticket Systems
Recommendations Automated Actions
Metadata Store
13 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Application Architecture
Log Accumulation/release branch
Grok parsers for HDP/Ambari components
Identical Match (Stacktrace)
Nearest Match(Levenshtein Adaptation)
RCA/Associative AnalysisError Hierarchy Association
Automated Ticket Processing
RecommendationBased on Recency
Unsupervised Learning
Ingestion
Outcome
14 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Split Processor
Test Clusters
Storage
Deployment Architecture
Livy (Job Server) HDFS
Spark Jobs
MetaData Store
Log Daemon
Log daemonsPush Logs into HDFS
Trigger Analysis at End of Run
Web Application
Manual Input for Selection/Rejection of Outcome
Data Processing Data SourceApplication
15 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
RCA Versus Error Graph Creation Error Graphs Creation Failed
– FP Growth Algorithm did not yield desired results– Too many closed loops, cyclic dependencies– Time as a split dimension was not enough
Moved towards RCAs– Origin of the error chain was easier to find out– Accuracy was higher– Enough data supporting multiple code-flows
Easier to validate through out system Analysts– Unsupervised Learning is hard to validate without manual intervention
16 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
RCA Rejections False Positives are very prevalent
– Dominating Exceptions because of frequent code path execution– They are repetitive and need to be ignored, statistically based on decile values
Priority versus Ignored versus Historical– Historical RCA’s based on the source code changes and recency allows final decision– If corresponding tickets are open, then those issues take priority
Common Exceptions or Common RCA’s– Prioritize the ones that are causing cross-component failures
17 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
RCA Graph
18 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Quick Stats
Item Data
Total Run Ids Analyzed 14410
Total Splits across components 115 K
Raw errors parsed from logs 120 M
Unique Errors 45025
Total Test Case failure 170 K
Errors related to Failed Test Cases 592 K
Unique Errors related to Failed Test Cases
30570
19 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Adoption Challenges Great for fast changing code base
– Individual component owners have reported upto 99% accuracy – Multi-component use case scenarios needs improvement
Log collection required multiple iterations– Order of logs being written and collected– Central Log server issues
Stable releases are harder to instrument– Our internal team has been unable to use it– Source code changes are minimal/recency parameters are harder to provide
Unsupervised learning verification is harder– Very hard to effectively judge performance of models without manual interference
20 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Future Work Unsupervised learning validation using automated techniques Online processing using Spark Streaming Event based error detection on live production clusters Correlation with other log events/customer use cases
21 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Thank You
Rohit Choudhary & Gaurav Nagar