mool - automated log analysis using data science and ml

21
1 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Mool – Automated Root Cause Analysis using ML Rohit Choudhary & Gaurav Nagar, Hortonworks

Upload: dataworks-summithadoop-summit

Post on 13-Apr-2017

19 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Mool - Automated Log Analysis using Data Science and ML

1 © Hortonworks Inc. 2011 – 2017 All Rights Reserved

Mool – Automated Root Cause Analysis using ML

Rohit Choudhary & Gaurav Nagar, Hortonworks

Page 2: Mool - Automated Log Analysis using Data Science and ML

2 © Hortonworks Inc. 2011 – 2017 All Rights Reserved

Introduction HDP

– Cumulative Big Data Package with 25+ Certified Open Source Apache Projects– Source Code arrives from both - Community and Internal Engineering

QE and Certification Process– Every change goes through Git and Gerrit– System tests are written for each components, 100s of new tests added every release

Release Stability – Determined by System Test failure and pass percentages– Once new features and System Tests and are at 100%, we call the release done!

Releases– On-premise Releases– Cloud Releases – HDI and HDC

Page 3: Mool - Automated Log Analysis using Data Science and ML

3 © Hortonworks Inc. 2011 – 2017 All Rights Reserved

Problem Statement Test Suite Size

– System Tests are organized as Suites, also called Splits – 700– Several 1000s of test cases, executed in every run

Infrastructure– YarnCloud Infrastructure &OpenStack Infrastructure– 700 X 5 Node+ HDP Clusters – Creation and Tear Downs– Test Suites are run on each clusters and Logs are collected– Test produce 1-1.5 TB of System Logs across our stack everyday

Failure Assessments and Subsequent Process– Component owners undertake the responsibilities of identifying failures– Time-taking, Repetitive without increasing system knowledge– Restrictive (reduces our ability to release faster)

Page 4: Mool - Automated Log Analysis using Data Science and ML

4 © Hortonworks Inc. 2011 – 2017 All Rights Reserved

Mool – Automated Log Analysis

Root Cause across components in one click– Identify common failure causes across components

Recommend Actions instead of assisted search with– Systemic Knowledge/Repository of Errors and their associations– Recency of occurrence– Source modifications as data features– Current and past reported issues in ticketing systems

Integrate with downstream process lifecycle– Test Analysis– Ticketing system integration

Mool – Sanskrit meaning Root

Page 5: Mool - Automated Log Analysis using Data Science and ML

5 © Hortonworks Inc. 2011 – 2017 All Rights Reserved

Past Industry Efforts – AALA @Siemens

Automated Log Analysis Using Machine Learning, Weixi Li, Uppsala Universitet

“The biggest limitation in our project is that it is hard to find experts to analyze the logs manually. Although the clustering algorithms do not need any feedback during learning, the evaluation of different models need to compare our prediction results with the true

answers. But it is not possible to find the experts to do manual log analysis in this project, therefore, the evaluation is based on the test system verdicts.”

Page 6: Mool - Automated Log Analysis using Data Science and ML

6 © Hortonworks Inc. 2011 – 2017 All Rights Reserved

RCA Analysis Process

Log Message Feature ExtractionTest Failure Feature Extraction

Feature Extraction1

Enriched with Test Execution TimeOrigin ComponentsEnrichment

Error CategorizationRCA AnalysisError Repository Upgrades

Learning

2

3

Page 7: Mool - Automated Log Analysis using Data Science and ML

7 © Hortonworks Inc. 2011 – 2017 All Rights Reserved

Algebra

TC2

TC1

TC3

TC4

Run ID

Component

Suite

E2

E1

E3

E4

Test Case – Error CorrelationTC1 = {E1, E2,E4}TC2 = {E1, E3}TC3 = {E3, E4}TC4 = {E1, E4}

Error – Test Case Correlation (Conversely)E1 = {TC1, TC2,TC4}E4 = {TC1, TC4}

Where Components = {C1, C2, C3, C4}

Page 8: Mool - Automated Log Analysis using Data Science and ML

8 © Hortonworks Inc. 2011 – 2017 All Rights Reserved

Algebra - Explained

Suite i

Suite l…

Suite j

Suite k

Suite n

T =t

Errors Test Cases

E1, E2, E3… TC1, TC2, T3…

E1, E2, E3… TC1, TC2, T3…

E1, E2, E3… TC1, TC2, T3…

E1, E2, E3… TC1, TC2, T3…

E1, E2, E3… TC1, TC2, T3…

T =0 T =f

Sing

le C

lust

er R

un

Page 9: Mool - Automated Log Analysis using Data Science and ML

9 © Hortonworks Inc. 2011 – 2017 All Rights Reserved

Algebra - Explained

Suite i

T =t

Errors Test Cases

E1, E2, E3… TC1, TC2, T3…

T =0 T =f

Mul

ti-cl

uste

r Run

Suite i

T =t2

E1, E2, E3… TCi1, TCi2, Ti3…

T =t1 T =f2

Suite i

T =t3

E1, E2, E3… TC1, TC2, T3…

T =t2 T =f3

Suite i

T =t4

E1, E2, E3… TC1, TC2, T3…

T =t3 T =f4

Page 10: Mool - Automated Log Analysis using Data Science and ML

10 © Hortonworks Inc. 2011 – 2017 All Rights Reserved

Error Paths and Feature Extraction

Hive Server2

Yarn

ATS

HDFS

Livy

Yarn

HDFS

Pig

Hive

Yarn

HDFS

Spark Oozie WorkflowHive Suite Test Suites

Stack Call

E1, E2, E3 E1, E2, E3, E4, E5, E6 En….

Test Case Features = {name, suite_name, start_time, end_time, status}Error Features = {stacktrace, message, occurrence_time, origin, category, file_name}

Errors

Page 11: Mool - Automated Log Analysis using Data Science and ML

11 © Hortonworks Inc. 2011 – 2017 All Rights Reserved

Salient Points: Failure Sample & Error Samples

Test Case Failures

Page 12: Mool - Automated Log Analysis using Data Science and ML

12 © Hortonworks Inc. 2011 – 2017 All Rights Reserved

System Interactions

Ensemble Modeling & Learning

Customer Reports

Data Pipeline

Source Code

HistoricalError DB

Ticket Systems

Recommendations Automated Actions

Metadata Store

Page 13: Mool - Automated Log Analysis using Data Science and ML

13 © Hortonworks Inc. 2011 – 2017 All Rights Reserved

Application Architecture

Log Accumulation/release branch

Grok parsers for HDP/Ambari components

Identical Match (Stacktrace)

Nearest Match(Levenshtein Adaptation)

RCA/Associative AnalysisError Hierarchy Association

Automated Ticket Processing

RecommendationBased on Recency

Unsupervised Learning

Ingestion

Outcome

Page 14: Mool - Automated Log Analysis using Data Science and ML

14 © Hortonworks Inc. 2011 – 2017 All Rights Reserved

Split Processor

Test Clusters

Storage

Deployment Architecture

Livy (Job Server) HDFS

Spark Jobs

MetaData Store

Log Daemon

Log daemonsPush Logs into HDFS

Trigger Analysis at End of Run

Web Application

Manual Input for Selection/Rejection of Outcome

Data Processing Data SourceApplication

Page 15: Mool - Automated Log Analysis using Data Science and ML

15 © Hortonworks Inc. 2011 – 2017 All Rights Reserved

RCA Versus Error Graph Creation Error Graphs Creation Failed

– FP Growth Algorithm did not yield desired results– Too many closed loops, cyclic dependencies– Time as a split dimension was not enough

Moved towards RCAs– Origin of the error chain was easier to find out– Accuracy was higher– Enough data supporting multiple code-flows

Easier to validate through out system Analysts– Unsupervised Learning is hard to validate without manual intervention

Page 16: Mool - Automated Log Analysis using Data Science and ML

16 © Hortonworks Inc. 2011 – 2017 All Rights Reserved

RCA Rejections False Positives are very prevalent

– Dominating Exceptions because of frequent code path execution– They are repetitive and need to be ignored, statistically based on decile values

Priority versus Ignored versus Historical– Historical RCA’s based on the source code changes and recency allows final decision– If corresponding tickets are open, then those issues take priority

Common Exceptions or Common RCA’s– Prioritize the ones that are causing cross-component failures

Page 17: Mool - Automated Log Analysis using Data Science and ML

17 © Hortonworks Inc. 2011 – 2017 All Rights Reserved

RCA Graph

Page 18: Mool - Automated Log Analysis using Data Science and ML

18 © Hortonworks Inc. 2011 – 2017 All Rights Reserved

Quick Stats

Item Data

Total Run Ids Analyzed 14410

Total Splits across components 115 K

Raw errors parsed from logs 120 M

Unique Errors 45025

Total Test Case failure 170 K

Errors related to Failed Test Cases 592 K

Unique Errors related to Failed Test Cases

30570

Page 19: Mool - Automated Log Analysis using Data Science and ML

19 © Hortonworks Inc. 2011 – 2017 All Rights Reserved

Adoption Challenges Great for fast changing code base

– Individual component owners have reported upto 99% accuracy – Multi-component use case scenarios needs improvement

Log collection required multiple iterations– Order of logs being written and collected– Central Log server issues

Stable releases are harder to instrument– Our internal team has been unable to use it– Source code changes are minimal/recency parameters are harder to provide

Unsupervised learning verification is harder– Very hard to effectively judge performance of models without manual interference

Page 20: Mool - Automated Log Analysis using Data Science and ML

20 © Hortonworks Inc. 2011 – 2017 All Rights Reserved

Future Work Unsupervised learning validation using automated techniques Online processing using Spark Streaming Event based error detection on live production clusters Correlation with other log events/customer use cases

Page 21: Mool - Automated Log Analysis using Data Science and ML

21 © Hortonworks Inc. 2011 – 2017 All Rights Reserved

Thank You

Rohit Choudhary & Gaurav Nagar