Download - Mool - Automated Log Analysis using Data Science and ML

1 © Hortonworks Inc. 2011 – 2017 All Rights Reserved

Mool – Automated Root Cause Analysis using ML

Rohit Choudhary & Gaurav Nagar, Hortonworks


Introduction HDP

– Cumulative Big Data Package with 25+ Certified Open Source Apache Projects– Source Code arrives from both - Community and Internal Engineering

QE and Certification Process– Every change goes through Git and Gerrit– System tests are written for each components, 100s of new tests added every release

Release Stability – Determined by System Test failure and pass percentages– Once new features and System Tests and are at 100%, we call the release done!

Releases– On-premise Releases– Cloud Releases – HDI and HDC


Problem Statement Test Suite Size

– System Tests are organized as Suites, also called Splits – 700– Several 1000s of test cases, executed in every run

Infrastructure– YarnCloud Infrastructure &OpenStack Infrastructure– 700 X 5 Node+ HDP Clusters – Creation and Tear Downs– Test Suites are run on each clusters and Logs are collected– Test produce 1-1.5 TB of System Logs across our stack everyday

Failure Assessments and Subsequent Process– Component owners undertake the responsibilities of identifying failures– Time-taking, Repetitive without increasing system knowledge– Restrictive (reduces our ability to release faster)


Mool – Automated Log Analysis

Root Cause across components in one click– Identify common failure causes across components

Recommend Actions instead of assisted search with– Systemic Knowledge/Repository of Errors and their associations– Recency of occurrence– Source modifications as data features– Current and past reported issues in ticketing systems

Integrate with downstream process lifecycle– Test Analysis– Ticketing system integration

Mool – Sanskrit meaning Root


Past Industry Efforts – AALA @Siemens

Automated Log Analysis Using Machine Learning, Weixi Li, Uppsala Universitet

“The biggest limitation in our project is that it is hard to find experts to analyze the logs manually. Although the clustering algorithms do not need any feedback during learning, the evaluation of different models need to compare our prediction results with the true

answers. But it is not possible to find the experts to do manual log analysis in this project, therefore, the evaluation is based on the test system verdicts.”


RCA Analysis Process

Log Message Feature ExtractionTest Failure Feature Extraction

Feature Extraction1

Enriched with Test Execution TimeOrigin ComponentsEnrichment

Error CategorizationRCA AnalysisError Repository Upgrades

Learning

2

3


Algebra

TC2

TC1

TC3

TC4

Run ID

Component

Suite

E2

E1

E3

E4

Test Case – Error CorrelationTC1 = {E1, E2,E4}TC2 = {E1, E3}TC3 = {E3, E4}TC4 = {E1, E4}

Error – Test Case Correlation (Conversely)E1 = {TC1, TC2,TC4}E4 = {TC1, TC4}

Where Components = {C1, C2, C3, C4}


Algebra - Explained

Suite i

Suite l…

Suite j

Suite k

Suite n

T =t

Errors Test Cases

E1, E2, E3… TC1, TC2, T3…

E1, E2, E3… TC1, TC2, T3…

E1, E2, E3… TC1, TC2, T3…

E1, E2, E3… TC1, TC2, T3…

E1, E2, E3… TC1, TC2, T3…

T =0 T =f

Sing

le C

lust

er R

un


Algebra - Explained

Suite i

T =t

Errors Test Cases

E1, E2, E3… TC1, TC2, T3…

T =0 T =f

Mul

ti-cl

uste

r Run

Suite i

T =t2

E1, E2, E3… TCi1, TCi2, Ti3…

T =t1 T =f2

Suite i

T =t3

E1, E2, E3… TC1, TC2, T3…

T =t2 T =f3

Suite i

T =t4

E1, E2, E3… TC1, TC2, T3…

T =t3 T =f4


Error Paths and Feature Extraction

Hive Server2

Yarn

ATS

HDFS

Livy

Yarn

HDFS

Pig

Hive

Yarn

HDFS

Spark Oozie WorkflowHive Suite Test Suites

Stack Call

E1, E2, E3 E1, E2, E3, E4, E5, E6 En….

Test Case Features = {name, suite_name, start_time, end_time, status}Error Features = {stacktrace, message, occurrence_time, origin, category, file_name}

Errors


Salient Points: Failure Sample & Error Samples

Test Case Failures


System Interactions

Ensemble Modeling & Learning

Customer Reports

Data Pipeline

Source Code

HistoricalError DB

Ticket Systems

Recommendations Automated Actions

Metadata Store


Application Architecture

Log Accumulation/release branch

Grok parsers for HDP/Ambari components

Identical Match (Stacktrace)

Nearest Match(Levenshtein Adaptation)

RCA/Associative AnalysisError Hierarchy Association

Automated Ticket Processing

RecommendationBased on Recency

Unsupervised Learning

Ingestion

Outcome


Split Processor

Test Clusters

Storage

Deployment Architecture

Livy (Job Server) HDFS

Spark Jobs

MetaData Store

Log Daemon

Log daemonsPush Logs into HDFS

Trigger Analysis at End of Run

Web Application

Manual Input for Selection/Rejection of Outcome

Data Processing Data SourceApplication


RCA Versus Error Graph Creation Error Graphs Creation Failed

– FP Growth Algorithm did not yield desired results– Too many closed loops, cyclic dependencies– Time as a split dimension was not enough

Moved towards RCAs– Origin of the error chain was easier to find out– Accuracy was higher– Enough data supporting multiple code-flows

Easier to validate through out system Analysts– Unsupervised Learning is hard to validate without manual intervention


RCA Rejections False Positives are very prevalent

– Dominating Exceptions because of frequent code path execution– They are repetitive and need to be ignored, statistically based on decile values

Priority versus Ignored versus Historical– Historical RCA’s based on the source code changes and recency allows final decision– If corresponding tickets are open, then those issues take priority

Common Exceptions or Common RCA’s– Prioritize the ones that are causing cross-component failures


RCA Graph


Quick Stats

Item Data

Total Run Ids Analyzed 14410

Total Splits across components 115 K

Raw errors parsed from logs 120 M

Unique Errors 45025

Total Test Case failure 170 K

Errors related to Failed Test Cases 592 K

Unique Errors related to Failed Test Cases

30570


Adoption Challenges Great for fast changing code base

– Individual component owners have reported upto 99% accuracy – Multi-component use case scenarios needs improvement

Log collection required multiple iterations– Order of logs being written and collected– Central Log server issues

Stable releases are harder to instrument– Our internal team has been unable to use it– Source code changes are minimal/recency parameters are harder to provide

Unsupervised learning verification is harder– Very hard to effectively judge performance of models without manual interference


Future Work Unsupervised learning validation using automated techniques Online processing using Spark Streaming Event based error detection on live production clusters Correlation with other log events/customer use cases


Thank You

Rohit Choudhary & Gaurav Nagar

Download - Mool - Automated Log Analysis using Data Science and ML

Top Related