hadoop eagle: full-stack realtime monitoring framework for ebay hadoop

37
HADOOP EAGLE Full-stack realtime monitoring framework for eBay hadoop Edward Zhang @yonzhang2012 | Hao Chen @ihaoch

Upload: hao-chen

Post on 28-Jul-2015

225 views

Category:

Technology


0 download

TRANSCRIPT

HADOOP EAGLEFull-stack realtime monitoring framework for eBay hadoop

Edward Zhang @yonzhang2012 | Hao Chen @ihaoch

2

Use case: Detect node anomaly by analyzing task failure ratio across all nodes

Assumption : task failure ratio for every node should be approximately equal

Algorithm : node by node compare (symmetry violation) and per node trend

HADOOP EAGLE – EBAY INC

HADOOP EAGLE

Background – initial use cases

3

Host: Task failure based anomaly host detection

HADOOP EAGLE – EBAY INC

HADOOP EAGLE

Anomaly Detection & Alerting Analysis Auto-Remediation

4

Scale Challenges @ eBay Hadoop Monitoring

HADOOP EAGLE – EBAY INC

HADOOP EAGLE

• 10+ large Hadoop clusters• 10,000+ data nodes• 50,000+ jobs per day• 50,000,000+ tasks per day• 500+ types of Hadoop/Hbase native metrics• Billions of audit events, metrics per day

5

Use cases challenges @ eBay Hadoop Monitoring

HADOOP EAGLE – EBAY INC

HADOOP EAGLE

• Host• Task failure ratio based machine anomaly detection

• Job monitoring across its lifetime• Real-time running job performance analysis• Near real-time job history analytics• Data skew detection

• Hadoop native metrics• Hdfs• Hbase• M/R

• Logs• GC log• Hadoop daemon log• Audit log• HDFS image file

• Yarn Framework• Queue

6 HADOOP EAGLE – EBAY INC

HADOOP EAGLE

Engineering Challenges @ eBay Hadoop Monitoring

• Varieties of data sourcesM/R history job, running, GC log, namenode log, hadoop native metrics, YARN queue, audit log, hdfs image file etc.

• Varieties of data collectorspull form hdfs, pull YARN API, ship logs, …

• Complex business logicjoin outside data, pre-aggregations, memory window …

• Alert rules can’t be hot deployed

• Scalability issue with single process

7

Job History Performance Analyzer

HADOOP EAGLE – EBAY INC

HADOOP EAGLE

• Monitor job history files in near real-time• Crawl job history files immediately after it is completed• Apply expertise rules for job performance suggestions• Job history trend for the same type of job

Job Start Event

Task Start Event

Task End

Event

Task roll-up

Task2 Start Event

Task2 End

Event

Task roll-up

Job End

Event

Job Suggestion

Rules

8

Job real-time monitoring

HADOOP EAGLE – EBAY INC

HADOOP EAGLE

• Monitoring running job in real time• Minute-level job progress snapshots• Minute-level resource usage

snapshots• CPU, HDFS I/O, Disk I/O, slot

seconds• Roll up to user/queue/cluster level• Slide window based alert

9

Service: GC Log / Server Log

HADOOP EAGLE – EBAY INC

HADOOP EAGLE

• GC event detection and prediction• Log metrics statistics• Real-time log anomaly detection

HADOOP EAGLE – EBAY INC 10

Why Eagle Monitoring Framework

HADOOP EAGLE

11

• Data collector -> data processing -> metric pre-agg/alert engine -> storage -> dashboards• We need create framework to cover full stack in monitoring system

Programming Paradigm and Abstraction

HADOOP EAGLE – EBAY INC

HADOOP EAGLE

12

As a framework, Eagle does not assume :• Data source (where, what)• Business logic execution path (how)• Policy engine implementation (how)• Data sink (where, what)

Eagle as a Framework

HADOOP EAGLE – EBAY INC

As a framework, Eagle does the following:• SQL-like service API• High-performing query framework• Lightweight streaming process java API• Extensible policy engine

implementation• Scalable and distributed rule evaluation• Native HBase data storage support• Metadata driven stream processing• Data source extensibility• Data sink extensibility• Interactive dashboard

HADOOP EAGLE

13

Eagle Overall Architecture

HADOOP EAGLE – EBAY INC

HADOOP EAGLE

HADOOP EAGLE – EBAY INC 14

Eagle Monitoring Framework Internals

• Lightweight Streaming Process Framework• Extensible & Scalable Policy Framework for Alert• Eagle Query Framework• Interactive Dashboards

HADOOP EAGLE

15

Facts• Computation is based on single

event which constitutes endless continuous stream

• Computation can be aggregation, time-window, length-window or join outside data etc.

• Filter design pattern is used for modularizing code at the beginning

Lightweight Streaming Process Framework

HADOOP EAGLE – EBAY INC

HADOOP EAGLE

Abstraction Inspired by cascading framework, we

abstract a light-weight streaming programing API which is independent of execution environment

Streaming process is directed acyclic graph This layer of indirection is for code

modularization, code reuse and prevention of coupling with specific execution environment

Runs on single process, Storm or other streaming technology like Spark

16

Step 1: Task DAG graph setup

Eagle Stream Data Processing API

HADOOP EAGLE – EBAY INC

HADOOP EAGLE

@Overrideprotected void buildDependency(FlowDef def, DataProcessConfig config) { Task header = Task.newTask("wordgenerator").setExecutor(source).completeBuild(); Task uppertask = Task.newTask("uppercase").setExecutor(new UppercaseExecutor()).connectFrom(header).completeBuild(); Task groupbyUppercaseTask = Task.newTask("groupby_uppercase").setExecutor(new GroupbyCountExecutor()).connectFrom(uppertask).completeBuild(); def.endBy(groupbyUppercaseTask); }

Step 2: Inter-task data exchange protocol

@Overrideprotected void buildDependency(FlowDef def, DataProcessConfig config) { Task header = Task.newTask("wordgenerator").setExecutor(source).completeBuild(); Task uppertask = Task.newTask("uppercase").setExecutor(new UppercaseExecutor()).connectFrom(header).completeBuild(); Task groupbyUppercaseTask = Task.newTask("groupby_uppercase").setExecutor(new GroupbyCountExecutor()).connectFrom(uppertask).completeBuild(); def.endBy(groupbyUppercaseTask); }

17

Execution Graph development, compile and deploy

Development / Compile Phase Deployment / Runtime Phase

HADOOP EAGLE – EBAY INC

HADOOP EAGLE

HADOOP EAGLE – EBAY INC 18

Eagle Monitoring Framework Internals

• Lightweight Streaming Process Framework• Extensible & Scalable Policy Framework for Alert• Eagle Query Framework• Interactive Dashboards

HADOOP EAGLE

19

Extensible & Scalable Policy framework

HADOOP EAGLE – EBAY INC

HADOOP EAGLE

Scalability• Dynamic policy partitioning across compute nodes based on configurable partition class• Dynamic policy deployment• Event partitioning by storm and policy partitioning by Eagle (N events * M policies)

Extensibility• Support new policy evaluation engine, for example Siddhi, Esper, Machine learning etc.

Features• Policy CRUD• Stream metadata (event attribute name, attribute type, attribute value resolver, …)

20

Dynamic Policy Partitioning

HADOOP EAGLE – EBAY INC

HADOOP EAGLE

21

Scalability of Policy Evaluation

HADOOP EAGLE – EBAY INC

HADOOP EAGLE

22

Extensibility of policy framework

HADOOP EAGLE – EBAY INC

HADOOP EAGLE

public interface PolicyEvaluatorServiceProvider { public String getPolicyType();

public Class<? extends PolicyEvaluator> getPolicyEvaluator();public Class<? extends PolicyDefinitionParser> getPolicyDefinitionParser();public Class<? extends PolicyEvaluatorBuilder> getPolicyEvaluatorBuilder();public List<Module> getBindingModules();

}

Policy Evaluator Provider use SPI to register policy engine implementations

HADOOP EAGLE – EBAY INC 23

Eagle Monitoring Framework Internals

• Lightweight Streaming Process Framework• Extensible & Scalable Policy Framework for Alert• Eagle Query Framework• Interactive Dashboards

HADOOP EAGLE

HADOOP EAGLE – EBAY INC 24

Eagle Query Framework

HADOOP EAGLE

Persistence• Metric• Event• Metadata• Alert• Log• Customized

Structure• …

Query• Search• Filter• Aggregation• Sort• Expression • ….

Features• Simple API• Powerful query• High performance• Scalability• Pluggable • …

The light-weight metadata-driven store layer to servecommonly shared storage & query requirements of most monitoring system

HADOOP EAGLE – EBAY INC 25

Eagle Query Framework

HADOOP EAGLE

HADOOP EAGLE – EBAY INC 26

Eagle Query Framework

HADOOP EAGLE

• Metadata definition ORM• High performance RESTful API supporting CRUD• SQL-like declarative query syntax• Generic service client library• Native support HBase and RDBMS• Interactive and customizable dashboard

27

• Annotations are metadata to entity• Metadata driven query compiling and

response rendering• Metadata driven ser/deser• Rename column to shorter string(hbase)• Entity metadata primitives

• Table• ColumnFamily• Prefix(the very first partition key)• Service(entity identifier)• Partition• Tags• Indexes• Column

Metadata definition ORM

HADOOP EAGLE – EBAY INC

HADOOP EAGLE

@Table("alertdef")@ColumnFamily("f")@Prefix("alertdef")@Service(AlertConstants.ALERT_DEFINITION_SERVICE_ENDPOINT_NAME)@TimeSeries(false)@Partition({"cluster", "datacenter"})@Tags({"programId", "alertExecutorId", "policyId", "policyType"})@Indexes({

@Index(name="Index_1_alertExecutorId", columns = { "alertExecutorID" })})public class AlertDefinitionAPIEntity extends TaggedLogAPIEntity{

@Column("a")private String desc;@Column("b")private String policyDef;@Column("c")private String dedupeDef;@Column("d")private String notificationDef;@Column("e")private String remediationDef;@Column("f")private boolean enabled;

28

Generic RESTful API & Query

HADOOP EAGLE – EBAY INC

HADOOP EAGLE

::= <EntityName> “[" <FilterCondition> "]" "<" <GroupbyFields> ">" "{" <AggregatedFunctions> "}” [ "." "{" <SortbyOptions> "}" ]

eagle-service/rest/entities?query=

29

Generic RESTful API Query Syntax

HADOOP EAGLE – EBAY INC

HADOOP EAGLE

query=JobExecutionService[@cluster=“xyz” AND @datacenter=“abc”]{@startTime,@numTotalMaps}&startTime=&endTime=&pageSize=100

Aggregation Query ::=  <EntityName> [QueryCondition]<GroupbyFields>{ AggregatedFunctions}.{SortbyOptions}query=JobExecutionService[@cluster=“xyz” AND @datacenter=“abc”]<@user>{count, min(endTime-startTime)}&startTime=&endTime=&pageSize=100

query=TaskFailureCountService[@cluster=“xyz” AND @datacenter=“abc” AND @failureCount>10]{@startTime,@failureCount}&startTime=&endTime=&pageSize=100

CONTAINS, IN, !=, =, <, <=, >, >=

query=TaskFailureCountService[@cluster=“xyz” AND @datacenter=“abc” AND @failureCount>10]{@startTime,@failureCount}&startTime=&endTime=&pageSize=100&startRowkey=BgVz-9R…….

Search Query

Aggregate Query

TimeSeries Histogram Queryquery=GenericMetricService[@cluster="ares" AND @datacenter="lvs"]<@user>{sum(value)}.{sum(value) desc} &timeSeries=true&intervmin=1440&pageSize=10000000&startTime=2014-07-01 00:00:00&endTime=2014-08-01 00:00:00&metricName=eagle.hdfs.spacesize.cluster

Operators

Numeric Filters

Paginations

30

Generic Eagle Service Client Library

HADOOP EAGLE – EBAY INC

HADOOP EAGLE

• Basic CRUD• Fluent DSL• Metric Builder API• Parallel Client• Asynchronous Client

client.metric("unit.test.metrics") .batch(5) .tags(tags) .send("unit.test.metrics", System.currentTimeMillis(), tags, 0.1, 0.2, 0.3) .send(System.currentTimeMillis(), 0.1) .send(System.currentTimeMillis(),0.1,0.2) .send(System.currentTimeMillis(),tags,0.1,0.2,0.3) .send("unit.test.anothermetrics",System.currentTimeMillis(),tags,0.1,0.2,0.3) .flush();

client.search("GenericMetricService[@cluster=\"cluster4ut\" AND @datacenter = \"datacenter4ut\"]<@cluster>{sum(value)}") .startTime(0) .endTime(System.currentTimeMillis()+24 * 3600 * 1000) .metricName("unit.test.metrics") .pageSize(1000) .send();

31

Uniform rowkey design

• Metric

• Entity

• Log

HBase Storage Design

HADOOP EAGLE – EBAY INC

HADOOP EAGLE

Rowkey ::= Prefix | Partition Keys | timestamp | tagName | tagValue | …

Rowkey ::= Metric Name | Partition Keys | timestamp | tagName | tagValue | …

Rowkey ::= Default Prefix | Partition Keys | timestamp | tagName | tagValue | …

Rowkey ::= Log Type | Partition Keys | timestamp | tagName | tagValue | …Rowvalue ::= Log Content

com.ebay.eagle.coprocessor.AggregateProtocol

32

HBase Coprocessor

HADOOP EAGLE – EBAY INC

HADOOP EAGLE

avg count max min sum0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

nocoprocesso in single region

coprocessor in single region

estimated in cluster

33

• Uniform HBASE row-key design for all types of monitoring data sources• Logically partition data by tags which is defined in annotation @Partition({“cluster”,

“datacenter”})• Physically shard data by HBASE native feature: rowkey range and region mapping• Write throughput optimized by using HBASE multi-put• Co-processor to maximize query performance• Push evaluation of numeric filters down to HBase• Secondary index support• Inspection of RESTful resources and entity metadata• Numeric filters• Expression evaluation in output fields• Rowkey inspection

Tuning for HBase Storage

HADOOP EAGLE – EBAY INC

HADOOP EAGLE

HADOOP EAGLE – EBAY INC 34

Eagle Monitoring Framework Internals

• Lightweight Streaming Process Framework• Extensible & Scalable Policy Framework for Alert• Eagle Query Framework• Interactive Dashboards

HADOOP EAGLE

35

• Interactive: IPython notebook-like interactive visualization analysis and troubleshooting.

• Dashboard: Customizable dashboard layout and drill-down path, persist and share.

Generic Dashboard Analytics for Eagle Store

HADOOP EAGLE – EBAY INC

HADOOP EAGLE

36

Open Source Soon …

HADOOP EAGLE – EBAY INC

HADOOP EAGLE

• First use case: Eagle to secure Hadoop platform based on Eagle framework

• Work closely with Hortonworks, Dataguise, …

• Share with community and get community’s support

• Continue to open source job monitoring, GC monitoring etc.

37

Q & A

HADOOP EAGLE – EBAY INC

HADOOP EAGLE