hadoop platform at yahoo

Post on 06-Jan-2017

562 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

HADOOP PLATFORM AT YAHOOA YEAR IN REVIEW

SUMEET SINGH (@sumeetksingh)Sr. Director, Cloud and Big Data Platforms

Agenda

2

Platform Overview 1

Infrastructure and Metrics2

CaffeOnSpark for Distributed DL3

Compute and Sketches4

Oozie6

Ease of Use7

Q&A8

HBase and Omid5

2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 20160

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

45,000

50,000

0

100

200

300

400

500

600

700

800

Servers StorageYear

# Se

rver

s

Raw

HD

FS (i

n PB

)

Yahoo! Commits to

Scaling Hadoop for Production

Use

Research Workloads

in Search and

Advertising

Production (Modeling)

with machine learning & WebMap

Revenue Systems

with Security, Multi-

tenancy, and SLAs

Open Sourced with

Apache

Hortonworks Spinoff for Enterprise hardening

Nextgen Hadoop

(H 0.23 YARN)

New Services(HBase,

Storm, Spark, Hive)

Increased User-base

with partitioned

namespaces

Apache H2.7(Scalable ML, Latency, Utilization, Productivity)

Platform Evolution

3

Deployment Models

Private (dedicated) Clusters

Hosted Multi-tenant (private cloud)

Clusters

Hosted Compute Clusters

Large demanding use cases

New technology not yet platformized

Data movement and regulation issues

When more cost effective than on-premise

Time to market/ results matter

Data already in public cloud

Source of truth for all of orgs data

App delivery agility

Operational efficiency and cost savings through economies of scale

On-Premise Public Cloud

Purpose-builtBig Data Clusters

For performance, tighter integration with tech stack

Value added services such as monitoring, alerts, tuning and common tools

4

Platform Today

ZK DBMS MON SSHOP LOG WH TOOLS

Apache / Open Source Projects Yahoo Projects

HDFS HBase HCat Kafka CMS DH

Pig Hive Oozie Hue GDM Big ML

YARN CS MR Tez Spark Storm

Services

Compute

Storage / Msg.

Tools

5

Technology Stack Assembly

HDFS(File System)

YARN(Scheduling, Resource Management)

Common

RHEL6 64-bit, JDK8

Platformized Tech with Production Support

In-progress, Unmet needs or Apache Alignment

6

Common Backplane

DataNode NodeManager

NameNode RM

DataNodes RegionServers

NameNode HBase Master Nimbus

Supervisor

Administration, Management and Monitoring

ZooKeeperPools

HTTP/HDFS/GDM Load Proxies

Applications and Data

DataFeeds

Data Stores

Oozie Server

HS2/HCat

Network Backplane

7

05

10152025

Cluster 1 (2,000 servers)HDFS 12 PB Compute 23 TBAvg. Util: 26%

Research Cluster Consolidation

0

20

40

60

80

Com

pute

Tot

al a

nd U

sed

(TB

)

Cluster 3 (5,400 servers)HDFS 36 PB Compute 70 TBAvg. Util: 59%

Cluster 2 (3,100 servers)HDFS 21 PB Compute 52 TBAvg. Util: 40%

0102030405060

One Month Sample (2015)

Total Used

8

0

50

100

150

200

250

300

Consolidated ClusterHDFS 65 PB Compute 240 TBAvg. Util: 70%

Consolidated Research Cluster Characteristics

One Month Sample (2016)40% decrease in TCO

10,500 servers

2,200servers

Before After

65% increase in compute capacity

50% increase in avg. utilization

Total Used

Com

pute

Tot

al a

nd U

sed

(TB

)

9

Common Hadoop Cluster Configuration

Rack 1

Network Backplane

CPU Serverswith JBODs& 10GbE

Rack 2 Rack N

.. .. ..

. . .

10

New Hadoop Cluster Configuration

Rack 1

Network Backplane

CPU Serverswith JBODs& 10GbE

Rack 2 Rack N

100Gbps InfiniBand

GPU Servers

Hi-Mem Servers

. . .

11

YARN Node Labels

J2J3

J4

Queue 1, 40%Label x

Queue 2, 40%Label x, y

J1

Queue 3, 20%

x x x x x x

x x x x x x

y y y y y yy y y y y y

yarn.scheduler.capacity.root.<queue name>.accessible-node-labels = <label name>yarn.scheduler.capacity.root.<label name>.default-node-label-expression sets the default label asked for by queue

Hadoop Cluster

12

Agenda

Platform Overview 1

Infrastructure and Metrics2

CaffeOnSpark for Distributed DL3

Compute and Sketches4

Oozie6

Ease of Use7

Q&A8

HBase and Omid5

13

CaffeOnSpark – Distributed Deep Learning

CaffeOnSparkfor DL

MLLibfor

non-DL

Hive or SparkSQL

Spark

YARN (RM and Scheduling)

HDFS (Datasets)

. . .

14

Few Use Cases – Yahoo Weather

15

Few Use Cases – Flickr Facial Recognition

16

Few Use Cases – Flickr Scene Detection

17

CaffeOnSpark Architecture – Common ClusterSpark Driver

Caffe (enhanced with

multi-GPU/CPU)

Model Synchronizer(across nodes)

HDFS Datasets

Spark Executor (for data feeding and

control)

Caffe (enhanced with

multi-GPU/CPU)

Model Synchronizer(across nodes)

HDFS Datasets

Spark Executor (for data feeding and

control)

Caffe (enhanced with

multi-GPU/CPU)

Model Synchronizer(across nodes)

HDFS Datasets

Spark Executor (for data feeding and

control)

Model O/P on HDFS

MPI on RDMA / TCP18

CaffeOnSpark Architecture – Incremental Learning

cos = new CaffeOnSpark(ctx)conf = new Config(ctx, args).init()dl_train_source = DataSource.getSource(conf, true)cos.train(dl_train_source) //training DL modellr_raw_source = DataSource.getSource(conf, false)ext_df = cos.features(lr_raw_source) // extract features via DL

Feature Engineering:

Deep Learning

19

CaffeOnSpark Architecture – Incremental Learning

cos = new CaffeOnSpark(ctx)conf = new Config(ctx, args).init()dl_train_source = DataSource.getSource(conf, true)cos.train(dl_train_source) //training DL modellr_raw_source = DataSource.getSource(conf, false)ext_df = cos.features(lr_raw_source) // extract features via DL

vlr_input=ext_df.withColumn(“L",cos.floats2doubleUDF(ext_df(conf.label))) .withColumn(“F",cos.floats2doublesUDF(ext_df(conf.features(0))))lr = new LogisticRegression().setLabelCol(”L").setFeaturesCol(”F")lr_model = lr.fit(lr_input_df) …

Feature Engineering:

Deep Learning

20

Train Classifiers:N

on-deep Learning

CaffeOnSpark Architecture – Single Command

spark-submit --num-executors #Exes--class CaffeOnSpark my-caffe-on-spark.jar-devices #GPUs-model dl_model_file-output lr_model_file

21

Distributed Deep Learning

Apache License

ExistingClusters

Powerful DL Platform

FullyDistributed

High-level API

Incremental Learning

CaffeOnSparkgithub.com/yahoo/caffeonspark

22

Agenda

Platform Overview 1

Infrastructure and Metrics2

CaffeOnSpark for Distributed DL3

Compute and Sketches4

Oozie6

Ease of Use7

Q&A8

HBase and Omid5

23

Hadoop Compute Sources

HDFS(File System and Storage)

Pig(Scripting)

Hive(SQL)Java MR APIs

YARN(Resource Management and Scheduling)

Tez(Execution Engine for

Pig and Hive)

Spark(Alternate Exec Engine)

MapReduce(Legacy)

Data Processing

ML

Custom App on Slider

Oozie

Data Management

24

Compute GrowthM

ar-13

Apr-1

3

May

-13

Jun-1

3

Jul-1

3

Aug-1

3

Sep-1

3

Oct-

13

Nov-1

3

Dec-1

3

Jan-1

4

Feb-1

4

Mar-

14

Apr-1

4

May

-14

Jun-1

4

Jul-1

4

Aug-1

4

Sep-1

4

Oct-

14

Nov-1

4

Dec-1

4

Jan-1

5

Feb-1

5

Mar-

15

Apr-1

5

May

-15

Jun-1

5

Jul-1

5

Aug-1

5

Sep-1

5

Oct-

15

Nov-1

5

Dec-1

5

Jan-1

6

Feb-1

6

Mar-

16

10

15

20

25

30

35

40

45

13.3

20.4

23.8

27.2

32.3

34.1

39.1

# M

R, T

ez, S

park

Job

s (in

mill

ions

)

25

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Pushing Batch Compute Boundaries%

of T

otal

Com

pute

(mem

ory-

sec)

Q1 2016

MapReduce Tez Spark 112 Million Batch Jobs in Q1’16

Jan 78%

Mar 67%

Mar 21% 12% Jan 8% 14%

26

Multi-tenant Apache Storm

27

Recent Apache Storm Developments at Yahoo

MT & RAScheduler

Dist. CacheAPI

8 xThroughput

Improved Debuggability

1 github.com/yahoo/streaming-benchmarks

PacemakerServer

StreamingBenchmark 1

28

Data Sketches Algorithms

Data Sketches Algorithms Library

datasketches.github.io

Good enough approximate answers

for problem queries

Streamable

Approximate with predictable error

Sub-linear in size

Mergeable / additive

Highly parallelizable

Maven deployable

Characteristics

29

Distinct Count Sketch, High-level View

Big Data Stream

Transform Data Structure Estimator

Result + / - ε

White Noise

Basic Sketch Elements

30

Data Sketches Algorithms

Data Sketches Algorithms Library

datasketches.github.io

31

Agenda

Platform Overview 1

Infrastructure and Metrics2

CaffeOnSpark for Distributed DL3

Compute and Sketches4

Oozie6

Ease of Use7

Q&A8

HBase and Omid5

32

Apache HBase at Yahoo

Security

Isolated Deployment

Multi-tenant

Region Server Group

Namespace

Unsupported Features

33

Security Authentication

Kerberos (users, processes) Delegation Token (MapReduce, YARN, etc.)

Authorization HBase ACLs (Read, Write, Create, Admin) Grant permissions to User or Unix Group ACL for Table, Column Family or Column

34

Region Server Groups Dedicated region servers for a set of tables Resource Isolation (CPU, Memory, IO, etc)

35

Namespaces Analogous to “Database” Namespace ACL to create tables Default group Quota

Tables Regions

36

Split Meta to Spread Load and Avoid Large Regions

37

Favored Nodes for HDFS Locality

38

Humongous Tables

39

Scaling HBase to Handle Millions of Regions on a Cluster

Region Server Groups

Split Meta

Split ZK

Favored Nodes

Humongous Tables

40

Transactions on HBase with Omid1

Highly performant and fault tolerant ACID transactional framework

New Apache Incubator projectincubator.apache.org/projects/omid.html

Handles million of transactions per day for search and personalization products

1 Omid stands for “Hope” in Persian

41

Omid Components

42

Omid Data Model

43

Agenda

Platform Overview 1

Infrastructure and Metrics2

CaffeOnSpark for Distributed DL3

Compute and Sketches4

Oozie6

Ease of Use7

Q&A8

HBase and Omid5

44

Oozie Data Pipelines

Oozie

Message Bus

HCatalog

3. Push notification<New Partition>

2. Register Topic

4. Notify New Partition

Data Producer HDFSProduce data (distcp, pig, M/R..)

/data/click/2014/06/02

1. Query/Poll Partition

Start workflow

Update metadata (ALTER TABLE click ADD PARTITION(data=‘2014/06/02’) location ’hdfs://data/click/2014/06/02’)

45

Large Scale Data Pipeline RequirementsAdministrative One should be able to start, stop and pause

all related pipelines at a same time

Dependency Management Output of a coordinator “n+1” action is

dependent on coordinator “n” action (dataset dependency)

If dataset has a BCP instance, workflow should run with either, whichever arrives first

Start as soon as mandatory data is available, other feeds are optional

Data is not guaranteed, start processing even if partial data is available

SLA Management Monitor pipeline processing to take

immediate action in case of failures or SLA misses

Pipelines owners should get notified if an SLA is missed

Multiple Providers If data is available from multiple

providers, I want to specify the provider priority

Combine datasets from multiple providers to fill the gaps in data a single provider may have

46

Large Scale Data Pipeline RequirementsAdministrative One should be able to start, stop and pause

all related pipelines at a same time

Dependency Management Output of a coordinator “n+1” action is

dependent on coordinator “n” action (dataset dependency)

If dataset has a BCP instance, workflow should run with either, whichever arrives first

Start as soon as mandatory data is available, other feeds are optional

Data is not guaranteed, start processing even if partial data is available

SLA Management Monitor pipeline processing to take

immediate action in case of failures or SLA misses

Pipelines owners should get notified if an SLA is missed

Multiple Providers If data is available from multiple

providers, I want to specify the provider priority

Combine datasets from multiple providers to fill the gaps in data a single provider may have

47

BCP And Mandatory / Optional Feeds

Pull data from A or B. Specify dataset as AorB. Action will start running as soon either dataset A or B is available.

<input-logic> <or name=“AorB”> <data-in dataset="A” wait=“10”/> <data-in dataset="B"/> </or></input-logic>

Dataset B is optional, Oozie will start processing as soon as A is available. It will include dataset from A and whatever is available from B.

<input-logic> <and name="optional <data-in dataset="A"/> <data-in dataset="B" min=”0”/> </and></input-logic>

48

Data Not Guaranteed / Priority Among Dataset Instances

A will have higher precedence over B and B will have higher precedence over C.

<input-logic> <or name="AorBorC"> <data-in dataset="A"/> <data-in dataset="B"/> <data-in dataset="C”/> </or></input-logic>

49

Oozie will start processing if available A instances are >= 10. Min can also be combined with wait (as shown for dataset B).

<input-logic> <data-in dataset="A" min=”10”/> <data-in dataset=“B” min =“10” wait=“20”/> </input-logic>

Combining Dataset From Multiple ProvidersCombine function will first check instances from A and go to B next for whatever is missing in A.

<data-in name="A" dataset="dataset_A"> <start-instance> ${coord:CURRENT(-5)} </start-instance> <end-instance> ${coord:latest(-1)} </end-instance></data-in>

<data-in name="B" dataset="dataset_B"> <start-instance>${coord:CURRENT(-5)}</start-instance> <end-instance>${coord:CURRENT(-1)}</end-instance></data-in>

<input-logic> <combine name="AB"> <data-in dataset="A"/> <data-in dataset="B"/> </combine></input-logic>

50

Agenda

Platform Overview 1

Infrastructure and Metrics2

CaffeOnSpark for Distributed DL3

Compute and Sketches4

Oozie6

Ease of Use7

Q&A8

HBase and Omid5

51

Automated Onboarding / Collaboration Portal

52

Built for Tenant Transparency

53

Queue Utilization Dashboard

54

Data Discovery and Access

55

Audits, Compliance, and Efficiency

Starling

FS, Job, Task logs

Cluster 1 Cluster 2 Cluster n...

CF, Region, Action, Query Stats

Cluster 1 Cluster 2 Cluster n...

DB, Tbl., Part., Colmn. Access Stats

...MS 1 MS 2 MS n

GDM

Data Defn., Flow, Feed, Source

F 1 F 2 F n

Log Warehouse

Log Sources

56

Audits, Compliance, and Efficiency (cont’d)

Data Discovery and Access

Public

Non-sensitive

Financial $Governance Classification

No addn. reqmt.

LMS Integration

Stock Admin Integration

Approval Flow

Restricted

57

Hosted UI – Hue as a Service

WSGI

Hue-1.Cluster-1 (Hot)

VIPUsers

HS2

Hue MySQL DB

(HA) Hadoop Cluster

HCatMeta

Oozie Server

YARNRM

WebHDFS

NMs

WSGI

Hue-2.Cluster-1 (hot)

HS2

IdP

SAMLAuth.

Serving pages and static content

Cookies, saved queries, workflows etc.

Full

Sta

ck H

A

REST / Thrift

(jQuery, Bootstrap, Knockout.js, Love)

58

Going Forward

IncreasedIntelligence

Greater Speed

Higher Efficiency

NecessaryScale

59

Increased Intelligence

GBDT FTRL SGD Deep Learning

Random Forests

ML Libraries

Click Prediction Search RankingKeyword Auctions Ad

Relevance Abuse Detection

Applications

Proven to Work at Scale

Solve Complex Problems

YARN (Resource Manager)Heterogeneous

Scheduling Long-running

Services GPUsLarge

Memory SupportCore GridEnhancements

Parameter ServerGlobally Shared Parameters

Compute EnginesDistributedProcessing

60

Greater Speed

DeData Management

Ease of Use

Productivity Dimensions

Real-timePipelines

Unified Metadata & Lineage Fine-grained

Access Control

Self-serve Data Movement

SLA & Cost Transparency

Intuitive UIs

Planning & Collab. Tools

Central Grid Portal

Improvements

Query times < 1 sec

4x Speedups in ETL

SQL on HBase

Limitless BI ClientsAnalytics, BI &

Reporting

61

Higher EfficiencyAchieve five 9’s availability and 70% average compute utilization across clusters

62

Hadoop Users at Yahoo

Slingstone & Aviate Mail Anti-Spam Gemini Campaign Mgmt. Search Assist

Audience Analytics Flickr YAM+ & Targeting Membership Abuse

… and many more.63

Yahoo at the Apache Open Source Foundation

10 Committers (6 PMC)

3 Committers (3 PMC)

3 Committers (2 PMC)

6 Committer (5 PMC)

1 Committer

3 Committers (2 PMCs)

7 Committers (6 PMCs)

1 2

43

5 6

7 8

1 Committer

64

Join Us @ yahoohadoop.tumblr.com

65

THANK YOUSUMEET SINGH (@sumeetksingh)Sr. Director, Cloud and Big Data Platforms

Icon Courtesy – iconfinder.com (under Creative Commons)

top related