hadoop platform at yahoo

HADOOP PLATFORM AT YAHOOA YEAR IN REVIEW

SUMEET SINGH (@sumeetksingh)Sr. Director, Cloud and Big Data Platforms

Agenda

Platform Overview 1

Infrastructure and Metrics2

CaffeOnSpark for Distributed DL3

Compute and Sketches4

Oozie6

Ease of Use7

HBase and Omid5

2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 20160

10,000

15,000

20,000

25,000

30,000

35,000

40,000

45,000

50,000

Servers StorageYear

Yahoo! Commits to

Scaling Hadoop for Production

Research Workloads

in Search and

Advertising

Production (Modeling)

with machine learning & WebMap

Revenue Systems

with Security, Multi-

tenancy, and SLAs

Open Sourced with

Apache

Hortonworks Spinoff for Enterprise hardening

Nextgen Hadoop

(H 0.23 YARN)

New Services(HBase,

Storm, Spark, Hive)

Increased User-base

with partitioned

namespaces

Apache H2.7(Scalable ML, Latency, Utilization, Productivity)

Platform Evolution

Deployment Models

Private (dedicated) Clusters

Hosted Multi-tenant (private cloud)

Clusters

Hosted Compute Clusters

Large demanding use cases

New technology not yet platformized

Data movement and regulation issues

When more cost effective than on-premise

Time to market/ results matter

Data already in public cloud

Source of truth for all of orgs data

App delivery agility

Operational efficiency and cost savings through economies of scale

On-Premise Public Cloud

Purpose-builtBig Data Clusters

For performance, tighter integration with tech stack

Value added services such as monitoring, alerts, tuning and common tools

Platform Today

ZK DBMS MON SSHOP LOG WH TOOLS

Apache / Open Source Projects Yahoo Projects

HDFS HBase HCat Kafka CMS DH

Pig Hive Oozie Hue GDM Big ML

YARN CS MR Tez Spark Storm

Services

Compute

Storage / Msg.

Technology Stack Assembly

HDFS(File System)

YARN(Scheduling, Resource Management)

Common

RHEL6 64-bit, JDK8

Platformized Tech with Production Support

In-progress, Unmet needs or Apache Alignment

Common Backplane

DataNode NodeManager

NameNode RM

DataNodes RegionServers

NameNode HBase Master Nimbus

Supervisor

Administration, Management and Monitoring

ZooKeeperPools

HTTP/HDFS/GDM Load Proxies

Applications and Data

DataFeeds

Data Stores

Oozie Server

HS2/HCat

Network Backplane

10152025

Cluster 1 (2,000 servers)HDFS 12 PB Compute 23 TBAvg. Util: 26%

Research Cluster Consolidation

0102030405060

One Month Sample (2015)

Total Used

Consolidated ClusterHDFS 65 PB Compute 240 TBAvg. Util: 70%

Consolidated Research Cluster Characteristics

One Month Sample (2016)40% decrease in TCO

10,500 servers

2,200servers

Before After

65% increase in compute capacity

50% increase in avg. utilization

Total Used

Common Hadoop Cluster Configuration

Rack 1

Network Backplane

CPU Serverswith JBODs& 10GbE

Rack 2 Rack N

.. .. ..

New Hadoop Cluster Configuration

Rack 1

Network Backplane

CPU Serverswith JBODs& 10GbE

Rack 2 Rack N

100Gbps InfiniBand

GPU Servers

Hi-Mem Servers

YARN Node Labels

Queue 1, 40%Label x

Queue 2, 40%Label x, y

Queue 3, 20%

x x x x x x

y y y y y yy y y y y y

yarn.scheduler.capacity.root.<queue name>.accessible-node-labels = <label name>yarn.scheduler.capacity.root.<label name>.default-node-label-expression sets the default label asked for by queue

Hadoop Cluster

Agenda

Platform Overview 1

Oozie6

Ease of Use7

HBase and Omid5

CaffeOnSpark – Distributed Deep Learning

CaffeOnSparkfor DL

MLLibfor

non-DL

Hive or SparkSQL

YARN (RM and Scheduling)

HDFS (Datasets)

Few Use Cases – Yahoo Weather

Few Use Cases – Flickr Facial Recognition

Few Use Cases – Flickr Scene Detection

CaffeOnSpark Architecture – Common ClusterSpark Driver

Caffe (enhanced with

multi-GPU/CPU)

Model Synchronizer(across nodes)

HDFS Datasets

Spark Executor (for data feeding and

control)

multi-GPU/CPU)

HDFS Datasets

control)

multi-GPU/CPU)

HDFS Datasets

control)

Model O/P on HDFS

MPI on RDMA / TCP18

CaffeOnSpark Architecture – Incremental Learning

cos = new CaffeOnSpark(ctx)conf = new Config(ctx, args).init()dl_train_source = DataSource.getSource(conf, true)cos.train(dl_train_source) //training DL modellr_raw_source = DataSource.getSource(conf, false)ext_df = cos.features(lr_raw_source) // extract features via DL

Feature Engineering:

Deep Learning

CaffeOnSpark Architecture – Incremental Learning

cos = new CaffeOnSpark(ctx)conf = new Config(ctx, args).init()dl_train_source = DataSource.getSource(conf, true)cos.train(dl_train_source) //training DL modellr_raw_source = DataSource.getSource(conf, false)ext_df = cos.features(lr_raw_source) // extract features via DL

vlr_input=ext_df.withColumn(“L",cos.floats2doubleUDF(ext_df(conf.label))) .withColumn(“F",cos.floats2doublesUDF(ext_df(conf.features(0))))lr = new LogisticRegression().setLabelCol(”L").setFeaturesCol(”F")lr_model = lr.fit(lr_input_df) …

Feature Engineering:

Deep Learning

Train Classifiers:N

on-deep Learning

CaffeOnSpark Architecture – Single Command

spark-submit --num-executors #Exes--class CaffeOnSpark my-caffe-on-spark.jar-devices #GPUs-model dl_model_file-output lr_model_file

Distributed Deep Learning

Apache License

ExistingClusters

Powerful DL Platform

FullyDistributed

High-level API

Incremental Learning

CaffeOnSparkgithub.com/yahoo/caffeonspark

Agenda

Platform Overview 1

Oozie6

Ease of Use7

HBase and Omid5

Hadoop Compute Sources

HDFS(File System and Storage)

Pig(Scripting)

Hive(SQL)Java MR APIs

YARN(Resource Management and Scheduling)

Tez(Execution Engine for

Pig and Hive)

Spark(Alternate Exec Engine)

MapReduce(Legacy)

Data Processing

Custom App on Slider

Data Management

Compute GrowthM

Pushing Batch Compute Boundaries%

Q1 2016

MapReduce Tez Spark 112 Million Batch Jobs in Q1’16

Jan 78%

Mar 67%

Mar 21% 12% Jan 8% 14%

Multi-tenant Apache Storm

Recent Apache Storm Developments at Yahoo

MT & RAScheduler

Dist. CacheAPI

8 xThroughput

Improved Debuggability

1 github.com/yahoo/streaming-benchmarks

PacemakerServer

StreamingBenchmark 1

Data Sketches Algorithms

Data Sketches Algorithms Library

datasketches.github.io

Good enough approximate answers

for problem queries

Streamable

Approximate with predictable error

Sub-linear in size

Mergeable / additive

Highly parallelizable

Maven deployable

Characteristics

Distinct Count Sketch, High-level View

Big Data Stream

Transform Data Structure Estimator

Result + / - ε

White Noise

Basic Sketch Elements

Data Sketches Algorithms

Data Sketches Algorithms Library

datasketches.github.io

Agenda

Platform Overview 1

Oozie6

Ease of Use7

HBase and Omid5

Apache HBase at Yahoo

Security

Isolated Deployment

Multi-tenant

Region Server Group

Namespace

Unsupported Features

Security Authentication

Kerberos (users, processes) Delegation Token (MapReduce, YARN, etc.)

Authorization HBase ACLs (Read, Write, Create, Admin) Grant permissions to User or Unix Group ACL for Table, Column Family or Column

Region Server Groups Dedicated region servers for a set of tables Resource Isolation (CPU, Memory, IO, etc)

Namespaces Analogous to “Database” Namespace ACL to create tables Default group Quota

Tables Regions

Split Meta to Spread Load and Avoid Large Regions

Favored Nodes for HDFS Locality

Humongous Tables

Scaling HBase to Handle Millions of Regions on a Cluster

Region Server Groups

Split Meta

Split ZK

Favored Nodes

Humongous Tables

Transactions on HBase with Omid1

Highly performant and fault tolerant ACID transactional framework

New Apache Incubator projectincubator.apache.org/projects/omid.html

Handles million of transactions per day for search and personalization products

1 Omid stands for “Hope” in Persian

Omid Components

Omid Data Model

Agenda

Platform Overview 1

Oozie6

Ease of Use7

HBase and Omid5

Oozie Data Pipelines

Message Bus

HCatalog

3. Push notification<New Partition>

2. Register Topic

4. Notify New Partition

Data Producer HDFSProduce data (distcp, pig, M/R..)

/data/click/2014/06/02

1. Query/Poll Partition

Start workflow

Update metadata (ALTER TABLE click ADD PARTITION(data=‘2014/06/02’) location ’hdfs://data/click/2014/06/02’)

Large Scale Data Pipeline RequirementsAdministrative One should be able to start, stop and pause

all related pipelines at a same time

Dependency Management Output of a coordinator “n+1” action is

dependent on coordinator “n” action (dataset dependency)

If dataset has a BCP instance, workflow should run with either, whichever arrives first

Start as soon as mandatory data is available, other feeds are optional

Data is not guaranteed, start processing even if partial data is available

SLA Management Monitor pipeline processing to take

immediate action in case of failures or SLA misses

Pipelines owners should get notified if an SLA is missed

Multiple Providers If data is available from multiple

providers, I want to specify the provider priority

Combine datasets from multiple providers to fill the gaps in data a single provider may have

Large Scale Data Pipeline RequirementsAdministrative One should be able to start, stop and pause

all related pipelines at a same time

Dependency Management Output of a coordinator “n+1” action is

dependent on coordinator “n” action (dataset dependency)

If dataset has a BCP instance, workflow should run with either, whichever arrives first

Start as soon as mandatory data is available, other feeds are optional

Data is not guaranteed, start processing even if partial data is available

SLA Management Monitor pipeline processing to take

immediate action in case of failures or SLA misses

Pipelines owners should get notified if an SLA is missed

Multiple Providers If data is available from multiple

providers, I want to specify the provider priority

Combine datasets from multiple providers to fill the gaps in data a single provider may have

BCP And Mandatory / Optional Feeds

Pull data from A or B. Specify dataset as AorB. Action will start running as soon either dataset A or B is available.

<input-logic> <or name=“AorB”> <data-in dataset="A” wait=“10”/> <data-in dataset="B"/> </or></input-logic>

Dataset B is optional, Oozie will start processing as soon as A is available. It will include dataset from A and whatever is available from B.

<input-logic> <and name="optional <data-in dataset="A"/> <data-in dataset="B" min=”0”/> </and></input-logic>

Data Not Guaranteed / Priority Among Dataset Instances

A will have higher precedence over B and B will have higher precedence over C.

<input-logic> <or name="AorBorC"> <data-in dataset="A"/> <data-in dataset="B"/> <data-in dataset="C”/> </or></input-logic>

Oozie will start processing if available A instances are >= 10. Min can also be combined with wait (as shown for dataset B).

<input-logic> <data-in dataset="A" min=”10”/> <data-in dataset=“B” min =“10” wait=“20”/> </input-logic>

Combining Dataset From Multiple ProvidersCombine function will first check instances from A and go to B next for whatever is missing in A.

<data-in name="A" dataset="dataset_A"> <start-instance> ${coord:CURRENT(-5)} </start-instance> <end-instance> ${coord:latest(-1)} </end-instance></data-in>

<data-in name="B" dataset="dataset_B"> <start-instance>${coord:CURRENT(-5)}</start-instance> <end-instance>${coord:CURRENT(-1)}</end-instance></data-in>

<input-logic> <combine name="AB"> <data-in dataset="A"/> <data-in dataset="B"/> </combine></input-logic>

Agenda

Platform Overview 1

Oozie6

Ease of Use7

HBase and Omid5

Automated Onboarding / Collaboration Portal

Built for Tenant Transparency

Queue Utilization Dashboard

Data Discovery and Access

Audits, Compliance, and Efficiency

Starling

FS, Job, Task logs

Cluster 1 Cluster 2 Cluster n...

CF, Region, Action, Query Stats

Cluster 1 Cluster 2 Cluster n...

DB, Tbl., Part., Colmn. Access Stats

...MS 1 MS 2 MS n

Data Defn., Flow, Feed, Source

F 1 F 2 F n

Log Warehouse

Log Sources

Audits, Compliance, and Efficiency (cont’d)

Data Discovery and Access

Public

Non-sensitive

Financial $Governance Classification

No addn. reqmt.

LMS Integration

Stock Admin Integration

Approval Flow

Restricted

Hosted UI – Hue as a Service

Hue-1.Cluster-1 (Hot)

VIPUsers

Hue MySQL DB

(HA) Hadoop Cluster

HCatMeta

Oozie Server

YARNRM

WebHDFS

Hue-2.Cluster-1 (hot)

SAMLAuth.

Serving pages and static content

Cookies, saved queries, workflows etc.

REST / Thrift

(jQuery, Bootstrap, Knockout.js, Love)

Going Forward

IncreasedIntelligence

Greater Speed

Higher Efficiency

NecessaryScale

Increased Intelligence

GBDT FTRL SGD Deep Learning

Random Forests

ML Libraries

Click Prediction Search RankingKeyword Auctions Ad

Relevance Abuse Detection

Applications

Proven to Work at Scale

Solve Complex Problems

YARN (Resource Manager)Heterogeneous

Scheduling Long-running

Services GPUsLarge

Memory SupportCore GridEnhancements

Parameter ServerGlobally Shared Parameters

Compute EnginesDistributedProcessing

Greater Speed

DeData Management

Ease of Use

Productivity Dimensions

Real-timePipelines

Unified Metadata & Lineage Fine-grained

Access Control

Self-serve Data Movement

SLA & Cost Transparency

Intuitive UIs

Planning & Collab. Tools

Central Grid Portal

Improvements

Query times < 1 sec

4x Speedups in ETL

SQL on HBase

Limitless BI ClientsAnalytics, BI &

Reporting

Higher EfficiencyAchieve five 9’s availability and 70% average compute utilization across clusters

Hadoop Users at Yahoo

Slingstone & Aviate Mail Anti-Spam Gemini Campaign Mgmt. Search Assist

Audience Analytics Flickr YAM+ & Targeting Membership Abuse

… and many more.63

Yahoo at the Apache Open Source Foundation

10 Committers (6 PMC)

6 Committer (5 PMC)

1 Committer

3 Committers (2 PMCs)

7 Committers (6 PMCs)

1 Committer

Join Us @ yahoohadoop.tumblr.com

THANK YOUSUMEET SINGH (@sumeetksingh)Sr. Director, Cloud and Big Data Platforms

Icon Courtesy – iconfinder.com (under Creative Commons)

hadoop platform at yahoo

Technology

hw09 hadoop applications at yahoo!

hadoop at yahoo!• yahoo! architect on hadoop map/reduce...

apache hadoop -...

yahoo! blueprint platform

introduction to hadoop-mapreduce platform

how hadoop revolutionized data warehousing at yahoo and...

characterization of hadoop jobs using unsupervised...

yahoo open platform stack

yahoo audience expansion: migration from hadoop streaming...

apache hadoop at yahoo! ready for business

hadoop & its usage at facebook - snia · • hadoop...

introduction to hadoop owen o’malley yahoo!, grid team...

yahoo! hadoop user group - may meetup - hbase and pig: the...

hadoop at yahoo!...evergreen 2008 hadoop timeline • 2004...

data-intensive computing with hadoop · data-intensive...

a gentle introduction to hadoop platforms€¦ · giant...

directions for hadoop innovation, yahoo

yahoo! mail antispam - bay area hadoop user group

large scale applications on hadoop in yahoo

yahoo & hadoop