grid computing at yahoo! sameer paranjpye mahadev konar yahoo!

Grid Computing at Yahoo!

Sameer Paranjpye

Mahadev Konar

Yahoo!

Condor Week 2007

Outline

• Introduction– What do we mean by ‘Grid’?– Technology Overview

• Technologies– HDFS– Hadoop Map-Reduce– Hadoop on Demand

Introduction

Condor Week 2007

What do we mean by ‘Grid’?

• Computing platform that can support many distributed applications– Runs on dedicated clusters of commodity PCs (a Grid)– Hardware can be dynamically allocated to a “job”– Plan to support many applications per Grid

• Good for Batch data processing– Log Processing– Document Analysis and Indexing– Web Graphs and Crawling

• Large scale a primary design goal– 10,000 PCs / Grid a design goal (working @ 1000 now)– Very large data (10 Petabyte storage a design goal)

• 100+ TB inputs to a single job• Bandwidth to data is a significant design driver

• Large production deployments – Number of CPUs that can be applied gates what you can do– Several clusters of 1000s of nodes

Condor Week 2007

Technology Overview

• Hadoop (Our primary Grid project)– An open source apache project, started by Doug Cutting – HDFS, a distributed file system– Implementation of Map-Reduce programming model– http://lucene.apache.org/hadoop

• HOD (Hadoop-on-Demand)– Adaptor that runs Hadoop tools on batch systems– Hadoop expressed as a parallel job– Manages setup, startup, shutdown and cleanup of Hadoop– Currently supports Condor and Torque

http://www.cs.wisc.edu/condor/

Technologies

Condor Week 2007

HDFS - Hadoop Distributed FS

• Very Large Distributed File System– We plan to support 10k nodes and 10 PB data– Current deployment of 1k+ nodes, 1PB data

• Assumes commodity hardware that fails– Files are replicated to handle hardware failure– Checksums for corruption detection and recovery– Continues operation as nodes / racks added / removed

• Optimized for fast batch processing– Data location exposed to allow computes to move to data– Stores data in chunks on every node in the cluster– Provides VERY high aggregate bandwidth

Condor Week 2007

Hadoop DFS Architecture

Client

I/O

Namenode

Metadata (Name, replicas, …):/home/sameerp/foo, 3, …

/home/sameerp/docs, 4, …

Client

Datanodes

Rack 1 Rack 2

Metadata ops

Condor Week 2007

Hadoop Map-Reduce

• Implementation of the Map-Reduce programming model– Framework for distributed processing of large data sets

– Resilient to nodes failing and joining during a job

– Great for web data and log processing

• Pluggable user code runs in generic reusable framework– Input records are transformed, sorted and combined to produce a

new output

– All actions plugable / configurable

• A reusable design patternInput | Map | Shuffle | Reduce | Output

(example)

cat * | grep | sort | unique -c > file

Condor Week 2007

HOD (Hadoop on Demand)

• Adaptor that enables Hadoop use with batch schedulers– Provisions Hadoop clusters on demand

– Scheduling is handled by resource managers like Condor– Requests N nodes from a resource manager and provisions them

with a Hadoop cluster

• Condor interaction– User specifies: number of nodes, workload to launch– HOD generates class-ads for Hadoop master and slaves and

submits them as Condor jobs– Cluster comes up when the jobs start running– HOD launches workload

Condor Week 2007

HOD (Hadoop on Demand)

• HOD shell– User interface to HOD is a command shell– Workloads are specified as command lines– Example:% bin/hod -c hodconf -n 100

>> run hadoop-streaming.jar –mapper ‘grep condor’ -reducer ‘uniq -c’

-input /user/sameerp/data –output /user/sameerp/condor

• Work in progress– Data affinity for workloads– Implementation of elastic workloads– Software distribution via BitTorrent

Condor Week 2007

Hadoop on Condor

• Clients launch jobs• Condor dynamically

allocates clusters• HOD used to start Hadoop

Map-Reduce on cluster• Map-Reduce Reads/Writes

Data from the HDFS• When done

– Results are stored in HDFS and/or returned to the client

– Condor reclaims nodes

HDFS

Condor

Dynamic Hadoop Map-Reduce

Cluster

Dynamic Hadoop Map-Reduce Cluster

Client 1 Client 2

Condor Week 2007

Other things in the works

• Record I/O– Define a structure once, use it in C, Java, Python… – Export it in a binary or XML format

• Streaming– A simple way to use existing Unix filters and / or stdin/out

programs in any language with Map-Reduce

• Pig - Y! Research– Higher level data manipulation language, uses Hadoop– Data analysis tasks expressed as queries, in the style of

SQL or Relational Algebra– http://research.yahoo.com/project/pig

Condor Week 2007

The end

THE END

grid computing at yahoo! sameer paranjpye mahadev konar yahoo!

Documents