hw09 hadoop applications at yahoo!

Eric BaldeschwielerVP Hadoop Software DevelopmentYahoo!

Hadoop at Yahoo!

What we want you to know about Yahoo! and Hadoop

• Yahoo! is:– The largest contributor to Hadoop– The largest tester of Hadoop– The largest user of Hadoop

• 4000 node clusters!– A great place to do Hadoop development, do internet scale science

and change the world!

• Also:– We release “The Yahoo! Distribution of Hadoop”– We contribute all of our Hadoop work to the Apache Foundation as

open source– We continue to aggressively invest in Hadoop– We do not sell Hadoop service or support!

• We use Hadoop services to run Yahoo!

• The majority of all patches to Hadoop have come from Yahoo!s– 72% of core patches

• Core = HDFS, Map-reduce, Common– Metric is an underestimate

• Some Yahoo!s use apache.org accounts• Patch sizes vary

• Yahoo! is the largest employer of Hadoop Contributors, by far!– We contribute ALL of our Hadoop

development work back to Apache!

• We are hiring!– Sunnyvale, Bangalore, Beijing– See http://hadoop.yahoo.com

The Largest Hadoop Contributor

All Patches

Core Patches

The Largest Hadoop Tester

• Every release of The Yahoo! Distribution of Hadoop goes through multiple levels of testing before we declare it stable

• 4 Tiers of Hadoop clusters– Development, Testing and QA (~10% of our hardware)

• Continuous integration / testing of new code– Proof of Concepts and Ad-Hoc work (~10% of our hardware)

• Runs the latest version of Hadoop – currently 0.20 (RC6)– Science and Research (~60% of our hardware)

• Runs more stable versions – currently 0.20 (RC5)– Production (~20% of our hardware)

• The most stable version of Hadoop – currently 0.18.3

• We continue to grow our Quality Engineering team– We are hiring!!

0

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

90,000

Cumulative Total Nodes Cumulative Storage

Total Nodes = 24146Total Storage = 82 PB

The

Largest

Hadoop

User

2006 now

Hardware Internal Hadoop Users

building new datacenter

PB Disk, >82PB TodayNodes, >25,000 Today

Collaborations around the globe

• The Apache Foundation – http://hadoop.apache.org– Hundreds of contributors and thousands users of Hadoop!– see http://wiki.apache.org/hadoop/PoweredBy• The Yahoo! Distribution of Hadoop – Opening up our testing– Cloudera• M45 - Yahoo!’s shared research supercomputing cluster– Carnegie Mellon University– The University of California at Berkeley– Cornell University– The University of Massachusetts at Amherst• Partners in India– Computational Research Laboratories (CRL), India's Tata Group– Universities – IIT-B, IISc, IIIT-H, PSG• Open Cirrus™ - cloud computing research & education– The University of Illinois at Urbana-Champaign– Infocomm Development Authority (IDA) in Singapore– The Karlsruhe Institute of Technology (KIT) in Germany– HP, Intel – The Russian Academy of Sciences, Electronics & Telecomm. – Malaysian Institute of Microelectronic Systems

Usage of Hadoop

Why Hadoop @ Yahoo!?

• Massive scale– 500M+ unique users per month– Billions of “transactions” per day– Many petabytes of data

• Analysis and data processing key to our business– Need to process all of that data in a timely manner– Lots of ad hoc investigation to look for patterns, run reports…

• Need to do this cost effectively– Use low cost commodity hardware– Share resources among multiple projects– Try new things quickly at huge scale– Handle the failure of some of that hardware every day

• The Hadoop infrastructure provides these capabilities

Yahoo! front page - Case Study

Ads Optimization

Search Index


Ads Optimization

Content Optimization

Search Index

Machine Learned

Spam filters

RSS Feeds Content Optimization


Large Applications2008 2009

Webmap ~70 hours runtime~300 TB shuffling~200 TB output

~73 hours runtime~490 TB shuffling~280 TB output+55% Hardware

Sort benchmarks(Jim Gray contest)

1 Terabyte sorted•209 seconds•900 nodes

1 Terabyte sorted•62 seconds, 1500 nodes1 Petabyte sorted•16.25 hours, 3700 nodes

Largest cluster 2000 nodes•6PB raw disk•16TB of RAM•16K Cores

4000 nodes•16PB raw disk•64TB of RAM•32K Cores•(40% faster too!)

• Makes Developers & Scientists more productive– Research questions answered in days, not months– Projects move from research to production easily– Easy to learn! “Even our rocket scientists use it!”

• The major factors– You don’t need to find new hardware to experiment– You can work with all your data!– Production and research based on same framework– No need for R&D to do I.T., the clusters just work

Tremendous Impact on Productivity

14

Example: Search AssistTM

Before Hadoop After Hadoop

Time 26 days 20 minutes

Language C++ Python

Development Time 2-3 weeks 2-3 days

• Database for Search Assist™ is built using Hadoop. • 3 years of log-data• 20-steps of map-reduce

Futures

Current Yahoo! Development

• Hadoop– Backwards compatibility

• New APIs in 21, Avro– Append, Sync, Flush– Improved Scheduling

• Capacity Scheduler, Hierarchical Queues in 21– Security - Kerberos authentication– GridMix3 / Mumak Simulator / Better trace & log analysis

• PIG– SQL and Metadata– Column oriented storage access layer - Zebra– Multi-query, lots of other optimizations

• Oozie– New workflow and scheduling system

• Quality Engineering – Many more tests– Stress tests, Fault injection, Functional test coverage, performance regression…

Questions?

Eric BaldeschwielerVP Hadoop Software DevelopmentYahoo!

For more information:http://hadoop.apache.org/ http://hadoop.yahoo.com/ (including job openings)