hw09 hadoop applications at yahoo!
TRANSCRIPT
Eric BaldeschwielerVP Hadoop Software DevelopmentYahoo!
Hadoop at Yahoo!
What we want you to know about Yahoo! and Hadoop
• Yahoo! is:– The largest contributor to Hadoop– The largest tester of Hadoop– The largest user of Hadoop
• 4000 node clusters!– A great place to do Hadoop development, do internet scale science
and change the world!
• Also:– We release “The Yahoo! Distribution of Hadoop”– We contribute all of our Hadoop work to the Apache Foundation as
open source– We continue to aggressively invest in Hadoop– We do not sell Hadoop service or support!
• We use Hadoop services to run Yahoo!
• The majority of all patches to Hadoop have come from Yahoo!s– 72% of core patches
• Core = HDFS, Map-reduce, Common– Metric is an underestimate
• Some Yahoo!s use apache.org accounts• Patch sizes vary
• Yahoo! is the largest employer of Hadoop Contributors, by far!– We contribute ALL of our Hadoop
development work back to Apache!
• We are hiring!– Sunnyvale, Bangalore, Beijing– See http://hadoop.yahoo.com
The Largest Hadoop Contributor
All Patches
Core Patches
The Largest Hadoop Tester
• Every release of The Yahoo! Distribution of Hadoop goes through multiple levels of testing before we declare it stable
• 4 Tiers of Hadoop clusters– Development, Testing and QA (~10% of our hardware)
• Continuous integration / testing of new code– Proof of Concepts and Ad-Hoc work (~10% of our hardware)
• Runs the latest version of Hadoop – currently 0.20 (RC6)– Science and Research (~60% of our hardware)
• Runs more stable versions – currently 0.20 (RC5)– Production (~20% of our hardware)
• The most stable version of Hadoop – currently 0.18.3
• We continue to grow our Quality Engineering team– We are hiring!!
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
90,000
Cumulative Total Nodes Cumulative Storage
Total Nodes = 24146Total Storage = 82 PB
The
Largest
Hadoop
User
2006 now
Hardware Internal Hadoop Users
building new datacenter
PB Disk, >82PB TodayNodes, >25,000 Today
Collaborations around the globe
• The Apache Foundation – http://hadoop.apache.org– Hundreds of contributors and thousands users of Hadoop!– see http://wiki.apache.org/hadoop/PoweredBy• The Yahoo! Distribution of Hadoop – Opening up our testing– Cloudera• M45 - Yahoo!’s shared research supercomputing cluster– Carnegie Mellon University– The University of California at Berkeley– Cornell University– The University of Massachusetts at Amherst• Partners in India– Computational Research Laboratories (CRL), India's Tata Group– Universities – IIT-B, IISc, IIIT-H, PSG• Open Cirrus™ - cloud computing research & education– The University of Illinois at Urbana-Champaign– Infocomm Development Authority (IDA) in Singapore– The Karlsruhe Institute of Technology (KIT) in Germany– HP, Intel – The Russian Academy of Sciences, Electronics & Telecomm. – Malaysian Institute of Microelectronic Systems
Usage of Hadoop
Why Hadoop @ Yahoo!?
• Massive scale– 500M+ unique users per month– Billions of “transactions” per day– Many petabytes of data
• Analysis and data processing key to our business– Need to process all of that data in a timely manner– Lots of ad hoc investigation to look for patterns, run reports…
• Need to do this cost effectively– Use low cost commodity hardware– Share resources among multiple projects– Try new things quickly at huge scale– Handle the failure of some of that hardware every day
• The Hadoop infrastructure provides these capabilities
Yahoo! front page - Case Study
Ads Optimization
Search Index
Yahoo! front page - Case Study
Ads Optimization
Content Optimization
Search Index
Machine Learned
Spam filters
RSS Feeds Content Optimization
Yahoo! front page - Case Study
Large Applications2008 2009
Webmap ~70 hours runtime~300 TB shuffling~200 TB output
~73 hours runtime~490 TB shuffling~280 TB output+55% Hardware
Sort benchmarks(Jim Gray contest)
1 Terabyte sorted•209 seconds•900 nodes
1 Terabyte sorted•62 seconds, 1500 nodes1 Petabyte sorted•16.25 hours, 3700 nodes
Largest cluster 2000 nodes•6PB raw disk•16TB of RAM•16K Cores
4000 nodes•16PB raw disk•64TB of RAM•32K Cores•(40% faster too!)
• Makes Developers & Scientists more productive– Research questions answered in days, not months– Projects move from research to production easily– Easy to learn! “Even our rocket scientists use it!”
• The major factors– You don’t need to find new hardware to experiment– You can work with all your data!– Production and research based on same framework– No need for R&D to do I.T., the clusters just work
Tremendous Impact on Productivity
14
Example: Search AssistTM
Before Hadoop After Hadoop
Time 26 days 20 minutes
Language C++ Python
Development Time 2-3 weeks 2-3 days
• Database for Search Assist™ is built using Hadoop. • 3 years of log-data• 20-steps of map-reduce
Futures
Current Yahoo! Development
• Hadoop– Backwards compatibility
• New APIs in 21, Avro– Append, Sync, Flush– Improved Scheduling
• Capacity Scheduler, Hierarchical Queues in 21– Security - Kerberos authentication– GridMix3 / Mumak Simulator / Better trace & log analysis
• PIG– SQL and Metadata– Column oriented storage access layer - Zebra– Multi-query, lots of other optimizations
• Oozie– New workflow and scheduling system
• Quality Engineering – Many more tests– Stress tests, Fault injection, Functional test coverage, performance regression…
Questions?
Eric BaldeschwielerVP Hadoop Software DevelopmentYahoo!
For more information:http://hadoop.apache.org/ http://hadoop.yahoo.com/ (including job openings)