4apachehadoop 0-23hadoopworld2011-111110151810-phpapp02

16
Apache Hadoop 0.23 What it takes and what it means… Page 1 Arun C. Murthy Founder/Architect, Hortonworks @acmurthy (@hortonworks)

Upload: nitish-bhardwaj

Post on 19-Nov-2014

46 views

Category:

Documents


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: 4apachehadoop 0-23hadoopworld2011-111110151810-phpapp02

Apache Hadoop 0.23What it takes and what it means…

Page 1

Arun C. Murthy

Founder/Architect, Hortonworks

@acmurthy (@hortonworks)

Page 2: 4apachehadoop 0-23hadoopworld2011-111110151810-phpapp02

Hello! I’m Arun

Page 2

• Founder/Architect at Hortonworks Inc.– Formerly, Architect Hadoop MapReduce, Yahoo– Responsible for running Hadoop MR as a service for all of Yahoo (50k nodes

footprint)– Yes, I took the 3am calls!

• Apache Hadoop, ASF– VP, Apache Hadoop, ASF (Chair of Apache Hadoop PMC)– Long-term Committer/PMC member (full time ~6 years)– Release Manager - hadoop-0.23

Page 3: 4apachehadoop 0-23hadoopworld2011-111110151810-phpapp02

Releases so far…

Page 3

• Started for Nutch… Yahoo picked it up in early 2006, hired Doug Cutting• Initially, we did monthly releases (0.1, 0.2 …)• Quarterly after hadoop-0.15 until hadoop-0.20 in 04/2009…• hadoop-0.20 is still the basis of all current, stable, Hadoop distributions

– Apache Hadoop 0.20.2xx– CDH3.*– HDP1.*

• hadoop-0.20.203 (security) – 05/2011• hadoop-0.20.205 (security + append -> hbase) – 10/2011

2006 2009 2012

hadoop-0.1.0 hadoop-0.10.0 hadoop-0.20.0 hadoop-0.23.0hadoop-0.20.205

Page 4: 4apachehadoop 0-23hadoopworld2011-111110151810-phpapp02

hadoop-0.23

Page 4

• First stable release off Apache Hadoop trunk in over 30 months…• Currently alpha (hadoop-0.23.0) is under voting by the Hadoop PMC• Significant major features• Several, several enhancements

Page 5: 4apachehadoop 0-23hadoopworld2011-111110151810-phpapp02

HDFS - Federation

Page 5

• Significant scaling… • Separation of Namespace mgmt and Block mgmt• Suresh Srinivas (Hortonworks) – Wed 11am

Page 6: 4apachehadoop 0-23hadoopworld2011-111110151810-phpapp02

MapReduce - YARN

Page 6

• NextGen Hadoop Data Processing Framework• Support MR and other paradigms• Mahadev Konar (Hortonworks) – Tue 4.30pm

Page 7: 4apachehadoop 0-23hadoopworld2011-111110151810-phpapp02

Performance

Page 7

• 2x+ across the board• HDFS read/write

– CRC32– fadvise– Shortcut for local reads

• MapReduce– Unlock lots of improvements from Terasort record (Owen/Arun, 2009)– Shuffle 30%+– Small Jobs – Uber AM

• Todd Lipcon (Cloudera) – Wed 10am

Page 8: 4apachehadoop 0-23hadoopworld2011-111110151810-phpapp02

HDFS NameNode HA

Page 8

• The famous SPOF• https://issues.apache.org/jira/browse/HDFS-1623• Well on the way to fix in hadoop-0.23.½ • Suresh Srinivas (Hortonworks), Aaron Myers (Cloudera) – Tue 2.15pm

Page 9: 4apachehadoop 0-23hadoopworld2011-111110151810-phpapp02

More…

Page 9

• HDFS Write pipeline improvements for Hbase– Append/flush etc.

• Build - Full Mavenization• EditLogs re-write

– https://issues.apache.org/jira/browse/HDFS-1073

• Tonnes more …

Page 10: 4apachehadoop 0-23hadoopworld2011-111110151810-phpapp02

Deployment goals

Page 10

• Clusters of 6,000 machines– Each machine with 16+ cores, 48G/96G RAM, 24TB/36TB disks– 200+ PB (raw) per cluster– 100,000+ concurrent tasks– 10,000 concurrent jobs

• Yahoo: 50,000+ machines

Page 11: 4apachehadoop 0-23hadoopworld2011-111110151810-phpapp02

What does it take to get there?

Page 11

• Testing, *lots* of it• Benchmarks – At least as good as the last one• Integration testing

– HBase– Pig– Hive– Oozie

• Deployment discipline

Page 12: 4apachehadoop 0-23hadoopworld2011-111110151810-phpapp02

Testing

Page 12

• Why is it hard?– MapReduce is, effectively, very wide api– Add Streaming– Add Pipes– Oh, Pig/Hive etc. etc.

• Functional tests– Nightly– Nearly 1000 functional tests for MapReduce alone– Several hundred for Pig/Hive etc.

• Scale tests– Simulation

• Longevity tests• Stress tests

Page 13: 4apachehadoop 0-23hadoopworld2011-111110151810-phpapp02

Benchmarks

Page 13

• Benchmark every part of the HDFS & MR pipeline– HDFS read/write throughput– NN operations– Scan, Shuffle, Sort

• GridMixv3– Run production traces in test clusters– Thousands of jobs– Stress mode v/s Replay mode

Page 14: 4apachehadoop 0-23hadoopworld2011-111110151810-phpapp02

Integration Testing

Page 14

• Several projects in the ecosystem– HBase– Pig– Hive– Oozie

• Cycle– Functional– Scale– Rinse, repeat

Page 15: 4apachehadoop 0-23hadoopworld2011-111110151810-phpapp02

Deployment

Page 15

• Alpha/Test (early UAT)– Starting Nov, 2011– Small scale (500-800 nodes)

• Alpha– Jan, 2012– Majority of users– 2000 nodes per cluster, > 10,000 nodes in all

• Beta– Misnomer: 100s of PB, Millions of user applications– Significantly wide variety of applications and load– 4000+ nodes per cluster, > 20000 nodes in all– Late Q1, 2012

• Production– Well, it’s production– Mid-to-late Q2 2012