Download - 20091030nasajpl
![Page 1: 20091030nasajpl](https://reader034.vdocuments.us/reader034/viewer/2022052618/54c6fc3d4a795949388b459c/html5/thumbnails/1.jpg)
Thursday, October 29, 2009
![Page 2: 20091030nasajpl](https://reader034.vdocuments.us/reader034/viewer/2022052618/54c6fc3d4a795949388b459c/html5/thumbnails/2.jpg)
Hadoop and ClouderaManaging Petabytes with Open Source
Jeff HammerbacherChief Scientist and Vice President of Products, ClouderaOctober 30, 2009
Thursday, October 29, 2009
![Page 3: 20091030nasajpl](https://reader034.vdocuments.us/reader034/viewer/2022052618/54c6fc3d4a795949388b459c/html5/thumbnails/3.jpg)
Why You Should CareHadoop in the Sciences▪ Crossbow: Genotyping from short reads using cloud computing▪ “Crossbow shows how Hadoop can be a enabling technology for
computational biology”▪ SMARTS substructure searching using the CDK and Hadoop▪ “The Hadoop framework makes handling large data problems pretty much
trivial”▪ Tier II data storage for LHC▪ “The shift from dCache to Hadoop has been a pleasant transition”
▪ Scaling the Sky with MapReduce/Hadoop▪ “This research project is focused on developing new algorithms for
indexing, accessing and analyzing astronomical images.”
Thursday, October 29, 2009
![Page 4: 20091030nasajpl](https://reader034.vdocuments.us/reader034/viewer/2022052618/54c6fc3d4a795949388b459c/html5/thumbnails/4.jpg)
My BackgroundThanks for Asking
▪ [email protected]▪ Studied Mathematics at Harvard▪ Worked as a Quant on Wall Street▪ Conceived, built, and led Data team at Facebook▪ Nearly 30 amazing engineers and data scientists▪ Several open source projects and research papers
▪ Founder of Cloudera▪ Vice President of Products and Chief Scientist (other titles)▪ Also, check out the book “Beautiful Data”
Thursday, October 29, 2009
![Page 5: 20091030nasajpl](https://reader034.vdocuments.us/reader034/viewer/2022052618/54c6fc3d4a795949388b459c/html5/thumbnails/5.jpg)
Presentation Outline▪ What is Hadoop?▪ HDFS▪ MapReduce▪ Hive, Pig, Avro, Zookeeper, and friends
▪ Solving big data problems with Hadoop at Facebook and Yahoo!▪ Short history of Facebook’s Data team▪ Hadoop applications at Yahoo!, Facebook, and Cloudera▪ Other examples: LHC, smart grid, genomes
▪ Questions and Discussion
Thursday, October 29, 2009
![Page 6: 20091030nasajpl](https://reader034.vdocuments.us/reader034/viewer/2022052618/54c6fc3d4a795949388b459c/html5/thumbnails/6.jpg)
What is Hadoop?▪ Apache Software Foundation project, mostly written in Java▪ Inspired by Google infrastructure▪ Software for programming warehouse-scale computers (WSCs)▪ Hundreds of production deployments▪ Project structure▪ Hadoop Distributed File System (HDFS)▪ Hadoop MapReduce▪ Hadoop Common▪ Other subprojects
▪ Avro, HBase, Hive, Pig, Zookeeper
Thursday, October 29, 2009
![Page 7: 20091030nasajpl](https://reader034.vdocuments.us/reader034/viewer/2022052618/54c6fc3d4a795949388b459c/html5/thumbnails/7.jpg)
Anatomy of a Hadoop Cluster▪ Commodity servers▪ 1 RU, 2 x 4 core CPU, 8 GB RAM, 4 x 1 TB SATA, 2 x 1 gE NIC
▪ Typically arranged in 2 level architecture▪ 40 nodes per rack
▪ Inexpensive to acquire and maintain
ApacheCon US 2008
Commodity Hardware Cluster
•! Typically in 2 level architecture
–! Nodes are commodity Linux PCs
–! 40 nodes/rack
–! Uplink from rack is 8 gigabit
–! Rack-internal is 1 gigabit all-to-all
Thursday, October 29, 2009
![Page 8: 20091030nasajpl](https://reader034.vdocuments.us/reader034/viewer/2022052618/54c6fc3d4a795949388b459c/html5/thumbnails/8.jpg)
HDFS▪ Pool commodity servers into a single hierarchical namespace▪ Break files into 128 MB blocks and replicate blocks▪ Designed for large files written once but read many times▪ Files are append-only
▪ Two major daemons: NameNode and DataNode▪ NameNode manages file system metadata▪ DataNode manages data using local filesystem
▪ HDFS manages checksumming, replication, and compression▪ Throughput scales nearly linearly with node cluster size▪ Access from Java, C, command line, FUSE, or Thrift
Thursday, October 29, 2009
![Page 9: 20091030nasajpl](https://reader034.vdocuments.us/reader034/viewer/2022052618/54c6fc3d4a795949388b459c/html5/thumbnails/9.jpg)
HDFSHDFS distributes file blocks among servers
!"#$%&'%(#)**+,%-&.%/#$#%&0%$"1%20$13+3&'1% !"
!"#$%&#"'()*+%,"-'./0('
#$%&&'"()*+,%-."$"/$,+010&+-2$)0".0&2$3-".4.0-5"*$++-%"06-"
#$%&&'"7(.02(8,0-%"9(+-":4.0-5;"&2"#79:<"#79:"(."$8+-"0&".0&2-"
6,3-"$5&,)0."&/"()/&25$0(&);".*$+-",'"()*2-5-)0$++4"$)%"
.,2=(=-"06-"/$(+,2-"&/".(3)(/(*$)0"'$20."&/"06-".0&2$3-"
()/2$.02,*0,2-">(06&,0"+&.()3"%$0$<"
#$%&&'"*2-$0-."456'$13'"&/"5$*6()-."$)%"*&&2%()$0-.">&2?"
$5&)3"06-5<"@+,.0-2."*$)"8-"8,(+0">(06"()-A'-).(=-"*&5',0-2.<"
B/"&)-"/$(+.;"#$%&&'"*&)0(),-."0&"&'-2$0-"06-"*+,.0-2">(06&,0"
+&.()3"%$0$"&2"()0-22,'0()3">&2?;"84".6(/0()3">&2?"0&"06-"
2-5$()()3"5$*6()-."()"06-"*+,.0-2<"
#79:"5$)$3-.".0&2$3-"&)"06-"*+,.0-2"84"82-$?()3"()*&5()3"
/(+-."()0&"'(-*-.;"*$++-%"C8+&*?.;D"$)%".0&2()3"-$*6"&/"06-"8+&*?."
2-%,)%$)0+4"$*2&.."06-"'&&+"&/".-2=-2.<""B)"06-"*&55&)"*$.-;"
#79:".0&2-."062--"*&5'+-0-"*&'(-."&/"-$*6"/(+-"84"*&'4()3"-$*6"
'(-*-"0&"062--"%(//-2-)0".-2=-2.E"
"
"
!"#$%&'()'*+!,'-"./%"0$/&.'1"2&'02345.'6738#'.&%9&%.'
"
#79:"6$.".-=-2$+",.-/,+"/-$0,2-.<"B)"06-"=-24".(5'+-"-A$5'+-"
.6&>);"$)4"0>&".-2=-2."*$)"/$(+;"$)%"06-"-)0(2-"/(+-">(++".0(++"8-"
$=$(+$8+-<"#79:")&0(*-.">6-)"$"8+&*?"&2"$")&%-"(."+&.0;"$)%"
*2-$0-."$")->"*&'4"&/"5(..()3"%$0$"/2&5"06-"2-'+(*$."(0"
F"
!"
G"
H"
I"
!"
I"
H"
F"
!"
H"
F"
G"
I"
!"
G"
I"
F"
G"
H"
#79:"
"
" "
"
" "
7#8*3%90$1301$%
+3*+13$&1'%5&:1%;**.51<%
=>#?*0<%@#41A**:%#0)%
B#"**C%"#D1%+&*01131)%
$"1%6'1%*E%01$F*3:'%*E%
&01G+10'&D1%4*>+6$13'%
E*3%5#3.1H'4#51%)#$#%
'$*3#.1%#0)%
+3*41''&0.I%(/@J%6'1'%
$"1'1%$14"0&K61'%$*%
'$*31%10$13+3&'1%)#$#I%
Thursday, October 29, 2009
![Page 10: 20091030nasajpl](https://reader034.vdocuments.us/reader034/viewer/2022052618/54c6fc3d4a795949388b459c/html5/thumbnails/10.jpg)
Hadoop MapReduce▪ Fault tolerant execution layer and API for parallel data processing ▪ Can target multiple storage systems▪ Key/value data model▪ Two major daemons: JobTracker and TaskTracker▪ Many client interfaces▪ Java▪ C++▪ Streaming▪ Pig▪ SQL (Hive)
Thursday, October 29, 2009
![Page 11: 20091030nasajpl](https://reader034.vdocuments.us/reader034/viewer/2022052618/54c6fc3d4a795949388b459c/html5/thumbnails/11.jpg)
MapReduceMapReduce pushes work out to the data
!"#$%&'%(#)**+,%-&.%/#$#%&0%$"1%20$13+3&'1% !"
"
!"#$%&'()'*+,--.'.$/0&/'1-%2'-$3'3-'30&',+3+'
"
#$%%&%'"()*"+%+,-.&."/%"()*"%/0*."()+("+1($+,,-".(/2*"()*"0+(+"
0*,&3*2."4$1)"4$1)"5*((*2"6*27/24+%1*"()+%"2*+0&%'"0+(+"
/3*2"()*"%*(8/29"72/4"+".&%',*"1*%(2+,&:*0".*23*2;""<+0//6"
4/%&(/2."=/5."0$2&%'"*>*1$(&/%?"+%0"8&,,"2*.(+2("8/29",/.("0$*"
(/"%/0*"7+&,$2*"&7"%*1*..+2-;"@%"7+1(?"&7"+"6+2(&1$,+2"%/0*"&."
2$%%&%'"3*2-".,/8,-?"<+0//6"8&,,"2*.(+2("&(."8/29"/%"+%/()*2"
.*23*2"8&()"+"1/6-"/7"()*"0+(+;"
!"##$%&'
<+0//6A."B+6#*0$1*"+%0"<CDE"$.*".&46,*?"2/5$.("(*1)%&F$*."
/%"&%*>6*%.&3*"1/46$(*2".-.(*4."(/"0*,&3*2"3*2-")&')"0+(+"
+3+&,+5&,&(-"+%0"(/"+%+,-:*"*%/24/$."+4/$%(."/7"&%7/24+(&/%"
F$&19,-;"<+0//6"/77*2."*%(*262&.*."+"6/8*27$,"%*8"(//,"7/2"
4+%+'&%'"5&'"0+(+;"
D/2"4/2*"&%7/24+(&/%?"6,*+.*"1/%(+1("G,/$0*2+"+(H"
" &%7/I1,/$0*2+;1/4"
" JKLMNOLPMQLO!RR"
" )((6HSS888;1,/$0*2+;1/4S"
K"
P"
N"
K"
P"
!"
Q"
P"
!"
K"
Q"
N"
Q"
!"
N"
(#)**+%$#41'%
#)5#0$#.1%*6%(/789%
)#$#%)&'$3&:;$&*0%
'$3#$1.<%$*%+;'"%=*34%
*;$%$*%>#0<%0*)1'%&0%#%
?@;'$13A%B"&'%#@@*='%
#0#@<'1'%$*%3;0%&0%
+#3#@@1@%#0)%1@&>&0#$1'%
$"1%:*$$@101?4'%
&>+*'1)%:<%>*0*@&$"&?%
'$*3#.1%'<'$1>'A%
Thursday, October 29, 2009
![Page 12: 20091030nasajpl](https://reader034.vdocuments.us/reader034/viewer/2022052618/54c6fc3d4a795949388b459c/html5/thumbnails/12.jpg)
Hadoop Subprojects▪ Avro▪ Cross-language framework for RPC and serialization
▪ HBase▪ Table storage on top of HDFS, modeled after Google’s BigTable
▪ Hive▪ SQL interface to structured data stored in HDFS
▪ Pig▪ Language for data flow programming; also Owl, Zebra, SQL
▪ Zookeeper▪ Coordination service for distributed systems
Thursday, October 29, 2009
![Page 13: 20091030nasajpl](https://reader034.vdocuments.us/reader034/viewer/2022052618/54c6fc3d4a795949388b459c/html5/thumbnails/13.jpg)
Hadoop Community Support▪ 185+ contributors to the open source code base▪ ~60 engineers at Yahoo!, ~15 at Facebook, ~15 at Cloudera
▪ Over 500 (paid!) attendees at Hadoop World NYC▪ Hadoop World Beijing later this month
▪ Three books (O’Reilly, Apress, Manning)▪ Training videos free online▪ Regular user group meetups in many cities▪ University courses across the world▪ Growing consultant and systems integrator expertise▪ Commercial training, certification, and support from Cloudera
Thursday, October 29, 2009
![Page 14: 20091030nasajpl](https://reader034.vdocuments.us/reader034/viewer/2022052618/54c6fc3d4a795949388b459c/html5/thumbnails/14.jpg)
Hadoop Project Mechanics▪ Trademark owned by ASF; Apache 2.0 license for code▪ Rigorous unit, smoke, performance, and system tests▪ Release cycle of 3 months (-ish)▪ Last major release: 0.20.0 on April 22, 2009▪ 0.21.0 will be last release before 1.0; nearly complete▪ Subprojects on different release cycles
▪ Releases put to a vote according to Apache guidelines▪ Releases made available as tarballs on Apache and mirrors▪ Cloudera packages own release for many platforms▪ RPM and Debian packages; AMI for Amazon’s EC2
Thursday, October 29, 2009
![Page 15: 20091030nasajpl](https://reader034.vdocuments.us/reader034/viewer/2022052618/54c6fc3d4a795949388b459c/html5/thumbnails/15.jpg)
Hadoop at FacebookEarly 2006: The First Research Scientist
▪ Source data living on horizontally partitioned MySQL tier▪ Intensive historical analysis difficult▪ No way to assess impact of changes to the site
▪ First try: Python scripts pull data into MySQL▪ Second try: Python scripts pull data into Oracle
▪ ...and then we turned on impression logging
Thursday, October 29, 2009
![Page 16: 20091030nasajpl](https://reader034.vdocuments.us/reader034/viewer/2022052618/54c6fc3d4a795949388b459c/html5/thumbnails/16.jpg)
Facebook Data Infrastructure2007
Oracle Database Server
Data Collection Server
MySQL TierScribe Tier
Thursday, October 29, 2009
![Page 17: 20091030nasajpl](https://reader034.vdocuments.us/reader034/viewer/2022052618/54c6fc3d4a795949388b459c/html5/thumbnails/17.jpg)
Facebook Data Infrastructure2008
MySQL TierScribe Tier
Hadoop Tier
Oracle RAC Servers
Thursday, October 29, 2009
![Page 18: 20091030nasajpl](https://reader034.vdocuments.us/reader034/viewer/2022052618/54c6fc3d4a795949388b459c/html5/thumbnails/18.jpg)
Major Data Team Workloads▪ Data collection▪ server logs▪ application databases▪ web crawls
▪ Thousands of multi-stage processing pipelines▪ Summaries consumed by external users▪ Summaries for internal reporting▪ Ad optimization pipeline▪ Experimentation platform pipeline
▪ Ad hoc analyses
Thursday, October 29, 2009
![Page 19: 20091030nasajpl](https://reader034.vdocuments.us/reader034/viewer/2022052618/54c6fc3d4a795949388b459c/html5/thumbnails/19.jpg)
Workload StatisticsFacebook 2009
▪ Largest cluster running Hive: 4,800 cores, 5.5 PB of storage▪ 4 TB of compressed new data added per day▪ 135TB of compressed data scanned per day▪ 7,500+ Hive jobs on per day▪ 80K compute hours per day▪ Around 200 people per month run Hive jobs
(data from Ashish Thusoo’s Hadoop World NYC presentation)
Thursday, October 29, 2009
![Page 20: 20091030nasajpl](https://reader034.vdocuments.us/reader034/viewer/2022052618/54c6fc3d4a795949388b459c/html5/thumbnails/20.jpg)
Hadoop at Yahoo!▪ Jan 2006: Hired Doug Cutting▪ Apr 2006: Sorted 1.9 TB on 188 nodes in 47 hours▪ Apr 2008: Sorted 1 TB on 910 nodes in 209 seconds▪ Aug 2008: Deployed 4,000 node Hadoop cluster▪ May 2009: Sorted 1 TB on 1,460 nodes in 62 seconds▪ Sorted 1 PB on 3,658 nodes in 16.25 hours
▪ Other data points▪ Over 25,000 nodes running Hadoop across 17 clusters▪ Hundreds of thousands of jobs per day from over 600 users▪ 82 PB of data
Thursday, October 29, 2009
![Page 21: 20091030nasajpl](https://reader034.vdocuments.us/reader034/viewer/2022052618/54c6fc3d4a795949388b459c/html5/thumbnails/21.jpg)
Example Hadoop Applications▪ Yahoo!▪ Yahoo! Search Webmap▪ Content and ad targeting optimization
▪ Facebook▪ Fraud and abuse detection▪ Lexicon (text mining)
▪ Cloudera▪ Facial recognition for automatic tagging▪ Genome sequence analysis▪ Financial services, government, telco, scientific data
Thursday, October 29, 2009
![Page 22: 20091030nasajpl](https://reader034.vdocuments.us/reader034/viewer/2022052618/54c6fc3d4a795949388b459c/html5/thumbnails/22.jpg)
Cloudera OfferingsOnly One Slide, I Promise
▪ Two software products▪ Cloudera’s Distribution for Hadoop▪ Cloudera Desktop▪ ...more on the way
▪ Training and Certification▪ For Developers, Operators, and Managers
▪ Support▪ Professional services
Thursday, October 29, 2009
![Page 23: 20091030nasajpl](https://reader034.vdocuments.us/reader034/viewer/2022052618/54c6fc3d4a795949388b459c/html5/thumbnails/23.jpg)
Cloudera DesktopBig Data can be Beautiful
Thursday, October 29, 2009
![Page 24: 20091030nasajpl](https://reader034.vdocuments.us/reader034/viewer/2022052618/54c6fc3d4a795949388b459c/html5/thumbnails/24.jpg)
(c) 2009 Cloudera, Inc. or its licensors. "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0
Thursday, October 29, 2009