20100128ebay
DESCRIPTION
TRANSCRIPT
![Page 1: 20100128ebay](https://reader034.vdocuments.us/reader034/viewer/2022051515/54c6f17e4a7959f3488b45b4/html5/thumbnails/1.jpg)
Thursday, January 28, 2010
![Page 2: 20100128ebay](https://reader034.vdocuments.us/reader034/viewer/2022051515/54c6f17e4a7959f3488b45b4/html5/thumbnails/2.jpg)
Hadoop, Cloudera, and eBayManaging Petabytes with Open Source
Jeff HammerbacherChief Scientist and Vice President of Products, ClouderaJanuary 28, 2010
Thursday, January 28, 2010
![Page 3: 20100128ebay](https://reader034.vdocuments.us/reader034/viewer/2022051515/54c6f17e4a7959f3488b45b4/html5/thumbnails/3.jpg)
My BackgroundThanks for Asking
▪ [email protected]▪ Studied Mathematics at Harvard▪ Worked as a Quant on Wall Street▪ Conceived, built, and led Data team at Facebook▪ Nearly 30 amazing engineers and data scientists▪ Several open source projects and research papers
▪ Founder of Cloudera▪ Vice President of Products and Chief Scientist▪ Also, check out the book “Beautiful Data”
Thursday, January 28, 2010
![Page 4: 20100128ebay](https://reader034.vdocuments.us/reader034/viewer/2022051515/54c6f17e4a7959f3488b45b4/html5/thumbnails/4.jpg)
Presentation Outline▪ What is Hadoop?▪ HDFS and MapReduce▪ Hive, Pig, Avro, Zookeeper, HBase
▪ From Steve▪ Why Hadoop?▪ Hadoop for machine learning and modeling▪ Other things I find interesting
▪ What we’re building at Cloudera▪ Questions and Discussion
Thursday, January 28, 2010
![Page 5: 20100128ebay](https://reader034.vdocuments.us/reader034/viewer/2022051515/54c6f17e4a7959f3488b45b4/html5/thumbnails/5.jpg)
What is Hadoop?▪ Apache Software Foundation project, mostly written in Java▪ Inspired by Google infrastructure▪ Software for programming warehouse-scale computers (WSCs)▪ Hundreds of production deployments▪ Project structure▪ Hadoop Distributed File System (HDFS)▪ Hadoop MapReduce▪ Hadoop Common▪ Other subprojects
▪ Avro, HBase, Hive, Pig, Zookeeper
Thursday, January 28, 2010
![Page 6: 20100128ebay](https://reader034.vdocuments.us/reader034/viewer/2022051515/54c6f17e4a7959f3488b45b4/html5/thumbnails/6.jpg)
Anatomy of a Hadoop Cluster▪ Commodity servers▪ 1 RU, 2 x 4 core CPU, 8 GB RAM, 4 x 1 TB SATA, 2 x 1 gE NIC▪ Or: 2 RU, 2 x 8 core CPU, 32 GB RAM, 12 x 1 TB SATA
▪ Typically arranged in 2 level architecture▪ Inexpensive to acquire and maintain
ApacheCon US 2008
Commodity Hardware Cluster
•! Typically in 2 level architecture
–! Nodes are commodity Linux PCs
–! 40 nodes/rack
–! Uplink from rack is 8 gigabit
–! Rack-internal is 1 gigabit all-to-all
Thursday, January 28, 2010
![Page 7: 20100128ebay](https://reader034.vdocuments.us/reader034/viewer/2022051515/54c6f17e4a7959f3488b45b4/html5/thumbnails/7.jpg)
HDFS▪ Pool commodity servers into a single hierarchical namespace▪ Break files into 128 MB blocks and replicate blocks▪ Designed for large files written once but read many times▪ Files are append-only via a single writer
▪ Two major daemons: NameNode and DataNode▪ NameNode manages file system metadata▪ DataNode manages data using local filesystem
▪ HDFS manages checksumming, replication, and compression▪ Throughput scales nearly linearly with node cluster size
Thursday, January 28, 2010
![Page 8: 20100128ebay](https://reader034.vdocuments.us/reader034/viewer/2022051515/54c6f17e4a7959f3488b45b4/html5/thumbnails/8.jpg)
HDFSHDFS distributes file blocks among servers
!"#$%&'%(#)**+,%-&.%/#$#%&0%$"1%20$13+3&'1% !"
!"#$%&#"'()*+%,"-'./0('
#$%&&'"()*+,%-."$"/$,+010&+-2$)0".0&2$3-".4.0-5"*$++-%"06-"
#$%&&'"7(.02(8,0-%"9(+-":4.0-5;"&2"#79:<"#79:"(."$8+-"0&".0&2-"
6,3-"$5&,)0."&/"()/&25$0(&);".*$+-",'"()*2-5-)0$++4"$)%"
.,2=(=-"06-"/$(+,2-"&/".(3)(/(*$)0"'$20."&/"06-".0&2$3-"
()/2$.02,*0,2-">(06&,0"+&.()3"%$0$<"
#$%&&'"*2-$0-."456'$13'"&/"5$*6()-."$)%"*&&2%()$0-.">&2?"
$5&)3"06-5<"@+,.0-2."*$)"8-"8,(+0">(06"()-A'-).(=-"*&5',0-2.<"
B/"&)-"/$(+.;"#$%&&'"*&)0(),-."0&"&'-2$0-"06-"*+,.0-2">(06&,0"
+&.()3"%$0$"&2"()0-22,'0()3">&2?;"84".6(/0()3">&2?"0&"06-"
2-5$()()3"5$*6()-."()"06-"*+,.0-2<"
#79:"5$)$3-.".0&2$3-"&)"06-"*+,.0-2"84"82-$?()3"()*&5()3"
/(+-."()0&"'(-*-.;"*$++-%"C8+&*?.;D"$)%".0&2()3"-$*6"&/"06-"8+&*?."
2-%,)%$)0+4"$*2&.."06-"'&&+"&/".-2=-2.<""B)"06-"*&55&)"*$.-;"
#79:".0&2-."062--"*&5'+-0-"*&'(-."&/"-$*6"/(+-"84"*&'4()3"-$*6"
'(-*-"0&"062--"%(//-2-)0".-2=-2.E"
"
"
!"#$%&'()'*+!,'-"./%"0$/&.'1"2&'02345.'6738#'.&%9&%.'
"
#79:"6$.".-=-2$+",.-/,+"/-$0,2-.<"B)"06-"=-24".(5'+-"-A$5'+-"
.6&>);"$)4"0>&".-2=-2."*$)"/$(+;"$)%"06-"-)0(2-"/(+-">(++".0(++"8-"
$=$(+$8+-<"#79:")&0(*-.">6-)"$"8+&*?"&2"$")&%-"(."+&.0;"$)%"
*2-$0-."$")->"*&'4"&/"5(..()3"%$0$"/2&5"06-"2-'+(*$."(0"
F"
!"
G"
H"
I"
!"
I"
H"
F"
!"
H"
F"
G"
I"
!"
G"
I"
F"
G"
H"
#79:"
"
" "
"
" "
7#8*3%90$1301$%
+3*+13$&1'%5&:1%;**.51<%
=>#?*0<%@#41A**:%#0)%
B#"**C%"#D1%+&*01131)%
$"1%6'1%*E%01$F*3:'%*E%
&01G+10'&D1%4*>+6$13'%
E*3%5#3.1H'4#51%)#$#%
'$*3#.1%#0)%
+3*41''&0.I%(/@J%6'1'%
$"1'1%$14"0&K61'%$*%
'$*31%10$13+3&'1%)#$#I%
Thursday, January 28, 2010
![Page 9: 20100128ebay](https://reader034.vdocuments.us/reader034/viewer/2022051515/54c6f17e4a7959f3488b45b4/html5/thumbnails/9.jpg)
Hadoop MapReduce▪ Fault tolerant execution layer and API for parallel data processing ▪ Can target multiple storage systems▪ Key/value data model▪ Two major daemons: JobTracker and TaskTracker▪ Many client interfaces▪ Java▪ C++▪ Streaming▪ Pig▪ SQL (Hive)
Thursday, January 28, 2010
![Page 10: 20100128ebay](https://reader034.vdocuments.us/reader034/viewer/2022051515/54c6f17e4a7959f3488b45b4/html5/thumbnails/10.jpg)
MapReduceMapReduce pushes work out to the data
!"#$%&'%(#)**+,%-&.%/#$#%&0%$"1%20$13+3&'1% /�
�!"#$%&'()'*+,--.'.$/0&/'1-%2'-$3'3-'30&',+3+'
�
��������������"������������� ������������"����������� �� � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � ������������� ������������������������#� �������&��� ������������������ �������!�������$�� � ����������� ��������� ���� � � � � � � � � � � � � �����������"&�������$���������������� ��� � ������������"���� �"$�� ���� ��������������� ��������������� � � � � � � � � � � � � � � " � � � � � � � � � &�
!"##$%&'
� ���(������ ����� ��������������� $������� �������������� ����!������������������"��������� ����������"������ ����������" �� ������"#����������������� � � � � � � � � � � � � � � �������"&� � �������������������������� �������� ���������������� ����� � &�
�������������������$������������������ ����% �
� � � � � * � � � � � � & � � ��� 3, '10+ '.1- '+/22 �� ����%)) &���� ��&���) �
,�
.�
0�
,�
.�
/�
-�
.�
/�
,�
-�
0�
-�
/�
0�
(#)**+%$#41'%
#)5#0$#.1%*6%(/789%
)#$#%)&'$3&:;$&*0%
'$3#$1.<%$*%+;'"%=*34%
*;$%$*%>#0<%0*)1'%&0%#%
?@;'$13A%B"&'%#@@*='%
#0#@<'1'%$*%3;0%&0%
+#3#@@1@%#0)%1@&>&0#$1'%
$"1%:*$$@101?4'%
&>+*'1)%:<%>*0*@&$"&?%
'$*3#.1%'<'$1>'A%
Thursday, January 28, 2010
![Page 11: 20100128ebay](https://reader034.vdocuments.us/reader034/viewer/2022051515/54c6f17e4a7959f3488b45b4/html5/thumbnails/11.jpg)
Hadoop Subprojects▪ Avro▪ Cross-language framework for RPC and serialization
▪ HBase▪ Table storage on top of HDFS, modeled after Google’s BigTable
▪ Hive▪ SQL interface to structured data stored in HDFS
▪ Pig▪ Language for data flow programming; also Owl, Zebra, SQL
▪ Zookeeper▪ Coordination service for distributed systems
Thursday, January 28, 2010
![Page 12: 20100128ebay](https://reader034.vdocuments.us/reader034/viewer/2022051515/54c6f17e4a7959f3488b45b4/html5/thumbnails/12.jpg)
Hadoop Community Support▪ 185+ contributors to the open source code base▪ ~60 engineers at Yahoo!, ~15 at Facebook, ~15 at Cloudera
▪ Over 500 (paid!) attendees at Hadoop World NYC▪ Regular user group meetups in many cities▪ Bay Area Meetup group has 534 members
▪ Three books (O’Reilly, Apress, Manning)▪ Training videos free online▪ University courses across the world▪ Growing consultant and systems integrator expertise▪ Training, certification, support, and services from Cloudera
Thursday, January 28, 2010
![Page 13: 20100128ebay](https://reader034.vdocuments.us/reader034/viewer/2022051515/54c6f17e4a7959f3488b45b4/html5/thumbnails/13.jpg)
Hadoop Project Mechanics▪ Trademark owned by ASF; Apache 2.0 license for code▪ Rigorous unit, smoke, performance, and system tests▪ Release cycle of 9 months▪ Last major release: 0.20.0 on April 22, 2009▪ 0.21.0 will be last release before 1.0; nearly complete▪ Subprojects on different release cycles
▪ Releases put to a vote according to Apache guidelines▪ Releases made available as tarballs on Apache and mirrors▪ Cloudera packages a distribution for many platforms▪ RPM and Debian packages; AMI for Amazon’s EC2
Thursday, January 28, 2010
![Page 14: 20100128ebay](https://reader034.vdocuments.us/reader034/viewer/2022051515/54c6f17e4a7959f3488b45b4/html5/thumbnails/14.jpg)
Hadoop at FacebookEarly 2006: The First Research Scientist
▪ Source data living on horizontally partitioned MySQL tier▪ Intensive historical analysis difficult▪ No way to assess impact of changes to the site
▪ First try: Python scripts pull data into MySQL▪ Second try: Python scripts pull data into Oracle
▪ ...and then we turned on impression logging
Thursday, January 28, 2010
![Page 15: 20100128ebay](https://reader034.vdocuments.us/reader034/viewer/2022051515/54c6f17e4a7959f3488b45b4/html5/thumbnails/15.jpg)
Facebook Data Infrastructure2007
Oracle Database Server
Data Collection Server
MySQL TierScribe Tier
Thursday, January 28, 2010
![Page 16: 20100128ebay](https://reader034.vdocuments.us/reader034/viewer/2022051515/54c6f17e4a7959f3488b45b4/html5/thumbnails/16.jpg)
Facebook Data Infrastructure2008
MySQL TierScribe Tier
Hadoop Tier
Oracle RAC Servers
Thursday, January 28, 2010
![Page 17: 20100128ebay](https://reader034.vdocuments.us/reader034/viewer/2022051515/54c6f17e4a7959f3488b45b4/html5/thumbnails/17.jpg)
Major Data Team Workloads▪ Data collection▪ server logs▪ application databases▪ web crawls
▪ Thousands of multi-stage processing pipelines▪ Summaries consumed by external users▪ Summaries for internal reporting▪ Ad optimization pipeline▪ Experimentation platform pipeline
▪ Ad hoc analyses
Thursday, January 28, 2010
![Page 18: 20100128ebay](https://reader034.vdocuments.us/reader034/viewer/2022051515/54c6f17e4a7959f3488b45b4/html5/thumbnails/18.jpg)
Workload StatisticsFacebook 2010
▪ Largest cluster running Hive: 8,400 cores, 12.5 PB of storage▪ 12 TB of compressed new data added per day▪ 135TB of compressed data scanned per day▪ 7,500+ Hive jobs on per day▪ 80K compute hours per day▪ Around 200 people per month run Hive jobs
(data from Ashish Thusoo’s Bay Area ACM DM SIG presentation)
Thursday, January 28, 2010
![Page 19: 20100128ebay](https://reader034.vdocuments.us/reader034/viewer/2022051515/54c6f17e4a7959f3488b45b4/html5/thumbnails/19.jpg)
Why Did Facebook Choose Hadoop?1. Demonstrated effectiveness for primary workload2. Proven ability to scale past any commercial vendor3. Easy provisioning and capacity planning with commodity nodes4. Data access for all: engineers, business analysts, sales managers5. Single system to manage XML/JSON, text, and relational data6. No schemas enabled data collection without involving Data team7. Simple, modular architecture8. Easy to build, deploy, and monitor9. Apache-licensed open source code granted to ASF
Thursday, January 28, 2010
![Page 20: 20100128ebay](https://reader034.vdocuments.us/reader034/viewer/2022051515/54c6f17e4a7959f3488b45b4/html5/thumbnails/20.jpg)
Why Did Facebook Choose Hadoop?▪ Most importantly: the community▪ Broad and deep commitment to future development from
multiple organizations▪ Interaction with a community often useful for recruiting▪ Growing body of users and operators with prior expertise
meant lower cost of training new users▪ Learn about best practices from other organizations▪ Widely available public materials for improving skills▪ Not then, but now
▪ Commercial training, certification, support, and services▪ Growing body of complementary software
Thursday, January 28, 2010
![Page 21: 20100128ebay](https://reader034.vdocuments.us/reader034/viewer/2022051515/54c6f17e4a7959f3488b45b4/html5/thumbnails/21.jpg)
Hadoop and Machine Learning/Modeling▪ Data preparation using familiar programming tools▪ Scalable historical storage of data for training and validation▪ Field coding, aggregation, and data quality assertions▪ Feature extraction over massive or complex data sets▪ Efficient sampling and extraction to other tools▪ Combination with other data sets▪ Extensible metadata for organizing data sets
▪ Fundamental operations▪ Matrix multiplication and other linear algebra▪ Statistical tests of significance
Thursday, January 28, 2010
![Page 22: 20100128ebay](https://reader034.vdocuments.us/reader034/viewer/2022051515/54c6f17e4a7959f3488b45b4/html5/thumbnails/22.jpg)
Hadoop and Machine Learning/Modeling▪ Scoring▪ eHarmony matching users▪ Fraud detection for billing platforms
▪ Genetic Algorithms▪ Mailchimp’s Project Omnivore▪ Xavier Llorà’s research
▪ Collaborative filtering▪ Google News personalization▪ Yahoo! front page personalization (Cokeheads)
Thursday, January 28, 2010
![Page 23: 20100128ebay](https://reader034.vdocuments.us/reader034/viewer/2022051515/54c6f17e4a7959f3488b45b4/html5/thumbnails/23.jpg)
Hadoop and Machine Learning/Modeling▪ Model fitting▪ EM algorithm and HMMs (Jimmy Lin)
▪ Graph analysis▪ Finding largest connected component (Jeff Hodges)▪ Social graph analysis (Jake Hofman)
▪ Document analysis▪ Named entity extraction (Evri)▪ Document similarity (Jimmy Lin)
▪ Image similarity: Google paper
Thursday, January 28, 2010
![Page 24: 20100128ebay](https://reader034.vdocuments.us/reader034/viewer/2022051515/54c6f17e4a7959f3488b45b4/html5/thumbnails/24.jpg)
Hadoop and Machine Learning/Modeling▪ Classification▪ Google’s PLANET for building decision trees▪ eBay’s linear Poisson regression for behavioral targeting▪ Sessionization of clickstream logs and path prediction
▪ Bioinformatics▪ Cloudburst▪ Crossbow
▪ Computer vision▪ Face detection▪ Face recognition
Thursday, January 28, 2010
![Page 25: 20100128ebay](https://reader034.vdocuments.us/reader034/viewer/2022051515/54c6f17e4a7959f3488b45b4/html5/thumbnails/25.jpg)
Hadoop and Machine Learning/Modeling▪ Simulation▪ Protein folding▪ Particle-swarm optimization
▪ Crazy stuff▪ Factoring integers▪ Solving Boggle▪ Generating fractals
▪ Books and conferences▪ MDAC 2010▪ “Data Intensive Text Processing with MapReduce”
Thursday, January 28, 2010
![Page 26: 20100128ebay](https://reader034.vdocuments.us/reader034/viewer/2022051515/54c6f17e4a7959f3488b45b4/html5/thumbnails/26.jpg)
Hadoop at ClouderaCloudera’s Distribution for Hadoop
▪ Open source distribution of Apache Hadoop for enterprise use▪ Includes HDFS, MapReduce, Pig, Hive, and ZooKeeper▪ Ensures cross-subproject compatibility▪ Adds backported patches and customer-specific patches▪ Adds Cloudera utilities like MRUnit and Sqoop▪ Better integration with daemon administration utilities▪ Follows the Filesystem Hierarchy Standard (FHS) for file layout▪ Tools for automatically generating a configuration▪ Packaged as RPM, DEB, AMI, or tarball
Thursday, January 28, 2010
![Page 27: 20100128ebay](https://reader034.vdocuments.us/reader034/viewer/2022051515/54c6f17e4a7959f3488b45b4/html5/thumbnails/27.jpg)
Hadoop at ClouderaTraining and Certification
▪ Free online training▪ Basic, Intermediate (including Hive and Pig), and Advanced▪ Includes a virtual machine with software and exercises
▪ Live training sessions▪ One live session per month somewhere in the world▪ If you have a large group, we may come to you
▪ Certification▪ Exams for Developers, Administrators, and Managers▪ Administered online or in person
Thursday, January 28, 2010
![Page 28: 20100128ebay](https://reader034.vdocuments.us/reader034/viewer/2022051515/54c6f17e4a7959f3488b45b4/html5/thumbnails/28.jpg)
Hadoop at ClouderaServices and Support
▪ Professional Services▪ Get Hadoop up and running in your environment▪ Optimize an existing Hadoop infrastructure▪ Design new algorithms to make the most of your data
▪ Support▪ Unlimited questions for Cloudera’s technical team▪ Access to our Knowledge Base▪ Help prioritize feature development for CDH▪ Early access to upcoming Cloudera software products
Thursday, January 28, 2010
![Page 29: 20100128ebay](https://reader034.vdocuments.us/reader034/viewer/2022051515/54c6f17e4a7959f3488b45b4/html5/thumbnails/29.jpg)
Hadoop at ClouderaCommercial Software
▪ General thesis: build commercially-licensed software products which complement CDH for data management and analysis
▪ Current products▪ Cloudera Desktop▪ Extensible interface for users of Cloudera software
▪ Upcoming products for data collection▪ Talk to me offline
Thursday, January 28, 2010
![Page 30: 20100128ebay](https://reader034.vdocuments.us/reader034/viewer/2022051515/54c6f17e4a7959f3488b45b4/html5/thumbnails/30.jpg)
Cloudera DesktopBig Data can be Beautiful
Thursday, January 28, 2010
![Page 31: 20100128ebay](https://reader034.vdocuments.us/reader034/viewer/2022051515/54c6f17e4a7959f3488b45b4/html5/thumbnails/31.jpg)
(c) 2009 Cloudera, Inc. or its licensors. "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0
Thursday, January 28, 2010