integrating hadoop in your existing dw and bi environment
TRANSCRIPT
•
•
•
•
•
•
Presentation Outline! 1. The standard model
! 2. The 3 stages of Hadoop adoption
! 3. Cloudera partnerships
! 4. Analytics at eBay
! Questions and Discussion
Wednesday, November 17, 2010
1. The Standard ModelData Warehousing and Business Intelligence
Wednesday, November 17, 2010
ApplicationDatabase
ApplicationRequests
Wednesday, November 17, 2010
ApplicationDatabase
ApplicationRequests
DataWarehouse
Wednesday, November 17, 2010
ApplicationDatabase
ApplicationRequests
DataWarehouse
ETL
Wednesday, November 17, 2010
ApplicationDatabase
ApplicationRequests
DataWarehouse
ETL
BusinessIntelligence
Wednesday, November 17, 2010
ApplicationDatabase
ApplicationRequests
DataWarehouse
ETL
BusinessIntelligence
Analytics
Wednesday, November 17, 2010
2. The 3 Stages of Hadoop Adoption
Wednesday, November 17, 2010
Stage 1Off the Critical Path
Wednesday, November 17, 2010
Stage 1Copy or Archive
ApplicationDatabase
ApplicationRequests
DataWarehouse
ETL
BusinessIntelligence
Analytics
Hadoop
Wednesday, November 17, 2010
Stage 1Add Unstructured Data
ApplicationDatabase
ApplicationRequests
DataWarehouse
ETL
BusinessIntelligence
Analytics
Hadoop
Wednesday, November 17, 2010
Stage 1Consolidate Multiple Data Warehouses
ApplicationDatabase
DataWarehouse
ETL
Hadoop
ApplicationDatabase
DataWarehouse
ETL
Wednesday, November 17, 2010
Stage 2On the Critical Path
Wednesday, November 17, 2010
Stage 2Structure and Store
ApplicationDatabase
ApplicationRequests
DataWarehouse
BusinessIntelligence
Analytics
Hadoop
Wednesday, November 17, 2010
Stage 3Ad Hoc Query Support
Wednesday, November 17, 2010
ApplicationDatabase
ApplicationRequests
DataWarehouse
BusinessIntelligence
Analytics
Hadoop + Hive
BusinessIntelligence
Analytics
Wednesday, November 17, 2010
Cloudera’s Distribution for HadoopThe Industry-leading Hadoop Distribution
Wednesday, November 17, 2010
3. Cloudera Partnerships
Wednesday, November 17, 2010
Cloudera PartnershipsCloud, Hardware, and OS
! Processor
! AMD, Intel
! Server
! Acer, HP, Supermicro
! OS
! Canonical
! Cloud
! VMware vCloud
! CDH runs on AWS and Rackspace Cloud as well
Wednesday, November 17, 2010
Cloudera PartnershipsData Integration
! Informatica
! Talend
! Pentaho Data Integration
Wednesday, November 17, 2010
Cloudera PartnershipsDatabase
! Aster Data
! Greenplum
! Membase
! Netezza
! Quest Software (OraOop)
! Teradata
! Vertica
Wednesday, November 17, 2010
Cloudera PartnershipsBusiness Intelligence
! Jaspersoft
! Microstrategy
! Pentaho BI Suite
Wednesday, November 17, 2010
4. Analytics at eBay
Wednesday, November 17, 2010
1
eBay’s Data Scale
• eBay manages …• Over 90 million active users worldwide• Over 220 million items for sale• Over 10 billion URL requests per day
• • … in a dynamic environment• Tens of new features each week• Roughly 10% of items are listed or ended every day
• Collect Everything• eBay processes 40TB of new, incremental data per day• eBay analyzes 40PB of data per day• Store every historical item and purchase
eBay has one of the largest EDW system and is building one of the world’slargest Hadoop clusters
2
Where – it fits in our Data Platform…
Integration into Existing Warehouse
3
Click Stream
EDW
Images
Search Indices
Analytics Reporting
Algorithmic Models
Acquisition
Item Description
DataAcquisition
BIGeneration
InsightDelivery
Data Sourcing Patterns
4
Source Preparation Format Pattern / LearningClick StreamSessionEventSession Container
Session/Event Streamed as Gzip/ Binary. Prepared as LZO/Text.
Session/Event DataBuild an index and use LzoTextInputFormat for splits
Session Container - a join of Session and corresponding Event data. Prepared as Sequence Files.
Session Container - Secondary sort with reduce side join
EDWItemTransactionUserFeedbackBids
Incremental feed streamed and maintained as GZIP/Text.
Smaller data set , keep it in the original format.
Prepare a snapshot as SequenceFile.
Rebuild daily snapshot with previous snapshot and incremental day’s data.
Build a Hive table on snapshot data Create external Hive table which points to SequenceFile
HBasea) Leverage TotalOrderPartitonerwith RandomSamplers to identify partition ranges for reducers.b) Create HBaseregions using Hfilec) Update RegionServers using ruby script loadtable.rb
Learninga) Incremental data not temporal/sparse, hence not suitable as versions in a column oriented DB.b) HBase insert vs. append performance, 120K vs. 12K rows per secc) Hfile flush durability issues HBASE-1923
Hadoop Ecosystem
55
Hadoop Core (HDFS,Common)
MapReduce (Java, Streaming, Pipes,Scala)
Data Access (Hbase, Pig, Hive)
Tools & L ibraries(HUE,UC4,Oozie.Mobius,Mahout)
Monitoring & A lerting (Ganglia, Nagios)
•MapReduceSourcing data primarily JavaApplications using Perl, Scala, Python…
• Data Access FrameworksPig – data piplelinesHive – Adhoc queries MQL –Mobius Query Language
•Monitoring & AlertingGanglia, Nagios, Cloudera Enterprise
• Tools & LibrariesHUE/Mobius – lifecycle of user jobsUC4 ‐ schedulingOozie – user workflow and data pipelinesMahout – data mining
6
Metadata ‐ Data Discovery & Management
7
Clients
Data Sourcing
Data Access Layer
HDFS
Metadata
Data Discovery Data Monitoring
Logical Type System
Provisioning Tools
Metadata Store
Hive, Java
Pig Schemas
Pig load UDFs
Hive Tables
Java POJO
ValidationLoad
HBASE Tables
Extract Transform
Administration
• Groups• Cloudera Enterprise• Workload Management
• Allocation, Weights , Preemption, Speculative Execution, Data Locality
• Security• Integrate Hadoop security spec with corporate policies• Authentication
• HUE – custom module to use corp. credentials• Command Line Interface – PAM custom module
• Authorization• Establish roles based on data classification and access patterns
8
9
Metrics Details
Data Sourcing Latency, Data Load Status, Integrity, Quality, Availability
Consumption Cloudera’s Bean Counter , Job Statistics, System consumption
Budgeting Resource Allocation Models, Forecasting, Chargeback
Utilization Cloudera’s Activity Monitor, Efficiency, Performance
Platform Description
Availability Standby Nodes ‐ Checkpoint ,Backup , Avatar Node, SLAs
Manageability Installation, Provisioning, De‐Provisioning, Version upgrades
Scalability Federated NameNode, Metadata Replication, Zookeeper
Data Movement Publish/Subscribe ETL tools, low latency , self‐service
Storage Consistency, Partitioning, Compression, Replication
Workload Concurrency, Resource Sharing, Schedulers, Allocation
Policies Retention, Archival, Backup, Quotas
Platform & Metrics