hadoop in three use cases

Post on 19-Jun-2015

290 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

2 December 2011

Hadoop in Three Use CasesJoey Echeverria | Solutions Architectjoey@cloudera.com | @fwiffo

©2011 Cloudera, Inc. All Rights Reserved.2

About Joey

• Solutions Architect• 6 months• 3+ years• Local

Cloudera’s Distribution including Apache Hadoop

Copyright 2011 Cloudera Inc. All rights reserved3

Coordination

Data IntegrationFast Read/Write

Access

Languages / Compilers

Workflow Scheduling Metadata

APACHE ZOOKEEPER

APACHE FLUME*, APACHE SQOOP* APACHE HBASE

APACHE PIG, APACHE HIVE

APACHE OOZIE* APACHE OOZIE* APACHE HIVE

File System Mount UI Framework SDKFUSE-DFS HUE HUE SDK

*currently under incubation in the Apache Software Foundation

Extract, Transform, and Load

Copyright 2011 Cloudera Inc. All rights reserved4

©2011 Cloudera, Inc. All Rights Reserved.5

ETL before Hadoop

Difficult to maintain, not scalable

Logs

Files

Relational Databases

Enterprise Data Warehouse

Custom ETL Scripts

©2011 Cloudera, Inc. All Rights Reserved.6

ETL before Hadoop

May be scalable, expensive

Logs

Files

Relational Databases

Enterprise Data Warehouse SQL:

raw table → warehouse tables

©2011 Cloudera, Inc. All Rights Reserved.7

ETL with Hadoop

Managed, flexible, scalable

Logs

Files

Relational Databases

Enterprise Data Warehouse

Steps

Copyright 2011 Cloudera Inc. All rights reserved8

2. Process

1. In

3. Out

Flume

Copyright 2011 Cloudera Inc. All rights reserved9

Flume

Copyright 2011 Cloudera Inc. All rights reserved10

©2011 Cloudera, Inc. All Rights Reserved.11

ETL with Hadoop

Managed, flexible, scalable

Logs

Files

Relational Databases

Enterprise Data WarehouseFlume

HDFS

Copyright 2011 Cloudera Inc. All rights reserved12

HDFS

Copyright 2011 Cloudera Inc. All rights reserved13

Client

NameNode

DataNode 01

DataNode 05

DataNode 09

DataNode 02

DataNode 06

DataNode 10

DataNode 03

DataNode 07

DataNode 11

DataNode 04

DataNode 08

DataNode 12

open(“file.txt”)

02, 06, 10

data

data data

HDFS

• Distributed• Replication• Bulk I/O• Fault tolerant• Scalable• Append only• Not POSIX

Copyright 2011 Cloudera Inc. All rights reserved14

©2011 Cloudera, Inc. All Rights Reserved.15

ETL with Hadoop

Managed, flexible, scalable

Logs

Files

Relational Databases

Enterprise Data WarehouseFlume HDFS

FUSE-DFS

Copyright 2011 Cloudera Inc. All rights reserved16

FUSE-DFS

• FUSE– User space– File systems

• FUSE-DFS– /hdfs– Mostly transparent

Copyright 2011 Cloudera Inc. All rights reserved17

©2011 Cloudera, Inc. All Rights Reserved.18

ETL with Hadoop

Managed, flexible, scalable

Logs

Files

Relational Databases

Enterprise Data WarehouseFlume

FUSE-DFS

HDFS

Sqoop

Copyright 2011 Cloudera Inc. All rights reserved19

Sqoop

Copyright 2011 Cloudera Inc. All rights reserved20

• SQL to Hadoop• Parallel import• File formats

©2011 Cloudera, Inc. All Rights Reserved.21

ETL with Hadoop

Managed, flexible, scalable

Logs

Files

Relational Databases

Enterprise Data WarehouseFlume

FUSE-DFS

Sqoop

HDFS

Pig

Copyright 2011 Cloudera Inc. All rights reserved22

Pig

• Scripting language• Generates MapReduce jobs• Perl for Hadoop• Great for ETL

Copyright 2011 Cloudera Inc. All rights reserved23

A = LOAD 'data' USING PigStorage() AS (f1:int, f2:int, f3:int);B = GROUP A BY f1;C = FOREACH B GENERATE COUNT ($0);DUMP C;

©2011 Cloudera, Inc. All Rights Reserved.24

ETL with Hadoop

Managed, flexible, scalable

Logs

Files

Relational Databases

Enterprise Data WarehouseFlume

FUSE-DFS

Sqoop

HDFS

Pig

Sqoop with connectors

Copyright 2011 Cloudera Inc. All rights reserved25

Sqoop with connectors

• MySQL*• PostgreSQL*• Teradata*• Netezza*• Oracle*• Couchbase*• Microsoft SQL Server• VoltDB

Copyright 2011 Cloudera Inc. All rights reserved26

*Cloudera certified connector

©2011 Cloudera, Inc. All Rights Reserved.27

ETL with Hadoop

Managed, flexible, scalable

Logs

Files

Relational Databases

Enterprise Data WarehouseFlume

FUSE-DFS

Sqoop

HDFS

Sqoop

Pig

Recommendations

Copyright 2011 Cloudera Inc. All rights reserved28

©2011 Cloudera, Inc. All Rights Reserved.29

Recommendations with Hadoop

Logs

Relational Databases

Web Application

CUSTOMERS

Flume

Copyright 2011 Cloudera Inc. All rights reserved30

©2011 Cloudera, Inc. All Rights Reserved.31

Recommendations with Hadoop

Logs

Relational Databases

Flume

Web Application

CUSTOMERS

HDFS

Copyright 2011 Cloudera Inc. All rights reserved32

©2011 Cloudera, Inc. All Rights Reserved.33

Recommendations with Hadoop

Logs

Relational Databases

Flume HDFS

Web Application

CUSTOMERS

Sqoop

Copyright 2011 Cloudera Inc. All rights reserved34

©2011 Cloudera, Inc. All Rights Reserved.35

Recommendations with Hadoop

Logs

Relational Databases

Flume

Sqoop

HDFS

Web Application

CUSTOMERS

Pig

Copyright 2011 Cloudera Inc. All rights reserved36

©2011 Cloudera, Inc. All Rights Reserved.37

Recommendations with Hadoop

Logs

Relational Databases

Flume

Sqoop

HDFS

Pig

Web Application

CUSTOMERS

Mahout

Copyright 2011 Cloudera Inc. All rights reserved38

Mahout

• Scalable machine learning algorithms– Collaborative Filtering– User and Item based recommenders– K-Means, Fuzzy K-Means clustering– Mean Shift clustering– Singular value decomposition– Complementary Naive Bayes classifier …

Copyright 2011 Cloudera Inc. All rights reserved39

©2011 Cloudera, Inc. All Rights Reserved.40

Recommendations with Hadoop

Logs

Relational Databases

Flume

Sqoop

HDFS

Pig

Web Application

CUSTOMERS

Mahout

MapReduce

Copyright 2011 Cloudera Inc. All rights reserved41

MapReduce

Copyright 2011 Cloudera Inc. All rights reserved42

toOne()

toOne()

toOne()

:1

:1

:1

:1

:1

:1

:1

:1

:1

count():[1,1,1,1]

:[1,1]

:[1,1]

:[1]

count()

:4

:2

:2

:1

shufflemap reduce

MapReduce

• Distributed• Code to data• Reliable• Scalable

Copyright 2011 Cloudera Inc. All rights reserved43

©2011 Cloudera, Inc. All Rights Reserved.44

Recommendations with Hadoop

Logs

Relational Databases

Flume

Sqoop

HDFS

Pig

Web Application

CUSTOMERS

Mahout MapReduce Pig

Oozie

Copyright 2011 Cloudera Inc. All rights reserved45

Oozie

• Workflows• Coordinator

– Triggers

Copyright 2011 Cloudera Inc. All rights reserved46

©2011 Cloudera, Inc. All Rights Reserved.47

Recommendations with Hadoop

Logs

Relational Databases

Flume

Sqoop

HDFS

Pig

Web Application

CUSTOMERS

Mahout MapReduce Pig

Oozie

HBase

Copyright 2011 Cloudera Inc. All rights reserved48

HBase

• Key/value store

• Data stored in HDFS

• Access model is get/put/del– Plus range scans and versions

• Random reads and writes for Hadoop

Copyright 2011 Cloudera Inc. All rights reserved49

©2011 Cloudera, Inc. All Rights Reserved.50

Recommendations with Hadoop

Logs

Relational Databases

Flume

Sqoop

HDFS

Pig

Web Application

CUSTOMERS

Mahout MapReduce Pig

OozieHBase

Business Intelligence

Copyright 2011 Cloudera Inc. All rights reserved51

©2011 Cloudera, Inc. All Rights Reserved.52

Business Intelligence with Hadoop

Logs

Relational Databases

BI / Analytics

ANALYSTS

Flume

Copyright 2011 Cloudera Inc. All rights reserved53

©2011 Cloudera, Inc. All Rights Reserved.54

Business Intelligence with Hadoop

Logs

Relational Databases

Flume

BI / Analytics

ANALYSTS

HDFS

Copyright 2011 Cloudera Inc. All rights reserved55

©2011 Cloudera, Inc. All Rights Reserved.56

Business Intelligence with Hadoop

Logs

Relational Databases

Flume HDFS

BI / Analytics

ANALYSTS

Sqoop

Copyright 2011 Cloudera Inc. All rights reserved57

©2011 Cloudera, Inc. All Rights Reserved.58

Business Intelligence with Hadoop

Logs

Relational Databases

Flume

Sqoop

HDFS

BI / Analytics

ANALYSTS

Hive

Copyright 2011 Cloudera Inc. All rights reserved59

Hive

• Data warehouse• Ad-hoc queries

– Not real-time (minutes)

• SQL• Tables• Joins

Copyright 2011 Cloudera Inc. All rights reserved60

©2011 Cloudera, Inc. All Rights Reserved.61

Business Intelligence with Hadoop

Logs

Relational Databases

Flume

Sqoop

HDFS

BI / Analytics

ANALYSTS

Hive

MapReduce

Copyright 2011 Cloudera Inc. All rights reserved62

©2011 Cloudera, Inc. All Rights Reserved.63

Business Intelligence with Hadoop

Logs

Relational Databases

Flume

Sqoop

HDFS

BI / Analytics

ANALYSTS

Hive MapReduce

Oozie

Copyright 2011 Cloudera Inc. All rights reserved64

©2011 Cloudera, Inc. All Rights Reserved.65

Business Intelligence with Hadoop

Logs

Relational Databases

Flume

Sqoop

HDFS

BI / Analytics

ANALYSTS

Hive

Oozie

MapReduce

HBase

Copyright 2011 Cloudera Inc. All rights reserved66

©2011 Cloudera, Inc. All Rights Reserved.67

Business Intelligence with Hadoop

Logs

Relational Databases

Flume

Sqoop

HDFS HBase

BI / Analytics

ANALYSTS

Hive

Oozie

MapReduce

Hive

Copyright 2011 Cloudera Inc. All rights reserved68

Hive for Business Intelligence

• JDBC– JasperReports*– Pentaho*

• ODBC– MicroStrategy*^

Copyright 2011 Cloudera Inc. All rights reserved69

* Vender certified connector^ Cloudera certified connector

©2011 Cloudera, Inc. All Rights Reserved.70

Business Intelligence with Hadoop

Logs

Relational Databases

Flume

Sqoop

HDFS Hive HBase

BI / Analytics

ANALYSTS

Hive

Oozie

MapReduce

CDH

Copyright 2011 Cloudera Inc. All rights reserved71

Coordination

Data IntegrationFast Read/Write

Access

Languages / Compilers

Workflow Scheduling Metadata

APACHE ZOOKEEPER

APACHE FLUME*, APACHE SQOOP* APACHE HBASE

APACHE PIG, APACHE HIVE

APACHE OOZIE* APACHE OOZIE* APACHE HIVE

File System Mount UI Framework SDKFUSE-DFS HUE HUE SDK

*currently under incubation in the Apache Software Foundation

What’s next?

• Cloudera Training Videos• CDH Virtual Machines• Hadoop: The Definitive Guide, 2nd Edition• Cloudera University

– Developer Training in Columbia, MD• Dec 13-16, Feb 13-16

– Administrator Training in Herndon, VA• Jan 4-6

– Private Training

Copyright 2011 Cloudera Inc. All rights reserved72

We’re Hiring!

• http://www.cloudera.com/company/careers /• Customer Operations

– Customer Operations Engineer– Customer Operations Tools Developer

• Customer Solutions– Solutions Architect

• Engineering– Senior Data Integration Developer– Senior Distributed Systems Engineer– Senior UI Engineer– Software Quality Engineer– Technical Writer

• IT/Operations– Systems Administrator

Copyright 2011 Cloudera Inc. All rights reserved73

74

top related