big data with apache hadoop

49
Veldkant 33A, Kontich [email protected] www.infofarm.be Data Science Company Big Data with Apache Hadoop 15/06/2022

Upload: infofarm

Post on 04-Jun-2015

327 views

Category:

Software


2 download

DESCRIPTION

Slidedeck from our seminar about Hadoop (08/10/2014) Topics covered: - What is Big Data? - About Apache Hadoop - HDFS - MapReduce - Pig - Hive - HBase - Mahout & Machine Learning - Other tooling: Sqoop, Oozie, ... - Hadoop deployment options - Real-life cases

TRANSCRIPT

Page 1: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Data Science Company

Big Data with Apache Hadoop

12/04/2023

Page 2: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Who am I

BEN VERMEERSCHBig Data Consultant

Cloudera Certified Developer for Apache

Hadoop

@[email protected]

Page 3: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

About InfoFarm

Data Science

Big Data

Identifying, extracting and using data of all types and origins; exploring, correlating and using it in new and innovative ways in order to extract meaning and business

value from it.

Page 4: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

About InfoFarm

2 Data Scientists 4 Big Data Consultants

1 Infrastructure Specialist

Page 5: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Java

PHP E-Commerce

Mobile

Web Developmen

t

Page 6: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Page 7: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Agenda

• 09:30 – What is Big Data?• 09:45 – Hadoop – HDFS &

MapReduce• 10:00 – HDFS & MapReduce in

Practice• 10:30 – The Hadoop Ecosystem• 11:30 – Examples• 12:00 – Wrap up and Lunch

Page 8: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

What is Big Data?

Page 9: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

What is Big Data not?

Page 10: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

What is Big Data not?

• a technology• a solution (certainly not a silver-

bullet) to any IT problem• a replacement for an RDBMs• a cloud storage system• …

Page 11: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Big Data definition attempt

“a description of a problem domain with specific challenges and solutions which has become relevant with increasing volume, velocity and variety in business data and the increasing requirements towards processing of this data”

Page 12: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

The 3 V’s

Page 13: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Page 14: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Working the (Hadoop) Big Data way• Bringing data processing to the data

(vs centralized db)• Using unstructured or semi-

structured data• Store first, process later• Simple techniques applied at

massive scale• Your hardware will fail!

Page 15: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

OozieWorkflow

Hadoop (limited) overview

HDFSDistributed File System Amazon S3 Local FS

YARNDistributed Data Processing

MapReduce

HBaseNoSQL

HiveData Mart

PigScripting

SqoopSQL

ImportExport

MahoutMachine Learning

Page 16: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

HDFS

Page 17: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

HDFS Rack Topology

Page 18: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

MapReduce• A method for distributing tasks across multiple

nodes• Data is processed where it is stored (where

possible)• Two phases:– Map– Reduce

• Both fases have key-value pairs as input and output that may be chosen by the programmer

• The output from the mappers is used by the reducers

Page 19: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Map & Reduce

Mapper input Mapper output Reducer input Reducer output

Page 20: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Map functionInput.txt

Block 1

Block 2

Block 3

Node 1

Block 1

Block 2

Node 2

Block 2

Block 3

Node 3

Block 1

Block 3

Page 21: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Shuffle and sort

• Hadoop automatically sorts and merges output from all map tasks

This intermediate process is known as the shuffle and sort The result is supplied to reduce tasks

Page 22: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Reduce function• Reducer input comes from the shuffle and sort process receives one record at a time receives all records for a given key emit zero or more output records

• Example: A reduce function sums total per person and emits employee name (key) and total (value) as output

Page 23: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

MapReduce under the hood

Client ResourceManager

AppMasterNode 1

Node 2

Node 3

HDFS

Page 24: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

HDFS & MapReduce

DEMO

Page 25: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Joining

User Name

1 John

2 Maria

3 Jane

User Comment

1 Cool

2 Nonono

2 Hi there

3 Hadoop is awesome

Mapper Mapper

Key Value

1 AJohn

2 AMaria

3 AJane

Key Value

1 BCool

2 BNonono

2 BHi there

3 BHadoop is awesome

Page 26: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

JoiningKey Value

1 AJohn

2 AMaria

3 AJane

Key Value

1 BCool

2 BNonono

2 BHi there

3 BHadoop is awesome

Reducer

Shuffle/Sort

Key Values

1 AJohn; BCool

2 AMaria; BNonono; BHi there

3 AJane; BHadoop is awesome

Page 27: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Joining

Reducer

Key Values

1 AJohn; BCool

2 AMaria; BNonono; BHi there

3 AJane; BHadoop is awesome

Userid Name Comment

1 John Cool

2 Maria Nonono

2 Maria Hi there

3 Jane Hadoop is awesome

Page 28: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

MapReduce Design Patterns

• More info:

• Frameworks on top of MapReduce like Hive or Pig make this easier

Page 29: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

OozieWorkflow

The Hadoop Ecosystem

HDFSDistributed File System Amazon S3 Local FS

YARNDistributed Data Processing

MapReduce

HBaseNoSQL

HiveData Mart

PigScripting

SqoopSQL

ImportExport

MahoutMachine Learning

Page 30: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Apache Pig

• Processing framework for (large) datasets

• Pig Latin• Runs on Hadoop

(or local) with MapReduce

• Extensible with UDFs

Page 31: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Apache Pig

DEMO

Page 32: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Apache Hive

• SQL-like querying on Hadoop datasets

• Translates to MapReduce under the hood

• Originally developed at Facebook

• Now Apache Top Level project

Page 33: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Hive <-> Traditional RDBMS

• Schema on read• Fast initial load• Flexible schema• No update or

delete (only insert into)

• HiveQL (subset of SQL)

• Schema on write

• Slow initial load• Fixed schema• Updates,

deletes, inserts all possible

• SQL compliant

Page 34: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Apache Hive

DEMO

Page 35: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

HBase

• Column-oriented Data Store• Distributed• Type of NoSQL-DB• Based on Google BigTable

Page 36: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

HBase

• Lots and lots of data

• Large amount of clients

• Single selects• Range scan by

key• Variable schema

• Not Traditional RDBMS– Transactions– Group by– Join–Where– Like

Page 37: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

HBase

DEMO

Page 38: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Sqoop• Import data from structured data source

(typically RDBMS) into Hadoop• Export data into structured data sources from

Hadoop• sqoop import --connect jdbc:mysql://localhost/salesdb --table orders

• sqoop export --connect jdbc:mysql://localhost/salesdb --table orders --export-dir /user/test/orders --input-fields-terminated-by ‘\t’

Page 39: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Mahout

• Scalable Machine Learning

Recommendation

Classification

Clustering

Page 40: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Recommendation

Page 41: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Classification

Mammal Reptile Bird

Page 42: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Clustering

Page 43: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

More information:

• Free seminar: Machine Learning in practice

• Fri 7th of November 2014 12:00 – 16:00

• Kontich

• http://www.buzzberry.be/events/

Page 44: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Integrating Hadoop in your IT landscape

JDBC

Page 45: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Tools – BigData – IT options• Hadoop is not a trivial piece of software to manage!

• On-premise– Commodity Hardware– Advantage: full control & performance– Disadvantage: required skills, migrations, backup, ...

• Cloud – Amazon AWS

– EMR (Elastic Map Reduce)– Storage in S3– Very competitive offering financially– Manageability and flexibility

• Cloud - IBM SoftLayer• Hardware options (performance)

Page 46: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Beyond MapReduce

Page 47: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

There is more…

Page 48: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Oak3 Courses

• Data Science• Hadoop• Hbase

• http://www.oak3.be/

Page 49: Big Data with Apache Hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

Questions?