big data with apache hadoop

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Data Science Company

Big Data with Apache Hadoop

12/04/2023


Who am I

BEN VERMEERSCHBig Data Consultant

Cloudera Certified Developer for Apache

Hadoop

@[email protected]


About InfoFarm

Data Science

Big Data

Identifying, extracting and using data of all types and origins; exploring, correlating and using it in new and innovative ways in order to extract meaning and business

value from it.


About InfoFarm

2 Data Scientists 4 Big Data Consultants

1 Infrastructure Specialist


Java

PHP E-Commerce

Mobile

Web Developmen

t


Agenda

• 09:30 – What is Big Data?• 09:45 – Hadoop – HDFS &

MapReduce• 10:00 – HDFS & MapReduce in

Practice• 10:30 – The Hadoop Ecosystem• 11:30 – Examples• 12:00 – Wrap up and Lunch


What is Big Data?


What is Big Data not?


What is Big Data not?

• a technology• a solution (certainly not a silver-

bullet) to any IT problem• a replacement for an RDBMs• a cloud storage system• …


Big Data definition attempt

“a description of a problem domain with specific challenges and solutions which has become relevant with increasing volume, velocity and variety in business data and the increasing requirements towards processing of this data”


The 3 V’s


Working the (Hadoop) Big Data way• Bringing data processing to the data

(vs centralized db)• Using unstructured or semi-

structured data• Store first, process later• Simple techniques applied at

massive scale• Your hardware will fail!


OozieWorkflow

Hadoop (limited) overview

HDFSDistributed File System Amazon S3 Local FS

YARNDistributed Data Processing

MapReduce

HBaseNoSQL

HiveData Mart

PigScripting

SqoopSQL

ImportExport

MahoutMachine Learning

…


HDFS


HDFS Rack Topology


MapReduce• A method for distributing tasks across multiple

nodes• Data is processed where it is stored (where

possible)• Two phases:– Map– Reduce

• Both fases have key-value pairs as input and output that may be chosen by the programmer

• The output from the mappers is used by the reducers


Map & Reduce

Mapper input Mapper output Reducer input Reducer output


Map functionInput.txt

Block 1

Block 2

Block 3

Node 1

Block 1

Block 2

Node 2

Block 2

Block 3

Node 3

Block 1

Block 3


Shuffle and sort

• Hadoop automatically sorts and merges output from all map tasks

This intermediate process is known as the shuffle and sort The result is supplied to reduce tasks


Reduce function• Reducer input comes from the shuffle and sort process receives one record at a time receives all records for a given key emit zero or more output records

• Example: A reduce function sums total per person and emits employee name (key) and total (value) as output


MapReduce under the hood

Client ResourceManager

AppMasterNode 1

Node 2

Node 3

HDFS


HDFS & MapReduce

DEMO


Joining

User Name

1 John

2 Maria

3 Jane

User Comment

1 Cool

2 Nonono

2 Hi there

3 Hadoop is awesome

Mapper Mapper

Key Value

1 AJohn

2 AMaria

3 AJane

Key Value

1 BCool

2 BNonono

2 BHi there

3 BHadoop is awesome


JoiningKey Value

1 AJohn

2 AMaria

3 AJane

Key Value

1 BCool

2 BNonono

2 BHi there

3 BHadoop is awesome

Reducer

Shuffle/Sort

Key Values

1 AJohn; BCool

2 AMaria; BNonono; BHi there

3 AJane; BHadoop is awesome


Joining

Reducer

Key Values

1 AJohn; BCool

2 AMaria; BNonono; BHi there

3 AJane; BHadoop is awesome

Userid Name Comment

1 John Cool

2 Maria Nonono

2 Maria Hi there

3 Jane Hadoop is awesome


MapReduce Design Patterns

• More info:

• Frameworks on top of MapReduce like Hive or Pig make this easier


OozieWorkflow

The Hadoop Ecosystem

HDFSDistributed File System Amazon S3 Local FS

YARNDistributed Data Processing

MapReduce

HBaseNoSQL

HiveData Mart

PigScripting

SqoopSQL

ImportExport

MahoutMachine Learning

…


Apache Pig

• Processing framework for (large) datasets

• Pig Latin• Runs on Hadoop

(or local) with MapReduce

• Extensible with UDFs


Apache Pig

DEMO


Apache Hive

• SQL-like querying on Hadoop datasets

• Translates to MapReduce under the hood

• Originally developed at Facebook

• Now Apache Top Level project


Hive <-> Traditional RDBMS

• Schema on read• Fast initial load• Flexible schema• No update or

delete (only insert into)

• HiveQL (subset of SQL)

• Schema on write

• Slow initial load• Fixed schema• Updates,

deletes, inserts all possible

• SQL compliant


Apache Hive

DEMO


HBase

• Column-oriented Data Store• Distributed• Type of NoSQL-DB• Based on Google BigTable


HBase

• Lots and lots of data

• Large amount of clients

• Single selects• Range scan by

key• Variable schema

• Not Traditional RDBMS– Transactions– Group by– Join–Where– Like


HBase

DEMO


Sqoop• Import data from structured data source

(typically RDBMS) into Hadoop• Export data into structured data sources from

Hadoop• sqoop import --connect jdbc:mysql://localhost/salesdb --table orders

• sqoop export --connect jdbc:mysql://localhost/salesdb --table orders --export-dir /user/test/orders --input-fields-terminated-by ‘\t’


Mahout

• Scalable Machine Learning

Recommendation

Classification

Clustering


Recommendation


Classification

Mammal Reptile Bird


Clustering


More information:

• Free seminar: Machine Learning in practice

• Fri 7th of November 2014 12:00 – 16:00

• Kontich

• http://www.buzzberry.be/events/


Integrating Hadoop in your IT landscape

JDBC


Tools – BigData – IT options• Hadoop is not a trivial piece of software to manage!

• On-premise– Commodity Hardware– Advantage: full control & performance– Disadvantage: required skills, migrations, backup, ...

• Cloud – Amazon AWS

– EMR (Elastic Map Reduce)– Storage in S3– Very competitive offering financially– Manageability and flexibility

• Cloud - IBM SoftLayer• Hardware options (performance)


Beyond MapReduce


There is more…


Oak3 Courses

• Data Science• Hadoop• Hbase

• http://www.oak3.be/

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

Questions?

big data with apache hadoop

Software

data science companybig

hadoop big data way

there3 hadoop

business data andthe

data vscentralized db

infofarm2 data scientists

cloud storage system

hadoop hdfs mapreduce