intro to big data using hadoop

30
Intro to Big Data using Hadoop Big Data Sergejus Barinovas sergejus.blogas.l t

Upload: sergejus-barinovas

Post on 11-Nov-2014

6.343 views

Category:

Technology


1 download

DESCRIPTION

Introduction to Big Data and Apache Hadoop project. MapReduce vizualization

TRANSCRIPT

Page 1: Intro to Big Data using Hadoop

Intro to Big Data using

Hadoop

Big Data

Sergejus

Barinovas

sergejus.blogas.lt

fb.com/ITishnikai

@sergejusb

Page 2: Intro to Big Data using Hadoop

Information is powerful…

but it is how we use it that will

define ushow we use it

Page 3: Intro to Big Data using Hadoop

Data Explosion

picture from Big Data Integration

relational

textaudiovideo

images

Page 4: Intro to Big Data using Hadoop

Big Data (globally)

– creates over 30 billion pieces of content

per day

– stores 30 petabytes of data

– produces over 90 million tweets per day

30 billion

30 petabytes

90 million

Page 5: Intro to Big Data using Hadoop

Big Data (our example)

– logs over 300 gigabytes of transactions per

day

– stores more than 1,5 terabyte of

aggregated data

300 gigabytes

1,5 terabyte

Page 6: Intro to Big Data using Hadoop

4 Vs of Big Data

volumevelocityvarietyvariability

volumevelocityvarietyvariability

Page 7: Intro to Big Data using Hadoop

Big Data Challenges

Sort 10TB on 1 node =

100-node cluster =

2,5 days

35 mins

Page 8: Intro to Big Data using Hadoop

Big Data Challenges

“Fat” servers implies high cost

– use cheap commodity nodes instead

Large # of cheap nodes implies often

failures

– leverage automatic fault-tolerance

commodity

fault-tolerance

Page 9: Intro to Big Data using Hadoop

Big Data Challenges

We need new data-parallel programming

model for clusters of commodity machines

data-parallel

Page 10: Intro to Big Data using Hadoop

MapReduce

to the rescue!

Page 11: Intro to Big Data using Hadoop

MapReduce

Published in 2004 by Google– MapReduce: Simplified Data Processing on Large Clusters

Popularized by Apache Hadoop project

– used by Yahoo!, Facebook, Twitter, Amazon,

Hadoop

Google

Page 12: Intro to Big Data using Hadoop

MapReduce

Who got it?

Page 13: Intro to Big Data using Hadoop

Word Count Example

the quickbrown

fox

the fox ate the mouse

how now

brown cow

Map

Map

Map

Reduce

Reduce

the, 3brown, 2fox, 2how, 1now, 1

quick, 1ate, 1mouse, 1cow, 1

Input Map Shuffle & Sort Reduce Output

Page 14: Intro to Big Data using Hadoop

Word Count Example

the quickbrown

fox

the fox ate the mouse

how now

brown cow

Map

Map

Map

Reduce

Reduce

Input Map Shuffle & Sort Reduce Output

the, 1quick, 1brown, 1fox, 1

the, 1fox, 1ate, 1the, 1mouse, 1

how, 1now, 1brown, 1cow, 1

the, 1brown, 1fox, 1the, 1fox, 1the, 1how, 1now, 1brown, 1

quick, 1ate, 1mouse, 1cow, 1

Page 15: Intro to Big Data using Hadoop

Word Count Example

the quickbrown

fox

the fox ate the mouse

how now

brown cow

Map

Map

Map

Reduce

Reduce

the, 3brown, 2fox, 2how, 1now, 1

quick, 1ate, 1mouse, 1cow, 1

Input Map Shuffle & Sort Reduce Output

the, [1,1,1]brown, [1,1]fox, [1,1]how, [1]now, [1]

quick, [1]ate, [1]mouse, [1]cow, [1]

Ta da!

Page 16: Intro to Big Data using Hadoop

MapReduce philosophy

– hide complexity

–make it scalable

–make it cheap

philosophy

Page 17: Intro to Big Data using Hadoop

MapReduce popularized by

Apache Hadoop projectHadoop

Page 18: Intro to Big Data using Hadoop

Hadoop Overview

Open source implementation of

– Google MapReduce paper

– Google File System (GFS) paper

First release in 2008 by Yahoo!

– wide adoption by Facebook, Twitter,

Amazon, etc.

MapReduce

(GFS)

Yahoo!

Page 19: Intro to Big Data using Hadoop

Hadoop Core

MapReduce (Job Scheduling / Execution System)

Hadoop Distributed File System (HDFS)

Page 20: Intro to Big Data using Hadoop

Hadoop Core (HDFS)

MapReduce (Job Scheduling / Execution System)

Hadoop Distributed File System (HDFS)

• Name Node stores file metadata

• files split into 64 MB blocks

• blocks replicated across 3 Data Nodes

Name Node

blocks

3 Data Nodes

Page 21: Intro to Big Data using Hadoop

Hadoop Core (HDFS)

MapReduce (Job Scheduling / Execution System)

Hadoop Distributed File System (HDFS)

Name Node Data Node

Page 22: Intro to Big Data using Hadoop

Hadoop Core (MapReduce)

• Job Tracker distributes tasks and handles failures

• tasks are assigned based on data locality

• Task Trackers can execute multiple tasks

MapReduce (Job Scheduling / Execution System)

Hadoop Distributed File System (HDFS)

Name Node Data Node

Job Tracker

data locality

Task Trackers

Page 23: Intro to Big Data using Hadoop

Hadoop Core (MapReduce)

MapReduce (Job Scheduling / Execution System)

Hadoop Distributed File System (HDFS)

Name Node Data Node

Job Tracker Task Tracker

Page 24: Intro to Big Data using Hadoop

Hadoop Core (Job submission)

Name Node Data Node

Job Tracker Task Tracker

Clie

nt

Page 25: Intro to Big Data using Hadoop

HBase

Hadoop Ecosystem

Hadoop Distributed File System (HDFS)

MapReduce (Job Scheduling / Execution System)

Pig (ETL)

Avro

Zooke

ep

er

Hive (BI) Sqoop (RDBMS)

Page 26: Intro to Big Data using Hadoop

JavaScript MapReducevar map = function (key, value, context) { var words = value.split(/[^a-zA-Z]/); for (var i = 0; i < words.length; i++) { if (words[i] !== "") { context.write(words[i].toLowerCase(), 1); } }};var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum);};

Page 27: Intro to Big Data using Hadoop

Pig

words = LOAD '/example/count' AS ( word: chararray, count: int);popular_words = ORDER words BY count DESC; top_popular_words = LIMIT popular_words 10;DUMP top_popular_words;

Page 28: Intro to Big Data using Hadoop

HiveCREATE EXTERNAL TABLE WordCount (

word string,count int

)ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'

STORED AS TEXTFILELOCATION "/example/count";

SELECT * FROM WordCount ORDER BY count DESC LIMIT 10;

Page 29: Intro to Big Data using Hadoop

Demo

Hadoop in the Cloud

Über Demo

Page 30: Intro to Big Data using Hadoop

Thanks!

Questions?Questions?