intro to big data using hadoop
DESCRIPTION
Introduction to Big Data and Apache Hadoop project. MapReduce vizualizationTRANSCRIPT
Intro to Big Data using
Hadoop
Big Data
Sergejus
Barinovas
sergejus.blogas.lt
fb.com/ITishnikai
@sergejusb
Information is powerful…
but it is how we use it that will
define ushow we use it
Data Explosion
picture from Big Data Integration
relational
textaudiovideo
images
Big Data (globally)
– creates over 30 billion pieces of content
per day
– stores 30 petabytes of data
– produces over 90 million tweets per day
30 billion
30 petabytes
90 million
Big Data (our example)
– logs over 300 gigabytes of transactions per
day
– stores more than 1,5 terabyte of
aggregated data
300 gigabytes
1,5 terabyte
4 Vs of Big Data
volumevelocityvarietyvariability
volumevelocityvarietyvariability
Big Data Challenges
Sort 10TB on 1 node =
100-node cluster =
2,5 days
35 mins
Big Data Challenges
“Fat” servers implies high cost
– use cheap commodity nodes instead
Large # of cheap nodes implies often
failures
– leverage automatic fault-tolerance
commodity
fault-tolerance
Big Data Challenges
We need new data-parallel programming
model for clusters of commodity machines
data-parallel
MapReduce
to the rescue!
MapReduce
Published in 2004 by Google– MapReduce: Simplified Data Processing on Large Clusters
Popularized by Apache Hadoop project
– used by Yahoo!, Facebook, Twitter, Amazon,
…
Hadoop
MapReduce
Who got it?
Word Count Example
the quickbrown
fox
the fox ate the mouse
how now
brown cow
Map
Map
Map
Reduce
Reduce
the, 3brown, 2fox, 2how, 1now, 1
quick, 1ate, 1mouse, 1cow, 1
Input Map Shuffle & Sort Reduce Output
Word Count Example
the quickbrown
fox
the fox ate the mouse
how now
brown cow
Map
Map
Map
Reduce
Reduce
Input Map Shuffle & Sort Reduce Output
the, 1quick, 1brown, 1fox, 1
the, 1fox, 1ate, 1the, 1mouse, 1
how, 1now, 1brown, 1cow, 1
the, 1brown, 1fox, 1the, 1fox, 1the, 1how, 1now, 1brown, 1
quick, 1ate, 1mouse, 1cow, 1
Word Count Example
the quickbrown
fox
the fox ate the mouse
how now
brown cow
Map
Map
Map
Reduce
Reduce
the, 3brown, 2fox, 2how, 1now, 1
quick, 1ate, 1mouse, 1cow, 1
Input Map Shuffle & Sort Reduce Output
the, [1,1,1]brown, [1,1]fox, [1,1]how, [1]now, [1]
quick, [1]ate, [1]mouse, [1]cow, [1]
Ta da!
MapReduce philosophy
– hide complexity
–make it scalable
–make it cheap
philosophy
MapReduce popularized by
Apache Hadoop projectHadoop
Hadoop Overview
Open source implementation of
– Google MapReduce paper
– Google File System (GFS) paper
First release in 2008 by Yahoo!
– wide adoption by Facebook, Twitter,
Amazon, etc.
MapReduce
(GFS)
Yahoo!
Hadoop Core
MapReduce (Job Scheduling / Execution System)
Hadoop Distributed File System (HDFS)
Hadoop Core (HDFS)
MapReduce (Job Scheduling / Execution System)
Hadoop Distributed File System (HDFS)
• Name Node stores file metadata
• files split into 64 MB blocks
• blocks replicated across 3 Data Nodes
Name Node
blocks
3 Data Nodes
Hadoop Core (HDFS)
MapReduce (Job Scheduling / Execution System)
Hadoop Distributed File System (HDFS)
Name Node Data Node
Hadoop Core (MapReduce)
• Job Tracker distributes tasks and handles failures
• tasks are assigned based on data locality
• Task Trackers can execute multiple tasks
MapReduce (Job Scheduling / Execution System)
Hadoop Distributed File System (HDFS)
Name Node Data Node
Job Tracker
data locality
Task Trackers
Hadoop Core (MapReduce)
MapReduce (Job Scheduling / Execution System)
Hadoop Distributed File System (HDFS)
Name Node Data Node
Job Tracker Task Tracker
Hadoop Core (Job submission)
Name Node Data Node
Job Tracker Task Tracker
Clie
nt
HBase
Hadoop Ecosystem
Hadoop Distributed File System (HDFS)
MapReduce (Job Scheduling / Execution System)
Pig (ETL)
Avro
Zooke
ep
er
Hive (BI) Sqoop (RDBMS)
JavaScript MapReducevar map = function (key, value, context) { var words = value.split(/[^a-zA-Z]/); for (var i = 0; i < words.length; i++) { if (words[i] !== "") { context.write(words[i].toLowerCase(), 1); } }};var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum);};
Pig
words = LOAD '/example/count' AS ( word: chararray, count: int);popular_words = ORDER words BY count DESC; top_popular_words = LIMIT popular_words 10;DUMP top_popular_words;
HiveCREATE EXTERNAL TABLE WordCount (
word string,count int
)ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
STORED AS TEXTFILELOCATION "/example/count";
SELECT * FROM WordCount ORDER BY count DESC LIMIT 10;
Demo
Hadoop in the Cloud
Über Demo
Thanks!
Questions?Questions?