an introduction to big-data processing applying hadoop

39
An introduction to Big Data processing using Hadoop A.Sedighi hexican.com

Upload: amir-sedighi

Post on 12-Jul-2015

163 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Page 1: An introduction to Big-Data processing applying hadoop

An introduction to

Big Data processing

using Hadoop

A.Sedighi

hexican.com

Page 2: An introduction to Big-Data processing applying hadoop

No  single  standard  definiHon…

“Big  Data”  is  data  whose  scale,  diversity,  

and  complexity  require  new  architecture,  

techniques,  algorithms,  and  analyHcs  to  

manage  it  and  extract  value  and  hidden  

knowledge  from  it…

Big Data, Definition

Page 3: An introduction to Big-Data processing applying hadoop

Information is powerful…

but it is how we use it that will

define us

Page 4: An introduction to Big-Data processing applying hadoop

Data Explosion

relational

textaudiovideo

images

Page 5: An introduction to Big-Data processing applying hadoop

Big Data Era

-creates over 30 billion pieces of content per day

-stores 30 petabytes of data

-produces over 90 million tweets per day

Page 6: An introduction to Big-Data processing applying hadoop

Log Files

-Log files contains data.

-Each banking transaction should be logged in

different levels.

How much a Banking solution generates log

files per a day?

Page 7: An introduction to Big-Data processing applying hadoop

Big Data: 3 V's

Page 8: An introduction to Big-Data processing applying hadoop

Big Data: 3 V's

volumevelocityvariety

Page 9: An introduction to Big-Data processing applying hadoop

Some  Makes  it  3  V's

Page 10: An introduction to Big-Data processing applying hadoop

What  is  driving  Big  Data  Industry?  

- Optimizations and predictive analytics- Complex statistical analysis- All types of data, and many sources- Very large datasets- More of a real-time

- Ad-hoc querying and reporting- Data mining techniques- Structured data, typical sources- Small to mid-size datasets

Page 11: An introduction to Big-Data processing applying hadoop

Big Data Challenges

Page 12: An introduction to Big-Data processing applying hadoop

Big Data Challenges

Sorting of 10TB on:1 node takes 2.5 Days O(N log N)100 nodes takes 35 Mins O(log N)

Page 13: An introduction to Big-Data processing applying hadoop

Big Data Challenges

Problem: “Fat” servers implies high cost.

Solution: Using cheap commodity nodes instead.

Problem: Large number of cheap nodes implies often

failures.

Solution: leverage automatic fault-tolerance

Page 14: An introduction to Big-Data processing applying hadoop

Big Data Challenges

We need new data-parallel programming

model for clusters of commodity machines.

Page 15: An introduction to Big-Data processing applying hadoop

What  Technology  Do  We  Have

For  Big  Data  ?

Page 16: An introduction to Big-Data processing applying hadoop
Page 17: An introduction to Big-Data processing applying hadoop

Map Reduce

Page 18: An introduction to Big-Data processing applying hadoop

MapReduce

Published in 2004 by GooglePopularized by Apache Hadoop project.

Using by Yahoo!, Facebook, Twitter, Amazon, LinkedIn and many other enterprises.

Page 19: An introduction to Big-Data processing applying hadoop

Word  Count  Example

Page 20: An introduction to Big-Data processing applying hadoop

MapReduce philosophy

-hide complexity

-make it scalable

-make it cheap

Page 21: An introduction to Big-Data processing applying hadoop

MapReduce popularized by

Apache Hadoop project

Page 22: An introduction to Big-Data processing applying hadoop

Hadoop Overview

Open source implementation of Google

MapReduce

Google File System (GFS)

First release in 2008 by Yahoo!

Wide adoption by Facebook, Twitter, Amazon, etc.

Page 23: An introduction to Big-Data processing applying hadoop
Page 24: An introduction to Big-Data processing applying hadoop

Everything  Started  By  Searching

Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open source web search engine, itself a part of the Lucene project.

Page 25: An introduction to Big-Data processing applying hadoop

Hadoop  Sub  Projects  -­‐  1

Page 26: An introduction to Big-Data processing applying hadoop

Hadoop  Sub  Projects  -­‐  2

Page 27: An introduction to Big-Data processing applying hadoop

Hadoop  Distributed  File  System  (HDFS)  -­‐  1

HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters on commodity hardware.

-“Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of data.

Page 28: An introduction to Big-Data processing applying hadoop

Hadoop  Distributed  File  System  (HDFS)  -­‐  2

HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters on commodity hardware.

-HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. The time to read the whole dataset is more important than the latency in reading the first record.

Page 29: An introduction to Big-Data processing applying hadoop

Hadoop  Distributed  File  System  (HDFS)  -­‐  3

HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters on commodity hardware.

-HDFS is designed to carry on working without a noticeable interruption to the user in the face of such failure.

Page 30: An introduction to Big-Data processing applying hadoop

Were  HDFS  doesn't  work  well?

● Low-­‐latency  data  access

● Lots  of  small  files

● MulHple  writers,  arbitrary  file  modificaHons.

Page 31: An introduction to Big-Data processing applying hadoop

MapReduce  and  HDFS  

Page 32: An introduction to Big-Data processing applying hadoop

HDFS Concepts - Blocks65MB 128MB or 256MB Block size.If the seek time is around 10ms, and the transfer rate is 100 MB/s, then to make the seek time 1% of the transfer time, we need to make the block size around 100 MB.

Page 33: An introduction to Big-Data processing applying hadoop

Anatomy  of  a  File  Read

Page 34: An introduction to Big-Data processing applying hadoop

Anatomy  of  a  File  Write

Page 35: An introduction to Big-Data processing applying hadoop
Page 36: An introduction to Big-Data processing applying hadoop

Replica Replacement

Page 37: An introduction to Big-Data processing applying hadoop

Machine Learning - 1

Mahout's  goal  is  to  build  scalable  machine  learning  libraries  providing  core  algorithms  for  clustering,  classificaHon  and  batch  based  collaboraHve  filtering  are  implemented  on  top  of  Apache  Hadoop  using  the  map/reduce  paradigm.  

Page 38: An introduction to Big-Data processing applying hadoop

Machine Learning - 2

Mahout  can  be  used  as  a  recommender  engine  on  the  top  of  hadoop  clusters.  

Page 39: An introduction to Big-Data processing applying hadoop

Using  hadoop  for

● ads and recomendations● online travel● processing mobile data● energy savings and discovery● infrastructure management● image processing● fraud detection● IT security● health care