big data with hadoop

92

Upload: remas-ittahir

Post on 16-Jul-2015

95 views

Category:

Education


2 download

TRANSCRIPT

Page 1: Big data with hadoop
Page 2: Big data with hadoop

1- Introduction .2- What is big data?3- The characters of big data.4- The handling with data.5- Building a Successful Big Data Management.6- Big data applications .7- History of Hadoop.8- The Core of Apache Hadoop .9- Workflow and data movement .10- Apache hadoop Ecosystem .

2

Page 3: Big data with hadoop

INTRODUCTION

“big data” became more than just a technical term for scientists,

engineers, and other technologists.

The term entered the main stream on a myriad of fronts,

becoming a household word in news ,business , health care, and

people’s personal lives.

The term became synonymous with intelligence gathering and

spy craft ,

3

Page 4: Big data with hadoop

These days, increased data generation rate so as to increase the

sources that generate such data. thus , the data becomes huge ”Big

data”.

The traditional data were generated from employee , now in the era of

massive data become from:

- Employee.

- Users .

- Machines .

are all generate large and different type of data Constantly.

4

Page 5: Big data with hadoop

What is big data?

Big data is a broad term for data sets so large or complex that

traditional data processing applications are inappropriate.

5

Page 6: Big data with hadoop

Example

Every day, we create 2.5 quintillion bytes of data — so much that

90% of the data in the world today created in last two years alone.

- data comes from everywhere: sensors used to gather climate

information, posts to social media sites, digital pictures and videos,

purchase transaction records, and cell phone GPS signals to name a

few. This data is big data.

6

Page 7: Big data with hadoop

Big data is not a single technology but a combination of old and new

technologies that helps companies gain actionable insight.

- big data the capability to manage a huge volume

- at the right speed

- in right time frame to allow real-time analysis and reaction.

Why the Big Data is important?

7

Page 8: Big data with hadoop

QUESTIONS

1- What are the examples of machines that generate big data ?

2- What are the examples of employees that generate big data ?

3- What are the examples of users that generate big data ?

8

Page 9: Big data with hadoop

The examples of machines that generate Big data

1- GPS Data:

records the exact position of a device at a

specific moment in time. GPS events can

be transformed easily into position and

movement information.

EX: vehicles on a road network rely on

the accurate and sophisticated

processing of GPS information.

9

Page 10: Big data with hadoop

2- Sensor Data:

The availability of low cost, intelligent

sensors, coupled with the latest 3G and 4G

wireless technology has driven a dramatic

increase in the volume of sensor data, but the

need to extract operational intelligence in

real-time from the data.

Ex: include industrial automation plants, smart metering.

10

Page 11: Big data with hadoop

The examples of employees that generate Big data

11

Page 12: Big data with hadoop

The examples of users that generate Big data

12

Page 13: Big data with hadoop

THE CHARACTERS OF BIG DATA

13

Page 14: Big data with hadoop

Big Data is typically broken down by three

characteristics:

✓ Volume: How much data

✓ Velocity: How fast that data is processed

✓ Variety: The various types of data

Big Data

14

Page 15: Big data with hadoop

The Variety

- Big data combines all data.

- structured data.

- unstructured data .

- semi structure data.

This kind of data management requires that companies

leverage both their structured and unstructured data.

15

Page 16: Big data with hadoop

The characters of big data

16

Page 17: Big data with hadoop

Structure Unstructured

Analog Data

Big data

Semi-structure

Variety

XML

Enterprise

system(CRM,

ERP.. etc)

Data

warehouses

Audio/video

streams

GPS tracking

information

Databases

EDI

E-Mail

17

Page 18: Big data with hadoop

Looking at semi-structured data

Semi-structured data is a kind of data

- falls between structured and unstructured data.

- Semi-structured data not necessarily conform to a fixed schema.

- but may be self-describing and may have simple label/value pairs.

18

Page 19: Big data with hadoop

Looking at semi-structured data

For example, label/value pairs might include:

<family>=Jones, <mother>=Jane, and

<daughter>=Sarah.

Examples of semi-structured data include:

EDI, SWIFT, and XML.

You can think of them as sort of payloads for processing complex

events.

19

Page 20: Big data with hadoop

Traditional data & Big data

Traditional Data

- Documents.

- Finances .

- Stock records.

- Personal files.

Big Data

- Photographs .

- Audio & video .

- 3D models .

- Simulation .

- Location data.

20

Page 21: Big data with hadoop

Real time

Near real time

Hourly

Daily

Weekly

monthly

Batch And so on ….

v

e

l

o

c

i

t

y

v

o

l

u

m

e

Megabytes

Gigabytes

Terabytes

Petabytes

And more …

velocityvolume

21

Page 22: Big data with hadoop

The benefit gained from the ability to process large amounts of

information is the main attraction of big data analytics.

This volume presents the most immediate challenge to

conventional IT structures.

It calls for scalable storage, and a distributed approach to querying.

The volume

22

Page 23: Big data with hadoop

The Velocity

- It’s not just the velocity of the incoming data.

- it’s possible to stream fast-moving data into bulk storage

for later batch processing.

- The velocity of big data, coupled with its variety.

- cause a move toward real-time observations.

- allowing better decision making or quick action .

23

Page 24: Big data with hadoop

- The importance lies in the speed of the feedback loop, taking data

from input through to decision.

- A commercial from IBM makes the point that you wouldn’t cross the

road if all you had was a five-minute old snapshot of traffic location.

There are times when you simply won’t be able to wait for a report to

run or a Hadoop job to complete. The importance lies in the speed of

the feedback loop.

Example

24

Page 25: Big data with hadoop

Product categories for handling streaming data divide into :

1- Established proprietary products such as :

- IBM’s InfoSphere Streams and the lesspolished

2- Still emergent open source frameworks originating in the web industry:

- Twitter’s Storm and Yahoo S4.

Categories for handling streaming data “Velocity ”

25

Page 26: Big data with hadoop

Practice example on Big data

Example 1

Example 2

Example 3

- These are good web sites to absorb how much of data generated in the world

26

Page 27: Big data with hadoop

Different approaches to handling data exist

based on whether

- It is data in motion.

- It is data at rest.

Different approaches To

handling data

27

Page 28: Big data with hadoop

- Data at rest would be used by a business analyst to better understand

customers’ current buying patterns based on all aspects of the customer

relationship, including sales, social media data, and customer service

interactions.

Here’s a quick example of each:-->

- Data in motion would be used if a company is able to analyze the quality

of its products during the manufacturing process to avoid costly errors.

28

Page 29: Big data with hadoop

Managing Big data

With Big data, now possible to virtualize data.

- stored efficiently, utilizing cloud-based storage.

- more cost- effectively .

- improvements network speed .

- reliability have removed other physical limitations to manage massive

amounts of data at an acceptable pace.

29

Page 30: Big data with hadoop

Building a Successful Big Data Management

- capture.

- organize.

- Integrate.

- analyze.

- act .

Big data management should beginning with:

The cycle of big data management.30

Page 31: Big data with hadoop

- data must first be captured.

- Then organized and integrated.

- After this phase is successfully implemented.

- data analyzed based on problem being addressed.

Finally, management takes action based on the outcome of that analysis.

Building a SuccessfulBig Data

Management

31

Page 32: Big data with hadoop

The importance of Big dataIn

our world & The our future

Big data provides a competitive advantage for organizations .

- helps to make decisions are thus increasing efficiency and

profit and loss reduction.

- extend benefit to including energy, education, health and

huge scientific projects like” the human genome project”

(the entire genetic material for the study of human beings).

32

Page 33: Big data with hadoop

- Healthcare.

- Manufacturing.

- Management .

- traffic management .

Big data applications

Some of the emerging applications are in areas such as :

They rely on huge volumes, velocities , and varieties data to transform

the behavior of a market.

33

Page 34: Big data with hadoop

In healthcare: a big data application might be able to monitor

premature infants to determine when data indicates when

intervention is needed.

In manufacturing, a big data application can be used to prevent a

machine from shutting down during a production run.

Example 1

Example 2

34

Page 35: Big data with hadoop

Let’s summarize some benefits of Big data

Some of benefit of big data are:

- Increase of storage capacity. = scalable.

- Increase processing power. =real-time.

- Availability of data . = full tolerant.

- less cost . = commodity hard ware

35

Page 36: Big data with hadoop

Hadoop was created by Doug Cutting

and Mike Cafarella in 2005. Cutting,

who was working at Yahoo! at the time,

named it after his son's toy elephant. It

was originally developed to support

distribution for the Nutch search engine

project.

History of

36

Page 37: Big data with hadoop

Hadoop is designed to process huge amounts of structured,

unstructured data (terabytes to petabytes) and is implemented

on racks of commodity servers as a Hadoop cluster.

Hadoop is designed to parallelize data processing across computing nodes to speed computations and hide latency.

37

Page 38: Big data with hadoop

Apache Hadoop is a set of algorithms.

- Open source software framework written in Java.

- Distributed storage.

- Distributed processing .

- Built from commodity hardware.

- Files are replicated to handle hardware failure

- Detect failures and recovers from them

What is ?

38

Page 39: Big data with hadoop

Some of Hadoop users

- Facebook.

- IBM.

- Google.

- Yahoo!.

- New York Times.

- Amazon/A9.

- And there are others

39

Page 40: Big data with hadoop

The Core of Apache Hadoop

At its core, Hadoop has two primary components:

1- storage part- Hadoop Distributed File System <HDFS>.

# can support petabytes of data.

2- processing part- MapReduce.#computes results in batch.

40

Page 41: Big data with hadoop

HDFS:

- Stores large files across a commodity cluster.

- typically in the range of gigabytes to terabytes .

- Scalable, and portable file-system .

- written in Java for the Hadoop framework .

- Replicating data across multiple hosts.

Hadoop Distributed File System HDFS

41

Page 42: Big data with hadoop

- Default replication value, 3, data is stored on three nodes:

- two on the same rack, and one on a different rack.

- Data nodes can talk to each other:

- to rebalance data.

- to move copies around .

- to keep the replication of data high.

HDFS:

42

Page 43: Big data with hadoop

Question

- Why we need to replicate files in HDF ?

To achieve reliability .if any failure occur on any node we can continue in

processing

43

Page 44: Big data with hadoop

HDFS works by breaking large files into smaller pieces called blocks.

- The blocks are stored on data nodes.

- NameNode responsibility to know what blocks on which data

nodes make up the complete file.

“keeps track of where data is physically stored”.

- NameNode acts as a “traffic cop,” managing all access to the files .

HDFS

44

Page 45: Big data with hadoop

How a Hadoop cluster is mapped to hardware.45

Page 46: Big data with hadoop

The responsibility of NameNode

1- reads data blocks on the data nodes.

2- writes data blocks on the data nodes.

3- creates data blocks on the data nodes.

4- deletes data blocks on the data nodes.

5- replication of data blocks on the data nodes.

- NameNode acts as a “traffic cop,” managing all access to the filesincluding :

46

Page 47: Big data with hadoop

NameNode and data nodes :

- they operate in a “loosely coupled” fashion

that allows the cluster elements to behave

dynamically:

- adding (or subtracting) servers as the

demand increases (or decreases).

How a Hadoop cluster is mapped to hardware.

The Relationship Between

NameNode & DataNodes

47

Page 48: Big data with hadoop

Are DataNodes also smart?

NameNode is very smart

Data nodes are not very smart

48

Page 49: Big data with hadoop

Are DataNodes also smart?

- Data nodes are not very smart, but the NameNode is.

- Because the DataNodes constantly ask the NameNode whether there

anything for them to do.

- tells NameNode what data nodes out there and how busy they are.

- The NameNode so critical for correct operation of the cluster, can and

should be replicated to guard against a single point failure.

49

Page 50: Big data with hadoop

Map : distribute a computational

problem across a cluster .

Reduce : master node collects the

answers to all the sub- problems

and combines them .

mas

ter

copy

copy

copy

Map Reduce

50

Page 51: Big data with hadoop

An example of an inverted index being created in MapReduce

Page 52: Big data with hadoop

52

public static class Mapextends Mapper<LongWritable, Text, Text, Text> {private Text documentId;private Text word = new Text();@Overrideprotected void setup(Context context) {String filename =((FileSplit) context.getInputSplit()).getPath().getName();documentId = new Text(filename);}@Overrideprotected void map(LongWritable key, Text value,Context context)throws IOException, InterruptedException {for (String token :StringUtils.split(value.toString())) {word.set(token);context.write(word, documentId);}}}

shows the mapper code

Page 53: Big data with hadoop

53

public static class Mapextends Mapper<LongWritable, Text, Text, Text> {

When you extend the MapReduce mapper class you specify the key/value typesfor your inputs and outputs. You use the MapReduce default InputFormat foryour job, which supplies keys as byte offsets into the input file, and values aseach line in the file. Your map emits Text key/value pairs.

The following shows the mapper code

Page 54: Big data with hadoop

54

To cut down on object creation you create a single Text object, which you’ll reuse

private Text word = new Text();

private Text documentId;

A Text object to store the document ID (filename) for your input

Page 55: Big data with hadoop

- InputFormat decides how file going to be

broken into smaller pieces for processing

using a function called InputSplit.

- It then assigns a RecordReader to

transform the raw data for processing by

the map.

- Then the map requires two inputs: a key

and a value.

Workflow and data movement in a small Hadoop cluster

Hadoop MapReduce

55

Page 56: Big data with hadoop

- Your data is now in form acceptable to map.

- For each input pair, a distinct instance of map

is called to process data.

- map and reduce need to work together to

process your data.

- OutputCollector collects output from

independent mappers.

The mapping begin

56

Page 57: Big data with hadoop

- A Reporter function provides information

gathered from map tasks.

- to know when or if map tasks are

complete.

- All this work is being performed on

multiple nodes in the Hadoop cluster

simultaneously.

The mapping begin. Cont.

57

Page 58: Big data with hadoop

- Some of output may on a node different from the node where

reducers for that specific output will run.

- a partitioner and a sort gather and shuffle of intermediate

results.

- Map tasks deliver results to specific partition

- as inputs to the reduce tasks.

Workflow and data movement

- After all the map tasks are complete.

- the intermediate results are gathered in the partition.

- reduce shuffle ,sorting output for optimal processing.

58

Page 59: Big data with hadoop

Reduce & Combine

- For each output pair, reduce is called to perform its task.

- Reduce gathers its output while all the tasks are processing.

- Reduce can’t begin until all the mapping is done, and

- It isn’t finished until all instances are complete.

- The output of reduce a key and a value.

- OutputFormat takes the key-value pair .

- organizes the output for writing to HDFS.

- RecordWriter takes OutputFormat data and writes it to HDFS.

59

Page 60: Big data with hadoop

The Benefits of MapReduce

- Hadoop MapReduce is the heart of the Hadoop system.

- MR provides capabilities you need to break big data into

manageable chunks.

- MR process data in parallel on your cluster.

- MR makes data available for user consumption.

- MapReduce does all works in a highly resilient, fault-tolerant

manner.

60

Page 61: Big data with hadoop

61

Page 62: Big data with hadoop

Apache Hive

Apache Pig

Apache HBase

SQL-like language and

metadata repository

High-level language for expressing

data analysis programs

The Hadoop database. Random,

real -time read/write access

Hue

Browser-based desktop interface

for interacting with Hadoop

Oozie

Server-based workflow engine

for Hadoop activities

Sqoop Apache Whirr

Library for running

Hadoop in the cloud

Flume

Apache Zookeeper

Highly reliable distributed

coordination service

Distributed service for collecting

and aggregating log and event

data

Integrating Hadoop

with RDBMS

The major Utilities of Hadoop

62

Page 63: Big data with hadoop

- distributed, nonrelational columnar) database that utilizes HDFS

as its persistence store.

- Modeled after Google BigTable.

- Layered on Hadoop clusters .

- Capable of hosting very large tables (billions of columns/rows).

- Provides random, real-time read/write access .

- Highly configurable, providing flexibility to address huge of data

efficiently.

63

Page 64: Big data with hadoop

64

Page 65: Big data with hadoop

Mining Big Data with Hive

Hive is a batch-oriented, data-warehousing layer .

- built on the core elements of Hadoop (HDFS and MapReduce).

- It provides users who know SQL with HiveQL.

- Hive queries can take several minutes or hours depending on complexity.

- Hive best used for data mining and deeper analytics.

- relies on Hadoop foundation.

- very extensible, scalable, and resilient.

65

Page 66: Big data with hadoop

Hive uses three mechanisms for data organization:

✓ Tables .

✓ Partitions .

✓ Buckets .

supports multitable queries and inserts by

sharing input data within a single HiveQL

statement.

66

Page 67: Big data with hadoop

Hive tables same as RDBMS tables consisting rows and columns.

- Tables are mapped to directories in the file system.

- Hive supports tables stored in other native file systems.

Tables

67

Page 68: Big data with hadoop

- A Hive table can support one or more partitions.

- partitions mapped to subdirectories in underlying file system & represent

the distribution of data throughout the table.

For example:

If a table is called autos, with a key value of 12345 and a maker value Ford,

the path to the partition would be /hivewh/autos/kv=12345/Ford.

Partitions

68

Page 69: Big data with hadoop

- data may be divided into buckets.

- Buckets stored as files in partition directory in underlying file system.

- The buckets based on hash of a column in table.

In the preceding example, you might have a bucket called Focus, containing all

the attributes of a Ford Focus auto.

Buckets

69

Page 70: Big data with hadoop

Pig and Pig Latin

Pig make Hadoop more approachable and usable by non-

developers.

- interactive, or script-based,

- execution environment supporting Pig Latin.

- language used to express data flows.

- Pig Latin language supports loading, processing input data

transform the input data and produce the desired output.

70

Page 71: Big data with hadoop

Pig execution environment has two modes :

✓ Local mode: All scripts are run on a single machine.

Hadoop MapReduce and HDFS are not required.

✓ Hadoop: Also called MapReduce mode, all scripts are

run on a given Hadoop cluster.

Pig and Pig Latin

71

Page 72: Big data with hadoop

- Pig Latin language abstract way to get answers from big data .

- focusing on data and not structure of custom software program.

- Pig makes prototyping very simple.

For example

you can run a Pig script on small representation of your big data

environment

to ensure getting desired results before commit to processing all data.

Pig and Pig Latin. cont.

72

Page 73: Big data with hadoop

- Pig programs run in three different ways, all compatible with local

and Hadoop mode:

✓ Script .

✓ Grunt .

✓ Embedded .

Pig and Pig Latin. cont.

73

Page 74: Big data with hadoop

✓ Script: Simply a file containing Pig Latin commands

- identified by .pig suffix (for example, file.pig or myscript.pig).

- commands interpreted by Pig, executed in sequential order.

Script

74

Page 75: Big data with hadoop

✓ Grunt: Grunt is a command interpreter.

- can type Pig Latin on grunt command line .

- Grunt execute command on your behalf.

- very useful for prototyping and “what if” scenarios.

✓ Embedded: Pig programs executed as part of Java program.

Grunt & Embedded

75

Page 76: Big data with hadoop

Pig Latin has rich syntax. It supports operators for following operations:

✓ Loading and storing of data .✓ Streaming data .✓ Filtering data .✓ Grouping and joining data .✓ Sorting data .✓ Combining and splitting data .

Pig Latin supports wide variety of types, expressions, functions, diagnostic

operators, macros, and file system commands.

Pig and Pig Latin. cont.

76

Page 77: Big data with hadoop

77

Page 78: Big data with hadoop

Apache Sqoop

- Sqoop (SQL-to-Hadoop)

- tool that offers capability to extract data from non-

Hadoop data stores.

- transform data into form usable by Hadoop.

- then load data into HDFS.

- This process is called ETL, for Extract, Transform, and

Load.

- Sqoop commands executed one at a time.78

Page 79: Big data with hadoop

Features of keys in Sqoop

✓ Bulk import:

- Sqoop import individual tables or entire databases into HDFS.

- data stored in native directories & files in HDFS file system.

✓ Direct input:

- Sqoop import , map SQL databases directly into Hive and HBase.

79

Page 80: Big data with hadoop

✓ Data interaction:

- Sqoop generate Java classes .

- you interact with data programmatically.

✓ Data export:

Sqoop export data directly from HDFS into a relational database.

Features of keys in Sqoop .Cont.

80

Page 81: Big data with hadoop

The Apache Sqoop working

- Sqoop works by looking at database you want to import .

- Then selecting appropriate import function for source data.

- Then recognizes input, then reads metadata for table or database .

- Then creates a class definition of your input requirements.

81

Page 82: Big data with hadoop

82

Page 83: Big data with hadoop

- Zookeeper Hadoop’s way coordinating all elements of distributed

applications.

- simple, but its features are powerful.

- managing groups of nodes in service to single distributed application.

- best implemented across racks.

83

Page 84: Big data with hadoop

Some of the capabilities of Zookeeper are as follows:

✓ Process synchronization .

✓ Configuration management .

✓ Self-election .

✓ Reliable messaging .

The capabilities of Zookeeper

Zookeeper

84

Page 85: Big data with hadoop

The capabilities of Zookeeper. Cont.

✓ Process synchronization:

- coordinates starting and stopping of multiple nodes in cluster.

- This ensures all processing occurs in intended order.

- When entire process group complete, processing occur.

85

Page 86: Big data with hadoop

✓ Configuration management:

- used to send configuration attributes to any or all nodes in

cluster.

- processing dependent on particular resources being available

on all nodes.

- ensures consistency of configurations.

The capabilities of Zookeeper. Cont.

86

Page 87: Big data with hadoop

✓ Self-election:

- Zookeeper understands makeup of cluster.

- can assign “leader” role to one of nodes.

- leader/master handles all client requests on behalf of

cluster.

- if leader node fail, another leader will be elected from

remaining nodes.

The capabilities of Zookeeper. Cont.

87

Page 88: Big data with hadoop

✓ Reliable messaging: Even though workloads in Zookeeper

- Loosely coupled.

- Zookeeper offers a publish/subscribe capability.

- allows creation of queue.

- queue guarantees message delivery even in case of node failure.

The capabilities of Zookeeper. Cont.

88

Page 89: Big data with hadoop

The Benefits of Hadoop

- represented most pragmatic that allow companies to manage huge

volumes of data easily.

- Allowed big problems to broken down into smaller elements .

- Analysis done quickly and cost-effectively.

- Big data processed in parallel.

- Process information and regroup small pieces to present results.

89

Page 90: Big data with hadoop

90

Page 91: Big data with hadoop

Any Questions?

91

Page 92: Big data with hadoop

92