big data - a brief introduction

34
Big Data A brief introduction into Big Data & Hadoop 01/01/16 F. v. Noort

Upload: frans-van-noort

Post on 12-Apr-2017

309 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Big Data - A brief introduction

Big Data

A brief introduction into Big Data&

Hadoop

01/01/16 F. v. Noort

Page 2: Big Data - A brief introduction

Big Data – A brief introduction 2

Big Data – A definition

• Big data usually includes data sets (both structured and unstructured) with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time.

• Doug Laney (2001) 3V’s: “data growth challenges and opportunities defined as being three-dimensional, i.e. increasing Volume (amount of data), Velocity (speed of data in and out), and Variety (range of data types and sources)”

01/01/16 F. v. Noort

Page 3: Big Data - A brief introduction

Big Data – A brief introduction 3

Big Data – A definition

• Gartner (2012): "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization."

01/01/16 F. v. Noort

Page 4: Big Data - A brief introduction

Big Data – A brief introduction 4

Big Data - CharacterizationThe original 3V’s have been expanded by the following more complete set of characteristics:• Volume: the quantity of generated & stored data• Velocity: the speed at which is generated & processed• Variety: the type and nature of the data• Variability: Inconsistency of the data set can hamper processes to

handle and manage it• Veracity: The quality of captured data can vary greatly, affecting

accurate analysis.• Complexity: Managing data coming from multiple sources can be

very challenging. Data must be linked, connected, and correlated so users can query and process it effectively.

01/01/16 F. v. Noort

Page 5: Big Data - A brief introduction

Big Data – A brief introduction 5

Difference Big Data versus BI

• Business Intelligence uses descriptive statistics with high information density data to measure things, detect trends, etc.

• Big data (analytics) uses inductive statistics and concepts from nonlinear system identification to infer laws (regressions, nonlinear relationships, and causal effects) from large sets of data with low information density to reveal relationships and dependencies, or to perform predictions of outcomes and behaviors

01/01/16 F. v. Noort

Page 6: Big Data - A brief introduction

Big Data – A brief introduction 6

Architecture: Client Server

01/01/16 F. v. Noort

Server

Client ClientClientClient Client

Client ClientClientClient Client

Client’s can always overwhelm the system!

Page 7: Big Data - A brief introduction

Big Data – A brief introduction 7

Architecture: Storage Area Network

01/01/16 F. v. Noort

Central Contact Point

ServerServerServerServer

Client Client Client Client Client Client

Client’s can always overwhelm the system!

Page 8: Big Data - A brief introduction

Big Data – A brief introduction 8

Architecture: Google File System (GFS)

• Instead of having a giant file storage appliance sitting in the back end, use industry standard hardware on a large scale

• Drive high performance through the shear number of components

• Reliability through redundancy & replication

• Computation work is done there where the data is

01/01/16 F. v. Noort

Storage

Compute

Storage

Compute

Storage

Compute

Storage

Compute

Storage

Compute

Storage

Compute

Storage

Compute

Storage

Compute

Storage

Compute

Storage

Compute

Storage

Compute

Storage

Compute

Storage

Compute

Storage

Compute

Storage

Compute

Storage

Compute

Page 9: Big Data - A brief introduction

Big Data – A brief introduction 9

Hadoop• Based on work from Google File System + MapReduce• Doug Cutting & Mike Cafarella created there own version:

Hadoop (named after Doug’s son toy elephant)• Current distributions based on Open Source and Vendor

Work– Apache Hadoop– Cloudera - CH4 w/ Impala– Hortonworks– MapR– AWS– Windows Azure HDInsight

01/01/16 F. v. Noort

Page 10: Big Data - A brief introduction

Big Data – A brief introduction 10

Why use Hadoop?

• Scalability: Scales to Petabytes or more• Fault tolerant• Faster: Parallel data processing• Better: Suited for particular types of BigData

problems• Open source• Low cost: can be deployed on commodity

hardware

01/01/16 F. v. Noort

Page 11: Big Data - A brief introduction

Big Data – A brief introduction 11

Hadoop Core ArchitectureHadoop core comprises of a• Distributed File System

HDFS: Hadoop Distributed File System (based on GFS) File Sharing & Data Protection Across Physical Servers

• Processing paradigm MapReduce Distributed Computing Across Physical Servers

01/01/16 F. v. Noort

MapReduce

HDFS

Page 12: Big Data - A brief introduction

Big Data – A brief introduction 12

HDFS (1/2)Hadoop Distributed File System• Written in JAVA• On top of native filing system• Designed to handle very large files with streaming data

access patterns• Uses blocks to store a file or parts of a file/ splitting of

large files into blocks• Build on X86-standards

Lot’s of flexibility: reference architectures for many type of servers

01/01/16 F. v. Noort

Hadoop Distributed File System

X86 X86 X86 X86

Page 13: Big Data - A brief introduction

Big Data – A brief introduction 13

HDFS (2/2)

• HDFS File Blocks– 64Mb (default), 128 Mb (recommended)– 1HDFS block is supported by multiple operations system

(OS) blocks• Blocks are replicated (default 3x) to multiple nodes• Allows for node failure without data loss• Two key services

– Master NameNode– Many DataNodes

• Checkpoint Node (Secondary NameNode)

01/01/16 F. v. Noort

Page 14: Big Data - A brief introduction

Big Data – A brief introduction 14

MapReduce“take your task which is data oriented, chunk it up and distribute it on the network such that every piece of work is done within the network by the machine that has the piece of data that needs to be worked on”

MapReduce• Processing paradigm that pairs with HDFS• Distributed computation algorithm that pushes the compute down to

each of the X86 servers• Fault tolerant• Parallelized (scalable) processing• Combination of a Map- and a Reduce procedure:

– Map procedure: performs filtering and sorting of the data– Reduce procedure: performs summary operations

01/01/16 F. v. Noort

Page 15: Big Data - A brief introduction

Big Data – A brief introduction 15

Other Hadoop tools/frameworks

• Data Access:– Hive, Pig, Mahout

• Tools– Sqoop, Flume

01/01/16 F. v. Noort

Page 16: Big Data - A brief introduction

Big Data – A brief introduction 16

Hadoop Architecture

Main nodes of Hadoop

• Hadoop Distribute Files System (HDFS) nodes– NameNode– DataNode

• MapReduce nodes– JobTracker– TaskTracker

01/01/16 F. v. Noort

Page 17: Big Data - A brief introduction

Big Data – A brief introduction 17

HDFS - NameNode

• Single master service for HDFS• Single point of failure (HDFS 1.x)• Stores file to block to location mappings in the

namespace (manages the file system namespace and metadata)

• Don’t use inexpensive commodity hardware for this node

• Large memory requirements, keeps the entire file system metadata in memory

01/01/16 F. v. Noort

Page 18: Big Data - A brief introduction

Big Data – A brief introduction 18

HDFS - DataNode

• Many per Hadoop cluster• Stores blocks on local disk• Manages blocks with data and serves them to clients• Checksums on blocks => fault tolerant data store

system• Clients connect to DataNode for I/O• Sends frequent heartbeats (pings “hey I’m alive” for

about every 2 seconds) to NameNode• Sends block reports to NameNode

01/01/16 F. v. Noort

Page 19: Big Data - A brief introduction

Big Data – A brief introduction 19

HDFS Write operation

01/01/16 F. v. Noort

HDFS ClientFile 1

Block 1 Block 2 Block 3

Rack 2

DataNode 7

DataNode 8

DataNode 9

DataNode 12

DataNode 10

DataNode 11

Rack 3

DataNode 13

DataNode 14

DataNode 15

DataNode 18

DataNode 16

DataNode 17

Rack 1

DataNode 1

DataNode 2

DataNode 3

DataNode 6

DataNode 4

DataNode 5

NameNode

Clientdivides filein blocks

Client contacts name node to write data

NameNode says write it to these nodes(DN1, DN7, DN15)

Block 1

Block 1

Block 1

Block 2

Block 2Block 2

Block 3

Block 3

Block 3

• DataNodes replicate data blocks, orchestrated by the NameNode

• Default 3 replica’s• Rack-aware system!

Page 20: Big Data - A brief introduction

Big Data – A brief introduction 20

HDFS Read operation

01/01/16 F. v. Noort

HDFS Client

Rack 2

DataNode 7

DataNode 8

DataNode 9

DataNode 12

DataNode 10

DataNode 11

Rack 3

DataNode 13

DataNode 14

DataNode 15

DataNode 18

DataNode 16

DataNode 17

Rack 1

DataNode 1

DataNode 2

DataNode 3

DataNode 6

DataNode 4

DataNode 5

NameNode

Clientdivides filein blocks

Client contacts name node to read data

NameNode says find it on these nodes

Block 1

Block 1

Block 1

Block 2

Block 2Block 2

Block 3

Block 3

Block 3

Page 21: Big Data - A brief introduction

Big Data – A brief introduction 21

HDFS 2.0 Features

• NameNode High-Availability– Two redundant NameNodes in active/passive

configuration– Manual or automated failover

• NameNode Federation– Multiple independent NameNodes using the same

collection of DataNodes

01/01/16 F. v. Noort

Page 22: Big Data - A brief introduction

Big Data – A brief introduction 22

Hadoop MapReduce

• Moves the code to the data• JobTracker– Master service to monitor jobs

• TaskTracker– Multiple services to run tasks– Same physical machine as a DataNode

• A job contains many tasks• A task contains one or more task attempts

01/01/16 F. v. Noort

Page 23: Big Data - A brief introduction

Big Data – A brief introduction 23

MapReduce JobTracker

• One per Hadoop cluster• Receives job requests submitted by Client• Schedules jobs in FIFO order• Schedules & monitors MapReduce jobs on

task trackers• Issues task attempts to TaskTrackers• Single point of failure for MapReduce

01/01/16 F. v. Noort

Page 24: Big Data - A brief introduction

Big Data – A brief introduction 24

MapReduce TaskTracker

• Runs on same node as DataNode service• Many per Hadoop cluster• Sends heartbeats and task reports to

JobTracker• Executes MapReduce operations• Configurable number of map and reduce slots• Runs map and reduce task attempts

01/01/16 F. v. Noort

Page 25: Big Data - A brief introduction

Big Data – A brief introduction 25

HDFS Architecture: Master & Slave

01/01/16 F. v. Noort

HDFS Client

Secondary NameNodeNameNodeJobTracker

Note• Hadoop 1.0 has only 1

NameNode• Hadoop 2.0 has active & passive

NameNode

DataNode

TaskTracker

DataNode

TaskTracker

DataNode

TaskTracker

MapReduce Distributed Data Processing

Page 26: Big Data - A brief introduction

Big Data – A brief introduction 26

How MapReduce works (1/3)

• MapReduce is a combination of a Map- and a Reduce procedure:– Map procedure: performs filtering and sorting of

the data– Reduce procedure: performs summary operations

01/01/16 F. v. Noort

Page 27: Big Data - A brief introduction

Big Data – A brief introduction 27

How MapReduce works (2/3)

01/01/16 F. v. Noort

CustId, ZipCode, Amount

Data

Nod

e 1

Data

Nod

e 2

Data

Nod

e 3

4 6654FD €757 1534CD €60

2 5734CD €301 1184AN €15

5 5734CD €650 6654FD €22

5 5734CD €156 4484AN €10

3 1534CD €958 4484AN €55

6 4484AN €259 1184AN €15

Mapper 1

Mapper 2

2 Map JobsScenario: Get sum sales grouped by ZipCode

6654FD €75

1534CD €60

5734CD €30

1184AN €15

5734CD €65

6654FD €22

5734CD €15

4484AN €10

1534CD €95

4484AN €55

4484AN €25

1184AN €15

Map Phase

Page 28: Big Data - A brief introduction

Big Data – A brief introduction 28

How MapReduce works (3/3)

01/01/16 F. v. Noort

6654FD €75

1534CD €60

5734CD €30

1184AN €15

5734CD €65

6654FD €22

5734CD €15

4484AN €10

1534CD €95

4484AN €55

4484AN €25

1184AN €15

5734CD €655734CD €305734CD €15

4484AN €104484AN €251534CD €601534CD €954484AN €55

6654FD €756654FD €22

1184AN €151184AN €15

5734CD €655734CD €305734CD €15

4484AN €104484AN €25

1534CD €601534CD €95

4484AN €55

6654FD €756654FD €22

1184AN €151184AN €15

5734CD €110

1534CD €155

4484AN €90

1184AN €30

6654FD €97

Reducer

Reducer

Reducer

Scenario: Get sum sales grouped by ZipCode

Shuffl

e Ph

ase

Sort

Sum

Page 29: Big Data - A brief introduction

Big Data – A brief introduction 29

The Hadoop Ecosystem

• Data Access:– Hive– Pig– Mahout

• Tools– Sqoop– Flume

01/01/16 F. v. Noort

Page 30: Big Data - A brief introduction

Big Data – A brief introduction 30

Hive

• Declarative language• Allows users to write

write SQL-like queries (no ANSI SQL)

• Analytics area• Structures data• Data in Tables• Tables will remain

01/01/16 F. v. Noort

MapReduce

HDFS

Hive

Page 31: Big Data - A brief introduction

Big Data – A brief introduction 31

PIG

• Procedural language (PigLatin)

• Generates one or more MapReduce jobs

• Efficiency in computing• Structured/unstructured

data• Data in Variables• May not retain values

01/01/16 F. v. Noort

MapReduce

HDFS

Hive PIG

Page 32: Big Data - A brief introduction

Big Data – A brief introduction 32

Mahout

• Library for scalable machine learning (written in Java)

• Classification, Clustering, Pattern Mining, etc ..

01/01/16 F. v. Noort

MapReduce

HDFS

Hive PIG

Mahout

Page 33: Big Data - A brief introduction

Big Data – A brief introduction 33

Sqoop

• To transfer data to and from a relational database

• Compression of data is a feature

01/01/16 F. v. Noort

MapReduce

HDFS

Hive PIG

Mahout

Sqoop

Page 34: Big Data - A brief introduction

Big Data – A brief introduction 34

Flume

• An application that allows to move streaming data to a Hadoop cluster

• A massively distributable framework for event based data

01/01/16 F. v. Noort

MapReduce

HDFS

Hive PIG

Mahout

Sqoop

Flume