big data - a brief introduction

Big Data

A brief introduction into Big Data&

Hadoop

01/01/16 F. v. Noort

Big Data – A brief introduction 2

Big Data – A definition

• Big data usually includes data sets (both structured and unstructured) with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time.

• Doug Laney (2001) 3V’s: “data growth challenges and opportunities defined as being three-dimensional, i.e. increasing Volume (amount of data), Velocity (speed of data in and out), and Variety (range of data types and sources)”

01/01/16 F. v. Noort


Big Data – A definition

• Gartner (2012): "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization."

01/01/16 F. v. Noort


Big Data - CharacterizationThe original 3V’s have been expanded by the following more complete set of characteristics:• Volume: the quantity of generated & stored data• Velocity: the speed at which is generated & processed• Variety: the type and nature of the data• Variability: Inconsistency of the data set can hamper processes to

handle and manage it• Veracity: The quality of captured data can vary greatly, affecting

accurate analysis.• Complexity: Managing data coming from multiple sources can be

very challenging. Data must be linked, connected, and correlated so users can query and process it effectively.

01/01/16 F. v. Noort


Difference Big Data versus BI

• Business Intelligence uses descriptive statistics with high information density data to measure things, detect trends, etc.

• Big data (analytics) uses inductive statistics and concepts from nonlinear system identification to infer laws (regressions, nonlinear relationships, and causal effects) from large sets of data with low information density to reveal relationships and dependencies, or to perform predictions of outcomes and behaviors

01/01/16 F. v. Noort


Architecture: Client Server

01/01/16 F. v. Noort

Server

Client ClientClientClient Client

Client ClientClientClient Client

Client’s can always overwhelm the system!


Architecture: Storage Area Network

01/01/16 F. v. Noort

Central Contact Point

ServerServerServerServer

Client Client Client Client Client Client

Client’s can always overwhelm the system!


Architecture: Google File System (GFS)

• Instead of having a giant file storage appliance sitting in the back end, use industry standard hardware on a large scale

• Drive high performance through the shear number of components

• Reliability through redundancy & replication

• Computation work is done there where the data is

01/01/16 F. v. Noort

Storage

Compute

Storage

Compute

Storage

Compute

Storage

Compute

Storage

Compute

Storage

Compute

Storage

Compute

Storage

Compute

Storage

Compute

Storage

Compute

Storage

Compute

Storage

Compute

Storage

Compute

Storage

Compute

Storage

Compute

Storage

Compute


Hadoop• Based on work from Google File System + MapReduce• Doug Cutting & Mike Cafarella created there own version:

Hadoop (named after Doug’s son toy elephant)• Current distributions based on Open Source and Vendor

Work– Apache Hadoop– Cloudera - CH4 w/ Impala– Hortonworks– MapR– AWS– Windows Azure HDInsight

01/01/16 F. v. Noort


Why use Hadoop?

• Scalability: Scales to Petabytes or more• Fault tolerant• Faster: Parallel data processing• Better: Suited for particular types of BigData

problems• Open source• Low cost: can be deployed on commodity

hardware

01/01/16 F. v. Noort


Hadoop Core ArchitectureHadoop core comprises of a• Distributed File System

HDFS: Hadoop Distributed File System (based on GFS) File Sharing & Data Protection Across Physical Servers

• Processing paradigm MapReduce Distributed Computing Across Physical Servers

01/01/16 F. v. Noort

MapReduce

HDFS


HDFS (1/2)Hadoop Distributed File System• Written in JAVA• On top of native filing system• Designed to handle very large files with streaming data

access patterns• Uses blocks to store a file or parts of a file/ splitting of

large files into blocks• Build on X86-standards

Lot’s of flexibility: reference architectures for many type of servers

01/01/16 F. v. Noort

Hadoop Distributed File System

X86 X86 X86 X86


HDFS (2/2)

• HDFS File Blocks– 64Mb (default), 128 Mb (recommended)– 1HDFS block is supported by multiple operations system

(OS) blocks• Blocks are replicated (default 3x) to multiple nodes• Allows for node failure without data loss• Two key services

– Master NameNode– Many DataNodes

• Checkpoint Node (Secondary NameNode)

01/01/16 F. v. Noort


MapReduce“take your task which is data oriented, chunk it up and distribute it on the network such that every piece of work is done within the network by the machine that has the piece of data that needs to be worked on”

MapReduce• Processing paradigm that pairs with HDFS• Distributed computation algorithm that pushes the compute down to

each of the X86 servers• Fault tolerant• Parallelized (scalable) processing• Combination of a Map- and a Reduce procedure:

– Map procedure: performs filtering and sorting of the data– Reduce procedure: performs summary operations

01/01/16 F. v. Noort


Other Hadoop tools/frameworks

• Data Access:– Hive, Pig, Mahout

• Tools– Sqoop, Flume

01/01/16 F. v. Noort


Hadoop Architecture

Main nodes of Hadoop

• Hadoop Distribute Files System (HDFS) nodes– NameNode– DataNode

• MapReduce nodes– JobTracker– TaskTracker

01/01/16 F. v. Noort


HDFS - NameNode

• Single master service for HDFS• Single point of failure (HDFS 1.x)• Stores file to block to location mappings in the

namespace (manages the file system namespace and metadata)

• Don’t use inexpensive commodity hardware for this node

• Large memory requirements, keeps the entire file system metadata in memory

01/01/16 F. v. Noort


HDFS - DataNode

• Many per Hadoop cluster• Stores blocks on local disk• Manages blocks with data and serves them to clients• Checksums on blocks => fault tolerant data store

system• Clients connect to DataNode for I/O• Sends frequent heartbeats (pings “hey I’m alive” for

about every 2 seconds) to NameNode• Sends block reports to NameNode

01/01/16 F. v. Noort


HDFS Write operation

01/01/16 F. v. Noort

HDFS ClientFile 1

Block 1 Block 2 Block 3

Rack 2

DataNode 7

DataNode 8

DataNode 9

DataNode 12

DataNode 10

DataNode 11

Rack 3

DataNode 13

DataNode 14

DataNode 15

DataNode 18

DataNode 16

DataNode 17

Rack 1

DataNode 1

DataNode 2

DataNode 3

DataNode 6

DataNode 4

DataNode 5

NameNode

Clientdivides filein blocks

Client contacts name node to write data

NameNode says write it to these nodes(DN1, DN7, DN15)

Block 1

Block 1

Block 1

Block 2

Block 2Block 2

Block 3

Block 3

Block 3

• DataNodes replicate data blocks, orchestrated by the NameNode

• Default 3 replica’s• Rack-aware system!


HDFS Read operation

01/01/16 F. v. Noort

HDFS Client

Rack 2

DataNode 7

DataNode 8

DataNode 9

DataNode 12

DataNode 10

DataNode 11

Rack 3

DataNode 13

DataNode 14

DataNode 15

DataNode 18

DataNode 16

DataNode 17

Rack 1

DataNode 1

DataNode 2

DataNode 3

DataNode 6

DataNode 4

DataNode 5

NameNode

Clientdivides filein blocks

Client contacts name node to read data

NameNode says find it on these nodes

Block 1

Block 1

Block 1

Block 2

Block 2Block 2

Block 3

Block 3

Block 3


HDFS 2.0 Features

• NameNode High-Availability– Two redundant NameNodes in active/passive

configuration– Manual or automated failover

• NameNode Federation– Multiple independent NameNodes using the same

collection of DataNodes

01/01/16 F. v. Noort


Hadoop MapReduce

• Moves the code to the data• JobTracker– Master service to monitor jobs

• TaskTracker– Multiple services to run tasks– Same physical machine as a DataNode

• A job contains many tasks• A task contains one or more task attempts

01/01/16 F. v. Noort


MapReduce JobTracker

• One per Hadoop cluster• Receives job requests submitted by Client• Schedules jobs in FIFO order• Schedules & monitors MapReduce jobs on

task trackers• Issues task attempts to TaskTrackers• Single point of failure for MapReduce

01/01/16 F. v. Noort


MapReduce TaskTracker

• Runs on same node as DataNode service• Many per Hadoop cluster• Sends heartbeats and task reports to

JobTracker• Executes MapReduce operations• Configurable number of map and reduce slots• Runs map and reduce task attempts

01/01/16 F. v. Noort


HDFS Architecture: Master & Slave

01/01/16 F. v. Noort

HDFS Client

Secondary NameNodeNameNodeJobTracker

Note• Hadoop 1.0 has only 1

NameNode• Hadoop 2.0 has active & passive

NameNode

DataNode

TaskTracker

DataNode

TaskTracker

DataNode

TaskTracker

MapReduce Distributed Data Processing


How MapReduce works (1/3)

• MapReduce is a combination of a Map- and a Reduce procedure:– Map procedure: performs filtering and sorting of

the data– Reduce procedure: performs summary operations

01/01/16 F. v. Noort



01/01/16 F. v. Noort

CustId, ZipCode, Amount

Data

Nod

e 1

Data

Nod

e 2

Data

Nod

e 3

4 6654FD €757 1534CD €60

2 5734CD €301 1184AN €15

5 5734CD €650 6654FD €22

5 5734CD €156 4484AN €10

3 1534CD €958 4484AN €55

6 4484AN €259 1184AN €15

Mapper 1

Mapper 2

2 Map JobsScenario: Get sum sales grouped by ZipCode

6654FD €75

1534CD €60

5734CD €30

1184AN €15

5734CD €65

6654FD €22

5734CD €15

4484AN €10

1534CD €95

4484AN €55

4484AN €25

1184AN €15

Map Phase



01/01/16 F. v. Noort

6654FD €75

1534CD €60

5734CD €30

1184AN €15

5734CD €65

6654FD €22

5734CD €15

4484AN €10

1534CD €95

4484AN €55

4484AN €25

1184AN €15

5734CD €655734CD €305734CD €15

4484AN €104484AN €251534CD €601534CD €954484AN €55

6654FD €756654FD €22

1184AN €151184AN €15

5734CD €655734CD €305734CD €15

4484AN €104484AN €25

1534CD €601534CD €95

4484AN €55

6654FD €756654FD €22

1184AN €151184AN €15

5734CD €110

1534CD €155

4484AN €90

1184AN €30

6654FD €97

Reducer

Reducer

Reducer

Scenario: Get sum sales grouped by ZipCode

Shuffl

e Ph

ase

Sort

Sum


The Hadoop Ecosystem

• Data Access:– Hive– Pig– Mahout

• Tools– Sqoop– Flume

01/01/16 F. v. Noort


Hive

• Declarative language• Allows users to write

write SQL-like queries (no ANSI SQL)

• Analytics area• Structures data• Data in Tables• Tables will remain

01/01/16 F. v. Noort

MapReduce

HDFS

Hive


PIG

• Procedural language (PigLatin)

• Generates one or more MapReduce jobs

• Efficiency in computing• Structured/unstructured

data• Data in Variables• May not retain values

01/01/16 F. v. Noort

MapReduce

HDFS

Hive PIG


Mahout

• Library for scalable machine learning (written in Java)

• Classification, Clustering, Pattern Mining, etc ..

01/01/16 F. v. Noort

MapReduce

HDFS

Hive PIG

Mahout


Sqoop

• To transfer data to and from a relational database

• Compression of data is a feature

01/01/16 F. v. Noort

MapReduce

HDFS

Hive PIG

Mahout

Sqoop


Flume

• An application that allows to move streaming data to a Hadoop cluster

• A massively distributable framework for event based data

01/01/16 F. v. Noort

MapReduce

HDFS

Hive PIG

Mahout

Sqoop

Flume

big data - a brief introduction

Data & Analytics