big data

BIG DATA

HADOOP

Background

The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing

• The volume of data being made publicly available increases every year. success in the future will be dictated to a large extent by their ability to extract value from other organizations’ data.

• Variety of Data, Velocity of the Data , Volume of the data –V3

• Data Storage & Analysis– The storage capacity of the hard drives increased, but access speeds

have not kept up significantly– Now 1 Terabyte data is norm for disks and speed is around 100

MB/s,so it takes more than two and half hours to read all the data from the disk. So there is a long time to read zetta bytes of data

– So alternative solution—To read from multiple disks

Background

• Data Storage & Analysis – Problems in reading from and writing to multiple disks– Multiple hardware pieces are prone for failure-So data loss probability is high– Solution for Data loss-Replication– RAID works with replication only– Data Analyis need to combine the data from various elements & challenges– Need a solution as reliable shared storage and analysis system

• Hello ! Hadoop• NUTCH project by Doug Cutting• Google GFS & Map Reduce distributed data storage and processing• Yahoo Development Project• Doug Cutting Apache Hadoop Open source frame work• Hadoop-Made up Name

Hadoop Vs Other Systems

HADO

OP

• Best fit for Adhoc Analysis

• Written once and read many times

• Variety of Data

• Peta bytes of data

• Batch analysis

• Dynamic Schema

• Data Locality

• Data flow is implicit

• Shared Nothing Architecture

• Scaling out approach with commodity hardware

• Key/value pair

RDB

MS

• Good for low latency data

• Organized data/Structured data

• Gigabytes of data

• Interactive and Batch

• Static Schema

• Scaling is expensive

• Tables Structure

HPC ,GRID

&VOLUN

TEER COMPUTI

NG

• Distribution of work across the cluster

• Data intensive applications, Network Bandwidth

• Compute nodes idle

• MPI(message passing interface) flexibility but complexity for data flow

• SETI@home,

• Volunteer computing

• Volunteers are donating CPU cycles, not bandwidth

• Volunteer computing, untrusted computers, no data locality

HADOOP ARCHITECTURE

HDFS

MR HADOOP

Top of Existing File SystemStreaming Data Access

patternsVery large files

Commodity HardwareHigh Through put rather than

low latency

Lot of small filesLow latency Data access

Multiple Writes,

1) MAP2) REDUCE

3) CODE for MR JOB4) Automatic parallelization

5) Fault ToleranceJava,Python etc

House keeping in built

HDFS

• HDFS block size 64 MB -128 MB• Why is it so large? •

Name Node Secondary Name NodeClient

Heart Beating, Block replication and Balancing

Data node Data NodeData Nodes Data nodes Data node

Data node

big data

Documents

volume of data

terabyte data

organizations data

variety of data

distributed data storage

data lossreplicationraid

data loss probability

reliable shared storage