hadoop - a big data initiative

HADOOP (A BIG DATA INITIATIVE)-Mansi Mehra

AGENDA What is the problem? What is the solution?

HDFS MapReduce Hbase PIG HIVE Zookeeper Spark

DEFINING THE PROBLEM – 3V

Volume - Lots and lots of data Datasets are so large and complex Cannot use relational database Challenges: capture, curation, storage, search,

sharing, transfer, analysis and visualization.

DEFINING THE PROBLEM – 3V (CONTD.) Velocity - Huge amounts of data generated

at incredible speed NYSE generates about 1 TB of new trade data

per day AT&T anonymized Call Detail Records (CDRs) top

at around 1 GB per hour.

Variety - Differently formatted data sets from different sources Twitter keeps tracks of tweets, Facebook

produces posts and likes data, Youtube streams videos)

WHY TRADITIONAL STORAGES DON’T WORK Unstructured data is exploding, not much of

data produced has relational nature.

No redundancy High computational cost Capacity limit for structured data (costly

hardware) Expensive License

Data type NatureXML Semi-structuredWord docs, PDF files etc. UnstructuredEmail body Unstructured Data from Enterprise Systems (ERP, CRM etc.)

Structured

HOW DOES HADOOP WORK?

HDFS (HADOOP 1.0 VS 2.0)

HDFS (HADOOP 2.0)

YARN (2.0)- YET ANOTHER RESOURCE NEGOTIATOR Computing framework for Hadoop. YARN has Resource Manager-

Manages and allocates cluster resources Improves performance and Quality of Service

MAP REDUCE Programming model in Java

Work on large amounts of data Provides redundancy & fault tolerance Runs the code on each data node

MAP REDUCE (CONTD.) Steps for Map Reduce:

Read in lots of data Map: extract something you care about from

each record/line. Shuffle and sort Reduce: aggregate, summarize, filter or

transform Write results.

HDFS (HADOOP 2.0)

HIVE (OVERVIEW)

A data warehouse infrastructure built on top of Hadoop Compile SQL queries as MapReduce jobs and run the job in

the cluster. Brings structure to unstructured data Key Building Principles:

Structured data with rich data types (structs, lists and maps) Directly query data from different formats (text/binary) and file

formats (Flat/sequence). SQL as a familiar programming tool and for standard analytics

Types of applications: Summarization: Daily/weekly aggregations Ad hoc analysis Data Mining Spam detection Many more ….

PIG (OVERVIEW) High level dataflow language Has its own syntax (Preferable for people with

programming background) Compiler that produces sequences of MapReduce

programs. Structure is agreeable to substantial

parallelization. Key properties of PIG:

Ease of programming: Trivial to achieve parallel execution of simple and parallel data analysis tasks

Optimization opportunities: allows user to focus on semantics rather than efficiency.

Extensibility: Users can create their own functions to do special purpose processing.

HBASE (OVERVIEW) HBase is a distributed column-oriented data store built

on top of HDFS. Data is logically organized into tables, rows and

columns.

HDFS is good for batch processing (scan over big files). Not good for record lookup. Not good for incremental addition of small batches. Not good for updates.

HBase is designed to efficiently address the above points Fast record lookup Support for record level insertion Support for updates (not in place). Updates are done by creating new versions of values.

ZOOKEEPER (OVERVIEW) Zookeeper is a distributed, open source

coordination service for distributed applications.

Exposes simple set of primitives that distributed applications can build upon to implement higher level services for synchronization, configuration maintenance, and groups and naming.

Coordination services are notoriously hard to get right. They are prone to errors like race conditions and deadlock.

The motivation behind zookeeper is to relieve distributed applications the responsibility of implementing coordination services from scratch.

SPARK (OVERVIEW) Motivation : MapReduce programming model transform data

flowing from stable storage to stable storage (disk to disk). Acyclic data flow is a powerful abstraction, but not efficient

for applications that repeatedly reuse a working set of data. Iterative algorithms Interactive data mining

Spark makes working sets a first-class concept to efficiently support these applications.

Goal: To provide distributed memory abstractions for clusters to support

apps with working sets. Retain the attractive properties of map reduce.

Fault tolerance Data locality Scalability

Augment data flow model with “resilient distributed datasets” (RDDs)

SPARK (OVERVIEW CONTD.) Resilient distributed datasets (RDDs)

Immutable collections partitioned across cluster that can be rebuilt if a partition is lost.

Created by transforming data in stable storage using data flow operators (map, filter, group-by, ..)

Can be cached across parallel operations.

Parallel operations on RDDs. Reduce, collect, count, save, …..

Restricted shared variables Accumulators, broadcast variables.

SPARK (OVERVIEW CONTD) Fast map reduce like engine Uses in memory cluster computing Compatible with Hadoop storage API. Has API’s written in Scala, Java, Python. Useful for large datasets and iterative

algorithms. Up to 40x faster than MapReduce. Support for:

Spark SQL : Hive on Spark Mlib : Machine learning library Graphx : Graph processing.

hadoop - a big data initiative

Data & Analytics