bigdata workshop february 2015

60
Big Data Workshop By Avinash Ramineni Shekhar Vemuri

Upload: clairvoyantllc

Post on 11-Apr-2017

56 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Bigdata workshop  february 2015

Big Data WorkshopBy

Avinash RamineniShekhar Vemuri

Page 2: Bigdata workshop  february 2015

Big Data• Big Data

– More than a single machine can process

– 100s of GB, TB, ?PB!!

• Why the Buzz now ?

– Low Storage Costs

– Increased Computing power

– Potential for Tremendous Insights

– Data for Competitive Advantage

– Velocity , Variety , Volume

• Structured

• Unstructured - Text, Images, Video, Voice, etc

Page 3: Bigdata workshop  february 2015

Volume, Velocity & Variety

Page 4: Bigdata workshop  february 2015

Big Data Use Cases

Page 5: Bigdata workshop  february 2015
Page 6: Bigdata workshop  february 2015

Traditional Large-Scale Computation

• Computation has been Processor and Memory bound– Relatively small amounts of data– Faster processor , more RAM

• Distributed Systems– Multiple machines for a single job– Data exchange requires synchronization– Finite bandwidth is available – Temporal dependencies are complicated– Difficult to deal with partial failures

• Data is stored on a SAN• Data is copied to compute nodes

Page 7: Bigdata workshop  february 2015

Data Becomes the Bottleneck• Moore’s Law has held firm over 40 years

– Processing power doubles every two years– Processing speed is no longer the problem

• Getting the data to the processors becomes the bottleneck

• Quick Math– Typical disk data transfer rate : 75MB/Sec– Time taken to transfer 100GB of data to the processor ? 22

minutes !!• Assuming sustained reads• Actual time will be worse, since most servers have less than 100 GB of RAM

Page 8: Bigdata workshop  february 2015

Need for New Approach• System must support partial Failure

– Failure of a component should result in a graceful degration of application performance

• If a component of the system fails, the workload needs to be picked up by other components.– Failure should not result in loss of any data

• Components should be able to join back upon recovery• Component failures during execution of a job impact the

outcome of the job• Scalability

– Increasing resources should support a proportional incease in load capacity.

Page 9: Bigdata workshop  february 2015

Traditional vs Hadoop Approach• Traditional

– Application Server: Application resides– Database Server: Data resides– Data is retrieved from database and sent to application for

processing. – Can become network and I/O intensive for GBs of data.

• Hadoop– Distribute data on to multiple commodity hardware (data

nodes)– Send application to data nodes instead and process data in

parallel.

Page 10: Bigdata workshop  february 2015

What is Hadoop

• Hadoop is a framework for distributed data processing with distributed

filesystem.

• Consists of two core components

– The Hadoop Distributed File System (HDFS)

– MapReduce

• Map and then Reduce

• A set of machines running HDFS and MapReduce is known as a Hadoop

Cluster.

• Individual machines are known as Nodes

Page 11: Bigdata workshop  february 2015

Core Hadoop Concepts

• Distribute data as it is stored in to the system

• Nodes talk to each other as little as possible

–Shared Nothing Architecture

• Computation happens where the data is stored.

• Data is replicated multiple times on the system for

increased availability and reliability.

• Fault Tolerance

Page 12: Bigdata workshop  february 2015

HDFS -1

• HDFS is a file system written in Java

– Sits on top of a native filesystem• Such as ext3, ext4 or xfs

• Designed for storing very large files with streaming data access patterns

• Write-once, read-many-times pattern

• No Random writes to files are allowed

– High Latency, High Throughput

– Provides redundant storage by replicating

– Not Recommended for lot of small files and low latency requirements

Page 13: Bigdata workshop  february 2015

HDFS - 2

• Files are split into blocks of size 64MB or 128 MB

• Data is distributed across machines at load time

• Blocks are replicated across multiple machines, known as DataNodes

– Default replication is 3 times

• Two types of Nodes in HDFS Cluster – (Master – worker Pattern)

– NameNode (Master)• Keeps track of blocks that make up a file and where those blocks are located

– DataNodes • Hold the actual blocks

• Stored as standard files in a set of directories specified in the Configuration

Page 14: Bigdata workshop  february 2015

HDFS

Block1

Block2

Block3

File150MB

DataNode 1Block1Block2Block3

DataNode 2Block1Block2

DataNode3Block1Block3

DataNode4Block2Block3

ReplicateSplit

Myfile.txt

Myfile.txt: block1,block2,block3

NameNode

Secondary NameNode

Page 15: Bigdata workshop  february 2015

HDFS - 3

• HDFS Daemons

– NameNode

• Keeps track of which blocks make up a file and where those blocks are located

• Cluster is not accessible with out a NameNode

– Secondary NameNode

• Takes care of house keeping activities for NameNode (not a backup)

– DataNode

• Regularly report their status to namenode in a heartbeat

• Every hour or at startup sends a block report to NameNode

Page 16: Bigdata workshop  february 2015

HDFS - 4

• Blocks are replicated across multiple machines, known as

DataNodes

• Read and Write HDFS files via the Java API

• Access to HDFS from the command line is achieved with the

hadoop fs command

– File system commands – ls, cat, put, get,rm

Page 17: Bigdata workshop  february 2015

hadoop fs

"

Page 18: Bigdata workshop  february 2015

hadoop fs –get

"

Page 19: Bigdata workshop  february 2015

hadoop fs –put

"

Page 20: Bigdata workshop  february 2015

Hands On: Using HDFS

Page 21: Bigdata workshop  february 2015

MapReduce

• OK, You have data in HDFS, what next?

• Processing via Map Reduce

• Parallel processing for large amounts of data

•Map function, Reduce function

• Data Locality

• Code is shipped to the node closest to data

• Java, Python, Ruby, Scala…

Page 22: Bigdata workshop  february 2015

MapReduce Features

• Automatic parallelization and distribution

• Fault-tolerance (Speculative Execution)

• Status and monitoring tools

• Clean abstraction for programmers

–MapReduce programs usually written in Java (Hadoop Streaming)

• Abstracts all the house keeping activities from developers

Page 23: Bigdata workshop  february 2015

WordCount MapReduce

Page 24: Bigdata workshop  february 2015

Hadoop Daemons• Hadoop is comprised of five separate daemons

• NameNode

• Secondary NameNode

• DataNode

• JobTracker

–Manages MapReduce jobs, distributes individual tasks to machines

• TaskTracker

– Instantiates and monitors individual Map and Reduce tasks

Page 25: Bigdata workshop  february 2015

MapReduce

Page 26: Bigdata workshop  february 2015

MapReduce Classes

Page 27: Bigdata workshop  february 2015

Hadoop Cluster

Job Tracker

Name Node

MapReduce

HDFS

Task Tracker Task Tracker

Data Node Data Node

1,2,3 … n

Master Node Worker processes

Page 28: Bigdata workshop  february 2015

MapReduce

•MR1

• only MapReduce

• aka - batch mode

•MR2

• YARN based

•MR is one implementation on top of YARN

• batch, interactive, online, streaming….

Page 29: Bigdata workshop  february 2015

Hadoop 2.0

Page 30: Bigdata workshop  february 2015

MapReduce

HDFS

NameNodeDataNode

MapReduce

JobTrackerTaskTracker

YARNContainer

NodeManagerAppMaster

ResourceManager

MapReduce

Page 31: Bigdata workshop  february 2015

MapReduce

Page 32: Bigdata workshop  february 2015

Hands On: Running a MR Job

Page 33: Bigdata workshop  february 2015

How do you write MapReduce?

•Map Reduce is a basic building block,

programming approach to solve larger problems

• In most real world applications, multiple MR jobs

are chained together

• Processing pipeline - output of one - input to

another

Page 34: Bigdata workshop  february 2015

How do you write MapReduce?

•Map Reduce is a basic building block,

programming approach to solve larger problems

• In most real world applications, multiple MR jobs

are chained together

• Processing pipeline - output of one - input to

another

Page 35: Bigdata workshop  february 2015

Anatomy of Map Reduce Job (WordCount)

Page 36: Bigdata workshop  february 2015

Hands On: Writing a MR Program

Page 37: Bigdata workshop  february 2015

Hadoop Ecosystem

• Rich Ecosystem around Hadoop

•We have used many,

• Sqoop, Flume, Oozie, Mahout, Hive, Pig, Cascading

• Cascading, Lingual

•More out there!

• Impala, Tez, Drill, Stinger, Shark, Spark, Storm…….

Page 38: Bigdata workshop  february 2015

Big Data RolesDate Engineer Vs Data Scientist Vs Data Analyst

Page 39: Bigdata workshop  february 2015

Hive and Pig

• MapReduce code is typically written in Java• Requires:– A good java programmer– Who understands MapReduce– Who understands the problem– Who will be available to maintain and update the

code in the future as requirements change

Page 40: Bigdata workshop  february 2015

Hive and Pig• Many organizations have only a few developers who can

write good MapReduce code• Many other people want to analyze data

– Business analysts– Data scientists– Statisticians– Data analysts

• Need a higher level abstraction on top of MapReduce– Ability to query the data without needing to know MapReduce

intimately– Hive and Pig address these needs

Page 41: Bigdata workshop  february 2015

Hive

Page 42: Bigdata workshop  february 2015

Hive

• Abstraction on top of MapReduce• Allows users to query data in the Hadoop cluster

without knowing Java or MapReduce• Uses HiveQL Language– Very Similar to SQL

• Hive Interpreter runs on a client machine– Turns HiveQL queries into MapReduce jobs– Submits those jobs to the cluster

Page 43: Bigdata workshop  february 2015

Hive Data Model

• Hive ‘layers’ table definitions on top of data in HDFS

• Tables– Typed columns

• TINYINT , SMALLINT, INT, BIGINT• FLOAT,DOUBLE• STRING• BINARY• BOOLEAN, TIMESTAMP

• ARRAY , MAP, STRUCT

Page 44: Bigdata workshop  february 2015

Hive Metastore

• Hive’s Metastore is a database containing table definitions and other metadata– By default, stored locally on the client machine in a

Derby database– Usually MySQL if the metastore needs to be shared

across multiple people– table definitions on top of data in HDFS

Page 45: Bigdata workshop  february 2015

Hive Examples

Page 46: Bigdata workshop  february 2015

Hive Examples

Page 47: Bigdata workshop  february 2015

Hive Examples

Page 48: Bigdata workshop  february 2015

Hive Limitations

• Not all ‘standard’ SQL is supported– Subqueries are only supported in the FROM clause

• No correlated subqueries– No support for UPDATE or DELETE– No support for INSERTing single rows

Page 49: Bigdata workshop  february 2015

Hands-On: Hive

Page 50: Bigdata workshop  february 2015

Apache Pig

Page 51: Bigdata workshop  february 2015

Pig• Originally created at yahoo• High-level platform for data analysis and processing on Hadoop• Alternative to writing low-level MapReduce code• Relatively simple syntax• Features

– Joining datasets– Grouping data– Loading non-delimited data– Creation of user-defined functions written in java– Relational operations– Support for custom functions and data formats– Complex data structures

Page 52: Bigdata workshop  february 2015

Pig

• No shared metadata like Hive• Installation requires no modification to the cluster• Components of Pig

– Data flow language – Pig Latin• Scripting Language for writing data flows

– Interactive Shell – Grunt Shell• Type and execute Pig Latin Statements• Use pig interactively

– Pig Interpreter • Interprets and executes the Pig Latin• Runs on the client machine

Page 53: Bigdata workshop  february 2015

Grunt Shell

Page 54: Bigdata workshop  february 2015

Pig Interpreter

1. Preprocess and parse Pig Latin2. Check data types3. Make optimizations4. Plan execution5. Generate MapReduce jobs6. Submit jobs to Hadoop7. Monitor progress

Page 55: Bigdata workshop  february 2015

Pig Concepts

• A single element of data is an atom• A collection of atoms – such as a row or a

partial row – is a tuple• Tuples are collected together into bags• PigLatin script starts by loading one or more

datasets into bags, and then creates new bags by modifying those it already has

Page 56: Bigdata workshop  february 2015

Sample Pig Script

Page 57: Bigdata workshop  february 2015

Sample Pig Script

Page 58: Bigdata workshop  february 2015

Hands-On: Pig Exercise

Page 59: Bigdata workshop  february 2015

References

•Hadoop Definitive Guide - Tom White - must

read!

•http://highlyscalable.wordpress.com/

2012/02/01/mapreduce-patterns/

•http://hortonworks.com/hadoop/yarn/

Page 60: Bigdata workshop  february 2015

Questions?

[email protected]@clairvoyantsoft.com