Big data with hadoop

Download Big data with hadoop

Post on 16-Jul-2015




1 download

Embed Size (px)


Big data & hadoop

Big data with hadoopName : - Samar Husainstu_no :- 130501020 1- Introduction . 2- What is big data? 3- The characters of big data. 4- The handling with data. 5- Building a Successful Big Data Management. 6- Big data applications . 7- History of Hadoop. 8- The Core of Apache Hadoop . 9- Workflow and data movement . 10- Apache hadoop Ecosystem .

22introduction big data became more than just a technical term for scientists, engineers, and other technologists.The term entered the main stream on a myriad of fronts, becoming a household word in news ,business , health care, and peoples personal lives. The term became synonymous with intelligence gathering and spy craft ,33These days, increased data generation rate so as to increase the sources that generate such data. thus , the data becomes huge Big data.The traditional data were generated from employee , now in the era of massive data become from:- Employee.- Users .- Machines . are all generate large and different type of data Constantly. 4What is big data?Big data is a broad term for data sets so large or complex that traditional data processing applications are inappropriate.

5ExampleEvery day, we create 2.5 quintillion bytes of data so much that 90% of the data in the world today created in last two years alone. - data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data isbig data.6Big data is not a single technology but a combination of old and new technologies that helps companies gain actionable insight. big data the capability to manage a huge volume at the right speed in right time frame to allow real-time analysis and reaction.Why the Big Data is important?7Questions1- What are the examples of machines that generate big data ?2- What are the examples of employees that generate big data ?3- What are the examples of users that generate big data ?


The examples of machines that generate Big data1- GPS Data:records the exact position of a device at a specific moment in time. GPS events can be transformed easily into position and movement information. EX: vehicles on a road network rely on the accurate and sophisticated processing of GPS information.92- Sensor Data: The availability of low cost, intelligent sensors, coupled with the latest 3G and 4G wireless technology has driven a dramatic increase in the volume of sensor data, but the need to extract operational intelligence in real-time from the data.

Ex: include industrial automation plants, smart metering.


The examples of employees that generate Big data11The examples of users that generate Big data

12The characters of big data

13Big Data is typically broken down by three characteristics: Volume: How much data Velocity: How fast that data is processed Variety: The various types of data

Big Data14The Variety- Big data combines all data.structured data. unstructured data . semi structure data. This kind of data management requires that companies leverage both their structured and unstructured data.15

The characters of big data16Structure Unstructured Analog DataBig dataSemi-structure Variety XMLEnterprise system(CRM, ERP.. etc)Data warehousesAudio/video streamsGPS tracking informationDatabasesEDIE-Mail17Looking at semi-structured dataSemi-structured data is a kind of data falls between structured and unstructured data. Semi-structured data not necessarily conform to a fixed schema. but may be self-describing and may have simple label/value pairs.18Looking at semi-structured data

For example, label/value pairs might include:=Jones, =Jane, and=Sarah. Examples of semi-structured data include: EDI, SWIFT, and XML.You can think of them as sort of payloads for processing complex events.

19 Traditional data & Big dataTraditional DataDocuments.

Finances .

Stock records.

Personal files.Big DataPhotographs .

Audio & video .

3D models .

Simulation .

Location data.

20Real time

Near real time





Batch And so on .





And more velocityvolume21The benefit gained from the ability to process large amounts of information is the main attraction of big data analytics.This volume presents the most immediate challenge to conventional IT structures. It calls for scalable storage, and a distributed approach to querying.The volume22The Velocity Its not just the velocity of the incoming data. its possible to stream fast-moving data into bulk storage for later batch processing.The velocity of big data, coupled with its variety.cause a move toward real-time observations.allowing better decision making or quick action .

23 - The importance lies in the speed of the feedback loop, taking data from input through to decision.- A commercial from IBM makes the point that you wouldnt cross the road if all you had was a five-minute old snapshot of traffic location. There are times when you simply wont be able to wait for a report to run or a Hadoop job to complete. The importance lies in the speed of the feedback loop.Example24Product categories for handling streaming data divide into :

1- Established proprietary products such as : - IBMs InfoSphere Streams and the lesspolished 2- Still emergent open source frameworks originating in the web industry: - Twitters Storm and Yahoo S4.Categories for handling streaming data Velocity 25Practice example on Big dataExample 1

Example 2

Example 3

- These are good web sites to absorb how much of data generated in the world26Different approaches to handling data exist based on whether - It is data in motion. - It is data at rest. Different approaches To handling data27 - Data at rest would be used by a business analyst to better understand customers current buying patterns based on all aspects of the customer relationship, including sales, social media data, and customer service interactions.Heres a quick example of each:-->

- Data in motion would be used if a company is able to analyze the quality of its products during the manufacturing process to avoid costly errors.

28 Managing Big dataWith Big data, now possible to virtualize data. - stored efficiently, utilizing cloud-based storage.more cost- effectively .improvements network speed .reliability have removed other physical limitations to manage massive amounts of data at an acceptable pace.29Building a Successful Big Data Management capture. organize.Integrate. analyze. act .

Big data management should beginning with:

The cycle of big data management.30- data must first be captured.- Then organized and integrated.After this phase is successfully analyzed based on problem being addressed.Finally, management takes action based on the outcome of that analysis.Building a Successful Big Data Management31The importance of Big data In our world & The our futureBig data provides a competitive advantage for organizations .helps to make decisions are thus increasing efficiency and profit and loss reduction.extend benefit to including energy, education, health and huge scientific projects like the human genome project (the entire genetic material for the study of human beings).

32Healthcare. Manufacturing.Management .traffic management .Big data applicationsSome of the emerging applications are in areas such as :They rely on huge volumes, velocities , and varieties data to transform the behavior of a market.33In healthcare: a big data application might be able to monitor premature infants to determine when data indicates when intervention is needed.In manufacturing, a big data application can be used to prevent a machine from shutting down during a production run.Example 1Example 234Lets summarize some benefits of Big data Some of benefit of big data are:Increase of storage capacity. = scalable.Increase processing power. =real-time.Availability of data . = full tolerant. less cost . = commodity hard ware

35 Hadoop was created by Doug Cutting andMike Cafarella in 2005. Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant.It was originally developed to support distribution for the Nutch search engine project.

History of36Hadoop is designed to process huge amounts of structured, unstructured data (terabytes to petabytes) and is implemented on racks of commodity servers as a Hadoop cluster.

Hadoop is designed to parallelize data processing across computing nodes to speed computations and hide latency.37

Apache Hadoopis a set of algorithms.Open source software framework written in Java.Distributed storage.Distributed processing .Built from commodity hardware. Files are replicated to handle hardware failure Detect failures and recovers from them

What is ?38Some of Hadoop usersFacebook.IBM.Google.Yahoo!.New York Times.Amazon/A9.- And there are others


The Core of Apache HadoopAt its core, Hadoop has two primary components:1- storage part - Hadoop Distributed File System . # can support petabytes of data. 2- processing part - MapReduce. #computes results in batch.


HDFS: - Stores large files across a commodity cluster.typically in the range of gigabytes to terabytes .Scalable, and portable file-system .written inJava for the Hadoop framework .Replicating data across multiple hosts.

Hadoop Distributed File System HDFS41Default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack.

Data nodes can talk to each other: to rebalance data. to move copies around .to keep the replication of data high.


42Question- Why we need to replicate files in HDF ?

To achieve reliability . if any failure occur on any node we can continue in processing 43HDFS works by breaking large files into smaller pieces called blocks.- The blocks are stored on data nodes.NameNode responsibility to know what blocks on which data nodes make up the complete file. keeps track of where data is physically stored.- NameNode acts as a traffic cop, managing all access to the files . HDFS


How a Hadoop cluster is mapped to hardware.

45The responsibility of NameNode1- reads data blocks on the data nodes.2- writes data blocks on the data nodes.3- creates data blocks on the data nodes.4- deletes data blocks on the data nodes.5- replication of data blocks on the data nodes.NameNode acts as a traffic cop, managing all access to the files including :

46NameNode and data nodes :- they operate in a loosely coupled fashion that allows the cluster elements to behave dynamically: - adding (or subtracting) servers as the demand increases (or decreases).

How a Hadoop cluster is mapped to hardware.

The Relationship Between NameNode & DataNodes 47Are DataNodes also smart?NameNode is very smartData nodes are not very smart

48Are DataNodes also smart?

- Data nodes are not very smart, but the NameNode is. - Because the DataNodes constantly ask the NameNode whether there anything for them to do. - tells NameNode what data nodes out there and how busy they are.- The NameNode so critical for correct operation of the cluster, can and should be replicated to guard against a single point failure.49Map : distribute a computational problem across a cluster .Reduce : master node collects the answers to all the sub- problems and combines them .

mastercopycopycopyMap Reduce


An example of an inverted index being created in MapReduce52public static class Mapextends Mapper {private Text documentId;private Text word = new Text();@Overrideprotected void setup(Context context) {String filename =((FileSplit) context.getInputSplit()).getPath().getName();documentId = new Text(filename);}@Overrideprotected void map(LongWritable key, Text value,Context context)throws IOException, InterruptedException {for (String token :StringUtils.split(value.toString())) {word.set(token);context.write(word, documentId);}}} shows the mapper code53 public static class Map extends Mapper {

When you extend the MapReduce mapper class you specify the key/value typesfor your inputs and outputs. You use the MapReduce default InputFormat foryour job, which supplies keys as byte offsets into the input file, and values aseach line in the file. Your map emits Text key/value pairs.The following shows the mapper code54To cut down on object creation you create a single Text object, which youll reuseprivate Text word = new Text();private Text documentId;A Text object to store the document ID (filename) for your input- InputFormat decides how file going to be broken into smaller pieces for processing using a function called InputSplit.It then assigns a RecordReader to transform the raw data for processing by the map.Then the map requires two inputs: a key and a value.

Workflow and data movement in a small Hadoop clusterHadoop MapReduce55- Your data is now in form acceptable to map. - For each input pair, a distinct instance of map is called to process data.- map and reduce need to work together to process your data.OutputCollector collects output from independent mappers.The mapping begin

56- A Reporter function provides information gathered from map tasks. - to know when or if map tasks are complete.- All this work is being performed on multiple nodes in the Hadoop cluster simultaneously.

The mapping begin. Cont.57- Some of output may on a node different from the node where reducers for that specific output will run.- a partitioner and a sort gather and shuffle of intermediate results.Map tasks deliver results to specific partition as inputs to the reduce tasks.

Workflow and data movement - After all the map tasks are complete.the intermediate results are gathered in the partition.reduce shuffle ,sorting output for optimal processing.58Reduce & CombineFor each output pair, reduce is called to perform its task.Reduce gathers its output while all the tasks are processing.Reduce cant begin until all the mapping is done, and It isnt finished until all instances are complete.The output of reduce a key and a value.OutputFormat takes the key-value pair . organizes the output for writing to HDFS.- RecordWriter takes OutputFormat data and writes it to HDFS.

59The Benefits of MapReduce- Hadoop MapReduce is the heart of the Hadoop system. - MR provides capabilities you need to break big data into manageable chunks.- MR process data in parallel on your cluster.- MR makes data available for user consumption. - MapReduce does all works in a highly resilient, fault-tolerant manner.60


Apache HiveApache PigApache HBaseSQL-like language and metadata repos...