big data hadoop (overview)

19
BIG DATA

Upload: rohit-srivastava

Post on 19-Feb-2017

153 views

Category:

Technology


0 download

TRANSCRIPT

BIGDATA

What is BIG DATA ?

Big data is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis. Big data can be analyzed for insights that lead to better decisions and strategic business moves.

Where does it come from ?

Social Media Business Transaction Smart Phones Vehicles Satellite Log Files Smart Devices Sensors

Fact Check

3 V’s of Big Data

Velocity : The rate at which data is generated and changed. Variety : The number of different data sources and types. Volume : The average quantity of data units/category.

Importance of Big Data

The importance of big data doesn’t revolve around how much data you have, but what you do with it. You can take data from any source and analyze it to find answers that enable 1) cost reductions, 2) time reductions, 3) new product development and optimized offerings, and 4) smart decision making. When you combine big data with high-powered analytics, you can accomplish business-related tasks such as: Determining root causes of failures, issues and defects in near-real

time. Generating coupons at the point of sale based on the customer’s

buying habits. Recalculating entire risk portfolios in minutes. Detecting fraudulent behavior before it affects your organization.

Applications of Big Data

A 360 degree view of a customer. Internet of Things Healthcare Information Security E-Commerce Data warehouse optimization

Emergence of Hadoop

An Open Source project Nutch (A search engine) – the brainchild of Doug Cutting and Mike Cafarella, aimed at returning web search results faster by distributing data and calculations across different computers so multiple tasks could be accomplished simultaneously.

In 2006, Cutting joined Yahoo. The Nutch project was divided – the web crawler portion remained as Nutch and the distributed computing and processing portion became Hadoop.

 In 2008, Yahoo released Hadoop as an open-source project. Today, Hadoop’s framework and ecosystem of technologies are managed and maintained by the non-profit Apache Software Foundation (ASF), a global community of software developers and contributors.

Importance

Ability to store and process huge amounts of any kind of data, quickly. With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that's a key consideration.

Computing power. Hadoop's distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have.

Fault tolerance. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically

Importance(contd.)

Flexibility. Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos.

Low cost. The open-source framework is free and uses commodity hardware to store large quantities of data.

Scalability. You can easily grow your system to handle more data simply by adding nodes. Little administration is required.

Hadoop Distributed File System (HDFS)

MapReduce

MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.

The term MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the map job.

MapReduce : Example

Let’s look at a simple example. Assume you have three files, and each file contains two columns (a key and a value in Hadoop terms) that represent a city and the corresponding temperature recorded in that city for the various measurement days. Of course we’ve made this example very simple so it’s easy to follow. You can imagine that a real application won’t be quite so simple, as it’s likely to contain millions or even billions of rows, and they might not be neatly formatted rows at all; in fact, no matter how big or small the amount of data you need to analyze, the key principles we’re covering here remain the same. Either way, in this example, city is the key and temperature is the value.

Map Reduce : Example (contd.)

Key ValueToronto 20Whitby 25 New York

22

Rome 32Toronto 14Rome 33New York

18

Key ValueToronto 18Whitby 22 New York

25

Rome 35Toronto 22Rome 38New York

21

Key ValueToronto 22Whitby 26New York

24

Rome 36Toronto 12Rome 35New York

19

File 1 File 2 File 3

Out of all the data we have collected, we want to find the maximum temperature for each city across all of the data files (note that each file might have the same city represented multiple times). 

Map Reduce : Example (contd.)

After mapping each file will return data as shown below. This is called mapped data.

Key ValueToronto 20Whitby 25 New York 22Rome 33

Key ValueWhitby 22 New York 25Toronto 22Rome 38

Key ValueToronto 22Whitby 26New York 24Rome 36

Map Reduce : Example (contd.)

After mapping the reduction phase is performed and the final result is displayed. All the data in the three files will be compared among the corresponding key to find the highest temperature.

The final result will be as follows:-

Key ValueToronto 22Whitby 26New York 26Rome 38

Conclusion

Big data is changing the way people within organizations work together. It is creating a culture in which business and IT leaders must join forces to realize value from all data. Insights from big data can enable all employees to make better decisions—deepening customer engagement, optimizing operations, preventing threats and fraud, and capitalizing on new sources of revenue. But escalating demand for insights requires a fundamentally new approach to architecture, tools and practices.

Competitive Advantage Better decision making Value of data