big data with hadoop - introduction

Download Big data with Hadoop - Introduction

Post on 16-Apr-2017




4 download

Embed Size (px)


Big Data

Big DataHadoopTomy Rhymond | Sr. Consultant | HMB Inc. | | 614.432.9492

Torture the data, and it will confess to anything. -Ronald Coase, Economics, Nobel Prize Laureate"The goal is to turn data into information, and information into insight." Carly Fiorina"Data are becoming the new raw material of business." Craig Mundie, Senior Advisor to the CEO at Microsoft.In God we trust. All others must bring data. W. Edwards Deming, statistician, professor, author, lecturer, and consultant.


AgendaDataBig DataHadoopMicrosoft Azure HDInsightHadoop Use CasesDemoConfigure Hadoop Cluster / Azure StorageC# MapReduce Load and Analyze with Hive Use Pig Script for Analyze dataExcel Power Query


Huston, we have a Data problem.IDC estimate put the size of the digital universe at 40 zettabytes (ZB) by 2020, which is 50-fold growth from the beginning of 2010.By 2020, emerging markets will supplant the developed world as the main producer of the worlds data.This flood of data is coming from many source.The New York Stock Exchange generates about 1 terabytes of trade data per dayFacebook hosts approximately one petabyte of storageThe Hadron Collider produce about 15 petabytes of data per yearInternet Archives stores around 2 petabytes of data and growing at a rate of 20 terabytes per month.Mobile devices and Social Network attribute to the exponential growth of the data.

1 ZB = 1 billion TB 1 PB = 1000 TB.

Grocery chains know what you are buying very weekRestaurants know what you eatCable companies know what you watchSearch engines know what you browseRetailers know what you like to wear, what gadgets you are interested in.Social network know what is in your mind.

You are no longer a User123 in some database, you now have a profile. This gives a 360 degree view


85% Unstructured, 15% StructuredThe data as we know is structured.Structured data refers to information with a high degree of organization, such as inclusion in a relational database is seamless and readily searchable.Not all data we collect conform to a specific, pre-defined data model. It tends to be the human-generated and people-oriented content that does not fit neatly into database tables85 percent of business-relevant information originates in unstructured form, primarily text.Lack of structure make compilation a time and energy-consuming task.These data are so large and complex that it becomes difficult to process using on-hand management tools or traditional data processing applications.These type of data is being generated by everything around us at all times. Every digital process and social media exchange produces it. Systems, sensors and mobile devices transmit it.

What are some examples of Unstructured Data?E-MailsReportsExcel FilesWord DocumentsPDF DocumentsImages (e.g., .jpg, or .gif)Media (e.g., mp3, .wma, or .wmv)Text FilesPowerPoint PresentationsSocial Media Internet Forums4

Data Types

Relational Data SQL DataUn-Structured Data Twitter FeedSemi-Structured Data Json

Un-Structured Data Amazon Review


So What is Big Data?Big Data is a popular term used to describe the exponential growth and availability of data, both structured and unstructured.Capturing and managing lot of information; Working with many new types of data.Exploiting these masses of information and new data types of applications and extract meaningful value from big dataThe process of applying serious computing to seriously massive and often highly complex sets of information.Big data is arriving from multiple sources at an alarming velocity, volume and variety. More data lead to more accurate analyses. More accurate analysis may lead to more confident decision making.


The 4 Vs of Big DataVolume: We currently see the exponential growth in the data storage as the data is now more than text data. There are videos, music and large images on our social media channels. It is very common to have Terabytes and Petabytes of the storage system for enterprises.

Velocity: Velocity describes the frequency at which data is generated, captured and shared. Recent developments mean that not only consumers but also businesses generate more data in much shorter cycles.

Variety: Todays data no longer fits into neat, easy to consume structures. New types include content, geo-spatial, hardware data points, location based, log data, machine data, metrics, mobile, physical data points, process, RFID etc.

Veracity: This refers to the uncertainty of the data available. Veracity isnt just about data quality, its about data understandability. Veracity has an impact on the confidence data.

Volume presents the most immediate challenge to conventional IT structures. It calls for scalable storage, and a distributed approach to querying. Many companies already have large amounts of archived data, perhaps in the form of logs, but not the capacity to process it.

Velocity the increasing rate at which data flows into an organization has followed a similar pattern to that of volume. The Internet and mobile era means that the way we deliver and consume products and services is increasingly instrumented, generating a data flow back to the provider.

Variety Rarely does data present itself in a form perfectly ordered and ready for processing. A common theme in big data systems is that the source data is diverse, and doesnt fall into neat relational structures. It could be text from social networks, image data, a raw feed directly from a sensor source. None of these things come ready for integration into an application.

Veracity (Uncertainty): The lack of certainty. A state of having limited knowledge where it is impossible to exactly describe the existing state, a future outcome, or more than one possible outcome.7

Big Data vs Traditional DataTraditionalBig DataData SizeGigabytesPetabytesAccessInteractive and BatchBatchUpdatesRead and Write many timesWrite once, read many timesStructureStatic SchemaDynamic SchemaIntegrityHighLowScalingNonlinearLinear


Data StorageStorage capacity of the hard drives have increased massively over the yearsOn the other hand, the access speeds of the drives have not kept up.Drive from 1990 could store 1370 MB of data and had a speed of 4.4 MB/scan read all the data in about 5 mins.Today One Terabyte drives are the norm, but the transfer rate is around 100 MB/sTake more than two and half hours to read all the dataWriting is even slowerThe obvious ways to reduce time is to read from multiple disks at onceHave 100 disks each holding one hundredth of data. Working in parallel, we could read all the data in under 2 minutes.Move Computing to Data rather than bring data to computing.


Why big data should matter to youThe real issue is not that you are acquiring large amounts of data. It's what you do with the data that counts. The hopeful vision is that organizations will be able to take data from any source, harness relevant data and analyze it to find answers that enable cost reductionstime reductions new product development and optimized offeringssmarter business decision making. By combining big data and high-powered analytics, it is possible to:Determine root causes of failures, issues and defects in near-real time, potentially saving billions of dollars annually.Send tailored recommendations to mobile devices while customers are in the right area to take advantage of offers.Quickly identify customers who matter the most.Generate retail coupons at the point of sale based on the customer's current and past purchases.

Optimize routes for many thousands of package delivery vehicles while they are on the road.Analyze millions of SKUs to determine prices that maximize profit and clear inventory.Recalculate entire risk portfolios in minutes.Use clickstream analysis and data mining to detect fraudulent behavior.10

Ok I Got BigData, Now what?The huge influx of data raises many challenges.Process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful informationTo analyze and extract meaningful value from these massive amounts of data, we need optimal processing power.We need parallel processing and therefore requires many pieces of hardwareWhen we use many pieces of hardware, the chances that one will fail is fairly high.Common way to avoiding data loss is through replicationRedundant copies of data are keptData analysis tasks need to combine dataThe Data from one disk may need to combine with data from 99 other disks


Challenges of Big DataInformation GrowthOver 80% of the data in the enterprise consists of unstructured data, growing much faster pace than traditional dataProcessing PowerThe approach to use single, expensive, powerful computer to crunch information doesnt scale for Big DataPhysical StorageCapturing and managing all this information can consume enormous resourcesData IssuesLack of data mobility, proprietary formats and interoperability obstacle can make working with Big Data complicatedCostsExtract, transform and load (ETL) processes for Big Data can be expensive and time consuming



ApacheHadoopis an open source software project that enables the distributed processing of large data sets across clusters of commodity servers.It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance.All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are common and thus should be automatically handled in software by the framework.Hadoop consists of the Hadoop Common package, which provides filesystem and OS level abstractions, a MapReduce engine and the Hadoop Distributed File