big data with hadoop - introduction

Big DataHadoop

Tomy Rhymond | Sr. Consultant | HMB Inc. | [email protected] | 614.432.9492

“Torture the data, and it will confess to anything.” -Ronald Coase, Economics, Nobel

Prize Laureate "The goal is to turn data into information, and information into insight." – Carly Fiorina

"Data are becoming the new raw material of business." – Craig Mundie, Senior Advisor to the CEO at Microsoft.

“In God we trust. All others must bring data.” – W. Edwards Deming, statistician, professor, author, lecturer, and consultant.

mailto:[email protected]

Agenda

• Data• Big Data• Hadoop• Microsoft Azure HDInsight• Hadoop Use Cases• Demo

• Configure Hadoop Cluster / Azure Storage• C# MapReduce • Load and Analyze with Hive • Use Pig Script for Analyze data• Excel Power Query

Huston, we have a “Data” problem.

• IDC estimate put the size of the “digital universe” at 40 zettabytes (ZB) by 2020, which is 50-fold growth from the beginning of 2010.• By 2020, emerging markets will supplant the developed world as the main

producer of the world’s data.• This flood of data is coming from many source.

• The New York Stock Exchange generates about 1 terabytes of trade data per day• Facebook hosts approximately one petabyte of storage• The Hadron Collider produce about 15 petabytes of data per year• Internet Archives stores around 2 petabytes of data and growing at a rate of 20

terabytes per month.• Mobile devices and Social Network attribute to the exponential growth of

the data.

85% Unstructured, 15% Structured

• The data as we know is structured.• Structured data refers to information with a high degree of organization, such as inclusion in a

relational database is seamless and readily searchable.• Not all data we collect conform to a specific, pre-defined data model.

• It tends to be the human-generated and people-oriented content that does not fit neatly into database tables

• 85 percent of business-relevant information originates in unstructured form, primarily text.

• Lack of structure make compilation a time and energy-consuming task.• These data are so large and complex that it becomes difficult to process using on-

hand management tools or traditional data processing applications.• These type of data is being generated by everything around us at all times.

• Every digital process and social media exchange produces it. Systems, sensors and mobile devices transmit it.

Data Types

Relational Data – SQL Data

Un-Structured Data – Twitter Feed

Semi-Structured Data – Json Un-Structured Data – Amazon Review

So What is Big Data?

• Big Data is a popular term used to describe the exponential growth and availability of data, both structured and unstructured.

• Capturing and managing lot of information; Working with many new types of data.

• Exploiting these masses of information and new data types of applications and extract meaningful value from big data

• The process of applying serious computing to seriously massive and often highly complex sets of information.

• Big data is arriving from multiple sources at an alarming velocity, volume and variety.

• More data lead to more accurate analyses. More accurate analysis may lead to more confident decision making.

VARIETY

BIG DATA

VOLUME

VERACITYVELOCITY

Scale of Data Different Forms of Data

Analysis of Data

Uncertainty of Data

Hadron Collider generates1 PETA BYTES

Of Data are create per year

Estimated 100 TERA BYTES

Of Data per US Company

IDC Estimate40 ZETABYTES

Of Data by 2020

500 MILLION TWEETS

Per day

100 MILLION VIDEO600 Years of Video

13 Hours of video uploaded per minute

20 BILLION NETWORK CONNECTIONS

By 2016

NY Stock Exchange generates1 TERRA BYTES

Of Trade Data per day

Poor Data Quality cost businesses600 BILLION A YEAR

30% OF DATA COLLECTEDBy marketers are not usable for

real-time decision making

Poor data across business and the government costs the US economy

3.1 TRILLION DOLLARSa year

1 IN 3 LEADERSDon·t trust the information they

user to make decision

MAP

REDUCE

RESULT

200 BILLIONS PHOTOS

Facebook has1 PETTA BYTES

Of Storage

1.8 BILLION SMARTPHONES

Estimated6 BILLION PEOPLE

Have a cell Phone

Global Healthcare data 150 EXABYTES

2.4 EXABYTES per year Growth

2.5 QUINTILLION BYTES

of Data are Created each Day

Big Data

The 4 V’s of Big DataVolume: We currently see the exponential growth in the data storage as the data is now more than text data. There are videos, music and large images on our social media channels. It is very common to have Terabytes and Petabytes of the storage system for enterprises.

Velocity: Velocity describes the frequency at which data is generated, captured and shared. Recent developments mean that not only consumers but also businesses generate more data in much shorter cycles.

Variety: Today’s data no longer fits into neat, easy to consume structures. New types include content, geo-spatial, hardware data points, location based, log data, machine data, metrics, mobile, physical data points, process, RFID etc.

Veracity: This refers to the uncertainty of the data available. Veracity isn’t just about data quality, it’s about data understandability. Veracity has an impact on the confidence data.

Big Data vs Traditional Data

Traditional Big DataData Size Gigabytes PetabytesAccess Interactive and Batch BatchUpdates Read and Write many

timesWrite once, read many times

Structure Static Schema Dynamic SchemaIntegrity High LowScaling Nonlinear Linear

Data Storage

• Storage capacity of the hard drives have increased massively over the years• On the other hand, the access speeds of the drives have not kept up.• Drive from 1990 could store 1370 MB of data and had a speed of 4.4 MB/s

• can read all the data in about 5 mins.• Today One Terabyte drives are the norm, but the transfer rate is around 100 MB/s

• Take more than two and half hours to read all the data• Writing is even slower• The obvious ways to reduce time is to read from multiple disks at once

• Have 100 disks each holding one hundredth of data. Working in parallel, we could read all the data in under 2 minutes.

• Move Computing to Data rather than bring data to computing.

Why big data should matter to you• The real issue is not that you are acquiring large amounts of data. It's what you

do with the data that counts. The hopeful vision is that organizations will be able to take data from any source, harness relevant data and analyze it to find answers that enable• cost reductions• time reductions • new product development and optimized offerings• smarter business decision making.

• By combining big data and high-powered analytics, it is possible to:• Determine root causes of failures, issues and defects in near-real time, potentially saving

billions of dollars annually.• Send tailored recommendations to mobile devices while customers are in the right area to

take advantage of offers.• Quickly identify customers who matter the most.• Generate retail coupons at the point of sale based on the customer's current and past

purchases.

Ok I Got BigData, Now what?• The huge influx of data raises many challenges.• Process of inspecting, cleaning, transforming, and modeling data with the

goal of discovering useful information• To analyze and extract meaningful value from these massive amounts of

data, we need optimal processing power.• We need parallel processing and therefore requires many pieces of

hardware• When we use many pieces of hardware, the chances that one will fail is fairly high.

• Common way to avoiding data loss is through replication• Redundant copies of data are kept

• Data analysis tasks need to combine data• The Data from one disk may need to combine with data from 99 other disks

Challenges of Big Data

• Information Growth• Over 80% of the data in the enterprise consists of unstructured data, growing much faster

pace than traditional data• Processing Power

• The approach to use single, expensive, powerful computer to crunch information doesn’t scale for Big Data

• Physical Storage• Capturing and managing all this information can consume enormous resources

• Data Issues• Lack of data mobility, proprietary formats and interoperability obstacle can make working

with Big Data complicated• Costs

• Extract, transform and load (ETL) processes for Big Data can be expensive and time consuming

Hadoop

• Apache™ Hadoop® is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers.• It is designed to scale up from a single server to thousands of machines,

with a very high degree of fault tolerance.• All the modules in Hadoop are designed with a fundamental assumption

that hardware failures (of individual machines, or racks of machines) are common and thus should be automatically handled in software by the framework.• Hadoop consists of the Hadoop Common package, which provides

filesystem and OS level abstractions, a MapReduce engine and the Hadoop Distributed File System (HDFS).

History of Hadoop

• Hadoop is not an acronym; it’s a made-up name. • Named after stuffed an yellow elephant of Doug Cutting’s (Project

Creator) son.• 2002-2004 : Nutch Project - web-scale, open source, crawler-based

search engine.• 2003-2004: Google released GFS (Google File System) & MapReduce • 2005-2006: Added GFS (Google File System) & MapReduce impl to

Nutch• 2006-2008: Yahoo hired Doug Cutting and his team. They spun out

storage and processing parts of Nutch to form Hadoop.• 2009 : Achieved Sort 500 GB in 59 Seconds (on 1400 nodes) and 100

TB in 173 Minutes (on 3400 nodes)

Hadoop Modules• Hadoop Common: The common utilities that support the other

Hadoop modules.• Hadoop Distributed File System (HDFS™): A distributed file

system that provides high-throughput access to application data.• Hadoop YARN: A framework for job scheduling and cluster resource

management.• Hadoop MapReduce: A YARN-based system for parallel processing

of large data sets.• Other Related Modules;

• Cassandra - scalable multi-master database with no single points of failure.• HBase - A scalable, distributed database that supports structured data storage for large tables.• Pig - A high-level data-flow language and execution framework for parallel computation.• Hive - A data warehouse infrastructure that provides data summarization and ad hoc querying.• Zookeeper - A high-performance coordination service for distributed applications.

HDFS – Hadoop Distributed File System

• The heart of Hadoop is the HDFS.• The Hadoop Distributed File System (HDFS) is a distributed file system designed

to run on commodity hardware.• HDFS is designed on the following assumptions and goals:

• Hardware failure is norm rather than exception.• HDFS is designed more for batch processing rather than interactive use by

users.• Application that run on HDFS have large data sets. A typical file in HDFS is in

gigabytes to terabytes in size.• HDFS application uses a write-once-read-many access model. A file once

created, written and closed need not be changed.• A computation requested by an application is much more efficient if it is

executed near the data it operated on. On other words, Moving computation is cheaper than moving data.

• Easily portable from one platform to another.

Hadoop Architecture

• NameNode: • NameNode is the node which stores the filesystem

metadata i.e. which file maps to what block locations and which blocks are stored on which datanode.

• Secondary NameNode:• NameNode is the single point of failure.

• DataNode: • The data node is where the actual data resides.• All datanodes send a heartbeat message to the namenode

every 3 seconds to say that they are alive. • The data nodes can talk to each other to rebalance data,

move and copy data around and keep the replication high.• Job Tracker/Task Tracker:

• The primary function of the job tracker is resource management (managing the task trackers), tracking resource availability and task life cycle management (tracking its progress, fault tolerance etc.)

• The task tracker has a simple function of following the orders of the job tracker and updating the job tracker with its progress status periodically.

Name Node Secondary Name Node

text

B1 B2 B3

B4 B5 B6

B1Data Node

B1 B2 B3

B4 B5 B6

B1Data Node

text

B1 B2 B3

B4 B5 B6

B1Data Node

B1 B2 B3

B4 B5 B6

B1Data Node

B1 B2 B3

B4 B5 B6

B1Data Node

Rack 1 Rack 2

Client

Client

Read

Metadata Ops

Replication

HDFS - InputSplit

• InputFormat• Split the input blocks and files into

logical chunks of type InputSplit, each of which is assigned to a map task for processing.

• RecordReader• A RecordReader uses the data within the

boundaries created by the input split to generate key/value pairs.

MapReduce• Hadoop MapReduce is a software framework for

easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

• It is this programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster.

• The term MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform.• The first is the map job, which takes a set of

data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).

• The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples.

BIG DATA

MAP REDUCE RESULT

(tattoo, 1)

Hadoop Map-Reduce

Hadoop Distributions

• Microsoft Azure HDInsight• IBM InfoSphere BigInsights• Hortonworks• Amazon Elastic MapReduce• Cloudera CDH

Hadoop Meets The Mainframe• BMC

• Control-M for Hadoop is an extension of BMC’s larger Control-M product suite that was born in 1987 as an automated mainframe job scheduler.

• Compuware• APM is an application performance management suite that also spans the arc of data

enterprise data center computing distributed commodity servers. • Syncsort

• Syncsort offers Hadoop Connectivity to move data between Hadoop and other platforms including the mainframe, Hadoop Sort Acceleration, and Hadoop ETL for cross-platform data integration.

• Informatica• HParser can run its data transformation services as a distributed application on Hadoop’s

MapReduce engine.

Azure HDInsight

• HDInsight makes Apache Hadoop available as a service in the cloud. • Process, analyze, and gain new insights from big data using

the power of Apache Hadoop• Drive decisions by analyzing unstructured data with Azure

HDInsight, a big data solution powered by Apache Hadoop. • Build and run Hadoop clusters in minutes. • Analyze results with Power Pivot and Power View in Excel.• Choose your language, including Java and .NET. Query and

transform data through Hive.

Azure HDInsightScale elastically on demand

Crunch all data –structured, semi-structured, unstructured

Develop in your favorite language

No hardware to acquire or

maintain

Connect on-premises Hadoop clusters with the cloud

Use Excel to visualize your Hadoop data

Includes NoSQL transactional capabilities

Azure HDInsight

HDInsight Ecosystem

HDFS (Hadoop Distributed File System)

MapReduce (Job Scheduling / Execution)

Pig (Data Flow) Hive (SQL) Sqoop

ETL Tools BI Tools RDBMS

HDInsight

• The combination of Azure Storage and HDInsight provides an ultimate framework for running MapReduce jobs.• Creating an HDInsight cluster is quick and easy: log in to

Azure, select the number of nodes, name the cluster, and set permissions. • The cluster is available on demand, and once a job is

completed, the cluster can be deleted but the data remains in Azure Storage.• Use Powershell to submit MapReduce Jobs • Use C# to create MapReduce Programs• Support Pig Latin, Avro, Sqoop and more.

Use cases

• A 360 degree view of the customer• Business want to know to utilize social media postings to

improve revenue.• Utilities: Predict power consumption• Marketing: Sentiment analysis• Customer service: Call monitoring• Retail and marketing: Mobile data and location-based targeting• Internet of Things (IoT)• Big Data Service Refinery

Demo

• Configure HDInsight Cluster• Create Mapper and Reducer Program using Visual Studio

C# • Upload Data to Blob Storage using Azure Storage Explorer• Run Hadoop Job • Export output to Power Query for Excel • Hive Example with HDInsight• Pig Script with HDInsight

VARIETY

BIG DATA

VOLUME

VERACITYVELOCITY

Scale of Data Different Forms of Data

Analysis of Data

Uncertainty of Data

Hadron Collider generates1 PETA BYTES

Of Data are create per year

Estimated 100 TERA BYTES

Of Data per US Company

IDC Estimate40 ZETABYTES

Of Data by 2020

500 MILLION TWEETS

Per day

100 MILLION VIDEO600 Years of Video

13 Hours of video uploaded per minute

20 BILLION NETWORK CONNECTIONS

By 2016

NY Stock Exchange generates1 TERRA BYTES

Of Trade Data per day

Poor Data Quality cost businesses600 BILLION A YEAR

30% OF DATA COLLECTEDBy marketers are not usable for

real-time decision making

Poor data across business and the government costs the US economy

3.1 TRILLION DOLLARSa year

1 IN 3 LEADERSDon·t trust the information they

user to make decision

MAP

REDUCE

RESULT

200 BILLIONS PHOTOS

Facebook has1 PETTA BYTES

Of Storage

1.8 BILLION SMARTPHONES

Estimated6 BILLION PEOPLE

Have a cell Phone

Global Healthcare data 150 EXABYTES

2.4 EXABYTES per year Growth

2.5 QUINTILLION BYTES

of Data are Created each Day

Big Data

Resources for HDInsight for Windows AzureMicrosoft: HDInsight• Welcome to Hadoop on Windows Azure - the welcome page for the Developer Preview for the Apache Hadoop-based Services for Windows Azure.• Apache Hadoop-based Services for Windows Azure How To Guide - Hadoop on Windows Azure documentation.• Big Data and Windows Azure - Big Data scenarios that explore what you can build with Windows Azure.

Microsoft: Windows and SQL Database• Windows Azure home page - scenarios, free trial sign up, development tools and documentation that you need get started building applications.• MSDN SQL- MSDN documentation for SQL Database• Management Portal for SQL Database - a lightweight and easy-to-use database management tool for managing SQL Database in the cloud.• Adventure Works for SQL Database - Download page for SQL Database sample database.

Microsoft: Business Intelligence• Microsoft BI PowerPivot- a powerful data mashup and data exploration tool.• SQL Server 2012 Analysis Services - build comprehensive, enterprise-scale analytic solutions that deliver actionable insights.• SQL Server 2012 Reporting - a comprehensive, highly scalable solution that enables real-time decision making across the enterprise.

Apache Hadoop:• Apache Hadoop - software library providing a framework that allows for the distributed processing of large data sets across clusters of

computers.• HDFS - Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications.• Map Reduce - a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on

large clusters of compute nodes.

Hortonworks:• Sandbox - Sandbox is a personal, portable Hadoop environment that comes with a dozen interactive Hadoop tutorials.

https://www.hadooponazure.com/

http://social.technet.microsoft.com/wiki/contents/articles/6206.hadoop-based-services-on-windows-azure-how-to-guide.aspx

http://www.windowsazure.com/en-us/home/scenarios/big-data/

http://www.windowsazure.com/en-us/home/scenarios/big-data/

https://www.windowsazure.com/en-us/

http://msdn.microsoft.com/en-us/library/windowsazure/ee336279.aspx

http://msdn.microsoft.com/en-us/library/windowsazure/ee336279.aspx

http://msdn.microsoft.com/en-us/library/windowsazure/gg442309.aspx

http://msftdbprodsamples.codeplex.com/releases/view/37304

http://www.microsoft.com/en-us/bi/PowerPivot.aspx

http://www.microsoft.com/en-us/bi/PowerPivot.aspx

http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/SQL-Server-2012-analysis-services.aspx

http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/SQL-Server-2012-reporting-services.aspx

http://hadoop.apache.org/

http://hadoop.apache.org/hdfs/

http://hadoop.apache.org/mapreduce/

http://hadoop.apache.org/mapreduce/

http://hortonworks.com/products/hortonworks-sandbox/

About Me

Tomy RhymondSr. Consultant, HMB, Inc.

[email protected]://[email protected] (m)

mailto:[email protected]

http://tomyrhymond.wordpress.com/

big data with hadoop - introduction

Technology