lesson 1 - hadoop and big data overview

8/10/2019 Lesson 1 - Hadoop and Big Data Overview

1/57

Hadoop Developer Day

Nicolas MoralesIBM Big [email protected]

@NicolasJMorales


2/57

FREE

Monthly Events San Jose & Foster City

Full Day Developer Days Afternoon & Evening Hackathons Past Meetups covered

Text Analytics Real-time Analytics

Big Data Developers @

2 2013 IBM Corporation

SQL for Hadoop HBase Social Media Analytics Machine Data Analytics Security and Privacy

Development Environmentprovided

Live streaming Topic suggestions welcome

http://www.meetup.com/BigDataDevelopers/

NEXT MEETUP: Streams Developer Day on Thursday, April 17.Coming Soon: Big R, Watson, Big Data in the Cloud, Big SQL, MongoDB & more!


3/57



4/57

Agenda: Hadoop Developer Day

Time Subject

8:00 AM 9:00 AM Registration & Breakfast

9:00 AM 9:30 AM Introduction to Hadoop

4 2013 IBM Corporation4

9:30 AM 11:00 AM Hadoop Architecture and HDFS + Hands-on Lab11:00 AM 11:45 AM Introduction to MapReduce

11:45 AM 12:45 PM Lunch

12:45 PM 2:00 PM MapReduce Hands-on Lab

2:00 PM 4:00 PM Using Hive for Data Warehousing + Hands-on Lab

4:00 PM 6:00 PM SQL for Hadoop + Hands-on Lab

6:00 PM Closing Remarks


5/57

Big Data Universitywww.bigdatauniversity.com



6/57

Big Data Universitywww.bigdatauniversity.com



7/57

Quick Start Edition VM

Download: http://ibm.co/QuickStart .tar.gz Unpack using WinRAR, 7-Zip, etc.



8/57

Your Feedback is Important, pleasecomplete your Survey



9/57

Introduction to Hadoop


Rafael CossIBM Big [email protected]

@racoss


10/57

Executive Summary

Whats Big Data? More Analytics on More Data for More People

More than just Hadoop

Whats Hadoop? Distributed Computing framework that is


Cost Effective Flexible Fault Tolerance

What Hadoops Distribution?

Common set of Apache Projects Install Unique Value Add


11/57

Enrich YourInformation Basewith Big Data Exploration

Improve CustomerInteraction withEnhanced 360 Viewof the Customer

Key Business-driven Use Cases Improve BusinessOutcomes

Help Reduce Riskand Prevent Fraudwith Security andIntelligence Extension

42TB

1,100

99%


OptimizeInfrastructureand Monetize Datawith Operations Analysis

Gain IT efficiencyand scale with DataWarehouse

Modernization

-AcousticData Analyzed

Gain inAnalysisPerformance

40XMeteredCustomersin Five States

60K

PublishingPartnerships

In Time RequiredFor Analysis

2013 IBM Corporation


12/57



13/57

Why is Big Data important?

Data AVAILABLE to an

organization


data an organization canPROCESS

Enterprises are more blindto new opportunities.

Organizations are able toprocess less and less of theavailable data.

100 Millionen Tweets are posted every day, 35 hours of video are beeing uploaded every

minute,6.1 x 10^12 text messages have been sent in 2011 and 247 x 10^9 E-Mails passedthrough the net. 80 % spam and viruses. => Prefiltering is more and more important.


14/57

What is Big Data?

Transactional &Application Data

Machine Data Social Data EnterpriseContent

More Analytics on More Data for More People


Volume Structured

Throughput

Velocity Semi-structured

Ingestion

Variety Highly unstructured

Veracity

Variety Highly unstructured

Volume



15/57

Insurance

360 View of Domainor Subject

Catastrophe Modeling

Fraud & Abuse

Producer PerformanceAnalytics

Analytics Sandbox

Banking

Optimizing Offers andCross-sell

Customer Service andCall Center Efficiency

Fraud Detection &Investigation

Credit & CounterpartyRisk

Every Industry can Leverage Big Data and Analytics

Telco

Pro-active Call Center

Network Analytics

Location BasedServices

Energy &Utilities

Smart Meter Analytics

Distribution LoadForecasting/Scheduling

Condition Based

Maintenance Create & Target

Customer Offerings

Media &Entertainment

Business processtransformation

Audience & MarketingOptimization

Multi-ChannelEnablement

Digital commerceoptimization

RetailTravel &Transport

ConsumerProducts

Government Healtcare


Actionable Customer

Insight Merchandise

Optimization

Dynamic Pricing

Customer Analytics &

Loyalty Marketing Predictive Maintenance

Analytics

Capacity & PricingOptimization

Shelf Availability

Promotional SpendOptimization

MerchandisingCompliance

Promotion Exceptions& Alerts

Civilian Services

Defense & Intelligence Tax & Treasury Services

Measure & Act on

Population HealthOutcomes

Engage Consumers intheir Healthcare

!utomotive

Advanced ConditionMonitoring

Data WarehouseOptimization

Actionable CustomerIntelligence

"i#e$ciences

Increase visibility intodrug safety andeffectiveness

Cemical &Petroleum

Operational Surveillance,Analysis & Optimization

Data WarehouseConsolidation, Integration& Augmentation

Big Data Exploration forInterdisciplinaryCollaboration

!erospace& %e#ense

Uniform InformationAccess Platform

Data WarehouseOptimization

Airliner CertificationPlatform

Advanced Condition

Monitoring (ACM)

Electronics

Customer/ ChannelAnalytics

Advanced ConditionMonitoring



16/57

Big data adoption

Big Data use study


2012 Big Data @ Work Study surveying 1144 business and IT professionals in 95 countries

When segmented into four groups based on current levels of big data activity, respondents showed significant consistency inorganizational behaviors


17/57

Big Data AnalyticsIterative & ExploratoryData is the structure

IT TeamDelivers DataOn Flexible

Traditional AnalyticsStructured & Repeatable

Structure built to store data

BusinessUsers

DetermineAnalyzedInformation

Warehouse Modernization Has Two Themes


BusinessUsers

Explore andAsk Any Question

Analyze ALL Available Information

Whole population analyticsconnects the dots

IT TeamBuilds System

To AnswerKnown Questions

17

Available Information

Capacity constrained down samplingof available information

Carefully cleanse all informationbefore any analysis

AnalyzedInformation

Analyze information as is & cleanse asneeded

AnalyzedInformation


18/57

Big Data AnalyticsIterative & ExploratoryData is the structure

Traditional AnalyticsStructured & Repeatable

Structure built to store data

Warehouse Modernization Has Two Themes

?QuestionHypothesis Data

All Information

Exploration


Analyzed

Information

DataAnswer

Start with hypothesisTest against selected data

Data leads the wayExplore all data, identify correlations

CorrelationActionable Insight

Analyze after landing Analyze in motion


19/57

Getting the Value from Big Data Why a Platform?

Almost all big data use cases requirean integrated set of big data technologiesto address the business pain completely

The Whole is Greater thanthe Sum of the Parts

Accelerators

DataStreamHadoop

DiscoveryApplicationDevelopmentSystemsManagement

BIG DATA PLATFORM


Reduce time and cost and provide quick ROIby leveraging pre-integrated components

Provide both out of the box and standards-based services

Start small with a single project and progressto others over your big data journey

Information Integration & Governance

are ouseompu ngys em

Data Media Content Machine Social



20/57

Watson Foundations

Exploration,landing and

archive Trusted data

Reporting &interactiveanalysis

Deepanalytics &modeling

Data typesReal-time processing & analytics

$TRE!M$& %!T! REP"IC!TI'(

Transaction &applicationdata

Machine andsensor data

Enterprisecontent

Image andvideo

Operationalsystems

Actionableinsight

Decisionmanagement

Predictiveanalytics &modeling

Reporting, analysis,content analytics

1

2 3

3

3

5

3

3

Watson Foundations Differentiators


Information Integration & Governance

Third-partydata Discovery andexploration

4

3

3

1

2

3

4

5

More than HadoopGreater resiliency and recoverability

Advanced workload management, multi-tenancyEnhanced, flexible storage management (GPFS)Enhanced data access (BigSQL, Search)Analytics accelerators & visualizationEnterprise-ready security framework

Data in MotionEnterprise class stream processing & analytics

Analytics EverywhereRichest set of analytics capabilities

Ability to analyze data in placeGovernance EverywhereComplete integration & governance capabilitiesAbility to govern all data where ever it is

Complete PortfolioEnd-to-end capabilities to address all needs

Ability to grow and address future needsRemains open to work with existing investments


21/57

IBM Watson FoundationsNew/Enhanced

ApplicationsAll Data

What actionshould I

IBM Big Data & Analytics

Real-time Data #rocessing & $nalytics What ishappening

Discovery andexploration

Why did it

Deep$nalytics


Inormation Integration & 'overnance

(ystems (ecurity

On premise, !loud, $s a service

(torage

IBM Big Data & Analytics Infrastructure

take

Decisionmanagement

!ognitive

)a*ric

"anding,

Explorationand $rchivedata %one

EDW anddata mart

%one

pera ona

data %one Reporting andanalysis

What couldhappen

#redictiveanalytics and

modeling


22/57

What is Hadoop?

Apache open source software framework for reliable, scalable, distributedcomputing of massive amount of data Hides underlying system details and complexities from user

Developed in Java

Core sub projects: MapReduce


. . .

Hadoop Common

Supported by several Hadoop-related projects HBase Zookeeper Avro Etc.

Meant for heterogeneous commodity hardware


23/57

Design principles of Hadoop New way of storing and processing the data:

Let system handle most of the issues automatically: Failures Scalability

Reduce communications Distribute data and processing power to where the data is Make parallelism part of operating system Relatively inexpensive hardware ($2 4K)


Hadoop = HDFS + MapReduce infrastructure +

Optimized to handle Massive amounts of data through parallelism

A variety of data (structured, unstructured, semi-structured) Using inexpensive commodity hardware

Reliability provided through replication


24/57

Hadoop is not for all types of work

Not to process transactions (random access)

Not good when work cannot be parallelized

Not good for low latency data access

Not good for processing lots of small files


Not good for intensive calculations with little data

Big Data Solution


25/57

Who uses Hadoop?



26/57

Map-Reduce

Hadoop

BigInsights



27/57

What is Apache Hadoop?

Flexible, enterprise-class support for processing large volumes ofdata Inspired by Google technologies (MapReduce, GFS, BigTable, )

Initiated at Yahoo Originally built to address scalability problems of Nutch, an open source Web search

technology

Well-suited to batch-oriented, read-intensive applications


Enables applications to work with thousands of nodes and petabytesof data in a highly parallel, cost effective manner CPU + disks = node Nodes can be combined into clusters

New nodes can be added as needed without changing Data formats How data is loaded How jobs are written


28/57

Hadoop Open Source Projects

Hadoop is supplemented by an ecosystem of open source projects



29/57

How do I leverage Hadoop to create new value for myenterprise?

Hadoop, Pig, Hive, Zookeeper, Jaql, Hbase, Ozzie, Flume

HDFS

MapReduceAQL

Machinelearning

Terabytes

PetabytesExabytes

Loganal sis


Sentimentanalysis

. . .

. . .

CDRs. . .

. . .


30/57

Whats a Hadoop Distribution?

Whats a Linux Distribution? Linux Kernel Open Source Tools around Kernel

Installer Administration UI

Open Source Distribution Formula


Core Projects around Kernel Value Add

Test Components Installer Administration UI

Apps

WebSphere WAS 25 > Apache Projects + Additional Open Source + installer + IBM Value Add


31/57

BigInsights: Value Beyond Open Source

Enterprise Capabilities

Advanced Engines

Visualization & Exploration

Development Tools

Key differentiators Built-in analytics

Text engine, annotators, Eclipse tooling Interface to project R (statistical platform)

Enterprise software integration Spreadsheet-style analysis Integrated installation of supported open source

and other components Web Console for admin and application access


Administration & Security

Workload Optimization

Connectors

Open source

components IBM-certifiedApache Hadoop

a orm enr c men : a ona secur y,

performance features, . . . World-class support Full open source compatibility

Business benefits Quicker time-to-value due to IBM technology

and support Reduced operational risk Enhanced business knowledge with flexible

analytical platform Leverages and complements existing software


32/57

From Getting Starting to Enterprise Deployment:Different BigInsights Editions For Varying Needs

Standard Edition

nterprise

class Enterprise Edition

- S readsheet-st le tool

- Accelerators

-- GPFS FPO

-- Adaptive MapReduce

- Text analytics

- Enterprise Integration

-- Monitoring and alerts

--

32 2013 IBM Corporation 2013 IBM Corporation32

Breadth of capabilities

Quick StartFree. Non-production

-- Web console

-- Dashboards

- Pre-built applications

-- Eclipse tooling

-- RDBMS connectivity

-- Big SQL

-- Jaql

-- Platform enhancements

-- . . .

-

-- InfoSphere Streams*

-- Watson Explorer*

-- Cognos BI*

-- . . .

-* Limited use license

Apache

Hadoop


33/57

Scalable New nodes can be added

on the fly

Performance & reliability Adaptive MapReduce, Compression,

Indexing, Flexible Scheduler, +++

IBM Enriches Hadoop


Affordable Massively parallel computing on

commodity servers

Flexible

Hadoop is schema-less, and canabsorb any type of data

Fault Tolerant Through MapReduce

software framework

Enterprise Hardening of Hadoop

Productivity Accelerators Web-based UIs and tools End-user visualization

Analytic Accelerators +++

Enterprise Integration To extend & enrich your information

supply chain

33


34/57

Big Database Vendors Adopt Hadoop

34 2013 IBM CorporationIBM Internal Use Only


35/57

Competing Hadoop Distribution Vendors

Cloudera Cloudera makes it easy to run open source Hadoop in production Focus on deriving business value from all your data instead of worrying about managing Hadoop

Hortonworks Make Hadoop easier to consume for enterprises and technology vendors Provide expert support by the leading contributors to the Apache Hadoop open source projects

EMC Greenplum HD ** Pivotal HD **


Provides a complete platform including installation, training, global support, and value-add beyond

simple packaging of the Apache Hadoop distribution

MapR High Performance Hadoop, up to 2-5 times faster performance than Apache-based distributions The first distribution to provide true high availability at all levels making it more dependable

Amazon Elastic MapReduce

Amazon Elastic MapReduce lets you focus on crunching or analyzing your data without having toworry about time-consuming set-up, management or tuning of Hadoop clusters or the computecapacity upon which they sit

IBM Internal Use Only


36/57

Capabilities Required for Hadoop Style Workloads

Visualization &Discovery

Analytics Engines

Application Support and DevelopmentTooling


Runtime

Cluster and Workload ManagementDataIngest

File System

Data Store Security

36


37/57

Open Source Hadoop Components

Visualization & Discovery Data Ingest

Analytics Engines

Application Support and Development Tooling

MapReduceMapReduce PigPig HiveHiveLuceneLucene OozieOozie


Open Source

Cluster Optimization and Management

Runtime

File System

MapReduce

HDFS

Data StoreHBase

ZooKeeperZooKeeper

Sqoop

Security

HCatalog

Flume

AvroAvroDerby

37


38/57

Open Source Components Across Distributions

ComponentBig

Insights2.0

HortonWorksHDP 1.2

MapR2.0

GreenplumHD 1.2

ClouderaCDH3u5

ClouderaCDH4*

Hadoop 1.0.3 1.1.2 0.20.2 1.0.3 0.20.2 2.0.0 *

HBase 0.94.0 0.94.2 0.92.1 0.92.1 0.90.6 0.92.1

Hive 0.9.0 0.10.0 0.9.0 0.8.1 0.7.1 0.8.1

Pig 0.10.1 0.10.1 0.10.0 0.9.2 0.8.1 0.9.2


Zookeeper 3.4.3 3.4.5 X 3.3.5 3.3.5 3.4.3

Oozie 3.2.0 3.2.0 3.1.0 X 2.3.2 3.1.3

Avro 1.6.3 X X X X X

Flume 0.9.4 1.3.0 1.2.0 X 0.9.4 1.1.0

Sqoop 1.4.1 1.4.2 1.4.1 X 1.3.0 1.4.1

HCatalog 0.4.0 0.5.0 0.4.0 X X X


39/57


40/57

Two Key Aspects of Hadoop

Hadoop Distributed File System = HDFS

Where Hadoop stores data A file system that spans all the nodes in a Hadoop cluster It links together the file systems on many local nodes to

make them into one bi file s stem


MapReduce framework How Hadoop understands and assigns work to the nodes

(machines)


41/57

What is the Hadoop Distributed File System?

HDFS stores data across multiple nodes

HDFS assumes nodes will fail, so it achievesreliability by replicating data across multiple nodes


e e sys em s u rom a c us er o a a no es ,each of which serves up blocks of data over thenetwork using a block protocol specific to HDFS.


42/57

MapReduce

Take a large problem and divide it into sub-problems Break data set down into small chunks

Perform the same function on all sub-problems

MAP


Combine the output from all sub-problems

DoWork()DoWork() DoWork()DoWork() DoWork()DoWork()

OutputR

EDUCE


43/57

MapReduce Example

Hadoop computation model Data stored in a distributed file system spanning many inexpensive computers Bring function to the data Distribute application to the compute resources where the data is stored

Scalable to thousands of nodes and petabytes of data

public static class TokenizerMapper

extends Mapper {

private final static IntWritable

Hadoop Data Nodes


MapReduce Application

1. Map Phase(break job into small parts)

2. Shuffle(transfer interim outputfor final processing)

3. Reduce Phase(boil all output down toa single result set)

Return a single result setResult Set

Shuffle

one = ne IntWritable!"#$

private Text ord = ne Text!#$

public void %ap!Object ke&, Text val, 'ontext

(trin)Tokenizer itr =

ne (trin)Tokenizer!val*to(trin)!##$

+ile !itr*+asMoreTokens!## {

ord*set!itr*nextToken!##$

context*rite!ord, one#$

public static class Int(u%-educer

extends -educer


44/57


45/57

So What Does This Result In?

Easy To Scale

Fault Tolerant and Self-Healing


Data Agnostic

Extremely Flexible


46/57

Resources

bigdatauniversity.com

youtube.com/ibmBigData

Quick Start Editions Ibm.co/quickstart Ibm.co/streamsqs

ibm.meetu .com


ibmdw.net/streamsdev ibm.co/streamscon

ibmbigdatahub.com

ibm.co/bigdatadev

http://tinyurl.com/biginsights Links to demos, papers, forum, downloads, etc


47/57

Thank YouYour feedback is important!

Please fill out survey


A k l d t d Di l i


48/57

Acknowledgements and Disclaimers

Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries inwhich IBM operates.

The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided forinformational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant.While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS withoutwarranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, thispresentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties orrepresentations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the useof IBM software.

All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may haveachieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intendedto, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or otherresults.


Copyright IBM Corporation 2014. All rights reserved.

U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract withIBM Corp.

IBM, the IBM logo, ibm.com, and InfoSphere BigInsights are trademarks or registered trademarks of International BusinessMachines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on theirfirst occurrence in this information with a trademark symbol ( or ), these symbols indicate U.S. registered or common lawtrademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common lawtrademarks in other countries. A current list of IBM trademarks is available on the Web at Copyright and trademark information atwww.ibm.com/legal/copytrade.shtml

Other company, product, or service names may be trademarks or service marks of others.


49/57

Backup


Global TLE Framework


50/57

Implications of Big Data

Just reading 100 terabytes is slow Standard computer (100 MBPS) ~11 days Across 10Gbit link (high end storage) 1 day 1000 standard computers 15 minutes!

Seek times for random disk access is a problem 1 TB data set with 1010 100-byte records

Updates to 1% would require 1 month Reading and rewriting the whole data set would take 1 day*


One node is not enough! Need to scale out not up!

50

+ )rom the adoop mailing list



51/57

Scaling out

Bad news: nodes fail, especially if you have many Mean time between failures for 1 node = 3 years, 1000 nodes = 1 day Super-fancy hardware still fails and commodity machines give better performance

per dollar

Bad news II: distributed programming is hard Communication, synchronization, and deadlocks Recovering from machine failure Debugging Optimization


51



52/57

A new model is needed

Its all about the right level of abstraction

Hide system-level details from the developers

No more race conditions, lock contention, etc.

Separating the whatfrom how Developer specifies the computation that needs to be performed Execution framework (runtime) handles actual execution

52 2013 IBM Corporation 52



53/57

MapReduce


Traditional computing

apReduce computing



54/57

MapReduce, the reality


any node, little communication *et.een the nodes,some stragglers and ailures

Bi Diff S h R


55/57

Big Difference: Schema on Run

Regular database Schema on load

Big Data (Hadoop) Schema on run

Raw dataRaw data

55 2013 IBM Corporation 2013 IBM Corporation55

Schemato filter

Storage(pre-filtered data)

Storage(unfiltered,raw data)

Schemato filter

Output

K B fit A ilit /Fl ibilit


56/57

Key Benefit: Agility/Flexibility

Schema-on-Write (RDBMS)

Schema must be defined beforeany data is loaded

An explicit load operation hasto take place which transformsdata to internal DB structure

Schema-on-Read (Hadoop)

Data is copied to the file store,no transformation is needed

A SerDe (Serializer/Deserlizer)is applied during read time toextract the re uired columns


New Columns must be addedexplicitly before new data forsuch columns can be loadedinto the database

Read First

Standard/Governance

(late binding)

New data can start flowinganytime and will appearretroactively once SerDe isupdated to parse it.

Load Fast

Flexibility/Agility

Pros

S l bilit S l bl S ft D l t


57/57

Scalability: Scalable Software Deployment


lesson 1 - hadoop and big data overview

Documents