the big deal about big data

43
© 2012 IBM Corporation June 28, 2022 The Big Deal About Big Data Dean Compher Data Management Technical Professional for UT, NV [email protected] www.db2Dean.com @db2Dean facebook.com/ db2Dean Slides Created and Provided by: • Paul Zikopoulos • Tom Deustch www.db2Dean.com

Upload: joben

Post on 26-Jan-2016

32 views

Category:

Documents


0 download

DESCRIPTION

The Big Deal About Big Data. @db2Dean. facebook.com/db2Dean. www.db2Dean.com. Dean Compher Data Management Technical Professional for UT, NV [email protected] www.db2Dean.com. Slides Created and Provided by: Paul Zikopoulos Tom Deustch. Why Big Data How We Got Here. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Big Deal About Big Data

© 2012 IBM CorporationApril 21, 2023

The Big Deal About Big Data

Dean CompherData Management Technical Professional for UT, [email protected]

@db2Dean

facebook.com/db2Dean

Slides Created and Provided by:• Paul Zikopoulos• Tom Deustch

www.db2Dean.com

Page 2: The Big Deal About Big Data

© 2012 IBM CorporationApril 21, 2023

Why Big DataHow We Got Here

Page 3: The Big Deal About Big Data

© 2012 IBM Corporation33

…b

y th

e en

d o

f 20

11, t

his

was

ab

ou

t 30

b

illio

n a

nd

gro

win

g e

ven

fas

ter

In 2

005

ther

e w

ere

1.3

bil

lio

n R

FID

tag

s in

cir

cula

tio

n…

Page 4: The Big Deal About Big Data

© 2012 IBM Corporation4

An increasingly sensor-enabled and instrumented business environment generates HUGE volumes of

data with MACHINE SPEED characteristics…

1 BILLION lines of codeEACH engine generating 10 TB every 30 minutes!

Page 5: The Big Deal About Big Data

© 2012 IBM Corporation5

350B Transactions/Year

Meter Reads every 15 min.

3.65B – meter reads/day120M – meter reads/month

Page 6: The Big Deal About Big Data

© 2012 IBM Corporation6

In August of 2010, Adam Savage, of “Myth Busters,” took a photo of his vehicle using his smartphone. He then posted the photo to his Twitter account including the phrase “Off to work.”

Since the photo was taken by his smartphone, the image contained metadata revealing the exact geographical location the photo was taken

By simply taking and posting a photo, Savage revealed the exact location of his home, the vehicle he drives, and the time he leaves for work

Page 7: The Big Deal About Big Data

© 2012 IBM Corporation7

The Social Layer in an Instrumented Interconnected World

2+ billion

people on the

Web by end 2011

30 billion RFID tags today

(1.3B in 2005)

4.6 billion camera phones

world wide

100s of millions of GPS

enabled devices

sold annually

76 million smart meters in 2009… 200M by 2014

12+ TBs of tweet data

every day

25+ TBs oflog data

every day

? T

Bs

of

dat

a ev

ery

da

y

Page 8: The Big Deal About Big Data

© 2012 IBM Corporation8

Twitter Tweets per Second Record Breakers of 2011

Page 9: The Big Deal About Big Data

© 2012 IBM Corporation9

Extract Intent, Life Events, Micro Segmentation Attributes

Jo Jobs

Tina Mu

Tom Sit

Pauline

Name, Birthday, Family

Not Relevant - Noise

Not Relevant - Noise

Monetizable Intent

Monetizable IntentRelocation

Location Wishful Thinking

SPAMbots

Page 10: The Big Deal About Big Data

© 2012 IBM Corporation10

Extracting insight from an immense volume, variety and velocity of data, in context, beyond what was previously possible

Big Data Includes Any of the following Characteristics

Manage the complexity of data in many different structures, ranging from relational, to logs, to raw text

Streaming data and large volume data movement

Scale from Terabytes to Petabytes (1K TBs) to Zetabytes (1B TBs)

Variety:

Velocity:

Volume:

Page 11: The Big Deal About Big Data

© 2012 IBM Corporation11

Retailers collect click-stream data from Web site interactions and loyalty card data – This traditional POS information is used by retailer for shopping basket

analysis, inventory replenishment, +++– But data is being provided to suppliers for customer buying analysis

Healthcare has traditionally been dominated by paper-based systems, but this information is getting digitized

Science is increasingly dominated by big science initiatives– Large-scale experiments generate over 15 PB of data a year and can’t be

stored within the data center; sent to laboratories

Financial services are seeing large and large volumes through smaller trading sizes, increased market volatility, and technological improvements in automated and algorithmic trading

Improved instrument and sensory technology– Large Synoptic Survey Telescope’s GPixel camera generates 6PB+ of image

data per year or consider Oil and Gas industry

Bigger and Bigger Volumes of Data

Page 12: The Big Deal About Big Data

© 2012 IBM Corporation12

Data AVAILABLE to an organization

Data an organization can PROCESS

The Big Data Conundrum The percentage of available data an enterprise can analyze is

decreasing proportionately to the available to it

Quite simply, this means as enterprises, we are getting “more naive” about our business over time

We don’t know what we could already know….

Page 13: The Big Deal About Big Data

© 2012 IBM Corporation13

Why Not All of Big Data Before: Didn’t have the Tools?

Page 14: The Big Deal About Big Data

© 2012 IBM Corporation14

Applications for Big Data Analytics

Homeland Security

Finance Smarter Healthcare Multi-channel sales

Telecom

Manufacturing

Traffic Control

Trading Analytics Fraud and Risk

Log Analysis

Search Quality

Retail: Churn, NBO

Page 15: The Big Deal About Big Data

© 2012 IBM Corporation1515

Most Requested Uses of Big Data

Log Analytics & Storage

Smart Grid / Smarter Utilities

RFID Tracking & Analytics

Fraud / Risk Management & Modeling

360° View of the Customer

Warehouse Extension

Email / Call Center Transcript Analysis

Call Detail Record Analysis

+++

Page 16: The Big Deal About Big Data

© 2012 IBM Corporation16

So What Is Hadoop?

Page 17: The Big Deal About Big Data

© 2012 IBM Corporation17 17

Hadoop Background

Apache Hadoop is a software framework that supports data-intensive applications under a free license. It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Google Map/Reduce and Google File System papers.

Hadoop is a top-level Apache project being built and used by a global community of contributors, using the Java programming language. Yahoo has been the largest contributor to the project, and uses Hadoop extensively across its businesses.

Hadoop is a paradigm that says that you send your application to the data rather than sending the data to the application

Page 18: The Big Deal About Big Data

© 2012 IBM Corporation18

What Hadoop Is Not

It is not a replacement for your Database & Warehouse strategy– Customers need hybrid database/warehouse &

hadoop models It is not a replacement for your ETL strategy

– Existing data flows aren’t typically changed, they are extended

It is not designed for real-time complex event processing like Streams– Customers are asking for Streams & BigInsights

integration

Page 19: The Big Deal About Big Data

© 2012 IBM Corporation19

So What Is Really New Here?

Cost effective / Linear Scalability.– Hadoop brings massively parallel competing to commodity servers. You can start small

and scales linearly as your work requires.– Storage and Modeling at Internet-scale rather than small sampling– Cost profile for super-computer level compute capabilities– Cost per TB of storage enables superset of information to be modeled

Mixing Structured and Unstructured data.– Hadoop is its schema-less so it doesn’t care about the form the data stored is in, and thus

allows a super-set of information to be commonly stored. Further, MapReduce can be run effectively on any type of data and is really limited by the creatively of the developer.

– Structure can be introduced at the MapReduce run time based on the keys and values defined in the MapReduce program. Developers can create jobs that against structured, semi-structured, and even unstructured data.

Inherently flexible of what is modeled/analytics run– Ability to change direction literally on a moment’s notice without any design or operational

changes– Since hadoop is schema-less, and can introduce structure on the fly, the type of analytics

and nature of the questions being asked can be changed as often as needed without upfront cost or latency

Page 20: The Big Deal About Big Data

© 2012 IBM Corporation20

Break It Down For Me Here… Hadoop is a platform and framework, not a database

– It uses both the CPU and disc of single commodity boxes, or node

– Boxes can be combined into clusters– New nodes can be added as needed, and added without

needing to change the;• Data formats• How data is loaded• How jobs are written• The applications on top

Page 21: The Big Deal About Big Data

© 2012 IBM Corporation21

So How Does It Do That? At its core, hadoop is made up of;

Map/Reduce– How hadoop understands and assigns work to the nodes (machines)

Hadoop Distributed File System = HDFS– Where hadoop stores data– A file system that’s runs across the nodes in a hadoop cluster– It links together the file systems on many local nodes to make them

into one big file system

Page 22: The Big Deal About Big Data

© 2012 IBM Corporation22

What is HDFS

The HDFS file system stores data across multiple machines. HDFS assumes nodes will fail, so it achieves reliability by

replicating data across multiple nodes– Default is 3 copies

• Two on the same rack, and one on a different rack. The filesystem is built from a cluster of data nodes, each of

which serves up blocks of data over the network using a block protocol specific to HDFS. – They also serve the data over HTTP, allowing access to all content

from a web browser or other client– Data nodes can talk to each other to rebalance data, to move copies

around, and to keep the replication of data high.

Page 23: The Big Deal About Big Data

© 2012 IBM Corporation23

File System on my Laptop

Page 24: The Big Deal About Big Data

© 2012 IBM Corporation24

HDFS File System Example

Page 25: The Big Deal About Big Data

© 2012 IBM Corporation25 25

Map/Reduce Explained

"Map" step: – The program is chopped up into many smaller sub-

problems.• A worker node processes some subset of the smaller

problems under the global control of the JobTracker node and stores the result in the local file system where a reducer is able to access it.

"Reduce" step:– Aggregation

• The reduce aggregates data from the map steps. There can be multiple reduce tasks to parallelize the aggregation, and these tasks are executed on the worker nodes under the control of the JobTracker.

Page 26: The Big Deal About Big Data

© 2012 IBM Corporation26 26

The MapReduce Programming Model

"Map" step: – Program split into pieces – Worker nodes process individual pieces in parallel (under

global control of the Job Tracker node) – Each worker node stores its result in its local file system

where a reducer is able to access it

"Reduce" step:– Data is aggregated (‘reduced” from the map steps) by

worker nodes (under control of the Job Tracker) – Multiple reduce tasks can parallelize the aggregation

Page 27: The Big Deal About Big Data

© 2012 IBM Corporation27

Map/Reduce Job Example

Page 28: The Big Deal About Big Data

© 2012 IBM Corporation28

Murray 38 Salt Lake 39 Bluffdale 35 Sandy 32 Salt Lake 42 Murray 31

Bluffdale 32 Sandy 40 Murray 27 Salt Lake 25 Bluffdale 37 Sandy 32 Salt Lake 23 Murray 30

Sandy 40 Salt Lake 25 Bluffdale 37 Murray 30

Murray 38 Bluffdale 35 Sandy 32 Salt Lake 42

Murray 38 Bluffdale 35 Bluffdale 37 Murray 30

Sandy 40 Salt Lake 25 Sandy 32 Salt Lake 42

Murray 38 Bluffdale 37

Sandy 40 Salt Lake 42

Map Shuffle Reduce

Page 29: The Big Deal About Big Data

© 2012 IBM Corporation29

MapReduce In more Detail

Map-Reduce applications specify the input/output locations and supply map and reduce functions via implementations of appropriate Hadoop interfaces, such as Mapper and Reducer.

These, and other job parameters, comprise the job configuration. The Hadoop job client then submits the job (jar/executable, etc.) and configuration to the JobTracker

The JobTracker then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.

The Map/Reduce framework operates exclusively on <key, value> pairs — that is, the framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably of different types.

The vast majority of Map-Reduce applications executed on the Grid do not directly implement the low-level Map-Reduce interfaces; rather they are implemented in a higher-level language, such as Jaql, Pig or BigSheets

Page 30: The Big Deal About Big Data

© 2012 IBM Corporation30 30

JobTracker and TaskTrackers

Map/Reduce requests are handed to the Job Tracker which is a master controller for the map and reduce tasks.

– Each worker node contains a Task Tracker process which manages work on the local node.

– The Job Tracker pushes work out to the Task Trackers on available worker nodes, striving to keep the work as close to the data as possible

– The Job Tracker knows which node contains the data, and which other machines are nearby

– If the work cannot be hosted on the actual node where the data resides, priority is given to nodes in the same rack

– This reduces network traffic on the main backbone network. If a Task Tracker fails or times out, that part of the job is rescheduled

Page 31: The Big Deal About Big Data

© 2012 IBM Corporation31

How To Create Map/Reduce Jobs Map/reduce development in Java

– Hard, few resources that know this Pig

– Open source language / Apache sub-project– Becoming a “standard”

Hive– Open source language / Apache sub-project– Provides a SQL-like interface to hadoop

Jaql– IBM Research Invented– More powerful than Pig when dealing with loosely structure data– Visa has been a development partner

BigSheets– BigInsights browser based application– Little development required– You’ll use this most often

Skill Required

Page 32: The Big Deal About Big Data

© 2012 IBM Corporation32

Taken Together - What Does This Result In? Easy To Scale

– Simply add machines as your data and jobs require Fault Tolerant and Self-Healing

– Hadoop runs on commodity hardware and provides fault tolerance through software.– Hardware losses are expecting and tolerated– When you lose a node the system just redirects work to another location of the data

and nothing stops, nothing breaks, jobs, applications and users don’t even know. Hadoop Is Data Agnostic

– Hadoop can absorb any type of data, structured or not, from any number of sources.– Data from many sources can be joined and aggregated in arbitrary ways enabling

deeper analyses than any one system can provide. – Hadoop results can be consumed by any system necessary if the output is structured

appropriately Hadoop Is Extremely Flexible

– Start small, scale big– You can turn nodes “off” and use for other needs if required (really)– Throw any data, in any form or format, you want at it– What you use it for can be changed on a whim

Page 33: The Big Deal About Big Data

© 2012 IBM Corporation33

The IBM Big Data Platform

Page 34: The Big Deal About Big Data

© 2012 IBM Corporation34

Analytic Sandboxes – aka “Production”

Hadoop capabilities exposed to LOB with some notion of IT support

Not really production in an IBM sense Really “just” ad-hoc made visible to more users in

the organization Formal declaration of direction as part of the

architecture “Use it, but don’t count on it” Not built for secutity

Page 35: The Big Deal About Big Data

© 2012 IBM Corporation35

Production Usage with SLAs

SLA driven workloads– Guaranteed job completion– Job completion within operational windows

Data Security Requirements– Problematic if it fails or looses data– True DR becomes a requirements– Data quality becomes an issue– Secure Data Marts become a hard requirement

Integration With The Rest of the Enterprise– Workload integration becomes an issue

Efficiency Becomes A Hot Topic– Inefficient utilization on 20 machines isn’t an issue, on 500 or 1000+ it is

Relatively few are really here yet outside of Facebook, Yahoo, LinkedIn, etc…

Few are thinking of this but it is inevitable

Page 36: The Big Deal About Big Data

© 2012 IBM Corporation36

IBM – Delivers a Platform Not a Product Hardened Environment

– Removes single points of failure– Security – All Components Tested Together– Operational Processes– Ready for Production

Mature / Pervasive usage Deployed and Managed Like Other Mature Data Center

Platforms BIG INSIGHTS

– Text Analytics, Data Mining, Streams, Others

Page 37: The Big Deal About Big Data

© 2012 IBM Corporation37

The IBM Big Data Platform

InfoSphere BigInsights Hadoop-based low latency

analytics for variety and volume

IBM Netezza High Capacity Appliance

Queryable Archive Structured Data

IBM Netezza 1000BI+Ad Hoc

Analytics on Structured Data

IBM Smart Analytics System

Operational Analytics on Structured Data

IBM Informix TimeseriesTime-structured analytics

IBM InfoSphere Warehouse

Large volume structured data analytics

InfoSphere StreamsLow Latency Analytics for

streaming data

MPP Data Warehouse

Stream ComputingInformation Integration

Hadoop

InfoSphere Information Server

High volume data integration and transformation

Page 38: The Big Deal About Big Data

© 2012 IBM Corporation38

What Does a Big Data Platform Do?

Analyze Information in Motion

Streaming data analysisLarge volume data bursts and ad-hoc analysis

Analyze a Variety of Information

Novel analytics on a broad set of mixed information that could not be analyzed before

Discover and Experiment

Ad-hoc analytics, data discovery and experimentation

Analyze Extreme Volumes of Information

Cost-efficiently process and analyze PBs of informationManage & analyze high volumes of structured, relational data

Manage and Plan

Enforce data structure, integrity and control to ensure consistency for repeatable queries

Page 39: The Big Deal About Big Data

© 2012 IBM Corporation39

Big Data Enriches the Information Management Ecosystem

Who Ran What, Where, and When?

Audit MapReduce Jobs and tasks

Managing a Governance Initiative

OLTP Optimization

(SAP, checkout, +++)

Master Data Enrichment via Life Events, Hobbies, Roles, +++

Establishing

Information as a Service

Active Archive Cost Optimization

Page 40: The Big Deal About Big Data

© 2012 IBM CorporationApril 21, 2023

Get More Information…

Page 41: The Big Deal About Big Data

© 2012 IBM Corporation41

www.bigdatauniversity.com

Page 42: The Big Deal About Big Data

© 2012 IBM Corporation42

Get the Book

Page 43: The Big Deal About Big Data

© 2012 IBM Corporation43