1_mi_is_az_a_big_data.ppt

37
© 2013 IBM Corporation January 2013 IBM Big Data Platform Overview Martin Pavlík +420 731 435 691 [email protected]

Upload: sharadvasista

Post on 09-Feb-2016

49 views

Category:

Documents


0 download

DESCRIPTION

Big data presentation

TRANSCRIPT

Page 1: 1_Mi_is_az_a_big_data.ppt

© 2013 IBM CorporationJanuary 2013

IBM Big Data Platform Overview

Martin Pavlík+420 731 435 [email protected]

Page 2: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation2

Big Data is a Hot Topic Because Technology Makes it Possible to Analyze ALL Available Data

Cost effectively manage and analyzeall available data in its native form

unstructured, structured, streaming

ERPCRM RFID

Website

Network Switches

Social Media

Billing

Page 3: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation3

BIG DATA is not just HADOOP

Manage & store huge volume of any data

Hadoop File SystemMapReduce

Manage streaming data Stream Computing

Analyze unstructured data Text Analytics Engine

Data WarehousingStructure and control data

Integrate and govern all data sources

Integration, Data Quality, Security, Lifecycle Management, MDM

Understand and navigate federated big data sources Federated Discovery and Navigation

Page 4: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation4

Business-Centric Big Data Enables You to Start With a Critical Business Pain and Expand the Foundation for Future Requirements

“Big data” isn’t just a technology—it’s a business strategy for capitalizing on information resources

Getting started is crucial

Success at each entry point is accelerated by products within the Big Data platform

Build the foundation for future requirements by expanding further into the big data platform

Page 5: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation5

1 – Unlock Big DataCustomer need

• Understand existing data sources

• Search and navigate data within existing systems

• No copying of data

Value statement• Get up and running quickly

• Discover and retrieve big data

• Work even with big data sources – by business users

Solution• Vivisimo Velocity renamed to

• IBM InfoSphere DataDiscovery

Page 6: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation6

2 – Analyze Raw DataCustomer need

• Ingest data as-is into Hadoop• Combine it with data from DWH

• Process very large volume of data

Value statement• Gain new insight

• Overcome the high cost of converting data from unstructured to structured format

• Experiment with analysis on different data and combine them with other sources

Solution• IBM InfoSphere BigInsights

Page 7: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation7

Merging the Traditional and Big Data Approaches

ITStructures the data to answer that question

ITDelivers a platform to enable creative discovery

Business Explores what questions could be asked

Business UsersDetermine what question to ask

Monthly sales reportsProfitability analysisCustomer surveys

Brand sentimentProduct strategyMaximum asset utilization

Big Data ApproachIterative & Exploratory Analysis

Traditional ApproachStructured & Repeatable Analysis

Page 8: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation8

InfoSphere BigInsights is more than just HADOOP

IBM InfoSphere Big Insights• Is much more than

HADOOP

IBM Big data platform• Includes much more than

IBM InfoSphere Big Insights

Page 9: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation9

Hadoop Open-source software framework from Apache Inspired by

– Google MapReduce– GFS (Google File System)

HDFS Map/Reduce

Page 10: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation10

InfoSphere BigInsights Platform for volume, variety,

velocity Enhanced Hadoop foundation

Analytics Text analytics & tooling Application accelerators

Usability Web console Spreadsheet-style tool Ready-made “apps”

Enterprise Class Storage, security, cluster

management

Integration Connectivity to Netezza, DB2,

JDBC databases, etc

ApacheHadoop

Basic Edition

Enterprise EditionLicensed

Application accelerators Pre-built applications

Text analytics Spreadsheet-style tool

RDBMS, warehouse connectivity Administrative tools, security

Eclipse development toolsPerformance enhancements

. . . .

Free download

Integrated installOnline InfoCenter

BigData Univ.

Breadth of capabilities

Ente

rpris

e cl

ass

Can run also on top of

Page 11: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation11

Spreadsheet-style Analysis Web-based analysis

and visualization

Spreadsheet-like interface – Define and manage

long running data collection jobs

– Analyze content of the text on the pages that have been retrieved

Page 12: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation12

Build a Big Data Program – MapReduce exampleEclipse tools

For Jaql, Hive, Pig Java MapReduce, BigSheets plug-ins, text analytics, etc.

Page 13: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation13

JAQL – IBM’s programming language in hadoop world Jaql is a complete solutions environment supporting all other

BigInsights components Integration point for

various analytics– Text analytics– Statistical analysis– Machine learning– Ad-hoc analysis

Integration point for various data sources– Local and distributed

file systems– NoSQL data bases– Content repositories– Relational sources

(Warehouses, operational data bases)

Big

Insi

ghts

Te

xt A

naly

tics

Stat

istic

al

Ana

lysi

s (R

mod

ule)

Mac

hine

le

arni

ng

(Sys

tem

ML)

Ad-

Hoc

an

alys

is

(Big

Shee

ts)

(Inte

grat

ion)

D

B2,

Net

ezza

, St

ream

s, …

Jaql

Jaql I/O Jaql Core Operators

Jaql Modules

DFS NoSQL RDBMS File System

Page 14: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation14

BigInsights

Data warehouse

Traditional analytic tools Big Data

analytic applications

Filter Transform Aggregate

BigInsights and the data warehouse

Page 15: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation15

3 – Simplify your warehouseCustomer need – SIGNIFICANTLY

• Make performance of DWH better• Reduce DWH administration costs

Value statement• Speed: 10 – 100x better performance• Simplicity: Administration costs reduced by 75% - 90%• Scalability• Smart system

• In-database analytics• Out-of-the box integration with SPSS

Solution• IBM Netezza renamed to

• PureData System for Analytics

Page 16: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation16

Analyst IT

I need to evaluate the possible relationship between client salary and

overdrafts

OK. We have to evaluate a lot of statistics, set the correct db indexes and db partitioning. It will take us 5

days.

Page 17: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation17

Analyst IT

Great. Thanks a lot.I’m going to check the results.

Done. You can run your analytical query.

Page 18: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation18

Analyst IT

Great. I can see here some nice correlations. Now I need to look at it from the different perspective.

Ohhh, welcome dear friend. Understand. So, it’s ….

another 5 days of our work

Noooo!!!It’s not possible to work

here!

Page 19: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation19

And now with Netezza ...

Page 20: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation20

Analyst IT

I need to evaluate the possible relationship between client salary and

overdrafts.I will use Netezza.

Page 21: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation21

Analyst IT

Great. I can see here some nice correlations. Now I need to look at it from the different

perspective.With Netezza I can run the query immediately.

The response will be in the same time

IT can do something else – much more useful

Page 22: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation22

Page 23: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation2323

Built-In Expertise Makes This as Simple as an Appliance

Dedicated device

Optimized for purpose

Complete solution

Fast installation

Very easy operation

Standard interfaces

Low cost

Page 24: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation24

IBM Netezza was renamed to IBM PureData System for Analytics

In October 2012

Page 25: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation25

Netezza Genesis in T-Mobile CZ

Proof-Of-Concept Project– New EnterpriseDataWarehouse platform selection– Comparison of existing and other platforms

– Selection Criteria• Performance• Operational Savings

….and the winner was: Netezza

Page 26: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation26

Netezza Genesis in T-Mobile CZExpectations– Significant response improvement:

• Faster platform means better reports response

– Direct Data Availability• Higher trust in data , one version of truth• Aggregation reduction• Any attribute available

– Operational Benefits• Storage savings (no data replicas)• Administration costs reduction(DBA)

– Infrastructure Simplification• Lower environment complexity

Page 27: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation27

Netezza Genesis in T-Mobile CZ

Project Implementation

– EDW platform migration• Netezza platform implementation• ETL graphs/processes redesign

– BI Front-End Tool Migration• SAP Business Object implementation• All reports redesign

Main Integration Partner: T-System CZ

Page 28: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation28

Netezza Genesis in T-Mobile CZ

Actual Status

– All relevant ETL procecessing redesigned

– Actual parallel run to Original and Netezza platform finished

– Netezza as only primary platform

Page 29: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation29

Original Platform

Netezza

Workflow Reporting 2 hours 1 minute

Invoicing and Payments reporting

Payment discipline of current month invoices 33 minutes 17 seconds

Overdue Debt of Invoices – in Current Month 10 hours 23 seconds

Average Monthly Invoice Figures 50 minutes 38 seconds

RESPONSE TIME MASSIVELY IMPROVED

Real Netezza experience from T-Mobile Czech Rep.

Page 30: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation30

4 – Reduce costs with HadoopCustomer need – SIGNIFICANTLY

• Too much data => Too expensive to store and to maintain• Big portion is used “just in case”• Data amount is still growing => it’s more expensive

• => too expensive to have all data in standard DWH

Value statement• Leverage the architecture of parallel processing in Hadoop

• Hadoop uses cheap commodity HW

• Enable business users still work in the same or similar way

Solution• IBM InfoSphere BigInsights

Page 31: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation31

BigInsights and the data warehouse

BigInsights

• Query-ready archive for “cold” warehouse dataData Warehouse

Big Data analytic applications

Traditional analytic tools From Cognos BI

via Hive JDBC

Page 32: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation34

Application

SQL interface Engine

InfoSphere BigInsights

HiveTables HBase tables

CSV Files

Data Sources

SQL Language

JDBC / ODBC Driver

JDBC / ODBC Server

Future: The SQL interface . . . . Rich SQL query capabilities

– SQL '92 and 2011 features– Correlated subqueries– Windowed aggregates

SQL access to all data stored in InfoSphere BigInsights

Robust JDBC/ODBC support

Take advantage of key features of each data source

Leverage MapReduce parallelism

ORachieving low-latency

Page 33: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation35

5 – Analyze Streaming DataCustomer need

• Process and leverage streaming data

• Select valuable data from data stream for future processing

• Quickly process data going to be useless if it’s not processed immediately

Value statement• React in real-time to take an oppurtinity

before it expires

• Periodically adjust streaming models based on analysis on data at rest

Solution• IBM InfoSphere Streams

Streams ComputingStreaming Data

Sources

ACTION

Page 34: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation36

Why and when to use InfoSphere Streams?

Sensors Environmental, Industrial, GPS, … Images, Videos, …

Data Exhaust Network data system logs (web server, app server), …

High-rate transaction data Financial transactions CDRs

Isolation Processing in isolation … or in limited windows (time / nr. Of records)

Non-traditional formats included Spatial data, images, text, voice, …

Integration challenges Different connection methods Different data rates Different processing requirements

Multiple processing nodes Volume / rate very high => scalability required

Sub-millisecond latency Immediate analysis and response

Store & mine approach doesn’t work Because of very high volume of data (and its rates)

At least 2 criteria from the list bellow should be fulfilled

Applications needing on-fly processing, filtering and analyzing streaming data

Page 35: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation38

Streams and BigInsights - Integrated Analytics on Data in Motion & Data at Rest

1. Data Ingest

Data Integration, data mining, machine learning, statistical modeling

Visualization of real-time and historical insights

3. Adaptive Analytics Model

Data ingest, preparation, online analysis, model validation

Data

2. Bootstrap/Enrich

Control flow

InfoSphereBigInsights, Database & Warehouse

InfoSphereStreams

Page 36: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation39

The Platform Advantage

BI / Reporting

BI / Reporting

Exploration / Visualization

FunctionalApp

IndustryApp

Predictive Analytics

Content Analytics

Analytic Applications

IBM Big Data Platform

Systems Management

Application Development

Visualization & Discovery

Accelerators

Information Integration & Governance

HadoopSystem

Stream Computing

Data Warehouse

BENEFITS IN DETAIL

Increase overtime

By moving from entry to a 2nd and 3rd project

Lowering deployment costs

Shared components

Integration

Points of leverage Shared text analytics for Streams and BigInsights

HDFS connectors (data integration (ETL, …), Streams)

Accelerators Build across multiple

engines

Page 37: 1_Mi_is_az_a_big_data.ppt

© 2012 IBM Corporation40

IBM big data • IBM big data • IBM big data

IBM big data • IBM big data • IBM big data

IBM

big

dat

a

• IB

M b

ig d

ata

IBM

big data • IBM

big data

THINK