2012.04.26 big insights streams im forum2

Big Data Plattform der IBM

InfoSphere BigInsights und InfoSphere Streams

Big Data Plattform der IBM

InfoSphere BigInsights und InfoSphere Streams

Wilfried Hoge – Leading Technical Sales Professional [email protected] twitter.com/wilfriedhoge

New analytic applications drive the requirements for a big data platform •  Integrate and manage the full

variety, velocity and volume of data

•  Apply advanced analytics to information in its native form

•  Visualize all available data for ad-hoc analysis

•  Development environment for building new analytic applications

•  Workload optimization and scheduling

•  Security and Governance

IBM Big Data Strategy: Move the Analytics Closer to the Data

BI / Reporting

Exploration / Visualization

Functional App

Industry App

Predictive Analytics

Content Analytics

Analytic Applications

IBM Big Data Platform Systems

Management Application

Development Visualization & Discovery

Accelerators

Information Integration & Governance

Hadoop System

Stream Computing

Data Warehouse

Up to 10,000 Times larger

Up to 10,000 times faster

Traditional Data Warehouse and Business Intelligence

Dat

a Sc

ale

Dat

a Sc

ale

yr mo wk day hr min sec … ms µs

Exa

Peta

Tera

Giga

Mega

Kilo

Decision Frequency Occasional Frequent Real-time

Data in Motion

Dat

a at

Res

t

Volume and Velocity – two dimensions for Big Data

Telco Promotions

100,000 records/sec, 6B/day 10 ms/decision 270TB for Deep Analytics

DeepQA 100s GB for Deep Analytics 3 sec/decision Power7, 15TB memory

Wind Turbine Placement & Operation PBs of data Analysis time to 3 days from 3 weeks 1220 IBM iDataPlex nodes

Security

600,000 records/sec, 50B/day 1-2 ms/decision 320TB for Deep Analytics

26.04.2012 © Copyright IBM Corporation 2012 4

Based on open source & IBM technologies

Distinguishing characteristics •  Built-in analytics . . . enhances business

knowledge

•  Enterprise software integration . . . complements and extends existing capabilities

•  Production-ready platform with tooling for analysts, developers, and administrators. . . speeds time-to-value and simplifies development/maintenance

IBM advantage •  Combination of software, hardware,

services and advanced research

BigInsights – analytical platform for persistent “Big Data”

BI / Reporting


Functional App

Industry App


Content Analytics





Accelerators


Stream Computing

Data Warehouse

Hadoop System

Flexible, enterprise-class support for processing large volumes of data •  Based on Google’s MapReduce technology

•  Inspired by Apache Hadoop; compatible with its ecosystem and distribution

•  Well-suited to batch-oriented, read-intensive applications

•  Supports wide variety of data

Enables applications to work with thousands of nodes and petabytes of data in a highly parallel, cost effective manner •  CPU + disks = “node”

•  Nodes can be combined into clusters

•  New nodes can be added as needed without changing

•  Data formats

•  How data is loaded

•  How jobs are written

About the BigInsights Platform

Hadoop computation model •  Data stored in a distributed file system spanning many inexpensive computers

•  Bring function to the data

•  Distribute application to the compute resources where the data is stored

Scalable to thousands of nodes and petabytes of data

Hadoop Explained – Map Reduce

MapReduce Application

1.  Map Phase (break job into small parts)

2.  Shuffle (transfer interim output for final processing)

3.  Reduce Phase (boil all output down to a single result set)

Return a single result set Result Set

Shuffle

public static class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text();

public void map(Object key, Text val, Context StringTokenizer itr = new StringTokenizer(val.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWrita private IntWritable result = new Intritable();

public void reduce(Text key, Iterable<IntWritable> val, Context context){ int sum = 0; for (IntWritable v : val) { sum += v.get(); . . .

Distribute map tasks to cluster

Hadoop Data Nodes

Technical differentiators •  Built-in analytics

•  Text processing engine, annotators, Eclipse tooling •  Statistical and predictive analysis •  Interface to project R (statistical platform)

•  Enterprise software integration (DBMS, warehouse) •  Spreadsheet-style analytical tool for analysts •  Ready-made business process accelerators •  Integrated installation of supported open source and IBM components •  Web Console for administration and application access •  Platform enrichment: additional security, performance features, . . . •  Standard IBM licensing agreement and world-class support

Business benefits •  Quicker time-to-value due to IBM technology and support •  Reduced operational risk •  Enhanced business knowledge with flexible analytical platform •  Leverages and complements existing software assets

BigInsights – Value Beyond Open Source

Seamless process for single node and cluster environments

Integrated installation of all selected components

Post-install validation of IBM and open source components

Web Installation Tool

No need to iteratively download, configure, and test multiple open source projects and their pre-requisite software.

Manage BigInsights •  Inspect system health

•  Add / drop nodes

•  Start / stop services

•  Run / monitor jobs (applications)

•  Explore / modify file system

Launch applications •  Spreadsheet-like analysis tool

•  Pre-built applications (IBM supplied or user developed)

Publish applications

Leverage community resources

Web Console

BigSheets is a visual tool for data manipulation and prototyping •  Allows more users to do more work, more quickly

•  Simply stated, growing an army of MapReduce developers is not cost effective

•  In your BI environments you have a ratio of 30+ report users for every complex SQL developer. We need to support the same ratios with BigInsights

Sample Uses •  Data exploration and visualization

•  Visual job creation

BigSheets

BigSheets – Spreadsheet-style Data Analysis and Discovery

BigSheets – Visualization

Reusable software assets based on customer engagements •  Useful for starting point for various applications

•  Can be customized by BigInsights application developers as needed

•  Accessible through Web console

Available assets •  Data export (to relational DBMS, files, HBase)

•  Data import (from relational DBMS, files)

•  Web crawler, Twitter crawler

•  Boardreader.com support (Web forum search engine)

•  Ad hoc queries for Jaql, Hive, Pig

•  TeraGen-TeraSort, WordCount sample applications

Quick start applications or “apps”

Running Applications from the Web Console

Develop Hive with the SQL Editor and view results

Build a Big Data Program – Map Reduce example

Eclipse based development tools For JAQL, Hive, Java MapReduce, Text Analytics

Text analytics – Distill structured information from unstructured data •  Rich annotator library supports multiple languages

•  Declarative Information Extraction (IE) system based on an algebraic framework

•  Richer, cleaner rule semantics

•  Better performance through optimization

Developed at IBM Research since 2004

Embedded in several IBM products •  Lotus Notes

•  Cognos Consumer Insights

•  InfoSphere Streams

•  Compose operators to build complex annotators

Text Analytics in BigInsights

Pre-configured text annotators ready for distributed processing on Big Data •  City, County, Zipcode, Address, Maplocation, StateOrProvince, Country, Continent,

EmailAddress, Person, Organizaion, DateTime, URL, Compane Names, Merger, Acquisition, Alliance, etc..

Support for native languages including double-byte

Turns disparate words into measurable insights

Identify positive or negative sentiment,

NLP-based analytics, define variables, macros

and rules.

Physically assemble data, standardize

formats, address auto-identify language,

process punctuation and non-grammatical

characters, standardize spelling.

Part-of-speech identification, standard and

customized extraction dictionaries, proper noun

identification, concept categorization, synonyms,

exclusions, multi-terms, regular expressions, fuzzy-

matching

Iterative classification using automated and manual techniques.

Concept derivation & inclusion, semantic networks and co-occurrence rules

Reporting/Monitoring social commentary, combination w/structured data, clustering,

associated concepts, correlated concepts, auto-

classification of documents, sites, posts.

How it works •  Parses text and detects meaning with

annotators

•  Understands the context in which the text is analyzed

•  Hundreds of pre-built annotators for names, addresses, phone numbers, along others

Accuracy •  Highly accurate in deriving meaning

from complex text

Performance •  AQL language optimized for

MapReduce

Text Analytics – highly accurate analysis of textual content

Football World Cup 2010, one team distinguished themselves well, losing to the eventual champions 1-0 in the Final. Early in the second half, Netherlands’ striker, Arjen Robben, had a breakaway, but the keeper for Spain, Iker Casillas made the save. Winger Andres Iniesta scored for Spain for the win.

Unstructured text (document, email, etc)

Classification and Insight

BigInsights Text Analytics Development – AQL

Text Analytics Tooling

Result Viewer AQL Editor

Runtime Explain

Framework for machine learning (ML) implementations on Big Data •  Large, sparse data sets, e.g. 5B non-zero values

•  Runs on large BigInsights clusters with 1000s of nodes

Productivity •  Build and enhance predictive models directly on Big Data

•  High-level language – Declarative Machine Learning Language (DML)

•  E.g. 1500 lines of Java code boils down to 15 lines of DML code

•  Parallel SPSS data mining algorithms implementable in DML

Optimization •  Compile algorithms into optimized parallel code

•  For different clusters and different data characteristics

•  E.g. 1 hr. execution (hand-coded) down to 10 mins

Statistical and Predictive Analysis

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000

# non zeros (million)

Exe

cutio

n Ti

me

(sec

)Java Map-Reduce SystemML Single node R

Optimized performance for big data analytic workloads

Workload Optimization

Task Map (break task into small parts)

Adaptive Map (optimization — order small units of work)

Reduce (many results to a single result set)

Adaptive MapReduce

§  Algorithm to optimize execution time of multiple small jobs

§  Performance gains of 30% reduce overhead of task startup

Hadoop System Scheduler

§  Identifies small and large jobs from prior experience

§  Sequences work to reduce overhead

InfoSphere BigInsights – Embrace and Extend Hadoop

HDFS

Storage HBase

GPFS-SNC

Application

AdaptiveMR

Zook

eepe

r

Avro

Pig Hive Jaql

MapReduce

Flume

Data Sources/ Connectors

JDBC

Netezza BoardReader

DB2

Streams

Web Crawler

Oozie

Analytics Text Analytics ML Analytics Interface

Lucene

R

CSV / XML / JSON Data Stage SPSS

IBM

LZO

Com

pres

sion

BigSheets

BigIndex FLEX

Open Source

IBM

Web console •  Monitor cluster health •  Add / remove nodes •  Start / stop services •  Inspect job status •  Inspect workflow status •  Deploy apps •  Launch apps / jobs •  Work with distrib. file system •  Work with spreadsheet interface •  Support REST-based API •  . . .

Eclipse plug-ins •  Text analytics •  MapReduce programming •  Jaql development •  Hive query development

In the Cloud •  Via RightScale, or directly on Amazon, Rackspace, IBM

Smart Enterprise Cloud, or on private clouds.

•  Pay only for the resources used.

In the Virtual Classroom •  Free Hadoop Fundamentals training course

www.bigdatauniversity.com

•  e.g. BD105EN - Text Analytics Essentials

On Your Cluster •  Download Basic Edition from ibm.com.

In the Classroom •  Enroll in the InfoSphere BigInsights Essentials course.

Ways to get started with BigInsights

Free links to papers, demos, discussion forum, and more

http://www.ibm.com/developerworks/wiki/biginsights/

Visit the BigInsights technical portal . . . .

Built to analyze data in motion •  Multiple concurrent input streams

•  Massive scalability

Process and analyze a variety of data •  Structured, unstructured content, video,

audio

•  Advanced analytic operators

Streams – analytical platform for in-motion “Big Data”

BI / Reporting


Functional App

Industry App


Content Analytics





Accelerators


Hadoop System

Data Warehouse

Stream Computing

Current fact finding

Analyze data in motion – before it is stored

Low latency paradigm, push model

Data driven – bring the data to the query

Historical fact finding

Find and analyze information stored on disk

Batch paradigm, pull model

Query-driven: submits queries to static data

Traditional Computing Stream Computing

Query Data Results Data Query Results

Stream Computing – Analyze Data in Motion

Applications that require on-the-fly processing, filtering and analysis of streaming data •  Sensors: environmental, industrial, surveillance video, GPS, …

•  “Data exhaust”: network/system/web server/app server log files

•  High-rate transaction data: financial transactions, call detail records

Criteria: two or more of the following •  Messages are processed in isolation or in limited data windows

•  Sources include non-traditional data (spatial, imagery, text, …)

•  Sources vary in connection methods, data rates, and processing requirements, presenting integration challenges

•  Data rates/volumes require the resources of multiple processing nodes

•  Analysis and response are needed with sub-millisecond latency

•  Data rates and volumes are too great for store-and-mine approaches

Why InfoSphere Streams?

Linear Scalability

§  Clustered deployments – unlimited scalability

Automated Deployment

§  Automatically optimize operator deployment across clusters

Performance Optimization

§  JVM Sharing – minimize memory use

§  Fuse operators on same cluster

§  Telco client – 25 Million messages per second

Analytics on Streaming Data

§  Analytic accelerators for a variety of data types

§  Optimized for real-time performance

Massively Scalable Stream Analytics

Visualization

Streams Runtime

Deployments

Sync Adapters

Analytic Operators

Source Adapters

Automated and Optimized Deployment

Streaming Data Sources

Streams Studio IDE

Streams approach illustrated

directory: ”/img" filename: “farm”

directory: ”/img" filename: “bird”

directory: ”/opt" filename: “java”

directory: ”/img" filename: “cat”

tuple

height: 640 width: 480 data:



Easy to extend: •  Built in adaptors •  Users add capability

with familiar C++ and Java

InfoSphere Streams for superior real time analytic processing

Compile groups of operators into single processes: •  Efficient use of cores •  Distributed execution •  Very fast data exchange •  Can be automatic or tuned •  Scaled with push of a button

Streams Processing Language (SPL) built for Streaming applications: •  Reusable operators •  Rapid application development •  Continuous “pipeline” processing

Flexible and high performance transport: •  Very low latency •  High data rates

Use the data that gives you a competitive advantage: •  Can handle virtually

any data type •  Use data that is too

expensive and time sensitive for traditional approaches

Easy to manage: •  Automatic placement •  Extend applications incrementall

without downtime •  Multi-user / multiple applications

Dynamic analysis: •  Programmatically change

topology at runtime •  Create new subscriptions •  Create new port properties

Streams Studio Integrated Development Environment

34

Operator Fusion •  Fine-grained operators

•  From small parts, make larger ones that fit

Code generation •  Generates code to match the underlying

runtime environment

•  Number of cores

•  Interconnect characteristics

•  Architecture-specific instructions

•  Driven by automatic profiling

•  Compiler-based optimization

•  Driven by incremental learning of application characteristics

Compiler Framework Logical app view

Physical app view

Enables scoring of real-time data in a Streams application •  Scoring is performed against a predefined model

•  Supports a variety of model types and scoring algorithms

Models represented in Predictive Model Markup Language (PMML) •  Standard for statistical and data mining models

•  XML Representation

Toolkit provides four Streams operators to enable scoring •  Classification

•  Clustering

•  Regression

•  Associations

The toolkit supports dynamic replacement of the PMML model used by an operator.

Streams Data Mining Toolkit

Without a Big Data Platform You Code…

IBM Big Data Platform

Streams provides development, deployment, runtime, and infrastructure services

“TerraEchos developers can deliver applications 45% faster due to the agility

of Streams Processing Language…” – Alex Philip, CEO and President, TerraEchos

Multithreading

Custom SQL and

Scripts

Performance Optimization

Debug

Application Management

Event Handling

Connectors

Check Pointing

Security

HA Accelerators and

Toolkits

Over 100 sample applications and toolkits with industry focused toolkits with 300+ functions and operators

redbooks.ibm.com/abstracts/sg247970.html

This book is intended for professionals that require an understanding of how to process high volumes of streaming data or need information about how to implement systems to satisfy those requirements.

Streams Redbook

Traditional / Relational

Data Sources

Streams

Internet Scale

Traditional Warehouse

In-Motion Analytics

Data Analytics, Data Operations & Model

Building

Results Internet Scale

Database & Warehouse

At-Rest Data Analytics

Results

Ultra Low Latency Results

InfoSphere Big Insights

Non-Traditional/ Non-Relational Data Sources

Non-Traditional / Non-Relational Data Sources

Traditional/Relational Data

Sources

• Three routes to analytics

• Application and workload optimized appliances and systems

• Fast data movement and integration

Right-time actions are taken in the new BI/BA ecosystem

26.04.2012 © Copyright IBM Corporation 2012 39

Example of 360° customer view

Master Data Management

Business Processes"

Big Data Platform

Call Detail Records Call Behavior and

Experience Insight

Data Warehouse

Website Logs Social Media

Streaming Analytics

Internet Scale Analytics

Web Traffic and Social Media Insight

Events and Alerts

Information Integration

Cognos Consumer Insight

Campaign Management

Big Data Plattform der IBM InfoSphere BigInsights und InfoSphere Streams

2012.04.26 big insights streams im forum2

Self Improvement

available data

data formats

deep analytics data

big data plattform

ibm big data strategy

bday traditional data

value hadoop stream

hadoop computation model