2012.04.26 big insights streams im forum2
DESCRIPTION
TRANSCRIPT
Big Data Plattform der IBM
InfoSphere BigInsights und InfoSphere Streams
Big Data Plattform der IBM
InfoSphere BigInsights und InfoSphere Streams
Wilfried Hoge – Leading Technical Sales Professional [email protected] twitter.com/wilfriedhoge
New analytic applications drive the requirements for a big data platform • Integrate and manage the full
variety, velocity and volume of data
• Apply advanced analytics to information in its native form
• Visualize all available data for ad-hoc analysis
• Development environment for building new analytic applications
• Workload optimization and scheduling
• Security and Governance
IBM Big Data Strategy: Move the Analytics Closer to the Data
BI / Reporting
Exploration / Visualization
Functional App
Industry App
Predictive Analytics
Content Analytics
Analytic Applications
IBM Big Data Platform Systems
Management Application
Development Visualization & Discovery
Accelerators
Information Integration & Governance
Hadoop System
Stream Computing
Data Warehouse
Up to 10,000 Times larger
Up to 10,000 times faster
Traditional Data Warehouse and Business Intelligence
Dat
a Sc
ale
Dat
a Sc
ale
yr mo wk day hr min sec … ms µs
Exa
Peta
Tera
Giga
Mega
Kilo
Decision Frequency Occasional Frequent Real-time
Data in Motion
Dat
a at
Res
t
Volume and Velocity – two dimensions for Big Data
Telco Promotions
100,000 records/sec, 6B/day 10 ms/decision 270TB for Deep Analytics
DeepQA 100s GB for Deep Analytics 3 sec/decision Power7, 15TB memory
Wind Turbine Placement & Operation PBs of data Analysis time to 3 days from 3 weeks 1220 IBM iDataPlex nodes
Security
600,000 records/sec, 50B/day 1-2 ms/decision 320TB for Deep Analytics
26.04.2012 © Copyright IBM Corporation 2012 4
Based on open source & IBM technologies
Distinguishing characteristics • Built-in analytics . . . enhances business
knowledge
• Enterprise software integration . . . complements and extends existing capabilities
• Production-ready platform with tooling for analysts, developers, and administrators. . . speeds time-to-value and simplifies development/maintenance
IBM advantage • Combination of software, hardware,
services and advanced research
BigInsights – analytical platform for persistent “Big Data”
BI / Reporting
Exploration / Visualization
Functional App
Industry App
Predictive Analytics
Content Analytics
Analytic Applications
IBM Big Data Platform Systems
Management Application
Development Visualization & Discovery
Accelerators
Information Integration & Governance
Stream Computing
Data Warehouse
Hadoop System
Flexible, enterprise-class support for processing large volumes of data • Based on Google’s MapReduce technology
• Inspired by Apache Hadoop; compatible with its ecosystem and distribution
• Well-suited to batch-oriented, read-intensive applications
• Supports wide variety of data
Enables applications to work with thousands of nodes and petabytes of data in a highly parallel, cost effective manner • CPU + disks = “node”
• Nodes can be combined into clusters
• New nodes can be added as needed without changing
• Data formats
• How data is loaded
• How jobs are written
About the BigInsights Platform
Hadoop computation model • Data stored in a distributed file system spanning many inexpensive computers
• Bring function to the data
• Distribute application to the compute resources where the data is stored
Scalable to thousands of nodes and petabytes of data
Hadoop Explained – Map Reduce
MapReduce Application
1. Map Phase (break job into small parts)
2. Shuffle (transfer interim output for final processing)
3. Reduce Phase (boil all output down to a single result set)
Return a single result set Result Set
Shuffle
public static class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text();
public void map(Object key, Text val, Context StringTokenizer itr = new StringTokenizer(val.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWrita private IntWritable result = new Intritable();
public void reduce(Text key, Iterable<IntWritable> val, Context context){ int sum = 0; for (IntWritable v : val) { sum += v.get(); . . .
Distribute map tasks to cluster
Hadoop Data Nodes
Technical differentiators • Built-in analytics
• Text processing engine, annotators, Eclipse tooling • Statistical and predictive analysis • Interface to project R (statistical platform)
• Enterprise software integration (DBMS, warehouse) • Spreadsheet-style analytical tool for analysts • Ready-made business process accelerators • Integrated installation of supported open source and IBM components • Web Console for administration and application access • Platform enrichment: additional security, performance features, . . . • Standard IBM licensing agreement and world-class support
Business benefits • Quicker time-to-value due to IBM technology and support • Reduced operational risk • Enhanced business knowledge with flexible analytical platform • Leverages and complements existing software assets
BigInsights – Value Beyond Open Source
Seamless process for single node and cluster environments
Integrated installation of all selected components
Post-install validation of IBM and open source components
Web Installation Tool
No need to iteratively download, configure, and test multiple open source projects and their pre-requisite software.
Manage BigInsights • Inspect system health
• Add / drop nodes
• Start / stop services
• Run / monitor jobs (applications)
• Explore / modify file system
Launch applications • Spreadsheet-like analysis tool
• Pre-built applications (IBM supplied or user developed)
Publish applications
Leverage community resources
Web Console
BigSheets is a visual tool for data manipulation and prototyping • Allows more users to do more work, more quickly
• Simply stated, growing an army of MapReduce developers is not cost effective
• In your BI environments you have a ratio of 30+ report users for every complex SQL developer. We need to support the same ratios with BigInsights
Sample Uses • Data exploration and visualization
• Visual job creation
BigSheets
BigSheets – Spreadsheet-style Data Analysis and Discovery
BigSheets – Visualization
Reusable software assets based on customer engagements • Useful for starting point for various applications
• Can be customized by BigInsights application developers as needed
• Accessible through Web console
Available assets • Data export (to relational DBMS, files, HBase)
• Data import (from relational DBMS, files)
• Web crawler, Twitter crawler
• Boardreader.com support (Web forum search engine)
• Ad hoc queries for Jaql, Hive, Pig
• TeraGen-TeraSort, WordCount sample applications
Quick start applications or “apps”
Running Applications from the Web Console
Develop Hive with the SQL Editor and view results
Build a Big Data Program – Map Reduce example
Eclipse based development tools For JAQL, Hive, Java MapReduce, Text Analytics
Text analytics – Distill structured information from unstructured data • Rich annotator library supports multiple languages
• Declarative Information Extraction (IE) system based on an algebraic framework
• Richer, cleaner rule semantics
• Better performance through optimization
Developed at IBM Research since 2004
Embedded in several IBM products • Lotus Notes
• Cognos Consumer Insights
• InfoSphere Streams
• Compose operators to build complex annotators
Text Analytics in BigInsights
Pre-configured text annotators ready for distributed processing on Big Data • City, County, Zipcode, Address, Maplocation, StateOrProvince, Country, Continent,
EmailAddress, Person, Organizaion, DateTime, URL, Compane Names, Merger, Acquisition, Alliance, etc..
Support for native languages including double-byte
Turns disparate words into measurable insights
Identify positive or negative sentiment,
NLP-based analytics, define variables, macros
and rules.
Physically assemble data, standardize
formats, address auto-identify language,
process punctuation and non-grammatical
characters, standardize spelling.
Part-of-speech identification, standard and
customized extraction dictionaries, proper noun
identification, concept categorization, synonyms,
exclusions, multi-terms, regular expressions, fuzzy-
matching
Iterative classification using automated and manual techniques.
Concept derivation & inclusion, semantic networks and co-occurrence rules
Reporting/Monitoring social commentary, combination w/structured data, clustering,
associated concepts, correlated concepts, auto-
classification of documents, sites, posts.
How it works • Parses text and detects meaning with
annotators
• Understands the context in which the text is analyzed
• Hundreds of pre-built annotators for names, addresses, phone numbers, along others
Accuracy • Highly accurate in deriving meaning
from complex text
Performance • AQL language optimized for
MapReduce
Text Analytics – highly accurate analysis of textual content
Football World Cup 2010, one team distinguished themselves well, losing to the eventual champions 1-0 in the Final. Early in the second half, Netherlands’ striker, Arjen Robben, had a breakaway, but the keeper for Spain, Iker Casillas made the save. Winger Andres Iniesta scored for Spain for the win.
Unstructured text (document, email, etc)
Classification and Insight
BigInsights Text Analytics Development – AQL
Text Analytics Tooling
Result Viewer AQL Editor
Runtime Explain
Framework for machine learning (ML) implementations on Big Data • Large, sparse data sets, e.g. 5B non-zero values
• Runs on large BigInsights clusters with 1000s of nodes
Productivity • Build and enhance predictive models directly on Big Data
• High-level language – Declarative Machine Learning Language (DML)
• E.g. 1500 lines of Java code boils down to 15 lines of DML code
• Parallel SPSS data mining algorithms implementable in DML
Optimization • Compile algorithms into optimized parallel code
• For different clusters and different data characteristics
• E.g. 1 hr. execution (hand-coded) down to 10 mins
Statistical and Predictive Analysis
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 500 1000 1500 2000
# non zeros (million)
Exe
cutio
n Ti
me
(sec
)Java Map-Reduce SystemML Single node R
Optimized performance for big data analytic workloads
Workload Optimization
Task Map (break task into small parts)
Adaptive Map (optimization — order small units of work)
Reduce (many results to a single result set)
Adaptive MapReduce
§ Algorithm to optimize execution time of multiple small jobs
§ Performance gains of 30% reduce overhead of task startup
Hadoop System Scheduler
§ Identifies small and large jobs from prior experience
§ Sequences work to reduce overhead
InfoSphere BigInsights – Embrace and Extend Hadoop
HDFS
Storage HBase
GPFS-SNC
Application
AdaptiveMR
Zook
eepe
r
Avro
Pig Hive Jaql
MapReduce
Flume
Data Sources/ Connectors
JDBC
Netezza BoardReader
DB2
Streams
Web Crawler
Oozie
Analytics Text Analytics ML Analytics Interface
Lucene
R
CSV / XML / JSON Data Stage SPSS
IBM
LZO
Com
pres
sion
BigSheets
BigIndex FLEX
Open Source
IBM
Web console • Monitor cluster health • Add / remove nodes • Start / stop services • Inspect job status • Inspect workflow status • Deploy apps • Launch apps / jobs • Work with distrib. file system • Work with spreadsheet interface • Support REST-based API • . . .
Eclipse plug-ins • Text analytics • MapReduce programming • Jaql development • Hive query development
In the Cloud • Via RightScale, or directly on Amazon, Rackspace, IBM
Smart Enterprise Cloud, or on private clouds.
• Pay only for the resources used.
In the Virtual Classroom • Free Hadoop Fundamentals training course
www.bigdatauniversity.com
• e.g. BD105EN - Text Analytics Essentials
On Your Cluster • Download Basic Edition from ibm.com.
In the Classroom • Enroll in the InfoSphere BigInsights Essentials course.
Ways to get started with BigInsights
Free links to papers, demos, discussion forum, and more
http://www.ibm.com/developerworks/wiki/biginsights/
Visit the BigInsights technical portal . . . .
Built to analyze data in motion • Multiple concurrent input streams
• Massive scalability
Process and analyze a variety of data • Structured, unstructured content, video,
audio
• Advanced analytic operators
Streams – analytical platform for in-motion “Big Data”
BI / Reporting
Exploration / Visualization
Functional App
Industry App
Predictive Analytics
Content Analytics
Analytic Applications
IBM Big Data Platform Systems
Management Application
Development Visualization & Discovery
Accelerators
Information Integration & Governance
Hadoop System
Data Warehouse
Stream Computing
Current fact finding
Analyze data in motion – before it is stored
Low latency paradigm, push model
Data driven – bring the data to the query
Historical fact finding
Find and analyze information stored on disk
Batch paradigm, pull model
Query-driven: submits queries to static data
Traditional Computing Stream Computing
Query Data Results Data Query Results
Stream Computing – Analyze Data in Motion
Applications that require on-the-fly processing, filtering and analysis of streaming data • Sensors: environmental, industrial, surveillance video, GPS, …
• “Data exhaust”: network/system/web server/app server log files
• High-rate transaction data: financial transactions, call detail records
Criteria: two or more of the following • Messages are processed in isolation or in limited data windows
• Sources include non-traditional data (spatial, imagery, text, …)
• Sources vary in connection methods, data rates, and processing requirements, presenting integration challenges
• Data rates/volumes require the resources of multiple processing nodes
• Analysis and response are needed with sub-millisecond latency
• Data rates and volumes are too great for store-and-mine approaches
Why InfoSphere Streams?
Linear Scalability
§ Clustered deployments – unlimited scalability
Automated Deployment
§ Automatically optimize operator deployment across clusters
Performance Optimization
§ JVM Sharing – minimize memory use
§ Fuse operators on same cluster
§ Telco client – 25 Million messages per second
Analytics on Streaming Data
§ Analytic accelerators for a variety of data types
§ Optimized for real-time performance
Massively Scalable Stream Analytics
Visualization
Streams Runtime
Deployments
Sync Adapters
Analytic Operators
Source Adapters
Automated and Optimized Deployment
Streaming Data Sources
Streams Studio IDE
Streams approach illustrated
directory: ”/img" filename: “farm”
directory: ”/img" filename: “bird”
directory: ”/opt" filename: “java”
directory: ”/img" filename: “cat”
tuple
height: 640 width: 480 data:
height: 1280 width: 1024 data:
height: 640 width: 480 data:
Easy to extend: • Built in adaptors • Users add capability
with familiar C++ and Java
InfoSphere Streams for superior real time analytic processing
Compile groups of operators into single processes: • Efficient use of cores • Distributed execution • Very fast data exchange • Can be automatic or tuned • Scaled with push of a button
Streams Processing Language (SPL) built for Streaming applications: • Reusable operators • Rapid application development • Continuous “pipeline” processing
Flexible and high performance transport: • Very low latency • High data rates
Use the data that gives you a competitive advantage: • Can handle virtually
any data type • Use data that is too
expensive and time sensitive for traditional approaches
Easy to manage: • Automatic placement • Extend applications incrementall
without downtime • Multi-user / multiple applications
Dynamic analysis: • Programmatically change
topology at runtime • Create new subscriptions • Create new port properties
Streams Studio Integrated Development Environment
34
Operator Fusion • Fine-grained operators
• From small parts, make larger ones that fit
Code generation • Generates code to match the underlying
runtime environment
• Number of cores
• Interconnect characteristics
• Architecture-specific instructions
• Driven by automatic profiling
• Compiler-based optimization
• Driven by incremental learning of application characteristics
Compiler Framework Logical app view
Physical app view
Enables scoring of real-time data in a Streams application • Scoring is performed against a predefined model
• Supports a variety of model types and scoring algorithms
Models represented in Predictive Model Markup Language (PMML) • Standard for statistical and data mining models
• XML Representation
Toolkit provides four Streams operators to enable scoring • Classification
• Clustering
• Regression
• Associations
The toolkit supports dynamic replacement of the PMML model used by an operator.
Streams Data Mining Toolkit
Without a Big Data Platform You Code…
IBM Big Data Platform
Streams provides development, deployment, runtime, and infrastructure services
“TerraEchos developers can deliver applications 45% faster due to the agility
of Streams Processing Language…” – Alex Philip, CEO and President, TerraEchos
Multithreading
Custom SQL and
Scripts
Performance Optimization
Debug
Application Management
Event Handling
Connectors
Check Pointing
Security
HA Accelerators and
Toolkits
Over 100 sample applications and toolkits with industry focused toolkits with 300+ functions and operators
redbooks.ibm.com/abstracts/sg247970.html
This book is intended for professionals that require an understanding of how to process high volumes of streaming data or need information about how to implement systems to satisfy those requirements.
Streams Redbook
Traditional / Relational
Data Sources
Streams
Internet Scale
Traditional Warehouse
In-Motion Analytics
Data Analytics, Data Operations & Model
Building
Results Internet Scale
Database & Warehouse
At-Rest Data Analytics
Results
Ultra Low Latency Results
InfoSphere Big Insights
Non-Traditional/ Non-Relational Data Sources
Non-Traditional / Non-Relational Data Sources
Traditional/Relational Data
Sources
• Three routes to analytics
• Application and workload optimized appliances and systems
• Fast data movement and integration
Right-time actions are taken in the new BI/BA ecosystem
26.04.2012 © Copyright IBM Corporation 2012 39
Example of 360° customer view
Master Data Management
Business Processes"
Big Data Platform
Call Detail Records Call Behavior and
Experience Insight
Data Warehouse
Website Logs Social Media
Streaming Analytics
Internet Scale Analytics
Web Traffic and Social Media Insight
Events and Alerts
Information Integration
Cognos Consumer Insight
Campaign Management
Big Data Plattform der IBM InfoSphere BigInsights und InfoSphere Streams