computerworld big data forum 2015
TRANSCRIPT
© 2013 IBM Corporation© 2015 IBM Corporation1
ComputerWorldHK2015Systems of Insights - making big data analytics more consumable
Steven SitDirector of Products
Open Source Based Analytic Systems
© 2013 IBM Corporation© 2015 IBM Corporation2 22
Extend & Integrate
Big Dataand
Analytics
Actionable Insights @Point of Impact
Operational Systems
• Smarter Infrastructure • Security Intelligence • Enterprise Applications
Systems of Engagement
• Mobile Commerce
• Call Center • Social Business
Leaders are Leveraging Big Data to Deliver Actionable Insights at the Point of Impact
© 2013 IBM Corporation© 2015 IBM Corporation3
Information Ingestion
and Operational Information
Landing andArchive Zone
Real-timeIn-memory
Zone
Enterprise Warehouse & Mart Zone
Governance, Security and Business Continuity
Analytic Appliances
Big Data Platform CapabilitiesAll Data Sources Advanced Analytics Applications
Streaming Data
Text Data
Applications Data
Time Series
Geo Spatial
Relational
Social Network
Video & Image
Automated Process
Case Management
Analytic Applications
CognitiveLearn
Dynamically?
PrescriptiveBest Outcomes?
PredictiveWhat Could
Happen?
DescriptiveWhat Has
Happened?
Exploration & DiscoveryWhat Do You
Have?
Watson
Cloud Services
ISV Solutions
Alerts
© 2013 IBM Corporation© 2015 IBM Corporation4
Common Big Data Use Cases
Big Data ExplorationFind, visualize, understand all big data to improve decision making
Enhanced 360o Viewof the CustomerExtend existing customer views (MDM, CRM, etc) by incorporating additional internal and external information sources
Operations AnalysisAnalyze a variety of machinedata for improved business results
Data Warehouse AugmentationIntegrate big data and data warehouse capabilities to increase operational efficiency
Security/Intelligence ExtensionLower risk, detect fraud and monitor cyber security in real-time
© 2013 IBM Corporation© 2015 IBM Corporation5
Big Data to Improve Health Care for Millions of Patients
Independence Blue Cross is leveraging Big Data to improve the lifestyle for those who are ill• Maps of complex referral network to
identify cost-efficient providers
• Identify patients who are at the highest risk of being re-hospitalized within a short period of time
• Enhance the efficacies of managing chronic disease, such as early detection of diabetes
• Analyze clinical data and scanned images to identify hip implant patients who has high-risk exposure to complications such as metallosis, infection and dislocation
© 2015 IBM Corporation6
© 2015 IBM Corporation77
© 2015 IBM Corporation8
© 2013 IBM Corporation© 2015 IBM Corporation9
Rain detected *vibrate windshield*
Rubbing sound * publish event*
Part failure prediction
© 2013 IBM Corporation© 2015 IBM Corporation10
75%Drivers reduced fuel consumption with
better driving behavior
• Improve safety & lower insurance premiums
• Preventive maintenance to improve car reliability
• Provide services such as accident and weather alerts
© 2013 IBM Corporation© 2015 IBM Corporation11
2.7Mleads for upsell based on customer
profiles
• Better accuracy in targeted marketing activities based on individual interests, intentions and goals
• Protect individual privacy in profiles
© 2013 IBM Corporation© 2015 IBM Corporation12 IBM Research
Counterparty Relationships from Regulatory Filings
Annual Report Loan Agreement
Proxy Statement Insider Transaction
Counterparty Relationships
Loan Exposure
Company
Person
Extract Integrate
Over 2500 financial companiesOver 33000 key officials in financial companiesSample set of 1 Million documents
2005 2014 Filingtimeline
SEC/FDIC Filings of Financial Companies
(Forms 10-K,8-k, 10-Q, DEF 14A, 3/4/5, 13F, SC 13D SC 13 G
FDIC Call Reports)
• Regulatory• Systemic
Risk • Investment
Decisions
64.242.88.10 - - [07/Mar/2004:16:05:49 -0800] "GET /twiki/bin/edit/Mr?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 1284664.242.88.10 - - [07/Mar/2004:16:06:51 -0800] "GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1" 200 452364.242.88.10 - - [07/Mar/2004:16:10:02 -0800] "GET /mailman/listinfo/hsdivision HTTP/1.1" 200 629164.242.88.10 - - [07/Mar/2004:16:11:58 -0800] "GET /twiki/bin/view/TWiki/WikiSyntax HTTP/1.1" 200 735264.242.88.10 - - [07/Mar/2004:16:20:55 -0800] "GET /twiki/bin/view/Main/DCCAndPostFix HTTP/1.1" 200 525364.242.88.10 - - [07/Mar/2004:16:23:12 -0800] "GET /twiki/bin/oops/TWiki/AppendixFileSystem HTTP/1.1" 200 1138264.242.88.10 - - [07/Mar/2004:16:24:16 -0800] "GET /twiki/bin/view/Main/PeterThoeny HTTP/1.1" 200 4924
SELECT FNAME, LNAME, (SELECT NAME
FROM DEPARTMENT AS DEP
WHERE DEP.DEPT_ID = EMP.DEPT_ID) FROM EMPLOYEE AS EMP
64.242.88.10 - - [07/Mar/2004:16:05:49 -0800] "GET /twiki/bin/edit/Mr?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 1284664.242.88.10 - - [07/Mar/2004:16:06:51 -0800] "GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1" 200 452364.242.88.10 - - [07/Mar/2004:16:10:02 -0800] "GET /mailman/listinfo/hsdivision HTTP/1.1" 200 629164.242.88.10 - - [07/Mar/2004:16:11:58 -0800] "GET /twiki/bin/view/TWiki/WikiSyntax HTTP/1.1" 200 735264.242.88.10 - - [07/Mar/2004:16:20:55 -0800] "GET /twiki/bin/view/Main/DCCAndPostFix HTTP/1.1" 200 525364.242.88.10 - - [07/Mar/2004:16:23:12 -0800] "GET /twiki/bin/oops/TWiki/AppendixFileSystem HTTP/1.1" 200 1138264.242.88.10 - - [07/Mar/2004:16:24:16 -0800] "GET /twiki/bin/view/Main/PeterThoeny HTTP/1.1" 200 4924
X2
Y5
Y6
… …
…
X1X2X3
Xn
… Y1
Y2
Y3
Yn
SELECT FNAME, LNAME, (SELECT NAME
FROM DEPARTMENT AS DEP
WHERE DEP.DEPT_ID = EMP.DEPT_ID) FROM EMPLOYEE AS EMP
64.242.88.10 - - [07/Mar/2004:16:05:49 -0800] "GET /twiki/bin/edit/Mr?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 1284664.242.88.10 - - [07/Mar/2004:16:06:51 -0800] "GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1" 200 452364.242.88.10 - - [07/Mar/2004:16:10:02 -0800] "GET /mailman/listinfo/hsdivision HTTP/1.1" 200 629164.242.88.10 - - [07/Mar/2004:16:11:58 -0800] "GET /twiki/bin/view/TWiki/WikiSyntax HTTP/1.1" 200 735264.242.88.10 - - [07/Mar/2004:16:20:55 -0800] "GET /twiki/bin/view/Main/DCCAndPostFix HTTP/1.1" 200 525364.242.88.10 - - [07/Mar/2004:16:23:12 -0800] "GET /twiki/bin/oops/TWiki/AppendixFileSystem HTTP/1.1" 200 1138264.242.88.10 - - [07/Mar/2004:16:24:16 -0800] "GET /twiki/bin/view/Main/PeterThoeny HTTP/1.1" 200 4924
X2
Y5
Y6
… …
…
X1X2X3
Xn
… Y1
Y2
Y3
Yn
SELECT FNAME, LNAME, (SELECT NAME
FROM DEPARTMENT AS DEP
WHERE DEP.DEPT_ID = EMP.DEPT_ID) FROM EMPLOYEE AS EMP
64.242.88.10 - - [07/Mar/2004:16:05:49 -0800] "GET /twiki/bin/edit/Mr?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 1284664.242.88.10 - - [07/Mar/2004:16:06:51 -0800] "GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1" 200 452364.242.88.10 - - [07/Mar/2004:16:10:02 -0800] "GET /mailman/listinfo/hsdivision HTTP/1.1" 200 629164.242.88.10 - - [07/Mar/2004:16:11:58 -0800] "GET /twiki/bin/view/TWiki/WikiSyntax HTTP/1.1" 200 735264.242.88.10 - - [07/Mar/2004:16:20:55 -0800] "GET /twiki/bin/view/Main/DCCAndPostFix HTTP/1.1" 200 525364.242.88.10 - - [07/Mar/2004:16:23:12 -0800] "GET /twiki/bin/oops/TWiki/AppendixFileSystem HTTP/1.1" 200 1138264.242.88.10 - - [07/Mar/2004:16:24:16 -0800] "GET /twiki/bin/view/Main/PeterThoeny HTTP/1.1" 200 4924
© 2013 IBM Corporation© 2015 IBM Corporation1818
The current ecosystem is challenged and slowed by fragmented and duplicated efforts.
The ODP Core will take the guesswork out of the process and accelerate many use cases by running on a common platform.
Freeing up enterprises and ecosystem vendors to focus on building business driven applications.
© 2013 IBM Corporation© 2015 IBM Corporation19
Open Data Platform – Stakeholders Across the Hadoop Spectrum
Representation across the Hadoop ecosystem…
• Hadoop distribution vendors• Software application providers• Systems integrators and
consultants• Hardware vendors
… who all believe in the need for a community-based effort to standardize Hadoop, which will lead to improved adoption
© 2013 IBM Corporation© 2015 IBM Corporation20
The Open Data Platform Initiative Will Benefit Customers
ODP consortium’s goal: anyone using a Hadoop distribution based on the ODP Core
will be able develop Hadoop products or apps with assurances of seamless
deployment and compatibility.
Apache Hadoop Open Source Ecosystem
HBase
Spark
Flume
Hive Pig
Sqoop
HCatalog
Solr/Lucene
HDFS
YARN
MapReduce
Ambari
Initial ODP Scope Zookeeper Oozie Knox Slider
Open Platform
4.0 with Apache
Hadoop
Pivotal HD 3.0
HDP 2.2
© 2013 IBM Corporation© 2015 IBM Corporation21
21
RTextSQLSheetsMatch
© 2013 IBM Corporation© 2015 IBM Corporation22 Built on the IBM Open Platform with Apache Hadoop
POSIX Distributed Filesystem
Multi-workload, Multi-tenant scheduling
IBM BigInsights Enterprise Management
IBM BigInsights Analyst
Big SQL
BigSheets
IBM BigInsights Data Scientist
Big SQL
BigSheets
Big R + ML
Text Analytics
© 2013 IBM Corporation© 2015 IBM Corporation23
In-MemoryPerformance
Ease of Development
Easier APIs:Python, Scala,
Java
Resilient Distributed Datasets
Unify processing
Batch InteractiveIterative Algorithms
Micro-batch
PolyglotWorkloads
ReliabilityResiliencySecurity
Multiple data sources and applications
Multiple users
UnlimitedScale
EnterprisePlatform
Wide Range of Applications
FilesDatabases
Semi-structured
Pretty intense Java programming required and knowledge of parallelism
Few abstractions available and ones that do exists perform poorly
No in-memory framework, when tasks complete data sets no longer in memory
Each map tasks is a disk write and new maps are a disk read in a workflow
Suitable for batch – what it was built for - but use cases are changing
CHALLENGES
© 2013 IBM Corporation© 2015 IBM Corporation24
Apache Spark is a fast, general purpose, easy-to-use cluster computing system for large-scale data processing- Fast
• Leverages aggressively cached in-memory distributed computing and JVM threads
• Faster than MapReduce- Generality
• Covers a wide range of workloads• Provides SQL, streaming and complex
analytics- Ease of use (for programmers)
• Spark is written in Scala, an object oriented, functional programming language
• Scala, Python and Java APIs• Scala and Python interactive shells• Runs on Hadoop, Mesos, standalone or cloud
Logistic regression in Hadoop and Spark
from http://spark.apache.org
© 2013 IBM Corporation© 2015 IBM Corporation25
Spark Resilient Distributed Datasets
Slave node 1
c3 d2
a2 b1
partition3
partition1
partition2
Slave node 2
c2 d1
a1 b2
partition1
partition3
Slave node 3
c1 d2
a3 b3
partition2
partition2
partition1
RDD1
RDD2
RDD3
Spark RDDIn-memory distribution
HDFSOn-disk
distribution
© 2013 IBM Corporation© 2015 IBM Corporation26
IBM Software Defined Infrastructure
ExampleApplications
High Performance
Analytics (Low Latency
Parallel)
Homegrown
Hadoop / Big Data
Application Frameworks(Long Running
Services)
High Performance Computing(Batch, Serial, MPI, Workflow)
Homegrown
IBM Platform Computing
WorkloadEngines
ResourceManagement
Scheduling & Acceleration With Infrastructure Sharing
MapReduce(Symphony)Symphony
Application Service
ControllerLSF
© 2013 IBM Corporation© 2015 IBM Corporation27
Create Apps Quickly with Prebuilt Services
Security Services
Web and application
services
CloudIntegration
Services
Mobile Services
Database Services
Big Data Services
Internet of Things
Services
Watson Services
DevOps Services
A full range of capabilities to suit any great idea
Choice: runtimes, services, and tooling up to you
Industry Leading IBMCapabilities
– Services leveraging the depthof IBM software
– Full range of capabilities
Completeness– Open source platform and services– Third party to enable key use
cases
Analytic Services:– dashDB, DataWorks, EHaaS,– Watson, Cloudant, DBaaS– Spark as a Service (SaaS)– +++
© 2013 IBM Corporation© 2015 IBM Corporation28
IBM Announces First 20 Industry Analytics Solutions
Behavior-based Customer InsightFor Banking
Multi-Channel Fraud AnalyticsFor Banking
Behavior-based Client InsightFor Wealth Management
Trade Compliance AnalyticsFor Financial Markets
Regulatory Compliance & ControlFor Financial Markets
AML Monitoring & AnalyticsFor Financial Markets
Behavior-based Customer InsightFor Insurance
Producer Lifecycle & Credential ManagementFor Insurance
Property & Casualty Claims FraudFor Insurance
Lift AnalyticsFor Retail
Social MerchandisingFor Retail
Behavior-based Customer InsightFor Telecommunications
Asset Analytics for Transmission & DistributionFor Energy & Utilities
Asset Analytics for Rotational EquipmentFor Oil & Gas
Asset Analytics for Robotics EquipmentFor Automotive
Threat Intelligence AnalysisFor National Security & Defense
COPLINK on CloudFor Law Enforcement
Behavior-based Audience InsightFor Media & Entertainment
Social MerchandisingFor Consumer Products
Customer Experience AnalyticsFor Telecommunications