an analytics platform for connected vehicles
TRANSCRIPT
1 1
Predictive Maintenance in Connected Vehicles
Frank McQuillan June 21, 2016
Big Data and Analytics Platform
2
"The primary goal at the moment is predictive maintenance, being able to detect defects at the earliest stage. We have to
find the right correlation patterns … and incoming data to predict upcoming malfunctions and their consequences.”
– Dirk Ruger, Head of after-sale analytics and digital processes at BMW http://www.v3.co.uk/v3-uk/news/2407083/big-data-analytics-driving-predictive-car-maintenance-at-bmw
3
This is a system design problem
1
4
This is a hard system design problem
1
5
Open source is the starting point
2
6
Open source is just the starting point
2
7
There is no single best design
3
8
There is an acceptable design
3
9
(Back End) Platform Characteristics
• Connectivity to multiple data sources • Data ingestion • Real-time streaming analytics • Persist data to big data store • Tools for data exploration, build/score data science models • Build applications that consume model outputs • Deploy and manage those applications in the cloud
(operationalization)
%%publish model info.
/
Microservices (Spring Boot)
/load_model /score_model
Spring Cloud Data Flow
vehicle data (streaming)
connector
exploratory data analysis & model
training
Rabbit/Kafka source
training (offline) scoring (online)
/
web or mobile app dashboard
Reference Architecture
%%publish model info.
/
Microservices (Spring Boot)
/load_model /score_model
Spring Cloud Data Flow
vehicle data (streaming)
connector
exploratory data analysis & model
training
Rabbit/Kafka source
training (offline) scoring (online)
/
web or mobile app dashboard
Reference Architecture
12 12
Apache HAWQ (incubating)
Pivotal HDB
13
What is Apache HAWQ / Pivotal HDB?
http://hawq.incubator.apache.org/
14
1986 … 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014
1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015
Journey to Open Source
Michael Stonebraker develops Postgres at UCB
Postgres adds support for SQL
Open Source PostgreSQL
PostgreSQL 7.0 released
PostgreSQL 8.0 released
Greenplum forks PostgreSQL
Hadoop 1.0 Released
HAWQ & MADlib go Apache
HAWQ launched
Hadoop 2.0 Released
MADlib launched
Greenplum open sourced
15
AdvancedAnaly1csPerformance
Excep&onalMPPperformance,lowlatency,ACIDreliability,datafedera&on
MostCompleteLanguageCompliance
HigherdegreeofSQLcompa&bility,SQL-92,99,2003,OLAP(leverageexis&ngSQLskills)
AdvancedQueryOp1mizerMaximizeperformanceand
doadvancedquerieswithconfidence
Elas1cArchitectureforScalability
Scale-up/downorscale-in/out,expand/shrinkclustersonthefly
Integratedw/MADlibMachineLearning
AdvancedMPPanaly&cs,datascienceatscale,directlyonHadoopdata
HAWQ / Pivotal HDB Advantages
MAD
16
HAWQ Extension Framework (PXF)
• Enables connectivity between Pivotal HDB and other stores (Hive, HBase, HDFS files).
• Provides an extensible framework to add support for custom services
• Operates as a separate service in Hadoop • Low latency on large data sets • Considers cost model of federated sources
HAWQ
HDFS (Hadoop Distributed File System)
Hive
HBase P X F
Services
17 17
Greenplum Database
Greenplum
18
Greenplum Database
• SQL Based: – Load And Query Like Any SQL Database – MPP Shared-Nothing Parallelization – Automatic data distribution without tuning
• Linear Scalability: – Linear scaling of capacity, loading, users and concurrency
• Analytics Optimized: – Analytics-oriented query optimization, write locking, storage
management, data compression, etc. • Extensible for Analytics:
– MADlib machine learning library
Greenplum Database
Greenplum DB
http://greenplum.org/
19
MPP Shared Nothing Architecture
Standby Master
Segment Host with one or more Segment Instances Segment Instances process queries in parallel
Flexible framework for processing large datasets
High speed interconnect for continuous pipelining of data processing …
Master Host
SQL Master Host and Standby Master Host Master coordinates work with Segment Hosts
Interconnect
Segment Host Segment Instance Segment Instance Segment Instance Segment Instance
Segment Hosts have their own CPU, disk and memory (shared nothing)
Segment Host Segment Instance Segment Instance Segment Instance Segment Instance
node1
Segment Host Segment Instance Segment Instance Segment Instance Segment Instance
node2
Segment Host Segment Instance Segment Instance Segment Instance Segment Instance
node3
Segment Host Segment Instance Segment Instance Segment Instance Segment Instance
nodeN
Greenplum DB
20 20
Apache MADlib (Incubating)
Distributed In-Database Machine Learning
21
Scalable, In-Database Machine Learning
• Open source https://github.com/apache/incubator-madlib • Downloads and docs http://madlib.incubator.apache.org/ • Wiki https://cwiki.apache.org/confluence/display/MADLIB/
22
History MADlib project was initiated in 2011 by EMC/Greenplum architects and Joe Hellerstein from Univ. of California, Berkeley.
UrbanDictionary.com: mad (adj.): an adjective used to enhance a noun.
1- dude, you got skills. 2- dude, you got mad skills.
23
Functions
Linear Systems • Sparse and Dense Solvers • Linear Algebra
Matrix Factorization • Singular Value Decomposition (SVD) • Low Rank
Generalized Linear Models • Linear Regression • Logistic Regression • Multinomial Logistic Regression • Ordinal Regression • Cox Proportional Hazards Regression • Elastic Net Regularization • Robust Variance (Huber-White),
Clustered Variance, Marginal Effects
Other Machine Learning Algorithms • Principal Component Analysis (PCA) • Association Rules (Apriori) • Topic Modeling (Parallel LDA) • Decision Trees • Random Forest • Support Vector Machines • Conditional Random Field (CRF) • Clustering (K-means) • Cross Validation • Naïve Bayes • Support Vector Machines (SVM)
Descriptive Statistics Sketch-Based Estimators • CountMin (Cormode-Muth.) • FM (Flajolet-Martin) • MFV (Most Frequent Values) Correlation and Covariance Summary
Utility Modules Array and Matrix Operations Sparse Vectors Random Sampling Probability Functions Data Preparation PMML Export Conjugate Gradient Stemming
Inferential Statistics Hypothesis Tests
Time Series • ARIMA
April 2016
Path Functions • Operations on Pattern Matches
24
MADlib Features
� Better parallelism – Algorithms designed to leverage MPP and
Hadoop architecture
� Better scalability – Algorithms scale as your data set scales
� Better predictive accuracy – Can use all data, not a sample
� ASF open source (incubating) – Active and growing community
25
Supported Platforms
Greenplum Database
PostgreSQL Apache HAWQ (incubating)
Scale-out machine learning on open source, MPP execution engines.
Reference Architecture
%%publish model info.
/
Microservices (Spring Boot)
/load_model /score_model
Spring Cloud Data Flow
vehicle data (streaming)
connector
exploratory data analysis & model
training
Rabbit/Kafka source
training (offline) scoring (online)
/
web or mobile app dashboard
27 27
Spring Cloud Data Flow
28
https://cloud.spring.io/spring-cloud-dataflow/
29
%%publish model info.
/
Microservices (Spring Boot)
/load_model /score_model
Spring Cloud Data Flow
vehicle data (streaming)
connector
exploratory data analysis & model
training
Rabbit/Kafka source
training (offline) scoring (online)
/
web or mobile app dashboard
Reference Architecture
31 31
Apache Geode (incubating)
Pivotal Gemfire
32
An in-memory, distributed database with strong consistency built to support low latency transactional applications at
extreme scale.
Apache Geode / Pivotal Gemfire
http://geode.incubator.apache.org/
33
Cloud-ready, infra-structure agnostic
33
Horizontal Scalability Automatic fail-overing Reliable eventing model
Multi-site High Availability Seamless integration to
analytical databases
App 1 App 3 App 2
Apache Geode / Pivotal Gemfire
34
Pivotal Big Data Suite Complete platform
Hadoop Native SQL
Deployment options
Based on open source
Flexible licensing
Advanced data services
PIVOTAL GREENPLUM DATABASE
Data warehouse database based on open source Greenplum Database
PIVOTAL HDB Open source analytical database for Apache
Hadoop based on Apache HAWQ
PIVOTAL GEMFIRE Open source application and transaction data grid based on Apache Geode
Pivotal Big Data Suite Open source data management portfolio
35 35
Other Architectures
36
https://berlinbuzzwords.de/sites/berlinbuzzwords.de/files/media/documents/ 2016-06-06_berlin_buzzwords_nat_poc2indus_nonne.pdf
37 http://enterprise.microsoft.com/en-us/industries/discrete-manufacturing/learning-leaders-manufacturers-using-iot-reimagine-connected-services-customer-experiences/
38
https://www.ge.com/digital/sites/default/files/predix-platform-brief-ge-digital.pdf
39 https://www.mapr.com/developercentral/lambda-architecture
40
Platform Challenges
• Managing complexity • Integration • Open source – how to chose and keep up? • Data security and lineage • IT and car development cycles are not in sync • Multiple vendors involved (e.g., carmakers need mobile
partners)
41
Platform Challenges (2)
• Car dealers need to connect to the platform • Who will pay for connected car services? • Fleet management
42
This is a system design problem that YOU can help solve