automated analytics at scale
TRANSCRIPT
Automated Analytics at ScaleModel Management in Streaming Big Data Architectures
Chris Kang
Copyright © 2016 Accenture All rights reserved.
• Machine learning allows organizations to proactively discover patterns and predict outcomes for their operations, and improving those insights requires deploying better analytical models on their data.
• Finding the best analytical model requires running thousands of hypotheses on various datasets and comparing models in a brute force approach.
• Currently a model management framework does not exist - that is, an agnostic tool or framework that manages and orchestrates the entire lifecycle of a model.
Real-time Analytics at Scale
2
Challenges of Model Management
3Copyright © 2016 Accenture All rights reserved.
Model Management Framework operationalizes analytics to ease development and deployment of analytical models The framework provides key benefits to operationalize and democratize access to analytical modeling at scale
Captures and templates analytical models created by
expert data scientists for easy reuse
Faster development of analytical models
with rapid iteration of training and
comparing models using brute force
approach
Presents champion-challenger view to
visually compare and promote trained
models
Reduces complexity for data scientists to
train and deploy models
Enables business analysts and others to
participate in modeling process
Copyright © 2016 Accenture All rights reserved.
Model Management Framework is essential for the Internet of Things platformThe Internet of Things platform exposes thousands of sensors that require models to be automatically managed and maintained as well as provide easy access to the predicted results
Identify desired insights
Identify sights for operationalizing devices/machinery for various purposes: detecting anomaly, prediction maintenance, budget and resource optimization
Collect dataCollect various types of data (time series or static) and store them into databases that best fits the data type
AnalyzeTrain the analytical models using the model management framework or using other analytical tools such as R then onboard it to the framework
Actuate and optimize
Set up rules to act on predicted results from thousands of sensors, e.g. schedule a maintenance or lower temperature on a device
4
Background
6Copyright © 2016 Accenture All rights reserved.
Organizations today have an unprecedented amount of data available because of the Internet of Things, the web, and social mediaIn order to take advantage of this massive set of data, organizations must build analytics platforms
Source: IBM, Big Data Hub, 2013
Copyright © 2016 Accenture All rights reserved.
Traditional analytics platforms use big data technologies to process and analyze large amounts of data
“Excited by big data technology capabilities to store more data, more diverse data and more real-time data, (companies) focus on data collection. Rapidly growing data stores put increasing pressure on figuring out what to do with this data. Determining the value of the collected data becomes the top challenge in all industries.” Source: Svetlana Sicular, Gartner, October 30 2015
Example Technologies
The steps to derive value out of the data include collecting, processing, and analyzing the data using a variety of big data tools.
Analytics and Visualization
Data Processing
Data CollectionStore huge volumes of data in multiple data stores in a variety of data types for processing.
Process the data by filtering, transforming, and applying machine learning algorithms using computing engines.
Create ad hoc reports on processed data using business intelligence and visualization tools.
7
Copyright © 2016 Accenture All rights reserved.
Enterprises need access to both historical and real-time data to gain the most value out of big data analytics• Real-time is data that is processed in sub-seconds to seconds from the time data arrives to when the results are derived.• Batch processing technologies alone are insufficient because in the time it takes to process a batch (hours, days), real-time
data has accumulated and is missed, which generates a loss of opportunity for proactive decision making.
Storing data in a fault-tolerant, replicated historic store, processing a large batch of data, and storing the processed data using batch writes incurs delays that make real-time not feasible
Queries are only directed at stale data of up to hours or days. The lack of real-time data limits the analytics to ad-hoc summarizations and aggregations.
Because of the batch processing delay, by the time the captured data is available for queries, it is stale
Real-time data is missed by the time analytics begins
Historic Data Store
Batch Batch Write
Data Query
Storage Processing Serving
Real-Time Data
8
Copyright © 2016 Accenture All rights reserved.
The Lambda Architecture empowers real-time analytics by handling data at scale and in real-time using a hybrid architecture• Designed by Nathan Marz, the creator of the Apache Storm project and previously a lead engineer at Twitter, the goal was
to build a general architecture to process big data at scale.
• The architecture separates batch processing on historical data from stream processing on real-time flow of data, allowing for analytics on data that combines the most up-to-date data with historical data views.
Real-time analytics can now be performed on data combined from most up-to-date data with historical views
BATCH LAYER focuses on processing historical data views for queries
SPEED LAYER handles the complexity of real-time data collection and analysis
Historic Data Store
Batch Batch Write
Data Query
Storage Processing Serving
Queue Speed Random Write
9
Copyright © 2016 Accenture All rights reserved.
In the Internet of Things, predictive modeling on sensor data allows organizations to discover patterns and predict outcomes for their operations
Remediation
Notification and Alerts
Oil & Gas Producer
Water Utility Client
NoSQL for Unstructured Data
Computing Engines and Stream Processors
Machine Learning Algorithms
Model Runtime Environments
Sensors at Field Sites
Predictive Results
Data Collection Data Processing Predictive Modeling Proactive Decision Making
Collects data from over 190,000 sensors
Collects data from sensors placed along pipes in a water
distribution network
Injects 6,000 rows/second and 11 billion rows of data per
month – larger analytics platform than Twitter
Processes data for water flow rate and pressure
Has over 3,500 models analyzing data using various algorithms
Apply predictive model to project forward in time to see spikes or falls that exhibit warning signs of failure
Enables company to examine huge sets of data, discover trends to predict outcomes in operation
and exploration efforts
Use results from predictive model to proactively reduce pressure spikes, avoiding leaks,
prolonging the longevity of assets, and reducing disruption to customers
• The real value of big data is the insight via the analytics, not just the collection of the data.
• Predictive modeling is the primary means by which companies can discover trends and make proactive, as opposed to reactive, decisions on data.
10
11Copyright © 2016 Accenture All rights reserved.
The modeling process is iterative and its lifetime spans both the batch mode model training and real-time predictionIn general, a model creates an output for an unknown target value given a defined set of inputs. In a time-series model, the target value also depends on time as an input
Build Model• Identify required data and
how to get it• Design and validate
specific analytic models• Verify approach through
initial set of insights on particular environments
Analyzes a variety of machine learning algorithms and identifies the logistic regression model as the most suitable for the problem. Codes model .JAR file
Train Model• Prepare historical data for
training• Select model input
parameters and runtime environment
• Train the model on data from historical batch and/or real-time stream in runtime environment
Selects input parameters such as the regularization parameter for the logistic regression model. Submits the model to Spark to train the model on historical data in HDFS
Monitor Execution• Monitor the status of
training the model in the runtime environment (e.g. running, succeeded, failed)
• Troubleshoot issues in the runtime environment if necessary
Opens the terminal, ssh into the Hadoop cluster, and enters the commands to verify the status of the model as it is trained
Compare Models• Compare trained models
in champion-challenger fashion
• Brute force approach to finding best-of-breed model for deploying to live stream
After iteratively training many models, select the best-of-breed based on the model with the lowest mean square error
Operationalize Model• Deploy best model on live
stream of data• Generate predicted
results for automated or manual proactive decision making
• Observe results to feed back and fine-tune the model
Submits the model to Spark Streaming to be applied to streaming data ingested from Kafka, and model predicts in real-time whether sensor will fail
I want to deploy a model that can detect if a sensor
is faulty in real-time
Data Scientist
Data Science System Administration
Copyright © 2016 Accenture All rights reserved. 12
Challenges with Analytical Modelingin the Current State
Copyright © 2016 Accenture All rights reserved.
Building, training, and deploying analytical models require a rare combination of data science and engineering skillsThe ability to complete the modeling process is limited to specialized individuals who are experts in both data science and engineering
“The United States alone faces a shortage of 140,000 to 190,000 people with analytical
expertise and
1.5 million managers and analysts
with the skills to understand and make decisions based on the analysis of big
data.”Source: McKinsey Global Institute analysis
Traditional StrengthsPotential Hurdles with Model Building and Deployment
Full Set of Skills Needed for Model Building and Deployment
Mathematics, statistics, machine learning, data mining, pattern recognition, predictive algorithms, domain expertise
Troubleshooting and running a runtime environment such as Spark requires advanced system engineering skills, which a data scientist may not be trained in. This can potentially lead to slower development and deployment of predictive models.
• Understanding of a variety of machine learning algorithms, pattern recognition, as well as expertise in a domain.
• Ability to build and code accurate models based on problem space.
• System administrator skills as well as deep understanding of big data systems to deploy models in runtime environment.
Domain expertise, business processes, requirements gathering
Traditional business analysts may lack core skills in data science or data engineering because of a lack of experience to build, train, or deploy models
Combination of data science skills as well as software engineering and system administrator skills for big data systems
May lack domain expertise, in which case it may take longer to build and train relevant models for the use case
Data Scientist
Business Analyst
Dual Data Scientist
and Engineer
13
Copyright © 2016 Accenture All rights reserved.
Analytical models are not easily reusable or shareable, resulting in siloed analytics workThere is no standard method for sharing models to let users leverage models created by other data scientists, so the analytics work is siloed. This is true for both freshly built models and models that were already trained on a dataset
Predictive models duplicate and sprawl as data scientists build and train their own individual library of models that are not shared.
No standard for sharing or viewing other data scientist’s models
Individual Libraries of ModelsData scientists primarily leverage their own libraries of models and previous datasets they worked with to select an algorithm and build a model for the current problem
Model DuplicationAs models are built and trained, the same types of models may be built by more than one data scientist, particularly if the types of models are common in the industry’s use cases
Model SprawlOver time, as more data scientists build and train more models, the models begin to sprawl and duplicate unnecessarily, making the central management of models more difficult
Train and deploy individual models
Runtime execution environments for model training and deployment
14
Copyright © 2016 Accenture All rights reserved.
Without a framework, current approach is too inflexible to support multiple runtime execution environmentsIt is impractical to scale the number of runtime environments to train and deploy models using a manual approach
Spark model with R dependencies
Model with R dependencies
I have a model, but I don’t know which runtime environment can support it
I’m only familiar with R, so I need to learn all the environments to test my model
I have a new type of model so I need to learn another runtime environment
Runtime environments often times cannot support all types of models. As a result, data scientists must spend time learning environments instead of using that time for analytical modeling.
Dependencies match and runtime can support model
Missing Spark functionality to execute model
Missing specific R dependency so cannot support model
All R libraries supported and can execute model
Data Scientist
Update
Test
Learn
• Data scientist needs to acquire the system administration skills to operate the runtime environments
• Each runtime environment is unique and requires time and energy
• In the worst case, the data scientist must try every runtime environment before successfully finding a match for the predictive model
• As more model types are needed, additional runtime environments must be learned
• Learning additional environments becomes a time-consuming endeavor
15
Copyright © 2016 Accenture All rights reserved.
Lack of engineering abstraction makes it difficult to quickly train predictive models on dataData scientists lose productivity as the process to train models is manual, requiring a manual check for the status of a model in the environment as well as system administration for troubleshooting the model in the environment
Need for abstraction grows as the number of types of models and runtime environments increases
Wasted productivity – Spending time on data engineering instead of comparing models to
find the best-of-breed for deployment
No abstractions for training or monitoring models on runtime environments
Train model
Repeated for hundreds of models on various runtime environments
Check status of model
Troubleshoot model
Train modelCheck
status of model
Troubleshoot model
Train modelCheck
status of model
Troubleshoot model
Build many models on
various algorithms
More time spent on system administration
Less time spent on building predictive models
Try different input parameters and
algorithms to find best-of-breed model …
..
Manual Process
Data Scientist
Build Model Train Model and Monitor Status
16
17Copyright © 2015 Accenture All rights reserved.
Model Management Frameworkfor Automated Analytics at Scale
Copyright © 2016 Accenture All rights reserved.
Model Management Framework simplifies the training, deployment, and management of a large number of models for a Lambda architecture
Model management is a framework for data scientists and users to more easily train and deploy analytical models in various runtime environments on the lambda architecture by abstracting the system administration, reducing the complexity of train and deploy, and sharing the models in a way that is consumable by users in your organization, enabling other users such as business analysts to partake in the modeling process.
The framework in this reference architecture proposes• Model Store and Trained Model Store: A library of models of commonly used
machine learning algorithms that can be trained on user’s historical datasets, as well as trained models that are available to be deployed.
• Model Interface Templates: Interfaces that abstract away the complexity of the machine learning algorithm, allowing users to specify the inputs and outputs of the model.
• Deployment and Scheduler: Automatic training, deployment, and scheduling of models on runtime environments so that users do not need to operate the runtime environments themselves.
• Runtime Verifier: Ability to determine which runtime environments can support a model prior to execution, enabling faster development of trained models.
• Monitoring Service and Metadata Store: Service monitors the status of the model during its execution on the runtime environment for the user, as well as any metadata about its execution which it can then store.
• API: Exposes functionalities with API endpoints for users to verify, train, deploy, and monitor models on runtime environments.
Real Time Analytics
Runtime Environments
Distributed Computing Scientific Computing
Model ManagementDeployment
and SchedulerRuntime Verifier
Model Store Metadata StoreTrained Model Store
Monitoring Service
API Model Interface Templates
Users
Data Scientists Business Analysts
18
Copyright © 2016 Accenture All rights reserved.
• Design for seamless interfaces is the method of connecting various stages throughout modeling pipeline to support the domain experts/data scientists to create and update models and for the business analysts to extract data insights.
• Model management at scale is specific for large scale data analytics which requires distributed resources allocation and communicates with various data stores.
Model Management Framework provides seamless interfaces along data analytics pipeline for model creation, deployment and scheduling
The framework in this technical architecture proposes• Runtime Environments: Backend runtime
environments such as Spark, MapReduce, R, and more interact with distributed resources (e.g. Hadoop) to train and deploy models
• Historical Data Store: Data virtualization interacts with various databases (e.g. Cassandra, Redshift, S3)
• Training, Prediction, Model Runtime Services: Framework services interact with runtime service to deploy and allocate resources for models as well as verify models for execution
• APIs: APIs interact with framework services for various functionalities
• Online Message Queue: Message queue is injected with real-time data
19
Prediction Service Training Service
API
User Interface
Resource Allocation Service
Model Store
Results Store
Model Metadata Store
Historical Data
Storage
Runtime Environments
Model Runtime ServiceOnline
Message Queue
Data Scientist
Business Analyst
Copyright © 2016 Accenture All rights reserved.
Demo
20
21Copyright © 2016 Accenture All rights reserved.
Model Management Framework covers a number of features to support various perspectivesThe framework provides the following features from the services to better serve domain experts/data scientists and business analysts
Feature ExplanationAutomatic model deployment on multiple runtime environments
Automatic preparing trained model to serve real-time data with the saved.jar file to multiple runtime environments with pre-verification prior to execution.
Modeling algorithm library A library with algorithms for machine learning and statistical learning
Model metadata A model profile to describe the configuration parameters, path to input/output data, model version as well as resource consumption
Heterogeneous data stores Data can be stored in various databases
Champion-challenger model Multiple models with the best performed model as the champion and the rest as the challengers
Batch mode and real-time mode A combination of model training and serving model to real-time data
Model update Retraining of the current model or re-selecting of the champion model
Job completion time estimation Estimate of how soon a job can be completed given the current resources
Prediction results query and UI Access to prediction results from applying trained model for real-time data for dashboard display
Algorithm parameter tuning Automatic fine tuning of algorithm parameters to achieve the best model quality
22Copyright © 2016 Accenture All rights reserved.
Deploy Accenture’s Model Management Framework on-premise to operationalize analytics in a big data analytics platformAt Accenture Labs, we have a patent-protected invention on the model management framework that showcases the unique capabilities of our framework. If you have analytical models running in a big data analytics platform, we can help deploy our model manager in your environment before problems arise as the number of types of models and runtime environments you need to support increases
Simplified modeling process for data scientists
Abstracts data engineering and presents champion-challenger view for your data scientists to more quickly train, compare, and promote their models for deployment.
Provide analytics for Internet of Things use cases
Process data from heterogeneous data stores allows for sending data from thousands of sensors through modeling pipeline to leverage existing platform’s analytical capabilities.
Enabled for real-time analytics The model manager can deploy prediction jobs that ingest streaming data and applies a trained model for real-time predictions.
Greater coverage of runtime environments and models
Extends the capability to support additional runtime environments, increasing the number of types of models you can use in your data pipeline.
Democratized access to analytics Share library of models created by experts allows other data scientists and business analysts to leverage the models for their use cases.
Copyright © 2016 Accenture All rights reserved.
Contact InformationAccenture Labs
Teresa TungTechnology Labs [email protected]
Carl DukatzR&D Senior [email protected]
23
Chris KangR&D Associate [email protected]
Copyright © 2016 Accenture All rights reserved.
Appendix
24
Copyright © 2016 Accenture All rights reserved.
The solution: A new Model Management FrameworkSimplifying model deployment at scale
25
A simplified interface
RESULTS• Enables a catalog approach to finding analytics• Simplified onboarding of new analytics• Brute-force approach to retraining and comparing models
Comprises of a model building service, a prediction
service, and a resource allocation service
Supports end-to-end analytical modeling at scale
using the Lambda Architecture
Hides the complexity of Lambda and unlocks its power for data scientists,
domain experts, and business analysts
Copyright © 2016 Accenture All rights reserved.
Benefits of the new frameworkUnlocking the power of Lambda for data scientists, domain experts, and business analysts
26
Data scientists and domain experts who generate the models can:• Select from already captured
modeling approaches or onboard their own
• Easily compare models in a champion-challenger fashion
Business analysts who rely on model’s results can select from a catalog of models created by experts
Copyright © 2016 Accenture All rights reserved.
Model Management Framework differs from other approaches in its enablement of big data capability with heterogeneity and scalability
Other analytics focuses on designing and fine tuning machine learning algorithms to improve accuracy with modeling tools that are hard to scale or speed. For example, WEKA libraries provides comprehensive machine learning algorithms but lack the capability to integrate with big data or manage thousands of models. For example, Apache Mahout works with Hadoop MapReduce with slowdown from frequent writes to disk.
Comparison ExamplesModel Management Framework• I want to run my analytics on the distributed data set with the size
of TB or PB which is geographically distributed and stored in various databases
• I want to deploy multiple models on distributed resources and let the framework automatically select the best model based on the metrics I have defined
• I want to specify the prediction interval and query the results by calling API endpoints
• I want to always use the up-to-date model by having the framework retrain the current model or selecting a new champion model
Other Model Management• I want to the improve my SVM classification algorithm by 3% in
terms of accuracy with my 300MB dataset residing on my local disk
• I want to try various algorithms and fine tune parameters to see how the accuracy can be improved
• I want to apply the trained model for new data for prediction by calling the modeling method and specifying where to store the results. I need to try multiple prediction intervals to see which works.
• I want to see the prediction results by plotting the data from the file where results are saved into
27