introduction to real-time predictive modeling

Post on 13-Feb-2017

234 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Factors

Scores / Classes

User Inputs

Prediction or Selection

Scoring Rules

Structured

Data

EXAMPLES

Predictive Modeling Applications

• Crime mapping

“The core innovation that Zillow

offers are its advanced statistical

predictive products, including the

Zestimate®, the Rent Zestimate

and the ZHVI® family of real

estate indexes. By using R in

production as well as research,

Zillow maximizes flexibility and

minimizes the latency in rolling

out updates and new products.”

• Statistical forecasting

Operational Announced

Central USIowa

West USCalifornia

North EuropeIreland

East USVirginia

East US 2Virginia

US GovVirginia

North Central US

Illinois

US GovIowa

South Central US

Texas

Brazil SouthSao Paulo

West EuropeNetherlands

China North *Beijing

China South *Shanghai

Japan EastSaitama

Japan WestOsakaIndia West

TBD

India EastTBD

East AsiaHong Kong

SE AsiaSingapore

Australia WestMelbourne

Australia EastSydney

* Operated by 21Vianet

http://blog.revolutionanalytics.com/2015/06/r-build-keynote.html/

REAL TIME

BIG DATA

PREDICTIVE ANALYTICS

Photo: Sarah&Boston (flickr: pocheco) Creative Commons BY-SA 2.0

"CLOCK" by Heiko Klingele flickr.com/photos/divdax/3458668053/ CC-BY 2.0

Structured

Data

Log Files

Sensor Streams

Language Text

ExtractionIngestion

Historical

Data

”IO VAPOURA” by Jaya Prime

flickr.com/photos/sanjayaprime/4924462993 CC-BY 2.0

Factors

Scores / Classes

Decision Tree

Logistic Regression

Neural Network

K-means clustering

Ensemble Model

User ID

Browser

Time/Date / Location

Previous purchases

Friend data

Any known information

Product of most interest

Offer of most likely sale

Most relevant link

Forecast sale value

Optimal Bid

Prediction or Selection

Scoring Rules

Feature Selection

Sampling

Aggregation

Variable Trans-

formation

Model Estimation

Model Refinement

Model Comparison /

Bench-marking

Known Factors

Known OutcomesPredictive Model

Name Node

Data NodeData Node Data NodeData Node Data Node

Job

Tracker

Task

Tracker

Task

Tracker

Task

Tracker

Task

Tracker

Task

Tracker

MapReduce

HDFS

Factors

Score

Structured

Data

Factors

Scores

Actual Outcomes

Structured

Data

Phase “Big Data” “Real Time”

Unstructured

Data

Petabytes (or

Exabytes!)

Minutes to Hours

Advanced

Analytics

Gigabytes to

Terabytes

Minutes

Deployment Megabytes/second Milliseconds

Consumption Kilobytes Seconds

powerbi.microsoft.com/en-us/industries/airline

Data• SQL Server 2016 Big-data R analytics integrated with SQL Server

database

• HDInsight Cloud-based Hadoop clusters

Develop

• Microsoft R Server Big-data R with distributed and in-database

computing

• Visual Studio R Tools for Visual Studio: integrated development

environment for R

Deploy• Azure ML Studio ML, Python and R in cloud-based Experiment

workflows

• Cortana Analytics Suite Cloud-based R APIs and Virtual Machines

Consume• PowerBI Computations and charts from R scripts in dashboards

• Excel With Azure ML Web Services plug-in

cloud computing

2011 2016 5x increase

data science

Universities filling 300,000 US talent gap

90% of the data in the world today has been created in the last two years alone

bigdata

opensourceincluding R, Linux, Hadoop

Getting Started with R tutorials:

• http://mran.microsoft.com/documents/getting-started/

Import/export data from SQL tables

• RODBC package: http://mran.microsoft.com/packages/info/?RODBC

Machine Learning Task View

• http://mran.microsoft.com/taskview/info/?MachineLearning

Applied Predictive Modeling (Kuhn & Johnson, 2014)

• http://appliedpredictivemodeling.com/ & R “caret” package

https://datainsightssummit.hubb.me/

http://blog.revolutionanalytics.com/2015/06/r-build-keynote.html/

Building a genetic disease risk application with RData

• Public genome data from 1000 Genomes

• About 2TB of raw data

Analytics Development

• Microsoft R Server

• VariantTools variant caller in R

Factors & Scores

• DNA Sample / genetic variations

• Risk association

Deployment and Consumption

• Expose as API

• Web page, phone app, etc

Data Platform

• HDInsight Hadoop 1800 Nodes

• Raw genome sequence data in HDFS

The Ultimate Business Analytics Training

Business analytics training doesn’t end today. Join us at the upcoming PASS Business Analytics Conference to gain more Power BI and Excel skills through practical, hands-on training that you can put to use immediately.

Like What You Heard?

Join David Smith again at the PASS BA Conference in the session:

“Power BI Desktop Deep Dive including R Integration”

May 2 – 4, 2016

San Jose, CA

REGISTER TODAYpassbaconference.com

Use discount code BACDATA for $150 savings*

Please Note: Discount Codes cannot be applied retroactively.

top related