raghu ramakrishnan - globaltaxevent.com · raghu ramakrishnan cto ... statistical methods for...
TRANSCRIPT
A World of Data
Raghu Ramakrishnan
CTO for Data, Technical Fellow Microsoft
Content Optimization Agrawal et al., CACM 56(6):92-101 (2013) Content Recommendation on Web Portals
Key Features
Package Ranker (CORE)
Ranks packages by expected CTR based on
data collected every 5 minutes
Dashboard (CORE)
Provides real-time insights into performance by
package, segment, and property
Mix Management (Property)
Ensures editorial voice is maintained and user
gets a variety of content
Package rotation (Property)
Tracks which stories a user has seen and
rotates them after user has seen them for a
certain period of time
Key Performance Indicators
Lifts in quantitative metrics
Editorial Voice Preserved
Recommended links News Interests Top Searches
Estimate P(response | user, item, context)
Statistical Methods for Recommender Systems, Agarwal and Chen, CUP, 2016
Modeling Overview
Offline Modeling • Exploratory data analysis • Regression, feature selection, collaborative filtering (factorization) • Seed online models & explore/exploit methods at good initial points • Reduce the set of candidate items
Online Learning • Online regression models, time-series models • Model the temporal dynamics • Provide fast learning for per-item models
Explore/Exploit • Multi-armed bandits • Find the best way of collecting real- time user feedback (for new items)
Large amount of historical data
(user event streams)
Near real-time user feedback
Store any data relations
Do any analysis SQL queries
Hive,
At any speed Batch
Hive
At any scale … elastic!
Anywhere
Data to Intelligent
Action
Windows
SMSG
Live Ads
CRM/Dynamics Windows Phone
Xbox Live
Office365
STB Malware Protection Microsoft Stores
STB Commerce Risk
Messenger LCA
Exchange
Yammer Skype
Bing
data managed: EBs
cluster sizes: 10s of Ks
# machines: 100s of Ks
daily I/O: >100 PBs
# internal developers: 1000s
# daily jobs: 100s of Ks
Azure Data Lake
“1st party = 3rd party”
Hadoop and OSS
Cloud
Data
Intelligence IaaS
PaaS
SaaS
Relational
Document
Data Lake
In-Memory
Operational Analytics
(Algorithms, IoT…)
Our Axes of Innovation
Azure
34 Azure
regions
2x
as AWS
More certifications than any other cloud provider
Industry leader for customer advocacy and privacy protection
Unique data residency guarantees
+
Applications
Management
App Frameworks
Databases & Middleware
Infrastructure
Linux
• Amoeba Rayon
• Status: shipping in Apache Hadoop 2.6
• Mercury and Yaq
• Status: Now in Apache Hadoop trunk!
• Federation
• Status: prototype and JIRA
• Framework-level Pooling • Enable frameworks that want to take over resource allocation to support millisecond-
level response and adaptation times • Status: spec
Microsoft Contributions to OSS Apache YARN
Hybrid Management + Security
Log analytics Automation Backup DR and data protection Security
17
3 6
0
MQ
le
ad
er
qu
ad
ran
ts
Competitor 1 Competitor 2 Competitor 3
Cortana Intelligence
Suite
Data and Analytics – 3 Pillars
SQL 2016 Server
Azure DB
Azure DW
SQL server R services
On-prem and cloud
(Windows, Linux)
Cortana Intelligence
Suite Azure Big Data and Analytics Hadoop, Data Lake, Machine
learning, PowerBI, Data Factory, Streaming,
Perceptual Intelligence
On-prem connectivity
Microsoft
R server Analytics
Hadoop
Teradata
On-prem and cloud
(Windows, Linux)
Forrester Wave
Big Data Hadoop
Cloud Solutions
Q2 2016
Personalized Offers
Web Logs, Omniture logs
On-Premise SQL Server
(customer and product data)
In-Store Activity with
Kinect sensors
Social Data
Diagnostic streaming
Event hubs
Machine Learning
Stream Analytics
Azure DataLake
Data Factory: Move Data, Orchestrate, Schedule, and Monitor
HDInsight HDInsight Machine Learning
Azure SQL Data Warehouse
Power BI
INGEST PREPARE ANALYZE PUBLISH
Stream Analytics
CONSUME DATA SOURCES
Cortana
Web/LOB Dashboards
Churn Prediction
Tacoma Public Schools wanted to leverage data to predict student dropout risks to increase graduation rates
Personalized Healthcare
• Data from sensors and devices such as blood-pressure cuffs and activity trackers
Cortana Analytics dashboard where registered nurses have a singular view of each customer’s personalized care plan
Many of the solutions currently
on the market give physicians
access to raw data; that’s not as
useful as actionable intelligence
to help them make a diagnosis.
When you start looking at tools
such as ImagineCare that have
intelligence built in, I think that’s a
big deal for providers.
DR. ETHAN BERKE
Medical Director for Clinical Design and
Innovation
Service Analytics
• To replace manual threshold method to monitor the dynamics telemetry data to detect anomalies intelligently.
• To detect the small trending or level changes early in order to start timely investigations and actions to prevent potential incidents;
• To learn automatically from both historical and real-time data to scale the monitoring.
• SQL Azure uses Anomaly Detection models to track hundreds different service exceptions that won’t be able to tracked just by setting thresholds.
Revenue Forecasting
This helps us triangulate internal forecasts and gives us more confidence in the forward looking revenue ranges we provide to Wall Street.
VANDANA GANGAWAR
Senior Director,
Microsoft Central Finance Planning and Operations
Azure Data Lake Store
Fully managed cloud data store designed for analytics
Supports HDFS compliant analytics applications and tools
Petabyte files, unlimited account size
High throughput for analytics performance
Low latency ingestion with read as you write
AAD-based authentication, access auditing
File and folder-level ACLs, Encryption at rest
Azure Data Lake Analytics An elastic analytics service built on Apache YARN that processes all data, at any size
Pay PER QUERY & Scale PER QUERY
- No need to create a cluster
No limits to SCALE
Includes U-SQL, unifying the benefits of
SQL with the expressive power of C#
- In future: Hive, Spark
Optimized to work with ADL STORE
FEDERATED QUERY across Azure data
sources
ENTERPRISE GRADE role-based access
control and auditing
Azure HDInsight—Linux and Windows
Managed, Monitored, Supported • Cluster customization – Install your favorite project
• Harness existing .Net & Java skills to write
customer extensions
• Supports broad ecosystem of ISVs
(Hadoop and Traditional)
Full Apache Hadoop • Batch – MapReduce, PIG, Hive, Spark
• Stream Processing and Analytics – Storm,
SparkStreaming
• Interactive SQL – Hive (Tez), and SparkSQL
• Table Serving – Hbase
• Machine Learning – SparkML, Mahout
Azure HDInsight
Batch MapReduce, PIG, Hive, Spark
Interactive SQL Hive (Tez), SparkSQL
Stream Analytics Storm, SparkStreaming
Machine Learning SparkML, Mahout
Table Serving Hbase
Exploratory Visualization Jupyter, Zeppelin
Interactive SQL SQL DW
Stream Analytics Azure Stream Analytics
Machine Learning Azure ML
Table Serving Azure SQL DB
Exploratory Visualization Power BI T
he B
est
of
Had
oo
p
High-performance open source R plus:
Enterprise Scale & Performance
– Scales from workstations to large clusters
– Scales to large data sizes
– Growing portfolio of Parallelized algorithms
Secure, Scalable R Deployment/Operationalization
Write Once Deploy Anywhere for multiple platforms
IDE for data scientists and developers
Enterprise Class Support
DistributedR
DeployR DevelopR
ScaleR
ConnectR
Code Portability Across Platforms
Azure VM Azure HDI, Spark Azure ML PowerBI Office 365 …
Linux Windows
Teradata, SQL Server
Hortonworks Cloudera MapR
In the Cloud
Workstations & Servers
EDW
Hadoop
DistributedR
ScaleR
ConnectR
DevelopR
SQL Server 2016: Everything Built-In
The above graphic was published by Gartner, Inc. as part of a larger research document and should be evaluated in the context of the entire document. The Gartner document is available upon request from Microsoft. Gartner does not endorse any
vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner's research
organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.
Consistent experience from on-premises to cloud
In-memory across all workloads
TPC-H non-clustered 10TB
Oracle is #4 #2
SQL Server
#1
SQL Server
#3
SQL Server
built-in built-in built-in built-in built-in
0 1
4
0 0 3
34
29
22
15
5
22
6
43
20
69
18
49
3
-80
-70
-60
-50
-40
-30
-20
-10
0
2010 2011 2012 2013 2014 2015
SQL Server Oracle MySQL2 SAP HANA
TPC-H non-clustered results as of 04/06/15, 5/04/15, 4/15/14 and 11/25/13, respectively. http://www.tpc.org/tpch/results/tpch_perf_results.asp?resulttype=noncluster
at massive scale
National Institute of Standards and Technology Comprehensive Vulnerability Database update 5/4/2015
Microsoft Tableau Oracle
$120
$480
$2,230
Self-service BI per user
In-Database Advanced Analytics No need to move the data
Open source R with in-memory & massive scale – multi-threading & massive parallel processing
Data Scientist Interact directly with data
R built-in to SQL Server
Data Developer/DBA Manage data and
analytics together
Example Solutions
• Sales forecasting
• Warehouse efficiency
• Predictive maintenance
Extensibility
? R
R Integration
Relational data
Analytic Library
T-SQL interface
010010
100100
010101
New R scripts
010010
100100
010101
010010
100100
010101
010010
100100
010101
• Credit risk protection
010010
100100
010101
Microsoft Azure Marketplace
Real-time operational analytics without moving the data
NEW
NEW
End-to-end mobile BI Advanced Analytics Mission critical OLTP
SMS (Twilio)
Skype Consumer
… … …
Groupme
Active Directory Bot Securely access people in a company (+ files, topics, data) from anywhere, via conversation
Intelligence: Cognitive Services
Image:
Face, Age,
Gender,
Emotion
Academic
Knowledge
Conversation channels
Language
Understanding
Slack
o365 apis
…beginning with conversation, people & bots