three buzzwords of our timedownload.microsoft.com/documents/hk/technet/techdays2015... ·...
Post on 01-Apr-2020
0 Views
Preview:
TRANSCRIPT
Three Buzzwords of our time
IOT - Big Data - Predictive Analytics
A fourth – The Cloud
Agility Control
Elastically Scale
Storage & Compute
Data Management &
Governance
Access Control
Rich Stores &
Compute
Options
Discovery in Familiar
Tools
Informed Decisions
on Time
Re-imagining modern data analytics by balancing agility and control
Business
• Innovate Faster
• Discover New
Opportunities
• Reliable Information
IT
• Lower Costs
• Minimize Complexity
• Improve Efficiency
• Control/Reduce Risks
Connect
Collect
Enrich
Transform
Publish
Data
Co
nsu
mp
tio
n
Data
Pro
du
ctio
n
Information Production
Data
Pro
du
ctio
n
Operational Dashboards, etc
BI & Analytics
I need to learn big
data technologies to
develop the pipeline
Do I need to start
developing every
pipeline from scratch?
What does it take to
deploy the pipeline
once developed?
Do I need to touch the
code of every step when
metadata changes?
How do I monitor
health and execution
status of the pipeline?
How do I ensure an
update does not break
the entire pipeline?
How do I deal with
streaming and batch
requirements?
How to I ensure
reliable execution
and fault tolerance?
How do I make it
available to consumers?
Authoring Operating Managing Lifecycle Publishing
Microsoft Confidential
Typical Azure Data Architecture
Stream Analytics
Transform Ingest
Web logs
Present &
decide
Kinect In-Store
Activity
Social Data
Event Hubs HDInsight
Azure Data
Factory
Azure SQL DB
Azure Blob Storage
Azure Machine
Learning
Power BI
Web
dashboards
Mobile devices
DW / Long-term
storage
Predictive
analytics
Event & data
producers
APS
Cloud-scale telemetry ingestion from websites, apps, and devices •Log millions of events per second in near real time •Connect devices with flexible authorization and throttling •Time-based event buffering •Managed service with elastic scale •Broad platform reach with native client libraries •Pluggable adapters for other cloud services
Supports small number of queries that arrive at high volume
Streaming in Azure
Project Codename NRT
Input
Adapter
Output
Adapter
Complex Event Processor
NRT
Cloud
Service
Data
Stores
Event
Hub
Data Stores
Dashboards & Alerts Sensors &
Devices
Event
Hub
Small number of high volume queries
Complex Event Processing
(aggregation, reduction, cleanup)
Predictable & repeatable results
SQL-like queries
Ingress Azure blobs and Event Hub
Egress to Azure DB, blobs, Event Hub
Dashboarding
Alerting & Bind notification
Anomaly detection
Compute datasets
Orchestrate data movement, machine learning, Hadoop (via HDInsight) for on-premise and cloud data
Plan workflow dependencies and scheduling
Publish to Power BI users as a searchable data view
Lifecycle management, monitoring
Operationalize information production & governance
Orchestration and Data Production in Azure
Project Codename MDP
Support HBase as NoSQL columnar database on Azure Blobs
Support Storm as streaming
Hadoop in Azure
HDInsight
Data Node Data Node Data Node Data Node
Task Tracker Task Tracker Task Tracker Task Tracker
Name Node
Job Tracker
HMaster Coordination
Region Server Region Server Region Server Region Server
HBase as a columnar NoSQL transactional database running on Azure Blobs
Storm as a streaming service for near real time processing
Hadoop 2.4 support for 100x query gains on Hive queries
Mahout support for machine learning + Hadoop
Graphical User Interface for HIVE queries
Enjoy unprecedented efficiencies via a near-zero database-as-a-service
Ensure predictable performance and elastic scale from one to thousands of databases
Support business continuity policies with self-service restore and disaster recovery
Drive DevOps tasks via programmatic APIs
Migrate LOB apps for reduced CAPEX & OPEX; drive database administration efficiencies at scale
SQL Database service
Relational database-as-a-service designed for devs & architects
Azure SQL Database
Enable collaborative data science work with anyone, anywhere via a personal Windows Azure Machine Learning Studio workspace
Bring in cloud data sources with the ease of a drop down menu
Utilize the same best in class algorithms in ML Studio that run Xbox and Bing
Quickly deploy models as Azure web services with Machine Learning API service
TBs of scalability via HDInsight
ML SDK enabling partners to build and monetize ML web services
Easily create sophisticated models using numerous languages Including R & Python
Deploy predictive models into production in minutes instead of days or weeks
Connect seamlessly with Excel for results visualization
Use historical data to predict future outcomes using cloud based machine learning
Azure Machine Learning
Power BI mobile app support for iOS devices
New data visualizations and self-service predictive
analytics for forecasting and population plotting
Enhanced data source and data refresh support
Enhanced data management and governance built
into Power BI
• Connectivity to on-premises data source
• Mobile access to Power BI reports
Powerful new ways to work with data with Excel and Power BI for Office 365
Self-service analysis in
Power View
Power BI
Intake millions of events per second Process data from connected devices/apps
Integrated with highly-scalable publish-subscriber ingestor
Easy processing on continuous streams of data Transform, augment, correlate, temporal operations
Detect patterns and anomalies in streaming data
Correlate streaming with reference data
Guaranteed events delivery Guaranteed not to lose events or incorrect output
Preserves event order on per-device basis
Guaranteed business continuity Guaranteed uptime (three nines of availability)
Auto-recovery from failures
Built in state management for fast recovery
Elasticity of the cloud for scale up or scale down Spin up any number of resources on demand
Scale from small to large when required
Distributed, scale-out architecture
Scale using slider in Azure Portal and not writing code
Low startup costs Provision and run Streaming solution for as low as $25/month
Pay only for the resources you use
Ability to incrementally add resources
Reduce costs when business needs changes
End-to-End Architecture Overview
Data Source Collect Process Consume Deliver
Event Inputs - Event Hub
- Azure Blob
Transform - Temporal joins
- Filter
- Aggregates
- Projections
- Windows
- Etc.
Enrich
Correlate
Upcoming –
Call ML models
Outputs - SQL Azure
- Azure Blobs
- Event Hub
Upcoming
- PowerBI (in Private
Preview)
- Azure Tables
BI
Dashboards
Predictive
Analytics
Azure
Storage
• Temporal Semantics
• Guaranteed delivery
• Guaranteed up time
Azure “NRT”
Reference Data - Azure Blob
- …
Every event that flows through the system has a timestamp
SELECT FROM TIMESTAMP BY
SELECT FROM
Projecting timestamp into payload SELECT System.Timestamp AS FROM
SELECT TimeZone, COUNT(*) AS Count FROM TwitterStream TIMESTAMP BY CreatedAt GROUP BY TimeZone, TumblingWindow(second,10)
Tell me the count of tweets per time zone every 10 seconds
1 5 4 2 6 8 6 5
Time
(secs)
1 5 4 2 6
8 6
A 10-second Tumbling Window
3 6 1
5 3 6 1
SELECT Topic, COUNT(*) AS TotalTweets, AVG(SentimentScore) FROM TwitterStream TIMESTAMP BY CreatedAt GROUP BY Topic, HoppingWindow(second, 10 , 5)
Every 5 seconds give me the
count of tweets and the average
sentiment score over the last 10
seconds
1 5 4 2 6 8 7
A 10-second Hopping Window with a 5-second “Hop”
4 2 6
8 6
5 3 6 1
1 5 4 2 6
8 6 5 3
6 1 5 3
SELECT Topic, COUNT(*) FROM TwitterStream TIMESTAMP BY CreatedAt GROUP BY Topic, SlidingWindow(second, 10) HAVING COUNT(*) > 10
Give me the count of tweets for all
topics which are tweeted more
than 10 times in the last 10
seconds
1 5
A 10-second Sliding Window
8
8
5 1
9
5 1 9
1
Data sources
Consumed by BI
Integrated with Apps
Coordination and management
• Build and manage a network data pipelines
• From a single pane of glass:
• See full data and operational lineage
• Monitor pipeline and dataset health
• Control data production policy
Data stores and processing environments
• Work with your data
• On premise SQL Server
• Azure DB, Azure Blobs, Azure table
• Compose and orchestrate data processing
• HDInsight, Custom Code, etc.
AZURE DATA FACTORY
Relational and non-relational
On-premise or cloud
Batch or Stream
Hadoop (Hive, Pig, etc.)
Custom code
Data movement
Manage and monitor
Data and operational lineage
Coordination and scheduling
Policy
DATA PIPELINES
Activity Activity
Produce trusted information from raw data
Data sources
Consumed by BI
Integrated with Apps
AZURE DATA FACTORY
Relational and non-relational
On-premise or cloud
Batch or Stream
Hadoop (Hive, Pig, etc.)
Custom code
Data movement
Manage and monitor
Data and operational lineage
Coordination and scheduling
Policy
DATA PIPELINES
Activity Activity
Information assets Raw data Orchestrate, monitor
Data Factory
A platform for developers to compose data processing, storage and
movement services to create & operationalize analytics Pipelines
Pipeline
Pipelines are groups of data movement and/or processing Activities that
accept N input Datasets and produce N output Datasets. Pipelines can be
executed once or on a flexible range of schedules (hourly, daily, weekly,
etc…).
Dataset
A Dataset is a named view of data. The data being described can vary
from simple bytes, semi-structured data like CSV files all the way to Tables
or Models.
Activity
An Activity is the unit of execution within the pipeline that can
perform data movement or transformation. It can import/export data
from disparate Data Stores (DB, files, SaaS services, etc) used by the
organization into a Data Hub
Data Hub
A Data Hub is a pairing of collocated data storage and associated
compute services. For example, a Hadoop cluster ( HDFS as storage ,
Hive/Pig/etc as compute) is a Data Hub. Similarly, an EDW can be
modelled as a Data Hub (DB as storage, Sprocs and/or ETL tool as
compute services).
C#
MapReduce
Hive
Pig
Stored Procedures
ETL Tool (SSIS, etc)
EDW (SQL Svr, Teradata, etc)
Extract
Original Data
Load
Transformed Data
Transform
BI Tools
Data Marts
Data Lake(s)
Ingest (EL)
Original Data
Dashboards
Apps Scale-out Storage & Compute
(HDFS, Blob Storage, etc)
Transform & Load
Streaming data
New Azure service for data developers and IT
Compose data processing, storage, and movement services to create and manage
analytics pipelines
Rich, simple end-to-end pipeline monitoring and management
Initially focused on Azure and hybrid movement to/from on premises SQL Server. Overtime
will expand to more storage and processing systems
HDInsight Hadoop for the Cloud
Azure HDInsight
Hadoop Meets the Cloud Microsoft’s cloud Hadoop offering
100% open source Apache Hadoop
Built on the latest releases across Hadoop (2.4)
Up and running in minutes with no hardware to deploy
Harness existing .NET and Java skills to write MapReduce
Utilize familiar BI tools for analysis including Microsoft Excel
Demo: Getting Started With HDInsight
Hadoop 2.0
Data Node Data Node Data Node Data Node
Task Tracker Task Tracker Task Tracker Task Tracker
Name Node
Job Tracker
HMaster Coordination
Region Server Region Server Region Server Region Server
$£€¥
Fully
managed
Integrated Best in Class
Algorithms + R Deploy in
minutes
No software to install,
no hardware to manage,
and one portal to view
and update.
Simple drag, drop and
connect interface for
Data Science. No need
for programming for
common tasks.
Built-in collection of
best of breed
algorithms. Support for
R and popular CRAN
packages.
Operationalize models
with a single click.
Monetize in Machine
Learning Marketplace.
Drag & Drop + Best in Class Algorithms
Live Connectivity to SQL Server Analysis Services
Live Query
Stream Analytics
Transform Ingest
Web logs
Present &
decide
Kinect In-Store
Activity
Social Data
Event Hubs HDInsight
Azure Data
Factory
Azure SQL DB
Azure Blob Storage
Azure Machine
Learning
Power BI
Web
dashboards
Mobile devices
DW / Long-term
storage
Predictive
analytics
Event & data
producers
APS
http://aka.ms/DBI233
Session Evaluation
top related