azure data lake intro (sqlbits 2016)
TRANSCRIPT
SQLBits 2016
Azure Data Lake &U-SQLMichael Rys, @MikeDoesBigData
http://www.azure.com/datalake{mrys, usql}@microsoft.com
The Data Lake Approach
CLOUD
MOBILE
Growth of data
INTERNET CONNECTEDDIGITALANALOG
1985 1990 1995 2000 2005 2010 2015 2020
CLOUD
MOBILE
Implement Data WarehouseReporting & Analytics Development
Reporting & Analytics Design
Physical Design
Dimension Modelling
ETL DevelopmentETL Design
Install and TuneSetup Infrastructure
Traditional data warehousing approach
Data sources
ETL
BI and analytics
Dashboards
Reporting
Data warehouse
Understand Corporate Strategy
Gather Requirements
Business Requirement
s
Technical Requirements
The Data Lake approach
Ingest all data regardless of requirements
Store all data in native format without schema definition
Do analysisUsing analytic engines like Hadoop
Interactive queriesBatch queries
Machine LearningData warehouse
Real-time analytics
Devices
Source: ComScore 2009-2015 Search Report US
2009 2010 2011 2012 2013 2014 20150%
5%
10%
15%
20%
25%
9%11%
15%16%
18%19% 20%
MICROSOFT DOUBLES SEARCH SHARE
How Microsoft has used Big DataWe needed to better leverage data and analytics to win in searchWe changed our approach• More experiments by more people!
So we…Built an Exabyte-scale data lake for everyone to put their data.Built tools approachable by any developer.Built machine learning tools for collaborating across large experiment models.
Introducing Azure Data LakeBig Data Made Easy
Business ScenariosRecommendations,
customer churn,forecasting, etc.
Perceptual IntelligenceFace, vision
Speech, text
Personal Digital Assistant
Cortana
Dashboards and Visualizations
Power BI
Machine Learning
and Analytics
Azure Machine Learning
Azure Stream Analytics
Cortana Analytics SuiteBig Data & Advanced Analytics
DATA
Business apps
Custom apps
Sensors and devices
INTELLIGENCE ACTION
People
Automated Systems
Information Management
Azure Data Factory
Azure Data Catalog
Azure Event Hub
Big Data Stores
Azure SQL Data Warehouse
Azure Data Lake store
Azure Data Lake Analytics
Azure Data LakeManaged clusters
Analytics
Storage
HDInsight(“managed clusters”)
Azure Data Lake Analytics
Azure Data Lake Storage
Azure Data Lake
Azure Data Lake Storage Service
No limits to SCALE
Store ANY DATA in its native format
HADOOP FILE SYSTEM (HDFS) for the cloud
ENTERPRISE GRADE access control, encryption at rest
Optimized for analytic workload PERFORMANCE
Azure Data Lake StoreA hyper scale repository for big data analytics workloads
IN PREVIEW
Data Lake Store: Built for the cloudSecure Must be highly secure to prevent unauthorized access (especially as all data is in one
place).
Native format Must permit data to be stored in its ‘native format’ to track lineage and for data provenance.
Low latency Must have low latency for high-frequency operations.
Must support multiple analytic frameworks—Batch, Real-time, Streaming, Machine Learning, etc. No one analytic framework can work for all data and all types of analysis.
Multiple analytic frameworks
Details Must be able to store data with all details; aggregation may lead to loss of details.
Throughput Must have high throughput for massively parallel processing via frameworks such as Hadoop and Spark.
Reliable Must be highly available and reliable (no permanent loss of data).
Scalable Must be highly scalable. When storing all data indefinitely, data volumes can quickly add up.
All sources Must be able ingest data from a variety of sources-LOB/ERP, Logs, Devices, Social NWs etc.
Four pillars of security and compliance
Authentication
Authorization
Auditing Data Protection
Azure Active Directory
OAuth
Role-based access control
POSIX ACLs
Audit logs
Forensic analysis
Transparent encryption
Key Mgmt.
MICROSOFT CONF IDENT IAL – INTERNAL ONLY
Scenario: Securing a Big Data pipeline
Social
ClickstreamWeb
Contoso Acme.com
CONTOSO• Retail company with large market presence.• Records sales transactions, user interactions on
website, social data, etc.
RETAIL ANALYTICS (Acme.com)• Provides social media-based sentiment analysis.• Hired by Contoso to:
• Develop insights from social-media information combined with user activity on Contoso portal.
TASK FOR CONTOSO IT ADMIN:• Provide Acme.com employees access to Contoso
data.• Allow Acme.com employees to submit U-SQL
jobs.
FULLY SUPPORTED Hadoop for the cloud
Available on LINUX and WINDOWS
Works on AZURE STORAGE or DATA LAKE STORE
100% OPEN SOURCE Apache Hadoop (HDP 2.3)
Clusters up and RUNNING IN MINUTES
Use familiar BI TOOLS FOR ANALYSIS like Excel
Azure HDInsightHadoop Platform as a Service on Azure
Azure Data Lake Analytics Service
WebHDFS
YARN
U-SQL
ADL Analytics
ADL HDInsight
1
1
1
1
1
1 1
1
1
1
1
1
Store
HiveAnalytics
Storage
Azure Data Lake (Store, HDInsight, Analytics)
ADLA complements HDInsightTarget the same scenarios, tools, and customers
HDInsightFor developers familiar with the Open Source: Java, Eclipse, Hive, etc.
Clusters offer customization, control, and flexibility in a managed Hadoop cluster
ADLAEnables customers to leverage existing experience with C#, SQL & PowerShell
Offers convenience, efficiency, automatic scale, and management in a “job service” form factor
No limits to SCALE
Includes U-SQL, a language that unifies the benefits of SQL with the expressive power of C#
Optimized to work with ADL STORE
FEDERATED QUERY across Azure data sources
ENTERPRISE GRADE role-based access control and auditing
Pay PER QUERY and scale PER QUERY
Azure Data Lake AnalyticsA distributed analytics servicebuilt on Apache YARN that dynamically scales to your needs
IN PREVIEW
ADL and SQLDW
XML
JSON
Preparation• Pre-process • Transpose• Re-format
TEXT Model• Load• Transform• Aggregate• Consume
High Value Data
Unknown Value
Data
BatchAd-hocBatch
Work across all cloud data
Azure Data Lake Analytics
Azure SQL DW Azure SQL DB Azure Storage Blobs
Azure Data Lake Store
SQL DB in an Azure VM
DemoShow me ADL!
Simplified management and administration
Web-based management in Azure PortalAutomate tasks using PowerShellRole-based access control with Azure ADMonitor service operations and activity
Get started
Log in to Azure
Create an ADLA account
Write and submit an ADLA job with U-SQL (or Hive/Pig)
The job reads and writes data from storage
1 2 3 4
30 seconds
ADLSAzure BlobsAzure DB…
Azure Data Lake SDK/CLI
ADL Store (ADLS) feature set(not exhaustive)Account ManagementCreate new accountList accountsUpdate account propertiesDelete account
Transferring DataUpload into store from local diskDownload from store to local disk
Files and FoldersList contents of folderCreateMoveDeleteDoes file exist
SecurityGet ACLsUpdate ACLsGet OwnerSet Owner
File ContentSet file contentAppend file contentGet file contentMerge files
ADL Analytics (ADLA) feature set(not exhaustive)Account ManagementCreate new accountList accountsUpdate account propertiesDelete account
Data SourcesAdd a data sourceList data sourcesUpdate data sourceDelete data source
ComputeList jobsSubmit jobCancel job
Catalog ItemsList items in U-SQL catalogUpdate item
Catalog SecretsCreate catalog secretList catalog secretsDelete catalog secrets
SDKs/APIs: development options
ADL .NET SDKs
Azure and ADL REST APIs
ADL PowerShe
llADL XPlat
CLI
ADL Node.js SDK ADL Java SDK
Your application
Five REST API endpoints
ManagementCreate and manage ADLA accounts
JobsSubmit and manage jobs
CatalogExplore catalog items
ManagementCreate and manage ADLS accounts
File SystemUpload, download, list, delete, rename, append
(WebHDFS)
Analytics Store
Developer landscape: .NET SDKs
Analytics .NET SDK
Store .NET SDK
• Management• Catalog• Jobs• Management• Filesystem• Uploader
SDKs NuGet packages
Available from NuGet (nuget.org – search for DataLake)Microsoft.Azure.Management.DataLake.Store
Microsoft.Azure.Management.DataLake.AnalyticsMicrosoft.Azure.Management.DataLake.Uploader
Workflow1. Authenticate using OAuth 2.0 grant flow
Get an OAuth token from Azure Active Directory (Azure AD, AAD)2. Setup
Create a service client object3. Do work
Call methods on the client object
Using the SDKs
http://aka.ms/AzureDataLake