analyzing stackexchange data with azure data lake
TRANSCRIPT
![Page 1: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/1.jpg)
Sponsored & Brought to you by
Analyzing StackExchange data with Azure Data LakeTom Kerkhove
http://www.twitter.com/TomKerkhove
https://be.linkedin.com/in/tomkerkhove
![Page 2: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/2.jpg)
Analysing StackExchange datawith Azure Data Lake
Analysing StackExchange data with Azure Data Lake
![Page 3: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/3.jpg)
Nice to meet youTom KERKHOVE➔ Integration Professional➔ IoT Competency Lead➔ Windows Development &
Microsoft Azure MVP
[email protected]+32 473 701 [email protected]/in/tomkerkhovegithub.com/tomkerkhove
![Page 4: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/4.jpg)
Agenda• Why should we care about Big
Data?• Big Data in Azure• Azure Data Lake• Demo• Q & A
4
![Page 5: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/5.jpg)
10101010110101
10101010110101
101010111
10101010110101
10101010110101
![Page 6: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/6.jpg)
Integration of ThingsInternet of Things
6
![Page 7: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/7.jpg)
Connect and scale with efficiency
Analyze and act on new
data
Integrate and transform
business processes
Business Systems1010100111010010110101010110101
10101010110100011010001011
10101010110101
10101010110100011010001011010101
Connect and scale with efficiency
Analyze and act on new
data
Integrate and transform
business processes
![Page 8: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/8.jpg)
Event producers & gateways
Ingestion & transformation Report, Act, Predict
![Page 9: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/9.jpg)
Microsoft Patterns & Practices – IoT Journey
![Page 10: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/10.jpg)
10
![Page 11: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/11.jpg)
11
Cluster Management
![Page 12: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/12.jpg)
12
Languages
![Page 13: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/13.jpg)
Platform Services
Infrastructure Services
OS/Server Compute Storage
Datacenter Infrastructure (24 Regions, 22 Online)
Web and Mobile
Web Apps
MobileApps
APIManagement
API Apps
Logic Apps
Notification Hubs
Media & CDNContent DeliveryNetwork (CDN)
Media Services
Integration
BizTalkServices
HybridConnections
Service Bus
StorageQueues
HybridOperations
Backup
StorSimple
Azure SiteRecovery
Import/Export
Networking
Data
SQL Database
DocumentDB
RedisCache Azure
SearchStorageTables
DataWarehouse Azure AD
Health Monitoring
Virtual Network
ExpressRoute
BLOB Storage AzureFiles
PremiumStorage
Virtual Machines
AD PrivilegedIdentity Management
Traffic Manager
AppGateway
OperationalAnalytics
Services ComputeCloud Services
Batch RemoteApp
ServiceFabric
Developer Services
Visual Studio
AppInsights
Azure SDK
VS Online
ContainerService
DNS VPN GatewayLoad Balancer
Domain Services
Analytics & IoT
HDInsight MachineLearning
StreamAnalytics
Data Factory
EventHubs
MobileEngagement
Data Lake
IoT Hub
Data Catalog
Security & Management
Azure ActiveDirectory
Multi-FactorAuthentication
Automation
Portal
Key Vault
Store/Marketplace
VM Image Gallery& VM Depot
Azure ADB2C
Scheduler
![Page 14: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/14.jpg)
Overview in Azure
14
DocumentDB
Data Factory Stream Analytics Data Lake HDInsight Data Lake(Store & Analytics)
Virtual Machine
IoT Hub SQL DataWarehouse
SQL DatabaseStorageEvent HubsDocument Db
Data Ingestion Data Storage
Data Pipelines
Machine Learning
Data Analytics
![Page 15: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/15.jpg)
Personal Digital Assistant – Cortana
Perceptual Intelligence
Preconfigured Solutions
Dashboards and Visualizations
Machine Learning and Analytics
Big Data Store
Information Management
Cortana Analytics Suite
![Page 16: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/16.jpg)
16
![Page 17: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/17.jpg)
Analysing Big Data in Azure
Azure Data Lake Family
HDInsight Data Lake Store Data Lake Analytics
• Unlimited storage• WebHDFS Store
• Managed cluster service• Open-source technology• Runs on Windows or
Linux
• Managed job service• U-SQL batch-processing
![Page 18: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/18.jpg)
Azure Data Lake Store➔ WebHDFS compatible➔ Any size➔ Any format as-is➔ Write-once-read-many➔ Enterprise-grade security
➔ Thé big data store in Azure
18
![Page 19: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/19.jpg)
Characteristics➔ Data Warehousing
➔ Structured data➔ Defined set of schemas➔ Requires Extract-
Transform-Load (ETL) before storing
➔ Known for some of us
➔ Exploratory analysis is hard because of transforming the data
19
Data Lake vs Data Warehousing➔ Data Lake
➔ Raw data(unstructured/semi-structured/structured)
➔ “Dump” all your data in the lake
➔ Data scientists will interpret data from the lake
➔ Without metadata, turns in a data swamp pretty fast
![Page 21: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/21.jpg)
Azure Data Lake Analytics➔ Run analytics jobs on managed clusters
➔ Don’t worry about scale➔ Written in U-SQL
➔ SQL Syntax➔ Extensibility in C#
➔ Easily scaled with Analytics Units➔ Pay for processing time only
21
![Page 22: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/22.jpg)
Writing U-SQL scripts
22
Extract from data source by using built-in or custom extractors.
Transform / Analyse the data using SQL-syntax, in-line C# or C# method calls
Output the result to a data source by using built-in or custom extractors
![Page 23: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/23.jpg)
23
![Page 24: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/24.jpg)
Data Lake Analytics - Data Sources
U-SQL Query
Query
Query
Query
Write
Query
Azure Storage Blobs
Azure Data Lake Store
Azure SQL Database
Azure SQL Data
Warehouse
Azure SQL in VMs
Azure Data Lake Analytics
![Page 25: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/25.jpg)
25
![Page 26: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/26.jpg)
Meet StackExchange➔ Over 280 subwebsites➔ 150+ GB of open-source data➔ Different kinds of data
➔ Posts➔ Users➔ Votes➔ ...
➔ A big data sample data set
![Page 27: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/27.jpg)
What Are We Going To Do?
• Downloading the original data set
Acquiring The Data
• Upload data set to Azure• Determine what
service to use
Moving The Data
• Visualize what we’ve learned
Visualizing The Data
27
![Page 28: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/28.jpg)
Azure Data Lake tools for Visual Studio➔ Projects / Solutions / Source control➔ Store Explorer
➔ Browse store➔ Download complete / subset of file➔ Preview
➔ Job Visualizer➔ Determine bottlenecks by using heatmaps➔ Playback jobs based on telemetry➔ Query optimization➔ Job Profiler
➔ Off-Line execution28
![Page 29: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/29.jpg)
Integration with Azure Services➔ Integrate in your data pipelines in Azure Data
Factory➔ Move data from Azure Data Lake Store to other store➔ Move data to Azure Data Lake Store➔ Run U-SQL query within pipeline
➔ Integration with Azure Data Catalog➔ Register your Azure Data Lake Store assets
29
![Page 30: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/30.jpg)
Pricing➔ Data Lake Store
➔ $0,08/GB stored per month➔ $0,14 per 1M transactions
• 1 transaction is block of up to 128 kB➔ Egress will be billed but not know yet
➔ Data Lake Analytics➔ $0,05 per job➔ $0,05 per minute per Analytics Unit for processing
time
30
![Page 31: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/31.jpg)
Azure Data Lake Store vs Blob Storage
31
No LimitationsStore whatever you want in any format
SecurityBuilt-in Azure Active Directory support
PricingMore expensive than Storage RA-GRS
RedundancyIt’s there but no control over it
Built for ScaleOptimized for high-scale reads
IntegrationWith Data Factory, Data Catalog & HDInsight
![Page 32: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/32.jpg)
32
![Page 33: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/33.jpg)
Summary➔ Big Data is not just a hype so get ready➔ Azure Data Lake Store
➔ Analyse today & explore tomorrow➔ Data Swamps
➔ Data Lake Analytics➔ No cluster management➔ Re-use existing skills➔ Pay for what we use
➔ Big Data in Azure? Azure Data Lake family and it’s easy!
![Page 34: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/34.jpg)
![Page 35: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/35.jpg)
35
![Page 36: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/36.jpg)
36
![Page 37: Analyzing StackExchange data with Azure Data Lake](https://reader036.vdocuments.us/reader036/viewer/2022070516/5871641f1a28ab58758b4fad/html5/thumbnails/37.jpg)
37