data lake organization -...
TRANSCRIPT
Data Lake OrganizationA Hadoop Eco-System
Jan Cordtz, Microsoft Denmark
Cloud Solution Architect
42Azure regions
Hyper scale
Infrastructure
100+ Data-
centers across
42 Regions
Worldwide
Learn more: Microsoft.com/datacenter
Top 3 networks in the world
AzureSearch
HybridCloud
Backup
StorSimple
Azure SiteRecovery
Import/Export
Azure AD Health Monitoring
AD PrivilegedIdentity Management
OperationalAnalytics
Domain Services
SQL Database DocumentDB
Redis Cache
StorageTables
SQL DataWarehouse
SQL Server Stretch Database
Visual Studio
ApplicationInsights
VS Team ServicesXamarin
HockeyApp
MobileEngagement
Cognitive Services Bot Framework Cortana
Security & Management
Azure ActiveDirectory
Multi-FactorAuthentication
Automation
Portal
Key Vault
Store/Marketplace
VM Image Gallery& VM Depot
Azure ADB2C
Scheduler
Security Center
WebApps
MobileApps
API Apps
Notification Hubs
Cloud Services
ServiceFabric
Functions
BatchRemoteApp
Container Service
VM Scale Sets
BizTalkServices
Service Bus
Logic Apps
API Management
Content DeliveryNetwork
Media Services
Media Analytics
HDInsight MachineLearning Stream Analytics
Data Factory
EventHubs
Data LakeAnalytics Service
IoT Hub
Data Catalog
Power BI Embedded
Data Lake Store
Data Center
Infrastructure as a Service
Platform as a Service
TrustedHIPAA /
HITECH Act
FERPAGxP
21 CFR Part 11
ISO 27001 SOC 1 Type 2ISO 27018
CSA STAR
Self-Assessment
Singapore
MTCS
UK
G-Cloud
Australia
IRAP/CCSL
FISC Japan
New Zealand
GCIO
China
GB 18030
EU
Model Clauses
ENISA
IAF
Argentina
PDPA
Japan CS
Mark Gold
CDSA Shared
Assessments
Japan My
Number Act
FACT UK
GLBA
Spain
ENS
PCI DSS
Level 1
MARS-E FFIEC
China
TRUCS
SOC 2 Type 2 SOC 3
Canada
Privacy Laws
MPAA
Privacy
Shield
ISO 22301
India
MeitY
Germany IT
Grundschutz
workbook
Spain
DPA
CSA STAR
Certification
CSA STAR
Attestation
HITRUST IG Toolkit UK
China
DJCP
ISO 27017
GLO
BA
LIN
DU
ST
RY
REG
ION
AL
More certifications than any cloud provider
AnalyticsData Cloud
How to (easily) disrupt a Data Warehouse
Digital transformation/disruption
Planning is dead
Open and hybrid
How to use and organize….
9
How to do ?
Innovation - Mode 2
TestPre-Prod
Prod
Functionality
Data Lake
Functionality
SQL Cube
Dev
Repository like VSTS
SQLSQLCubeCube
Blob/Disc
Data Inges-tion
Storage – from a functionality point of view
11
File storage Database CubeData Lake
Functionality and cost
ETLData Factory
ETLData Factory
Principal regarding the Organization
• Is very simple to use for an end-user/application
• Is as cost-effective as sensible/possible.
• Do not compromise security.
• Fits well into a DevOps scenario
• Supports both of Gartner’s mode 1 and mode 2 implementation
scenarios.
• Have a well-defined path for the information needed to be able
to support an effective auditing and logging process.
Copy
Organizing the Azure Data Lake
13
Azure Data Lake
Landing Zone
Work
Landing Zone System Account(s)
– read/write
Work System Account(s)
- read
Work System Account(s)
– read/write
Publish
Users in Groups
Read/Write
Read Only – except Work System Account(s)
Folder per ”something”
Analytics
Users in Groups
Read Only – except Work System Account(s)
Read/Write”All data”
Transform
Transform &
Anonymize
Archive
Data Catalog
Data Inges-tion
Data Ingestion
”Gatekeeper”
Validation
Standardization SSISData Factory
DatabaseFTP
File Storage
FirewallAD control
Items like : Date formats (yyyymmdd),
number formats (,. or .,)
”Are you allowed to come in ?”
”Is the content you are coming with in
accordance with what we have agreed”
Organize an Area in the lakeExample ”by deparment”
15
Publish
Users in Groups
Read/Write
Read Only
Directory per ”department”
Org
aniz
atio
n t
op
leve
l
Department A Function Area A
Department B
Department C
Use
r le
vel User A Own Directory A
User B
User C
Read Only
Read/Write
16
Data lake Data lake
DW DW
Mirror
Prod Dev/test
Data lake
DW DW
Prod Dev/test
Data lake
Data lake
DW DW
Prod Dev/test
A
B
C
What is Azure Data Lake then….
17
• d
Built on Open Standards
Built on YARN
Store lets all HDFS compliant analytic applications
connect to it like Hortonworks, Cloudera, and
MapR
HDInsight is 100% Apache Hadoop
Microsoft continues to contribute tens of thousands
of code and engineering hours to open source
HDFS
YARN
U-SQL
Analytics
ServiceHDInsight
HDFS
Store
Any type of analytics: batch, streaming, interactive
Batch, interactive, streaming, machine learning
Allows for exploratory analytics over your data
Do analytics with Hadoop and Microsoft solutions
Azure Data Lake analytics
Data stored of any size, optimized for high performance
Store has no fixed limits on file sizes
(PB sized files)
Ultra fast read/write access
No code rewrite as you increase size of data stored
Optimized for large analytic systems: with massive throughput
Optimized for IOT with high volume of small writes
PB
TB GB
PBTB
U-SQL
Be productive with U-SQL, a simple and powerful language
Simple and familiar, easily extensible
Unifies declarative nature of SQL with expressive power of
C#
Familiar syntax to millions of .NET developers
Manage and secure your data assets
Auditing, alerting, access control - all from within a single
web-based portal
Azure Active Directory integration for identity and access
management
Cortana Intelligence Suite
2
3
Microsoft Data solution Overview (CIS)
People
Automated Systems
Apps
Web
Mobile
Bots
Intelligence
Capabilities
Dashboards &
Visualizations
Cortana
Bot
Framework
Cognitive
Services
Power BI
Data Sources
Apps
Sensors and devices
Machine Learning
and Analytics
HDInsight
(Hadoop and
Spark)
Stream Analytics
Data Lake
Analytics
Machine
Learning
ActionData Intelligence
Information
Management
Event Hubs
Data Catalog
Data Factory
Big Data Stores
SQL Data
Warehouse
Data Lake Store
Thank you