clean your data swamp by migration off hadoop

64
Clean Your Data Swamp By Migration off Hadoop

Upload: others

Post on 22-Nov-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Clean Your Data Swamp By Migration off Hadoop

Clean Your Data SwampBy Migration off Hadoop

Page 2: Clean Your Data Swamp By Migration off Hadoop

Speaker

Ron GuerreroSenior Solutions

Architect

Page 3: Clean Your Data Swamp By Migration off Hadoop

Agenda

● Why modernize?● Planning your migration off of Hadoop ● Top migration topics

Page 4: Clean Your Data Swamp By Migration off Hadoop

Why migrate off of Hadoop and onto Databricks?

Page 5: Clean Your Data Swamp By Migration off Hadoop

History of Hadoop

● Created 2005● Open Source distributed processing and storage

platform running on commodity hardware● Originally consisted of HDFS, and MapReduce, but

now incorporates numerous open source projects (Hive, HBase, Spark)

● On-prem and on the cloud

Page 6: Clean Your Data Swamp By Migration off Hadoop

COMPLEX FIXED

Today Hadoop is very hard

● Many tools: Need to understand multiple technologies.

● Real-time and batch ingestion to build AI models requires integrating many components.

Slow Innovation

● 24/7 clusters.● Fixed capacity: CPU

+ RAM + Disk.● Costly to upgrade.

Cost Prohibitive

MAINTENANCE INTENSIVE

● Hadoop ecosystem is complex and hard to manage that is prone to failures.

Low Productivity

X

Page 7: Clean Your Data Swamp By Migration off Hadoop

Enterprises Need a ModernData Analytics Architecture

CRITICAL REQUIREMENTS

Cost-effective scale and performance in the cloud

Easy to manage and highly reliable for diverse data

Predictive and real-time insights to drive innovation

Page 8: Clean Your Data Swamp By Migration off Hadoop

Structured Semi-structured Unstructured Streaming

Lakehouse Platform

Data Engineering BI & SQL Analytics

Real-time Data Applications

Data Science & Machine Learning

Data Management & Governance

Open Data Lake

SIMPLE OPEN COLLABORATIVE

Page 9: Clean Your Data Swamp By Migration off Hadoop

Planning your migration off of Hadoop and onto Databricks

Page 10: Clean Your Data Swamp By Migration off Hadoop

Migration Planning● Internal Questions● Assessment● Technical Planning● Enablement and Evaluation● Migration Execution

Page 11: Clean Your Data Swamp By Migration off Hadoop

Migration PlanningInternal Question● why?● who?● desired start and end dates● internal stakeholders● cloud strategy

Page 12: Clean Your Data Swamp By Migration off Hadoop

Migration PlanningAssessment● Environment inventory

○ compute, data, tooling● Use case prioritization● Workload analysis● Existing TCO● Projected TCO● Migration timelines

Page 13: Clean Your Data Swamp By Migration off Hadoop

Migration PlanningTechnical Planning● Target state architecture● Data migration● Workload migration

○ Lift and shift, transformative, hybrid● Data governance approach● Automated deployment● Monitoring and Operations

Page 14: Clean Your Data Swamp By Migration off Hadoop

Migration PlanningEnablement and Evaluation● Workshops,Technical deep dives● Training● Proof of technology / MVP

○ Validate assumptions and designs

Page 15: Clean Your Data Swamp By Migration off Hadoop

Migration PlanningMigration Execution● Environment Deployment● Iterate of use cases

○ Data Migration○ Workload Migration○ Dual Production Deployment - Old and New○ Validation○ Cut-over and Decommission of Hadoop

Page 16: Clean Your Data Swamp By Migration off Hadoop

Top Migration Topics

Page 17: Clean Your Data Swamp By Migration off Hadoop

Key Areas of Migration1. Administration2. Data Migration3. Data Processing4. Security & Governance5. SQL and BI Layer

Page 18: Clean Your Data Swamp By Migration off Hadoop

Administration

Page 19: Clean Your Data Swamp By Migration off Hadoop

Hadoop Ecosystem to Databricks ConceptsHadoop

HDFSc

disk

1di

sk2

disk

3di

sk4

disk

5di

sk6

... disk

N

YARN

Impala

HBase

cc

cc

c MR mapperc MR mapperc MR mapper

cSpark Worker (Executor)

ccc

c MR mappercSpark Worker (Executor)

ccc

cSpark Worker (Executor)

ccc

c

c

c2x12c = 24c compute

HDFSc

disk

1di

sk2

disk

3di

sk4

disk

5di

sk6

... disk

N

YARN

Impala

HBase

cc

cc

c MR mapperc MR mapperc MR mapper

cSpark Worker (Executor)

ccc

c MR mappercSpark Worker (Executor)

ccc

cSpark Worker (Executor)

ccc

c

c

c2x12c = 24c compute

HDFSc

disk

1di

sk2

disk

3di

sk4

disk

5di

sk6

... disk

N

YARN

Impala

HBase

cc

cc

c MR mapperc MR mapperc MR mapper

cSpark Worker (Executor)

ccc

c MR mappercSpark Worker (Executor)

ccc

cSpark Driver

ccc

c

c

c2x12c = 24c compute

...

Node 1 Node 2 Node N

Hive Metastore

Hive Server

Impala(LoadBalancer)

HBaseAPI

SentryTable Metadata + HDFS ACLs

JDBC/ODBC

Node makeup▪ Local disks▪ Cores/Memory carved to services▪ Submitted jobs compete for resources▪ Services constrained to accommodate

resources

Metadata and Security▪ Sentry table metadata permissions combined

with syncing HDFS ACLs OR

▪ Apache Ranger, policy based access control

Endpoints▪ Direct Access to HDFS / Copied dataset

▪ Hive (on MR or Spark) accepts incoming connections

▪ Impala for interactive queries▪ HBase APIs as required

RangerPolicy based access control

OR

Page 20: Clean Your Data Swamp By Migration off Hadoop

Hadoop Ecosystem to Databricks ConceptsHadoop

HDFSc

disk

1di

sk2

disk

3di

sk4

disk

5di

sk6

... disk

N

YARN

Impala

HBase

cc

cc

c MR mapperc MR mapperc MR mapper

cSpark Worker (Executor)

ccc

c MR mappercSpark Worker (Executor)

ccc

cSpark Worker (Executor)

ccc

c

c

c2x12c = 24c compute

HDFSc

disk

1di

sk2

disk

3di

sk4

disk

5di

sk6

... disk

N

YARN

Impala

HBase

cc

cc

c MR mapperc MR mapperc MR mapper

cSpark Worker (Executor)

ccc

c MR mappercSpark Worker (Executor)

ccc

cSpark Worker (Executor)

ccc

c

c

c2x12c = 24c compute

HDFSc

disk

1di

sk2

disk

3di

sk4

disk

5di

sk6

... disk

N

YARN

Impala

HBase

cc

cc

c MR mapperc MR mapperc MR mapper

cSpark Worker (Executor)

ccc

c MR mappercSpark Worker (Executor)

ccc

cSpark Driver

ccc

c

c

c2x12c = 24c compute

...

Node 1 Node 2 Node N

Hive Metastore

Hive Server

Impala(LoadBalancer)

HBaseAPI

Sentry/RangerTable Metadata + HDFS ACLs

Hive Metastore(managed)

Databricks

SQL Endpoint

JDBC/ODBC

High Conc. Cluster SQL Analytics

CosmosDB/DynamoDB/Keyspaces

Object Storage

cSpark Worker (Executor)

ccc

DeltaEngine

cSpark Driver

ccc

cSpark Worker (Executor)

ccc

DeltaEngine

cSpark Worker (Executor)

ccc

DeltaEngine Databricks Cluster

Spark ETL(Batch/Streaming)

cSpark Worker (Executor)

ccc

DeltaEngine

cSpark Driver

ccc

cSpark Worker (Executor)

ccc

DeltaEngine

cSpark Worker (Executor)

ccc

DeltaEngine Databricks Cluster

SQL Analytics

cSpark Worker (Executor)

ccc

DeltaEngine

cSpark Driver

ccc

cSpark Worker (Executor)

ccc

DeltaEngine

cSpark Worker (Executor)

ccc

DeltaEngine Databricks Cluster

ML Runtime

Table ACLs

Object Storage ACLs

Ephemeral Clusters for All-purpose or Jobs

JDBC/ODBC

Page 21: Clean Your Data Swamp By Migration off Hadoop

Hadoop Ecosystem to Databricks Concepts

Hive Metastore(managed)

Databricks

SQL EndpointHigh Conc. Cluster SQL Analytics

cSpark Worker (Executor)

ccc

DeltaEngine

cSpark Driver

ccc

cSpark Worker (Executor)

ccc

DeltaEngine

cSpark Worker (Executor)

ccc

DeltaEngine Databricks Cluster

Spark ETL(Batch/Streaming)

cSpark Worker (Executor)

ccc

DeltaEngine

cSpark Driver

ccc

cSpark Worker (Executor)

ccc

DeltaEngine

cSpark Worker (Executor)

ccc

DeltaEngine Databricks Cluster

SQL Analytics

cSpark Worker (Executor)

ccc

DeltaEngine

cSpark Driver

ccc

cSpark Worker (Executor)

ccc

DeltaEngine

cSpark Worker (Executor)

ccc

DeltaEngine Databricks Cluster

ML Runtime

Table ACLs

Ephemeral Clusters or long running for All-purpose or Jobs

JDBC/ODBCNode makeup▪ Each Node (VM), maps to single Spark

Driver/Worker▪ Cluster of nodes completely isolated from other

jobs/compute▪ De-coupled compute and storage

Metadata and Security▪ Managed Hive metastore (other options

available)▪ Table ACLs (Databricks) and Object Storage

permissions

Endpoints▪ SQL endpoint for both advanced analytics and

simple SQL analytics

▪ Code access to data - Notebooks▪ HBase → maps to Azure CosmosDB, AWS

DynamoDB/Keyspaces (non-Databricks solution)

Object Storage Object Storage ACLs

CosmosDB/DynamoDB/Keyspaces

Page 22: Clean Your Data Swamp By Migration off Hadoop

Demo - Administration

Page 23: Clean Your Data Swamp By Migration off Hadoop

Data Migration

Page 24: Clean Your Data Swamp By Migration off Hadoop

Data Migration

- On-premise block storage.- Fixed disk capacity. - Health checks to validate data Integrity.- As data volumes grow, must add more nodes to cluster and rebalance data.

MIGRATE

- Fully managed cloud object storage.- Unlimited capacity. - No maintenance, no health checks, no rebalancing. - 99.99% availability, 99.9999999% durability.- Use native cloud services to migrate data. - Leverage partner solutions:

Page 25: Clean Your Data Swamp By Migration off Hadoop

Data MigrationBuild a Data Lake in cloud storage with Delta Lake

● Open source and uses Parquet file format. ● Performance: Data indexing → Faster queries. ● Reliability: ACID Transactions → Guaranteed data integrity. ● Scalability: Handle petabyte-scale tables with billions of partitions and files at ease. ● Enhanced Spark SQL: UPDATE, MERGE, and DELETE commands.● Unify Batch and Stream processing → No more LAMBDA architecture. ● Schema Enforcement: Specify schema on write. ● Schema Evolution: Automatically change schemas on the fly. ● Audit History: Full audit trail of the changes.● Time Travel: Restore data from past versions. ● 100% Compatible with Apache Spark API.

Page 26: Clean Your Data Swamp By Migration off Hadoop

Start with Dual ingestion

● Add a feed to cloud storage

● Enable new use cases with new data

● Introduces options for backup

Page 27: Clean Your Data Swamp By Migration off Hadoop

How to migrate data

● Leverage existing Data Delivery tools to point to cloud storage

● Introduce simplified flows to land data into cloud storage

Page 28: Clean Your Data Swamp By Migration off Hadoop

How to migrate data ● Push the data

○ DistCP ○ 3rd Party Tooling○ In-house frameworks○ Cloud Native - Snowmobile , Azure Data Box, Google Transfer Appliance○ Typically easier to approve (security)

● Pull the data○ Spark Streaming○ Spark Batch

■ File Ingest■ JDBC

○ 3rd Party Tooling

Page 29: Clean Your Data Swamp By Migration off Hadoop

How to migrate data - Pull approach● Set up connectivity to On Premises

○ AWS Direct Connect○ Azure ExpressRoute / VPN Gateway○ This may be needed for some use cases

● Kerberized Hadoop Environments○ Databricks clusters initialization scripts

■ Kerberos client setup■ krb5.conf, keytab■ kinit()

● Shared External Metastore ○ Databricks and Hadoop can share a metastore

Page 30: Clean Your Data Swamp By Migration off Hadoop

Demo - Databricks Pull

Page 31: Clean Your Data Swamp By Migration off Hadoop

Data Processing

Page 32: Clean Your Data Swamp By Migration off Hadoop

Technology Mapping

Page 33: Clean Your Data Swamp By Migration off Hadoop

Migrating Spark Jobs

● Spark versions

● RDD to Dataframes

● Changes to submission

● Hard coded references to hadoop environment

Page 34: Clean Your Data Swamp By Migration off Hadoop

Converting non-Spark workloads

● MapReduce

● Sqoop

● Flume

● Nifi Considerations

Page 35: Clean Your Data Swamp By Migration off Hadoop

Migrating HiveQL

● Hive queries have high compatibility

● Minor changes in DDL

● Serdes, and UDFs

Page 36: Clean Your Data Swamp By Migration off Hadoop

Migration Workflow Orchestration

● Create Airflow, Azure Data Factory, or other, equivalents

● Databricks REST APIs allows integration to any Scheduler

Page 37: Clean Your Data Swamp By Migration off Hadoop

Automated Tooling

● MLens

○ PySpark○ HiveQL○ Oozie to Airflow, Azure Data Factory (roadmap)

Page 38: Clean Your Data Swamp By Migration off Hadoop

Security and Governance

Page 39: Clean Your Data Swamp By Migration off Hadoop

Security and GovernanceAuthentication Authorization Metadata Management

- Single Sign On (SSO) with SAML 2.0 supported corporate directory.

- Access Control Lists (ACLs) for Databricks RBAC. - Table ACLs - Dynamic Views for Column/Row permissionons- Leverage cloud native security: IAM Federation and AAD passthrough. - Integration with Ranger an Immuta for more advanced RBAC and ABAC.

- Integration with 3rd party services.

Amazon Glue

Page 40: Clean Your Data Swamp By Migration off Hadoop

Pivacera

Page 41: Clean Your Data Swamp By Migration off Hadoop

Migrating Security Policies from Hadoop to Databricks

Enabling enterprises to responsibly use their data in the cloudPowered by Apache Ranger

Page 42: Clean Your Data Swamp By Migration off Hadoop

HADOOP ECOSYSTEM● 100s and 1000s of tables in

Apache Hive● 100s of policies in Apache

Ranger● Variety of policies. Resource

Based, Tag Based, Masking, Row Level Filters, etc.

● Policies for Users and Groups from AD/LDAP

Page 43: Clean Your Data Swamp By Migration off Hadoop

PRIVACERA AND DATABRICKS

Hive MetaStore MetaStore

Dataset

Schema

Policies

Page 44: Clean Your Data Swamp By Migration off Hadoop

SEAMLESS MIGRATION

INSTANTLY TRANSFER

YEARS OF EFFORT

INSTANTLY IMPLEMENT THE SAME

POLICIES IN DATABRICKS AS ON-PREM

Page 45: Clean Your Data Swamp By Migration off Hadoop

● Richer, deeper, and more robust Access Control● Row/Column level access control in SQL● Dynamic and Static data de-identification● File level access control for Dataframes, object level access● Read/Write operations supported

Object Store(S3/ADLS)

Privacera+Databricks

S3 - Bucket Level

Y

S3 - Object Level

Y

ADLS Y

Privacera Value Add - Enhancing Databricks Authorization

Spark SQL and R Privacera +Databricks

Table Y

Column Y

Column Masking Y

Row Level Filtering Y

Tag Based Policies Y

Attribute based policies Y

Centralized Auditing Y

Page 46: Clean Your Data Swamp By Migration off Hadoop

Databricks SQL/Python Cluster

Spark Driver Ranger Plugin

Spark Executors

Spark Executors Ranger Policy Manager

Privacera Portal

Privacera Audit Server

DB Solr

Apache Kafka

Splunk

Cloud Watch

SIEM

Privacera Cloud

Spark SQL and/or Spark Read/Write

Privacera Anomaly Detection and Alerting

Databricks Cluster

Privacera Discovery

Business User

Admin User

Privacera Approval Workflow

AD/LDAP3rd Party Catalog

Page 47: Clean Your Data Swamp By Migration off Hadoop
Page 48: Clean Your Data Swamp By Migration off Hadoop

SQL and BI

Page 49: Clean Your Data Swamp By Migration off Hadoop

What about the SQL CommunityHadoop

● HUE ○ Data browsing○ SQL Editor○ Visualizations

● Interactive SQL○ Impala○ Hive LLAP

Databricks

● SQL Analytics Workspace○ Data Browser○ SQL Editor○ Visualizations

● Interactive SQL○ Spark optimizations - Adaptive Query Execution○ Advanced Caching○ Project Photon○ Scaling cluster of clusters

Page 50: Clean Your Data Swamp By Migration off Hadoop

SQL & BI LayerOptimized SQL and BI

Performance BI Integrations Tuned

- Fast queries with Delta Engine on Delta Engine.- Support for high-concurrency with auto-scaling clusters.- Optimized JDBC/ODBC drivers.

- Optimized and tuned for BI and and SQL out of the box.

Compatible with any BI client and tool that supports Spark.

Page 51: Clean Your Data Swamp By Migration off Hadoop

Vision

Give SQL users a home in DatabricksProvide SQL workbench, light dashboarding, and alerting capabilities

Great BI experience on the data lakeEnable companies to effectively leverage the data lake from any BI tool without having to move the data around.

Easy to use & price-performantMinimal setup & configuration. Data lake price performance.

Page 52: Clean Your Data Swamp By Migration off Hadoop

SQL-native user interface for analysts

▪ Familiar SQL Editor▪ Auto Complete▪ Built in visualizations▪ Data Browser

▪ Automatic Alerts ▪ Trigger based upon values▪ Email or Slack integration

▪ Dashboards▪ Simply convert queries to

dashboards ▪ Share with Access Control

Page 53: Clean Your Data Swamp By Migration off Hadoop

Built-in connectors for existing BI tools

Other BI & SQL clients that support

▪ Supports your favorite tool▪ Connectors for top BI & SQL clients▪ Simple connection setup▪ Optimized performance

▪ OAuth & Single Sign On▪ Quick and easy authentication

experience. No need to deal with access tokens.

▪ Power BI Available now▪ Others coming soon

Page 54: Clean Your Data Swamp By Migration off Hadoop

Performance Delta Metadata Performance

Improved read performance for cold queries on Delta tables. Provides interactive metadata performance regardless of # of Delta tables in a query or table sizes.

New ODBC / JDBC Drivers

Wire protocol re-engineered to provide lower latencies & higher data transfer speeds:

▪ Lower latency / less overhead (~¼ sec) with reduced round trips per request▪ Higher transfer rate (up to 50%) using Apache Arrow▪ Optimized metadata performance for ODBC/JDBC

APIs (up to 10x for metadata retrieval operations)

Photon - Delta Engine[Preview]

New MPP engine built from scratch in C++. Vectorized to exploit data level parallelism and instruction-level parallelism. Optimized for modern structured and semi-structured workloads.

Page 55: Clean Your Data Swamp By Migration off Hadoop

Summary

Page 56: Clean Your Data Swamp By Migration off Hadoop

It all starts with a plan● Databricks and are partner community can help you

○ Assess○ Plan○ Validate○ Execute

Page 57: Clean Your Data Swamp By Migration off Hadoop

Considerations for your migration to Databricks● Administration ● Data Migration● Data Processing● Security & Governance● SQL and BI Layer

Page 58: Clean Your Data Swamp By Migration off Hadoop

Next Steps

Page 59: Clean Your Data Swamp By Migration off Hadoop

Next Steps● You will receive a follow up email from our teams

● Let us help you with your Hadoop Migration Journey

Page 60: Clean Your Data Swamp By Migration off Hadoop

Follow up materials - Useful links

Page 61: Clean Your Data Swamp By Migration off Hadoop

Databricks Reference Architecture

Page 62: Clean Your Data Swamp By Migration off Hadoop

Databricks Azure Reference Architecture

Page 63: Clean Your Data Swamp By Migration off Hadoop

Databricks AWS Reference Architecture

Page 64: Clean Your Data Swamp By Migration off Hadoop

Demo