best practices in migrating from on-premises data

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Best practices in migrating from on-premises data warehouses to Amazon Redshift

A N T 3 2 3 - R

Matt Scaer

Principal DW Specialist Solutions Architect

Amazon Web Services

Thiyagarajan Arumugam

DW Specialist Solutions Architect

Amazon Web Services

Agenda

• Architecture and concepts

• Scaling options

• Migration techniques

• Migration best practices

• Additional resources

• Open Q&A

Related breakouts

ANT237-R - Zero to hero: Quick starts with Amazon Redshift

ANT335-R - Build data analytics stacks with Amazon Redshift, featuring Warner Bros.

ANT338-R - Best practices for using popular BI tools with Amazon Redshift

ANT404-R - Build a single query to analyze data across Amazon Redshift & Amazon S3

DAT363-R - Migrating your data warehouses to Amazon Redshift

Amazon Redshift benefits10s of thousands of customers use Amazon Redshift & process exabytes of data per day

3x faster than the most popular alternative

Up to 75% less than other cloud data warehouses & predictable costs

AWS Lake Formation catalog & security,exabyte querying (Redshift Spectrum & federated), AWS integrated (AWS DMS,

Amazon CloudWatch)

AWS grade security, (e.g., Amazon VPC, encryption with AWS KMS, AWS

CloudTrail), certifications such as SOC, PCI, DSS, ISO, FedRAMP, HIPAA

Easy to provision & manage, automated backups, AWS support, 99.9% SLAs

Virtually unlimited elastic linear scaling

Amazon

Redshift is the most popular

cloud data warehouse

Moving to Amazon Redshift enables you run a 24x7 data warehouse cost effectively

Amazon Redshift is the most cost effective

Fastest Most cost effective

up to 75%

Use RI to reduce price up to 75% with Reserved Instances (RIs)

You can start small for just $0.25 per hour and scale to $1000 per terabyte per year,

less than one-tenth the cost of other solutions

Your Amazon Redshift data warehouse and Amazon S3 data lake enable all your workloads

Amazon Redshift integrates with your data lake

Fastest Most cost effective Integrates with your data lake

Exabyte scale

Store and analyze relational and nonrelational data

Purpose-built analytics tools

Cost effective

• Store at 2.3 cents per GB-month in Amazon S3

• Query with Redshift Spectrum at $5/TB scanned

• DW with Amazon Redshift for $1000/TB/year

Give access to everyone

• Amazon QuickSight: $0.30 for 30 minutes of use

Amazon Redshift architectureMassively parallel, shared nothing columnar

architecture

Leader node

SQL endpoint

Stores metadata

Coordinates parallel SQL processing

Compute nodes

Local, columnar storage

Executes queries in parallel

Load, unload, backup, restore

Amazon Redshift Spectrum nodes

Execute queries directly against

Amazon Simple Storage Service (Amazon S3)

Compute

nodes

10 GigE (HPC)

JDBC/ODBC

SQL clients/

BI tools

Leader node

Am

azo

n R

ed

shif

t clu

ster

Amazon S3Exabyte-scale object storage

Efficient data loads

Streaming backup/restore

Concurrency scaling

Backup

Amazon Redshift automatically adds transient clusters,

in seconds, to serve sudden spike in concurrent requests with

consistently fast performance. No hydration required.

Caching Layer

How it works:

All queries go to the leader node, user only sees less wait for queries.

When queries in designated WLM queue begin queuing, Amazon Redshift automatically routes them to the new clusters, enabling Concurrency Scaling automatically.

Amazon Redshift automatically spins up a new cluster, processes waiting queries and automatically shuts down the Concurrency Scaling cluster.

1

2

3

For every 24 hours that your main cluster is in use, you accrue a one-hour credit for Concurrency Scaling. This means that Concurrency Scaling is free for > 97% of customers.

Concurrency Scaling is a powerful new feature of

Amazon Redshift. We were really impressed by

the performance of the feature, especially with

its ability to instantly add transient capacity with

nothing to manage on our end. As Redshift

administrators at Yelp, we think that

Concurrency Scaling will keep our many users

happy, even under peak load. We’re excited that

Concurrency Scaling provides the flexibility to

handle significant variance in our workloads over

the course of a day.

-Shahid Chohan,

Software Engineer, Yelp

Amazon Redshift Elastic Resize

Amazon Redshift Cluster

Co

mp

ute

no

des

Amazon Redshift Managed Amazon S3

JDBC/ODBC

1

2

3

Leader Node

Backup

How it works:

Amazon Redshift updates the snapshot on Amazon S3 with the most recent data.

New nodes are added (for scaling up) or removed (for scaling down) during this period.

The cluster is fully available for read and write queries. Queries that were being held are queued for execution automatically.

1

2

3

Scale your Amazon Redshift clusters up and down in minutes to get the optimal performance.

Amazon Redshift Spectrum

Amazon Redshift Spectrum

query engine

Query across Amazon Redshift and Amazon S3

Amazon Redshift data

Amazon S3data lake

Extend the data warehouse to exabytes of data in Amazon S3 data lake

No data loading required

Scale compute and storage separately

Directly query data stored in Amazon S3

Parquet, ORC, Avro, JSON, and CSV data formats

Spectrum Request Accelerator

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

MPP migration timeline

Power Off

Go Live

Stakeholder

Signature

Cutovers

Amazon S3

CDC Sync

Audit per Application

Current System

Future System

AWS Data Migrations: Broadest toolkit

AWS provides the broadest range of tools for easy, fast, and secure data movement to and from the AWS Cloud

AWS Direct

Connect

AWS

Snowball

AWS Database

Migration

Service

AWS Storage

Gateway

Amazon Kinesis

Data Firehose

E a s y , P o w e r f u l T o o l s f o r M i g r a t i n g D a t a

Petabyte-scale data transport

solution

Migration is a two-step process:Step 1: Convert or Copy your Schema

Source DB or DW

AWS SCT

Native Tool

Destination DB or DW

Step 2: Move your data

Source DB or DW Destination DB or DW

AWS DMS

AWS SCT

When to use AWS SCT?

Modernize

Modernize your Database tier

Modernize and Migrate your Data

Warehouse to Amazon Redshift

AWS SCT helps with converting tables, views, and code

• Sequences

• User-defined types

• Synonyms

• Packages

• Stored procedures

• Functions

• Triggers

• Schemas

• Tables

• Indexes

• Views

• Sort and distribution

keys

Database migration assessment

Connect AWS SCT to

source and target

databases

Run assessment report

Read executive summary

Follow detailed

instructions

“AWS Database Migration Service is

the most impressive migration service

we’ve seen.”

2. Relational Databases

1. Non-Relational Databases

3. Other sources

Amazon S3

A c c e l e r a t e M i g r a t i o n s t o A m a z o n R e d s h i f t

Amazon

Redshift

AWS Database

Migration Service

AWS Database Migration Service (AWS DMS)

Leverage AWS Ecosystem

Categories Amazon Redshift

Schema Conversion AWS Schema Conversion Tool for schema/stored proc conversion

Data Movement AWS Schema Conversion Tool Data Extractors

Near Real Time / Streaming Kinesis Data Firehose and/or Amazon Redshift SpectrumAmazon Redshift as a data source for machine learningAmazon Redshift as a destination for Amazon DynamoDB

For Data Lake Integration + column level security

AWS Lake Formation

Data Integration AWS Glue + Amazon EMRData Visualization Amazon QuickSightMonitoring and alerting Amazon CloudWatch Launching Resources AWS CloudFormationSecurity AWS Identity and Access Management (IAM)

Best practices: Legacy data warehouse migrations

• Amazon Redshift is a fully ACID and ANSI SQL Compliant, Columnar store

Data Warehouse

• No indices – instead, fast query performance is achieved through

parallelism and efficient data storage to minimize I/O

• Amazon Redshift is relational – organizes data as tables comprised of

columns and rows in named schemas (namespaces)

• Amazon Redshift has Schemas, Users, and Groups like most other

databases

• Amazon Redshift is designed to working with structured data, and support

semi-structured data in the form of Parquet, JSON or Avro

Best practices: Legacy data warehouse migrations

When moving from legacy row-based data warehouses

• Amazon Redshift is efficient with wide tables because of columnar storage and compression

• Denormalize tables where it makes sense (predicate columns in the fact table)

When moving from SMP legacy data warehouses

• Accelerated joins can be achieved (when frequently queried together) by Colocation of tables

• Leverage DIST STYLE ALL/AUTO or KEY

• Amazon Redshift is designed for big data (> 100GB to PB scale)

Leverage Amazon Redshift Stored Procedures for faster migrations

• PL/pgSQL stored procedures were added to make porting legacy procedures easier

Extend Data warehouse to Data lake using Amazon Redshift Spectrum

• Unlock the data lake in Amazon S3 for end users with industry-standard SQL, with limitless scaling

• Archive Historical Data into Amazon S3

AWSLabs on GitHub — Amazon Redshift

https://github.com/awslabs/amazon-redshift-utils

https://github.com/awslabs/amazon-redshift-monitoring

https://github.com/awslabs/amazon-redshift-udfs

Admin scripts

Collection of utilities for running diagnostics on your cluster

Admin views

Collection of utilities for managing your cluster, generating schema DDL, and so on

Analyze Vacuum utility

Utility that can be scheduled to vacuum and analyze the tables within your Amazon Redshift cluster

Column Encoding utility

Utility that will apply optimal column encoding to an established schema with data already loaded

https://github.com/awslabs/amazon-redshift-utils

https://github.com/awslabs/amazon-redshift-monitoring

https://github.com/awslabs/amazon-redshift-udfs

AWS Big Data Blog – Amazon Redshift

Amazon Redshift Engineering’s Advanced Table Design Playbook

https://aws.amazon.com/blogs/big-data/amazon-redshift-engineerings-advanced-table-design-playbook-preamble-prerequisites-and-prioritization/

Top 10 Performance Tuning Techniques for Amazon Redshift

https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-techniques-for-amazon-redshift/

Twelve Best Practices for Amazon Redshift Spectrum

https://aws.amazon.com/blogs/big-data/10-best-practices-for-amazon-redshift-spectrum/

Top 8 Best Practices for High-Performance ETL Processing Using Amazon Redshift

https://aws.amazon.com/blogs/big-data/top-8-best-practices-for-high-performance-etl-processing-using-amazon-redshift/

How to migrate a large data warehouse from IBM Netezza to Amazon Redshift with no downtime

https://aws.amazon.com/blogs/big-data/how-to-migrate-from-ibm-netezza-to-amazon-redshift-with-no-downtime/

https://aws.amazon.com/blogs/big-data/amazon-redshift-engineerings-advanced-table-design-playbook-preamble-prerequisites-and-prioritization/

https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-techniques-for-amazon-redshift/

https://aws.amazon.com/blogs/big-data/10-best-practices-for-amazon-redshift-spectrum/

https://aws.amazon.com/blogs/big-data/top-8-best-practices-for-high-performance-etl-processing-using-amazon-redshift/

https://aws.amazon.com/blogs/big-data/how-to-migrate-from-ibm-netezza-to-amazon-redshift-with-no-downtime/


Learn big data with AWS Training and Certification

Visit aws.amazon.com/training/paths-specialty/

New free digital course, Data Analytics Fundamentals, introduces

Amazon S3, Amazon Kinesis, Amazon EMR,

AWS Glue, and Amazon Redshift

Validate expertise with the AWS Certified Big Data - Specialty exam or the new AWS Certified Data Analytics - Specialty beta exam

Resources created by the experts at AWS to help you build and validate data analytics skills

Classroom offerings, including Big Data on AWS, feature AWS expert instructors and hands-on labs

Thank you!


best practices in migrating from on-premises data

Documents