loading and analyzing behavioral data in amazon redshift

47
Loading and Analyzing Behavioral Data in Amazon Redshift Presented by Segment, AWS & XO Group Inc. March 3, 2015

Upload: segmenthq

Post on 17-Jul-2015

81 views

Category:

Technology


0 download

TRANSCRIPT

Loading and Analyzing Behavioral Data in Amazon Redshift

Presented by Segment, AWS & XO Group Inc.

March 3, 2015

Jon Hawkins Dir. of Analytics & SEO

XO Group Inc.

Scott Ward Solutions Architect

AWS

Peter Reinhardt CEO

Segment

Today’s Speakers

Today’s agenda

•  Amazon Redshift Architecture

•  Segment SQL Quick Demo

•  XO Group’s Story

Amazon Redshift

Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year

Amazon Redshift is Easy to Use

•  Provision in minutes

•  Monitor query performance

•  Point and click resize

•  Built in security

•  Automatic backups

Amazon Redshift Architecture •  Leader Node

–  SQL endpoint –  Stores metadata –  Coordinates query execution

•  Compute Nodes –  Local, columnar storage –  Execute queries in parallel –  Load, backup, restore via Amazon S3 –  Parallel load from Amazon DynamoDB, Amazon

EMR, Amazon S3, HDFS/SSH

•  Two hardware platforms –  Optimized for data processing –  DW1: HDD; scale from 2TB to 1.6PB –  DW2: SSD; scale from 160GB to 256TB

10 GigE (HPC)

Ingestion Backup Restore

SQL Clients/BI Tools

128GB RAM

16TB disk

16 cores

Amazon S3

JDBC/ODBC

128GB RAM

16TB disk

16 cores Compute Node

128GB RAM

16TB disk

16 cores Compute Node

128GB RAM

16TB disk

16 cores Compute Node

Leader Node

•  Column storage

•  Data compression

•  Zone maps

•  Direct-attached storage •  With row storage you do unnecessary I/O

•  To get total amount, you have to read everything

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375

Amazon Redshift Dramatically Reduces I/O

•  With column storage, you only read the data you need

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375

•  Column storage

•  Data compression

•  Zone maps

•  Direct-attached storage

Amazon Redshift Dramatically Reduces I/O

analyze compression listing; Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw

•  Column storage

•  Data compression

•  Zone maps

•  Direct-attached storage •  COPY compresses automatically

•  You can analyze and override

•  More performance, less cost

Amazon Redshift Dramatically Reduces I/O

•  Column storage

•  Data compression

•  Zone maps

•  Direct-attached storage

10 | 13 | 14 | 26 |… … | 100 | 245 | 324

375 | 393 | 417…

… 512 | 549 | 623

637 | 712 | 809 …

… | 834 | 921 | 959

10

324

375

623

637

959

•  Track the minimum and maximum value for each block

•  Skip over blocks that don’t contain relevant data

Amazon Redshift Dramatically Reduces I/O

•  Column storage

•  Data compression

•  Zone maps

•  Direct-attached storage

128 GB RAM

16 cores

16 TB disk

DW.HS1.8XL:

•  > 2 GB/s scan rate

•  Optimized for data processing

•  High disk density

16 GB RAM 2 cores

2 TB disk

DW.HS1.XL:

Amazon Redshift Dramatically Reduces I/O

•  Query

•  Load

•  Backup/Restore

•  Resize

Amazon Redshift Parallelizes and Distributes Everything

Amazon S3/DynamoDB

128GB RAM

16TB disk

16 cores Compute Node

128GB RAM

16TB disk

16 cores Compute Node

128GB RAM

16TB disk

16 cores Compute Node

•  Query

•  Load

•  Backup/Restore

•  Resize

•  Parallel load from Amazon DynamoDB, Amazon EMR, Amazon S3, HDFS/SSH

•  Kinesis integration

•  Data automatically distributed and sorted according to DDL

•  Scales linearly with number of nodes

Amazon Redshift Parallelizes and Distributes Everything

Amazon S3

128GB RAM

16TB disk

16 cores Compute Node

128GB RAM

16TB disk

16 cores Compute Node

128GB RAM

16TB disk

16 cores Compute Node

•  Query

•  Load

•  Backup/Restore

•  Resize

•  Backups to Amazon S3 are automatic, continuous and incremental

•  Backup your cluster to a second region

•  Configurable system snapshot retention period; take user snapshots on-demand

•  Streaming restores enable you to resume querying faster

Amazon Redshift Parallelizes and Distributes Everything

SQL Clients/BI Tools

128GB RAM

48TB disk

16 cores Compute Node

128GB RAM

48TB disk

16 cores Compute Node

128GB RAM

48TB disk

16 cores Compute Node

128GB RAM

48TB disk

16 cores Leader Node

128GB RAM

48TB disk

16 cores Compute Node

128GB RAM

48TB disk

16 cores Compute Node

128GB RAM

48TB disk

16 cores Compute Node

128GB RAM

48TB disk

16 cores Compute Node

128GB RAM

48TB disk

16 cores Leader Node

•  Query

•  Load

•  Backup/Restore

•  Resize

•  Add/remove nodes or change node type while remaining online

•  Provision a new cluster and copy data in parallel from node to node

•  Only charged for source cluster until SQL endpoint has automatically been switched over via DNS

Amazon Redshift Parallelizes and Distributes Everything

•  SSL to secure data in transit

•  Encryption to secure data at rest –  AES-256; hardware accelerated

–  All blocks on disks and in Amazon S3 encrypted

–  HSM/CloudHSM

•  No direct access to compute nodes

•  Amazon VPC support

10 GigE (HPC)

Ingestion Backup Restore

SQL Clients/BI Tools

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

Amazon S3 / Amazon DynamoDB

Customer VPC

Internal Security Group

JDBC/ODBC

Leader Node

Compute Node

Compute Node

Compute Node

Amazon Redshift Has Security Built In

Segment SQL

A single hub to store, transform, and send your customer data.

All these tools run on three types of data.

IDENTIFY

Who are your users?

TRACK

What are they doing?

PAGE

Where are they?

analytics.track({ 'Added Product' , { name: 'Monopoly: 3rd Edition', price: 18.99, quantity: 3});

12345

Replace all your tags and tracking with a single API.

A single hub to store, transform, and send your customer data.

That same data can flow effortlessly (for you) into

a petabyte scale data warehouse.

Segment SQL

All of your data in a hosted, schematized Amazon Redshift instance.

Segment SQL

Query user, page, and event tables.

Segment SQL Partners

XO Group Inc.

XO Group Inc.

Our Analytics Journey

XO Group Inc. + Segment

Individual product teams wanted isolated access to their own analytics.

Segment + Mixpanel + Customer.io + Optimizely + Uservoice

XO Group Inc. + Segment

Still needed a solution to connect Segment data from multiple products and platforms into a single view.

Segment SQL + Mode Analytics

Version support.

Sharing functionality.

Membership analysis.

3 Case Studies

Membership Analysis

Existing database didn’t give any context into how a user became a member in the app.

Couldn’t analyze which features were driving membership upgrades, marketing attribution, or cross-product pollination.

SELECT userID, source,

reason FROM xogrp.signed_up

1234

Isolated Views

Membership Analysis

App A App B

Membership Analysis

A desire to save our content drives our membership.

Experiencing certain apps before downloading a second significantly improved LTV.

We made it easier to save and organize our content.

We optimized promotion strategy to advertise the stickiest apps first to current members.

Share Functionality

Facebook & Twitter drove the most inbound traffic.

So, we guessed they’d also be the most popular sharing options.

SELECT title, shareOption,

quantity FROM xogrp.shared_article

1234

Share Functionality

More considerate “Share” options

SMS a dress from desktop or mobile web browser.

One click email yourself the details of a venue.

Version Support

Couldn’t answer very simple question about which roll up browser versions, devices, and operating systems customers were using.

Caused a lot of debate over which versions to support, which test devices to buy.

Version Support

Chrome 40 + Chrome 40 + Chrome 40 = No, just no!

Version Support

Now we can easily see what browser versions and OS’s are most popular.

Product and engineering teams can make data-driven decisions on which versions to support.

Set up your tracking and analytics database to answer any question.

Save valuable engineering time with prebuilt software.

Listen to each team’s needs and create the analytics view they want.

Takeaways

Thank you!

Schedule a demo today.

segment.com/contact/sql