real-time data loading from mysql to hadoop with new tungsten replicator 3.0

41
©Continuent 2014 Real-Time Loading from MySQL to Hadoop Featuring Continuent Tungsten MC Brown, Director of Documentation

Upload: continuent

Post on 14-Dec-2014

930 views

Category:

Technology


1 download

DESCRIPTION

Hadoop is an increasingly popular means of analyzing transaction data from single MySQL or multiple MySQL servers. Up until now mechanisms for moving data between MySQL and Hadoop have been rather limited. The new Continuent Tungsten Replicator 3.0 provides enterprise-quality replication from MySQL to Hadoop. Tungsten Replicator 3.0 is 100% open source, released under a GPL V2 license, and available for download at https://code.google.com/p/tungsten-replicator/. Continuent Tungsten handles MySQL transaction types including INSERT/UPDATE/DELETE operations and can materialize binlogs as well as mirror-image data copies in Hadoop. Continuent Tungsten also has the high performance necessary to load data from busy source MySQL systems into Hadoop clusters with minimal load on source systems while retaining control over how the data is viewed and accessed within Hadoop. We will discuss: - How Hadoop works and why it's useful for processing transaction data from MySQL - Setting up Continuent Tungsten replication from MySQL to Hadoop - Understanding how Hadoop makes data analytics efficient and easy to execute - Tuning replication to maximize performance.

TRANSCRIPT

Page 1: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014

Real-Time Loading from MySQL to Hadoop

Featuring Continuent Tungsten

MC Brown, Director of Documentation

Page 2: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014 ���2

Introducing Continuent

Page 3: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014

Introducing Continuent

���3

• The leading provider of clustering and replication for open source DBMS

• Our Product: Continuent Tungsten

• Clustering - Commercial-grade HA, performance scaling and data management for MySQL

• Replication - Flexible, high-performance data movement

Page 4: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014

Quick Continuent Facts

• Largest Tungsten installation processes over 700 million transactions daily on 225 terabytes of data

• Tungsten Replicator was application of the year at the 2011 MySQL User Conference

• Wide variety of topologies including MySQL, Oracle, Vertica, and MongoDB are in production now

• MySQL to Hadoop deployments are now in progress with multiple customers

���4

Page 5: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014

Selected Continuent Customers

���5

23

Page 6: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014 ���6

Five Minute Hadoop Introduction

Page 7: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014

What Is Hadoop, Exactly?

���7

a.A distributed file system

b.A method of processing massive quantities of data in parallel

c.The Cutting family's stuffed elephant

d.All of the above

Page 8: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014

Hadoop Distributed File System

���8

Java Client

NameNode (directory)

DataNodes (replicated data)

Hive

Pig

hadoop command

Find file

Read block(s)

Page 9: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014

Typical MySQL to Hadoop Use Case

���9

Hive (Analytics)

Hadoop Cluster

Transaction Processing

Initial Load?

Latency?

App changes?

Materialized views?

Changes?

App load?

Page 10: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014

Traditional Hadoop Deployments

• Data Analytics

• Single databases

• Collective databases

• Databases and external information

• Non-structured data

• Long term datastores and archiving

���10

ClientBack

Office

Page 11: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014

• Online Analytics

• Real-time queries and caching

• Fully heterogeneous deployments

Future Hadoop Deployments

���11

Client

BackOffice

Page 12: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014

Options for Loading Data

���12

CSV Files

Sqoop

Manual Loading Sqoop

Tungsten Replicator

Page 13: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014

Comparing Methods in Detail

���13

Manual via CSV

SqoopTungsten

Replicator

Process Manual/Scripted

Manual/Scripted

Fully automated

Incremental Loading

Possible with DDL changes

Requires DDL changes

Fully supported

Latency Full-load Intermittent Real-time

Extraction Requirements

Full table scan Full and partial table scans

Low-impact binlog scan

Page 14: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014 ���14

Replicating MySQL Data to Hadoop using

Tungsten Replicator

Page 15: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014

What is Tungsten Replicator?

���15

A real-time, high-performance,

open source database replication engine

!GPL V2 license - 100% open source

Download from https://code.google.com/p/tungsten-replicator/ Annual support subscription available from Continuent

“Golden Gate without the Price Tag”®

Page 16: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014

Tungsten Replicator Overview

���16

Master

(Transactions + Metadata)

Slave

THL

DBMS Logs

Replicator

(Transactions + Metadata)

THLReplicator

Extract transactions

from log

Apply

Page 17: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014

Tungsten Replicator 3.0 & Hadoop

���17

• Extract from MySQL or Oracle

• Base Hadoop plus commercial distributions: Cloudera, HortonWorks, Amazon EMR, IBM

• Provision using Sqoop or parallel extraction

• Automatic replication of incremental changes

• Transformation to preferred HDFS formats

• Schema generation for Hive

• Tools for generating materialized views

Page 18: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014

Basic MySQL to Hadoop Replication

���18

MySQL Tungsten Master Replicator

hadoop

Master-Side Filtering * pkey - Fill in pkey info * colnames - Fill in names * cdc - Add update type and schema/table info * source - Add source DBMS * replicate - Subset tables to be replicated

binlog_format=row

Tungsten Slave Replicator

hadoop

MySQL Binlog

CSV FilesCSV FilesCSV FilesCSV FilesCSV Files

Hadoop Cluster

Extract from MySQL binlog

Load raw CSV to HDFS (e.g., via LOAD DATA to

Hive)

Access via Hive

Page 19: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014

Hadoop Data Loading - Gory Details

���19

Replicator

hadoopTransactions from master

CSV FilesCSV FilesCSV Files

Staging TablesStaging TablesStaging “Tables”

Base TablesBase TablesMaterialized Views

Javascript load script

e.g. hadoop.js

Write data to CSV

(Run MapReduce)

(Generate Table

Definitions)

(Generate Table

Definitions)

Load using hadoop

command

Page 20: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014 ���20

Demo #1 !

Replicating data into Hadoop

Page 21: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014

JavaScript Batch Loader

• Simple, flexible batch loader

• prepare() - when we go online

• begin() - start of batch

• apply() - write the events

• commit() - commit the events

• release() - when we go offline

���21

Page 22: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014 ���22

Viewing MySQL Data in Hadoop

Page 23: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014

Generating Staging Table Schema

���23

$ ddlscan -template ddl-mysql-hive-0.10-staging.vm \! -user tungsten -pass secret \! -url jdbc:mysql:thin://logos1:3306/sales -db sales!...!DROP TABLE IF EXISTS sales.stage_xxx_sales;!!CREATE EXTERNAL TABLE sales.stage_xxx_sales!(! tungsten_opcode STRING ,! tungsten_seqno INT ,! tungsten_row_id INT ,! id INT ,! salesman STRING ,! planet STRING ,! value DOUBLE)!ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' ESCAPED BY '\\'!LINES TERMINATED BY '\n'!STORED AS TEXTFILE LOCATION '/user/tungsten/staging/sales/sales';

Page 24: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014

Generating Base Table Schema

$ ddlscan -template ddl-mysql-hive-0.10.vm -user tungsten \! -pass secret -url jdbc:mysql:thin://logos1:3306/sales -db sales!...!DROP TABLE IF EXISTS sales.sales;!!CREATE TABLE sales.sales!(! id INT,! salesman STRING,! planet STRING,! value DOUBLE )!;!

���24

Page 25: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014

Creating a Materialized View in Theory

���25

Log #1 Log #2 Log #N...

MAP Sort by key(s), transaction order

REDUCE Emit last row per key if not a delete

Page 26: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014

MapReduce

���26

Acme,2013,4.75!Spitze,2013,25.00!Acme,2013,55.25!Excelsior,2013,1.00!Spitze,2013,5.00

Spitze,2014,60.00!Spitze,2014,9.50!Acme,2014,1.00!Acme,2014,4.00!Excelsior,2014,1.00!Excelsior,2014,9.00

Acme,(4.75,55.25)!Spitze,(25.00,5,00)!Excelsior,(1.00)

Spitze,(60.00,9.50)!Acme,(1.00,4.00)!Excelsior,(1.00,9.00)

MAP

MAP

REDUCEAcme,65.00!Excelsior,11.00!Spitze,99.50

SELECT COMPANY, VALUE FROM ... WHERE ... GROUP BY COMPANY

Page 27: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014

Creating a Materialized View in Hive

$ hive!...!hive> ADD FILE /home/rhodges/github/continuent-tools-hadoop/bin/tungsten-reduce;!hive> FROM ( ! SELECT sales.*! FROM sales.stage_xxx_sales sales! DISTRIBUTE BY id ! SORT BY id,tungsten_seqno,tungsten_row_id!) map1!INSERT OVERWRITE TABLE sales.sales! SELECT TRANSFORM(! tungsten_opcode,tungsten_seqno,tungsten_row_id,id, ! salesman,planet,value)! USING 'perl tungsten-reduce -k id -c tungsten_opcode,tungsten_seqno,tungsten_row_id,id,salesman,planet,value'! AS id INT,salesman STRING,planet STRING,value DOUBLE;!

���27

MAP

REDUCE

Page 28: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014

Comparing MySQL and Hadoop Data

$ export TUNGSTEN_EXT_LIBS=/usr/lib/hive/lib!...!$ /opt/continuent/tungsten/bristlecone/bin/dc \! -url1 jdbc:mysql:thin://logos1:3306/sales \! -user1 tungsten -password1 secret \! -url2 jdbc:hive2://localhost:10000 \! -user2 'tungsten' -password2 'secret' -schema sales \! -table sales -verbose -keys id \! -driver org.apache.hive.jdbc.HiveDriver!22:33:08,093 INFO DC - Data comparison utility!...!22:33:24,526 INFO Tables compare OK!

���28

Page 29: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014

Doing it all at once

$ git clone \! https://github.com/continuent/continuent-tools-hadoop.git!!$ cd continuent-tools-hadoop!!$ bin/load-reduce-check \! -U jdbc:mysql:thin://logos1:3306/sales \! -s sales --verbose

���29

Page 30: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014 ���30

Demo #2 !

Constructing and Checking a Materialized View

Page 31: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014 ���31

Scaling It Up!

Page 32: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014

MySQL to Hadoop Fan-In Architecture

���32

Replicator

m1 (slave)

m2 (slave)

m3 (slave)

Replicator

m1 (master)

m2 (master)

m3 (master)

Replicator

Replicator

RBR

RBR

Slaves

Hadoop Cluster

(many nodes)

Masters

RBR

Page 33: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014

Integration with Provisioning

���33

MySQL

Tungsten Master

hadoop

binlog_format=row

Tungsten Slave

hadoopMySQL Binlog

CSV FilesCSV FilesCSV FilesCSV FilesCSV Files

Hadoop Cluster

Access via Hive

Sqoop/ETL

(Initial provisioning run)

Page 34: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014

On-Demand Provisioning via Parallel Extract

���34

MySQL Tungsten Master Replicator

hadoop

Master-Side Filtering * pkey - Fill in pkey info * colnames - Fill in names * cdc - Add update type and schema/table info * source - Add source DBMS * replicate - Subset tables to be replicated (other filters as needed) binlog_format=row

Tungsten Slave Replicator

hadoop

MySQL Binlog

CSV FilesCSV FilesCSV FilesCSV FilesCSV Files

Hadoop Cluster

Extract from MySQL tables

Load raw CSV to HDFS (e.g., via LOAD DATA to

Hive)

Access via Hive

Page 35: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014

Tungsten Replicator Roadmap

���35

• Parallel CSV file loading

• Partition loaded data by commit time

• Data formats and tools to support additional Hadoop clients as well as HBase

• Replication out of Hadoop

• Integration with emerging real-time analytics based on HDFS (Impala, Spark/Shark, Stinger,...)

Page 36: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014 ���36

Getting Started with Continuent Tungsten

Page 37: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014

Where Is Everything?

���37

• Tungsten Replicator 3.0 builds are now available on code.google.com http://code.google.com/p/tungsten-replicator/

• Replicator 3.0 documentation is available on Continuent website http://docs.continuent.com/tungsten-replicator-3.0/deployment-hadoop.html

• Tungsten Hadoop tools are available on GitHub https://github.com/continuent/continuent-tools-hadoop

Contact Continuent for support

Page 38: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014

Commercial Terms

• Replicator features are open source (GPL V2)

• Investment Elements

• POC / Development (Walk Away Option)

• Production Deployment

• Annual Support Subscription

• Governing Principles

• Annual Subscription Required

• More Upfront Investment -> Less Annual Subscription

���38

Page 39: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014

We Do Clustering Too!

���39

Tungsten clusters combine off-the-shelf open source MySQL servers into data services with: !

• 24x7 data access • Scaling of load on replicas • Simple management commands !...without app changes or data migration

Amazon US West

apache /php

GonzoPortal.com

Connector Connector

Page 40: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014

In Conclusion: Tungsten Offers...

• Fully automated, real-time replication from MySQL into Hadoop

• Support for automatic transformation to HDFS data formats and creation of full materialized views

• Positions users to take advantage of evolving real-time features in Hadoop

���40

Page 41: Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014

Continuent Web Page: http://www.continuent.com

!

Tungsten Replicator 3.0: http://code.google.com/p/tungsten-replicator

Our Blogs: http://scale-out-blog.blogspot.com http://mcslp.wordpress.com http://www.continuent.com/news/blogs

560 S. Winchester Blvd., Suite 500 San Jose, CA 95128 Tel +1 (866) 998-3642 Fax +1 (408) 668-1009 e-mail: [email protected]