keynote: getting serious about mysql and hadoop at continuent

©Continuent 2014

Getting Serious about MySQL and Hadoop at

Continuent

Robert Hodges, CEO

©Continuent 2014

Why should MySQL users care about Hadoop?

2

©Continuent 2014

What is a Hadoop?

3

Hadoop Distributed File System (HDFS)

MapReduce Spark

Hive

Storm

Pig

Shark

MahoutHBase

Oozie

Avro

HCatalog

Scalding

Stinger

Impala

Sqoop

AmbariCassandra

Zookeeper

©Continuent 2014

With this much funding it must be good

4

(ZDNet)

(jaxenter.com)

(forbes.com)

(451 Group)

©Continuent 2014

Hadoop analyzes any type of data

5

Server Logs

Social media feeds

Geolocation data

Clickstreams

Sensor readings

Business transactions

Analytic reports

©Continuent 2014

Hadoop data loading is simple

!mysql> select * into -> outfile '/tmp/sakila.rental.csv' -> fields terminated by ',' -> lines terminated by '\n' -> from sakila.rental; Query OK, 16044 rows affected (0.03 sec) !mysql> quit Bye $ hadoop fs -put /tmp/sakila.rental.csv

6

©Continuent 2014

Hadoop exploits downward cost of storing and processing data

7

Disk Storage -- Average Cost Per Gigabyte

$0.01

$0.10

$1.00

$10.00

$100.00

$1,000.00

$10,000.00

1990 1993 1996 1999 2002 2005 2008 2011 2014

(Source: John McCallum, http://www.jcmit.com)

©Continuent 2014

Hadoop is shifting from batch to real-time analytics

8

Cycle time for different iterative algorithms

Page Rank

K-Means Clustering

Logistic Regression

0 40 80 120 160

0.96

4.1

14

110

155

80

Core Hadoop Spark

(Source: Pat McDonough, http://spark-summit.org/2013)

©Continuent 2014

Hadoop is becoming the way that users œš‘“›⁸see’”⁹ data

9

©Continuent 2014

What does it mean to integrate with Hadoop?

10

©Continuent 2014

Three integration problems

11

1.Continuous, high-performance loading

2.Meaningful analytics on Hadoop

3.Optimized operation for large-scale deployment

©Continuent 2014

Thesis: Snapshots

12

Data volumes? System load?

Latency? Change history?

Dump/load

©Continuent 2014

MySQL does not do it that way...

13

Binlog

Replication

©Continuent 2014

Antithesis: Real-time replication

14

Raw files? Overwrite/append?

Replication

Binlog

©Continuent 2014

Synthesis: Snapshots + real-time replication

15

Replication

CSV FilesCSV FilesBuffered

TransactionsBinlog

Dump/load

©Continuent 2014

We can implement that!

16

MySQL

binlog_format=row

MySQL Binlog

Tungsten 3.0 Master

hadoop

Tungsten 3.0 Slave

hadoop

CSV FilesCSV FilesCSV FilesCSV FilesCSV

Apache Sqoop/ETL

Fast data filtering

Buffered CSV

Programmable load scripts

Parallel applyParallel table

dumps

Low impact replication from the binlog

©Continuent 2014

How do you like your data?

(Your data stored in MySQL) +---------+--------------------+-------------+--------+ | film_id | title | rental_rate | length | +---------+--------------------+-------------+--------+ | 556 | MALTESE HOPE | 4.99 | 127 | | 557 | MANCHURIAN CURTAIN | 2.99 | 177 | | 558 | MANNEQUIN WORST | 2.99 | 71 | | 559 | MARRIED GO | 2.99 | 114 | +---------+--------------------+-------------+--------+ !

17

©Continuent 2014

Does it really look better like this?

!!!

!

556,MALTESE HOPE,4.99,127\n 557,MANCHURIAN CURTAIN,3.99,177\n 558,MANNEQUIN WORST,2.99,71\n 559,MARRIED GO,2.99,114\n

18

field separator

file partitioning

record separator

compression type conversions

(Your data stored in Hadoop)

©Continuent 2014

Or this?

19

!(INSERT)

I,57,556,2014-03-27 21:04:24.000,556,MALTESE HOPE,4.99,127\n !

(UPDATE) D,57,557,2014-03-27 21:04:24.000,557,\N,\N,\N\n I,57,558,2014-03-27 21:04:24.000,557,MANCHURIAN CURTAIN,2.99,177\n !

(DELETE) D,57,559,2014-03-27 21:04:24.000,558,\N,\N,\N\n

©Continuent 2014

One more thing to replicate...

20

Dump/load

Replication


TransactionsBinlog

Table metadata

©Continuent 2014

A more civilized view of data

!!(Your data viewed through Hive) 556 MALTESE HOPE 4.99 127 557 MANCHURIAN CURTAIN 3.99 177 558 MANNEQUIN WORST 2.99 71 559 MARRIED GO 2.99 114

21

©Continuent 2014

Are we done yet?

22

Transaction logs Snapshot

????

©Continuent 2014

Introducing a useful MapReduce trick...

23

Transaction logs Snapshot

UNION ALL

Emit last row per key if not a delete

MAP

REDUCE

Materialized view including all updates

Sort by key(s), transaction orderSHUFFLE

©Continuent 2014

...With some amazing properties

24

Apache Sqoop

Tungsten Replication

CSV FilesCSV FilesBuffered CSV Files

No replication failures due to consistency

Reconstruct consistent views at will

No locks No transactions No need to pause processing

Reprovision any table at will

Table metadata

©Continuent 2014

We can implement that too!!

25

https://github.com/continuent/continuent-tools-hadoop

Continuent Hadoop Tools

Schema creation

Materialized view

generation

Data comparison

Apache 2.0 licensing

©Continuent 2014

Optimizing large scale deployments

26

Replicator

m1 (slave)

m2 (slave)

m3 (slave)

Replicator

m1 (master)

m2 (master)

m3 (master)

Replicator

Replicator

RBR

RBR

RBR

©Continuent 2014

Tungsten 3.0 Roadmap for Hadoop

29

Q1 2014 Q2 2014

Features • Parallel extractor • Polished MapReduce

tools • Improved schema

change handling • Binary data

conversion • HortonWorks 2.0

Features • Scripted load • Better block commit • Hive CSV format • Hive DDL generation • Partitioned files • Auto-recovery • Parallel batch apply • Sqoop integration • Cloudera 4.x/5.0

©Continuent 2014

Users can prepare...

• Use Unicode/UTF8

• Standardize on UTC for time

• Enable row replication

• Cluster your data in a way that supports restarts

31

©Continuent 2014

The MySQL community can prepare...

• Fast heterogeneous replication and loading

• Innovative projects to make relational data easy to consume on Hadoop

• Competing solutions that improve life for users

33

©Continuent 2014

Conclusion

• Hadoop is for real and the MySQL community needs to adapt

• The challenge is to move data to Hadoop and make it easy to integrate into analytics

• MySQL can be *the* preferred RDBMS to use with Hadoop

34

©Continuent 2014

Wed 2:20pm Ballroom B - Hadoop for MySQL People !

Thurs 1pm Ballroom D - From Dolphins to Elephants: Real-Time MySQL to Hadoop Replication

We’re Hiring!

http://www.continuent.com

keynote: getting serious about mysql and hadoop at continuent

Technology

hadoop data loading

continuent robert hodges

hadoop fs

master hadoop tungsten

binlog replication

replication binlog

type of data

processing data