leveraging hadoop to augment mysql deployments

Upload: oleksiy-kovyrin

Post on 08-Apr-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/7/2019 Leveraging Hadoop to Augment MySQL Deployments

    1/24

    01-1Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Leveraging Hadoop to AugmentMySQL Deployments

    +

  • 8/7/2019 Leveraging Hadoop to Augment MySQL Deployments

    2/24

    01-2Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    About me

    Sarah Sproehnle

    Former MySQL instructor

    Joined Cloudera in March 2010

    [email protected]

  • 8/7/2019 Leveraging Hadoop to Augment MySQL Deployments

    3/24

    01-3Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    What is Hadoop?

    An open-source framework for storing and processing data on acluster of computers

    Built-in high availability

    Scales linearly (proven to scale to 1000s of nodes)

    Designed for batch processing

    Optimized for streaming reads

  • 8/7/2019 Leveraging Hadoop to Augment MySQL Deployments

    4/24

    01-4Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Why Hadoop?

    Lots of data (TB+)

    Need to scan, process or transform alldata

    Complex and/or unstructured data

  • 8/7/2019 Leveraging Hadoop to Augment MySQL Deployments

    5/24

    01-5Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Use cases for Hadoop

    Recommendation engine

    Netflix recommends movies

    last.fm recommends music

    Ad targeting, log processing, search optimization

    ContextWeb, eBay and Orbitz Machine learning and classification

    Yahoo! Mail's spam detection

    Graph analysis

    Facebook and LinkedIn suggest connections eHarmony matches people

  • 8/7/2019 Leveraging Hadoop to Augment MySQL Deployments

    6/24

    01-6Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Hadoop vs. MySQL

    MySQL HadoopData capacity TB+ (may require sharding) PB+

    Data per query GB? PB+

    Read/write Random read/write Sequential scans, Append-

    onlyQuery language SQL Java MapReduce, scripting

    languages, HiveQL

    Transactions Yes No

    Indexes Yes No

    Latency Sub-second (hopefully) Minutes to hours

    Data structure Structured Structured or un-structured

  • 8/7/2019 Leveraging Hadoop to Augment MySQL Deployments

    7/24

    01-7Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    How does Hadoop work?

    Spreads your data onto tens, hundreds or thousands ofmachines using the Hadoop Distributed File System (HDFS)

    Built-in redundancy (replication) for fault-tolerance

    Machines will fail!

    HDD MTBF 1000 days, 1000 disks = 1 failure every day

    Ability to read and process with MapReduce

    Processing is sent to the data

    Many "map" tasks each work on a slice of the data

    Failed tasks are automatically restarted on another node

  • 8/7/2019 Leveraging Hadoop to Augment MySQL Deployments

    8/24

    01-8Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Example - word count

    map(key, value)

    foreach (word in value)

    output (word, 1)

    Key and value represent a row of data :

    key is the byte offset, value is a line

    Intermediate output:

    the, 1

    cat, 1

    in, 1

    the, 1

    hat, 1

  • 8/7/2019 Leveraging Hadoop to Augment MySQL Deployments

    9/24

    01-9Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Reduce

    reduce(key, list)

    sum the list

    output(key, sum)

    Hadoop aggregates the keys and callsreduce for each unique key:

    the, (1,1,1,1,1,11)

    cat, (1,1,1)

    in, (1,1,1,1,1,1) ...

    Final result:

    the, 45823

    cat, 1204in, 2693

  • 8/7/2019 Leveraging Hadoop to Augment MySQL Deployments

    10/24

    01-10Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Why MapReduce?

    By constraining computation to map and reduce phases, thetasks can be split and run in parallel

    Scales horizontally

    Programmer is isolated from individual failed tasks

    Tasks are restarted on another node

    However, many computational tasks will require a series ofmap/reduce phases

  • 8/7/2019 Leveraging Hadoop to Augment MySQL Deployments

    11/24

    01-11Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    The problem with MapReduce

    The developer has to worry about a lot of things besides theanalysis/processing logic (Job setup, InputFormat, customkey/value classes)

    Typically written in Java

    The data is schema-less Even simple things may require several MapReduce passes

    Would be more convenient to use constructs such as "filter","join", "aggregate"

  • 8/7/2019 Leveraging Hadoop to Augment MySQL Deployments

    12/24

    01-12Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Facebook's story

    Facebook collects TBs of data each day, coming from variousplaces (MySQL, web logs, etc)

    Uses for Hadoop:

    Log processing

    Text mining

    Document indexing

    BI/analytics

    Desired:

    Command-line interface

    Easier development environment that supported ad hoc queries A "schema" for the data

  • 8/7/2019 Leveraging Hadoop to Augment MySQL Deployments

    13/24

    01-13Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    So Hive was born

    Since many at Facebook were familiar with SQL, Hive wascreated to use a similar interface

    Hive provides:

    HiveQL, an SQL-like language for formulating your queries

    A metastore which stores schema information (typically a MySQLserver)

    Hive translates HiveQL to MapReduce code

    Today Facebook runs over 7500 Hive queries/day and scansmore than 100TB

  • 8/7/2019 Leveraging Hadoop to Augment MySQL Deployments

    14/24

    01-14Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    HiveQL vs. SQL

    RDBMS Hive

    Language SQL-92 standard (maybe) Subset of SQL-92 plus Hive-

    specific extension

    Update Capabilities INSERT, UPDATE, and DELETE INSERT but not UPDATE or

    DELETE

    Transactions Yes No

    Latency Sub-second Minutes or more

    Indexes Any number of indexes, very

    important for performance

    No indexes, data is always

    scanned (in parallel)

    Data size TBs PBs

    Data per query GBs? PBs

  • 8/7/2019 Leveraging Hadoop to Augment MySQL Deployments

    15/24

    01-15Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Getting started is easy!

    1. Install Hadoop (on a single machine or a cluster) Or download Cloudera's free VM!

    2. Create Hive tables CREATE TABLE t (col1 INT, col2 STRING)

    3. Load data into Hive

    LOAD DATA INPATH '/path-in-hdfs' INTO TABLE t;

    LOAD DATA LOCAL INPATH '/local-path' INTO TABLE t;

    4. SELECT data

    SELECT sum(col1) AS total, col2FROM tWHERE col2 LIKE 'foo%'GROUP BY col2ORDER BY total;

  • 8/7/2019 Leveraging Hadoop to Augment MySQL Deployments

    16/24

    01-16Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Under the hood

    Hive converts HiveQL to a series of MapReduce jobs andsubmits code to the Hadoop cluster

    WHERE => map

    GROUP/ORDER BY => reduce

    JOIN => map or reduce depending on optimizer

    EXPLAIN shows the MapReduce plans

  • 8/7/2019 Leveraging Hadoop to Augment MySQL Deployments

    17/24

    01-17Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Example

    EXPLAINSELECT * FROM purchasesWHERE cost > 40ORDER BY order_date DESC;

    Single MapReduce required: WHERE clause translates to a map

    Map outputs order_date as key

    Single reducer collects sorted rows

  • 8/7/2019 Leveraging Hadoop to Augment MySQL Deployments

    18/24

    01-18Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Extra features

    Partitioning

    UDF/UDAs

    Support for "sampling"

    JDBC and ODBC interfaces

    BI vendor support coming soon!

    Integration with HBase

    For an introduction to HBase, see Tom Hanlon's talk @ 4:25

  • 8/7/2019 Leveraging Hadoop to Augment MySQL Deployments

    19/24

    01-19Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    It gets even easier!

    Sqoop = SQL-to-Hadoop

    Open source product from Cloudera

    Parallel import of data from many databases to Hadoop

    Parallel export of data from Hadoop to databases

    Integrates with Hive

    Hadoopmysql

    sqoop!

    ...Via auto-generateddatatype definitions

    Custom

    MapReduceprograms

    reinterpret data

  • 8/7/2019 Leveraging Hadoop to Augment MySQL Deployments

    20/24

    01-20Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    "Sqooping" your tables into Hadoop

    $ sqoop import --connect jdbc:mysql://foo.com/db--table employees--hive-import--fields-terminated-by '\t'--lines-terminated-by '\n'

    Features supported:

    Parallelized import and export

    Choose rows (--where) or columns to import

    Optimized for MySQL (uses parallel mysqldump commands) LOBs can be inline or a separate file

    Incremental loads (with TIMESTAMP or AUTO_INCREMENT col)

  • 8/7/2019 Leveraging Hadoop to Augment MySQL Deployments

    21/24

    01-21Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Example data pipeline

    1. Use MySQL for real-time read/write data access2. Cron job occasionally "sqoops" data into Hive

    3. Use HiveQL to transform data, run batch analysis, join withother data (e.g., Apache logs), etc

    4. Export the transformed results to OLAP or OLTP environment

  • 8/7/2019 Leveraging Hadoop to Augment MySQL Deployments

    22/24

    01-22Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Summary

  • 8/7/2019 Leveraging Hadoop to Augment MySQL Deployments

    23/24

    01-23Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Demo!

  • 8/7/2019 Leveraging Hadoop to Augment MySQL Deployments

    24/24