bd sqltohadoop2 pdf

Upload: herotest

Post on 04-Jun-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 Bd Sqltohadoop2 PDF

    1/18

    Copyright IBM Corporation 2013 Trademarks

    SQL to Hadoop and back again, Part 2: Leveraging HBase and

    Hive

    Page 1 of 18

    SQL to Hadoop and back again, Part 2: Leveraging

    HBase and Hive

    Martin C. Brown([email protected])

    Director of Documentation

    08 October 2013

    "Big data" is a term that has been used regularly now for almost a decade, and it along

    with technologies like NoSQL are seen as the replacements for the long-successful

    RDBMS solutions that use SQL. Today, DB2, Oracle, Microsoft SQL Server MySQL, and

    PostgreSQL dominate the SQLspace and still make up a considerable proportion of the overall

    market. Here in Part 2, we will concentrate on how to use HBase and Hive for exchanging data

    with your SQL data stores. From the outside, the two systems seem to be largely similar, but

    the systems have very different goals and aims. Let's start by looking at how the two systems

    differ and how we can take advantage of that in our big data requirements.

    View more content in this series

    Hive and HBase

    Integrating Hadoop and SQL with InfoSphere

    BigInsights

    InfoSphere BigInsights makes integrating between Hadoop and SQL databases much

    simpler, since it provides the necessary tools and mechanics to export and import data

    among databases. Using InfoSphere BigInsights, you can define database sources, views,

    queries, and other selection criteria, then automatically convert that into a variety of formats

    before importing that collection directly into Hadoop (see Resourcesfor more information).

    For example, you can create a query that extracts the data and populates a JSON array with

    the record data. Once exported, a job can be created to process and crunch the data before

    displaying it or importing the processed data and exporting the data back to DB2.

    Download InfoSphere BigInsights Quick Start Edition, a free, downloadable non-production

    version of BigInsights.

    Hive is a data warehouse solution that has a thin SQL-like querying language called HiveQL. This

    language is used for querying data, and it saves you from writing native MapReduce processing

    to get your data out. Since you already know SQL, Hive is a good solution since it enables you to

    take advantage of your SQL knowledge to get data in and out of Apache Hadoop. One limitation

    of the Hive approach, though, is that it makes use of the append-only nature of HDFS to provide

    http://www.ibm.com/legal/copytrade.shtmlhttp://www.ibm.com/legal/copytrade.shtmlhttp://www.ibm.com/developerworks/views/bigdata/libraryview.jsp?search_by=sql+hadoop+backmailto:[email protected]://www.ibm.com/developerworks/downloads/im/biginsightsquick/http://www.ibm.com/developerworks/views/bigdata/libraryview.jsp?search_by=sql+hadoop+backmailto:[email protected]://www.ibm.com/developerworks/ibm/trademarks/http://www.ibm.com/legal/copytrade.shtml
  • 8/13/2019 Bd Sqltohadoop2 PDF

    2/18

    developerWorks ibm.com/developerWorks/

    SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 2 of 18

    storage. This means that it is phenomenally easy to get the data in, but you cannot update it. Hive

    is not a database but a data warehouse with convenient SQL querying built on top of it. Despite

    the convenient interface, particularly on very large datasets, the fact that the query time required to

    process requests is so large means that jobs are submitted and results accessed when available.

    This means that the information is not interactively available.

    HBase, by comparison, is a key-value (NoSQL) data store that enables you to write, update, and

    read data randomly, just like any other database. But it's not SQL. HBase enables you to make use

    of Hadoop in a more traditional real-time fashion than would normally be possible with the Hadoop

    architecture. Processing and querying data is more complex with HBase, but you can combine the

    HBase structure with Hive to get an SQL-like interface. HBase can be really practical as part of a

    solution that adds the data, processes it, summarizes it through MapReduce, and stores the output

    for use in future processing.

    In short, think of Hive as an append-only SQL database and HBase as a more typical read-write

    NoSQL data store.

    Hive is useful for SQL integration if you want to store long-term data to be processed and

    summarized and loaded back. Hive's major limitation is query speed. When dealing with billions of

    rows, there is no live querying of the data that would be fast enough for any interactive interface to

    the data.

    For example, with data logging, the quantities of data can be huge, but what you often need is

    quick, flexible querying on either summarized or extreme data (i.e., faults and failures).

    HBase is useful when what you want is to store large volumes of flexible data and query that

    information, but you might want only smaller datasets to work with. Hence, you might export datathat simultaneously:

    1. Needs to be kept "whole," such as sales or financial data

    2. May change over time

    3. Also needs to be queried

    HBase can then be combined with traditional SQL or Hive to allow snapshots, ranges, or

    aggregate data to be queried.

    Making use of Hive

    The primary reason to use Hive over a typical SQL database infrastructure is simply the size of the

    data and the length of time required to perform the query. Rather than dumping information into

    Hadoop, writing your own MapReduce query, and getting the information back out, with Hive you

    can (normally) write the same SQL statement, but on a much larger dataset.

    Hive accomplishes this task by translating the SQL statement into a more typical Hadoop

    MapReduce job that assembles the data into a tabular format. This is where the limitation comes,

    in that Hive is not a real-time, or live querying solution. Once you submit the job, it can take a long

    time to get a response.

  • 8/13/2019 Bd Sqltohadoop2 PDF

    3/18

    ibm.com/developerWorks/ developerWorks

    SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 3 of 18

    A typical use of Hive is to combine the real-time accessibility of data in a local SQL table, export

    that information into Hive long-term, and reimport the processed version that summarizes the data

    so it can be used in a live query environment, as seen in Figure 1.

    Figure 1. Typical use of Hive

    We can use the same SQL statement to get and obtain the data in both situations, a convenience

    that helps to harmonize and streamline your applications.

    Getting data into Hive from SQL

    The simplest way to get data in and out is by writing a custom application that will extract the

    required data from your existing SQL table and insert that data into Hive that is, perform a

    select query and use INSERTto place those values directly into Hive.

    Alternatively, depending on your application type, you might consider inserting data directly into

    both your standard SQL and Hive stores. This way, you can check your standard SQL for recent

    queries and process Hive data on an hourly/daily/weekly schedule as required to produce the

    statistical data you require longer-term.

    Remember, with Hive, in the majority of cases we can:

    Use the tabular data directly

  • 8/13/2019 Bd Sqltohadoop2 PDF

    4/18

    developerWorks ibm.com/developerWorks/

    SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 4 of 18

    Export without conversion

    Export without reformatting or restructuring

    More typically, you will be dumping entire tables or entire datasets by hour or day into Hive through

    an intermediary file. One benefit of this approach is that we easily introduce the file by running a

    local import or by copying that data directly into HDFS.

    Let's look at this with an example, using the City of Chicago Traffic Tracker that studies bus data,

    giving a historical view of the speed of buses in different regions of Chicago at different times. A

    sample of the data is shown in Listing 1.

    Listing 1. Sample of bus data from City of Chicago Traffic Trackermysql> select * from chicago limit 10;

    +---------------------+--------+----------+----------+-------+

    | timelogged | region | buscount | logreads | speed |

    +---------------------+--------+----------+----------+-------+

    | 2013-03-26 09:50:00 | 2 | 40 | 600 | 26.59 |

    | 2013-03-26 09:50:00 | 3 | 76 | 1023 | 21.82 |

    | 2013-03-26 09:50:00 | 4 | 46 | 594 | 23.18 || 2013-03-26 09:50:00 | 5 | 56 | 1016 | 22.33 |

    | 2013-03-26 09:50:00 | 6 | 58 | 859 | 21.14 |

    | 2013-03-26 09:50:00 | 7 | 59 | 873 | 19.23 |

    | 2013-03-26 09:50:00 | 8 | 89 | 1356 | 21.14 |

    | 2013-03-26 09:50:00 | 9 | 28 | 404 | 21.82 |

    | 2013-03-26 09:50:00 | 10 | 114 | 1544 | 20.45 |

    | 2013-03-26 09:50:00 | 11 | 91 | 1310 | 22.5 |

    +---------------------+--------+----------+----------+-------+

    10 rows in set

    The sample dataset is just over 2 million records, and it has been loaded from a CSV export. To

    get that information into Hive, we can export it to a CSV file.

    Listing 2. Exporting data to a CSV filemysql> select * into outfile 'chicago.csv' fields terminated by ','

    lines terminated by '\n' from chicago;

    Then copy the data into HDFS: $ hdfs dfs -copyFromLocal chicago.csv. Now open Hive and

    create a suitable table.

    Listing 3. Opening Hive and creating a suitable tablehive> create table chicago_bus (timelog timestamp, region int,

    buscount int, logreads int, speed float) row format delimited

    fields terminated by ',' lines terminated by "\n";

    The first thing to notice with Hive is that compared to most standard SQL environments, we have

    a somewhat limited set of data types that can be used. Although core types like integers, strings,

    and floats are there, date types are limited. That said, Hive supports reading complex types, such

    as hashmaps. The core types supported by Hive:

    Integers (1-8 bytes)

    Boolean

    Float/Double

  • 8/13/2019 Bd Sqltohadoop2 PDF

    5/18

    ibm.com/developerWorks/ developerWorks

    SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 5 of 18

    String any sequence of characters, and therefore good for CHAR, VARCHAR, SET, ENUM,

    TEXT, and BLOB types if the BLOG is storing text

    Timestamp either an EPOCH or YYYY-MM-DD hh:mm:ss.fffffffff-formatted string

    Binary for BLOBs that are not TEXT

    The second observation is that we are defining the table structure. There are binary structures

    within Hive, but using CSV natively is convenient for our purposes since we've exported the data

    to a CSV file. When loading the data, the CSV format specification here will be used to identify the

    fields in the data.

    The above example creates a standard table (the table within Hive's data store within HDFS). You

    can also create an external table that uses the copied file directly. However, in an SQL-to-Hive

    environment, we want to make use of one big table into which we can append new data.

    Now the data can be loaded into the table: hive> load data inpath 'chicago.csv' into table

    chicago_bus;.

    This code adds the contents of the CSV file to the existing table. In this case, it is empty, but you

    can see how easy it would be to import additional data.

    Processing and querying Hive data

    Once the data is loaded, you can execute Hive queries from the Hive shell just as you would in any

    other SQL environment. For example, Listing 4 shows the same query to get the first 10 rows.

    Listing 4. Query to get the first 10 rowshive> select * from chicago_bus limit 10;

    OK

    2013-03-26 09:50:00 1 31 461 22.52013-03-26 09:50:00 2 40 600 26.59

    2013-03-26 09:50:00 3 76 1023 21.82

    2013-03-26 09:50:00 4 46 594 23.18

    2013-03-26 09:50:00 5 56 1016 22.33

    2013-03-26 09:50:00 6 58 859 21.14

    2013-03-26 09:50:00 7 59 873 19.23

    2013-03-26 09:50:00 8 89 1356 21.14

    2013-03-26 09:50:00 9 28 404 21.82

    2013-03-26 09:50:00 10 114 1544 20.45

    Time taken: 0.205 seconds

    The benefit comes when we perform an aggregate query. For example, let's obtain an average bus

    speed for each region by day.

    Listing 5. Obtaining average bus speed for each region by dayhive> select to_date(timelog),region,avg(speed) from chicago_bus

    group by to_date(timelog),region;

    Total MapReduce jobs = 1

    Launching Job 1 out of 1

    Number of reduce tasks not specified. Estimated from input data size: 1

    In order to change the average load for a reducer (in bytes):

    set hive.exec.reducers.bytes.per.reducer=

    In order to limit the maximum number of reducers:

    set hive.exec.reducers.max=

    In order to set a constant number of reducers:

  • 8/13/2019 Bd Sqltohadoop2 PDF

    6/18

    developerWorks ibm.com/developerWorks/

    SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 6 of 18

    set mapred.reduce.tasks=

    Starting Job = job_201308151101_0009, Tracking URL =

    http://localhost.localdomain:50030/jobdetails.jsp?jobid

    =job_201308151101_0009

    Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201308151101_0009

    Hadoop job information for Stage-1: number of mappers: 1;

    number of reducers: 1

    2013-08-18 11:44:04,943 Stage-1 map = 0%, reduce = 0%

    2013-08-18 11:44:24,041 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 12.83 sec

    2013-08-18 11:44:25,053 Stage-1 map = 100%, reduce = 0%,

    Cumulative CPU 12.83 sec

    2013-08-18 11:44:26,093 Stage-1 map = 100%, reduce = 0%,

    Cumulative CPU 12.83 sec

    2013-08-18 11:44:27,102 Stage-1 map = 100%, reduce = 0%,

    Cumulative CPU 12.83 sec

    2013-08-18 11:44:28,115 Stage-1 map = 100%, reduce = 0%,

    Cumulative CPU 12.83 sec

    2013-08-18 11:44:29,136 Stage-1 map = 100%, reduce = 100%,

    Cumulative CPU 14.94 sec

    2013-08-18 11:44:30,147 Stage-1 map = 100%, reduce = 100%,

    Cumulative CPU 14.94 sec

    2013-08-18 11:44:31,164 Stage-1 map = 100%, reduce = 100%,

    Cumulative CPU 14.94 sec2013-08-18 11:44:32,179 Stage-1 map = 100%, reduce = 100%,

    Cumulative CPU 14.94 sec

    MapReduce Total cumulative CPU time: 14 seconds 940 msec

    Ended Job = job_201308151101_0009

    MapReduce Jobs Launched:

    Job 0: Map: 1 Reduce: 1 Cumulative CPU: 14.94 sec HDFS Read:

    77778973 HDFS Write: 690111 SUCCESS

    Total MapReduce CPU Time Spent: 14 seconds 940 msec

    OK

    2011-03-12 1 25.597916682561237

    2011-03-12 2 27.758749961853027

    2011-03-12 3 26.450833241144817

    2011-03-12 4 24.638333241144817

    2011-03-12 5 25.868695715199348

    2011-03-12 6 24.921666463216145

    2011-03-12 7 24.383749961853027

    2011-03-12 8 23.729583422342937

    2011-03-12 9 26.06217384338379

    2011-03-12 10 24.53833333651225

    ...

    As you can see from the output, the query is fundamentally the same as with MySQL (we are

    grouping by an alternative value), but Hive converts this into a MapReduce job, then calculates the

    summary values.

    The reality with data of this style is that the likelihood of requiring the speed in region 1, for

    example, at 9:50 last Thursday is quite low. But knowing the average speed per day for each

    region might help predict the timing of traffic or buses in the future. The summary data can be

    queried and analyzed efficiently with a few thousand rows in an SQL store to allow the data to be

    sliced and diced accordingly.

    To output that information back to a file, a number of options are available, but you can simply

    export back to a local file using the statement in Listing 6.

  • 8/13/2019 Bd Sqltohadoop2 PDF

    7/18

    ibm.com/developerWorks/ developerWorks

    SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 7 of 18

    Listing 6. Exporting back to a local filehive> INSERT OVERWRITE LOCAL DIRECTORY 'chicagoout' SELECT

    to_date(timelog),region,avg(speed) from chicago_bus

    group by to_date

    (timelog),region;

    This code creates a directory (chicagoout), into which the output is written as a series of text files.

    These can be loaded back into MySQL, but by default, the fields are separated by Ctrl+A. The

    output can be simplified to a CSV file again by creating a table beforehand, which uses the CSV

    formatting.

    Listing 7. Simplifying to a CSV filehive> CREATE TABLE chicago_region_speed (logdate timestamp, region int,

    speed float) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

    Now rerun the job and insert the information into the table.

    Listing 8. Rerunning the jobhive> INSERT TABLE chicago_region_speed SELECT to_date(timelog) as

    timelog,region,avg(speed) as speed from chicago_bus group by to_date

    (timelog),region;

    You can now find the files that make up the table in your Hive datawarehouse directory, so you can

    copy them out for loading into your standard SQL store, for example, using LOAD DATA INFILE.

    Listing 9. Loading files into your standard SQL storemysql> load data local infile 'chicago_region_speed.csv' into table

    chicago fields terminated by ',' lines terminated by '\n';

    This process sounds clunky, but it can be automated, and because we have files from each stage,

    it is easy to re-execute or reformat the information if required.

    Using views

    If you are using Hive in the manner suggested earlier, and regularly processing and summarizing

    data daily, weekly or monthly, it might be simpler to create a view.

    Views within Hive are logical that is, the output of the view gets re-created each time a query

    is executed. Although using views is more expensive, for a data exchange environment, views

    hugely simplify the process by simplifying the query structure and allowing consistent output as the

    underlying source tables expand with new data. For example, to create a view from our original

    speed/region summary, use Listing 10.

    Listing 10. Creating a view from our original speed/region summaryhive> create view chicago_region_speed_view as select to_date(timelog),

    region,avg(speed) from chicago_bus group by to_date(timelog),region;

    Now we can perform Hive queries on the view including new selections.

  • 8/13/2019 Bd Sqltohadoop2 PDF

    8/18

    developerWorks ibm.com/developerWorks/

    SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 8 of 18

    Listing 11. Performing Hive queries on the view

    hive> select * from chicago_region_speed_view where speed < 15;

    ...

    OK

    2012-01-15 28 13.449097183015612

    2012-01-29 28 13.56520840856764

    2012-02-05 28 14.568958282470703

    2012-02-19 28 12.512847211625841

    2012-03-04 28 12.886666708522373

    2012-04-01 28 13.932638910081652

    Not looking good for traffic in region 28, is it?

    Data life-cycle management

    If you decide to use Hive as a live component of your querying mechanism exporting data from

    SQL, into Hive, and back out again so it can be used regularly give careful thought as to how

    you manage the files to ensure accuracy. I tend to use the following basic sequence:

    1. Insert new data into a table datalog, for example.

    2. When datalog is full (i.e., an hour, day, week, or month of information is complete), the table is

    renamed (to datalog_archive, for example), and a new table (same structure) is created.

    3. Data is exported from datalog_archive and appended into the Hive table for the data.

    Depending on how the data is used and processed, analysis occurs by accessing the live data or

    by running the exact same SQL query statement on Hive. If the data is needed quickly, a view or

    query is executed that imports the corresponding data back into an SQL table in a summarized

    format. For example, for systems logging data (RAM, disk, and other usages) of large clusters, the

    data is stored in SQL for a day. This approach allows for live monitoring and makes it possible to

    spot urgent trends.

    Data is written out each day to Hive where the log content is analyzed by a series of views that

    collect extreme values (for example, disk space less than 5 percent), as well as average disk

    usage. While reviewing the recent data, it's easy to examine and correlate problems (extreme disk

    usage and increased CPU time, for example), and execute the same query on the Hive long-term

    store to get the detailed picture.

    Using HBase

    Whereas Hive is useful for huge datasets where live queries are not required, HBase allows us to

    perform live queries on data, but it works differently. The primary difference is that HBase is not a

    tabular data store, so importing tabular data from an SQL store is more complex.

    That said, the flexible internal structure of HBase is also more flexible. Data sources of multiple

    different data structures can be merged together within HBase. For example, with log data, you

    can store multiple sensor data into a single table within HBase, a situation that would require

    multiple tables in an SQL store.

  • 8/13/2019 Bd Sqltohadoop2 PDF

    9/18

    ibm.com/developerWorks/ developerWorks

    SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 9 of 18

    Getting data in HBase from SQL

    Unlike Hive, which supports a native tabular layout for the source data, HBase stores key-value

    pairs. This key-value system complicates the process of exporting data and using it directly

    because it first needs to be identified and then formatted accordingly to be understood within

    HBase.

    Each item (or item identifier) requires a unique key. The unique ID is important because it is the

    only way to get individual data back again; the unique ID locates the record within the HBase table.

    Remember that HBase is about key-value pairs, and the unique ID (or key) is the identifier to the

    stored record data.

    For some data types, such as the log data in our Hive examples, the unique key is meaningless

    because we are unlikely to want to view just one record. For other types, the data may already

    have a suitable unique ID within the table you want to use.

    This dilemma can be solved by pre-processing our output, for example, and inserting a UUID()intoour output.

    Listing 12. Pre-processing the output

    mysql> select uuid(),timelogged,region,buscount,logreads,speed into

    outfile 'chicago.tsv' fields terminated by '\t' lines terminated by '\n'

    from chicago;

    This code creates a new UUID for each row of output. The UUID can be used to identify each

    record even though for this type of data, that identification is not individually useful.

    A secondary consideration within the export process is that HBase does not support joins. If you

    want to use HBase to write complex queries on your SQL data, you need to run a query within

    your SQL store that outputs an already-joined or aggregate record.

    Within HBase, tables are organized according to column families, and these can be used to bond

    multiple groups of individual columns, or you can use the column families as actual columns. The

    translation is from the table to the document structure, as shown in Figure 2.

  • 8/13/2019 Bd Sqltohadoop2 PDF

    10/18

    developerWorks ibm.com/developerWorks/

    SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 10 of 18

    Figure 2. Translation from table to the document structure

    To import the data, you have to first create the table. HBase includes a basic shell for acceptingcommands. We can open it and create a table and a table group called cf.

    Listing 13. Creating a table group called cf$ hbase shell

    13/08/18 15:54:46 WARN conf.Configuration: hadoop.native.lib is

    deprecated. Instead, use io.native.lib.available

    HBase Shell; enter 'help' for list of supported commands.

    Type "exit" to leave the HBase Shell

    Version 0.94.6-cdh4.3.0, rUnknown, Mon May 27 20:22:05 PDT 2013

    hbase(main):001:0> create 'chicago', 'logtime', 'region', 'buscount',

    'readnumber' , 'speed'

    Copy the tab-separated file created earlier into HDFS: $ hdfs dfs -copyFromLocal chicago.tsv.

    Now we can run importtsv, a tool inside the HBase JAR that imports values from a tab-delimited

    file.

    Listing 14. Running importtsv$ hadoop jar /usr/lib/hbase/hbase.jar importtsv \

    -Dimporttsv.columns=HBASE_ROW_KEY,logtime,region,buscount,readnumber,

    speed chicago chicago.tsv

    The content needs to be split to make it understandable:hadoop jar /usr/lib/hbase/hbase.jar

    importtsv.

    This code runs the importtsv tool, which is included as part of the HBase JAR: -

    Dimporttsv.columns=HBASE_ROW_KEY,logtime,region,buscount,readnumber,speed .

    The tool defines the columns that will be imported and how they will be identified. The fields are

    defined as a list; at least one of them must be the identifier (UUID) for each row, specified by

    HBASE_ROW_KEY, and the others define the field names (within the column family, cf) used for each

    input column.

  • 8/13/2019 Bd Sqltohadoop2 PDF

    11/18

    ibm.com/developerWorks/ developerWorks

    SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 11 of 18

    chicago The table name. It must have been created before this tool is executed.

    chicago.tsv The name of the file in HDFS to be imported.

    The output from this command (see Listing 15) is rather immense, but the import process is

    complicated. The data cannot be directly loaded. Instead, it gets parsed by a MapReduce process

    that extracts and then inserts the data into an HBase table.

    Listing 15. Output from importtsv command13/08/18 16:08:37 INFO zookeeper.ZooKeeper: Client

    environment:zookeeper.version=3.4.5-cdh4.3.0--1, built on

    05/28/2013 02:01 GMT

    ...

    13/08/18 16:08:48 INFO mapred.JobClient: Running job:

    job_201308151105_0008

    13/08/18 16:08:49 INFO mapred.JobClient: map 0% reduce 0%

    13/08/18 16:09:07 INFO mapred.JobClient: map 1% reduce 0%

    13/08/18 16:09:10 INFO mapred.JobClient: map 4% reduce 0%

    13/08/18 16:09:13 INFO mapred.JobClient: map 9% reduce 0%

    13/08/18 16:09:16 INFO mapred.JobClient: map 18% reduce 0%

    13/08/18 16:09:19 INFO mapred.JobClient: map 26% reduce 0%

    13/08/18 16:09:22 INFO mapred.JobClient: map 28% reduce 0%

    13/08/18 16:09:25 INFO mapred.JobClient: map 35% reduce 0%

    13/08/18 16:09:29 INFO mapred.JobClient: map 40% reduce 0%

    13/08/18 16:09:32 INFO mapred.JobClient: map 46% reduce 0%

    13/08/18 16:09:36 INFO mapred.JobClient: map 47% reduce 0%

    13/08/18 16:09:38 INFO mapred.JobClient: map 52% reduce 0%

    13/08/18 16:09:42 INFO mapred.JobClient: map 55% reduce 0%

    13/08/18 16:09:43 INFO mapred.JobClient: map 57% reduce 0%

    13/08/18 16:09:45 INFO mapred.JobClient: map 58% reduce 0%

    13/08/18 16:09:48 INFO mapred.JobClient: map 60% reduce 0%

    13/08/18 16:09:51 INFO mapred.JobClient: map 62% reduce 0%

    13/08/18 16:09:54 INFO mapred.JobClient: map 64% reduce 0%

    13/08/18 16:09:57 INFO mapred.JobClient: map 65% reduce 0%

    13/08/18 16:10:01 INFO mapred.JobClient: map 67% reduce 0%

    13/08/18 16:10:04 INFO mapred.JobClient: map 69% reduce 0%

    13/08/18 16:10:07 INFO mapred.JobClient: map 71% reduce 0%13/08/18 16:10:10 INFO mapred.JobClient: map 72% reduce 0%

    13/08/18 16:10:13 INFO mapred.JobClient: map 74% reduce 0%

    13/08/18 16:10:16 INFO mapred.JobClient: map 76% reduce 0%

    13/08/18 16:10:19 INFO mapred.JobClient: map 78% reduce 0%

    13/08/18 16:10:23 INFO mapred.JobClient: map 80% reduce 0%

    13/08/18 16:10:26 INFO mapred.JobClient: map 81% reduce 0%

    13/08/18 16:10:29 INFO mapred.JobClient: map 83% reduce 0%

    13/08/18 16:10:32 INFO mapred.JobClient: map 84% reduce 0%

    13/08/18 16:10:35 INFO mapred.JobClient: map 86% reduce 0%

    13/08/18 16:10:38 INFO mapred.JobClient: map 88% reduce 0%

    13/08/18 16:10:41 INFO mapred.JobClient: map 90% reduce 0%

    13/08/18 16:10:44 INFO mapred.JobClient: map 92% reduce 0%

    13/08/18 16:10:47 INFO mapred.JobClient: map 95% reduce 0%

    13/08/18 16:10:50 INFO mapred.JobClient: map 98% reduce 0%

    13/08/18 16:10:53 INFO mapred.JobClient: map 100% reduce 0%13/08/18 16:11:02 INFO mapred.JobClient: Job complete:

    job_201308151105_0008

    13/08/18 16:11:02 INFO mapred.JobClient: Counters: 25

    13/08/18 16:11:02 INFO mapred.JobClient: File System Counters

    13/08/18 16:11:02 INFO mapred.JobClient:

    FILE: Number of bytes read=0

    13/08/18 16:11:02 INFO mapred.JobClient:

    FILE: Number of bytes written=416878

    13/08/18 16:11:02 INFO mapred.JobClient:

    FILE: Number of read operations=0

    13/08/18 16:11:02 INFO mapred.JobClient:

    FILE: Number of large read operations=0

    13/08/18 16:11:02 INFO mapred.JobClient:

  • 8/13/2019 Bd Sqltohadoop2 PDF

    12/18

    developerWorks ibm.com/developerWorks/

    SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 12 of 18

    FILE: Number of write operations=0

    13/08/18 16:11:02 INFO mapred.JobClient:

    HDFS: Number of bytes read=156425993

    13/08/18 16:11:02 INFO mapred.JobClient:

    HDFS: Number of bytes written=0

    13/08/18 16:11:02 INFO mapred.JobClient:

    HDFS: Number of read operations=4

    13/08/18 16:11:02 INFO mapred.JobClient:

    HDFS: Number of large read operations=013/08/18 16:11:02 INFO mapred.JobClient:

    HDFS: Number of write operations=0

    13/08/18 16:11:02 INFO mapred.JobClient:

    Job Counters

    13/08/18 16:11:02 INFO mapred.JobClient:

    Launched map tasks=2

    13/08/18 16:11:02 INFO mapred.JobClient:

    Data-local map tasks=2

    13/08/18 16:11:02 INFO mapred.JobClient:

    Total time spent by all maps in occupied slots (ms)=178767

    13/08/18 16:11:02 INFO mapred.JobClient:

    Total time spent by all reduces in occupied slots (ms)=0

    13/08/18 16:11:02 INFO mapred.JobClient:

    Total time spent by all maps waiting after reserving slots (ms)=0

    13/08/18 16:11:02 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0

    13/08/18 16:11:02 INFO mapred.JobClient:

    Map-Reduce Framework

    13/08/18 16:11:02 INFO mapred.JobClient:

    Map input records=2168224

    13/08/18 16:11:02 INFO mapred.JobClient:

    Map output records=2168224

    13/08/18 16:11:02 INFO mapred.JobClient:

    Input split bytes=248

    13/08/18 16:11:02 INFO mapred.JobClient:

    Spilled Records=0

    13/08/18 16:11:02 INFO mapred.JobClient:

    CPU time spent (ms)=14140

    13/08/18 16:11:02 INFO mapred.JobClient:

    Physical memory (bytes) snapshot=274292736

    13/08/18 16:11:02 INFO mapred.JobClient:

    Virtual memory (bytes) snapshot=1394532352

    13/08/18 16:11:02 INFO mapred.JobClient:

    Total committed heap usage (bytes)=125566976

    13/08/18 16:11:02 INFO mapred.JobClient:

    ImportTsv

    13/08/18 16:11:02 INFO mapred.JobClient:

    Bad Lines=0

    If you get a bad-lines output that shows a high number of errors, particularly if the number equals

    the number of rows you are importing, the problem is probably the format of the source file or the

    fact that the number of columns in the source file does not match the number of columns defined

    in the import specification.

    Once the data has been imported, we can use the shell to get one record to check that the import

    has worked.

  • 8/13/2019 Bd Sqltohadoop2 PDF

    13/18

    ibm.com/developerWorks/ developerWorks

    SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 13 of 18

    Listing 16. Checking to see if the import has workedhbase(main):003:0> scan 'chicago', { 'LIMIT' > 1}

    ROW COLUMN+CELL

    e49e1c1c-0d93-11e3-a5e5-81b71544159b column=buscount:,

    timestamp=1376949811719, value=40

    e49e1c1c-0d93-11e3-a5e5-81b71544159b column=logtime:,

    timestamp=1376949811719,

    value=2013-03-26 09:50:00e49e1c1c-0d93-11e3-a5e5-81b71544159b column=readnumber:,

    timestamp=1376949811719, value=600

    e49e1c1c-0d93-11e3-a5e5-81b71544159b column=region:,

    timestamp=1376949811719, value=2

    e49e1c1c-0d93-11e3-a5e5-81b71544159b column=speed:,

    timestamp=1376949811719, value=26.59

    1 row(s) in 0.1830 seconds

    You can see the basic structure of the data as it exists within the HBase table. The unique ID

    identifies each record, then individual key-value pairs contain the detail (i.e., the columns from the

    original SQL table).

    Alternative SQL or Hive to HBaseAn alternative model to the raw-data export (less common to HBase because of the record

    structure) is to use HBase to store summary values and parsed/composed queries.

    Because the data from HBase is stored in a readily and quickly accessible format (access the

    key and get the data), it can be used to access chunks of data that have been computed from

    other jobs, stored into HBase, and used to access the summary data. For example, the summary

    data we generated using Hive earlier in this example could have been written into HBase to be

    accessed quickly to provide statistical data on the fly for a website.

    Using HBase data from HiveNow that we have the data in HBase, we can start querying and reporting on the information. The

    primary advantage of HBase is its powerful querying facilities based on the MapReduce within

    Hadoop. Since the data is stored internally as simple key-value combinations, it is easy to process

    through MapReduce.

    MapReduce is no solution for someone from the SQL world, but we can take advantage of the

    flexible nature of Hive's processing model to crunch HBase data using the HQL interface. You may

    remember earlier I described how Hive supports processing of mapped data types; this is what

    HBase data is: mapped key-value pairs.

    To use HBase data, we need to create a table within Hive that points to the HBase table and maps

    the key-value pairs in HBase to the column style of Hive.

    Listing 17. Creating a table within Hive that points to the HBase table and

    maps the key-value pairs in HBase to the column style of Hivehive> create external table hbase_chicago (key string, logtime timestamp,

    region int, buscount int, readnumber int, speed float) STORED BY

    'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with SERDEPROPERTIES

    ('hbase.columns.mapping'=':key, logtime:val, region:val, buscount:val,

    readnumber:val, speed:val') TBLPROPERTIES('hbase.table.name'='chicago');

  • 8/13/2019 Bd Sqltohadoop2 PDF

    14/18

    developerWorks ibm.com/developerWorks/

    SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 14 of 18

    The first part of this code creates a table definition identical to the one we used natively in

    Hive, except that we have added the row-key UUID as the first column.

    The STORED BYblock defines the storage format.

    The SERDEPROPERTIESblock is the mapping between the document structure and the columns.

    The colon separates the key name and corresponding value and how the data should bemapped to the columns, in sequence, from the table definition.

    The TBLPROPERTIESblock defines the name of the HBase table where the data lies.

    Once the table has been created, the table can be queried through Hive using native SQL, just as

    we saw earlier.

    Why use this method instead of a native import? The primary reason is the ease with which it can

    be queried (although no longer live), but also because the underlying HBase data can be updated,

    rather than just appended to. In an SQL-to-Hadoop architecture this advantage means we can

    take regular dumps of changing data from SQL and update the content.

    Reminder: HBase or Hive

    Given the information here, it's worth reminding ourselves of the benefits of the two systems.

    Table 1. Benefits of the two systems

    Feature Hive HBase

    SQL Support Yes No

    Tabular Data Structure Yes Semi

    Append Only Yes No

    Allows Updates No Yes

    Live Queries Not really Yes, within limits

    Joins Yes No

    Which one you use will depend entirely on your use case and the data you have available, and

    how you want to query it. Hive is great for massive processing of ever-increasing data. HBase is

    useful for querying data that may change over time and need to be updated.

    Conclusion

    The primary reason for moving data between SQL stores and Hadoop is usually to take advantage

    of the massive storage and processing capabilities to process quantities of data larger than you

    could hope to cope with in SQL alone. How you exchange and process that information from your

    SQL store into Hadoop is, therefore, important. Large quantities of long-term data that need to be

    queried more interactively can take advantage of the append-only and SQL nature of Hive. For

    data that needs to be updated and processed, it might make more sense to use HBase. HBase

    also makes an ideal output target from Hive because it's so easy to access summary data directly

    by using the native key-value store.

  • 8/13/2019 Bd Sqltohadoop2 PDF

    15/18

    ibm.com/developerWorks/ developerWorks

    SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 15 of 18

    When processing, you also need to consider how to get the data back in. With Hive, the process is

    easy because we can run SQL and get a table that can easily be imported back to our SQL store

    for straightforward or live query processing. In this article, I've covered a wide range of use cases

    and examples of how data can be exchanged more easily from SQL using tabular interfaces to the

    Hadoop and non-tabular storage underneath.

  • 8/13/2019 Bd Sqltohadoop2 PDF

    16/18

    developerWorks ibm.com/developerWorks/

    SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 16 of 18

    Resources

    Learn

    Check out the Chicago Traffic Tracker.

    Read "What's the big deal about Big SQL?" to learn about IBM's SQL interface to its Hadoop-based platform, InfoSphere BigInsights. Big SQL is designed to provide SQL developers with

    an easy on-ramp for querying data managed by Hadoop.

    Dig deeper in the "Developing Big SQL queries to analyze big data" tutorial in the InfoSphere

    BigInsights tutorial collection (PDF).

    "Analyzing social media and structured data with InfoSphere BigInsights" teaches the basics

    of using BigSheets to analyze social media and structured data collected through sample

    applications provided with BigInsights.

    Read "Understanding InfoSphere BigInsights" to learn more about the InfoSphere BigInsights

    architecture and underlying technologies.

    Watch the Big Data: Frequently Asked Questions for IBM InfoSphere BigInsights videotolisten to Cindy Saracco discuss some of the frequently askedquestions about IBM's Big Data

    platform and InfoSphere BigInsights.

    Watch Cindy Saracco demonstrate portions of the scenario described in this article in Big

    Data -- Analyzing Social Media for Watson.

    Read "Exploring your InfoSphere BigInsights cluster and sample applications" to learn more

    about the InfoSphere BigInsights web console.

    Visit the BigInsights Technical Enablement wikifor links to technical materials, demos,

    training courses, news items, and more.

    Learn about the IBM Watsonresearch project.

    Check out Big Data Universityfor free courses on Hadoop and big data.

    Order a copy of Understanding Big Data: Analytics for Enterprise Class Hadoop and

    Streaming Datafor details on two of IBM's key big data technologies.

    Learn more about Apache Hadoop.

    Check out HadoopDB.

    Read "Using MapReduce and load balancing on the cloud" to learn how to implement the

    Hadoop MapReduce framework in a cloud environment and how to use virtual load balancing

    to improve the performance of both a single- and multiple-node system.

    For information on installing Hadoop using CDH4, see CDH4 Installation - Cloudera Support.

    Big Data Glossary,by Pete Warden, O'Reilly Media, ISBN 1449314597, 2011; and Hadoop:

    The Definitive Guide,by Tom White, O'Reilly Media, ISBN 1449389732, 2010, offer more

    information. "HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical

    Workloads" explores the feasibility of building a hybrid system that takes the best features

    from both technologies.

    Learn more by reading "MapReduce and parallel DBMSes: friends or foes?"

    "A Survey of Large ScaleData Management Approaches in Cloud Environments" gives a

    comprehensive survey of numerous approaches and mechanisms of deploying data-intensive

    applications in the cloud, which are gaining a lot of momentum in both research and industrial

    communities.

    http://www.vldb.org/pvldb/2/vldb09-861.pdfhttp://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/CDH4-Installation-Guide.htmlhttp://hadoop.apache.org/http://youtu.be/kny3nPwSZ_whttp://www.ibm.com/e-business/linkweb/publications/servlet/pbi.wss?CTY=US&FNC=SRX&PBL=GC19-4104-00http://www.mendeley.com/research/survey-large-scale-data-management-approaches-cloud-environments/http://database.cs.brown.edu/papers/stonebraker-cacm2010.pdfhttp://www.vldb.org/pvldb/2/vldb09-861.pdfhttp://www.vldb.org/pvldb/2/vldb09-861.pdfhttp://shop.oreilly.com/product/0636920010388.dohttp://shop.oreilly.com/product/0636920010388.dohttp://shop.oreilly.com/product/0636920022466.dohttp://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/CDH4-Installation-Guide.htmlhttp://www.ibm.com/developerworks/cloud/library/cl-mapreduce/http://db.cs.yale.edu/hadoopdb/hadoopdb.htmlhttp://hadoop.apache.org/http://www.tatamcgrawhill.com/html/9780071790536.htmlhttp://www.tatamcgrawhill.com/html/9780071790536.htmlhttp://www.bigdatauniversity.com/http://www-03.ibm.com/innovation/us/watson/index.htmlhttps://www.ibm.com/developerworks/mydeveloperworks/wikis/home?lang=en_US#/wiki/BigInsightshttp://www.ibm.com/developerworks/data/library/techarticle/dm-1204infospherebiginsights/index.htmlhttp://youtu.be/kny3nPwSZ_whttp://youtu.be/kny3nPwSZ_whttp://youtu.be/I4hsZa2jwAshttp://www.ibm.com/developerworks/data/library/techarticle/dm-1110biginsightsintro/index.htmlhttp://www.ibm.com/developerworks/data/library/techarticle/dm-1206socialmedia/index.htmlhttp://www.ibm.com/e-business/linkweb/publications/servlet/pbi.wss?CTY=US&FNC=SRX&PBL=GC19-4104-00http://www.ibm.com/developerworks/library/bd-bigsql/index.htmlhttps://data.cityofchicago.org/Transportation/Chicago-Traffic-Tracker-Historical-Congestion-Esti/emtn-qqdi
  • 8/13/2019 Bd Sqltohadoop2 PDF

    17/18

    ibm.com/developerWorks/ developerWorks

    SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 17 of 18

    Learn more about big data in the developerWorks big data content area. Find technical

    documentation, how-to articles, education, downloads, product information, and more.

    Find resources to help you get started with InfoSphere BigInsights, IBM's Hadoop-based

    offering that extends the value of open source Hadoop with features like Big SQL, text

    analytics, and BigSheets.

    Follow these self-paced tutorials (PDF)to learn how to manage your big data environment,import data for analysis, analyze data with BigSheets, develop your first big data application,

    develop Big SQL queries to analyze big data, and create an extractor to derive insights from

    text documents with InfoSphere BigInsights.

    Find resources to help you get started with InfoSphere Streams, IBM's high-performance

    computing platform that enables user-developed applications to rapidly ingest, analyze, and

    correlate information as it arrives from thousands of real-time sources.

    Stay current with developerWorks technical events and webcasts.

    Follow developerWorks on Twitter.

    Get products and technologies

    Get Hadoop 0.20.1, Hadoop MapReduce, and Hadoop HDFS from Apache.org.

    Download InfoSphere BigInsights Quick Start Edition, available as a native software

    installation or as a VMware image.

    Download InfoSphereStreams, available asa native software installation or as a VMware

    image.

    Use InfoSphere Streams on IBM SmartCloudEnterprise.

    Build your next development project with IBM trial software, available for download directly

    from developerWorks.

    Discuss

    Ask questions and get answers in the InfoSphere BigInsights forum.

    Ask questions and get answers in the InfoSphere Streams forum.

    Checkout the developerWorks blogsand get involved in the developerWorks community.

    IBM big data and analyticson Facebook.

    https://www.facebook.com/IBMbigdataanalyticshttp://www.ibm.com/developerworks/blogs/http://www.ibm.com/developerworks/blogs/http://www.ibm.com/developerworks/communityhttp://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000002409http://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000002409http://www.ibm.com/developerworks/downloads/im/streams/cloud.htmlhttp://www.ibm.com/developerworks/downloads/im/streams/cloud.htmlhttp://www.ibm.com/developerworks/downloads/im/streams/index.html/http://www.ibm.com/developerworks/downloads/im/streams/index.html/http://www.ibm.com/developerworks/downloads/im/streams/index.html/https://www.facebook.com/IBMbigdataanalyticshttp://www.ibm.com/developerworks/communityhttp://www.ibm.com/developerworks/blogs/https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000001664http://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000002409http://www.ibm.com/developerworks/downloads/http://www.ibm.com/developerworks/downloads/im/streams/cloud.htmlhttp://www.ibm.com/developerworks/downloads/im/streams/index.html/http://www.ibm.com/developerworks/downloads/im/biginsightsquick/http://hadoop.apache.org/common/releases.htmlhttp://www.twitter.com/developerworks/http://www.ibm.com/developerworks/offers/techbriefings/http://www.ibm.com/developerworks/bigdata/streams/index.htmlhttp://www.ibm.com/e-business/linkweb/publications/servlet/pbi.wss?CTY=US&FNC=SRX&PBL=GC19-4104-00http://www.ibm.com/developerworks/bigdata/biginsights/index.htmlhttp://www.ibm.com/developerworks/bigdata/
  • 8/13/2019 Bd Sqltohadoop2 PDF

    18/18

    developerWorks ibm.com/developerWorks/

    SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 18 of 18

    About the author

    Martin C. Brown

    A professional writer for over 15 years, Martin (MC) Brown is the author and

    contributor to more than 26 books covering an array of topics, including the recently

    published Getting Started with CouchDB.His expertise spans myriad development

    languages and platforms: Perl, Python, Java, JavaScript, Basic, Pascal, Modula-2, C,

    C++, Rebol, Gawk, Shellscript, Windows, Solaris, Linux, BeOS, Microsoft WP, Mac

    OS and more. He currently works as the director of documentation for Continuent.

    Copyright IBM Corporation 2013

    (www.ibm.com/legal/copytrade.shtml)

    Trademarks

    (www.ibm.com/developerworks/ibm/trademarks/)

    http://www.ibm.com/developerworks/ibm/trademarks/http://www.ibm.com/legal/copytrade.shtml