bd sqltohadoop2 pdf

8/13/2019 Bd Sqltohadoop2 PDF

1/18

Copyright IBM Corporation 2013 Trademarks

SQL to Hadoop and back again, Part 2: Leveraging HBase and

Hive

Page 1 of 18

SQL to Hadoop and back again, Part 2: Leveraging

HBase and Hive

Martin C. Brown([email protected])

Director of Documentation

08 October 2013

"Big data" is a term that has been used regularly now for almost a decade, and it along

with technologies like NoSQL are seen as the replacements for the long-successful

RDBMS solutions that use SQL. Today, DB2, Oracle, Microsoft SQL Server MySQL, and

PostgreSQL dominate the SQLspace and still make up a considerable proportion of the overall

market. Here in Part 2, we will concentrate on how to use HBase and Hive for exchanging data

with your SQL data stores. From the outside, the two systems seem to be largely similar, but

the systems have very different goals and aims. Let's start by looking at how the two systems

differ and how we can take advantage of that in our big data requirements.

View more content in this series

Hive and HBase

Integrating Hadoop and SQL with InfoSphere

BigInsights

InfoSphere BigInsights makes integrating between Hadoop and SQL databases much

simpler, since it provides the necessary tools and mechanics to export and import data

among databases. Using InfoSphere BigInsights, you can define database sources, views,

queries, and other selection criteria, then automatically convert that into a variety of formats

before importing that collection directly into Hadoop (see Resourcesfor more information).

For example, you can create a query that extracts the data and populates a JSON array with

the record data. Once exported, a job can be created to process and crunch the data before

displaying it or importing the processed data and exporting the data back to DB2.

Download InfoSphere BigInsights Quick Start Edition, a free, downloadable non-production

version of BigInsights.

Hive is a data warehouse solution that has a thin SQL-like querying language called HiveQL. This

language is used for querying data, and it saves you from writing native MapReduce processing

to get your data out. Since you already know SQL, Hive is a good solution since it enables you to

take advantage of your SQL knowledge to get data in and out of Apache Hadoop. One limitation

of the Hive approach, though, is that it makes use of the append-only nature of HDFS to provide
http://www.ibm.com/legal/copytrade.shtmlhttp://www.ibm.com/legal/copytrade.shtmlhttp://www.ibm.com/developerworks/views/bigdata/libraryview.jsp?search_by=sql+hadoop+backmailto:[email protected]://www.ibm.com/developerworks/downloads/im/biginsightsquick/http://www.ibm.com/developerworks/views/bigdata/libraryview.jsp?search_by=sql+hadoop+backmailto:[email protected]://www.ibm.com/developerworks/ibm/trademarks/http://www.ibm.com/legal/copytrade.shtml


2/18

developerWorks ibm.com/developerWorks/

SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 2 of 18

storage. This means that it is phenomenally easy to get the data in, but you cannot update it. Hive

is not a database but a data warehouse with convenient SQL querying built on top of it. Despite

the convenient interface, particularly on very large datasets, the fact that the query time required to

process requests is so large means that jobs are submitted and results accessed when available.

This means that the information is not interactively available.

HBase, by comparison, is a key-value (NoSQL) data store that enables you to write, update, and

read data randomly, just like any other database. But it's not SQL. HBase enables you to make use

of Hadoop in a more traditional real-time fashion than would normally be possible with the Hadoop

architecture. Processing and querying data is more complex with HBase, but you can combine the

HBase structure with Hive to get an SQL-like interface. HBase can be really practical as part of a

solution that adds the data, processes it, summarizes it through MapReduce, and stores the output

for use in future processing.

In short, think of Hive as an append-only SQL database and HBase as a more typical read-write

NoSQL data store.

Hive is useful for SQL integration if you want to store long-term data to be processed and

summarized and loaded back. Hive's major limitation is query speed. When dealing with billions of

rows, there is no live querying of the data that would be fast enough for any interactive interface to

the data.

For example, with data logging, the quantities of data can be huge, but what you often need is

quick, flexible querying on either summarized or extreme data (i.e., faults and failures).

HBase is useful when what you want is to store large volumes of flexible data and query that

information, but you might want only smaller datasets to work with. Hence, you might export datathat simultaneously:

1. Needs to be kept "whole," such as sales or financial data

2. May change over time

3. Also needs to be queried

HBase can then be combined with traditional SQL or Hive to allow snapshots, ranges, or

aggregate data to be queried.

Making use of Hive

The primary reason to use Hive over a typical SQL database infrastructure is simply the size of the

data and the length of time required to perform the query. Rather than dumping information into

Hadoop, writing your own MapReduce query, and getting the information back out, with Hive you

can (normally) write the same SQL statement, but on a much larger dataset.

Hive accomplishes this task by translating the SQL statement into a more typical Hadoop

MapReduce job that assembles the data into a tabular format. This is where the limitation comes,

in that Hive is not a real-time, or live querying solution. Once you submit the job, it can take a long

time to get a response.


3/18

ibm.com/developerWorks/ developerWorks


A typical use of Hive is to combine the real-time accessibility of data in a local SQL table, export

that information into Hive long-term, and reimport the processed version that summarizes the data

so it can be used in a live query environment, as seen in Figure 1.

Figure 1. Typical use of Hive

We can use the same SQL statement to get and obtain the data in both situations, a convenience

that helps to harmonize and streamline your applications.

Getting data into Hive from SQL

The simplest way to get data in and out is by writing a custom application that will extract the

required data from your existing SQL table and insert that data into Hive that is, perform a

select query and use INSERTto place those values directly into Hive.

Alternatively, depending on your application type, you might consider inserting data directly into

both your standard SQL and Hive stores. This way, you can check your standard SQL for recent

queries and process Hive data on an hourly/daily/weekly schedule as required to produce the

statistical data you require longer-term.

Remember, with Hive, in the majority of cases we can:

Use the tabular data directly


4/18



Export without conversion

Export without reformatting or restructuring

More typically, you will be dumping entire tables or entire datasets by hour or day into Hive through

an intermediary file. One benefit of this approach is that we easily introduce the file by running a

local import or by copying that data directly into HDFS.

Let's look at this with an example, using the City of Chicago Traffic Tracker that studies bus data,

giving a historical view of the speed of buses in different regions of Chicago at different times. A

sample of the data is shown in Listing 1.

Listing 1. Sample of bus data from City of Chicago Traffic Trackermysql> select * from chicago limit 10;

+---------------------+--------+----------+----------+-------+

| timelogged | region | buscount | logreads | speed |

+---------------------+--------+----------+----------+-------+

| 2013-03-26 09:50:00 | 2 | 40 | 600 | 26.59 |

| 2013-03-26 09:50:00 | 3 | 76 | 1023 | 21.82 |

| 2013-03-26 09:50:00 | 4 | 46 | 594 | 23.18 || 2013-03-26 09:50:00 | 5 | 56 | 1016 | 22.33 |

| 2013-03-26 09:50:00 | 6 | 58 | 859 | 21.14 |

| 2013-03-26 09:50:00 | 7 | 59 | 873 | 19.23 |

| 2013-03-26 09:50:00 | 8 | 89 | 1356 | 21.14 |

| 2013-03-26 09:50:00 | 9 | 28 | 404 | 21.82 |

| 2013-03-26 09:50:00 | 10 | 114 | 1544 | 20.45 |

| 2013-03-26 09:50:00 | 11 | 91 | 1310 | 22.5 |

+---------------------+--------+----------+----------+-------+

10 rows in set

The sample dataset is just over 2 million records, and it has been loaded from a CSV export. To

get that information into Hive, we can export it to a CSV file.

Listing 2. Exporting data to a CSV filemysql> select * into outfile 'chicago.csv' fields terminated by ','

lines terminated by '\n' from chicago;

Then copy the data into HDFS: $ hdfs dfs -copyFromLocal chicago.csv. Now open Hive and

create a suitable table.

Listing 3. Opening Hive and creating a suitable tablehive> create table chicago_bus (timelog timestamp, region int,

buscount int, logreads int, speed float) row format delimited

fields terminated by ',' lines terminated by "\n";

The first thing to notice with Hive is that compared to most standard SQL environments, we have

a somewhat limited set of data types that can be used. Although core types like integers, strings,

and floats are there, date types are limited. That said, Hive supports reading complex types, such

as hashmaps. The core types supported by Hive:

Integers (1-8 bytes)

Boolean

Float/Double


5/18



String any sequence of characters, and therefore good for CHAR, VARCHAR, SET, ENUM,

TEXT, and BLOB types if the BLOG is storing text

Timestamp either an EPOCH or YYYY-MM-DD hh:mm:ss.fffffffff-formatted string

Binary for BLOBs that are not TEXT

The second observation is that we are defining the table structure. There are binary structures

within Hive, but using CSV natively is convenient for our purposes since we've exported the data

to a CSV file. When loading the data, the CSV format specification here will be used to identify the

fields in the data.

The above example creates a standard table (the table within Hive's data store within HDFS). You

can also create an external table that uses the copied file directly. However, in an SQL-to-Hive

environment, we want to make use of one big table into which we can append new data.

Now the data can be loaded into the table: hive> load data inpath 'chicago.csv' into table

chicago_bus;.

This code adds the contents of the CSV file to the existing table. In this case, it is empty, but you

can see how easy it would be to import additional data.

Processing and querying Hive data

Once the data is loaded, you can execute Hive queries from the Hive shell just as you would in any

other SQL environment. For example, Listing 4 shows the same query to get the first 10 rows.

Listing 4. Query to get the first 10 rowshive> select * from chicago_bus limit 10;

OK

2013-03-26 09:50:00 1 31 461 22.52013-03-26 09:50:00 2 40 600 26.59

2013-03-26 09:50:00 3 76 1023 21.82

2013-03-26 09:50:00 4 46 594 23.18

2013-03-26 09:50:00 5 56 1016 22.33

2013-03-26 09:50:00 6 58 859 21.14

2013-03-26 09:50:00 7 59 873 19.23

2013-03-26 09:50:00 8 89 1356 21.14

2013-03-26 09:50:00 9 28 404 21.82

2013-03-26 09:50:00 10 114 1544 20.45

Time taken: 0.205 seconds

The benefit comes when we perform an aggregate query. For example, let's obtain an average bus

speed for each region by day.

Listing 5. Obtaining average bus speed for each region by dayhive> select to_date(timelog),region,avg(speed) from chicago_bus

group by to_date(timelog),region;

Total MapReduce jobs = 1

Launching Job 1 out of 1

Number of reduce tasks not specified. Estimated from input data size: 1

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=

In order to set a constant number of reducers:


6/18



set mapred.reduce.tasks=

Starting Job = job_201308151101_0009, Tracking URL =

http://localhost.localdomain:50030/jobdetails.jsp?jobid

=job_201308151101_0009

Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201308151101_0009

Hadoop job information for Stage-1: number of mappers: 1;

number of reducers: 1

2013-08-18 11:44:04,943 Stage-1 map = 0%, reduce = 0%

2013-08-18 11:44:24,041 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 12.83 sec

2013-08-18 11:44:25,053 Stage-1 map = 100%, reduce = 0%,

Cumulative CPU 12.83 sec

2013-08-18 11:44:26,093 Stage-1 map = 100%, reduce = 0%,


2013-08-18 11:44:27,102 Stage-1 map = 100%, reduce = 0%,


2013-08-18 11:44:28,115 Stage-1 map = 100%, reduce = 0%,


2013-08-18 11:44:29,136 Stage-1 map = 100%, reduce = 100%,


2013-08-18 11:44:30,147 Stage-1 map = 100%, reduce = 100%,


2013-08-18 11:44:31,164 Stage-1 map = 100%, reduce = 100%,

Cumulative CPU 14.94 sec2013-08-18 11:44:32,179 Stage-1 map = 100%, reduce = 100%,


MapReduce Total cumulative CPU time: 14 seconds 940 msec

Ended Job = job_201308151101_0009

MapReduce Jobs Launched:

Job 0: Map: 1 Reduce: 1 Cumulative CPU: 14.94 sec HDFS Read:

77778973 HDFS Write: 690111 SUCCESS

Total MapReduce CPU Time Spent: 14 seconds 940 msec

OK

2011-03-12 1 25.597916682561237

2011-03-12 2 27.758749961853027

2011-03-12 3 26.450833241144817

2011-03-12 4 24.638333241144817

2011-03-12 5 25.868695715199348

2011-03-12 6 24.921666463216145

2011-03-12 7 24.383749961853027

2011-03-12 8 23.729583422342937

2011-03-12 9 26.06217384338379

2011-03-12 10 24.53833333651225

...

As you can see from the output, the query is fundamentally the same as with MySQL (we are

grouping by an alternative value), but Hive converts this into a MapReduce job, then calculates the

summary values.

The reality with data of this style is that the likelihood of requiring the speed in region 1, for

example, at 9:50 last Thursday is quite low. But knowing the average speed per day for each

region might help predict the timing of traffic or buses in the future. The summary data can be

queried and analyzed efficiently with a few thousand rows in an SQL store to allow the data to be

sliced and diced accordingly.

To output that information back to a file, a number of options are available, but you can simply

export back to a local file using the statement in Listing 6.


7/18



Listing 6. Exporting back to a local filehive> INSERT OVERWRITE LOCAL DIRECTORY 'chicagoout' SELECT

to_date(timelog),region,avg(speed) from chicago_bus

group by to_date

(timelog),region;

This code creates a directory (chicagoout), into which the output is written as a series of text files.

These can be loaded back into MySQL, but by default, the fields are separated by Ctrl+A. The

output can be simplified to a CSV file again by creating a table beforehand, which uses the CSV

formatting.

Listing 7. Simplifying to a CSV filehive> CREATE TABLE chicago_region_speed (logdate timestamp, region int,

speed float) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

Now rerun the job and insert the information into the table.

Listing 8. Rerunning the jobhive> INSERT TABLE chicago_region_speed SELECT to_date(timelog) as

timelog,region,avg(speed) as speed from chicago_bus group by to_date

(timelog),region;

You can now find the files that make up the table in your Hive datawarehouse directory, so you can

copy them out for loading into your standard SQL store, for example, using LOAD DATA INFILE.

Listing 9. Loading files into your standard SQL storemysql> load data local infile 'chicago_region_speed.csv' into table

chicago fields terminated by ',' lines terminated by '\n';

This process sounds clunky, but it can be automated, and because we have files from each stage,

it is easy to re-execute or reformat the information if required.

Using views

If you are using Hive in the manner suggested earlier, and regularly processing and summarizing

data daily, weekly or monthly, it might be simpler to create a view.

Views within Hive are logical that is, the output of the view gets re-created each time a query

is executed. Although using views is more expensive, for a data exchange environment, views

hugely simplify the process by simplifying the query structure and allowing consistent output as the

underlying source tables expand with new data. For example, to create a view from our original

speed/region summary, use Listing 10.

Listing 10. Creating a view from our original speed/region summaryhive> create view chicago_region_speed_view as select to_date(timelog),

region,avg(speed) from chicago_bus group by to_date(timelog),region;

Now we can perform Hive queries on the view including new selections.


8/18



Listing 11. Performing Hive queries on the view

hive> select * from chicago_region_speed_view where speed < 15;

...

OK

2012-01-15 28 13.449097183015612

2012-01-29 28 13.56520840856764

2012-02-05 28 14.568958282470703

2012-02-19 28 12.512847211625841

2012-03-04 28 12.886666708522373

2012-04-01 28 13.932638910081652

Not looking good for traffic in region 28, is it?

Data life-cycle management

If you decide to use Hive as a live component of your querying mechanism exporting data from

SQL, into Hive, and back out again so it can be used regularly give careful thought as to how

you manage the files to ensure accuracy. I tend to use the following basic sequence:

1. Insert new data into a table datalog, for example.

2. When datalog is full (i.e., an hour, day, week, or month of information is complete), the table is

renamed (to datalog_archive, for example), and a new table (same structure) is created.

3. Data is exported from datalog_archive and appended into the Hive table for the data.

Depending on how the data is used and processed, analysis occurs by accessing the live data or

by running the exact same SQL query statement on Hive. If the data is needed quickly, a view or

query is executed that imports the corresponding data back into an SQL table in a summarized

format. For example, for systems logging data (RAM, disk, and other usages) of large clusters, the

data is stored in SQL for a day. This approach allows for live monitoring and makes it possible to

spot urgent trends.

Data is written out each day to Hive where the log content is analyzed by a series of views that

collect extreme values (for example, disk space less than 5 percent), as well as average disk

usage. While reviewing the recent data, it's easy to examine and correlate problems (extreme disk

usage and increased CPU time, for example), and execute the same query on the Hive long-term

store to get the detailed picture.

Using HBase

Whereas Hive is useful for huge datasets where live queries are not required, HBase allows us to

perform live queries on data, but it works differently. The primary difference is that HBase is not a

tabular data store, so importing tabular data from an SQL store is more complex.

That said, the flexible internal structure of HBase is also more flexible. Data sources of multiple

different data structures can be merged together within HBase. For example, with log data, you

can store multiple sensor data into a single table within HBase, a situation that would require

multiple tables in an SQL store.


9/18



Getting data in HBase from SQL

Unlike Hive, which supports a native tabular layout for the source data, HBase stores key-value

pairs. This key-value system complicates the process of exporting data and using it directly

because it first needs to be identified and then formatted accordingly to be understood within

HBase.

Each item (or item identifier) requires a unique key. The unique ID is important because it is the

only way to get individual data back again; the unique ID locates the record within the HBase table.

Remember that HBase is about key-value pairs, and the unique ID (or key) is the identifier to the

stored record data.

For some data types, such as the log data in our Hive examples, the unique key is meaningless

because we are unlikely to want to view just one record. For other types, the data may already

have a suitable unique ID within the table you want to use.

This dilemma can be solved by pre-processing our output, for example, and inserting a UUID()intoour output.

Listing 12. Pre-processing the output

mysql> select uuid(),timelogged,region,buscount,logreads,speed into

outfile 'chicago.tsv' fields terminated by '\t' lines terminated by '\n'

from chicago;

This code creates a new UUID for each row of output. The UUID can be used to identify each

record even though for this type of data, that identification is not individually useful.

A secondary consideration within the export process is that HBase does not support joins. If you

want to use HBase to write complex queries on your SQL data, you need to run a query within

your SQL store that outputs an already-joined or aggregate record.

Within HBase, tables are organized according to column families, and these can be used to bond

multiple groups of individual columns, or you can use the column families as actual columns. The

translation is from the table to the document structure, as shown in Figure 2.


10/18



Figure 2. Translation from table to the document structure

To import the data, you have to first create the table. HBase includes a basic shell for acceptingcommands. We can open it and create a table and a table group called cf.

Listing 13. Creating a table group called cf$ hbase shell

13/08/18 15:54:46 WARN conf.Configuration: hadoop.native.lib is

deprecated. Instead, use io.native.lib.available

HBase Shell; enter 'help' for list of supported commands.

Type "exit" to leave the HBase Shell

Version 0.94.6-cdh4.3.0, rUnknown, Mon May 27 20:22:05 PDT 2013

hbase(main):001:0> create 'chicago', 'logtime', 'region', 'buscount',

'readnumber' , 'speed'

Copy the tab-separated file created earlier into HDFS: $ hdfs dfs -copyFromLocal chicago.tsv.

Now we can run importtsv, a tool inside the HBase JAR that imports values from a tab-delimited

file.

Listing 14. Running importtsv$ hadoop jar /usr/lib/hbase/hbase.jar importtsv \

-Dimporttsv.columns=HBASE_ROW_KEY,logtime,region,buscount,readnumber,

speed chicago chicago.tsv

The content needs to be split to make it understandable:hadoop jar /usr/lib/hbase/hbase.jar

importtsv.

This code runs the importtsv tool, which is included as part of the HBase JAR: -

Dimporttsv.columns=HBASE_ROW_KEY,logtime,region,buscount,readnumber,speed .

The tool defines the columns that will be imported and how they will be identified. The fields are

defined as a list; at least one of them must be the identifier (UUID) for each row, specified by

HBASE_ROW_KEY, and the others define the field names (within the column family, cf) used for each

input column.


11/18



chicago The table name. It must have been created before this tool is executed.

chicago.tsv The name of the file in HDFS to be imported.

The output from this command (see Listing 15) is rather immense, but the import process is

complicated. The data cannot be directly loaded. Instead, it gets parsed by a MapReduce process

that extracts and then inserts the data into an HBase table.

Listing 15. Output from importtsv command13/08/18 16:08:37 INFO zookeeper.ZooKeeper: Client

environment:zookeeper.version=3.4.5-cdh4.3.0--1, built on

05/28/2013 02:01 GMT

...

13/08/18 16:08:48 INFO mapred.JobClient: Running job:

job_201308151105_0008

13/08/18 16:08:49 INFO mapred.JobClient: map 0% reduce 0%





















13/08/18 16:10:07 INFO mapred.JobClient: map 71% reduce 0%13/08/18 16:10:10 INFO mapred.JobClient: map 72% reduce 0%














13/08/18 16:10:53 INFO mapred.JobClient: map 100% reduce 0%13/08/18 16:11:02 INFO mapred.JobClient: Job complete:

job_201308151105_0008

13/08/18 16:11:02 INFO mapred.JobClient: Counters: 25

13/08/18 16:11:02 INFO mapred.JobClient: File System Counters

13/08/18 16:11:02 INFO mapred.JobClient:

FILE: Number of bytes read=0


FILE: Number of bytes written=416878


FILE: Number of read operations=0


FILE: Number of large read operations=0



12/18



FILE: Number of write operations=0


HDFS: Number of bytes read=156425993


HDFS: Number of bytes written=0


HDFS: Number of read operations=4


HDFS: Number of large read operations=013/08/18 16:11:02 INFO mapred.JobClient:

HDFS: Number of write operations=0


Job Counters


Launched map tasks=2


Data-local map tasks=2


Total time spent by all maps in occupied slots (ms)=178767


Total time spent by all reduces in occupied slots (ms)=0


Total time spent by all maps waiting after reserving slots (ms)=0

13/08/18 16:11:02 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0


Map-Reduce Framework


Map input records=2168224


Map output records=2168224


Input split bytes=248


Spilled Records=0


CPU time spent (ms)=14140


Physical memory (bytes) snapshot=274292736


Virtual memory (bytes) snapshot=1394532352


Total committed heap usage (bytes)=125566976


ImportTsv


Bad Lines=0

If you get a bad-lines output that shows a high number of errors, particularly if the number equals

the number of rows you are importing, the problem is probably the format of the source file or the

fact that the number of columns in the source file does not match the number of columns defined

in the import specification.

Once the data has been imported, we can use the shell to get one record to check that the import

has worked.


13/18



Listing 16. Checking to see if the import has workedhbase(main):003:0> scan 'chicago', { 'LIMIT' > 1}

ROW COLUMN+CELL

e49e1c1c-0d93-11e3-a5e5-81b71544159b column=buscount:,

timestamp=1376949811719, value=40

e49e1c1c-0d93-11e3-a5e5-81b71544159b column=logtime:,

timestamp=1376949811719,

value=2013-03-26 09:50:00e49e1c1c-0d93-11e3-a5e5-81b71544159b column=readnumber:,


e49e1c1c-0d93-11e3-a5e5-81b71544159b column=region:,


e49e1c1c-0d93-11e3-a5e5-81b71544159b column=speed:,

timestamp=1376949811719, value=26.59

1 row(s) in 0.1830 seconds

You can see the basic structure of the data as it exists within the HBase table. The unique ID

identifies each record, then individual key-value pairs contain the detail (i.e., the columns from the

original SQL table).

Alternative SQL or Hive to HBaseAn alternative model to the raw-data export (less common to HBase because of the record

structure) is to use HBase to store summary values and parsed/composed queries.

Because the data from HBase is stored in a readily and quickly accessible format (access the

key and get the data), it can be used to access chunks of data that have been computed from

other jobs, stored into HBase, and used to access the summary data. For example, the summary

data we generated using Hive earlier in this example could have been written into HBase to be

accessed quickly to provide statistical data on the fly for a website.

Using HBase data from HiveNow that we have the data in HBase, we can start querying and reporting on the information. The

primary advantage of HBase is its powerful querying facilities based on the MapReduce within

Hadoop. Since the data is stored internally as simple key-value combinations, it is easy to process

through MapReduce.

MapReduce is no solution for someone from the SQL world, but we can take advantage of the

flexible nature of Hive's processing model to crunch HBase data using the HQL interface. You may

remember earlier I described how Hive supports processing of mapped data types; this is what

HBase data is: mapped key-value pairs.

To use HBase data, we need to create a table within Hive that points to the HBase table and maps

the key-value pairs in HBase to the column style of Hive.

Listing 17. Creating a table within Hive that points to the HBase table and

maps the key-value pairs in HBase to the column style of Hivehive> create external table hbase_chicago (key string, logtime timestamp,

region int, buscount int, readnumber int, speed float) STORED BY

'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with SERDEPROPERTIES

('hbase.columns.mapping'=':key, logtime:val, region:val, buscount:val,

readnumber:val, speed:val') TBLPROPERTIES('hbase.table.name'='chicago');


14/18



The first part of this code creates a table definition identical to the one we used natively in

Hive, except that we have added the row-key UUID as the first column.

The STORED BYblock defines the storage format.

The SERDEPROPERTIESblock is the mapping between the document structure and the columns.

The colon separates the key name and corresponding value and how the data should bemapped to the columns, in sequence, from the table definition.

The TBLPROPERTIESblock defines the name of the HBase table where the data lies.

Once the table has been created, the table can be queried through Hive using native SQL, just as

we saw earlier.

Why use this method instead of a native import? The primary reason is the ease with which it can

be queried (although no longer live), but also because the underlying HBase data can be updated,

rather than just appended to. In an SQL-to-Hadoop architecture this advantage means we can

take regular dumps of changing data from SQL and update the content.

Reminder: HBase or Hive

Given the information here, it's worth reminding ourselves of the benefits of the two systems.

Table 1. Benefits of the two systems

Feature Hive HBase

SQL Support Yes No

Tabular Data Structure Yes Semi

Append Only Yes No

Allows Updates No Yes

Live Queries Not really Yes, within limits

Joins Yes No

Which one you use will depend entirely on your use case and the data you have available, and

how you want to query it. Hive is great for massive processing of ever-increasing data. HBase is

useful for querying data that may change over time and need to be updated.

Conclusion

The primary reason for moving data between SQL stores and Hadoop is usually to take advantage

of the massive storage and processing capabilities to process quantities of data larger than you

could hope to cope with in SQL alone. How you exchange and process that information from your

SQL store into Hadoop is, therefore, important. Large quantities of long-term data that need to be

queried more interactively can take advantage of the append-only and SQL nature of Hive. For

data that needs to be updated and processed, it might make more sense to use HBase. HBase

also makes an ideal output target from Hive because it's so easy to access summary data directly

by using the native key-value store.


15/18



When processing, you also need to consider how to get the data back in. With Hive, the process is

easy because we can run SQL and get a table that can easily be imported back to our SQL store

for straightforward or live query processing. In this article, I've covered a wide range of use cases

and examples of how data can be exchanged more easily from SQL using tabular interfaces to the

Hadoop and non-tabular storage underneath.


16/18



Resources

Learn

Check out the Chicago Traffic Tracker.

Read "What's the big deal about Big SQL?" to learn about IBM's SQL interface to its Hadoop-based platform, InfoSphere BigInsights. Big SQL is designed to provide SQL developers with

an easy on-ramp for querying data managed by Hadoop.

Dig deeper in the "Developing Big SQL queries to analyze big data" tutorial in the InfoSphere

BigInsights tutorial collection (PDF).

"Analyzing social media and structured data with InfoSphere BigInsights" teaches the basics

of using BigSheets to analyze social media and structured data collected through sample

applications provided with BigInsights.

Read "Understanding InfoSphere BigInsights" to learn more about the InfoSphere BigInsights

architecture and underlying technologies.

Watch the Big Data: Frequently Asked Questions for IBM InfoSphere BigInsights videotolisten to Cindy Saracco discuss some of the frequently askedquestions about IBM's Big Data

platform and InfoSphere BigInsights.

Watch Cindy Saracco demonstrate portions of the scenario described in this article in Big

Data -- Analyzing Social Media for Watson.

Read "Exploring your InfoSphere BigInsights cluster and sample applications" to learn more

about the InfoSphere BigInsights web console.

Visit the BigInsights Technical Enablement wikifor links to technical materials, demos,

training courses, news items, and more.

Learn about the IBM Watsonresearch project.

Check out Big Data Universityfor free courses on Hadoop and big data.

Order a copy of Understanding Big Data: Analytics for Enterprise Class Hadoop and

Streaming Datafor details on two of IBM's key big data technologies.

Learn more about Apache Hadoop.

Check out HadoopDB.

Read "Using MapReduce and load balancing on the cloud" to learn how to implement the

Hadoop MapReduce framework in a cloud environment and how to use virtual load balancing

to improve the performance of both a single- and multiple-node system.

For information on installing Hadoop using CDH4, see CDH4 Installation - Cloudera Support.

Big Data Glossary,by Pete Warden, O'Reilly Media, ISBN 1449314597, 2011; and Hadoop:

The Definitive Guide,by Tom White, O'Reilly Media, ISBN 1449389732, 2010, offer more

information. "HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical

Workloads" explores the feasibility of building a hybrid system that takes the best features

from both technologies.

Learn more by reading "MapReduce and parallel DBMSes: friends or foes?"

"A Survey of Large ScaleData Management Approaches in Cloud Environments" gives a

comprehensive survey of numerous approaches and mechanisms of deploying data-intensive

applications in the cloud, which are gaining a lot of momentum in both research and industrial

communities.
http://www.vldb.org/pvldb/2/vldb09-861.pdfhttp://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/CDH4-Installation-Guide.htmlhttp://hadoop.apache.org/http://youtu.be/kny3nPwSZ_whttp://www.ibm.com/e-business/linkweb/publications/servlet/pbi.wss?CTY=US&FNC=SRX&PBL=GC19-4104-00http://www.mendeley.com/research/survey-large-scale-data-management-approaches-cloud-environments/http://database.cs.brown.edu/papers/stonebraker-cacm2010.pdfhttp://www.vldb.org/pvldb/2/vldb09-861.pdfhttp://www.vldb.org/pvldb/2/vldb09-861.pdfhttp://shop.oreilly.com/product/0636920010388.dohttp://shop.oreilly.com/product/0636920010388.dohttp://shop.oreilly.com/product/0636920022466.dohttp://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/CDH4-Installation-Guide.htmlhttp://www.ibm.com/developerworks/cloud/library/cl-mapreduce/http://db.cs.yale.edu/hadoopdb/hadoopdb.htmlhttp://hadoop.apache.org/http://www.tatamcgrawhill.com/html/9780071790536.htmlhttp://www.tatamcgrawhill.com/html/9780071790536.htmlhttp://www.bigdatauniversity.com/http://www-03.ibm.com/innovation/us/watson/index.htmlhttps://www.ibm.com/developerworks/mydeveloperworks/wikis/home?lang=en_US#/wiki/BigInsightshttp://www.ibm.com/developerworks/data/library/techarticle/dm-1204infospherebiginsights/index.htmlhttp://youtu.be/kny3nPwSZ_whttp://youtu.be/kny3nPwSZ_whttp://youtu.be/I4hsZa2jwAshttp://www.ibm.com/developerworks/data/library/techarticle/dm-1110biginsightsintro/index.htmlhttp://www.ibm.com/developerworks/data/library/techarticle/dm-1206socialmedia/index.htmlhttp://www.ibm.com/e-business/linkweb/publications/servlet/pbi.wss?CTY=US&FNC=SRX&PBL=GC19-4104-00http://www.ibm.com/developerworks/library/bd-bigsql/index.htmlhttps://data.cityofchicago.org/Transportation/Chicago-Traffic-Tracker-Historical-Congestion-Esti/emtn-qqdi


17/18



Learn more about big data in the developerWorks big data content area. Find technical

documentation, how-to articles, education, downloads, product information, and more.

Find resources to help you get started with InfoSphere BigInsights, IBM's Hadoop-based

offering that extends the value of open source Hadoop with features like Big SQL, text

analytics, and BigSheets.

Follow these self-paced tutorials (PDF)to learn how to manage your big data environment,import data for analysis, analyze data with BigSheets, develop your first big data application,

develop Big SQL queries to analyze big data, and create an extractor to derive insights from

text documents with InfoSphere BigInsights.

Find resources to help you get started with InfoSphere Streams, IBM's high-performance

computing platform that enables user-developed applications to rapidly ingest, analyze, and

correlate information as it arrives from thousands of real-time sources.

Stay current with developerWorks technical events and webcasts.

Follow developerWorks on Twitter.

Get products and technologies

Get Hadoop 0.20.1, Hadoop MapReduce, and Hadoop HDFS from Apache.org.

Download InfoSphere BigInsights Quick Start Edition, available as a native software

installation or as a VMware image.

Download InfoSphereStreams, available asa native software installation or as a VMware

image.

Use InfoSphere Streams on IBM SmartCloudEnterprise.

Build your next development project with IBM trial software, available for download directly

from developerWorks.

Discuss

Ask questions and get answers in the InfoSphere BigInsights forum.

Ask questions and get answers in the InfoSphere Streams forum.

Checkout the developerWorks blogsand get involved in the developerWorks community.

IBM big data and analyticson Facebook.
https://www.facebook.com/IBMbigdataanalyticshttp://www.ibm.com/developerworks/blogs/http://www.ibm.com/developerworks/blogs/http://www.ibm.com/developerworks/communityhttp://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000002409http://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000002409http://www.ibm.com/developerworks/downloads/im/streams/cloud.htmlhttp://www.ibm.com/developerworks/downloads/im/streams/cloud.htmlhttp://www.ibm.com/developerworks/downloads/im/streams/index.html/http://www.ibm.com/developerworks/downloads/im/streams/index.html/http://www.ibm.com/developerworks/downloads/im/streams/index.html/https://www.facebook.com/IBMbigdataanalyticshttp://www.ibm.com/developerworks/communityhttp://www.ibm.com/developerworks/blogs/https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000001664http://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000002409http://www.ibm.com/developerworks/downloads/http://www.ibm.com/developerworks/downloads/im/streams/cloud.htmlhttp://www.ibm.com/developerworks/downloads/im/streams/index.html/http://www.ibm.com/developerworks/downloads/im/biginsightsquick/http://hadoop.apache.org/common/releases.htmlhttp://www.twitter.com/developerworks/http://www.ibm.com/developerworks/offers/techbriefings/http://www.ibm.com/developerworks/bigdata/streams/index.htmlhttp://www.ibm.com/e-business/linkweb/publications/servlet/pbi.wss?CTY=US&FNC=SRX&PBL=GC19-4104-00http://www.ibm.com/developerworks/bigdata/biginsights/index.htmlhttp://www.ibm.com/developerworks/bigdata/


18/18



About the author

Martin C. Brown

A professional writer for over 15 years, Martin (MC) Brown is the author and

contributor to more than 26 books covering an array of topics, including the recently

published Getting Started with CouchDB.His expertise spans myriad development

languages and platforms: Perl, Python, Java, JavaScript, Basic, Pascal, Modula-2, C,

C++, Rebol, Gawk, Shellscript, Windows, Solaris, Linux, BeOS, Microsoft WP, Mac

OS and more. He currently works as the director of documentation for Continuent.

Copyright IBM Corporation 2013

(www.ibm.com/legal/copytrade.shtml)

Trademarks

(www.ibm.com/developerworks/ibm/trademarks/)
http://www.ibm.com/developerworks/ibm/trademarks/http://www.ibm.com/legal/copytrade.shtml

bd sqltohadoop2 pdf

Documents