bd sqltohadoop2 pdf
TRANSCRIPT
-
8/13/2019 Bd Sqltohadoop2 PDF
1/18
Copyright IBM Corporation 2013 Trademarks
SQL to Hadoop and back again, Part 2: Leveraging HBase and
Hive
Page 1 of 18
SQL to Hadoop and back again, Part 2: Leveraging
HBase and Hive
Martin C. Brown([email protected])
Director of Documentation
08 October 2013
"Big data" is a term that has been used regularly now for almost a decade, and it along
with technologies like NoSQL are seen as the replacements for the long-successful
RDBMS solutions that use SQL. Today, DB2, Oracle, Microsoft SQL Server MySQL, and
PostgreSQL dominate the SQLspace and still make up a considerable proportion of the overall
market. Here in Part 2, we will concentrate on how to use HBase and Hive for exchanging data
with your SQL data stores. From the outside, the two systems seem to be largely similar, but
the systems have very different goals and aims. Let's start by looking at how the two systems
differ and how we can take advantage of that in our big data requirements.
View more content in this series
Hive and HBase
Integrating Hadoop and SQL with InfoSphere
BigInsights
InfoSphere BigInsights makes integrating between Hadoop and SQL databases much
simpler, since it provides the necessary tools and mechanics to export and import data
among databases. Using InfoSphere BigInsights, you can define database sources, views,
queries, and other selection criteria, then automatically convert that into a variety of formats
before importing that collection directly into Hadoop (see Resourcesfor more information).
For example, you can create a query that extracts the data and populates a JSON array with
the record data. Once exported, a job can be created to process and crunch the data before
displaying it or importing the processed data and exporting the data back to DB2.
Download InfoSphere BigInsights Quick Start Edition, a free, downloadable non-production
version of BigInsights.
Hive is a data warehouse solution that has a thin SQL-like querying language called HiveQL. This
language is used for querying data, and it saves you from writing native MapReduce processing
to get your data out. Since you already know SQL, Hive is a good solution since it enables you to
take advantage of your SQL knowledge to get data in and out of Apache Hadoop. One limitation
of the Hive approach, though, is that it makes use of the append-only nature of HDFS to provide
http://www.ibm.com/legal/copytrade.shtmlhttp://www.ibm.com/legal/copytrade.shtmlhttp://www.ibm.com/developerworks/views/bigdata/libraryview.jsp?search_by=sql+hadoop+backmailto:[email protected]://www.ibm.com/developerworks/downloads/im/biginsightsquick/http://www.ibm.com/developerworks/views/bigdata/libraryview.jsp?search_by=sql+hadoop+backmailto:[email protected]://www.ibm.com/developerworks/ibm/trademarks/http://www.ibm.com/legal/copytrade.shtml -
8/13/2019 Bd Sqltohadoop2 PDF
2/18
developerWorks ibm.com/developerWorks/
SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 2 of 18
storage. This means that it is phenomenally easy to get the data in, but you cannot update it. Hive
is not a database but a data warehouse with convenient SQL querying built on top of it. Despite
the convenient interface, particularly on very large datasets, the fact that the query time required to
process requests is so large means that jobs are submitted and results accessed when available.
This means that the information is not interactively available.
HBase, by comparison, is a key-value (NoSQL) data store that enables you to write, update, and
read data randomly, just like any other database. But it's not SQL. HBase enables you to make use
of Hadoop in a more traditional real-time fashion than would normally be possible with the Hadoop
architecture. Processing and querying data is more complex with HBase, but you can combine the
HBase structure with Hive to get an SQL-like interface. HBase can be really practical as part of a
solution that adds the data, processes it, summarizes it through MapReduce, and stores the output
for use in future processing.
In short, think of Hive as an append-only SQL database and HBase as a more typical read-write
NoSQL data store.
Hive is useful for SQL integration if you want to store long-term data to be processed and
summarized and loaded back. Hive's major limitation is query speed. When dealing with billions of
rows, there is no live querying of the data that would be fast enough for any interactive interface to
the data.
For example, with data logging, the quantities of data can be huge, but what you often need is
quick, flexible querying on either summarized or extreme data (i.e., faults and failures).
HBase is useful when what you want is to store large volumes of flexible data and query that
information, but you might want only smaller datasets to work with. Hence, you might export datathat simultaneously:
1. Needs to be kept "whole," such as sales or financial data
2. May change over time
3. Also needs to be queried
HBase can then be combined with traditional SQL or Hive to allow snapshots, ranges, or
aggregate data to be queried.
Making use of Hive
The primary reason to use Hive over a typical SQL database infrastructure is simply the size of the
data and the length of time required to perform the query. Rather than dumping information into
Hadoop, writing your own MapReduce query, and getting the information back out, with Hive you
can (normally) write the same SQL statement, but on a much larger dataset.
Hive accomplishes this task by translating the SQL statement into a more typical Hadoop
MapReduce job that assembles the data into a tabular format. This is where the limitation comes,
in that Hive is not a real-time, or live querying solution. Once you submit the job, it can take a long
time to get a response.
-
8/13/2019 Bd Sqltohadoop2 PDF
3/18
ibm.com/developerWorks/ developerWorks
SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 3 of 18
A typical use of Hive is to combine the real-time accessibility of data in a local SQL table, export
that information into Hive long-term, and reimport the processed version that summarizes the data
so it can be used in a live query environment, as seen in Figure 1.
Figure 1. Typical use of Hive
We can use the same SQL statement to get and obtain the data in both situations, a convenience
that helps to harmonize and streamline your applications.
Getting data into Hive from SQL
The simplest way to get data in and out is by writing a custom application that will extract the
required data from your existing SQL table and insert that data into Hive that is, perform a
select query and use INSERTto place those values directly into Hive.
Alternatively, depending on your application type, you might consider inserting data directly into
both your standard SQL and Hive stores. This way, you can check your standard SQL for recent
queries and process Hive data on an hourly/daily/weekly schedule as required to produce the
statistical data you require longer-term.
Remember, with Hive, in the majority of cases we can:
Use the tabular data directly
-
8/13/2019 Bd Sqltohadoop2 PDF
4/18
developerWorks ibm.com/developerWorks/
SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 4 of 18
Export without conversion
Export without reformatting or restructuring
More typically, you will be dumping entire tables or entire datasets by hour or day into Hive through
an intermediary file. One benefit of this approach is that we easily introduce the file by running a
local import or by copying that data directly into HDFS.
Let's look at this with an example, using the City of Chicago Traffic Tracker that studies bus data,
giving a historical view of the speed of buses in different regions of Chicago at different times. A
sample of the data is shown in Listing 1.
Listing 1. Sample of bus data from City of Chicago Traffic Trackermysql> select * from chicago limit 10;
+---------------------+--------+----------+----------+-------+
| timelogged | region | buscount | logreads | speed |
+---------------------+--------+----------+----------+-------+
| 2013-03-26 09:50:00 | 2 | 40 | 600 | 26.59 |
| 2013-03-26 09:50:00 | 3 | 76 | 1023 | 21.82 |
| 2013-03-26 09:50:00 | 4 | 46 | 594 | 23.18 || 2013-03-26 09:50:00 | 5 | 56 | 1016 | 22.33 |
| 2013-03-26 09:50:00 | 6 | 58 | 859 | 21.14 |
| 2013-03-26 09:50:00 | 7 | 59 | 873 | 19.23 |
| 2013-03-26 09:50:00 | 8 | 89 | 1356 | 21.14 |
| 2013-03-26 09:50:00 | 9 | 28 | 404 | 21.82 |
| 2013-03-26 09:50:00 | 10 | 114 | 1544 | 20.45 |
| 2013-03-26 09:50:00 | 11 | 91 | 1310 | 22.5 |
+---------------------+--------+----------+----------+-------+
10 rows in set
The sample dataset is just over 2 million records, and it has been loaded from a CSV export. To
get that information into Hive, we can export it to a CSV file.
Listing 2. Exporting data to a CSV filemysql> select * into outfile 'chicago.csv' fields terminated by ','
lines terminated by '\n' from chicago;
Then copy the data into HDFS: $ hdfs dfs -copyFromLocal chicago.csv. Now open Hive and
create a suitable table.
Listing 3. Opening Hive and creating a suitable tablehive> create table chicago_bus (timelog timestamp, region int,
buscount int, logreads int, speed float) row format delimited
fields terminated by ',' lines terminated by "\n";
The first thing to notice with Hive is that compared to most standard SQL environments, we have
a somewhat limited set of data types that can be used. Although core types like integers, strings,
and floats are there, date types are limited. That said, Hive supports reading complex types, such
as hashmaps. The core types supported by Hive:
Integers (1-8 bytes)
Boolean
Float/Double
-
8/13/2019 Bd Sqltohadoop2 PDF
5/18
ibm.com/developerWorks/ developerWorks
SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 5 of 18
String any sequence of characters, and therefore good for CHAR, VARCHAR, SET, ENUM,
TEXT, and BLOB types if the BLOG is storing text
Timestamp either an EPOCH or YYYY-MM-DD hh:mm:ss.fffffffff-formatted string
Binary for BLOBs that are not TEXT
The second observation is that we are defining the table structure. There are binary structures
within Hive, but using CSV natively is convenient for our purposes since we've exported the data
to a CSV file. When loading the data, the CSV format specification here will be used to identify the
fields in the data.
The above example creates a standard table (the table within Hive's data store within HDFS). You
can also create an external table that uses the copied file directly. However, in an SQL-to-Hive
environment, we want to make use of one big table into which we can append new data.
Now the data can be loaded into the table: hive> load data inpath 'chicago.csv' into table
chicago_bus;.
This code adds the contents of the CSV file to the existing table. In this case, it is empty, but you
can see how easy it would be to import additional data.
Processing and querying Hive data
Once the data is loaded, you can execute Hive queries from the Hive shell just as you would in any
other SQL environment. For example, Listing 4 shows the same query to get the first 10 rows.
Listing 4. Query to get the first 10 rowshive> select * from chicago_bus limit 10;
OK
2013-03-26 09:50:00 1 31 461 22.52013-03-26 09:50:00 2 40 600 26.59
2013-03-26 09:50:00 3 76 1023 21.82
2013-03-26 09:50:00 4 46 594 23.18
2013-03-26 09:50:00 5 56 1016 22.33
2013-03-26 09:50:00 6 58 859 21.14
2013-03-26 09:50:00 7 59 873 19.23
2013-03-26 09:50:00 8 89 1356 21.14
2013-03-26 09:50:00 9 28 404 21.82
2013-03-26 09:50:00 10 114 1544 20.45
Time taken: 0.205 seconds
The benefit comes when we perform an aggregate query. For example, let's obtain an average bus
speed for each region by day.
Listing 5. Obtaining average bus speed for each region by dayhive> select to_date(timelog),region,avg(speed) from chicago_bus
group by to_date(timelog),region;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
-
8/13/2019 Bd Sqltohadoop2 PDF
6/18
developerWorks ibm.com/developerWorks/
SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 6 of 18
set mapred.reduce.tasks=
Starting Job = job_201308151101_0009, Tracking URL =
http://localhost.localdomain:50030/jobdetails.jsp?jobid
=job_201308151101_0009
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201308151101_0009
Hadoop job information for Stage-1: number of mappers: 1;
number of reducers: 1
2013-08-18 11:44:04,943 Stage-1 map = 0%, reduce = 0%
2013-08-18 11:44:24,041 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 12.83 sec
2013-08-18 11:44:25,053 Stage-1 map = 100%, reduce = 0%,
Cumulative CPU 12.83 sec
2013-08-18 11:44:26,093 Stage-1 map = 100%, reduce = 0%,
Cumulative CPU 12.83 sec
2013-08-18 11:44:27,102 Stage-1 map = 100%, reduce = 0%,
Cumulative CPU 12.83 sec
2013-08-18 11:44:28,115 Stage-1 map = 100%, reduce = 0%,
Cumulative CPU 12.83 sec
2013-08-18 11:44:29,136 Stage-1 map = 100%, reduce = 100%,
Cumulative CPU 14.94 sec
2013-08-18 11:44:30,147 Stage-1 map = 100%, reduce = 100%,
Cumulative CPU 14.94 sec
2013-08-18 11:44:31,164 Stage-1 map = 100%, reduce = 100%,
Cumulative CPU 14.94 sec2013-08-18 11:44:32,179 Stage-1 map = 100%, reduce = 100%,
Cumulative CPU 14.94 sec
MapReduce Total cumulative CPU time: 14 seconds 940 msec
Ended Job = job_201308151101_0009
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 14.94 sec HDFS Read:
77778973 HDFS Write: 690111 SUCCESS
Total MapReduce CPU Time Spent: 14 seconds 940 msec
OK
2011-03-12 1 25.597916682561237
2011-03-12 2 27.758749961853027
2011-03-12 3 26.450833241144817
2011-03-12 4 24.638333241144817
2011-03-12 5 25.868695715199348
2011-03-12 6 24.921666463216145
2011-03-12 7 24.383749961853027
2011-03-12 8 23.729583422342937
2011-03-12 9 26.06217384338379
2011-03-12 10 24.53833333651225
...
As you can see from the output, the query is fundamentally the same as with MySQL (we are
grouping by an alternative value), but Hive converts this into a MapReduce job, then calculates the
summary values.
The reality with data of this style is that the likelihood of requiring the speed in region 1, for
example, at 9:50 last Thursday is quite low. But knowing the average speed per day for each
region might help predict the timing of traffic or buses in the future. The summary data can be
queried and analyzed efficiently with a few thousand rows in an SQL store to allow the data to be
sliced and diced accordingly.
To output that information back to a file, a number of options are available, but you can simply
export back to a local file using the statement in Listing 6.
-
8/13/2019 Bd Sqltohadoop2 PDF
7/18
ibm.com/developerWorks/ developerWorks
SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 7 of 18
Listing 6. Exporting back to a local filehive> INSERT OVERWRITE LOCAL DIRECTORY 'chicagoout' SELECT
to_date(timelog),region,avg(speed) from chicago_bus
group by to_date
(timelog),region;
This code creates a directory (chicagoout), into which the output is written as a series of text files.
These can be loaded back into MySQL, but by default, the fields are separated by Ctrl+A. The
output can be simplified to a CSV file again by creating a table beforehand, which uses the CSV
formatting.
Listing 7. Simplifying to a CSV filehive> CREATE TABLE chicago_region_speed (logdate timestamp, region int,
speed float) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
Now rerun the job and insert the information into the table.
Listing 8. Rerunning the jobhive> INSERT TABLE chicago_region_speed SELECT to_date(timelog) as
timelog,region,avg(speed) as speed from chicago_bus group by to_date
(timelog),region;
You can now find the files that make up the table in your Hive datawarehouse directory, so you can
copy them out for loading into your standard SQL store, for example, using LOAD DATA INFILE.
Listing 9. Loading files into your standard SQL storemysql> load data local infile 'chicago_region_speed.csv' into table
chicago fields terminated by ',' lines terminated by '\n';
This process sounds clunky, but it can be automated, and because we have files from each stage,
it is easy to re-execute or reformat the information if required.
Using views
If you are using Hive in the manner suggested earlier, and regularly processing and summarizing
data daily, weekly or monthly, it might be simpler to create a view.
Views within Hive are logical that is, the output of the view gets re-created each time a query
is executed. Although using views is more expensive, for a data exchange environment, views
hugely simplify the process by simplifying the query structure and allowing consistent output as the
underlying source tables expand with new data. For example, to create a view from our original
speed/region summary, use Listing 10.
Listing 10. Creating a view from our original speed/region summaryhive> create view chicago_region_speed_view as select to_date(timelog),
region,avg(speed) from chicago_bus group by to_date(timelog),region;
Now we can perform Hive queries on the view including new selections.
-
8/13/2019 Bd Sqltohadoop2 PDF
8/18
developerWorks ibm.com/developerWorks/
SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 8 of 18
Listing 11. Performing Hive queries on the view
hive> select * from chicago_region_speed_view where speed < 15;
...
OK
2012-01-15 28 13.449097183015612
2012-01-29 28 13.56520840856764
2012-02-05 28 14.568958282470703
2012-02-19 28 12.512847211625841
2012-03-04 28 12.886666708522373
2012-04-01 28 13.932638910081652
Not looking good for traffic in region 28, is it?
Data life-cycle management
If you decide to use Hive as a live component of your querying mechanism exporting data from
SQL, into Hive, and back out again so it can be used regularly give careful thought as to how
you manage the files to ensure accuracy. I tend to use the following basic sequence:
1. Insert new data into a table datalog, for example.
2. When datalog is full (i.e., an hour, day, week, or month of information is complete), the table is
renamed (to datalog_archive, for example), and a new table (same structure) is created.
3. Data is exported from datalog_archive and appended into the Hive table for the data.
Depending on how the data is used and processed, analysis occurs by accessing the live data or
by running the exact same SQL query statement on Hive. If the data is needed quickly, a view or
query is executed that imports the corresponding data back into an SQL table in a summarized
format. For example, for systems logging data (RAM, disk, and other usages) of large clusters, the
data is stored in SQL for a day. This approach allows for live monitoring and makes it possible to
spot urgent trends.
Data is written out each day to Hive where the log content is analyzed by a series of views that
collect extreme values (for example, disk space less than 5 percent), as well as average disk
usage. While reviewing the recent data, it's easy to examine and correlate problems (extreme disk
usage and increased CPU time, for example), and execute the same query on the Hive long-term
store to get the detailed picture.
Using HBase
Whereas Hive is useful for huge datasets where live queries are not required, HBase allows us to
perform live queries on data, but it works differently. The primary difference is that HBase is not a
tabular data store, so importing tabular data from an SQL store is more complex.
That said, the flexible internal structure of HBase is also more flexible. Data sources of multiple
different data structures can be merged together within HBase. For example, with log data, you
can store multiple sensor data into a single table within HBase, a situation that would require
multiple tables in an SQL store.
-
8/13/2019 Bd Sqltohadoop2 PDF
9/18
ibm.com/developerWorks/ developerWorks
SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 9 of 18
Getting data in HBase from SQL
Unlike Hive, which supports a native tabular layout for the source data, HBase stores key-value
pairs. This key-value system complicates the process of exporting data and using it directly
because it first needs to be identified and then formatted accordingly to be understood within
HBase.
Each item (or item identifier) requires a unique key. The unique ID is important because it is the
only way to get individual data back again; the unique ID locates the record within the HBase table.
Remember that HBase is about key-value pairs, and the unique ID (or key) is the identifier to the
stored record data.
For some data types, such as the log data in our Hive examples, the unique key is meaningless
because we are unlikely to want to view just one record. For other types, the data may already
have a suitable unique ID within the table you want to use.
This dilemma can be solved by pre-processing our output, for example, and inserting a UUID()intoour output.
Listing 12. Pre-processing the output
mysql> select uuid(),timelogged,region,buscount,logreads,speed into
outfile 'chicago.tsv' fields terminated by '\t' lines terminated by '\n'
from chicago;
This code creates a new UUID for each row of output. The UUID can be used to identify each
record even though for this type of data, that identification is not individually useful.
A secondary consideration within the export process is that HBase does not support joins. If you
want to use HBase to write complex queries on your SQL data, you need to run a query within
your SQL store that outputs an already-joined or aggregate record.
Within HBase, tables are organized according to column families, and these can be used to bond
multiple groups of individual columns, or you can use the column families as actual columns. The
translation is from the table to the document structure, as shown in Figure 2.
-
8/13/2019 Bd Sqltohadoop2 PDF
10/18
developerWorks ibm.com/developerWorks/
SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 10 of 18
Figure 2. Translation from table to the document structure
To import the data, you have to first create the table. HBase includes a basic shell for acceptingcommands. We can open it and create a table and a table group called cf.
Listing 13. Creating a table group called cf$ hbase shell
13/08/18 15:54:46 WARN conf.Configuration: hadoop.native.lib is
deprecated. Instead, use io.native.lib.available
HBase Shell; enter 'help' for list of supported commands.
Type "exit" to leave the HBase Shell
Version 0.94.6-cdh4.3.0, rUnknown, Mon May 27 20:22:05 PDT 2013
hbase(main):001:0> create 'chicago', 'logtime', 'region', 'buscount',
'readnumber' , 'speed'
Copy the tab-separated file created earlier into HDFS: $ hdfs dfs -copyFromLocal chicago.tsv.
Now we can run importtsv, a tool inside the HBase JAR that imports values from a tab-delimited
file.
Listing 14. Running importtsv$ hadoop jar /usr/lib/hbase/hbase.jar importtsv \
-Dimporttsv.columns=HBASE_ROW_KEY,logtime,region,buscount,readnumber,
speed chicago chicago.tsv
The content needs to be split to make it understandable:hadoop jar /usr/lib/hbase/hbase.jar
importtsv.
This code runs the importtsv tool, which is included as part of the HBase JAR: -
Dimporttsv.columns=HBASE_ROW_KEY,logtime,region,buscount,readnumber,speed .
The tool defines the columns that will be imported and how they will be identified. The fields are
defined as a list; at least one of them must be the identifier (UUID) for each row, specified by
HBASE_ROW_KEY, and the others define the field names (within the column family, cf) used for each
input column.
-
8/13/2019 Bd Sqltohadoop2 PDF
11/18
ibm.com/developerWorks/ developerWorks
SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 11 of 18
chicago The table name. It must have been created before this tool is executed.
chicago.tsv The name of the file in HDFS to be imported.
The output from this command (see Listing 15) is rather immense, but the import process is
complicated. The data cannot be directly loaded. Instead, it gets parsed by a MapReduce process
that extracts and then inserts the data into an HBase table.
Listing 15. Output from importtsv command13/08/18 16:08:37 INFO zookeeper.ZooKeeper: Client
environment:zookeeper.version=3.4.5-cdh4.3.0--1, built on
05/28/2013 02:01 GMT
...
13/08/18 16:08:48 INFO mapred.JobClient: Running job:
job_201308151105_0008
13/08/18 16:08:49 INFO mapred.JobClient: map 0% reduce 0%
13/08/18 16:09:07 INFO mapred.JobClient: map 1% reduce 0%
13/08/18 16:09:10 INFO mapred.JobClient: map 4% reduce 0%
13/08/18 16:09:13 INFO mapred.JobClient: map 9% reduce 0%
13/08/18 16:09:16 INFO mapred.JobClient: map 18% reduce 0%
13/08/18 16:09:19 INFO mapred.JobClient: map 26% reduce 0%
13/08/18 16:09:22 INFO mapred.JobClient: map 28% reduce 0%
13/08/18 16:09:25 INFO mapred.JobClient: map 35% reduce 0%
13/08/18 16:09:29 INFO mapred.JobClient: map 40% reduce 0%
13/08/18 16:09:32 INFO mapred.JobClient: map 46% reduce 0%
13/08/18 16:09:36 INFO mapred.JobClient: map 47% reduce 0%
13/08/18 16:09:38 INFO mapred.JobClient: map 52% reduce 0%
13/08/18 16:09:42 INFO mapred.JobClient: map 55% reduce 0%
13/08/18 16:09:43 INFO mapred.JobClient: map 57% reduce 0%
13/08/18 16:09:45 INFO mapred.JobClient: map 58% reduce 0%
13/08/18 16:09:48 INFO mapred.JobClient: map 60% reduce 0%
13/08/18 16:09:51 INFO mapred.JobClient: map 62% reduce 0%
13/08/18 16:09:54 INFO mapred.JobClient: map 64% reduce 0%
13/08/18 16:09:57 INFO mapred.JobClient: map 65% reduce 0%
13/08/18 16:10:01 INFO mapred.JobClient: map 67% reduce 0%
13/08/18 16:10:04 INFO mapred.JobClient: map 69% reduce 0%
13/08/18 16:10:07 INFO mapred.JobClient: map 71% reduce 0%13/08/18 16:10:10 INFO mapred.JobClient: map 72% reduce 0%
13/08/18 16:10:13 INFO mapred.JobClient: map 74% reduce 0%
13/08/18 16:10:16 INFO mapred.JobClient: map 76% reduce 0%
13/08/18 16:10:19 INFO mapred.JobClient: map 78% reduce 0%
13/08/18 16:10:23 INFO mapred.JobClient: map 80% reduce 0%
13/08/18 16:10:26 INFO mapred.JobClient: map 81% reduce 0%
13/08/18 16:10:29 INFO mapred.JobClient: map 83% reduce 0%
13/08/18 16:10:32 INFO mapred.JobClient: map 84% reduce 0%
13/08/18 16:10:35 INFO mapred.JobClient: map 86% reduce 0%
13/08/18 16:10:38 INFO mapred.JobClient: map 88% reduce 0%
13/08/18 16:10:41 INFO mapred.JobClient: map 90% reduce 0%
13/08/18 16:10:44 INFO mapred.JobClient: map 92% reduce 0%
13/08/18 16:10:47 INFO mapred.JobClient: map 95% reduce 0%
13/08/18 16:10:50 INFO mapred.JobClient: map 98% reduce 0%
13/08/18 16:10:53 INFO mapred.JobClient: map 100% reduce 0%13/08/18 16:11:02 INFO mapred.JobClient: Job complete:
job_201308151105_0008
13/08/18 16:11:02 INFO mapred.JobClient: Counters: 25
13/08/18 16:11:02 INFO mapred.JobClient: File System Counters
13/08/18 16:11:02 INFO mapred.JobClient:
FILE: Number of bytes read=0
13/08/18 16:11:02 INFO mapred.JobClient:
FILE: Number of bytes written=416878
13/08/18 16:11:02 INFO mapred.JobClient:
FILE: Number of read operations=0
13/08/18 16:11:02 INFO mapred.JobClient:
FILE: Number of large read operations=0
13/08/18 16:11:02 INFO mapred.JobClient:
-
8/13/2019 Bd Sqltohadoop2 PDF
12/18
developerWorks ibm.com/developerWorks/
SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 12 of 18
FILE: Number of write operations=0
13/08/18 16:11:02 INFO mapred.JobClient:
HDFS: Number of bytes read=156425993
13/08/18 16:11:02 INFO mapred.JobClient:
HDFS: Number of bytes written=0
13/08/18 16:11:02 INFO mapred.JobClient:
HDFS: Number of read operations=4
13/08/18 16:11:02 INFO mapred.JobClient:
HDFS: Number of large read operations=013/08/18 16:11:02 INFO mapred.JobClient:
HDFS: Number of write operations=0
13/08/18 16:11:02 INFO mapred.JobClient:
Job Counters
13/08/18 16:11:02 INFO mapred.JobClient:
Launched map tasks=2
13/08/18 16:11:02 INFO mapred.JobClient:
Data-local map tasks=2
13/08/18 16:11:02 INFO mapred.JobClient:
Total time spent by all maps in occupied slots (ms)=178767
13/08/18 16:11:02 INFO mapred.JobClient:
Total time spent by all reduces in occupied slots (ms)=0
13/08/18 16:11:02 INFO mapred.JobClient:
Total time spent by all maps waiting after reserving slots (ms)=0
13/08/18 16:11:02 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/08/18 16:11:02 INFO mapred.JobClient:
Map-Reduce Framework
13/08/18 16:11:02 INFO mapred.JobClient:
Map input records=2168224
13/08/18 16:11:02 INFO mapred.JobClient:
Map output records=2168224
13/08/18 16:11:02 INFO mapred.JobClient:
Input split bytes=248
13/08/18 16:11:02 INFO mapred.JobClient:
Spilled Records=0
13/08/18 16:11:02 INFO mapred.JobClient:
CPU time spent (ms)=14140
13/08/18 16:11:02 INFO mapred.JobClient:
Physical memory (bytes) snapshot=274292736
13/08/18 16:11:02 INFO mapred.JobClient:
Virtual memory (bytes) snapshot=1394532352
13/08/18 16:11:02 INFO mapred.JobClient:
Total committed heap usage (bytes)=125566976
13/08/18 16:11:02 INFO mapred.JobClient:
ImportTsv
13/08/18 16:11:02 INFO mapred.JobClient:
Bad Lines=0
If you get a bad-lines output that shows a high number of errors, particularly if the number equals
the number of rows you are importing, the problem is probably the format of the source file or the
fact that the number of columns in the source file does not match the number of columns defined
in the import specification.
Once the data has been imported, we can use the shell to get one record to check that the import
has worked.
-
8/13/2019 Bd Sqltohadoop2 PDF
13/18
ibm.com/developerWorks/ developerWorks
SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 13 of 18
Listing 16. Checking to see if the import has workedhbase(main):003:0> scan 'chicago', { 'LIMIT' > 1}
ROW COLUMN+CELL
e49e1c1c-0d93-11e3-a5e5-81b71544159b column=buscount:,
timestamp=1376949811719, value=40
e49e1c1c-0d93-11e3-a5e5-81b71544159b column=logtime:,
timestamp=1376949811719,
value=2013-03-26 09:50:00e49e1c1c-0d93-11e3-a5e5-81b71544159b column=readnumber:,
timestamp=1376949811719, value=600
e49e1c1c-0d93-11e3-a5e5-81b71544159b column=region:,
timestamp=1376949811719, value=2
e49e1c1c-0d93-11e3-a5e5-81b71544159b column=speed:,
timestamp=1376949811719, value=26.59
1 row(s) in 0.1830 seconds
You can see the basic structure of the data as it exists within the HBase table. The unique ID
identifies each record, then individual key-value pairs contain the detail (i.e., the columns from the
original SQL table).
Alternative SQL or Hive to HBaseAn alternative model to the raw-data export (less common to HBase because of the record
structure) is to use HBase to store summary values and parsed/composed queries.
Because the data from HBase is stored in a readily and quickly accessible format (access the
key and get the data), it can be used to access chunks of data that have been computed from
other jobs, stored into HBase, and used to access the summary data. For example, the summary
data we generated using Hive earlier in this example could have been written into HBase to be
accessed quickly to provide statistical data on the fly for a website.
Using HBase data from HiveNow that we have the data in HBase, we can start querying and reporting on the information. The
primary advantage of HBase is its powerful querying facilities based on the MapReduce within
Hadoop. Since the data is stored internally as simple key-value combinations, it is easy to process
through MapReduce.
MapReduce is no solution for someone from the SQL world, but we can take advantage of the
flexible nature of Hive's processing model to crunch HBase data using the HQL interface. You may
remember earlier I described how Hive supports processing of mapped data types; this is what
HBase data is: mapped key-value pairs.
To use HBase data, we need to create a table within Hive that points to the HBase table and maps
the key-value pairs in HBase to the column style of Hive.
Listing 17. Creating a table within Hive that points to the HBase table and
maps the key-value pairs in HBase to the column style of Hivehive> create external table hbase_chicago (key string, logtime timestamp,
region int, buscount int, readnumber int, speed float) STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with SERDEPROPERTIES
('hbase.columns.mapping'=':key, logtime:val, region:val, buscount:val,
readnumber:val, speed:val') TBLPROPERTIES('hbase.table.name'='chicago');
-
8/13/2019 Bd Sqltohadoop2 PDF
14/18
developerWorks ibm.com/developerWorks/
SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 14 of 18
The first part of this code creates a table definition identical to the one we used natively in
Hive, except that we have added the row-key UUID as the first column.
The STORED BYblock defines the storage format.
The SERDEPROPERTIESblock is the mapping between the document structure and the columns.
The colon separates the key name and corresponding value and how the data should bemapped to the columns, in sequence, from the table definition.
The TBLPROPERTIESblock defines the name of the HBase table where the data lies.
Once the table has been created, the table can be queried through Hive using native SQL, just as
we saw earlier.
Why use this method instead of a native import? The primary reason is the ease with which it can
be queried (although no longer live), but also because the underlying HBase data can be updated,
rather than just appended to. In an SQL-to-Hadoop architecture this advantage means we can
take regular dumps of changing data from SQL and update the content.
Reminder: HBase or Hive
Given the information here, it's worth reminding ourselves of the benefits of the two systems.
Table 1. Benefits of the two systems
Feature Hive HBase
SQL Support Yes No
Tabular Data Structure Yes Semi
Append Only Yes No
Allows Updates No Yes
Live Queries Not really Yes, within limits
Joins Yes No
Which one you use will depend entirely on your use case and the data you have available, and
how you want to query it. Hive is great for massive processing of ever-increasing data. HBase is
useful for querying data that may change over time and need to be updated.
Conclusion
The primary reason for moving data between SQL stores and Hadoop is usually to take advantage
of the massive storage and processing capabilities to process quantities of data larger than you
could hope to cope with in SQL alone. How you exchange and process that information from your
SQL store into Hadoop is, therefore, important. Large quantities of long-term data that need to be
queried more interactively can take advantage of the append-only and SQL nature of Hive. For
data that needs to be updated and processed, it might make more sense to use HBase. HBase
also makes an ideal output target from Hive because it's so easy to access summary data directly
by using the native key-value store.
-
8/13/2019 Bd Sqltohadoop2 PDF
15/18
ibm.com/developerWorks/ developerWorks
SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 15 of 18
When processing, you also need to consider how to get the data back in. With Hive, the process is
easy because we can run SQL and get a table that can easily be imported back to our SQL store
for straightforward or live query processing. In this article, I've covered a wide range of use cases
and examples of how data can be exchanged more easily from SQL using tabular interfaces to the
Hadoop and non-tabular storage underneath.
-
8/13/2019 Bd Sqltohadoop2 PDF
16/18
developerWorks ibm.com/developerWorks/
SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 16 of 18
Resources
Learn
Check out the Chicago Traffic Tracker.
Read "What's the big deal about Big SQL?" to learn about IBM's SQL interface to its Hadoop-based platform, InfoSphere BigInsights. Big SQL is designed to provide SQL developers with
an easy on-ramp for querying data managed by Hadoop.
Dig deeper in the "Developing Big SQL queries to analyze big data" tutorial in the InfoSphere
BigInsights tutorial collection (PDF).
"Analyzing social media and structured data with InfoSphere BigInsights" teaches the basics
of using BigSheets to analyze social media and structured data collected through sample
applications provided with BigInsights.
Read "Understanding InfoSphere BigInsights" to learn more about the InfoSphere BigInsights
architecture and underlying technologies.
Watch the Big Data: Frequently Asked Questions for IBM InfoSphere BigInsights videotolisten to Cindy Saracco discuss some of the frequently askedquestions about IBM's Big Data
platform and InfoSphere BigInsights.
Watch Cindy Saracco demonstrate portions of the scenario described in this article in Big
Data -- Analyzing Social Media for Watson.
Read "Exploring your InfoSphere BigInsights cluster and sample applications" to learn more
about the InfoSphere BigInsights web console.
Visit the BigInsights Technical Enablement wikifor links to technical materials, demos,
training courses, news items, and more.
Learn about the IBM Watsonresearch project.
Check out Big Data Universityfor free courses on Hadoop and big data.
Order a copy of Understanding Big Data: Analytics for Enterprise Class Hadoop and
Streaming Datafor details on two of IBM's key big data technologies.
Learn more about Apache Hadoop.
Check out HadoopDB.
Read "Using MapReduce and load balancing on the cloud" to learn how to implement the
Hadoop MapReduce framework in a cloud environment and how to use virtual load balancing
to improve the performance of both a single- and multiple-node system.
For information on installing Hadoop using CDH4, see CDH4 Installation - Cloudera Support.
Big Data Glossary,by Pete Warden, O'Reilly Media, ISBN 1449314597, 2011; and Hadoop:
The Definitive Guide,by Tom White, O'Reilly Media, ISBN 1449389732, 2010, offer more
information. "HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical
Workloads" explores the feasibility of building a hybrid system that takes the best features
from both technologies.
Learn more by reading "MapReduce and parallel DBMSes: friends or foes?"
"A Survey of Large ScaleData Management Approaches in Cloud Environments" gives a
comprehensive survey of numerous approaches and mechanisms of deploying data-intensive
applications in the cloud, which are gaining a lot of momentum in both research and industrial
communities.
http://www.vldb.org/pvldb/2/vldb09-861.pdfhttp://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/CDH4-Installation-Guide.htmlhttp://hadoop.apache.org/http://youtu.be/kny3nPwSZ_whttp://www.ibm.com/e-business/linkweb/publications/servlet/pbi.wss?CTY=US&FNC=SRX&PBL=GC19-4104-00http://www.mendeley.com/research/survey-large-scale-data-management-approaches-cloud-environments/http://database.cs.brown.edu/papers/stonebraker-cacm2010.pdfhttp://www.vldb.org/pvldb/2/vldb09-861.pdfhttp://www.vldb.org/pvldb/2/vldb09-861.pdfhttp://shop.oreilly.com/product/0636920010388.dohttp://shop.oreilly.com/product/0636920010388.dohttp://shop.oreilly.com/product/0636920022466.dohttp://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/CDH4-Installation-Guide.htmlhttp://www.ibm.com/developerworks/cloud/library/cl-mapreduce/http://db.cs.yale.edu/hadoopdb/hadoopdb.htmlhttp://hadoop.apache.org/http://www.tatamcgrawhill.com/html/9780071790536.htmlhttp://www.tatamcgrawhill.com/html/9780071790536.htmlhttp://www.bigdatauniversity.com/http://www-03.ibm.com/innovation/us/watson/index.htmlhttps://www.ibm.com/developerworks/mydeveloperworks/wikis/home?lang=en_US#/wiki/BigInsightshttp://www.ibm.com/developerworks/data/library/techarticle/dm-1204infospherebiginsights/index.htmlhttp://youtu.be/kny3nPwSZ_whttp://youtu.be/kny3nPwSZ_whttp://youtu.be/I4hsZa2jwAshttp://www.ibm.com/developerworks/data/library/techarticle/dm-1110biginsightsintro/index.htmlhttp://www.ibm.com/developerworks/data/library/techarticle/dm-1206socialmedia/index.htmlhttp://www.ibm.com/e-business/linkweb/publications/servlet/pbi.wss?CTY=US&FNC=SRX&PBL=GC19-4104-00http://www.ibm.com/developerworks/library/bd-bigsql/index.htmlhttps://data.cityofchicago.org/Transportation/Chicago-Traffic-Tracker-Historical-Congestion-Esti/emtn-qqdi -
8/13/2019 Bd Sqltohadoop2 PDF
17/18
ibm.com/developerWorks/ developerWorks
SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 17 of 18
Learn more about big data in the developerWorks big data content area. Find technical
documentation, how-to articles, education, downloads, product information, and more.
Find resources to help you get started with InfoSphere BigInsights, IBM's Hadoop-based
offering that extends the value of open source Hadoop with features like Big SQL, text
analytics, and BigSheets.
Follow these self-paced tutorials (PDF)to learn how to manage your big data environment,import data for analysis, analyze data with BigSheets, develop your first big data application,
develop Big SQL queries to analyze big data, and create an extractor to derive insights from
text documents with InfoSphere BigInsights.
Find resources to help you get started with InfoSphere Streams, IBM's high-performance
computing platform that enables user-developed applications to rapidly ingest, analyze, and
correlate information as it arrives from thousands of real-time sources.
Stay current with developerWorks technical events and webcasts.
Follow developerWorks on Twitter.
Get products and technologies
Get Hadoop 0.20.1, Hadoop MapReduce, and Hadoop HDFS from Apache.org.
Download InfoSphere BigInsights Quick Start Edition, available as a native software
installation or as a VMware image.
Download InfoSphereStreams, available asa native software installation or as a VMware
image.
Use InfoSphere Streams on IBM SmartCloudEnterprise.
Build your next development project with IBM trial software, available for download directly
from developerWorks.
Discuss
Ask questions and get answers in the InfoSphere BigInsights forum.
Ask questions and get answers in the InfoSphere Streams forum.
Checkout the developerWorks blogsand get involved in the developerWorks community.
IBM big data and analyticson Facebook.
https://www.facebook.com/IBMbigdataanalyticshttp://www.ibm.com/developerworks/blogs/http://www.ibm.com/developerworks/blogs/http://www.ibm.com/developerworks/communityhttp://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000002409http://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000002409http://www.ibm.com/developerworks/downloads/im/streams/cloud.htmlhttp://www.ibm.com/developerworks/downloads/im/streams/cloud.htmlhttp://www.ibm.com/developerworks/downloads/im/streams/index.html/http://www.ibm.com/developerworks/downloads/im/streams/index.html/http://www.ibm.com/developerworks/downloads/im/streams/index.html/https://www.facebook.com/IBMbigdataanalyticshttp://www.ibm.com/developerworks/communityhttp://www.ibm.com/developerworks/blogs/https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000001664http://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000002409http://www.ibm.com/developerworks/downloads/http://www.ibm.com/developerworks/downloads/im/streams/cloud.htmlhttp://www.ibm.com/developerworks/downloads/im/streams/index.html/http://www.ibm.com/developerworks/downloads/im/biginsightsquick/http://hadoop.apache.org/common/releases.htmlhttp://www.twitter.com/developerworks/http://www.ibm.com/developerworks/offers/techbriefings/http://www.ibm.com/developerworks/bigdata/streams/index.htmlhttp://www.ibm.com/e-business/linkweb/publications/servlet/pbi.wss?CTY=US&FNC=SRX&PBL=GC19-4104-00http://www.ibm.com/developerworks/bigdata/biginsights/index.htmlhttp://www.ibm.com/developerworks/bigdata/ -
8/13/2019 Bd Sqltohadoop2 PDF
18/18
developerWorks ibm.com/developerWorks/
SQL to Hadoop and back again, Part 2: Leveraging HBase andHive Page 18 of 18
About the author
Martin C. Brown
A professional writer for over 15 years, Martin (MC) Brown is the author and
contributor to more than 26 books covering an array of topics, including the recently
published Getting Started with CouchDB.His expertise spans myriad development
languages and platforms: Perl, Python, Java, JavaScript, Basic, Pascal, Modula-2, C,
C++, Rebol, Gawk, Shellscript, Windows, Solaris, Linux, BeOS, Microsoft WP, Mac
OS and more. He currently works as the director of documentation for Continuent.
Copyright IBM Corporation 2013
(www.ibm.com/legal/copytrade.shtml)
Trademarks
(www.ibm.com/developerworks/ibm/trademarks/)
http://www.ibm.com/developerworks/ibm/trademarks/http://www.ibm.com/legal/copytrade.shtml