apache hive micro guide - confusedcoders

12
Apache Hive Walkthrough YASH SHARMA - ConfusedCoders

Upload: yash-sharma

Post on 11-May-2015

2.807 views

Category:

Technology


4 download

DESCRIPTION

A small hive walk through for absolute beginner introduction to Apache Hive.

TRANSCRIPT

Page 1: Apache Hive micro guide - ConfusedCoders

Apache Hive Walkthrough

YASH SHARMA - ConfusedCoders

Page 2: Apache Hive micro guide - ConfusedCoders

Pa

ge2

Table of Contents Introducing Apache Hive ............................................................................................................................... 3

Starting Notes ........................................................................................................................................... 3

What is Apache Hive? ............................................................................................................................... 3

Motivation for Hive ................................................................................................................................... 3

Some History ............................................................................................................................................. 3

Deep Dive into Hive ...................................................................................................................................... 4

Hive Architecture ...................................................................................................................................... 4

Hive Data Model ....................................................................................................................................... 5

Hive Query Language (HiveQL) ................................................................................................................. 5

Hive Web Interface ................................................................................................................................... 5

Hive Authorizations ................................................................................................................................... 6

Getting Hands dirty – Hive Hands On ........................................................................................................... 7

Hive Installation ........................................................................................................................................ 7

Sample Hive Queries ................................................................................................................................. 8

Hive SerDe ................................................................................................................................................. 9

Hive JDBC .................................................................................................................................................. 9

Changing Hive default Data store ........................................................................................................... 11

Hive User Defined Functions (UDF) ........................................................................................................ 12

Page 3: Apache Hive micro guide - ConfusedCoders

Pa

ge3

Introducing Apache Hive

Starting Notes

This is a Hive kick start guide and talks about basic Hive concepts and shows you how hive works over

Hadoop. The tutorial expects you to be familiar with Hadoop basics and assumes that you have a

preliminary understanding of what Map-Reduce programming is.

What is Apache Hive?

Apache Hive is an open source data warehouse system for Hadoop. Hive is an abstraction over Apache

Hadoop and provides the users an SQL like query interface. The user only sees table like data and can

fire hive queries to get results from the data and Hive internally creates, plans and executes Map-

Reduce Jobs for us and gives us the desired results. Hive is suitable for both unstructured and semi-

structured data.

Motivation for Hive

The prime motivation for Hive was to enable users to quickly come up with fast business solutions using

a familiar language rather than having to think Map-Reduce for every problem. Hive uses its own query

language HiveQL which is very much similar to the traditional SQL.

Some History

Hive written in Java language was initially developed by Facebook, and is under Apache2.0 License, and

is now being developed by companies like Netflix. The current version of Hive is 0.9.9 which is

compatible with Hadoop 0.20.x and 1.x.

Page 4: Apache Hive micro guide - ConfusedCoders

Pa

ge4

Small Dive into Hive

Hive Architecture

Below is a diagram showing the Hive Architecture and its components. Hive works on top of Hadoop and

needs Apache Hadoop to be running on your box. Let’s have a quick note on the components:

• Hadoop - Hive needs Hadoop as a Base Framework to operate.

• Driver - Hive has its own drivers to communicate with the Hadoop World.

• Command Line Interface (CLI) – The Hive CLI is the console for firing Hive Queries. The CLI would

be used for operating on our data.

• Web Interface - Hive also provides a web interface to monitor/administrate Hive jobs.

• Metastore – Metastore is the Hive’s data warehouse which stores all the structure information

of various tables/partitions in Hive.

• Thrift Server – Hive provides a Thrift Server with itself via which we can expose Hive as a service

which can then be used for connecting via JDBC/ODBC etc.

Page 5: Apache Hive micro guide - ConfusedCoders

Pa

ge5

Apart from these above components there are few more vital actors responsible for the query

execution:

• Hive Compiler which is responsible for the semantic analysis of the input query and creating an

execution plan. The execution plan is a DAG of stages.

• Hive Execution Engine which executes the execution plan created by the Hive Compiler.

• Hive also has an Optimizer that optimizes the execution plan.

Hive Data Model

Any data into Hive can be organized into 3 data models:

• Tables – are similar to relational DB’

• Partitions – every table can have one or more partition keys. The data is stored in files based on

the partition key specified. Without a partition all the values would be submitted to the MR Job,

whereas on specifying the partition key only a small subset of data would be passed to the MR

jobs. Hive makes different directories for different partitions to hold data.

• Buckets – the data in each partition may again be divided into buckets based on the hash values.

Each bucket is stored as a file in the partition directory.

Hive Query Language (HiveQL)

HQL is very much similar to SQL and user can use SQL syntax for loading, querying tables. Hive queries

are checked by the compiler for correctness and execution plan is created from the queries. The Hive

Executer then runs the execution plan accordingly. The Hive Language Manual can be found here:

https://cwiki.apache.org/confluence/display/Hive/LanguageManual

Hive Web Interface

The Hive web interface is another alternative for the command line interface. We can use the Hive web

interface for administering and monitoring hive jobs. You can view the Hive web interface here by

default:

http://localhost:9999/hwi

Note: Your Hive Server must be running in order to view the Hive Web Interface.

Page 6: Apache Hive micro guide - ConfusedCoders

Pa

ge6

Hive Authorizations

Hive Authorization system consists of Users, Groups and Roles. Roles are the name given to a set of

grants given to any particular User and can be reused. A role may be assigned to Users, Groups and to

some other Roles. Hive roles must be created manually before being used. Users and Groups need not

be created manually and the Metastore will determine the username of the connecting user and the

groups associated with her.

More on users, Groups and Roles can be found here:

https://cwiki.apache.org/Hive/languagemanual-auth.html

Page 7: Apache Hive micro guide - ConfusedCoders

Pa

ge7

Getting Hands dirty – Hive Hands On

Hive Installation

This is a short crisp guide to installing Hive on top of your Hadoop setup. Apache Hadoop is a pre-

requisite for Apache Hive and must be installed on your box before we can proceed.

1. Download Hive: You can download ‘Apache Hive Here’.

2. Extract Hive: Extract your Hive archive to any location of your choice.

3. Export Environment Variables & Path:

Hive needs three environment variables set:

• JAVA_HOME: the path where your java is present.

• HADOOP_HOME: path where your Hadoop is present.

• HIVE_HOME: path where you’ve just extracted your hive.

Set these environment variables accordingly to continue. You can export environment variables

by the shell command:

$> export HADOOP_HOME=/home/ubuntu/hadoop/

$> export HIVE_HOME=/home/ubuntu/hive/

$> export PATH=$PATH:$HADOOP_PATH/bin

$> export PATH=$PATH:$HIVE_PATH/bin

4. Create Warehouse directory for Hive:

Hive stores all its data in a directory /user/hive/warehouse/. So let’s create the path for it.

$> sudo mkdir -P /user/hive/warehouse

5. Start Hadoop:

Hive needs Hadoop running for its operations. Let’s start Hadoop by the start-all script. Since

you have set the HADOOP_HOME/bin to PATH, you must be able to call the start-all.sh directly.

Else you might have to go to the directory and issue the start-all.sh (inside bin).

$> start-all.sh

6. Start Hive:

Finally start hive by issuing the command ‘hive’.

$> hive

hive> show databases;

Page 8: Apache Hive micro guide - ConfusedCoders

Pa

ge8

You can also start the hive server by the command:

$> hive -service hiveserver

Sample Hive Queries

Below is a sample hive query to create table and import data into the table. The query assumes the data

files to be present on your local system, and has the following data format:

File: movies.dat

movie_id:movie_name:tags

File: ratings.dat

user_id:movie_id:rating:timestamp

Download Movie Lens Dataset Here: http://www.grouplens.org/node/73

CREATE TABLES:

———————-------

CREATE TABLE movies (movie_id int, movie_name string, tags string)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘:’;

CREATE TABLE ratings (user_id string, movie_id string, rating float, tmstmp

string)

ROW FORMAT DELIMITED FIELDS TERMINATE BY ‘:’;

LOAD DATA FROM LOCAL PATH (Not HDFS):

———————————————————------------------

LOAD DATA LOCAL INPATH ‘/home/ubuntu/workspace/data/movies.dat’

OVERWRITE INTO TABLE movies;

LOAD DATA LOCAL INPATH ‘/home/ubuntu/workspace/data/ratings.dat’

OVERWRITE INTO TABLE ratings;

VERIFY:

———–---

DESCRIBE movies;

DESCRIBE ratings;

Page 9: Apache Hive micro guide - ConfusedCoders

Pa

ge9

OR,

SELECT * FROM movies LIMIT 10;

SELECT * FROM movies LIMIT 10;

Hive SerDe

The above queries work fine with single character delimiters, but many a time we face situations where

we have a complex delimiter or multi-character delimiters. In these scenarios we need to use the Hive

SerDe to get our data into Hive tables.

Here is a sample SerDe implementation for a sample USER file which has ‘::’ as the field delimiter. The

data in the USER file is in this format id::gender::age::occupation::zipcode

While using SerDe we specify a regular expression which is used to divide our line of data into fields.

Query Using SerDe:

---------------------

CREATE TABLE USER (id INT, gender STRING, age INT, occupation STRING, zipcode

INT)

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

WITH SERDEPROPERTIES (

"input.regex" = "(.*)::(.*)::(.*)::(.*)::(.*)"

);

Hive Partitions

Hive partitions can be created on any particular field/column data. The hive partitions make your

queries faster and Hadoop by default keeps your entire data divided into partitions into separate

directories. We can create a partitioned table in Hive by this query below; here we are choosing the date

column to be partitioned:

create table table_name (

id int,

date string,

name string

)

partitioned by (date string)

Hive JDBC

On starting the Hive Thrift server hive can be exposed as a service and we can connect to hive via JDBC.

Connecting to Hive via JDBC is similar to connecting to any relational DB like MySQL etc. Below is a

sample code for demonstrating few common hive queries:

import java.sql.SQLException; import java.sql.Connection; import java.sql.ResultSet;

Page 10: Apache Hive micro guide - ConfusedCoders

Pa

ge1

0

import java.sql.Statement; import java.sql.DriverManager; public class HiveJDBC { private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver"; /** * @param args * @throws SQLException

*/ public static void main(String[] args) throws SQLException { try { Class.forName(driverName); } catch (ClassNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); System.exit(1); } Connection con = DriverManager.getConnection("jdbc:hive://localhost:10000/default", "", ""); Statement stmt = con.createStatement(); String tableName = "testHiveDriverTable"; stmt.executeQuery("drop table " + tableName); ResultSet res = stmt.executeQuery("create table " + tableName + " (city string, temperature int)"); // show tables

String sql = "show tables '" + tableName + "'"; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); if (res.next()) { System.out.println(res.getString(1)); } // describe table sql = "describe " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(res.getString(1) + "\t" + res.getString(2)); } // load data into table // NOTE: filepath has to be local to the hive server

// NOTE: /tmp/a.txt is a ctrl-A separated file with two fields per line String filepath = "/home/ubuntu/yash_workspace/data"; sql = "load data local inpath '" + filepath + "' into table " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); // select * query sql = "select * from " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(String.valueOf(res.getString(1)) + "\t" + res.getInt(2)); } // regular hive query sql = "select count(1) from " + tableName; System.out.println("Running: " + sql);

res = stmt.executeQuery(sql); while (res.next()) { System.out.println(res.getString(1)); } stmt.close(); con.close(); } }

Page 11: Apache Hive micro guide - ConfusedCoders

Pa

ge1

1

The above code fires two select queries on a newly created table in Hive and prints the output on

console. Other complex queries are left on the reader to explore.

Changing Hive default Data store

Hive by default has Derby database as its data store and many a time we may need to change the

default data store to some other database. There are a couple of configurations changes that we need

to take care for changing Hive’s default data store. Here we will be changing the default data store to

MySQL DB.

Below are the steps for using MySQL Database as Hive’s database:

• Download MySQL JDBC Driver, and copy the jar file in Hive’s lib folder.

• Create a ‘metastore_db’ database in MySQL.

• Create a user ‘hiveuser’ in MySQL.

• Grant all permissions to ‘hiveuser’. GRANT ALL ON *.* TO ‘hiveuser’@localhost IDENTIFIED BY

‘your_password’

• Add the following configuration tags to hive-site.xml :

<property>

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:mysql://localhost/metastore_db?createDatabaseIfNotExists=true</value>

<description>The jdbc connection string for the metastore_db you just created</description>

</property>

<property>

<name>javax.jdo.option.ConnectionDriverName</name>

<value>con.mysql.jdbc.driver</value>

<description>Driver class name for jdbc</description>

</property>

<property>

<name>javax.jdo.option.ConnectionUserName</name>

<value>hiveuser</value>

<description>DB username, we just created on MySQL</description>

</property>

<property>

<name>javax.jdo.option.ConnectionPassword</name>

<value>your_password</value>

Page 12: Apache Hive micro guide - ConfusedCoders

Pa

ge1

2

<description>Password for your user – hiveuser</description>

</property>

Hive User Defined Functions (UDF)

Hive allows users to create their own User Defined Functions for using them in their hive queries. You

need to extend the Hive’s UDF class for creating your user defined UDF. Below is a small UDF for auto-

increment functionality. The UDF can also be used to get the ROWNUM of a table in Hive:

package com.confusedcoders.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF; public class AutoIncrUdf extends UDF{ int lastValue; public int evaluate() { lastValue++; return lastValue; } }

USAGE:

add jar /home/ubuntu/Desktop/HiveUdf.jar;

create temporary function incr as 'com.confusedcoders.hive.udf.AutoIncrUdf';

SELECT userid, incr() as rownum FROM USER LIMIT 10;