what is hive? - uk data service · storing your datasets in hadoop • hdfs or the hadoop...

What is Hive?

The webinar will begin at 3pm

• You now have a menu in the top right corner of your screen.

• The red button with a white arrow allows you to expand and contract the webinar menu, in which you can write questions/comments.

• We won’t have time to answer questions while we are presenting, but will answer them at the end

• You will be on mute throughout – we can’t hear you.

What is Hive?

Webinar

22 March 2016

Peter SmythUK Data Service

Can you hear us?

Can you hear us?

• If Not:

• Check your volume, and that your speaker/headset is

plugged in.

• Your invitation also included a phone number, you can

call that to listen in.

• UK +44 (0) 20 3713 5012

• US +1 (646) 307-1716

• We are recording this webinar, so you can always

listen to it later.

Overview of this webinar

• A definition of big data

• Problems of working with big data

• Introduction to Hive

• Examples using Hive to examine and

reduce the size of the data

• Using the smaller datasets in desktop

application

What is big data?

• For the purpose of this Webinar we are going to define Big data as follows;

“Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate.” (Wikipedia)

• Or to put it another way, too big to fit into your favourite desktop application.

What happened to make the data big?

Traditional data sources such as:

• Surveys are relatively small and well structured and the

data has been specifically collected for a specific

purpose.

• Census data is usually broken down or aggregated

before use, again with each ‘area’ designed and

collected to provide specific types of information.


New data sources appeared

Social Media

Sensor Data

Transaction Data


• Not collected primarily for researchers

• Created by others for their own purposes

• Can be very verbose

• Take it (all) or leave it

Growing and shrinking data

Tweets

Smart meter data

Sent Tweet

All Smart meter data

All tweets from user

All tweets from User & Friends

Data from Tweet

Smart meter by day

Smart meter by Month

By Month and Geography

1Kb 1Mb 1Gb 10+ Gb

Desktop Application Big Data Environment

Hadoop

• If you need to start your analysis with a big dataset, then

you need a big data processing system

• We will be using Hadoop

The two things we need Hadoop to do

• Store our big datasets

• Process our big datasets

Storing your datasets in Hadoop

• HDFS or the Hadoop Distributed Files System is used to

store big datasets in Hadoop

• Within HDFS they are stored in such a way as to allow

fast and efficient processing.

• But, to the end user, it is just a file system like File

Explorer in Windows - You have directories and files

within directories

Processing your datasets in Hadoop

• There are many different possibilities

• Today we will look at Hive

What is Hive?

This is the definition from Apache.org:

The Apache Hive ™ data warehouse software facilitates

querying and managing large datasets residing in

distributed storage. Hive provides a mechanism to project

structure onto this data and query the data using a SQL-

like language called HiveQL

What we need to know about Hive

• It makes our files look like tables

• We can use simple bits of code to extract data items

from the tables

Background to the Hive Demonstration

The EnvironmentWe will be using a Hortonworks Sandbox to provide the Hadoop/Hive environment. Details of how to install and access the Sandbox are provided on the UKDS Website.

Loading the Data into Hadoop/HDFSWe will assume that this has already been done. Details of how this can be done are available on the UKDS Website

The DataWe will be using the Smartmeter data which can be downloaded from UKDS (SN7591). We will be using the Geography dataset and the Gas Readings dataset

The Hive Demonstration

We will use two files• A small file (< 1Mb) containing details relating to

households

• A big file (~6.8GB) containing half hourly meter readings over a period of about 3 years for the same households

geography

gas readings

The Geography file

We will only use a few fields

anonid

acorn_category acorn_group acorn_type

nuts1 nuts4

This column represents the household id

These columns represent Acorn

demographic classifications

These columns represent

geographic regions

The Acorn classifications

The Nuts1 values

UKC. North East EnglandUKD. North West EnglandUKE. Yorkshire and the HumberUKF. East MidlandsUKG. West MidlandsUKH. East of EnglandUKI. Greater LondonUKJ. South East EnglandUKK. South West England

The Readings fileThere are only four columns

anon_id

advancedatetime

hh

This column represents the household id

The data and time of the reading

The half hour period within the

day

gaskwh The gas used in the half hour period

Data in Hadoop

Using Hive:

1. Create tables from the datasets

2. Explore missing values and create bar charts from the geography table

3. Aggregate gas meter readings to monthly – to make the data smaller

4. Create a summary table from the gas readings

5. Join monthly gas readings file and geography file

Data in R

Create simple graphs of the data from the joined tables

Data in Excel

Draw Scatterplot chart and investigate outliers

Use data in Excel

via ODBC

Use data in R via ODBC

ODBC = Open Data Base Connector – free to download and use – allows you to access data stored elsewhere on different tools e.g. R, Stata or Excel

Monthly gas readings joined with geography

Summary data from the allgas table

The Demonstration tools

• For the main part of the demonstration we will be using

Hive

• We will access Hive via the Zeppelin product

• Both are included in the Hortonworks Sandbox

• For the last two parts of the demonstration we will

access our created tables from R and Excel using ODBC

to transfer the table contents from Hive

Demo

• Accessing the Sandbox

• Accessing Zeppelin

• Writing Hive queries in Zeppelin

Zeppelin notebook overview

Zeppelin Graph output

Joining the tables

geography gas_months_09

• These are the two tables we wish to join• There must be a column which is common to both• In this case, it is the ‘anonID’ and the ‘anon_id’

columns

Joining the tables – an explanation


• Although the names are different, we know that they represent the same information

• In the geography table each value in the anonID column occurs only once. i.e there is a single row of data describing each household.

Joining the tables – an explanation


• In the gas_months_09 table each value in the anon_id column occurs (typically) once for each month of 2009.

The Join Query

create table gas_09_geog as select a.anon_id,

a.reading_month, a.monthlykwh, a.count_readings,b.fueltypes, b.acorn_category, b.acorn_group, b.acorn_type, b.nuts4, b.lacode, b.nuts1

from geography as b join gas_months_09 as a on b.anonid = a.anon_id;

The gas_09_geog table

Using the Hive tables on the desktop

• Using R and Excel to download the results

(gas_09_geog and gasdays_summary ) tables

• Using the R to perform simple graphing of the joined

data

• Using Excel to draw a scatter plot to expose potential

outliers

[Demo]

Summary

• There may be no choice in using Big Data especially if

you do not control the source

• Hadoop and Big Data Tools like Hive are just that – Tools

• In this example session we have used Hive to cut the

data down to size for your preferred desktop application.

• We have chosen to use R as our desktop application,

but SPSS or Stata could also have been chosen.

More Information

• The Apache.org website provides:

o details on Hive http://hive.apache.org/

o details on Zeppelin https://zeppelin.incubator.apache.org/

• Detailed documentation on the Hive commands and

Syntax is available at: https://cwiki.apache.org/confluence/display/Hive/LanguageManual

• A basic tutorial is at: https://cwiki.apache.org/confluence/display/Hive/Tutorial

A book worth reading

• Although, probably has a lot more information than you need to get started

Questions

Peter Smyth

[email protected]

ukdataservice.ac.uk/help/

Subscribe to the UK Data Service news list at https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=UKDATASERVICE

Follow us on Twitter https://twitter.com/UKDataServiceor Facebook https://www.facebook.com/UKDataService

what is hive? - uk data service · storing your datasets in hadoop • hdfs or the hadoop...

Documents