what is hive? - uk data service · storing your datasets in hadoop • hdfs or the hadoop...

37
What is Hive? The webinar will begin at 3pm You now have a menu in the top right corner of your screen. The red button with a white arrow allows you to expand and contract the webinar menu, in which you can write questions/comments. We won’t have time to answer questions while we are presenting, but will answer them at the end You will be on mute throughout – we can’t hear you.

Upload: others

Post on 23-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

What is Hive?

The webinar will begin at 3pm

• You now have a menu in the top right corner of your screen.

• The red button with a white arrow allows you to expand and contract the webinar menu, in which you can write questions/comments.

• We won’t have time to answer questions while we are presenting, but will answer them at the end

• You will be on mute throughout – we can’t hear you.

Page 2: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

What is Hive?

Webinar

22 March 2016

Peter SmythUK Data Service

Page 3: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

Can you hear us?

Page 4: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

Can you hear us?

• If Not:

• Check your volume, and that your speaker/headset is

plugged in.

• Your invitation also included a phone number, you can

call that to listen in.

• UK +44 (0) 20 3713 5012

• US +1 (646) 307-1716

• We are recording this webinar, so you can always

listen to it later.

Page 5: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

Overview of this webinar

• A definition of big data

• Problems of working with big data

• Introduction to Hive

• Examples using Hive to examine and

reduce the size of the data

• Using the smaller datasets in desktop

application

Page 6: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

What is big data?

• For the purpose of this Webinar we are going to define Big data as follows;

“Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate.” (Wikipedia)

• Or to put it another way, too big to fit into your favourite desktop application.

Page 7: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

What happened to make the data big?

Traditional data sources such as:

• Surveys are relatively small and well structured and the

data has been specifically collected for a specific

purpose.

• Census data is usually broken down or aggregated

before use, again with each ‘area’ designed and

collected to provide specific types of information.

Page 8: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

What happened to make the data big?

New data sources appeared

Social Media

Sensor Data

Transaction Data

Page 9: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

What happened to make the data big?

• Not collected primarily for researchers

• Created by others for their own purposes

• Can be very verbose

• Take it (all) or leave it

Page 10: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

Growing and shrinking data

Tweets

Smart meter data

Sent Tweet

All Smart meter data

All tweets from user

All tweets from User & Friends

Data from Tweet

Smart meter by day

Smart meter by Month

By Month and Geography

1Kb 1Mb 1Gb 10+ Gb

Desktop Application Big Data Environment

Page 11: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

Hadoop

• If you need to start your analysis with a big dataset, then

you need a big data processing system

• We will be using Hadoop

Page 12: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

The two things we need Hadoop to do

• Store our big datasets

• Process our big datasets

Page 13: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

Storing your datasets in Hadoop

• HDFS or the Hadoop Distributed Files System is used to

store big datasets in Hadoop

• Within HDFS they are stored in such a way as to allow

fast and efficient processing.

• But, to the end user, it is just a file system like File

Explorer in Windows - You have directories and files

within directories

Page 14: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

Processing your datasets in Hadoop

• There are many different possibilities

• Today we will look at Hive

Page 15: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

What is Hive?

This is the definition from Apache.org:

The Apache Hive ™ data warehouse software facilitates

querying and managing large datasets residing in

distributed storage. Hive provides a mechanism to project

structure onto this data and query the data using a SQL-

like language called HiveQL

Page 16: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

What we need to know about Hive

• It makes our files look like tables

• We can use simple bits of code to extract data items

from the tables

Page 17: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

Background to the Hive Demonstration

The EnvironmentWe will be using a Hortonworks Sandbox to provide the Hadoop/Hive environment. Details of how to install and access the Sandbox are provided on the UKDS Website.

Loading the Data into Hadoop/HDFSWe will assume that this has already been done. Details of how this can be done are available on the UKDS Website

The DataWe will be using the Smartmeter data which can be downloaded from UKDS (SN7591). We will be using the Geography dataset and the Gas Readings dataset

Page 18: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

The Hive Demonstration

We will use two files• A small file (< 1Mb) containing details relating to

households

• A big file (~6.8GB) containing half hourly meter readings over a period of about 3 years for the same households

geography

gas readings

Page 19: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

The Geography file

We will only use a few fields

anonid

acorn_category acorn_group acorn_type

nuts1 nuts4

This column represents the household id

These columns represent Acorn

demographic classifications

These columns represent

geographic regions

Page 20: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

The Acorn classifications

Page 21: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

The Nuts1 values

UKC. North East EnglandUKD. North West EnglandUKE. Yorkshire and the HumberUKF. East MidlandsUKG. West MidlandsUKH. East of EnglandUKI. Greater LondonUKJ. South East EnglandUKK. South West England

Page 22: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

The Readings fileThere are only four columns

anon_id

advancedatetime

hh

This column represents the household id

The data and time of the reading

The half hour period within the

day

gaskwh The gas used in the half hour period

Page 23: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

Data in Hadoop

Using Hive:

1. Create tables from the datasets

2. Explore missing values and create bar charts from the geography table

3. Aggregate gas meter readings to monthly – to make the data smaller

4. Create a summary table from the gas readings

5. Join monthly gas readings file and geography file

Data in R

Create simple graphs of the data from the joined tables

Data in Excel

Draw Scatterplot chart and investigate outliers

Use data in Excel

via ODBC

Use data in R via ODBC

ODBC = Open Data Base Connector – free to download and use – allows you to access data stored elsewhere on different tools e.g. R, Stata or Excel

Monthly gas readings joined with geography

Summary data from the allgas table

Page 24: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

The Demonstration tools

• For the main part of the demonstration we will be using

Hive

• We will access Hive via the Zeppelin product

• Both are included in the Hortonworks Sandbox

• For the last two parts of the demonstration we will

access our created tables from R and Excel using ODBC

to transfer the table contents from Hive

Page 25: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

Demo

• Accessing the Sandbox

• Accessing Zeppelin

• Writing Hive queries in Zeppelin

Page 26: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

Zeppelin notebook overview

Page 27: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

Zeppelin Graph output

Page 28: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

Joining the tables

geography gas_months_09

• These are the two tables we wish to join• There must be a column which is common to both• In this case, it is the ‘anonID’ and the ‘anon_id’

columns

Page 29: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

Joining the tables – an explanation

geography gas_months_09

• Although the names are different, we know that they represent the same information

• In the geography table each value in the anonID column occurs only once. i.e there is a single row of data describing each household.

Page 30: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

Joining the tables – an explanation

geography gas_months_09

• In the gas_months_09 table each value in the anon_id column occurs (typically) once for each month of 2009.

Page 31: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

The Join Query

create table gas_09_geog as select a.anon_id,

a.reading_month, a.monthlykwh, a.count_readings,b.fueltypes, b.acorn_category, b.acorn_group, b.acorn_type, b.nuts4, b.lacode, b.nuts1

from geography as b join gas_months_09 as a on b.anonid = a.anon_id;

Page 32: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

The gas_09_geog table

Page 33: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

Using the Hive tables on the desktop

• Using R and Excel to download the results

(gas_09_geog and gasdays_summary ) tables

• Using the R to perform simple graphing of the joined

data

• Using Excel to draw a scatter plot to expose potential

outliers

[Demo]

Page 34: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

Summary

• There may be no choice in using Big Data especially if

you do not control the source

• Hadoop and Big Data Tools like Hive are just that – Tools

• In this example session we have used Hive to cut the

data down to size for your preferred desktop application.

• We have chosen to use R as our desktop application,

but SPSS or Stata could also have been chosen.

Page 35: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

More Information

• The Apache.org website provides:

o details on Hive http://hive.apache.org/

o details on Zeppelin https://zeppelin.incubator.apache.org/

• Detailed documentation on the Hive commands and

Syntax is available at: https://cwiki.apache.org/confluence/display/Hive/LanguageManual

• A basic tutorial is at: https://cwiki.apache.org/confluence/display/Hive/Tutorial

Page 36: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

A book worth reading

• Although, probably has a lot more information than you need to get started

Page 37: What is Hive? - UK Data Service · Storing your datasets in Hadoop • HDFS or the Hadoop Distributed Files System is used to store big datasets in Hadoop • Within HDFS they are

Questions

Peter Smyth

[email protected]

ukdataservice.ac.uk/help/

Subscribe to the UK Data Service news list at https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=UKDATASERVICE

Follow us on Twitter https://twitter.com/UKDataServiceor Facebook https://www.facebook.com/UKDataService