what is hive? - uk data service · storing your datasets in hadoop • hdfs or the hadoop...
TRANSCRIPT
What is Hive?
The webinar will begin at 3pm
• You now have a menu in the top right corner of your screen.
• The red button with a white arrow allows you to expand and contract the webinar menu, in which you can write questions/comments.
• We won’t have time to answer questions while we are presenting, but will answer them at the end
• You will be on mute throughout – we can’t hear you.
What is Hive?
Webinar
22 March 2016
Peter SmythUK Data Service
Can you hear us?
Can you hear us?
• If Not:
• Check your volume, and that your speaker/headset is
plugged in.
• Your invitation also included a phone number, you can
call that to listen in.
• UK +44 (0) 20 3713 5012
• US +1 (646) 307-1716
• We are recording this webinar, so you can always
listen to it later.
Overview of this webinar
• A definition of big data
• Problems of working with big data
• Introduction to Hive
• Examples using Hive to examine and
reduce the size of the data
• Using the smaller datasets in desktop
application
What is big data?
• For the purpose of this Webinar we are going to define Big data as follows;
“Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate.” (Wikipedia)
• Or to put it another way, too big to fit into your favourite desktop application.
What happened to make the data big?
Traditional data sources such as:
• Surveys are relatively small and well structured and the
data has been specifically collected for a specific
purpose.
• Census data is usually broken down or aggregated
before use, again with each ‘area’ designed and
collected to provide specific types of information.
What happened to make the data big?
New data sources appeared
Social Media
Sensor Data
Transaction Data
What happened to make the data big?
• Not collected primarily for researchers
• Created by others for their own purposes
• Can be very verbose
• Take it (all) or leave it
Growing and shrinking data
Tweets
Smart meter data
Sent Tweet
All Smart meter data
All tweets from user
All tweets from User & Friends
Data from Tweet
Smart meter by day
Smart meter by Month
By Month and Geography
1Kb 1Mb 1Gb 10+ Gb
Desktop Application Big Data Environment
Hadoop
• If you need to start your analysis with a big dataset, then
you need a big data processing system
• We will be using Hadoop
The two things we need Hadoop to do
• Store our big datasets
• Process our big datasets
Storing your datasets in Hadoop
• HDFS or the Hadoop Distributed Files System is used to
store big datasets in Hadoop
• Within HDFS they are stored in such a way as to allow
fast and efficient processing.
• But, to the end user, it is just a file system like File
Explorer in Windows - You have directories and files
within directories
Processing your datasets in Hadoop
• There are many different possibilities
• Today we will look at Hive
What is Hive?
This is the definition from Apache.org:
The Apache Hive ™ data warehouse software facilitates
querying and managing large datasets residing in
distributed storage. Hive provides a mechanism to project
structure onto this data and query the data using a SQL-
like language called HiveQL
What we need to know about Hive
• It makes our files look like tables
• We can use simple bits of code to extract data items
from the tables
Background to the Hive Demonstration
The EnvironmentWe will be using a Hortonworks Sandbox to provide the Hadoop/Hive environment. Details of how to install and access the Sandbox are provided on the UKDS Website.
Loading the Data into Hadoop/HDFSWe will assume that this has already been done. Details of how this can be done are available on the UKDS Website
The DataWe will be using the Smartmeter data which can be downloaded from UKDS (SN7591). We will be using the Geography dataset and the Gas Readings dataset
The Hive Demonstration
We will use two files• A small file (< 1Mb) containing details relating to
households
• A big file (~6.8GB) containing half hourly meter readings over a period of about 3 years for the same households
geography
gas readings
The Geography file
We will only use a few fields
anonid
acorn_category acorn_group acorn_type
nuts1 nuts4
This column represents the household id
These columns represent Acorn
demographic classifications
These columns represent
geographic regions
The Acorn classifications
The Nuts1 values
UKC. North East EnglandUKD. North West EnglandUKE. Yorkshire and the HumberUKF. East MidlandsUKG. West MidlandsUKH. East of EnglandUKI. Greater LondonUKJ. South East EnglandUKK. South West England
The Readings fileThere are only four columns
anon_id
advancedatetime
hh
This column represents the household id
The data and time of the reading
The half hour period within the
day
gaskwh The gas used in the half hour period
Data in Hadoop
Using Hive:
1. Create tables from the datasets
2. Explore missing values and create bar charts from the geography table
3. Aggregate gas meter readings to monthly – to make the data smaller
4. Create a summary table from the gas readings
5. Join monthly gas readings file and geography file
Data in R
Create simple graphs of the data from the joined tables
Data in Excel
Draw Scatterplot chart and investigate outliers
Use data in Excel
via ODBC
Use data in R via ODBC
ODBC = Open Data Base Connector – free to download and use – allows you to access data stored elsewhere on different tools e.g. R, Stata or Excel
Monthly gas readings joined with geography
Summary data from the allgas table
The Demonstration tools
• For the main part of the demonstration we will be using
Hive
• We will access Hive via the Zeppelin product
• Both are included in the Hortonworks Sandbox
• For the last two parts of the demonstration we will
access our created tables from R and Excel using ODBC
to transfer the table contents from Hive
Demo
• Accessing the Sandbox
• Accessing Zeppelin
• Writing Hive queries in Zeppelin
Zeppelin notebook overview
Zeppelin Graph output
Joining the tables
geography gas_months_09
• These are the two tables we wish to join• There must be a column which is common to both• In this case, it is the ‘anonID’ and the ‘anon_id’
columns
Joining the tables – an explanation
geography gas_months_09
• Although the names are different, we know that they represent the same information
• In the geography table each value in the anonID column occurs only once. i.e there is a single row of data describing each household.
Joining the tables – an explanation
geography gas_months_09
• In the gas_months_09 table each value in the anon_id column occurs (typically) once for each month of 2009.
The Join Query
create table gas_09_geog as select a.anon_id,
a.reading_month, a.monthlykwh, a.count_readings,b.fueltypes, b.acorn_category, b.acorn_group, b.acorn_type, b.nuts4, b.lacode, b.nuts1
from geography as b join gas_months_09 as a on b.anonid = a.anon_id;
The gas_09_geog table
Using the Hive tables on the desktop
• Using R and Excel to download the results
(gas_09_geog and gasdays_summary ) tables
• Using the R to perform simple graphing of the joined
data
• Using Excel to draw a scatter plot to expose potential
outliers
[Demo]
Summary
• There may be no choice in using Big Data especially if
you do not control the source
• Hadoop and Big Data Tools like Hive are just that – Tools
• In this example session we have used Hive to cut the
data down to size for your preferred desktop application.
• We have chosen to use R as our desktop application,
but SPSS or Stata could also have been chosen.
More Information
• The Apache.org website provides:
o details on Hive http://hive.apache.org/
o details on Zeppelin https://zeppelin.incubator.apache.org/
• Detailed documentation on the Hive commands and
Syntax is available at: https://cwiki.apache.org/confluence/display/Hive/LanguageManual
• A basic tutorial is at: https://cwiki.apache.org/confluence/display/Hive/Tutorial
A book worth reading
• Although, probably has a lot more information than you need to get started
Questions
Peter Smyth
ukdataservice.ac.uk/help/
Subscribe to the UK Data Service news list at https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=UKDATASERVICE
Follow us on Twitter https://twitter.com/UKDataServiceor Facebook https://www.facebook.com/UKDataService