hive workshop slides - uk data service · 2016-06-27 · overview of this workshop 1. high level...

46
Hive Workshop Peter Smyth 24 rd June 2016 An introduction to Hadoop, HDFS, Hive and HQL Workshop

Upload: others

Post on 22-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

Hive Workshop

Peter Smyth

24rd June 2016

An introduction to Hadoop, HDFS, Hive and HQL

Workshop

Page 2: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

Admin

• Fire Alarm Test expected at 11:00am• Real Fire – out the way you came in (upstairs and out

main entrance• Toilets - on this floor

• Coffee breaks in the morning and afternoon• Lunch (Sandwiches) about 1:00pm

Page 3: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

Program session times09.30 Registration and coffee

10.00 Introductions to Hadoop, Hive, the software and each other

11.15 Coffee break

11.30 Hive 1: Hive Queries (Lessons 1 & 2)

13.00 Lunch - sandwiches

13.30 Hive 2: Creating tables, table types and table storage (Lesson 3)

14.30 Coffee break

14.45 Hive 3: Accessing Hive using the command line and ODBC (Lessons 4 & 5)

16.00 Close

Page 4: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

Overview of this workshop

1. High level explanation of the Hadoop ecosystem

2. Overview of HDFS and Hadoop processing

3. Develop Hive queries to slice, dice sample and join big datasets

4. Create and load data into tables5. Demonstrate the use of hive

queries from external systems using ODBC

Page 5: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

• Experience of big data?• Expectations?• Next steps?

User experience & aims

Page 6: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

What is big data?

• Perhaps most relevant, is the most obvious;

“Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate.” (wikipedia)

• Or to put it another way, too big to fit into your favourite desktop application.

Page 7: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

Growing and shrinking data

Tweets

Smart meter data

Sent Tweet

All Smart meter data

All tweets from user

All tweets from User & Friends

Data from Tweet

Smart meter by day

Smart meter by Month

By Month and Geography

1Kb 1Mb 1Gb 10+ Gb

Desktop Application Big Data Environment

Page 8: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

Hadoop

• Created by Doug Cutting and Mike Cafarella• Based on the 2003-4 Google papers on MapReduce and

GFS (Google File System)• MapReduce had been around for 40 years, first

appearing in the programming language Lisp in 1961• The name Hadoop comes from the name of a cuddly toy

elephant owned by Doug Cuttings son

Page 9: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

Hadoop Infrastructure

• A Hadoop cluster can be formed by thousands of nodes – 4 would be a minimum

• A node is an individual computer many times more powerful than the average desktop.

• The strength of the Hadoop system is not its raw power, but its ability to break a processing task down and run all of the parts in parallel

• It is built to be resilient, i.e. cope with server failures

Page 10: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

A picture of the Hadoop Eco-system

(http://hortonworks.com/)

Page 11: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

A (very) minimal picture of Hadoop

Hadoop

HDFS Map Reduce

Page 12: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

Hadoop components

• HDFS (Hadoop Distributed File System)• To the end user, just a file system• Internally, files are segmented into blocks of 128MB and

randomly distributed across the available datanodes.• A datanode is a server in the Hadoop cluster where actual

processing takes place – i.e. where your programs are run.• A namenode (another server in the cluster) keeps track of the

random distribution of the blocks

• Map Reduce• The execution engine• Enhanced by Tez

Page 13: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

Data + Program = Processing

Bigdata

In Hadoop

program

HadoopTraditional client server

Program

e.g. MS Word

data

Page 14: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

Sandbox v Cluster

• The Sandbox environment we will be using is not a cluster – It is one Virtual machine running on one physical machine

• The Sandbox thinks it is a cluster. It is configured that way

• In reality everything that a cluster does in parallel, the Sandbox does in series

• This makes it very slow• But it can still process big datasets• The coding for the queries you write will be just the same

Page 15: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

Sandbox limitations

• Speed of processing – already mentioned• Capacity – A Hortonworks Sandbox has a size limit of

50Gb• In practice this would mean a limit of 25-30 Gb of

storage space for your datasets in HDFS• It is possible to expand the size of the VM • Any space used by the VM as it expands must be

available on the hosting physical machine

Page 16: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

What is Hive?

This is the definition from Apache.org:

The Apache Hive ™ data warehouse software facilitates

querying and managing large datasets residing in

distributed storage. Hive provides a mechanism to project

structure onto this data and query the data using a SQL-

like language called HiveQL

Page 17: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

Using Hive – What we will cover

• Select statements (i.e. Queries)• Table creation and loading data• Sampling tables, Joining tables• Table storage and Partitioned tables• Various ways of accessing Hive

Page 18: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

Finding Information on the Internet

• Relating to Apache Hadoop and Hive

Page 19: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

Apache.org

Page 20: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

Specific Projects

• Hive has its own entry in the project list• HDFS is included as part of the Hadoop project.• All of the top pages of the projects have links to the

official documentation• Can be a bit thin on examples and explanations• Versions change regularly

Page 21: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

Hive wiki page(s)

Page 22: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

Getting Help

• Apache Wikis• Hortonworks or other Hadoop providers• Stackoverflow• Just google it!

Page 23: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

The Environment

• We are using Laptops with:• Windows 7• Intel I5-6200U processors • 16Gb of Ram

The amount of RAM is key as the Hortonworks Sandbox requires 10Gb to run reasonably well

Page 24: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

Installed on the Laptop (all Freeware)

• VMWare Player - for running the virtualisation environment which contains the Sandbox VM

• The Hortonworks Hadoop Sandbox v2.3.2 (just a set of files used by VMWare to build the Sandbox VM)

• PuTTY – A program which allows remote access to, typically Linux based systems – we need this to access the command line of the VM

• Filezilla – A program which allows the transfer of files between two different systems – we can use this to transfer our data files from the Windows system to the Linux VM system

Page 25: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

Installed on the Laptop (all Freeware)

• Web browser - used to connect to the Web servers exposed by Hadoop and the Hortonworks Sandbox – we need this to communicate with Hive via Ambari or Hue

• Toad for Hadoop - used to connect directly to the Hive and HDFS environments on the Sandbox

Page 26: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

The Sandbox environment

• The Sandbox is a complete Linux operating system which contains a complete Hadoop environment all running in a single VM (Virtual Machine)

• Which in turn is hosted on a Windows machine using VMware virtualisation software.

• Why a VirtualMachine – separate, isolated environment

Page 27: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

The Hadoop Environment

• Available automatically after the Linux environment has started

• The hadoop environment includes • The software component known as Hive • A file system called HDFS

• Within HDFS • the data files we will be using to write Hive queries against will

be stored• Hive also uses HDFS to store tables

Page 28: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

Accessing Hive

• We can use a variety of applications to access the Hive environment

• Some are better than others• They don’t all provide the same functionality• For some tasks there may be no alternatives

• All of the ones we will be using are either part of the Hadoop environment or are freely available for download

Page 29: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

Hive accessing applications

Page 30: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

VMWare Player

• Used to create the VM Sandbox

You only ned to know the IP address from here

Page 31: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

PuTTY

• Used to access Hive and HDFS from the command line

Page 32: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

Filezilla

• Used to transfer files between the Windows and the Linux VM

Page 33: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

Web Browser

• Any web browser will do• Older versions of IE don’t always get the layouts correctly

• There are two web interfaces available• Ambari – This is a Hortonworks developed product which

provides Hadoop management as well as access to Hive and HDFS

• Hue - is an open source development by Cloudera. It has been removed from the most recent version of the Hortonworks Sandbox

• Both provide similar functionality for Hive and HDFS• Load file• Create tables• Run queries

Page 34: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

Ambari front screen

Page 35: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

Ambari - Hive interface

Page 36: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

Hue – Front screen

Page 37: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

Hue – Hive interface

Page 38: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

Toad for Hadoop

• A Windows application, in beta, currently free.

Page 39: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

ODBC

• The Hive ODBC driver is transparent once installed but allows access to Hive from a variety of programming and application environments.

Page 40: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

The data to be used

We will be using the Smartmeter data which can be downloaded from UKDS (SN7591). There are four files.

The edrp_metadata.xlsx file is not loaded into Hive. It will be used to compare results from some of our queries

File Size Num. Records

edrp_elec.csv 12.07GB 413836038

edrp_gas.csv 6.83GB 246482700

edrp_geography_data.csv 1.3MB 14617edrp_metadata.xlsx

Page 41: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

Edrp_elec, edrp_gas file layout

ANON_ID,ADVANCEDATETIME,HH,ELECKWH5110,15FEB08:12:30:00,25,0.611617,15FEB08:12:30:00,25,0.254869,15FEB08:12:30:00,25,0.39015,15FEB08:12:30:00,25,0.411628,15FEB08:12:30:00,25,0.85

In the gas file, the last column is GASKWH

Page 42: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

Edrp_geography_data file layout

anonID,eProfileClass,fuelTypes,ACORN_Category,ACORN_Group,ACORN_Type,ACORN_Code,ACORN_Description,NUTS4,LACode,NUTS1,gspGroup,LDZ,Elec_Tout,Gas_Tout1,2,Dual,1,C,10,"1 ,C ,10",Well-off working families with mortgages,--,--,UKG,_B,WM,0,02,1,Dual,4,M,43,"4 ,M ,43","Older people, rented terraces",UKL1605,00PL,UKL,_K,WS,1,13,1,ElecOnly,3,I,32,"3 ,I ,32",Retired home owners,UKJ4210,29UN,UKJ,_J,SE,0,04,1,Dual,3,H,31,"3 ,H ,31",Home owning Asian family areas,--,--,UKI,--,--,0,05,1,ElecOnly,4,M,43,"4 ,M ,43","Older people, rented terraces",UKM3800,00RF,UKM,_N,SC,1,0

Page 43: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

edrp_metadata.xlsx contents

• The highlighted columns are ones we will try to verify with queries

Page 44: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

Objectives of the analysis

• To compare the data contained in the edrp_metadata.xlsx file with our query results

• Verify the Hypothesis that:Electricity only users, use proportionally more electricity in winter than the dual fuel users

Page 45: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

Acorn codes

Acorn segments postcodes and neighbourhoods into 6 Categories, 18 Groups and 62 types, three of which are not private households. By analysing significant social factors and population behaviour, it provides precise information and in-depth understanding of the different types of people.

http://acorn.caci.co.uk/

Page 46: Hive workshop slides - UK Data Service · 2016-06-27 · Overview of this workshop 1. High level explanation of the Hadoop ecosystem 2. Overview of HDFS and Hadoop processing 3. Develop

NUTS1 Codes