![Page 1: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/1.jpg)
CS 626 Large Scale Data Science
Jun ZhangMarch 5, 2020
Originally prepared by Licong Cui
Lecture 11 – Apache Hive
1
![Page 2: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/2.jpg)
Review: Hadoop Ecosystem
2
![Page 3: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/3.jpg)
Review: Pig
A platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs.
Pig Engine
Pig Latin3
![Page 4: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/4.jpg)
Review: WordCount Using Pig
lines = LOAD ‘cs626/words.txt' AS (line:chararray);words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;grouped = GROUP words BY word;wordcount = FOREACH grouped GENERATE group, COUNT(words);DUMP wordcount;
4
![Page 5: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/5.jpg)
Pig vs MapReduce vs Hive
1https://www.dezyre.com/article/mapreduce-vs-pig-vs-hive/163
1
5
![Page 6: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/6.jpg)
Outline
What is Hive?Data TypesData ModelsHive ArchitectureHive vs Traditional DatabaseHive vs Pig
6
![Page 7: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/7.jpg)
Hive
Facebook Data warehousing infrastructure based on
Hadoop Designed to Enable easy data summarization Ad-hoc querying Analysis of large volumes of data
HiveQL - Hive’s query language7
![Page 8: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/8.jpg)
Hive
Organize data into tables
Metastore
Metadata (table schemas)
8
![Page 9: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/9.jpg)
Run Hive
Interactive mode Hive shell: hive
Non-interactive mode Local: hive –f script.q
Hive web interface
JDBC (Java Database Connectivity)
9
![Page 10: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/10.jpg)
HiveQL Data Types
Primitive data types
Complex data types
10
![Page 11: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/11.jpg)
Primitive Data Types
Boolean
Numeric: TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, DECIMAL
String: STRING, VARCHAR
Timestamp: TIMESTAMP, DATE
11
![Page 12: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/12.jpg)
Primitive Data Types (cont.)
12
![Page 13: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/13.jpg)
Primitive Data Types (cont.)
13
![Page 14: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/14.jpg)
Complex Data Types
14
![Page 15: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/15.jpg)
Example Complex Data Types
15
![Page 16: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/16.jpg)
Example Complex Data Types (cont.)
16
![Page 17: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/17.jpg)
HiveQL Example
17
![Page 18: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/18.jpg)
HiveQL Example (cont.)
18
![Page 19: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/19.jpg)
HiveQL Example (cont.)
19
![Page 20: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/20.jpg)
Data Model
Tables
Partitions
Buckets
20
![Page 21: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/21.jpg)
Tables
Analogous to relational tables Hive moves data to its warehouse directory Each table has corresponding directory in HDFS Hive does not check that the files in the table
directory conform to the schema at the loading time Example:
hdfs://user/tom/data.txt ==> hdfs://user/hive/warehouse/managed_table
21
![Page 22: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/22.jpg)
External Tables
Does not move data to Hive’s warehouse directory
Point to existing data directories in HDFS
Data is assumed to be in Hive-compatible format
Dropping external table drops only the metadata
Example:
22
![Page 23: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/23.jpg)
Partitions
Hive organize tables into partitions
Partitions determine distribution of data within subdirectories
Example:
23
![Page 24: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/24.jpg)
Partitions (cont.)
Example:
SELECT statements:
24
![Page 25: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/25.jpg)
Buckets
Data in each partition divided into buckets
Each bucket is stored as a file in partition directory
Hash function: H(column) mod num_buckets = bucket_number
Example:
25
![Page 26: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/26.jpg)
Hive Architecture
26
![Page 27: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/27.jpg)
Hive Architecture
27
![Page 28: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/28.jpg)
Thrift Server
Framework for cross language services
Server written in Java
Support for clients written in different languages JDBC(java), ODBC(c++), php, perl, python scripts
28
![Page 29: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/29.jpg)
Metastore
System catalog which contains metadata about the Hive tables
Stored in RDBMS/local file system HDFS too slow (not optimized for random access) Derby, MySQL
Objects of Metastore Database - Namespace of tables Table - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage
29
![Page 30: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/30.jpg)
Driver
Driver Maintains the lifecycle of HiveQL statement
Query Compiler Compiles HiveQL in a DAG of map reduce tasks
Executor Executes the tasks plan generated by the compiler in
proper dependency order Interacts with the underlying Hadoop instance
30
![Page 31: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/31.jpg)
Hive vs Traditional Database
Schema on Read Versus Schema on Write Traditional database: schema on write table’s schema is enforced at data load time
Hive: schema on read does not verify the data when it is loaded, but rather when
a query is issued
31
![Page 32: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/32.jpg)
Hive vs Traditional Database (cont.)
Updates, Transactions, and Indexes Mainstays of traditional databases
HDFS does not provide in-place file updates Changes resulting from inserts, updates, and deletes are
stored in small delta files
Delta files are periodically merged into the base table files by MapReduce jobs
32
![Page 33: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/33.jpg)
Hive vs. RDBMS
33
![Page 34: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/34.jpg)
SQL vs HiveQL
34
Cheat sheet at class webpage
![Page 35: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/35.jpg)
SQL vs HiveQL (cont.)
35
![Page 36: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/36.jpg)
Pig vs HivePig Hive
Procedural data flow language
Declarative SQLishLanguage
36
![Page 37: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/37.jpg)
Pig vs Hive (cont.)
Pig Mainly for data transformations and processing
Unstructured and structured data
Hive Mainly for data warehousing and querying data
Structured data
Lower learning curve than Pig or MapReduce
HiveQL is much closer to SQL than Pig37
![Page 38: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/38.jpg)
Hive, Pig, and Hadoop Benchmark
Version: Hadoop – 0.18x, Pig:786346, Hive:786346
38
![Page 39: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/39.jpg)
Bigger Picture
Store large amounts of data to HDFS
Process raw data using Pig
Build schema using Hive
Querying data using Hive
39
![Page 40: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfTable - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage 29 Driver Driver](https://reader033.vdocuments.us/reader033/viewer/2022060100/60af840598257d054b1df932/html5/thumbnails/40.jpg)
References
Hadoop: The Definitive Guide (By Tom White) https://courses.engr.illinois.edu/cs525/sp201
5/ http://www.slideshare.net/chirag064/hive-
warehousing-over-hadoop https://acadgild.com/blog/working-with-
hive-complex-data-types/
40