![Page 1: Apache Hive - Carnegie Mellon School of Computer Sciencepavlo/courses/fall2013/static/slides/hive.pdf · •HDFS for storage ... •Key Building Principles: ... Hive can use tables](https://reader031.vdocuments.us/reader031/viewer/2022030409/5a8f1d9d7f8b9adb648da323/html5/thumbnails/1.jpg)
Apache HIVEData Warehousing & Analytics on Hadoop
Hefu Chai
![Page 2: Apache Hive - Carnegie Mellon School of Computer Sciencepavlo/courses/fall2013/static/slides/hive.pdf · •HDFS for storage ... •Key Building Principles: ... Hive can use tables](https://reader031.vdocuments.us/reader031/viewer/2022030409/5a8f1d9d7f8b9adb648da323/html5/thumbnails/2.jpg)
What is HIVE?
• A system for managing and querying structured data built on top of Hadoop• Uses Map-Reduce for execution• HDFS for storage• Extensible to other Data Repositories
• Key Building Principles:• SQL on structured data as a familiar data warehousing tool• Extensibility (Pluggable map/reduce scripts in the language of your choice,
Rich and User Defined data types, User Defined Functions)• Interoperability (Extensible framework to support different file and data
formats)
![Page 3: Apache Hive - Carnegie Mellon School of Computer Sciencepavlo/courses/fall2013/static/slides/hive.pdf · •HDFS for storage ... •Key Building Principles: ... Hive can use tables](https://reader031.vdocuments.us/reader031/viewer/2022030409/5a8f1d9d7f8b9adb648da323/html5/thumbnails/3.jpg)
What HIVE Is Not
• Not designed for OLTP
• Does not offer real-time queries
![Page 4: Apache Hive - Carnegie Mellon School of Computer Sciencepavlo/courses/fall2013/static/slides/hive.pdf · •HDFS for storage ... •Key Building Principles: ... Hive can use tables](https://reader031.vdocuments.us/reader031/viewer/2022030409/5a8f1d9d7f8b9adb648da323/html5/thumbnails/4.jpg)
HIVE Architecture
![Page 5: Apache Hive - Carnegie Mellon School of Computer Sciencepavlo/courses/fall2013/static/slides/hive.pdf · •HDFS for storage ... •Key Building Principles: ... Hive can use tables](https://reader031.vdocuments.us/reader031/viewer/2022030409/5a8f1d9d7f8b9adb648da323/html5/thumbnails/5.jpg)
Hive/Hadoop Usage @ Facebook
• Types of Applications:• Summarization
• Eg: Daily/Weekly aggregations of impression/click counts• Complex measures of user engagement
• Ad hoc Analysis• Eg: how many group admins broken down by state/country
• Data Mining (Assembling training data)• Eg: User Engagement as a function of user attributes
• Spam Detection• Anomalous patterns for Site Integrity• Application API usage patterns
• Ad Optimization• Too many to count ..
![Page 6: Apache Hive - Carnegie Mellon School of Computer Sciencepavlo/courses/fall2013/static/slides/hive.pdf · •HDFS for storage ... •Key Building Principles: ... Hive can use tables](https://reader031.vdocuments.us/reader031/viewer/2022030409/5a8f1d9d7f8b9adb648da323/html5/thumbnails/6.jpg)
Hive Query Language
• Basic SQL• CREATE TABLE sample (foo INT, bar STRING) PARTITIONED BY (ds STRING); • SHOW TABLES '.*s';• DESCRIBE sample;• ALTER TABLE sample ADD COLUMNS (new_col INT);• DROP TABLE sample;
• Extensibility• Pluggable Map-reduce scripts • Pluggable User Defined Functions• Pluggable User Defined Types• Pluggable SerDes to read different kinds of Data Formats
![Page 7: Apache Hive - Carnegie Mellon School of Computer Sciencepavlo/courses/fall2013/static/slides/hive.pdf · •HDFS for storage ... •Key Building Principles: ... Hive can use tables](https://reader031.vdocuments.us/reader031/viewer/2022030409/5a8f1d9d7f8b9adb648da323/html5/thumbnails/7.jpg)
Hive QL – Join
• SQL:INSERT INTO TABLE pv_users
SELECT pv.pageid, u.age
FROM page_view pv JOIN user u ON (pv.userid = u.userid);
pageid userid time
1 111 9:08:01
2 111 9:08:13
1 222 9:08:14
userid age gender
111 25 female
222 32 male
pageid age
1 25
2 25
1 32
X =
page_viewuser
pv_users
![Page 8: Apache Hive - Carnegie Mellon School of Computer Sciencepavlo/courses/fall2013/static/slides/hive.pdf · •HDFS for storage ... •Key Building Principles: ... Hive can use tables](https://reader031.vdocuments.us/reader031/viewer/2022030409/5a8f1d9d7f8b9adb648da323/html5/thumbnails/8.jpg)
Hive QL – Join in Map Reduce
key value
111 <1,1>
111 <1,2>
222 <1,1>
pageid userid time
1 111 9:08:01
2 111 9:08:13
1 222 9:08:14
userid age gender
111 25 female
222 32 male
page_view
user
key value
111 <2,25>
222 <2,32>
Map
key value
111 <1,1>
111 <1,2>
111 <2,25>
key value
222 <1,1>
222 <2,32>
Shuffle
Sort
![Page 9: Apache Hive - Carnegie Mellon School of Computer Sciencepavlo/courses/fall2013/static/slides/hive.pdf · •HDFS for storage ... •Key Building Principles: ... Hive can use tables](https://reader031.vdocuments.us/reader031/viewer/2022030409/5a8f1d9d7f8b9adb648da323/html5/thumbnails/9.jpg)
Hive QL – Join in Map Reduce
pv_userskey value
111 <1,1>
111 <1,2>
111 <2,25>
key value
222 <1,1>
222 <2,32>
Pageid age
1 25
2 25
pageid age
1 32
Reduce
![Page 10: Apache Hive - Carnegie Mellon School of Computer Sciencepavlo/courses/fall2013/static/slides/hive.pdf · •HDFS for storage ... •Key Building Principles: ... Hive can use tables](https://reader031.vdocuments.us/reader031/viewer/2022030409/5a8f1d9d7f8b9adb648da323/html5/thumbnails/10.jpg)
Integration with HBase
• Reasons to use Hive on HBase:• A lot of data sitting in HBase due to its usage in a real-time environment, but
never used for analysis
• Give access to data in HBase usually only queried through MapReduce to people that don’t code (business analysts)
• Reasons not to do it:• Run SQL queries on HBase to answer live user requests (it’s still a MR job)
![Page 11: Apache Hive - Carnegie Mellon School of Computer Sciencepavlo/courses/fall2013/static/slides/hive.pdf · •HDFS for storage ... •Key Building Principles: ... Hive can use tables](https://reader031.vdocuments.us/reader031/viewer/2022030409/5a8f1d9d7f8b9adb648da323/html5/thumbnails/11.jpg)
Integration with HBase
![Page 12: Apache Hive - Carnegie Mellon School of Computer Sciencepavlo/courses/fall2013/static/slides/hive.pdf · •HDFS for storage ... •Key Building Principles: ... Hive can use tables](https://reader031.vdocuments.us/reader031/viewer/2022030409/5a8f1d9d7f8b9adb648da323/html5/thumbnails/12.jpg)
Hive can use tables that already exist in HBase or manage its own ones, but they still all reside in the same HBase instance
HBaseHive table definitions
Points to an existing table
Manages this table from Hive
Integration with HBase
![Page 13: Apache Hive - Carnegie Mellon School of Computer Sciencepavlo/courses/fall2013/static/slides/hive.pdf · •HDFS for storage ... •Key Building Principles: ... Hive can use tables](https://reader031.vdocuments.us/reader031/viewer/2022030409/5a8f1d9d7f8b9adb648da323/html5/thumbnails/13.jpg)
When using an already existing table, defined as EXTERNAL
Columns are mapped however you want, changing names and giving type
HBase tableHive table definition
name STRING
age INT
siblings MAP<string, string>
d:fullname
d:age
d:address
f:
persons people
Integration with HBase
![Page 14: Apache Hive - Carnegie Mellon School of Computer Sciencepavlo/courses/fall2013/static/slides/hive.pdf · •HDFS for storage ... •Key Building Principles: ... Hive can use tables](https://reader031.vdocuments.us/reader031/viewer/2022030409/5a8f1d9d7f8b9adb648da323/html5/thumbnails/14.jpg)
Reference
• https://cwiki.apache.org/confluence/display/Hive/Home
• Hive Facebook
• StumbleUpon
![Page 15: Apache Hive - Carnegie Mellon School of Computer Sciencepavlo/courses/fall2013/static/slides/hive.pdf · •HDFS for storage ... •Key Building Principles: ... Hive can use tables](https://reader031.vdocuments.us/reader031/viewer/2022030409/5a8f1d9d7f8b9adb648da323/html5/thumbnails/15.jpg)
Thanks