Transcript
Page 1: TriHUG November HCatalog Talk by Alan Gates

HCatalogTable Management for HadoopAlan F. Gates

Page 2: TriHUG November HCatalog Talk by Alan Gates

Motivation: Data Sharing is Hard

Photo Credit: totalAldo via Flickr

This is programmer Bob, he uses Pig to crunch data.

This is analyst Joe, he uses Hive to build reports and answer ad-hoc queries.

Hmm, is it done yet? Where is it? What format did you use to store it today? Is it compressed? And can you help me load it into Hive, I can never remember all the parameters I have to pass that alter table command.

Ok

Joe, I need today’s data

Dude, we need HCatalog

Page 3: TriHUG November HCatalog Talk by Alan Gates

More Motivation: Each tool requires its own Translator

Pig Hive Map Reduce

RCFile Custom Format

CustomLoader

HiveColumnarLoader

RCFileInputFormat

CustomInputFormat

ColumnarSerDe

CustomSerDe

HCatalog

HCatLoader HCatSerDe HCatInputFormat

RCFileStorageDriver

CustomStorageDriver

Page 4: TriHUG November HCatalog Talk by Alan Gates

End User Example

raw = load ‘/rawevents/20100819/data’ using MyLoader() as (ts:long, user:chararray, url:chararray);botless = filter raw by NotABot(user);…store output into ‘/processedevents/20100819/data’;

Processedevents consumers must be manually informed by producer that data is available, or poll on HDFS (= bad for the NameNode)

raw = load ‘rawevents’ using HCatLoader();botless = filter raw by date = ‘20100819’ and NotABot(user);…store output into ‘processedevents’ using HCatStorage(“date=20100819”);

Processedevents consumers will be notified by HCatalog data is available and can then start their jobs

Page 5: TriHUG November HCatalog Talk by Alan Gates

Command Line for DDL

• Uses Hive SQL

• Create, drop, alter table

• CREATE TABLE employee ( emp_id INT, emp_name STRING, emp_start_date STRING, emp_gender STRING)PARTITIONED BY ( emp_country STRING, emp_state STRING)STORED AS RCFILEtblproperties('hcat.isd'='RCFileInputDriver','hcat.osd'='RCFileOutputDriver');

Page 6: TriHUG November HCatalog Talk by Alan Gates

Manages Data Format and Schema Changes

• Allows columns to be appended to tables in new partitions−no need to change existing data−fields not present in old data will be read as null−must do ‘alter table add column’ first

• Allows storage format changes−no need to change existing data, HCatalog will handle reading each

partition in the appropriate format−all new partitions will be written in current format

Page 7: TriHUG November HCatalog Talk by Alan Gates

Security

• Uses underlying storage permissions to determine authorization−Currently only works with HDFS based storage−If user can read from the HDFS directory, then he can read the table−If user can write to the HDFS directory, then he can write to the table−If the user can write to the database level directory, he can create and

drop tables−Allows users to define which group to create table as so table access

can be controlled by Unix group

• Authentication done via kerberos

Page 8: TriHUG November HCatalog Talk by Alan Gates

Metadata Architecture

Thrift server

Hive metadata interface

HCatInputFormat HCatOutputFormat

HCatLoader HCatStorage

RDBMS

CLI

= Hive

= Current HCatalog

Notification

= Future HCatalog

HTTP

Page 9: TriHUG November HCatalog Talk by Alan Gates

Storage Architecture

HCatInputFormat HCatOutputFormat

HCatLoader HCatStorage

InputStorageDriver

HDFS

OutputStorageDriver

HBase

Page 10: TriHUG November HCatalog Talk by Alan Gates

Project Status

• HCatalog was accepted to the Apache Incubator last March

• 0.2 released in October, includes:−Read/write from Pig−Read/write from MapReduce−Read/write from Hive−StorageDrivers for RCFile−Notification via JMS when data is available−Store to multiple partitions simultaneously−Import/Export tools

Page 11: TriHUG November HCatalog Talk by Alan Gates

HCatalog 0.3

• Plan to release mid-December

• Adds a Binary type (to Hive and HCatalog)

• Storage drivers for JSON and text

• Improved integration with Hive for custom storage formats

• Web services interface

Page 12: TriHUG November HCatalog Talk by Alan Gates

Future Plans

• Support for HBase and other data sources for storage

• RCFile compression improvements

• High Availability for Thrift server

• Data management interfaces for archivers, cleaners, etc.

• Additional metadata storage:−statistics−lineage/provenance−user tags

Page 13: TriHUG November HCatalog Talk by Alan Gates

Get Involved

• incubator.apache.org/hcatalog

• Join the mailing lists −User list: [email protected] −Dev list: [email protected]

Page 14: TriHUG November HCatalog Talk by Alan Gates

Questions?


Top Related