hcatalog hadoop summit 2011

11
HCatalog Table Management for Hadoop Alan F. Gates

Upload: hortonworks

Post on 11-Jan-2015

7.368 views

Category:

Technology


2 download

DESCRIPTION

Alan Gates' HCatalog talk

TRANSCRIPT

Page 1: HCatalog Hadoop Summit 2011

HCatalogTable Management for HadoopAlan F. Gates

Page 2: HCatalog Hadoop Summit 2011

Who Am I?

• Committer and mentor for Apache HCatalog

• Committer and PMC member for Apache Pig

• Co-founder Hortonworks

• Twitter @alanfgates

Photo credit: Charles Dawley

Page 3: HCatalog Hadoop Summit 2011

Motivation

Pig Hive Map Reduce

RCFile Custom Format

CustomLoader

HiveColumnarLoader

RCFileInputFormat

CustomInputFormat

ColumnarSerDe

CustomSerDe

HCatalog

HCatLoader HCatSerDe HCatInputFormat

RCFileStorageDriver

CustomStorageDriver

Page 4: HCatalog Hadoop Summit 2011

More Motivation

Data FactoryPig/MapReduce

PipelinesIterative ProcessingResearch

Data WarehouseHive

BI ToolsAnalysis

Page 5: HCatalog Hadoop Summit 2011

End User Example

raw = load ‘/rawevents/20100819/data’ using MyLoader() as (ts:long, user:chararray, url:chararray);botless = filter raw by NotABot(user);…store output into ‘/processedevents/20100819/data’;

Processedevents consumers must be manually informed by producer that data is available, or poll on HDFS (= bad for the NameNode)

raw = load ‘rawevents’ using HCatLoader();botless = filter raw by date = ‘20100819’ and NotABot(user);…store output into ‘processedevents’ using HCatStorage(“date=20100819”);

Processedevents consumers will be notified by HCatalog data is available and can then start their jobs

Page 6: HCatalog Hadoop Summit 2011

Metadata Architecture

Thrift server

Hive metadata interface

HCatInputFormat HCatOutputFormat

HCatLoader HCatStorage

RDBMS

CLI

= Hive

= Current HCatalog

Notification

= Future HCatalog

Page 7: HCatalog Hadoop Summit 2011

Storage Architecture

HCatInputFormat HCatOutputFormat

HCatLoader HCatStorage

InputStorageDriver

HDFS

OutputStorageDriver

HBase

Page 8: HCatalog Hadoop Summit 2011

Project Status

• HCatalog was accepted to the Apache Incubator last March

• 0.1 released this month, includes−Read/write from Pig−Read/write from MapReduce−Read/write from Hive−Works only with secure Hadoop−StorageDrivers for RCFile and Text

Page 9: HCatalog Hadoop Summit 2011

Future Plans

• 0.2, plan to release in July−Notification via JMS when data is available−Store to multiple partitions simultaneously−Import/Export tools

• Later this year−Storing in HBase−Integration with Hadoop streaming−Bytearray/blob type−RCFile compression improvements−High Availability for Thrift server

• Eventually−Data management interfaces for archivers, cleaners, etc.−Statistics storage

Page 10: HCatalog Hadoop Summit 2011

Get Involved

• incubator.apache.org/hcatalog

• Join the mailing lists −User list: [email protected] −Dev list: [email protected]

Page 11: HCatalog Hadoop Summit 2011

Questions?