hcatalog hadoop summit 2011
DESCRIPTION
Alan Gates' HCatalog talkTRANSCRIPT
HCatalogTable Management for HadoopAlan F. Gates
Who Am I?
• Committer and mentor for Apache HCatalog
• Committer and PMC member for Apache Pig
• Co-founder Hortonworks
• Twitter @alanfgates
Photo credit: Charles Dawley
Motivation
Pig Hive Map Reduce
RCFile Custom Format
CustomLoader
HiveColumnarLoader
RCFileInputFormat
CustomInputFormat
ColumnarSerDe
CustomSerDe
HCatalog
HCatLoader HCatSerDe HCatInputFormat
RCFileStorageDriver
CustomStorageDriver
More Motivation
Data FactoryPig/MapReduce
PipelinesIterative ProcessingResearch
Data WarehouseHive
BI ToolsAnalysis
End User Example
raw = load ‘/rawevents/20100819/data’ using MyLoader() as (ts:long, user:chararray, url:chararray);botless = filter raw by NotABot(user);…store output into ‘/processedevents/20100819/data’;
Processedevents consumers must be manually informed by producer that data is available, or poll on HDFS (= bad for the NameNode)
raw = load ‘rawevents’ using HCatLoader();botless = filter raw by date = ‘20100819’ and NotABot(user);…store output into ‘processedevents’ using HCatStorage(“date=20100819”);
Processedevents consumers will be notified by HCatalog data is available and can then start their jobs
Metadata Architecture
Thrift server
Hive metadata interface
HCatInputFormat HCatOutputFormat
HCatLoader HCatStorage
RDBMS
CLI
= Hive
= Current HCatalog
Notification
= Future HCatalog
Storage Architecture
HCatInputFormat HCatOutputFormat
HCatLoader HCatStorage
InputStorageDriver
HDFS
OutputStorageDriver
HBase
Project Status
• HCatalog was accepted to the Apache Incubator last March
• 0.1 released this month, includes−Read/write from Pig−Read/write from MapReduce−Read/write from Hive−Works only with secure Hadoop−StorageDrivers for RCFile and Text
Future Plans
• 0.2, plan to release in July−Notification via JMS when data is available−Store to multiple partitions simultaneously−Import/Export tools
• Later this year−Storing in HBase−Integration with Hadoop streaming−Bytearray/blob type−RCFile compression improvements−High Availability for Thrift server
• Eventually−Data management interfaces for archivers, cleaners, etc.−Statistics storage
Get Involved
• incubator.apache.org/hcatalog
• Join the mailing lists −User list: [email protected] −Dev list: [email protected]
Questions?