TriHUG November HCatalog Talk by Alan Gates
Post on 08-May-2015
- 1.HCatalogTable Management for HadoopAlan F. Gates
2. Motivation: Data Sharing is Hard This is analyst Joe, he usesThis is programmer Bob, he Hive to build reports anduses Pig to crunch data. answer ad-hoc queries.Joe, I needtodays dataOkPhoto Credit: totalAldo via FlickrHmm, is it done yet? Where is it? What format didyou use to store it today? Is it compressed? Andcan you help me load it into Hive, I can neverremember all the parameters I have to pass thatalter table command. Dude, we need HCatalog 3. More Motivation: Each tool requires its ownTranslator Pig Hive Map ReduceHive HCatLoaderHCatSerDeRCFileCustom HCatInputFormatCustomColumnarCustomColumnarInput InputLoaderSerDe SerDeLoaderFormat Format HCatalogRCFileCustomStorageDriver StorageDriver Custom RCFile Format 4. End User Exampleraw = load /rawevents/20100819/data using MyLoader()as (ts:long, user:chararray, url:chararray);botless = filter raw by NotABot(user);store output into /processedevents/20100819/data;Processedevents consumers must be manually informed by producer that data isavailable, or poll on HDFS (= bad for the NameNode)raw = load rawevents using HCatLoader();botless = filter raw by date = 20100819 and NotABot(user);store output into processedeventsusing HCatStorage(date=20100819);Processedevents consumers will be notified by HCatalog data is available and canthen start their jobs 5. Command Line for DDL Uses Hive SQL Create, drop, alter table CREATE TABLE employee (emp_id INT,emp_name STRING,emp_start_date STRING,emp_gender STRING)PARTITIONED BY (emp_country STRING,emp_state STRING)STORED AS RCFILEtblproperties(hcat.isd=RCFileInputDriver,hcat.osd=RCFileOutputDriver); 6. Manages Data Format and Schema Changes Allows columns to be appended to tables in new partitions no need to change existing data fields not present in old data will be read as null must do alter table add column first Allows storage format changes no need to change existing data, HCatalog will handle reading eachpartition in the appropriate format all new partitions will be written in current format 7. Security Uses underlying storage permissions to determineauthorization Currently only works with HDFS based storage If user can read from the HDFS directory, then he can read the table If user can write to the HDFS directory, then he can write to the table If the user can write to the database level directory, he can create anddrop tables Allows users to define which group to create table as so table accesscan be controlled by Unix group Authentication done via kerberos 8. Metadata Architecture HCatLoader HCatStorage HTTPHCatInputFormat HCatOutputFormatCLINotificationHive metadata interfaceThriftserverRDBMS = Current HCatalog = Hive = Future HCatalog 9. Storage Architecture HCatLoaderHCatStorageHCatInputFormat HCatOutputFormat Input Output StorageDriverStorageDriverHDFS HBase 10. Project Status HCatalog was accepted to the Apache Incubator last March 0.2 released in October, includes: Read/write from Pig Read/write from MapReduce Read/write from Hive StorageDrivers for RCFile Notification via JMS when data is available Store to multiple partitions simultaneously Import/Export tools 11. HCatalog 0.3 Plan to release mid-December Adds a Binary type (to Hive and HCatalog) Storage drivers for JSON and text Improved integration with Hive for custom storage formats Web services interface 12. Future Plans Support for HBase and other data sources for storage RCFile compression improvements High Availability for Thrift server Data management interfaces for archivers, cleaners, etc. Additional metadata storage: statistics lineage/provenance user tags 13. Get Involved incubator.apache.org/hcatalog Join the mailing lists User list: firstname.lastname@example.org Dev list: email@example.com 14. Questions?