future of hcatalog - hadoop summit 2012

Future of HCatalog

Alan F. Gates

@alanfgates

© Hortonworks Inc. 2012

Who Am I?

• HCatalog committer and mentor• Co-founder of Hortonworks• Lead for Pig, Hive, and HCatalog at Hortonworks• Pig committer and PMC Member• Member of Apache Software Foundation and Incubator PMC

• Author of Programming Pig from O’Reilly

© Hortonworks 2012

Hadoop Ecosystem

MetastoreHDFS

Hive

Metastore ClientInputFormat/ OuputFormat

SerDe

InputFormat/ OuputFormat

MapReduce Pig

Load/Store

© Hortonworks 2012

Opening up Metadata to MR & Pig

MetastoreHDFS

Hive

Metastore ClientInputFormat/ OuputFormat

SerDe

HCaInputFormat/ HCatOuputFormat

MapReduce Pig

HCatLoader/ HCatStorer

© Hortonworks 2012

Templeton - REST API

Hadoop/HCatalog

Get a list of all tables in the default database:

GEThttp://…/v1/ddl/database/default/table

{ "tables": ["counted","processed",], "database": "default"}

• REST endpoints: databases, tables, partitions, columns, table properties• PUT to create/update, GET to list or describe, DELETE to drop

© Hortonworks 2012


Hadoop/HCatalog


Create new table “rawevents”

PUT{"columns": [{ "name": "url", "type": "string" }, { "name": "user", "type": "string"}], "partitionedBy": [{ "name": "ds", "type": "string" }]}

http://…/v1/ddl/database/default/table/rawevents

{ "table": "rawevents", "database": "default”}

© Hortonworks 2012


Hadoop/HCatalog


• Included in HDP• Not yet checked in, but you can find the code on Apache’s JIRA HCATALOG-182

Describe table “rawevents”

GEThttp://…/v1/ddl/database/default/table/rawevents

{ "columns": [{"name": "url","type": "string"}, {"name": "user","type": "string"}], "database": "default", "table": "rawevents"}

https://issues.apache.org/jira/browse/HCATALOG-182

© 2012 Hortonworks

Reading and Writing Data in Parallel

• Use Case: Users want– to read and write records in parallel between Hadoop and their parallel system– driven by their system– in a language independent way– without needing to understand Hadoop’s file formats

• Example: an MPP data store wants to read data out of Hadoop as HCatRecords for its parallel jobs

• What exists today– webhdfs

– Language independent– Can move data in parallel– Driven from the user side– Moves only bytes, no understanding of file format

– Sqoop– Can move data in parallel– Understands data format– Driven from Hadoop side– Requires connector or JDBC

© 2012 Hortonworks

HCatReader and HCatWriter

Master

Slave

Slave

Slave

HCatalog

HDFS

getHCatReader

HCatReader

Input

Splits

read

read

read

Iterator<HCatRecord>



Right now all in Java, needs to be REST


Storing Semi-/Unstructured Data

Name Zip

Alice 93201

Bob 76331

select name, zipfrom users;

{"name":"alice","zip":"93201"}{"name":"bob”,"zip":"76331"}{"name":"cindy"}{"zip":"87890"}

Table Users File Users

A = load ‘Users’ as (name:chararray, zip:chararray);B = foreach A generate name, zip;


Storing Semi-/Unstructured Data

Name Zip

Alice 93201

Bob 76331

select name, zipfrom users;

{"name":"alice","zip":"93201"}{"name":"bob”,"zip":"76331"}{"name":"cindy"}{"zip":"87890"}

Table Users File Users

A = load ‘Users’B = foreach A generate name, zip;

A = load ‘Users’ as (name:chararray, zip:chararray);B = foreach A generate name, zip;

© 2012 Hortonworks

Hive ODBC/JDBC Today

Hive Server

Hadoop

JDBC Client

ODBC Client

Issue: Have to have Hive

code on the client

Issues: • Not concurrent• Not secure• Not scalable

Issue: Open source version

not easy to use

© 2012 Hortonworks

ODBC/JDBC Proposal

REST Server

Hadoop

JDBC Client

ODBC Client

Provide robust open sourceimplementations

• Spawns job inside cluster• Runs job as submitting user• Works with security• Scaling web services well understood

future of hcatalog - hadoop summit 2012

Technology