future of hcatalog - hadoop summit 2012
DESCRIPTION
TRANSCRIPT
Future of HCatalog
Page 1
Alan F. Gates
@alanfgates
© Hortonworks Inc. 2012
Who Am I?
Page 2
• HCatalog committer and mentor• Co-founder of Hortonworks• Lead for Pig, Hive, and HCatalog at Hortonworks• Pig committer and PMC Member• Member of Apache Software Foundation and Incubator PMC
• Author of Programming Pig from O’Reilly
© Hortonworks 2012
Hadoop Ecosystem
Page 3
MetastoreHDFS
Hive
Metastore ClientInputFormat/ OuputFormat
SerDe
InputFormat/ OuputFormat
MapReduce Pig
Load/Store
© Hortonworks 2012
Opening up Metadata to MR & Pig
Page 4
MetastoreHDFS
Hive
Metastore ClientInputFormat/ OuputFormat
SerDe
HCaInputFormat/ HCatOuputFormat
MapReduce Pig
HCatLoader/ HCatStorer
© Hortonworks 2012
Templeton - REST API
Page 5
Hadoop/HCatalog
Get a list of all tables in the default database:
GEThttp://…/v1/ddl/database/default/table
{ "tables": ["counted","processed",], "database": "default"}
• REST endpoints: databases, tables, partitions, columns, table properties• PUT to create/update, GET to list or describe, DELETE to drop
© Hortonworks 2012
Templeton - REST API
Page 6
Hadoop/HCatalog
• REST endpoints: databases, tables, partitions, columns, table properties• PUT to create/update, GET to list or describe, DELETE to drop
Create new table “rawevents”
PUT{"columns": [{ "name": "url", "type": "string" }, { "name": "user", "type": "string"}], "partitionedBy": [{ "name": "ds", "type": "string" }]}
http://…/v1/ddl/database/default/table/rawevents
{ "table": "rawevents", "database": "default”}
© Hortonworks 2012
Templeton - REST API
Page 7
Hadoop/HCatalog
• REST endpoints: databases, tables, partitions, columns, table properties• PUT to create/update, GET to list or describe, DELETE to drop
• Included in HDP• Not yet checked in, but you can find the code on Apache’s JIRA HCATALOG-182
Describe table “rawevents”
GEThttp://…/v1/ddl/database/default/table/rawevents
{ "columns": [{"name": "url","type": "string"}, {"name": "user","type": "string"}], "database": "default", "table": "rawevents"}
© 2012 Hortonworks
Reading and Writing Data in Parallel
Page 8
• Use Case: Users want– to read and write records in parallel between Hadoop and their parallel system– driven by their system– in a language independent way– without needing to understand Hadoop’s file formats
• Example: an MPP data store wants to read data out of Hadoop as HCatRecords for its parallel jobs
• What exists today– webhdfs
– Language independent– Can move data in parallel– Driven from the user side– Moves only bytes, no understanding of file format
– Sqoop– Can move data in parallel– Understands data format– Driven from Hadoop side– Requires connector or JDBC
© 2012 Hortonworks
HCatReader and HCatWriter
Page 9
Master
Slave
Slave
Slave
HCatalog
HDFS
getHCatReader
HCatReader
Input
Splits
read
read
read
Iterator<HCatRecord>
Iterator<HCatRecord>
Iterator<HCatRecord>
Right now all in Java, needs to be REST
© Hortonworks Inc. 2012
Storing Semi-/Unstructured Data
Page 10
Name Zip
Alice 93201
Bob 76331
select name, zipfrom users;
{"name":"alice","zip":"93201"}{"name":"bob”,"zip":"76331"}{"name":"cindy"}{"zip":"87890"}
Table Users File Users
A = load ‘Users’ as (name:chararray, zip:chararray);B = foreach A generate name, zip;
© Hortonworks Inc. 2012
Storing Semi-/Unstructured Data
Page 11
Name Zip
Alice 93201
Bob 76331
select name, zipfrom users;
{"name":"alice","zip":"93201"}{"name":"bob”,"zip":"76331"}{"name":"cindy"}{"zip":"87890"}
Table Users File Users
A = load ‘Users’B = foreach A generate name, zip;
A = load ‘Users’ as (name:chararray, zip:chararray);B = foreach A generate name, zip;
© Hortonworks Inc. 2012
Storing Semi-/Unstructured Data
Page 12
Name Zip
Alice 93201
Bob 76331
select name, zipfrom users;
{"name":"alice","zip":"93201"}{"name":"bob”,"zip":"76331"}{"name":"cindy"}{"zip":"87890"}
Table Users File Users
A = load ‘Users’B = foreach A generate name, zip;
A = load ‘Users’ as (name:chararray, zip:chararray);B = foreach A generate name, zip;
© 2012 Hortonworks
Hive ODBC/JDBC Today
Page 13
Hive Server
Hadoop
JDBC Client
ODBC Client
Issue: Have to have Hive
code on the client
Issues: • Not concurrent• Not secure• Not scalable
Issue: Open source version
not easy to use
© 2012 Hortonworks
ODBC/JDBC Proposal
Page 14
REST Server
Hadoop
JDBC Client
ODBC Client
Provide robust open sourceimplementations
• Spawns job inside cluster• Runs job as submitting user• Works with security• Scaling web services well understood
© 2012 Hortonworks
Questions
Page 15