future of hcatalog - hadoop summit 2012

15
Future of HCatalog Page 1 Alan F. Gates @alanfgates

Upload: hortonworks

Post on 26-Jan-2015

106 views

Category:

Technology


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Future of HCatalog - Hadoop Summit 2012

Future of HCatalog

Page 1

Alan F. Gates

@alanfgates

Page 2: Future of HCatalog - Hadoop Summit 2012

© Hortonworks Inc. 2012

Who Am I?

Page 2

• HCatalog committer and mentor• Co-founder of Hortonworks• Lead for Pig, Hive, and HCatalog at Hortonworks• Pig committer and PMC Member• Member of Apache Software Foundation and Incubator PMC

• Author of Programming Pig from O’Reilly

Page 3: Future of HCatalog - Hadoop Summit 2012

© Hortonworks 2012

Hadoop Ecosystem

Page 3

MetastoreHDFS

Hive

Metastore ClientInputFormat/ OuputFormat

SerDe

InputFormat/ OuputFormat

MapReduce Pig

Load/Store

Page 4: Future of HCatalog - Hadoop Summit 2012

© Hortonworks 2012

Opening up Metadata to MR & Pig

Page 4

MetastoreHDFS

Hive

Metastore ClientInputFormat/ OuputFormat

SerDe

HCaInputFormat/ HCatOuputFormat

MapReduce Pig

HCatLoader/ HCatStorer

Page 5: Future of HCatalog - Hadoop Summit 2012

© Hortonworks 2012

Templeton - REST API

Page 5

Hadoop/HCatalog

Get a list of all tables in the default database:

GEThttp://…/v1/ddl/database/default/table

{ "tables": ["counted","processed",], "database": "default"}

• REST endpoints: databases, tables, partitions, columns, table properties• PUT to create/update, GET to list or describe, DELETE to drop

Page 6: Future of HCatalog - Hadoop Summit 2012

© Hortonworks 2012

Templeton - REST API

Page 6

Hadoop/HCatalog

• REST endpoints: databases, tables, partitions, columns, table properties• PUT to create/update, GET to list or describe, DELETE to drop

Create new table “rawevents”

PUT{"columns": [{ "name": "url", "type": "string" }, { "name": "user", "type": "string"}], "partitionedBy": [{ "name": "ds", "type": "string" }]}

http://…/v1/ddl/database/default/table/rawevents

{ "table": "rawevents", "database": "default”}

Page 7: Future of HCatalog - Hadoop Summit 2012

© Hortonworks 2012

Templeton - REST API

Page 7

Hadoop/HCatalog

• REST endpoints: databases, tables, partitions, columns, table properties• PUT to create/update, GET to list or describe, DELETE to drop

• Included in HDP• Not yet checked in, but you can find the code on Apache’s JIRA HCATALOG-182

Describe table “rawevents”

GEThttp://…/v1/ddl/database/default/table/rawevents

{ "columns": [{"name": "url","type": "string"}, {"name": "user","type": "string"}], "database": "default", "table": "rawevents"}

Page 8: Future of HCatalog - Hadoop Summit 2012

© 2012 Hortonworks

Reading and Writing Data in Parallel

Page 8

• Use Case: Users want– to read and write records in parallel between Hadoop and their parallel system– driven by their system– in a language independent way– without needing to understand Hadoop’s file formats

• Example: an MPP data store wants to read data out of Hadoop as HCatRecords for its parallel jobs

• What exists today– webhdfs

– Language independent– Can move data in parallel– Driven from the user side– Moves only bytes, no understanding of file format

– Sqoop– Can move data in parallel– Understands data format– Driven from Hadoop side– Requires connector or JDBC

Page 9: Future of HCatalog - Hadoop Summit 2012

© 2012 Hortonworks

HCatReader and HCatWriter

Page 9

Master

Slave

Slave

Slave

HCatalog

HDFS

getHCatReader

HCatReader

Input

Splits

read

read

read

Iterator<HCatRecord>

Iterator<HCatRecord>

Iterator<HCatRecord>

Right now all in Java, needs to be REST

Page 10: Future of HCatalog - Hadoop Summit 2012

© Hortonworks Inc. 2012

Storing Semi-/Unstructured Data

Page 10

Name Zip

Alice 93201

Bob 76331

select name, zipfrom users;

{"name":"alice","zip":"93201"}{"name":"bob”,"zip":"76331"}{"name":"cindy"}{"zip":"87890"}

Table Users File Users

A = load ‘Users’ as (name:chararray, zip:chararray);B = foreach A generate name, zip;

Page 11: Future of HCatalog - Hadoop Summit 2012

© Hortonworks Inc. 2012

Storing Semi-/Unstructured Data

Page 11

Name Zip

Alice 93201

Bob 76331

select name, zipfrom users;

{"name":"alice","zip":"93201"}{"name":"bob”,"zip":"76331"}{"name":"cindy"}{"zip":"87890"}

Table Users File Users

A = load ‘Users’B = foreach A generate name, zip;

A = load ‘Users’ as (name:chararray, zip:chararray);B = foreach A generate name, zip;

Page 12: Future of HCatalog - Hadoop Summit 2012

© Hortonworks Inc. 2012

Storing Semi-/Unstructured Data

Page 12

Name Zip

Alice 93201

Bob 76331

select name, zipfrom users;

{"name":"alice","zip":"93201"}{"name":"bob”,"zip":"76331"}{"name":"cindy"}{"zip":"87890"}

Table Users File Users

A = load ‘Users’B = foreach A generate name, zip;

A = load ‘Users’ as (name:chararray, zip:chararray);B = foreach A generate name, zip;

Page 13: Future of HCatalog - Hadoop Summit 2012

© 2012 Hortonworks

Hive ODBC/JDBC Today

Page 13

Hive Server

Hadoop

JDBC Client

ODBC Client

Issue: Have to have Hive

code on the client

Issues: • Not concurrent• Not secure• Not scalable

Issue: Open source version

not easy to use

Page 14: Future of HCatalog - Hadoop Summit 2012

© 2012 Hortonworks

ODBC/JDBC Proposal

Page 14

REST Server

Hadoop

JDBC Client

ODBC Client

Provide robust open sourceimplementations

• Spawns job inside cluster• Runs job as submitting user• Works with security• Scaling web services well understood

Page 15: Future of HCatalog - Hadoop Summit 2012

© 2012 Hortonworks

Questions

Page 15