an introduction to apache hcatalog

Download An introduction to Apache HCatalog

If you can't read please download the document

Upload: semtech-solutions-ltd

Post on 16-Apr-2017

4.152 views

Category:

Technology


0 download

TRANSCRIPT

Apache HCatalog

What is it ?

How does it work ?

Interfaces

Architecture

Example

[email protected]

HCatalog What is it ?

A Hive metastore interface set

Shared schema and data types for Hadoop tools

Rest interface for external data access

Assists inter operability between

Pig, Hive and Map Reduce

Table abstraction of data storage

Will provide data availability notifications

[email protected]

HCatalog How does it work ?

Pig

HCatLoader + HCatStorer interface

Map Reduce

HCatInputFormat + HCatOutputFormat interface

Hive

No interface necessary

Direct access to meta data

Notifications when data available

[email protected]

HCatalog Interfaces

Interface via

Pig

Map Reduce

Hive

Streaming

Access data via

Orc file

RC file

Text file

Sequence file

Custom format

[email protected]

HCatalog Interfaces

[email protected]

HCatalog Architecture

[email protected]

HCatalog Example

A data flow example from hive.apache.org

First Joe in data acquisition uses distcp to get data onto the grid.

hadoop distcp file:///file.dat hdfs://data/rawevents/20100819/data

hcat "alter table rawevents add partition (ds='20100819') location 'hdfs://data/rawevents/20100819/data'"

Second Sally in data processing uses Pig to cleanse and prepare the data.

Without HCatalog, Sally must be manually informed by Joe when data is available, or poll on HDFS.

A = load '/data/rawevents/20100819/data' as (alpha:int, beta:chararray, );

B = filter A by bot_finder(zeta) = 0;

store Z into 'data/processedevents/20100819/data';

With HCatalog, HCatalog will send a JMS message that data is available. The Pig job can then be started.

A = load 'rawevents' using HCatLoader();

B = filter A by date = '20100819' and by bot_finder(zeta) = 0;

store Z into 'processedevents' using HcatStorer("date=20100819");

Note that the pig job refers to the data by name rawevents rather than a location

Now access the data via Hive QL

select advertiser_id, count(clicks) from processedevents

where date = 20100819 group by advertiser_id;

[email protected]

Contact Us

Feel free to contact us at

www.semtech-solutions.co.nz

[email protected]

We offer IT project consultancy

We are happy to hear about your problems

You can just pay for those hours that you need

To solve your problems