an introduction to apache hcatalog

Post on 16-Apr-2017

4.152 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Apache HCatalog

What is it ?

How does it work ?

Interfaces

Architecture

Example

www.semtech-solutions.co.nzinfo@semtech-solutions.co.nz

HCatalog What is it ?

A Hive metastore interface set

Shared schema and data types for Hadoop tools

Rest interface for external data access

Assists inter operability between

Pig, Hive and Map Reduce

Table abstraction of data storage

Will provide data availability notifications

www.semtech-solutions.co.nzinfo@semtech-solutions.co.nz

HCatalog How does it work ?

Pig

HCatLoader + HCatStorer interface

Map Reduce

HCatInputFormat + HCatOutputFormat interface

Hive

No interface necessary

Direct access to meta data

Notifications when data available

www.semtech-solutions.co.nzinfo@semtech-solutions.co.nz

HCatalog Interfaces

Interface via

Pig

Map Reduce

Hive

Streaming

Access data via

Orc file

RC file

Text file

Sequence file

Custom format

www.semtech-solutions.co.nzinfo@semtech-solutions.co.nz

HCatalog Interfaces

www.semtech-solutions.co.nzinfo@semtech-solutions.co.nz

HCatalog Architecture

www.semtech-solutions.co.nzinfo@semtech-solutions.co.nz

HCatalog Example

A data flow example from hive.apache.org

First Joe in data acquisition uses distcp to get data onto the grid.

hadoop distcp file:///file.dat hdfs://data/rawevents/20100819/data

hcat "alter table rawevents add partition (ds='20100819') location 'hdfs://data/rawevents/20100819/data'"

Second Sally in data processing uses Pig to cleanse and prepare the data.

Without HCatalog, Sally must be manually informed by Joe when data is available, or poll on HDFS.

A = load '/data/rawevents/20100819/data' as (alpha:int, beta:chararray, );

B = filter A by bot_finder(zeta) = 0;

store Z into 'data/processedevents/20100819/data';

With HCatalog, HCatalog will send a JMS message that data is available. The Pig job can then be started.

A = load 'rawevents' using HCatLoader();

B = filter A by date = '20100819' and by bot_finder(zeta) = 0;

store Z into 'processedevents' using HcatStorer("date=20100819");

Note that the pig job refers to the data by name rawevents rather than a location

Now access the data via Hive QL

select advertiser_id, count(clicks) from processedevents

where date = 20100819 group by advertiser_id;

www.semtech-solutions.co.nzinfo@semtech-solutions.co.nz

Contact Us

Feel free to contact us at

www.semtech-solutions.co.nz

info@semtech-solutions.co.nz

We offer IT project consultancy

We are happy to hear about your problems

You can just pay for those hours that you need

To solve your problems

top related