an introduction to apache hcatalog

Apache HCatalog

What is it ?

How does it work ?

Interfaces

Architecture

Example

www.semtech-solutions.co.nzinfo@semtech-solutions.co.nz

HCatalog What is it ?

A Hive metastore interface set

Shared schema and data types for Hadoop tools

Rest interface for external data access

Assists inter operability between

Pig, Hive and Map Reduce

Table abstraction of data storage

Will provide data availability notifications

HCatalog How does it work ?

HCatLoader + HCatStorer interface

Map Reduce

HCatInputFormat + HCatOutputFormat interface

No interface necessary

Direct access to meta data

Notifications when data available

HCatalog Interfaces

Interface via

Map Reduce

Streaming

Access data via

Orc file

RC file

Text file

Sequence file

Custom format

HCatalog Interfaces

HCatalog Architecture

HCatalog Example

A data flow example from hive.apache.org

First Joe in data acquisition uses distcp to get data onto the grid.

hadoop distcp file:///file.dat hdfs://data/rawevents/20100819/data

hcat "alter table rawevents add partition (ds='20100819') location 'hdfs://data/rawevents/20100819/data'"

Second Sally in data processing uses Pig to cleanse and prepare the data.

Without HCatalog, Sally must be manually informed by Joe when data is available, or poll on HDFS.

A = load '/data/rawevents/20100819/data' as (alpha:int, beta:chararray, );

B = filter A by bot_finder(zeta) = 0;

store Z into 'data/processedevents/20100819/data';

With HCatalog, HCatalog will send a JMS message that data is available. The Pig job can then be started.

A = load 'rawevents' using HCatLoader();

B = filter A by date = '20100819' and by bot_finder(zeta) = 0;

store Z into 'processedevents' using HcatStorer("date=20100819");

Note that the pig job refers to the data by name rawevents rather than a location

Now access the data via Hive QL

select advertiser_id, count(clicks) from processedevents

where date = 20100819 group by advertiser_id;

Contact Us

Feel free to contact us at

www.semtech-solutions.co.nz

info@semtech-solutions.co.nz

We offer IT project consultancy

We are happy to hear about your problems

You can just pay for those hours that you need

To solve your problems

an introduction to apache hcatalog

Technology

introduction to apache cassandra

latjug. apache camel introduction

introduction to apache spark

apache mahout - introduction

introduction to apache accumulo

trihug november hcatalog talk by alan gates

introduction to apache flink

apache beam introduction to -...

introduction to open source, apache and apache way

apache spark introduction

coordinating the many tools of big data - apache hcatalog,...

introduction to apache maven

may 2013 hug: hcatalog/hive data out

introduction to apache synapse

developing apache spark applications · apache spark...

introduction to apache solr

hive hcatalog

sqoop hcatalog integration

smart.science.go.kr · 2016-01-18 · hcatalog map reduce...

introduction to apache mesos