an introduction to apache hcatalog
TRANSCRIPT
Apache HCatalog
What is it ?
How does it work ?
Interfaces
Architecture
Example
HCatalog What is it ?
A Hive metastore interface set
Shared schema and data types for Hadoop tools
Rest interface for external data access
Assists inter operability between
Pig, Hive and Map Reduce
Table abstraction of data storage
Will provide data availability notifications
HCatalog How does it work ?
Pig
HCatLoader + HCatStorer interface
Map Reduce
HCatInputFormat + HCatOutputFormat interface
Hive
No interface necessary
Direct access to meta data
Notifications when data available
HCatalog Interfaces
Interface via
Pig
Map Reduce
Hive
Streaming
Access data via
Orc file
RC file
Text file
Sequence file
Custom format
HCatalog Interfaces
HCatalog Architecture
HCatalog Example
A data flow example from hive.apache.org
First Joe in data acquisition uses distcp to get data onto the grid.
hadoop distcp file:///file.dat hdfs://data/rawevents/20100819/data
hcat "alter table rawevents add partition (ds='20100819') location 'hdfs://data/rawevents/20100819/data'"
Second Sally in data processing uses Pig to cleanse and prepare the data.
Without HCatalog, Sally must be manually informed by Joe when data is available, or poll on HDFS.
A = load '/data/rawevents/20100819/data' as (alpha:int, beta:chararray, );
B = filter A by bot_finder(zeta) = 0;
store Z into 'data/processedevents/20100819/data';
With HCatalog, HCatalog will send a JMS message that data is available. The Pig job can then be started.
A = load 'rawevents' using HCatLoader();
B = filter A by date = '20100819' and by bot_finder(zeta) = 0;
store Z into 'processedevents' using HcatStorer("date=20100819");
Note that the pig job refers to the data by name rawevents rather than a location
Now access the data via Hive QL
select advertiser_id, count(clicks) from processedevents
where date = 20100819 group by advertiser_id;
Contact Us
Feel free to contact us at
www.semtech-solutions.co.nz
We offer IT project consultancy
We are happy to hear about your problems
You can just pay for those hours that you need
To solve your problems