an introduction to cloudera impala

Impala

What is it ?

How does it work ?

Performance

Formats

Architecture

[email protected]

Impala What is it ?

Adhoc real time query for Hadoop

Open source

Developed by Cloudera

Based on Google 2010 dremel paper

Direct data access via Impala engine

Future Hadoop parquet update will

Add columnar binary storage to Hadoop

Improve Impala performance

[email protected]

Impala How does it work ?

Direct data access

Query planning / coordination on data nodes

Node based query engine

Low latency

Perfomance imrovement

Query data on HDFS or Hbase

Uses same Hive QL syntax ( SQL like )

Has the Hue GUI

Allows table joins and aggregation

[email protected]

Impala Performance

Impala delivers performance gains

IO bound queries hardware limitations

Min 3 times

Complex multiple MapReduce stages

Min 7 times

Cached queries

Min 20 times

[email protected]

Impala Formats

Supported formats

Text & Sequence Files which can be compressed as

Snappy

GZIP

BZIP

Future support for

Avro

RCFile

LZO text file

Parquet

[email protected]

Impala Architecture

[email protected]

Impala Requirements

What does Impala need to run ?

CentOS 6.2

or RHEL (Red Hat Enterprise Linux)

CDH 4.1 (Cloudera Hadoop Distribution)

Cloudera Manager ( advised )

[email protected]

Contact Us

Feel free to contact us at

www.semtech-solutions.co.nz

[email protected]

We offer IT project consultancy

We are happy to hear about your problems

You can just pay for those hours that you need

To solve your problems

an introduction to cloudera impala

Technology