an introduction to cloudera impala
TRANSCRIPT
Impala
What is it ?
How does it work ?
Performance
Formats
Architecture
Impala What is it ?
Adhoc real time query for Hadoop
Open source
Developed by Cloudera
Based on Google 2010 dremel paper
Direct data access via Impala engine
Future Hadoop parquet update will
Add columnar binary storage to Hadoop
Improve Impala performance
Impala How does it work ?
Direct data access
Query planning / coordination on data nodes
Node based query engine
Low latency
Perfomance imrovement
Query data on HDFS or Hbase
Uses same Hive QL syntax ( SQL like )
Has the Hue GUI
Allows table joins and aggregation
Impala Performance
Impala delivers performance gains
IO bound queries hardware limitations
Min 3 times
Complex multiple MapReduce stages
Min 7 times
Cached queries
Min 20 times
Impala Formats
Supported formats
Text & Sequence Files which can be compressed as
Snappy
GZIP
BZIP
Future support for
Avro
RCFile
LZO text file
Parquet
Impala Architecture
Impala Requirements
What does Impala need to run ?
CentOS 6.2
or RHEL (Red Hat Enterprise Linux)
CDH 4.1 (Cloudera Hadoop Distribution)
Cloudera Manager ( advised )
Contact Us
Feel free to contact us at
www.semtech-solutions.co.nz
We offer IT project consultancy
We are happy to hear about your problems
You can just pay for those hours that you need
To solve your problems