bdm8 - near-realtime big data analytics using impala

1 / 18

Near-realtimeBig Data Analytics using Impala

David Lauzon Big Data Montreal #8January 10th 2013

2 / 18

Plan

• What is Impala?• Why Google built Dremel?• Use cases for Impala• Use cases for Map-Reduce• Cloudera Customer Survey• Impala Features• Impala Performance Expectations and Benchmarks• Impala Components• Impala Architecture• Impala Development Roadmap• Where to learn more and get started

3 / 18

Disclaimer

• In order to preserve the best accuracy in the description of Dremel and Impala, most of the contents in this presentation have been gathered from the authors of the respective technologies. References are found at the end of the presentation.

• I am not affiliated or sponsored by Cloudera or Google.

4 / 18

What is Impala?

“An Impala is an athletic, gracious, african antilope, famous for its velocity and its agility to jump” - Wikipedia

5 / 18

Seriously, what is Impala?

• “Impala enables real-time, interactive, analytical queries of the data stored in HBase or HDFS” – Cloudera

• Inspired by Google Dremel Paper (2010)– BigQuery is a Dremel implementation service,• It’s proprietary, not free, and requires to upload your

data to Google servers

6 / 18

Why Google built Dremel?

• Problems with Data Warehouse Solutions for OLAP/BI:– Relational OLAP (ROLAP) :

• Need to build indices for every possible query (for performance concerns) Indices size could take up the whole RAM

– Multi-dimensional OLAP (MOLAP):• Require extensive time and money to design and build the data cubes

– Ad-hoc query (specific non-optimised query) :• When you don’t know what you’ll need / or need to work in iterations. e.g.

quite often !!

• Solution:– Increase full-scan speed without requiring indexing or pre-

aggregated values

7 / 18

Come on, give me some use cases!

• Finding particular records with specified conditions.– “Find all the locations where account “ABC” was

accessed from”.• Quick aggregation of statistics with dynamically-

changing conditions:– “Can you give me yesterday’s number of impressions for

Google AdWords display ads – but only in the Tokyo region?”

• Trial-and-error data analysis:– “And between 11am to 1pm?”

8 / 18

Use cases for which you should stick with Map-Reduce based applications

• Very long running, batch-oriented tasks such as ETL:– e.g. exporting large amount of data after processing

• Complex event processing:– e.g. stream-processing

• “Complex data mining on Big Data which requires multiple iterations and paths of data processing with programmed algorithms” - Google

9 / 18

Integration with Hadoop

• Cloudera Customer Survey (Aug. 2012)– 80% needs faster queries on Hadoop data– 65% query Hadoop using Hive– 70% move data from Hadoop to RDBMS for

interactive SQL– 60% see value today in consolidating to a single

platform

10 / 18

Impala Features

• Shared with Hive:– Hive MetaStore– Hive SQL (most common SQL-92 features)– ODBC Driver– User Interface (Hue Beeswax)

• Specific to Impala:– No Map Reduce, but in memory transfers– Host and Disk Awareness (data locality)– Table data caching in RAM– No virtual columns, or locking

11 / 18

Impala Performance Expectations

• Performance improvements over Hive– 3 - 4X for purely I/O bound queries– 7 - 45X for queries with at least one join– 20 - 90X when data available in the cache

12 / 18

External Benchmarks

Workload Impala Query Time

Hive Query Time

MySQL Query Time

5.2 Gb HAproxy log – top IPs by request count 3.1s 65.4s 146.0s

5.2 Gb HAproxy log – top IPs by total request time 3.3s 65.2s 164.0s

800 Mb parsed rails log – slowest accounts 1.0s 33.2s 48.1s

800 Mb parsed rails log – highest database time paths 1.1s 33.7s 49.6s

8 Gb pageview table – daily pageviews and unique visitors

22.4s 92.2s 180.0s

• Searching log files at 37 signals (creators of Ruby on Rails web framework)

http://37signals.com/svn/posts/3315-how-i-came-to-love-big-data-or-at-least-acknowledge-its-existence

13 / 18

Impala Components

• Impala State Store : 1 per cluster– Coordinates information (location and status) about

all the running impalad instances• Impala Daemon : 1 per DataNode– Coordinates and executes queries– Distributes query fragments to other Impala Daemon

• Impala Shell : 1 per node– Provides Command Line Interface allowing

interactions with Impala

14 / 18

Impala Architecture

15 / 18

Roadmap : 0.3 Beta Version

• Operation System:– Only RHEL/CentOS 6.2 is supported

• File formats:– Text files, SequenceFiles, HBase table

• Compression:– Snappy, Gzip, BZip

• No UDFs or user extensibility *• Largest table in joins must be specified first *• Right-side of join must fit in RAM *• No support for complex nested structures * :

– e.g. maps, structs and arrays.* Post 1.0 G.A. Version Top Asks

16 / 18

Roadmap : 1.0 General Availability(Q1 2013)

• File Formats:– RCFile, Avro, LZO– Trevni : new columnar file-format by Doug Cutting

• More OS Support:– Same as those supported by CDH4

• Performance:– Faster, bigger, and more memory efficient joins and aggregations– Straggler handling :

• more work to faster machines, and less to slower machines

• DDL : enables users to create tables from Impala• JDBC Driver (shared with Hive)

17 / 18

Where to learn more and get started

• Impala Documentationhttps://ccp.cloudera.com/display/IMPALA10BETADOC/Cloudera+Impala+1.0+Beta+Documentation

• Clouder’s Impala Demo VMhttps://ccp.cloudera.com/display/SUPPORT/Cloudera's+Impala+Demo+VM

• Cloudera Bloghttp://blog.cloudera.com/blog/category/impala/

• Impala-user Google Grouphttps://groups.google.com/a/cloudera.org/forum/?fromgroups=#!forum/impala-user

• (Unofficial) presentation at Apache Asia Road Showhttp://sizeofvoid.net/wp-content/uploads/ImpalaIntroduction.pdf

• Official announcement of Impala at Strata Conference NY 2012http://www.slideshare.net/cloudera/2012-1025-hadoop-world-impala-16x9

• Dremel: Interactive Analysis of Web-Scale Datasetshttp://research.google.com/pubs/pub36632.html

• BigQuery Technical White Paperhttps://cloud.google.com/files/BigQueryTechnicalWP.pdf

https://ccp.cloudera.com/display/IMPALA10BETADOC/Cloudera+Impala+1.0+Beta+Documentation

https://ccp.cloudera.com/display/IMPALA10BETADOC/Cloudera+Impala+1.0+Beta+Documentation

https://ccp.cloudera.com/display/SUPPORT/Cloudera's+Impala+Demo+VM

http://blog.cloudera.com/blog/category/impala/

http://blog.cloudera.com/blog/category/impala/

https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!forum/impala-user

https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!forum/impala-user

http://sizeofvoid.net/wp-content/uploads/ImpalaIntroduction.pdf

http://sizeofvoid.net/wp-content/uploads/ImpalaIntroduction.pdf

http://www.slideshare.net/cloudera/2012-1025-hadoop-world-impala-16x9

http://research.google.com/pubs/pub36632.html

https://cloud.google.com/files/BigQueryTechnicalWP.pdf

https://cloud.google.com/files/BigQueryTechnicalWP.pdf

18 / 18

Conclusion

• Uses Impala when you need to find / compute quickly little data from a large data source

• Impala does not replace batch-oriented jobs• Impala beta and documentation is quite good

for a beta– If you can’t wait for Impala v1.0, try BigQuery

bdm8 - near-realtime big data analytics using impala

Technology