bdm8 - near-realtime big data analytics using impala

18
1 / 18 Near-realtime Big Data Analytics using Impala David Lauzon Big Data Montreal #8 January 10th 2013

Upload: david-lauzon

Post on 06-May-2015

1.727 views

Category:

Technology


1 download

DESCRIPTION

Quick overview of all informations I've gathered on Cloudera Impala. It describes use cases for Impala and what not to use Impala for. Presented at Big Data Montreal #8 at RPM Startup Center.

TRANSCRIPT

Page 1: BDM8 - Near-realtime Big Data Analytics using Impala

1 / 18

Near-realtimeBig Data Analytics using Impala

David Lauzon Big Data Montreal #8January 10th 2013

Page 2: BDM8 - Near-realtime Big Data Analytics using Impala

2 / 18

Plan

• What is Impala?• Why Google built Dremel?• Use cases for Impala• Use cases for Map-Reduce• Cloudera Customer Survey• Impala Features• Impala Performance Expectations and Benchmarks• Impala Components• Impala Architecture• Impala Development Roadmap• Where to learn more and get started

Page 3: BDM8 - Near-realtime Big Data Analytics using Impala

3 / 18

Disclaimer

• In order to preserve the best accuracy in the description of Dremel and Impala, most of the contents in this presentation have been gathered from the authors of the respective technologies. References are found at the end of the presentation.

• I am not affiliated or sponsored by Cloudera or Google.

Page 4: BDM8 - Near-realtime Big Data Analytics using Impala

4 / 18

What is Impala?

“An Impala is an athletic, gracious, african antilope, famous for its velocity and its agility to jump” - Wikipedia

Page 5: BDM8 - Near-realtime Big Data Analytics using Impala

5 / 18

Seriously, what is Impala?

• “Impala enables real-time, interactive, analytical queries of the data stored in HBase or HDFS” – Cloudera

• Inspired by Google Dremel Paper (2010)– BigQuery is a Dremel implementation service,• It’s proprietary, not free, and requires to upload your

data to Google servers

Page 6: BDM8 - Near-realtime Big Data Analytics using Impala

6 / 18

Why Google built Dremel?

• Problems with Data Warehouse Solutions for OLAP/BI:– Relational OLAP (ROLAP) :

• Need to build indices for every possible query (for performance concerns) Indices size could take up the whole RAM

– Multi-dimensional OLAP (MOLAP):• Require extensive time and money to design and build the data cubes

– Ad-hoc query (specific non-optimised query) :• When you don’t know what you’ll need / or need to work in iterations. e.g.

quite often !!

• Solution:– Increase full-scan speed without requiring indexing or pre-

aggregated values

Page 7: BDM8 - Near-realtime Big Data Analytics using Impala

7 / 18

Come on, give me some use cases!

• Finding particular records with specified conditions.– “Find all the locations where account “ABC” was

accessed from”.• Quick aggregation of statistics with dynamically-

changing conditions:– “Can you give me yesterday’s number of impressions for

Google AdWords display ads – but only in the Tokyo region?”

• Trial-and-error data analysis:– “And between 11am to 1pm?”

Page 8: BDM8 - Near-realtime Big Data Analytics using Impala

8 / 18

Use cases for which you should stick with Map-Reduce based applications

• Very long running, batch-oriented tasks such as ETL:– e.g. exporting large amount of data after processing

• Complex event processing:– e.g. stream-processing

• “Complex data mining on Big Data which requires multiple iterations and paths of data processing with programmed algorithms” - Google

Page 9: BDM8 - Near-realtime Big Data Analytics using Impala

9 / 18

Integration with Hadoop

• Cloudera Customer Survey (Aug. 2012)– 80% needs faster queries on Hadoop data– 65% query Hadoop using Hive– 70% move data from Hadoop to RDBMS for

interactive SQL– 60% see value today in consolidating to a single

platform

Page 10: BDM8 - Near-realtime Big Data Analytics using Impala

10 / 18

Impala Features

• Shared with Hive:– Hive MetaStore– Hive SQL (most common SQL-92 features)– ODBC Driver– User Interface (Hue Beeswax)

• Specific to Impala:– No Map Reduce, but in memory transfers– Host and Disk Awareness (data locality)– Table data caching in RAM– No virtual columns, or locking

Page 11: BDM8 - Near-realtime Big Data Analytics using Impala

11 / 18

Impala Performance Expectations

• Performance improvements over Hive– 3 - 4X for purely I/O bound queries– 7 - 45X for queries with at least one join– 20 - 90X when data available in the cache

Page 12: BDM8 - Near-realtime Big Data Analytics using Impala

12 / 18

External Benchmarks

Workload Impala Query Time

Hive Query Time

MySQL Query Time

5.2 Gb HAproxy log – top IPs by request count 3.1s 65.4s 146.0s

5.2 Gb HAproxy log – top IPs by total request time 3.3s 65.2s 164.0s

800 Mb parsed rails log – slowest accounts 1.0s 33.2s 48.1s

800 Mb parsed rails log – highest database time paths 1.1s 33.7s 49.6s

8 Gb pageview table – daily pageviews and unique visitors

22.4s 92.2s 180.0s

• Searching log files at 37 signals (creators of Ruby on Rails web framework)

http://37signals.com/svn/posts/3315-how-i-came-to-love-big-data-or-at-least-acknowledge-its-existence

Page 13: BDM8 - Near-realtime Big Data Analytics using Impala

13 / 18

Impala Components

• Impala State Store : 1 per cluster– Coordinates information (location and status) about

all the running impalad instances• Impala Daemon : 1 per DataNode– Coordinates and executes queries– Distributes query fragments to other Impala Daemon

• Impala Shell : 1 per node– Provides Command Line Interface allowing

interactions with Impala

Page 14: BDM8 - Near-realtime Big Data Analytics using Impala

14 / 18

Impala Architecture

Page 15: BDM8 - Near-realtime Big Data Analytics using Impala

15 / 18

Roadmap : 0.3 Beta Version

• Operation System:– Only RHEL/CentOS 6.2 is supported

• File formats:– Text files, SequenceFiles, HBase table

• Compression:– Snappy, Gzip, BZip

• No UDFs or user extensibility *• Largest table in joins must be specified first *• Right-side of join must fit in RAM *• No support for complex nested structures * :

– e.g. maps, structs and arrays.* Post 1.0 G.A. Version Top Asks

Page 16: BDM8 - Near-realtime Big Data Analytics using Impala

16 / 18

Roadmap : 1.0 General Availability(Q1 2013)

• File Formats:– RCFile, Avro, LZO– Trevni : new columnar file-format by Doug Cutting

• More OS Support:– Same as those supported by CDH4

• Performance:– Faster, bigger, and more memory efficient joins and aggregations– Straggler handling :

• more work to faster machines, and less to slower machines

• DDL : enables users to create tables from Impala• JDBC Driver (shared with Hive)

Page 17: BDM8 - Near-realtime Big Data Analytics using Impala

17 / 18

Where to learn more and get started

• Impala Documentationhttps://ccp.cloudera.com/display/IMPALA10BETADOC/Cloudera+Impala+1.0+Beta+Documentation

• Clouder’s Impala Demo VMhttps://ccp.cloudera.com/display/SUPPORT/Cloudera's+Impala+Demo+VM

• Cloudera Bloghttp://blog.cloudera.com/blog/category/impala/

• Impala-user Google Grouphttps://groups.google.com/a/cloudera.org/forum/?fromgroups=#!forum/impala-user

• (Unofficial) presentation at Apache Asia Road Showhttp://sizeofvoid.net/wp-content/uploads/ImpalaIntroduction.pdf

• Official announcement of Impala at Strata Conference NY 2012http://www.slideshare.net/cloudera/2012-1025-hadoop-world-impala-16x9

• Dremel: Interactive Analysis of Web-Scale Datasetshttp://research.google.com/pubs/pub36632.html

• BigQuery Technical White Paperhttps://cloud.google.com/files/BigQueryTechnicalWP.pdf

Page 18: BDM8 - Near-realtime Big Data Analytics using Impala

18 / 18

Conclusion

• Uses Impala when you need to find / compute quickly little data from a large data source

• Impala does not replace batch-oriented jobs• Impala beta and documentation is quite good

for a beta– If you can’t wait for Impala v1.0, try BigQuery