sharpest tool in the - vertica · 2018-09-12 · spark hive pig hdfs mapreduce hcatalog hbase....
TRANSCRIPT
![Page 1: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/1.jpg)
![Page 2: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/2.jpg)
Sharpest Tool in the Hadoop ToolboxVertica SQL on HadoopBob HansenDeepak MajetiJames Clampffer
![Page 3: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/3.jpg)
#SeizeTheData
We are making Vertica the fastest structured data processor for Hadoop.
3
Kafka
Spark Hive Pig
HDFS
MapReduce
HCatalog
HBase
![Page 4: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/4.jpg)
#SeizeTheData 4
Accessing your existing data is easy to doCREATE HCATALOG SCHEMA hive WITH
HOSTNAME='hcat.mycorp.com'
HCATALOG_SCHEMA='tweets';
SELECT
keyword,
EXTRACT(month from created_at),
AVG(score)
FROM hive.tweets.tweet_keywords
WHERE created_at >=
now()- ‘3 months’::interval
GROUP BY 1,2
ORDER BY 2;
![Page 5: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/5.jpg)
#SeizeTheData
Hadoop has two popular formats for columnar data:Parquet and ORC
5
Column Oriented file formats used
by popular Hadoop ingesting tools
like Hive, Spark, Drill, Impala etc.
![Page 6: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/6.jpg)
#SeizeTheData
Columnar formats efficiently pack data into Hadoop blocks
6
File broken into blocks (rowgroups/stripes)
Typical size up to 256 MB (size of an HDFS block)
Structured: Metadata contains information about the file including DDL, statistics, etc.
![Page 7: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/7.jpg)
#SeizeTheData
VSQLoH allows you to use your Hadoop data fast
7
![Page 8: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/8.jpg)
#SeizeTheData 8
Big Data SQL Performance Tournament
Cloudera Hortonworks
Parquet: libhdfs++
Parquet: webhdfs
ORC: libhdfs++
ORC: webhdfs
![Page 9: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/9.jpg)
#SeizeTheData 9
Big Data SQL Performance Tournament
Cloudera Hortonworks
Parquet: libhdfs++
Parquet: webhdfs
ORC: libhdfs++
ORC: webhdfs
![Page 10: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/10.jpg)
#SeizeTheData
0x
10x
20x
30x
40x
50x
60x
70x
10
Big Data SQL Performance Tournamentvs
Impala is 4x-60x faster Impala succeeded in 60 queries that Spark failed
Both Impala and Spark
failed 18 queries
Measured under TPC Benchmark™DS standards
TPC-DS query, sorted by relative run-time
Rel
ativ
e pe
rform
ance
of S
park
and
Impa
laN
umbe
rs g
reat
er th
an 1
are
bet
ter f
or Im
pala
Num
bers
less
than
1 a
re b
ette
r for
Spa
rk
![Page 11: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/11.jpg)
#SeizeTheData 11
Big Data SQL Performance Tournament
Cloudera Hortonworks
Parquet: libhdfs++
Parquet: webhdfs
ORC: libhdfs++
ORC: webhdfs
![Page 12: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/12.jpg)
#SeizeTheData
0x
10x
20x
30x
40x
50x
60x
70x
12
Big Data SQL Performance Tournamentvs
HAWQ is 2x – 60x fasterTez is up to 3x faster
HAWQ succeeded in22 queries that Tez failed
Tez succeeded in16 queries thatHAWQ failed
Both Tez andHAWQ failed
18 queries
Measured under TPC Benchmark™DS standards
TPC-DS query, sorted by relative run-time
Rel
ativ
e pe
rform
ance
Tez
and
HAW
QN
umbe
rs g
reat
er th
an 1
are
bet
ter f
or H
AWQ
Num
bers
less
than
1 a
re b
ette
r for
Tez
![Page 13: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/13.jpg)
#SeizeTheData 13
Big Data SQL Performance Tournament
Cloudera Hortonworks
Parquet: libhdfs++
Parquet: webhdfs
ORC: libhdfs++
ORC: webhdfs
![Page 14: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/14.jpg)
#SeizeTheData
0x
1x
2x
3x
4x
5x
6x
14
Big Data SQL Performance Tournamentvs
Parquet: libhdfs++ Parquet: webhdfs
Comparable
Measured under TPC Benchmark™DS standards
TPC-DS query, sorted by relative run-time
Rel
ativ
e pe
rform
ance
of V
ertic
a/O
RC
with
.li
bhdf
s++
and
web
hdfs
Num
bers
gre
ater
than
1 a
re b
ette
r for
libh
dfs+
+N
umbe
rs le
ss th
an 1
are
bet
ter f
or w
ebhd
fs
![Page 15: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/15.jpg)
#SeizeTheData 15
Big Data SQL Performance Tournament
Cloudera Hortonworks
Parquet: libhdfs++
Parquet: webhdfs
ORC: libhdfs++
ORC: webhdfs
Parquet: libhdfs++
![Page 16: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/16.jpg)
#SeizeTheData
0x
1x
2x
3x
4x
5x
6x
7x
16
Big Data SQL Performance Tournamentvs
ORC: libhdfs++ ORC: webhdfs
Comparable
Measured under TPC Benchmark™DS standards
TPC-DS query, sorted by relative run-time
Rel
ativ
e pe
rform
ance
of V
ertic
a/O
RC
with
.li
bhdf
s++
and
web
hdfs
Num
bers
gre
ater
than
1 a
re b
ette
r for
libh
dfs+
+N
umbe
rs le
ss th
an 1
are
bet
ter f
or w
ebhd
fs
![Page 17: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/17.jpg)
#SeizeTheData 17
Big Data SQL Performance Tournament
Cloudera Hortonworks
Parquet: libhdfs++
Parquet: webhdfs
ORC: libhdfs++
ORC: webhdfs
Parquet: libhdfs++ ORC: libhdfs++
![Page 18: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/18.jpg)
#SeizeTheData
0x
5x
10x
15x
20x
25x
30x
35x
18
Big Data SQL Performance Tournament
vsParquet
Vertica is 2x – 30x fasterSimilar Vertica succeeded with 19queries Impala failed
Measured under TPC Benchmark™DS standards
TPC-DS query, sorted by relative run-time
Rel
ativ
e pe
rform
ance
of I
mpa
la a
nd V
ertic
a/Pa
rque
tN
umbe
rs g
reat
er th
an 1
are
bet
ter f
or V
ertic
aN
umbe
rs le
ss th
an 1
are
bet
ter f
or Im
pala
![Page 19: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/19.jpg)
#SeizeTheData 19
Big Data SQL Performance Tournament
Cloudera Hortonworks
Parquet: libhdfs++
Parquet: webhdfs
ORC: libhdfs++
ORC: webhdfs
Parquet: libhdfs++ ORC: libhdfs++
Parquet
![Page 20: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/20.jpg)
#SeizeTheData
0x
10x
20x
30x
40x
50x
60x
70x
80x
20
Big Data SQL Performance Tournamentvs
ORC
Vertica is 2x – 73x fasterHAWQ up to 4x faster Vertica succeeded in 34 queries HAWQ failed
Measured under TPC Benchmark™DS standards
TPC-DS query, sorted by relative run-time
Rel
ativ
e pe
rform
ance
of H
AWQ
and
Ver
tica/
OR
CN
umbe
rs g
reat
er th
an 1
are
bet
ter f
or V
ertic
aN
umbe
rs le
ss th
an 1
are
bet
ter f
or H
AWQ
![Page 21: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/21.jpg)
#SeizeTheData 21
Big Data SQL Performance Tournament
Cloudera Hortonworks
Parquet: libhdfs++
Parquet: webhdfs
ORC: libhdfs++
ORC: webhdfs
Parquet: libhdfs++ ORC: libhdfs++
Parquet ORC
![Page 22: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/22.jpg)
#SeizeTheData
0.0x
0.5x
1.0x
1.5x
2.0x
2.5x
3.0x
22
Big Data SQL Performance Tournamentvs
Parquet ORC
Libhdfs++ is 1.2x – 2.5x faster
Comparable
Measured under TPC Benchmark™DS standards
TPC-DS query, sorted by relative run-time
Rel
ativ
e pe
rform
ance
of V
ertic
a/Pa
rque
t and
Ver
tica/
OR
CN
umbe
rs g
reat
er th
an 1
are
bet
ter f
or P
arqu
etN
umbe
rs le
ss th
an 1
are
bet
ter f
or O
RC
![Page 23: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/23.jpg)
#SeizeTheData 23
Big Data SQL Performance Tournament
Cloudera Hortonworks
Parquet: libhdfs++
Parquet: webhdfs
ORC: libhdfs++
ORC: webhdfs
Parquet: libhdfs++ ORC: libhdfs++
Parquet ORC
ROS
![Page 24: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/24.jpg)
#SeizeTheData 24
Big Data SQL Performance Tournamentvs
ROSVSQLOH
Measured under TPC Benchmark™DS standards
0x
2x
4x
6x
8x
10x
12x
ROS is 2x – 11x faster
TPC-DS query, sorted by relative run-time
Rel
ativ
e pe
rform
ance
of V
ertic
a/R
OS
and
Verti
ca/P
arqu
etN
umbe
rs g
reat
er th
an 1
are
bet
ter f
or R
OS
Num
bers
less
than
1 a
re b
ette
r for
Par
quet
![Page 25: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/25.jpg)
#SeizeTheData
VSQLoH is fast because of our open source investments
25
![Page 26: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/26.jpg)
#SeizeTheData
Vertica developed libParquet and libOrc for speed and stability
26
Most systems use Java SerDes and Java Vectorized ReadersThese do not couple well with C++ based systems due to lack of control over resources and lack of tighter integration
libOrc (https://orc.apache.org)• Development Started early 2015• HPE + Hortonworks collaboration
libParquet (https://parquet.apache.org)• Development started early 2016• HPE + Cloudera collaboration
![Page 27: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/27.jpg)
#SeizeTheData
Optimizations
27
Column selection
Partition Pruning
Read only the data you needPredicate Pushdown
![Page 28: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/28.jpg)
#SeizeTheData
How much do we gain ?
28Resources: https://github.com/apache/orc/pull/43/files
![Page 29: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/29.jpg)
#SeizeTheData
Fast is no good if it doesn’t work reliably
29
![Page 30: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/30.jpg)
#SeizeTheData 30
VSQLoH will run your SQL queries out of the box
0
10
20
30
40
50
60
70
80
90
Successful Unaltered TPC-DS Queries
56
2318
98
64
Running unmodified TPC‐DS benchmark queries
![Page 31: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/31.jpg)
#SeizeTheData 31
VSQLoH’s fine-grained resource management ensures that queries will complete without running out of memory
0
10
20
30
40
50
60Concurrent queries before error
Running concurrent select TPC‐DS queries
![Page 32: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/32.jpg)
#SeizeTheData
WebHDFS was not a good fit for Vertica’s use case
Webhdfs was intended to be easy for people to use, not for high performanceMeant to be accessed from curl or web browser:
curl webhdfs://host:port/webhdfs/v1/my_file_pathor http://host:port/webhdfs/v1/my_file_path
32
Vertica
HDFS Server
Web Server
HDFS Client
WebHDFS
WebHDFSInterface
libhdfs++Interface
HDFS Client (JVM)
libhdfsInterface
![Page 33: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/33.jpg)
#SeizeTheData
Libhdfs++ is developed from scratch with a focus on performance
• Implemented in C++ with minimal dependencies.• Supports Linux and OSX.• All interfaces are non-blocking (unless you want them to be).• Minimal memory footprint; all memory is explicitly freed as soon as possible.
33
0
0.5
1
1.5
2
2.5
3
Time (sec)
Find of 1 directory
Java C++
0
100
200
300
400
500
600
Time (sec) Memory (MB)
Find across 1M directories
Java C++
2.4 seconds
0.012 seconds
![Page 34: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/34.jpg)
#SeizeTheData
Libhdfs++ is developed from scratch with a focus on performance
• Implemented in C++ with minimal dependencies.• Supports Linux and OSX.• All interfaces are non-blocking (unless you want them to be).• Minimal memory footprint; all memory is explicitly freed as soon as possible.
34
HDFS JIRAS
• HDFS-7280• HDFS-7279• HDFS-7270• HDFS-7945
0
0.5
1
1.5
2
2.5
3
Time (sec)
Find of 1 directory
Java C++
0
100
200
300
400
500
600
Time (sec) Memory (MB)
Find across 1M directories
Java C++
2.4 seconds
0.012 seconds
![Page 35: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/35.jpg)
#SeizeTheData
Libraries were implemented with bindings to other languages in mind
35
libhdfs++
liborc
libparquet
• Pure C wrapper APIs allow functionality to accessed from nearly any other language.• Write prototypes and tools in scripting languages, and move to native implementations if required.• More language bindings lead to more adoption and more contributors.• Development here: http://issues.apache.org/jira/browse/HDFS-8707
Developed within the apache community
![Page 36: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/36.jpg)
#SeizeTheData
In upcoming releases, it will be faster
36
![Page 37: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/37.jpg)
#SeizeTheData 37
Caching Hadoop data in Vertica’s ROS format will supercharge your queries
CREATE CACHE VIEW fast_tweetsFROM hive.tweets.tweet_keywordsWHERE created_year_month BETWEEN
201509 AND 201608;
SELECT keyword, EXTRACT(month from created_at), AVG(score)
FROM fast_tweetsWHERE created_year_month = 201608GROUP BY 1,2ORDER BY 2;
![Page 38: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/38.jpg)
#SeizeTheData 38
ROS has 10 years of R&D to make it the fastest format around
ROSVSQLOH
HDFSROS
SELECT…
0x
2x
4x
6x
8x
10x
12x
ROS is 2x – 11x faster
TPC-DS query, sorted by relative run-time
Rel
ativ
e pe
rform
ance
of V
ertic
a/R
OS
and
Verti
ca/P
arqu
etN
umbe
rs g
reat
er th
an 1
are
bet
ter f
or R
OS
Num
bers
less
than
1 a
re b
ette
r for
Par
quet
![Page 39: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/39.jpg)
#SeizeTheData 39
Complex types allow richer, more semantically clean data
Complex types enable expression of SQL queries in a natural and intuitive way
“SELECT customer, orders.total_cost FROM customersWHERE orders.total_sales > 4000 and orders.products.id= ‘B2’;”
![Page 40: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/40.jpg)
#SeizeTheData 40
Writing data to HDFS will make Vertica a central part of your workflow
Support writing data in Vertica to ORC and Parquet formats.
“SELECT * FROM customers AS COPY TO ‘hdfs:///user/customers’ PARQUET”;
HDFSORC / Parquet
![Page 41: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/41.jpg)
#SeizeTheData
Who wants some fast?
41
![Page 42: Sharpest Tool in the - Vertica · 2018-09-12 · Spark Hive Pig HDFS MapReduce HCatalog HBase. #SeizeTheData 4 ... Relative performance of Impala and Vertica/Parquet Numbers greater](https://reader035.vdocuments.us/reader035/viewer/2022070913/5fb48a12226900740e4fa824/html5/thumbnails/42.jpg)
#SeizeTheData
Vertica SQL on Hadoop Summary
42
Kafka
Spark Hive Pig
HDFS
MapReduce
HCatalog
HBase
• High Performance Vertica Engine• Beats Hawk, Impala, Spark, Tez
• All TPC-DS queries run out of the box
• Supports major HDFS file formats• ORC, Parquet
• Native readers enable tighter integration• Partition pruning, Predicate pushdown,
Column selection
• Libhdfs++ enables efficient communication with HDFS
• Roadmap for more features and further improve performance