sql in hadoop – the real deal. emma mcgrattan, svp engineering @ actian
DESCRIPTION
Can you use full SQL language in Hadoop without limitations? Can you run SQL on the freshest Hadoop data without moving data out into another database every time you want to run a query? The answer is a resounding Yes. Unlike many solutions that promise SQL access in Hadoop but don’t deliver, this session will showcase a solution where users can benefit from enterprise-ready, scalable SQL access to data in Hadoop. Furthermore, we’ll talk about adding trickle update support on HDFS - a file system designed for data to be written once and read ever after. The addition of trickle-update support on Hadoop will mean that traditional OLTP workloads can be run natively on Hadoop without having to pull the data into a traditional RDBMS.TRANSCRIPT
Confidential © 2014 Actian Corporation1 Confidential © 2014 Actian Corporation1 © 2014 Actian Corporation
Actian SQL in Hadoop
Emma K McGrattan, Actian Corp.October 7th 2014
Confidential © 2014 Actian Corporation2 Confidential © 2014 Actian Corporation2
Actian is Delivering Transformational Value
$140M Revenues + Profitable
10,000+ Customers
Global Presence: 8 world-wide offices, 7x 24 multinational support model
2 “Fast becoming a big data
powerhouse to challenge
the market.” Forrester
“Actian is now very powerfully
positioned in the big data and
analytics markets.” Bloor
Confidential © 2014 Actian Corporation3 Confidential © 2014 Actian Corporation3
Actian Analytics Platform – VectorBuilt for Speed, Fast Time to Value
Tim
e / C
yc
les
to
Pro
ce
ss
Data Processed
DISK
RAM
CHIP
10GB2-3GB40-400MB
2-2
0150-2
50
Mill
ions
Vector Processing
Single
Instruction
Multiple
Data
2nd Gen Column Store
Limit I/O
Efficient real time updates
Smarter Compression
Maximize throughput
Vectorized decompression
Exploiting Chip Cache
Process data on chip – not in RAM
1
2
3
4
Multi-core ParallelismMaximize system resource
utilization…
Storage Indexes
Quickly identify candidate data
blocks
Minimize IO
5
6
Confidential © 2014 Actian Corporation4 Confidential © 2014 Actian Corporation4
TPC-H 1TB – Faster, Less Hardware
0 100,000 200,000 300,000 400,000
Actian Vector 445,529
Actian Vector 436,788
SQL Server 219,888
Oracle 209,534
Oracle 201,487
SQL Server 173,962
Sybase IQ 164,747
Oracle 140,181
SQL Server 134,117
June ‘12
May ‘11
Aug ‘11
June ‘11
Sept ‘11
Apr ‘11
Dec ‘10
Apr ‘10
Dec ‘11
$57,146
$1,229,968
$460,869
$2,402,706
$753,392
$278,527
$85,621
$1,249,967
$258,880
Hardware Cost(excluding discounts)QphH
Fastest TPC-H QphH@1TB Benchmark (non-clustered)
Source: www.tpc.org /
Confidential © 2014 Actian Corporation5 Confidential © 2014 Actian Corporation5
X100
X100
X100
X100
HDFS
HDFS
HDFS
HDFS
HDFS
X100
Work
er
no
de
[1
..n
] (d
ata
no
des)
SQL in Hadoop ArchitectureS
QL P
rocessin
g SQL parser
Optimizer
Cross compiler
parsed tree
query plan
Client application
X100 algebra
X100
Distributed rewriter
Builder
Execution engine
annotated query tree
operator tree
Buffer manager
datadata request
HDFS
Ma
ste
r n
od
e (
na
me
node
)
SQL query
I/O
X100
Rewriter
Builder
Execution engine
annotated query tree
partial operator tree
Buffer manager
datadata request
HDFS
I/O
MPI
annotated tree
result
MPI
partial result set
MP
I
inte
r-node c
om
mu
nic
atio
n
HDFS
namenode
HDFS
datanode
X100
Confidential © 2014 Actian Corporation6 Confidential © 2014 Actian Corporation6
HADOOP
YARN
HDFS
Standard
SQL
Interfaces
DataNode
HDFS
Visual Data
& Analytics
Workflow
DataNode
HDFS
DataNode
HDFS
DataNode
HDFS
X100X100X100
ReadLoad
Actian VectorBlend &Enrich
Data Science & Analytics
DataNode
HDFS
X100
HDFS
Data Block
• Original file format
• Standard block
replicationNameNode
High Performance,
Industrialized SQL
Database
High Performance,
Data Science &
Analytics
• Column-based
blocks
• Compressed
• Partitioned
Replicated
Data
• >=3 Replicated
Copies of Data
Blocks
• Leveraged to co-
locate data with
various join keys
Actian Analytics Platform – Hadoop IntegrationTransforms Hadoop into a High Performance Analytics Platform
Confidential © 2014 Actian Corporation7 Confidential © 2014 Actian Corporation7
History of the TPC-DS Comparison
Confidential © 2014 Actian Corporation8 Confidential © 2014 Actian Corporation 8
TPC-DS Benchmark Components
OperationalSystems
Refresh Process Ad-hoc Reporting Queries
User Queries
DSS DatabaseTPC-DS
Reports
Store
Web
Catalog
Inventory
Promotions
Set of Files
ETL
Confidential © 2014 Actian Corporation9
Actian Hadoop SQL Performance
0
5
10
15
20
25
30
35
Q3 Q7 Q19 Q27 Q34 Q42 Q43 Q46 Q52 Q53 Q55 Q59 Q63 Q65 Q68 Q73 Q79 Q89 Q98
“Impala Subset” of TPC-DS Queries at Scale Factor 3000 (3TB)Speedup vs Impala
Impala Actian
16x avg. speedup
Background to “Impala Subset “of TPC-DS benchmark can be found here:
http://blog.cloudera.com/blog/2014/01/impala-performance-dbms-class-speed/
Both Executed on the Same Hardware and Software Environment:
5 Node Cluster with 64GB of RAM per node and 24x1TB Hard Disks.
Sp
ee
du
p F
acto
r
Confidential © 2014 Actian Corporation10
Live Demo
Confidential © 2014 Actian Corporation11
AAP – Express Editions
Confidential © 2014 Actian Corporation12
Hadoop cluster installations
The fastest analytics database running natively in Hadoop with a powerful yet simple analytics workflow builder
Free!
Community supported
Coming very soon. Pre-register at:
http://bigdata.actian.com/sql-in-hadoop
Single node installations
The leading single server analytics database with a powerful yetsimple analytics workflow builder
Free!
Community supported
Get it today:
http://bigdata.actian.com/express
Actian Analytics Platform - Express Editions
Extreme Performance Edition Hadoop SQL Edition
Confidential © 2014 Actian Corporation13
Thank You!