135_pgcon 2009 - aster v6
TRANSCRIPT
-
7/28/2019 135_PGCon 2009 - Aster v6
1/18
Building PetaByte Warehouses withUnmodified PostgreSQL
Emmanuel CecchetMember of Research Staff
May 21st, 2009
Topics
Introduction to frontline data warehouses
PetaByte warehouse design with PostgreSQL
Aster contributions to PostgreSQL
Q&A
2 PGCon 2009, Ottawa 2009 Aster Data Systems
-
7/28/2019 135_PGCon 2009 - Aster v6
2/18
Enterprise Data Warehouse Under Stress
EnterpriseData Warehouse
EnterpriseData Warehouse
PGCon 2009, Ottawa 2009 Aster Data Systems3
Offloading The Enterprise Data Warehouse
Enterprise DataWarehouse
Enterprise DataWarehouse
Frontline DataWarehouse
Frontline DataWarehouse
Archival DataWarehouse
Archival DataWarehouse
Petabytes detailed data
Low cost/TB Flexible compression On-line access Aging out to off-line
Petabytes source data
24 x7 Availability TB/Day Load capacity In-database transforms Rapid data access
PGCon 2009, Ottawa 2009 Aster Data Systems4
-
7/28/2019 135_PGCon 2009 - Aster v6
3/18
!
"
#
$
" %
& ! ' ( # '
'
"
! & ! ' )
0 "
1
"
!
2
&
%
3
"
( & !
2
)
% 4
0
3
!
"
& !
"
5
6
7 # ' # )
"
) ) )
" % %
&
8
) # '
"
9 # ) # ' #
$
"
) )
" % %
&
8
) # '
"
6
B
C
D
6
C
F
)
$
" %
# ) G
0 " H
"
( !
0
P
"
) )
$
"
Requirements for Frontline Data Warehouses
5 PGCon 2009, Ottawa 2009 Aster Data Systems
Who is Aster Data Systems?
Aster nCluster is a software-only RDBMS for large-scalefrontline data warehousing
High performance Always Parallel MPP architecture
High availability Always On on-line operations
High value analytics In-Database MapReduce
Low cost Petabytes on commodity HW
6 PGCon 2009, Ottawa 2009 Aster Data Systems
-
7/28/2019 135_PGCon 2009 - Aster v6
4/18
! "
# $ $ $ % & &
"
!
H "
!
'
( ) 0 1 2 3 4 5 3 0
6
7 ( 8 )9
'
( ) 0 1
@ A B
C
D
0 1
'
( ) 0 1 2 3 4 5 3 0
6
7 ( 8 )9
'
( ) 0 1
@ A B
C
D
0 1
E G
7 7 ) 1 H
I B P
7 Q
I
3 C 9
D
Q 1
E G
7 7 ) 1 H
I B P
7 Q
I
3 C 9
D
Q 1
MySpace Frontline Data Warehouse
R S
T U U T
V W X Y ` W
a b c d e
f g
S h
` i p q r
s t u
v
u x t x
y
t
R S
T U U T
V W X Y ` W
a b c d e
f g
S h
` i p q r
s t u
v
u x t x
y
t
"
!
" ! !
j k l m n
o
7 PGCon 2009, Ottawa 2009 Aster Data Systems
Topics
Introduction to frontline data warehouses
PetaByte warehouse design with PostgreSQL
Aster contributions to PostgreSQL
Q&A
8 PGCon 2009, Ottawa 2009 Aster Data Systems
-
7/28/2019 135_PGCon 2009 - Aster v6
5/18
Petabyte Datawarehouse Design
PostgreSQL as a building blockDo not hack PostgreSQL to serve the distributed database
Build on top of mainline PostgreSQL
Use standard Postgres APIs
Service Oriented ArchitectureHierarchical query planning and optimization
Shared nothing replication treating Postgres and Linux as a
service
Compression at the OS level transparently to PostgreSQLIn-database Map-Reduce out-of-process
9 PGCon 2009, Ottawa 2009 Aster Data Systems
Aster nCluster Database
10 PGCon 2009, Ottawa 2009 Aster Data Systems
Queries/AnswersQueries/Answers
Queen Server Group
QueriesQueries
DataData
Worker Server Group
Loader/Exporter Server Group
Aster nCluster Database
Queen Node u
y y y
v
t u x
v
t u
y y
u
t
u u
s t t
u
v
y
x
y
Worker Node
! " # $ # % & # ' " # 0 " # 1
# 0 2 1
" # 3 0 1 #
"
5
" # 0 1 #
" $ $ 1
1 1 $
# 1
6
7
1 $ "
7
# 8 " #
9
#
Loader Node@
A 6 $ 1
1 A 1 B C D " # E 2
9 F
$ #
G
1 #
"
7
$ 1
1
" 1 0 0 # " 0 # 1
H I
P
#
Q
7
H I
$ $ 1
1 1 # " 8 " #
9
#
-
7/28/2019 135_PGCon 2009 - Aster v6
6/18
Query Processing: How It Works
A: Users send queries to the Queen (via ACT, ODBC/JDBC, BI tools, etc.)
B: The Queen parses the query, creates an optimal distributed plan, sends subqueriesto the Workers, and supervises the processing
C: Workers execute their subqueries (locally or via distributed communication)
D: The Queen aggregates Worker results and returns the final result to the user
g
7 8 0 8
S
g
V V
U
b
p
S
p
S
7 8 0 8
7 8 0 8
7 8 0 8.
2 !
D
Q
P
"
#
$
11 PGCon 2009, Ottawa 2009 Aster Data Systems
Table Compression Architecture
How It Works%
& ' & ( ) 0 1 2 3 4 5 ) ) 5
6 7 6
5 0 1 2 3 4 5 ) ) 5
6 8
5 1 @ 1 A ( 0 &
6
& ' &
8
& ) 5
B
' 1 4 5
6
C
6
( D D 5 4 5 C ' ' &
8
5 ) 3 & 0 5 )
6
5 3 5 C
6
( C A 1 C 0 1 2 3 4 5 ) ) ( 1 C 5 E 5
Architecture Benefits:F
( A G 0 1 2 3 4 5 ) ) ( 1 C 4 & ' ( 1 ) H
8
( A A 5 4
8
1 0
I
) ( P 5 ) Q ( 5
6
2 1 4 5 0 1 ) ' ) & E ( C A )
%
& ' &
8
& ) 5 ' 4 & C ) 3 & 4 5 C ' H 5 & ) 5 1 D D S ' S 4 5 T 1 ) ' A 4 5 ) S 3 A 4 &
6
5 )
U
1 C 0 S 4 4 5 C 0 Q H G ( A G V 3 5 4 D 1 4 2 & C 0 5 2 S ' ( V ' &
8
5 0 1 2 3 4 5 ) ) ( 1 C
T 5 4 D 1 4 2 & C 0 5
Y
5 @ V 4 1 @ ` a S 5 4 ( 5 )
6
1 C b ' C 5 5
6
D S ' &
8
5
6
5 0 1 2 3 4 5 ) ) ( 1 C
Logical Database (eg. Schema)
Physical Tablespacesc
d e f g h i i
q
h
r s t
e
c
d e f g h i i
q
h
r s t
e
c
d e f g h i i
q
h
r s t
e
c
d e f g h i i
q
h
r s t
e
c
d e f g h i i
q
h
r s t
e
c
d e f g h i i
q
h
r s t
e
c
d e f g h i i
s v w
c
d e f g h i i
s v w
c
d e f g h i i
d y
c
d e f g h i i
d y
c
d e f g h i i
d y
c
d e f g h i i
d y
c
d e f g h i i
s v w
c
d e f g h i i
s v w
c
d e f g h i i
s v w
c
d e f g h i i
s v w
c
d e f g h i i
s v w
c
d e f g h i i
s v w
c
d e f g h i i
s v w
c
d e f g h i i
s v w
c
d e f g h i i
s v w
c
d e f g h i i
s v w
c
d e f g h i i
s v w
c
d e f g h i i
s v w
d
c
d e f g h i i h
r
d
c
d e f g h i i h
r
c
d e f g h i i
s v w
c
d e f g h i i
s v w
c
d e f g h i i
s v w
c
d e f g h i i
s v w
c
d e f g h i i
s v w
c
d e f g h i i
s v w
Query
c
d e f g h i i
q
h
r s t
e
c
d e f g h i i
q
h
r s t
e
c
d e f g h i i
q
h
r s t
e
c
d e f g h i i
q
h
r s t
e
c
d e f g h i i
s v w
c
d e f g h i i
s v w
c
d e f g h i i
s v w
c
d e f g h i i
s v w
12 PGCon 2009, Ottawa 2009 Aster Data Systems
-
7/28/2019 135_PGCon 2009 - Aster v6
7/18
Table Compression Enables Powerful Archival
Compressed - High
Compressed - Medium
Compressed - Low
Not Compressed
Older data accessed less frequently
Compress to save space and cost
Oldest data is compressed the most, recent is
compressed the leastCompressed tables are fully available for
queries (true online archival)
13 PGCon 2009, Ottawa 2009 Aster Data Systems
What is MapReduce and Why Should I Care?
What is MapReduce?
Popularized by Google
http://labs.google.com/papers/mapreduce.html
Processes data in parallel across distributed cluster
Why is MapReduce significant?
Empowers ordinary developers
Write application logic, not debug cluster communication code
Why is In-Database MapReduce significant?
Unites MapReduce with SQL: power invoked from SQL
Develop SQL/MR functions with common languages
14 PGCon 2009, Ottawa 2009 Aster Data Systems
-
7/28/2019 135_PGCon 2009 - Aster v6
8/18
Aster In-Database MapReduce
Slide 15
Aster nClusterAster nCluster
Data Store Engine
In-Database MapReduce
SQL/MR Functions
PGCon 2009, Ottawa 2009 Aster Data Systems15
Users, analysts,applications
In-Database MapReduce
Extensible framework (MapReduce + SQL)
Flexible: Map-Reduce expressiveness, languages, polymorphism
Performance: Massively parallel, computational push-down
Availability: Fault isolation, resource management
Out-of-process executables
Does not use PL/* for custom code execution
Can execute Map and Reduce functions in any language that has a
runtime on Linux (e.g. Java, Python, C#, C/C++, Ruby, Perl, R,
etc)
Standard PostgreSQL APIs to send/receive data to executables
Fault isolation, security and resource management for arbitrary
user code
PGCon 2009, Ottawa 2009 Aster Data Systems16
-
7/28/2019 135_PGCon 2009 - Aster v6
9/18
Live QueriesLive Queries
Always On: Minimize Planned Downtime
Backup &Restore
Backup &Restore
DataBackup
DataBackup
Add CapacityAdd CapacityRebalance DataRebalance Data
Load & ExportLoad & Export
17 PGCon 2009, Ottawa 2009 Aster Data Systems
Precision Scaling
When more CPU/memory/capacity are needed, new nodes can be added for scale-out.
Precision Scaling uses standard PostgreSQL APIs to migrate vWorker partitions to
new nodes either for load balancing (more compute power) or capacity expansion
Example: Assume Workers 1/2/3 are 100% CPU-bottlenecked. Incorporation adds anew Worker4 node and migrates over vWorker partitions D/H/L. As a result of load-balancing, CPU-utilization drops to 75% per node, eliminating hotspots.
.
!
"
#
$
&
'
$
$
18 PGCon 2009, Ottawa 2009 Aster Data Systems
-
7/28/2019 135_PGCon 2009 - Aster v6
10/18
Replication
! &
H
#
% "
%
# ! #
4
" %
&
"
! )
0 "
)
%
"
'
H
#
4
# #
v w v w
t u t
y
y
v
v
y
v
t u
t u u t
v
u t
u
v
u
v
t u t
x
v
t u
t
v
v
y
u
y
v
v
t
y y
u
y y
v
t x
v
y
!
"
# ) ! & !
"
& 1 " )
%
" # )
( #
"
!
H
#
0 " $
%
u
y
u
v v
t
t
y
v
y
&
t '
u (
v
y
u
u x
& ) 0
v
y
v
t
v
x
v
u
y
19 PGCon 2009, Ottawa 2009 Aster Data Systems
Fault Tolerance & Automatic Online Failover
Replication Failover
Automatic, non-disruptive, graceful performance impact
Replication Restoration
Delta-based (fast) and online (non-disruptive)
1
2 3 4 5 3 6
1
2 3 4 5 3 7
1
2 3 4 5 3 8
9 @
5 5 A B
C D E F G
H
I P
$
&
!
"
1
2 3 4 5 3 R
1
2 3 4 5 3 S
!
#
$
"
#
&
20 PGCon 2009, Ottawa 2009 Aster Data Systems
-
7/28/2019 135_PGCon 2009 - Aster v6
11/18
Using Commodity Hardware
.
.
.
v
v
u
v
s
y
t
x u
v
x
v
s t
s
y
&
x
v
s
y y
(
y
v
x
v
t
'
0
'
0
0
' % ' (
y
(
y
u
v
(
0
x u
v
!
"
#
!
#
$
"
! %
" &
#
'
(
&
0
t
u
u
'
t )
y
v
% s
)
21 PGCon 2009, Ottawa 2009 Aster Data Systems
Scaling OnScaling On--Demand to aDemand to a PetaBytePetaByte
Commodity Hardware
2 TB Building Block
Dell, HP, IBM, Intel x86
16 GB Memory
2.4TB of Storage
8 Disks
$5k to $10k Node
0
1 2 1
U
3
b a
` i 4 V
5
More Blocks = More Power
Massive Power Per Rack
160 Cores
640 GB RAM
48 TB SAS
22 PGCon 2009, Ottawa 2009 Aster Data Systems
-
7/28/2019 135_PGCon 2009 - Aster v6
12/18
Heterogeneous Hardware Support
Heterogeneous HW support enables customers to addcheaper/faster nodes to existing systems for investment
protectionMix-n-match different servers as you grow (faster CPUs, more
memory, bigger disk drives, different vendors, etc)
...
...
V i ` i
....
CPU
Memory
Disk
23 PGCon 2009, Ottawa 2009 Aster Data Systems
Topics
Introduction to frontline data warehouses
PetaByte warehouse design with PostgreSQL
Aster contributions to PostgreSQL
Q&A
24 PGCon 2009, Ottawa 2009 Aster Data Systems
-
7/28/2019 135_PGCon 2009 - Aster v6
13/18
Error Logging in COPY
Separate good from malformed tuples
Per tuple error information
Errors to capture
Type mismatch (e.g. text vs. int)
Check constraints (e.g. int < 0)
Malformed chars (e.g. invalid UTF-8 seq.)
Missing / extra column
Low-performance overhead
Activated on-demand using environment variables
25 PGCon 2009, Ottawa 2009 Aster Data Systems
Error Logging in COPY
26 PGCon 2009, Ottawa 2009 Aster Data Systems
Detailed error context is logged along with tuple content
-
7/28/2019 135_PGCon 2009 - Aster v6
14/18
Error Logging Performance
1 million tuples COPY performance
27 PGCon 2009, Ottawa 2009 Aster Data Systems
Auto-partitioning in COPY
COPY into a parent table route tuples
directly in the child table with matching
constraints
COPY y2008 FROM data.txt
PGCon 2009, Ottawa 2009 Aster Data Systems28
-
7/28/2019 135_PGCon 2009 - Aster v6
15/18
Auto-partitioning in loading
Activated on-demand
set tuple_routing_in_copy = 0/1;
Configurable LRU cache size
set tuple_routing_cache_size = 3;
COPY performance into parent table similar to
direct COPY in child table if data is sorted
Will leverage partitioning information in the
future (WIP for 8.5)
PGCon 2009, Ottawa 2009 Aster Data Systems29
Other contributions
Temporary tables and 2PC transactions
[Auto-]Partitioning infrastructure
Regression test suite
LFI (http://lfi.sourceforge.net/)
Fault injection at the library level or below
out-of-memory conditions, network connection errors, interrupted
system calls, data corruption, hardware failures, etc
Lightning talk on Friday!
PGCon 2009, Ottawa 2009 Aster Data Systems30
-
7/28/2019 135_PGCon 2009 - Aster v6
16/18
Topics
Introduction to Aster and data warehousing
PetaByte warehouse design with PostgreSQL
Aster contributions to PostgreSQL
Q&A
31 PGCon 2009, Ottawa 2009 Aster Data Systems
PetaByte Warehouses with PostgreSQL
Unmodified PostgreSQL
Always Parallel MPP architecture
Always On on-line operations
In-Database MapReduce
PetaByte on commodity Hardware
32 PGCon 2009, Ottawa 2009 Aster Data Systems
-
7/28/2019 135_PGCon 2009 - Aster v6
17/18
Learn morewww.asterdata.com
Free TDWI report on advanced analytics:
asterdata.com/mapreduce
Free Gartner webcast on mission-critical DW:
asterdata.com/gartner
Contact us
Aster Data Systems
33 PGCon 2009, Ottawa 2009 Aster Data Systems
Bonus slides
34 PGCon 2009, Ottawa 2009 Aster Data Systems
-
7/28/2019 135_PGCon 2009 - Aster v6
18/18
nCluster Components
I G
( 0 8 9
S
g
V V
U
b
p q
a
q
T
W
T
W !
h
i V ! i q "
b
# $ " # $ % $ " #
&
)
2
G 0
1
2
3
C
D
C 4 8 0
A
C 8 C
D
7 3
3
C
D
C 4 8 0
A
C 8 C
D
7 3
2 5
A
7 8
D
2 5
A
7 8
D
nCluster
6
7
8
9
"
@
35 PGCon 2009, Ottawa 2009 Aster Data Systems
Aster Loader
AV q
5
S
q
U
q W B
T
W !
C D E
G H I G P Q R S T U V R
W
R X U
W
X Y S U R
W
U R V R
W
R
W
S R ` X
a
U S b U S
a c
S d R ` Y U e
c
W W
U ` U Y
g
i p
F q r s
p
H P t u
W
b U Q
c
R V U S X R S U d R b b U V
W
c w
x
c
S
g
U S X
c
`
x
c
S
g
U S `
c
V U X
y
a
W
U S b R S
W W
c
` ` T Q
c
R V U S
b R S R U
c
R V u ` u U V R
W
R `
W
c w
x
c
S
g
U S X
G H G
r
Q ` U R S X Y R R e U
c
R V b U S
a c
S d R ` Y U
a
S
c
d Q
c
R V U S X
W
c
x
c
S
g
U S X
Scalable PartitioningC D E
G H I G P R S
W W
c
` ` T d u X
W
e U X Y R R e U R ` V `
c
W
d b U V U
W
U
c
R V ` T b S
c
Y U X X
i p
F q r s
p
H P R Y
Q
c
R V U S Y
c
`
W
R ` X R R S
W W
c
` U S
Y
u X U X R ` R T
c
S
W
d
W
c
R X X T ` V R
W
R `
W
c
e u Y
g
U
W
X R Y
e u Y
g
U
W
X u ` u U d R b b U V
W
c
W
`
W
U Y u X
W
U S
G H G
r
P R X
W
`
W
U T U `
W
b R S
W W
c
` ` T V u S ` T d R X X
w
U X Y R U V R
W
R
c
R V X
q 3
U
a
q W
5 U T
W !
C D E
G H I G P R U V `
c
V U X d R
c
X U V R
W
R
c
S X T `
a
Y R `
W
V S
c
b
c
R V b U S
a c
S d R ` Y U
i p
F q r s
p
H
a
R `
c
V U
a
R X
c
W
U S X
W
S U R d X Y
c
`
W
` u U
W
c c
R V
b U S
a c
S d R ` Y U
W
y
V d ` X Y R ` U R X
S U Y
c w
U S
c
X
W
V R
W
R e S U
c
R V ` T u
g
U U V U S V R
W
R
G H G
r
R
W
R
c
X X b S
c
W
U Y
W
c
` b U S
a c
S d R ` Y U Y
c
` X X
W
U ` Y V u S ` T `
c
V U
a
R u S U
36 PGCon 2009, Ottawa 2009 Aster Data Systems