135_pgcon 2009 - aster v6

7/28/2019 135_PGCon 2009 - Aster v6

1/18

Building PetaByte Warehouses withUnmodified PostgreSQL

Emmanuel CecchetMember of Research Staff

May 21st, 2009

Topics

Introduction to frontline data warehouses

PetaByte warehouse design with PostgreSQL

Aster contributions to PostgreSQL

Q&A

2 PGCon 2009, Ottawa 2009 Aster Data Systems

7/28/2019 135_PGCon 2009 - Aster v6

2/18

Enterprise Data Warehouse Under Stress

EnterpriseData Warehouse

EnterpriseData Warehouse

PGCon 2009, Ottawa 2009 Aster Data Systems3

Offloading The Enterprise Data Warehouse

Enterprise DataWarehouse

Enterprise DataWarehouse

Frontline DataWarehouse

Frontline DataWarehouse

Archival DataWarehouse

Archival DataWarehouse

Petabytes detailed data

Low cost/TB Flexible compression On-line access Aging out to off-line

Petabytes source data

24 x7 Availability TB/Day Load capacity In-database transforms Rapid data access


7/28/2019 135_PGCon 2009 - Aster v6

3/18

!

"

#

$

" %

& ! ' ( # '

'

"

! & ! ' )

0 "

1

"

!

2

&

%

3

"

( & !

2

)

% 4

0

3

!

"

& !

"

5

6

7 # ' # )

"

) ) )

" % %

&

8

) # '

"

9 # ) # ' #

$

"

) )

" % %

&

8

) # '

"

6

B

C

D

6

C

F

)

$

" %

# ) G

0 " H

"

( !

0

P

"

) )

$

"

Requirements for Frontline Data Warehouses


Who is Aster Data Systems?

Aster nCluster is a software-only RDBMS for large-scalefrontline data warehousing

High performance Always Parallel MPP architecture

High availability Always On on-line operations

High value analytics In-Database MapReduce

Low cost Petabytes on commodity HW


7/28/2019 135_PGCon 2009 - Aster v6

4/18

! "

# $ $ $ % & &

"

!

H "

!

'

( ) 0 1 2 3 4 5 3 0

6

7 ( 8 )9

'

( ) 0 1

@ A B

C

D

0 1

'

( ) 0 1 2 3 4 5 3 0

6

7 ( 8 )9

'

( ) 0 1

@ A B

C

D

0 1

E G

7 7 ) 1 H

I B P

7 Q

I

3 C 9

D

Q 1

E G

7 7 ) 1 H

I B P

7 Q

I

3 C 9

D

Q 1

MySpace Frontline Data Warehouse

R S

T U U T

V W X Y ` W

a b c d e

f g

S h

` i p q r

s t u

v

u x t x

y

t

R S

T U U T

V W X Y ` W

a b c d e

f g

S h

` i p q r

s t u

v

u x t x

y

t

"

!

" ! !

j k l m n

o


Topics




Q&A


7/28/2019 135_PGCon 2009 - Aster v6

5/18

Petabyte Datawarehouse Design

PostgreSQL as a building blockDo not hack PostgreSQL to serve the distributed database

Build on top of mainline PostgreSQL

Use standard Postgres APIs

Service Oriented ArchitectureHierarchical query planning and optimization

Shared nothing replication treating Postgres and Linux as a

service

Compression at the OS level transparently to PostgreSQLIn-database Map-Reduce out-of-process


Aster nCluster Database


Queries/AnswersQueries/Answers

Queen Server Group

QueriesQueries

DataData

Worker Server Group

Loader/Exporter Server Group

Aster nCluster Database

Queen Node u

y y y

v

t u x

v

t u

y y

u

t

u u

s t t

u

v

y

x

y

Worker Node

! " # $ # % & # ' " # 0 " # 1

# 0 2 1

" # 3 0 1 #

"

5

" # 0 1 #

" $ $ 1

1 1 $

# 1

6

7

1 $ "

7

# 8 " #

9

#

Loader Node@

A 6 $ 1

1 A 1 B C D " # E 2

9 F

$ #

G

1 #

"

7

$ 1

1

" 1 0 0 # " 0 # 1

H I

P

#

Q

7

H I

$ $ 1

1 1 # " 8 " #

9

#

7/28/2019 135_PGCon 2009 - Aster v6

6/18

Query Processing: How It Works

A: Users send queries to the Queen (via ACT, ODBC/JDBC, BI tools, etc.)

B: The Queen parses the query, creates an optimal distributed plan, sends subqueriesto the Workers, and supervises the processing

C: Workers execute their subqueries (locally or via distributed communication)

D: The Queen aggregates Worker results and returns the final result to the user

g

7 8 0 8

S

g

V V

U

b

p

S

p

S

7 8 0 8

7 8 0 8

7 8 0 8.

2 !

D

Q

P

"

#

$


Table Compression Architecture

How It Works%

& ' & ( ) 0 1 2 3 4 5 ) ) 5

6 7 6

5 0 1 2 3 4 5 ) ) 5

6 8

5 1 @ 1 A ( 0 &

6

& ' &

8

& ) 5

B

' 1 4 5

6

C

6

( D D 5 4 5 C ' ' &

8

5 ) 3 & 0 5 )

6

5 3 5 C

6

( C A 1 C 0 1 2 3 4 5 ) ) ( 1 C 5 E 5

Architecture Benefits:F

( A G 0 1 2 3 4 5 ) ) ( 1 C 4 & ' ( 1 ) H

8

( A A 5 4

8

1 0

I

) ( P 5 ) Q ( 5

6

2 1 4 5 0 1 ) ' ) & E ( C A )

%

& ' &

8

& ) 5 ' 4 & C ) 3 & 4 5 C ' H 5 & ) 5 1 D D S ' S 4 5 T 1 ) ' A 4 5 ) S 3 A 4 &

6

5 )

U

1 C 0 S 4 4 5 C 0 Q H G ( A G V 3 5 4 D 1 4 2 & C 0 5 2 S ' ( V ' &

8

5 0 1 2 3 4 5 ) ) ( 1 C

T 5 4 D 1 4 2 & C 0 5

Y

5 @ V 4 1 @ ` a S 5 4 ( 5 )

6

1 C b ' C 5 5

6

D S ' &

8

5

6

5 0 1 2 3 4 5 ) ) ( 1 C

Logical Database (eg. Schema)

Physical Tablespacesc

d e f g h i i

q

h

r s t

e

c

d e f g h i i

q

h

r s t

e

c

d e f g h i i

q

h

r s t

e

c

d e f g h i i

q

h

r s t

e

c

d e f g h i i

q

h

r s t

e

c

d e f g h i i

q

h

r s t

e

c

d e f g h i i

s v w

c

d e f g h i i

s v w

c

d e f g h i i

d y

c

d e f g h i i

d y

c

d e f g h i i

d y

c

d e f g h i i

d y

c

d e f g h i i

s v w

c

d e f g h i i

s v w

c

d e f g h i i

s v w

c

d e f g h i i

s v w

c

d e f g h i i

s v w

c

d e f g h i i

s v w

c

d e f g h i i

s v w

c

d e f g h i i

s v w

c

d e f g h i i

s v w

c

d e f g h i i

s v w

c

d e f g h i i

s v w

c

d e f g h i i

s v w

d

c

d e f g h i i h

r

d

c

d e f g h i i h

r

c

d e f g h i i

s v w

c

d e f g h i i

s v w

c

d e f g h i i

s v w

c

d e f g h i i

s v w

c

d e f g h i i

s v w

c

d e f g h i i

s v w

Query

c

d e f g h i i

q

h

r s t

e

c

d e f g h i i

q

h

r s t

e

c

d e f g h i i

q

h

r s t

e

c

d e f g h i i

q

h

r s t

e

c

d e f g h i i

s v w

c

d e f g h i i

s v w

c

d e f g h i i

s v w

c

d e f g h i i

s v w


7/28/2019 135_PGCon 2009 - Aster v6

7/18

Table Compression Enables Powerful Archival

Compressed - High

Compressed - Medium

Compressed - Low

Not Compressed

Older data accessed less frequently

Compress to save space and cost

Oldest data is compressed the most, recent is

compressed the leastCompressed tables are fully available for

queries (true online archival)


What is MapReduce and Why Should I Care?

What is MapReduce?

Popularized by Google

http://labs.google.com/papers/mapreduce.html

Processes data in parallel across distributed cluster

Why is MapReduce significant?

Empowers ordinary developers

Write application logic, not debug cluster communication code

Why is In-Database MapReduce significant?

Unites MapReduce with SQL: power invoked from SQL

Develop SQL/MR functions with common languages


7/28/2019 135_PGCon 2009 - Aster v6

8/18

Aster In-Database MapReduce

Slide 15

Aster nClusterAster nCluster

Data Store Engine

In-Database MapReduce

SQL/MR Functions


Users, analysts,applications


Extensible framework (MapReduce + SQL)

Flexible: Map-Reduce expressiveness, languages, polymorphism

Performance: Massively parallel, computational push-down

Availability: Fault isolation, resource management

Out-of-process executables

Does not use PL/* for custom code execution

Can execute Map and Reduce functions in any language that has a

runtime on Linux (e.g. Java, Python, C#, C/C++, Ruby, Perl, R,

etc)

Standard PostgreSQL APIs to send/receive data to executables

Fault isolation, security and resource management for arbitrary

user code


7/28/2019 135_PGCon 2009 - Aster v6

9/18

Live QueriesLive Queries

Always On: Minimize Planned Downtime

Backup &Restore

Backup &Restore

DataBackup

DataBackup

Add CapacityAdd CapacityRebalance DataRebalance Data

Load & ExportLoad & Export


Precision Scaling

When more CPU/memory/capacity are needed, new nodes can be added for scale-out.

Precision Scaling uses standard PostgreSQL APIs to migrate vWorker partitions to

new nodes either for load balancing (more compute power) or capacity expansion

Example: Assume Workers 1/2/3 are 100% CPU-bottlenecked. Incorporation adds anew Worker4 node and migrates over vWorker partitions D/H/L. As a result of load-balancing, CPU-utilization drops to 75% per node, eliminating hotspots.

.

!

"

#

$

&

'

$

$


7/28/2019 135_PGCon 2009 - Aster v6

10/18

Replication

! &

H

#

% "

%

# ! #

4

" %

&

"

! )

0 "

)

%

"

'

H

#

4

# #

v w v w

t u t

y

y

v

v

y

v

t u

t u u t

v

u t

u

v

u

v

t u t

x

v

t u

t

v

v

y

u

y

v

v

t

y y

u

y y

v

t x

v

y

!

"

# ) ! & !

"

& 1 " )

%

" # )

( #

"

!

H

#

0 " $

%

u

y

u

v v

t

t

y

v

y

&

t '

u (

v

y

u

u x

& ) 0

v

y

v

t

v

x

v

u

y


Fault Tolerance & Automatic Online Failover

Replication Failover

Automatic, non-disruptive, graceful performance impact

Replication Restoration

Delta-based (fast) and online (non-disruptive)

1

2 3 4 5 3 6

1

2 3 4 5 3 7

1

2 3 4 5 3 8

9 @

5 5 A B

C D E F G

H

I P

$

&

!

"

1

2 3 4 5 3 R

1

2 3 4 5 3 S

!

#

$

"

#

&


7/28/2019 135_PGCon 2009 - Aster v6

11/18

Using Commodity Hardware

.

.

.

v

v

u

v

s

y

t

x u

v

x

v

s t

s

y

&

x

v

s

y y

(

y

v

x

v

t

'

0

'

0

0

' % ' (

y

(

y

u

v

(

0

x u

v

!

"

#

!

#

$

"

! %

" &

#

'

(

&

0

t

u

u

'

t )

y

v

% s

)


Scaling OnScaling On--Demand to aDemand to a PetaBytePetaByte

Commodity Hardware

2 TB Building Block

Dell, HP, IBM, Intel x86

16 GB Memory

2.4TB of Storage

8 Disks

$5k to $10k Node

0

1 2 1

U

3

b a

` i 4 V

5

More Blocks = More Power

Massive Power Per Rack

160 Cores

640 GB RAM

48 TB SAS


7/28/2019 135_PGCon 2009 - Aster v6

12/18

Heterogeneous Hardware Support

Heterogeneous HW support enables customers to addcheaper/faster nodes to existing systems for investment

protectionMix-n-match different servers as you grow (faster CPUs, more

memory, bigger disk drives, different vendors, etc)

...

...

V i ` i

....

CPU

Memory

Disk


Topics




Q&A


7/28/2019 135_PGCon 2009 - Aster v6

13/18

Error Logging in COPY

Separate good from malformed tuples

Per tuple error information

Errors to capture

Type mismatch (e.g. text vs. int)

Check constraints (e.g. int < 0)

Malformed chars (e.g. invalid UTF-8 seq.)

Missing / extra column

Low-performance overhead

Activated on-demand using environment variables


Error Logging in COPY


Detailed error context is logged along with tuple content

7/28/2019 135_PGCon 2009 - Aster v6

14/18

Error Logging Performance

1 million tuples COPY performance


Auto-partitioning in COPY

COPY into a parent table route tuples

directly in the child table with matching

constraints

COPY y2008 FROM data.txt


7/28/2019 135_PGCon 2009 - Aster v6

15/18

Auto-partitioning in loading

Activated on-demand

set tuple_routing_in_copy = 0/1;

Configurable LRU cache size

set tuple_routing_cache_size = 3;

COPY performance into parent table similar to

direct COPY in child table if data is sorted

Will leverage partitioning information in the

future (WIP for 8.5)


Other contributions

Temporary tables and 2PC transactions

[Auto-]Partitioning infrastructure

Regression test suite

LFI (http://lfi.sourceforge.net/)

Fault injection at the library level or below

out-of-memory conditions, network connection errors, interrupted

system calls, data corruption, hardware failures, etc

Lightning talk on Friday!


7/28/2019 135_PGCon 2009 - Aster v6

16/18

Topics

Introduction to Aster and data warehousing



Q&A


PetaByte Warehouses with PostgreSQL

Unmodified PostgreSQL

Always Parallel MPP architecture

Always On on-line operations


PetaByte on commodity Hardware


7/28/2019 135_PGCon 2009 - Aster v6

17/18

Learn morewww.asterdata.com

Free TDWI report on advanced analytics:

asterdata.com/mapreduce

Free Gartner webcast on mission-critical DW:

asterdata.com/gartner

Contact us

[email protected]

Aster Data Systems


Bonus slides


7/28/2019 135_PGCon 2009 - Aster v6

18/18

nCluster Components

I G

( 0 8 9

S

g

V V

U

b

p q

a

q

T

W

T

W !

h

i V ! i q "

b

# $ " # $ % $ " #

&

)

2

G 0

1

2

3

C

D

C 4 8 0

A

C 8 C

D

7 3

3

C

D

C 4 8 0

A

C 8 C

D

7 3

2 5

A

7 8

D

2 5

A

7 8

D

nCluster

6

7

8

9

"

@


Aster Loader

AV q

5

S

q

U

q W B

T

W !

C D E

G H I G P Q R S T U V R

W

R X U

W

X Y S U R

W

U R V R

W

R

W

S R ` X

a

U S b U S

a c

S d R ` Y U e

c

W W

U ` U Y

g

i p

F q r s

p

H P t u

W

b U Q

c

R V U S X R S U d R b b U V

W

c w

x

c

S

g

U S X

c

`

x

c

S

g

U S `

c

V U X

y

a

W

U S b R S

W W

c

` ` T Q

c

R V U S

b R S R U

c

R V u ` u U V R

W

R `

W

c w

x

c

S

g

U S X

G H G

r

Q ` U R S X Y R R e U

c

R V b U S

a c

S d R ` Y U

a

S

c

d Q

c

R V U S X

W

c

x

c

S

g

U S X

Scalable PartitioningC D E

G H I G P R S

W W

c

` ` T d u X

W

e U X Y R R e U R ` V `

c

W

d b U V U

W

U

c

R V ` T b S

c

Y U X X

i p

F q r s

p

H P R Y

Q

c

R V U S Y

c

`

W

R ` X R R S

W W

c

` U S

Y

u X U X R ` R T

c

S

W

d

W

c

R X X T ` V R

W

R `

W

c

e u Y

g

U

W

X R Y

e u Y

g

U

W

X u ` u U d R b b U V

W

c

W

`

W

U Y u X

W

U S

G H G

r

P R X

W

`

W

U T U `

W

b R S

W W

c

` ` T V u S ` T d R X X

w

U X Y R U V R

W

R

c

R V X

q 3

U

a

q W

5 U T

W !

C D E

G H I G P R U V `

c

V U X d R

c

X U V R

W

R

c

S X T `

a

Y R `

W

V S

c

b

c

R V b U S

a c

S d R ` Y U

i p

F q r s

p

H

a

R `

c

V U

a

R X

c

W

U S X

W

S U R d X Y

c

`

W

` u U

W

c c

R V

b U S

a c

S d R ` Y U

W

y

V d ` X Y R ` U R X

S U Y

c w

U S

c

X

W

V R

W

R e S U

c

R V ` T u

g

U U V U S V R

W

R

G H G

r

R

W

R

c

X X b S

c

W

U Y

W

c

` b U S

a c

S d R ` Y U Y

c

` X X

W

U ` Y V u S ` T `

c

V U

a

R u S U


135_pgcon 2009 - aster v6

Documents