ps1 psps object data manager design

PS1 PSPSObject Data Manager Design

PSPS Critical Design Review November 5-6, 2007

IfA

Detail Design

General Concepts Distributed Database architecture Ingest Workflow Prototype

Zones (spatial partitioning and indexing algorithm) Partition and bin the data into declination zones

• ZoneID = floor ((dec + 90.0) / zoneHeight) Few tricks required to handle spherical geometry Place the data close on disk

• Cluster Index on ZoneID and RA Fully implemented in SQL Efficient

• Nearby searches • Cross-Match (especially)

Fundamental role in addressing the critical requirements• Data volume management • Association Speed • Spatial capabilities

Zones

De

clin

atio

n (

De

c)

Right Ascension (RA)

Zoned Table

ObjID ZoneID* RA Dec CX CY CZ …

1 0 0.0 -90.0

2 20250 180.0 0.0

3 20250 181.0 0.0

4 40500 360.0 +90.0

* ZoneHeight = 8 arcsec in this example

ZoneID = floor ((dec + 90.0) / zoneHeight)

SQL CrossNeighbors

SELECT *FROM prObj1 z1

JOIN zoneZone ZZ ON ZZ.zoneID1 = z1.zoneID

JOIN prObj2 z2ON ZZ.ZoneID2 = z2.zoneID

WHERE z2.ra BETWEEN z1.ra-ZZ.alpha AND z2.ra+ZZ.alpha

AND z2.dec BETWEEN z1.dec-@r AND z1.dec+@r

AND (z1.cx*z2.cx+z1.cy*z2.cy+z1.cz*z2.cz) > cos(radians(@r))

Good CPU Usage

Partitions

SQL Server 2005 introduces technology to handle tables which are partitioned across different disk volumes and managed by a single server.

Partitioning makes management and access of large tables and indexes more efficient• Enables parallel I/O• Reduces the amount of data that needs to be

accessed• Related tables can be aligned and collocated in

the same place speeding up JOINS

Partitions

2 key elements• Partitioning function

– Specifies how the table or index is partitioned

• Partitioning schemas – Using a partitioning function, the schema specifies the placement

of the partitions on file groups

Data can be managed very efficiently using Partition Switching• Add a table as a partition to an existing table• Switch a partition from one partitioned table to another• Reassign a partition to form a single table

Main requirement• The table must be constrained on the partitioning column

Partitions

For the PS1 design, • Partitions mean File Group Partitions• Tables are partitioned into ranges of ObjectID,

which correspond to declination ranges.• ObjectID boundaries are selected so that each

partition has a similar number of objects.

Distributed Partitioned Views

Tables participating in the Distributed Partitioned View (DVP) reside on different databases which reside in different databases which reside on different instances or different (linked) servers

Concept: Slices

In the PS1 design, the bigger tables will be partitioned across servers

To avoid confusion with the File Group Partitioning, we call them “Slices”

Data is glued together using Distributed Partitioned Views

The ODM will manage slices. Using slices improves system scalability.

For PS1 design, tables are sliced into ranges of ObjectID, which correspond to broad declination ranges. Each slice is subdivided into partitions that correspond to narrower declination ranges.

ObjectID boundaries are selected so that each slice has a similar number of objects.

Detail Design Outline


PS1

P1 Pm

PartitionsMap

Objects

LnkToObj

Meta

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

Meta

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

MetaDetections

Linked servers

PS1 database

LoadAdmin

LoadSupport1

objZoneIndx

orphans_l1

Detections_l1

LnkToObj_l1

objZoneIndx

Orphans_ln

Detections_ln

LnkToObj_ln

detections

LoadSupportn

Linked servers

detections

PartitionsMap

PS1 Distributed DB system

Legend

Database

Full table [partitioned table]

Output table

Partitioned View

Query Manager (QM)

Web Based Interface (WBI)

Design Decisions: ObjID

Objects have their positional information encoded in their objID• fGetPanObjID (ra, dec, zoneH)• ZoneID is the most significant part of the ID

It gives scalability, performance, and spatial functionality Object tables are range partitioned according to their

object ID

ObjectID Clusters Data Spatially

ObjectID = 087941012871550661

Dec = –16.71611583 ZH = 0.008333

ZID = (Dec+90) / ZH = 08794.0661

RA = 101.287155

ObjectID is unique when objects are separated by >0.0043 arcsec

Design Decisions: DetectID

Detections have their positional information encoded in the detection identifier• fGetDetectID (dec, observationID, runningID,

zoneH) • Primary key (objID, detectionID), to align detections

with objects within partitions• Provides efficient access to all detections associated

to one object• Provides efficient access to all detections of nearby

objects

DetectionID Clusters Data in Zones

DetectID = 0879410500001234567

Dec = –16.71611583 ZH = 0.008333

ZID = (Dec+90) / ZH = 08794.0661

ObservationID = 1050000

Running ID = 1234567

ODM Capacity

5.3.1.3 The PS1 ODM shall be able to ingest into the

ODM a total of

• 1.51011 P2 detections• 8.31010 cumulative sky (stack) detections• 5.5109 celestial objects

together with their linkages.

PS1 Table Sizes - Monolithic

Table Year 1 Year 2 Year 3 Year 3.5

Objects 2.31 2.31 2.31 2.31

StackPsfFits 5.07 10.16 15.20 17.74

StackToObj 0.92 1.84 2.76 3.22

StackModelFits 1.15 2.29 3.44 4.01

P2PsfFits 7.87 15.74 23.61 27.54

P2ToObj 1.33 2.67 4.00 4.67

Other Tables 3.19 6.03 8.87 10.29

Indexes +20% 4.37 8.21 12.04 13.96

Total 26.21 49.24 72.23 83.74

Sizes are in TB

PS1

P1 Pm

PartitionsMap

Objects

LnkToObj

Meta

Linked servers

PS1 database

What goes into the main Server

Objects

LnkToObj

Meta

PartitionsMap

Legend

Database


Output table

Distributed Partitioned View

PS1

P1 Pm

PartitionsMap

Objects

LnkToObj

Meta

Linked servers

PS1 database

What goes into slices

PartitionsMap

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

PartitionsMap

Meta

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

PartitionsMap

Meta

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

Meta

Legend

Database


Output table


Duplication of Objects & LnkToObj

Objects are distributed across slices Objects, P2ToObj, and StackToObj are duplicated in the

slices to parallelize “inserts” & “updates” Detections belong into their object’s slice Orphans belong to the slice where their position would

allocate them• Orphans near slices’ boundaries will need special

treatment Objects keep their original object identifier

• Even though positional refinement might change their zoneID and therefore the most significant part of their identifier

PS1

P1 Pm

PartitionsMap

Objects

LnkToObj

Meta

Linked servers

PS1 database

Glue = Distributed Views

Detections

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

PartitionsMap

Meta

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

PartitionsMap

Meta

Legend

Database


Output table


Detections

PS1

P1 Pm


Linked servers

PS1 database

Partitioning in Main Server

Main server is partitioned (objects) and collocated (lnkToObj) by objid

Slices are partitioned (objects) and collocated (lnkToObj) by objid

Query Manager (QM)

PS1 Table Sizes - Main Server


Objects 2.31 2.31 2.31 2.31

StackPsfFits

StackToObj 0.92 1.84 2.76 3.22

StackModelFits

P2PsfFits

P2ToObj 1.33 2.67 4.00 4.67

Other Tables 0.41 0.46 0.52 0.55

Indexes +20% 0.99 1.46 1.92 2.15

Total 5.96 8.74 11.51 12.90

Sizes are in TB

PS1 Table Sizes - Each Slice

m=4 m=8 m=10 m=12


Objects 0.58 0.29 0.23 0.19

StackPsfFits 1.27 1.27 1.52 1.48

StackToObj 0.23 0.23 0.28 0.27


P2PsfFits 1.97 1.97 2.36 2.30

P2ToObj 0.33 0.33 0.40 0.39

Other Tables 0.75 0.81 1.00 1.01

Indexes +20% 1.08 1.04 1.23 1.19

Total 6.50 6.23 7.36 7.16

Sizes are in TB

PS1 Table Sizes - All Servers


Objects 4.63 4.63 4.61 4.59

StackPsfFits 5.08 10.16 15.20 17.76

StackToObj 1.84 3.68 5.56 6.46


P2PsfFits 7.88 15.76 23.60 27.60

P2ToObj 2.65 5.31 8.00 9.35

Other Tables 3.41 6.94 10.52 12.67

Indexes +20% 5.33 9.76 14.18 16.48

Total 31.98 58.56 85.07 98.87

Sizes are in TB

PS1

P1 Pm

PartitionsMap

Objects

LnkToObj

Meta

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

PartitionsMap

Meta

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

PartitionsMap

MetaDetections

Linked servers

PS1 database

LoadAdmin

LoadSupport1

objZoneIndx

orphans_l1

Detections_l1

LnkToObj_l1

objZoneIndx

Orphans_ln

Detections_ln

LnkToObj_ln

detections

LoadSupportn

Linked servers

detections

PartitionsMap

PS1 Distributed DB system

Legend

Database


Output table

Partitioned View

Query Manager (QM)


“Insert” & “Update”

SQL Insert and Update are expensive operations due to logging and re-indexing

In the PS1 design, Insert and Update have been re-factored into sequences of:

Merge + Constrain + Switch Partition Frequency

• f1: daily• f2: at least monthly• f3: TBD (likely to be every 6 months)

Ingest Workflow

ObjectsZ

CSV Detect

X(1”)

DXO_1a

NoMatch X(2”)

DXO_2a

DZone

P2PsfFits

Resolve

P2ToObjOrphans

Ingest @ frequency = f1

P2ToObj

Orphans

SLICE_1 MAIN

P2PsfFits

Metadata+

Objects

Orphans_1

P2ToPsfFits_1

P2ToObj_1

Objects_1

11 12 13

Stack*_1

1 2 3

P2ToObj

StackToObj

P2ToObj_1

P2ToPsfFits_1

Orphans_1

ObjectsZ

LOADER

SLICE_1 MAIN

Metadata+

Objects

Orphans_1

P2ToPsfFits_1

P2ToObj_1

Objects_1

11 12 13

Stack*_1

1 2 3

P2ToObj

StackToObj

LOADER

Objects

Updates @ frequency = f2

Updates @ frequency = f2

SLICE_1 MAIN

Metadata+

Objects

Orphans_1

P2ToPsfFits_1

P2ToObj_1

Objects_1

11 12 13

Stack*_1

1 2 3

P2ToObj

StackToObj

Objects

LOADER

Objects Objects_1

Snapshots @ frequency = f3

MAIN

Metadata+

Objects

1 2 3

P2ToObj

StackToObj

Snapshot

Objects

Batch Update of a Partition

A1 A2 A3

1 1 2 1 2 3…

merged

select into

switch

B1

select into … where

B2 + PK index


B3 + PK index

switch switch


B1 + PK index

P2

P3

PS1

P1

P2

Pm

P1

PartitionsMap

Objects

LnkToObj

Meta

Legend

Database Duplicate


Partitioned View Duplicate P view

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

[Objects_p2]

[LnkToObj_p2]

[Detections_p2]

Meta

Query Manager (QM)

Detections

Linked servers

PS1 database

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

MetaPS1

PartitionsMap

Objects

LnkToObj

Meta

Detections

Pm-1

Pm

Scaling-out

Apply Ping-Pong strategy to satisfy query performance during ingest

2 x ( 1 main + m slices)

P2

P3

PS1

P1

P2

Pm

P1

Legend

Database Duplicate


Partitioned View Duplicate P view

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

[Objects_p2]

[LnkToObj_p2]

[Detections_p2]

Meta

Query Manager (QM)

Linked servers

PS1 database

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

MetaPS1

PartitionsMap

Objects

LnkToObj

Meta

Detections

Pm-1

Pm

Scaling-out

More robustness, fault-tolerance, and reabilability calls for

3 x ( 1 main + m slices)

PartitionsMap

Objects

LnkToObj

Meta

Detections

Adding New slices

SQL Server range partitioning capabilities make it easy

Recalculate partitioning limits Transfer data to new slices Remove data from slices Define an d Apply new partitioning schema Add new partitions to main server Apply new partitioning schema to main server

Adding New Slices

ODM Ingest Performance

5.3.1.6 The PS1 ODM shall be able to ingest the data

from the IPP at two times the nominal daily arrival rate*

* The nominal daily data rate from the IPP is defined as the total data volume to be ingested annually by the ODM divided by 365.

Nominal daily data rate:• 1.51011 / 3.5 / 365 = 1.2108 P2 detections / day• 8.31010 / 3.5 / 365 = 6.5107 stack detections / day

Number of Objects

miniProto myProto Prototype PS1

SDSS* Stars 5.7 x 104 1.3 x 107 1.1 x 108

SDSS* Galaxies 9.1 x 104 1.1 x 107 1.7 x 108

Galactic Plane 1.5 x 106 3 x 106 1.0 x 109

TOTAL 1.6 x 106 2.6 x 107 1.3 x 109 5.5 x 109

* “SDSS” includes a mirror of 11.3 < < 30 objects to < 0

Total GB of csv loaded data: 300 GBCSV Bulk insert load: 8 MB/sBinary Bulk insert: 18-20 MB/sCreation Started: October 15th 2007

Finished: October 29th 2007 (??)Includes

• 10 epochs of P2PsfFits detections• 1 epoch of Stack detections

Prototype in Context

Survey Objects Detections

SDSS DR6 3.8 108

2MASS 4.7 108

USNO-B 1.0 109

Prototype 1.3 109 1.4 1010

PS1 (end of survey) 5.5 109 2.3 1011

Size of Prototype Database

Table Main Slice1 Slice2 Slice3 Loader Total

Objects 1.30 0.43 0.43 0.43 1.30 3.89

StackPsfFits 6.49 6.49

StackToObj 6.49 6.49

StackModelFits 0.87 0.87

P2PsfFits 4.02 3.90 3.35 0.37 11.64

P2ToObj 4.02 3.90 3.35 0.12 11.39

Total 15.15 8.47 8.23 7.13 1.79 40.77

Extra Tables 0.87 4.89 4.77 4.22 6.86 21.61

Grand Total 16.02 13.36 13.00 11.35 8.65 62.38

Table sizes are in billions of rows

Size of Prototype Database

Table Main Slice1 Slice2 Slice3 Loader Total

Objects 547.6 165.4 165.3 165.3 137.1 1180.6

StackPsfFits 841.5 841.6

StackToObj 300.9 300.9

StackModelFits 476.7 476.7

P2PsfFits 879.9 853.0 733.5 74.7 2541.1

P2ToObj 125.7 121.9 104.8 3.8 356.2

Total 2166.7 1171.0 1140.2 1003.6 215.6 5697.1

Extra Tables 207.9 987.1 960.2 840.7 957.3 3953.2

Allocated / Free 1878.0 1223.0 1300.0 1121.0 666.0 6188.0

Grand Total 4252.6 3381.1 3400.4 2965.3 1838.9 15838.3

9.6 TB of data in a distributed databaseTable sizes are in GB

Well-Balanced Partitions

Server Partition Rows Fraction Dec Range

Main 1 432,590,598 33.34% 32.59

Slice 1 1 144,199,105 11.11% 14.29

Slice 1 2 144,229,343 11.11% 9.39

Slice 1 3 144,162,150 11.12% 8.91

Main 2 432,456,511 33.33% 23.44

Slice 2 1 144,261,098 11.12% 8.46

Slice 2 2 144,073,972 11.10% 7.21

Slice 2 3 144,121,441 11.11% 7.77

Main 3 432,496,648 33.33% 81.98

Slice 3 1 144,270,093 11.12% 11.15

Slice 3 2 144,090,071 11.10% 14.72

Slice 3 3 144,136,484 11.11% 56.10

Ingest and Association Times

TaskMeasuredMinutes

Create Detections Zone Table 39.62

X(0.2") 121M X 1.3B 65.25

Build #noMatches Table 1.50

X(1") 12k X 1.3B 0.65

Build #allMatches Table (121M) 6.58

Build Orphans Table 0.17

Create P2PsfFits Table 11.63

Create P2ToObj Table 14.00

Total of Measured Times 140.40

Ingest and Association Times

TaskEstimatedMinutes

Compute DetectionID, HTMID 30

Remove NULLS 15

Index P2PsfFits on ObjID 15

Slices Pulling Data from Loader 5

Resolve 1 Detection - N Objects 10

Total of Estimated Times 75

Educated GuessWild Guess

Total Time to I/A daily Data

TaskTime

(hours)Time

(hours)

Ingest 121M Detections (binary) 0.32

Ingest 121M Detections (CSV) 0.98

Total of Measured Times 2.34 2.34

Total of Estimated Times 1.25 1.25

Total Time to I/A Daily Data 3.91 4.57

Requirement: Less than 12 hours (more than 2800 detections / s)

Detection Processing Rate: 8600 to 7400 detections / s

Margin on Requirement: 3.1 to 2.6

Using multiple loaders would improve performance

Insert Time @ slices

TaskEstimatedMinutes

Import P2PsfFits (binary out/in) 20.45

Import P2PsfFits (binary out/in) 2.68

Import Orphans 0.00

Merge P2PsfFits 58

Add constraint P2PsfFits 193

Merge P2ToObj 13

Add constraint P2ToObj 54

Total of Measured Times 362

6 h with 8 partitions/slice

(~1.3 x 109 detections/partition)

Educated Guess

Detections Per Partition

YearsTotal

Detections SlicesPartition per Slice

Total Partitions

Detections per Slice

0.0 0.00 4 8 32 0.00

1.0 4.29 1010 4 8 32 1.34 109

1.0 4.29 1010 8 8 64 6.7 108

2.0 8.57 1010 8 8 64 1.34 109

2.0 8.57 1010 10 8 80 1.07 109

3.0 1.29 1011 10 8 80 1.61 109

3.0 1.29 1011 12 8 96 1.34 109

3.5 1.50 1011 12 8 96 1.56 109

Total Time for Insert @ slice

TaskTime

(hours)

Total of Measured Times 0.25

Total of Estimated Times 5.3

Total Time for daily insert 6

Daily insert may operate in parallel with daily ingest and association.

Requirement: Less than 12 hours

Margin on Requirement: 2.0

Using more slices will improve insert performance.

Summary

Ingest + Association < 4 h using 1 loader (@f1= daily)• Scales with the number of servers• Current margin on requirement 3.1 • Room for improvement

Detection Insert @ slices (@f1= daily)• 6 h with 8 partitions/slice• It may happen in parallel with loading

Detections Lnks Insert @ main (@f2 < monthly)• Unknown• 6 h available

Objects insert & update @ slices (@f2 < monthly)• Unknown • 6 hours available

Objects update @ main server (@f2 < monthly)• Unknown• 12 h available. Transfer can be pipelined as soon as objects

have been processed

Risks

Estimates of Insert & Update at slices could be underestimated• Need more empirical evaluation of exercising

parallel I/O Estimates and lay out of disk storage could be

underestimated• Merges and Indexes require 2x the data size

ps1 psps object data manager design

Documents