ps1 psps object data manager design
DESCRIPTION
PS1 PSPS Object Data Manager Design. PSPS Critical Design Review November 5-6, 2007 IfA. Detail Design. General Concepts Distributed Database architecture Ingest Workflow Prototype. Zones. Declination (Dec). Right Ascension (RA). Zones (spatial partitioning and indexing algorithm) - PowerPoint PPT PresentationTRANSCRIPT
slide 1
PS1 PSPSObject Data Manager Design
PSPS Critical Design Review November 5-6, 2007
IfA
slide 2
Detail Design
General Concepts Distributed Database architecture Ingest Workflow Prototype
slide 3
Zones (spatial partitioning and indexing algorithm) Partition and bin the data into declination zones
• ZoneID = floor ((dec + 90.0) / zoneHeight) Few tricks required to handle spherical geometry Place the data close on disk
• Cluster Index on ZoneID and RA Fully implemented in SQL Efficient
• Nearby searches • Cross-Match (especially)
Fundamental role in addressing the critical requirements• Data volume management • Association Speed • Spatial capabilities
Zones
De
clin
atio
n (
De
c)
Right Ascension (RA)
slide 4
Zoned Table
ObjID ZoneID* RA Dec CX CY CZ …
1 0 0.0 -90.0
2 20250 180.0 0.0
3 20250 181.0 0.0
4 40500 360.0 +90.0
* ZoneHeight = 8 arcsec in this example
ZoneID = floor ((dec + 90.0) / zoneHeight)
slide 5
SQL CrossNeighbors
SELECT *FROM prObj1 z1
JOIN zoneZone ZZ ON ZZ.zoneID1 = z1.zoneID
JOIN prObj2 z2ON ZZ.ZoneID2 = z2.zoneID
WHERE z2.ra BETWEEN z1.ra-ZZ.alpha AND z2.ra+ZZ.alpha
AND z2.dec BETWEEN z1.dec-@r AND z1.dec+@r
AND (z1.cx*z2.cx+z1.cy*z2.cy+z1.cz*z2.cz) > cos(radians(@r))
slide 6
Good CPU Usage
slide 7
Partitions
SQL Server 2005 introduces technology to handle tables which are partitioned across different disk volumes and managed by a single server.
Partitioning makes management and access of large tables and indexes more efficient• Enables parallel I/O• Reduces the amount of data that needs to be
accessed• Related tables can be aligned and collocated in
the same place speeding up JOINS
slide 8
Partitions
2 key elements• Partitioning function
– Specifies how the table or index is partitioned
• Partitioning schemas – Using a partitioning function, the schema specifies the placement
of the partitions on file groups
Data can be managed very efficiently using Partition Switching• Add a table as a partition to an existing table• Switch a partition from one partitioned table to another• Reassign a partition to form a single table
Main requirement• The table must be constrained on the partitioning column
slide 9
Partitions
For the PS1 design, • Partitions mean File Group Partitions• Tables are partitioned into ranges of ObjectID,
which correspond to declination ranges.• ObjectID boundaries are selected so that each
partition has a similar number of objects.
slide 10
Distributed Partitioned Views
Tables participating in the Distributed Partitioned View (DVP) reside on different databases which reside in different databases which reside on different instances or different (linked) servers
slide 11
Concept: Slices
In the PS1 design, the bigger tables will be partitioned across servers
To avoid confusion with the File Group Partitioning, we call them “Slices”
Data is glued together using Distributed Partitioned Views
The ODM will manage slices. Using slices improves system scalability.
For PS1 design, tables are sliced into ranges of ObjectID, which correspond to broad declination ranges. Each slice is subdivided into partitions that correspond to narrower declination ranges.
ObjectID boundaries are selected so that each slice has a similar number of objects.
slide 12
Detail Design Outline
General Concepts Distributed Database architecture Ingest Workflow Prototype
slide 13
PS1
P1 Pm
PartitionsMap
Objects
LnkToObj
Meta
[Objects_p1]
[LnkToObj_p1]
[Detections_p1]
Meta
[Objects_pm]
[LnkToObj_pm]
[Detections_pm]
MetaDetections
Linked servers
PS1 database
LoadAdmin
LoadSupport1
objZoneIndx
orphans_l1
Detections_l1
LnkToObj_l1
objZoneIndx
Orphans_ln
Detections_ln
LnkToObj_ln
detections
LoadSupportn
Linked servers
detections
PartitionsMap
PS1 Distributed DB system
Legend
Database
Full table [partitioned table]
Output table
Partitioned View
Query Manager (QM)
Web Based Interface (WBI)
slide 14
Design Decisions: ObjID
Objects have their positional information encoded in their objID• fGetPanObjID (ra, dec, zoneH)• ZoneID is the most significant part of the ID
It gives scalability, performance, and spatial functionality Object tables are range partitioned according to their
object ID
slide 15
ObjectID Clusters Data Spatially
ObjectID = 087941012871550661
Dec = –16.71611583 ZH = 0.008333
ZID = (Dec+90) / ZH = 08794.0661
RA = 101.287155
ObjectID is unique when objects are separated by >0.0043 arcsec
slide 16
Design Decisions: DetectID
Detections have their positional information encoded in the detection identifier• fGetDetectID (dec, observationID, runningID,
zoneH) • Primary key (objID, detectionID), to align detections
with objects within partitions• Provides efficient access to all detections associated
to one object• Provides efficient access to all detections of nearby
objects
slide 17
DetectionID Clusters Data in Zones
DetectID = 0879410500001234567
Dec = –16.71611583 ZH = 0.008333
ZID = (Dec+90) / ZH = 08794.0661
ObservationID = 1050000
Running ID = 1234567
slide 18
ODM Capacity
5.3.1.3 The PS1 ODM shall be able to ingest into the
ODM a total of
• 1.51011 P2 detections• 8.31010 cumulative sky (stack) detections• 5.5109 celestial objects
together with their linkages.
slide 19
PS1 Table Sizes - Monolithic
Table Year 1 Year 2 Year 3 Year 3.5
Objects 2.31 2.31 2.31 2.31
StackPsfFits 5.07 10.16 15.20 17.74
StackToObj 0.92 1.84 2.76 3.22
StackModelFits 1.15 2.29 3.44 4.01
P2PsfFits 7.87 15.74 23.61 27.54
P2ToObj 1.33 2.67 4.00 4.67
Other Tables 3.19 6.03 8.87 10.29
Indexes +20% 4.37 8.21 12.04 13.96
Total 26.21 49.24 72.23 83.74
Sizes are in TB
slide 20
PS1
P1 Pm
PartitionsMap
Objects
LnkToObj
Meta
Linked servers
PS1 database
What goes into the main Server
Objects
LnkToObj
Meta
PartitionsMap
Legend
Database
Full table [partitioned table]
Output table
Distributed Partitioned View
slide 21
PS1
P1 Pm
PartitionsMap
Objects
LnkToObj
Meta
Linked servers
PS1 database
What goes into slices
PartitionsMap
[Objects_p1]
[LnkToObj_p1]
[Detections_p1]
PartitionsMap
Meta
[Objects_pm]
[LnkToObj_pm]
[Detections_pm]
PartitionsMap
Meta
[Objects_p1]
[LnkToObj_p1]
[Detections_p1]
Meta
Legend
Database
Full table [partitioned table]
Output table
Distributed Partitioned View
slide 22
PS1
P1 Pm
PartitionsMap
Objects
LnkToObj
Meta
Linked servers
PS1 database
What goes into slices
PartitionsMap
[Objects_p1]
[LnkToObj_p1]
[Detections_p1]
PartitionsMap
Meta
[Objects_pm]
[LnkToObj_pm]
[Detections_pm]
PartitionsMap
Meta
[Objects_p1]
[LnkToObj_p1]
[Detections_p1]
Meta
Legend
Database
Full table [partitioned table]
Output table
Distributed Partitioned View
slide 23
Duplication of Objects & LnkToObj
Objects are distributed across slices Objects, P2ToObj, and StackToObj are duplicated in the
slices to parallelize “inserts” & “updates” Detections belong into their object’s slice Orphans belong to the slice where their position would
allocate them• Orphans near slices’ boundaries will need special
treatment Objects keep their original object identifier
• Even though positional refinement might change their zoneID and therefore the most significant part of their identifier
slide 24
PS1
P1 Pm
PartitionsMap
Objects
LnkToObj
Meta
Linked servers
PS1 database
Glue = Distributed Views
Detections
[Objects_p1]
[LnkToObj_p1]
[Detections_p1]
PartitionsMap
Meta
[Objects_pm]
[LnkToObj_pm]
[Detections_pm]
PartitionsMap
Meta
Legend
Database
Full table [partitioned table]
Output table
Distributed Partitioned View
Detections
slide 25
PS1
P1 Pm
Web Based Interface (WBI)
Linked servers
PS1 database
Partitioning in Main Server
Main server is partitioned (objects) and collocated (lnkToObj) by objid
Slices are partitioned (objects) and collocated (lnkToObj) by objid
Query Manager (QM)
slide 26
PS1 Table Sizes - Main Server
Table Year 1 Year 2 Year 3 Year 3.5
Objects 2.31 2.31 2.31 2.31
StackPsfFits
StackToObj 0.92 1.84 2.76 3.22
StackModelFits
P2PsfFits
P2ToObj 1.33 2.67 4.00 4.67
Other Tables 0.41 0.46 0.52 0.55
Indexes +20% 0.99 1.46 1.92 2.15
Total 5.96 8.74 11.51 12.90
Sizes are in TB
slide 27
PS1 Table Sizes - Each Slice
m=4 m=8 m=10 m=12
Table Year 1 Year 2 Year 3 Year 3.5
Objects 0.58 0.29 0.23 0.19
StackPsfFits 1.27 1.27 1.52 1.48
StackToObj 0.23 0.23 0.28 0.27
StackModelFits 0.29 0.29 0.34 0.33
P2PsfFits 1.97 1.97 2.36 2.30
P2ToObj 0.33 0.33 0.40 0.39
Other Tables 0.75 0.81 1.00 1.01
Indexes +20% 1.08 1.04 1.23 1.19
Total 6.50 6.23 7.36 7.16
Sizes are in TB
slide 28
PS1 Table Sizes - All Servers
Table Year 1 Year 2 Year 3 Year 3.5
Objects 4.63 4.63 4.61 4.59
StackPsfFits 5.08 10.16 15.20 17.76
StackToObj 1.84 3.68 5.56 6.46
StackModelFits 1.16 2.32 3.40 3.96
P2PsfFits 7.88 15.76 23.60 27.60
P2ToObj 2.65 5.31 8.00 9.35
Other Tables 3.41 6.94 10.52 12.67
Indexes +20% 5.33 9.76 14.18 16.48
Total 31.98 58.56 85.07 98.87
Sizes are in TB
slide 29
Detail Design Outline
General Concepts Distributed Database architecture Ingest Workflow Prototype
slide 30
PS1
P1 Pm
PartitionsMap
Objects
LnkToObj
Meta
[Objects_p1]
[LnkToObj_p1]
[Detections_p1]
PartitionsMap
Meta
[Objects_pm]
[LnkToObj_pm]
[Detections_pm]
PartitionsMap
MetaDetections
Linked servers
PS1 database
LoadAdmin
LoadSupport1
objZoneIndx
orphans_l1
Detections_l1
LnkToObj_l1
objZoneIndx
Orphans_ln
Detections_ln
LnkToObj_ln
detections
LoadSupportn
Linked servers
detections
PartitionsMap
PS1 Distributed DB system
Legend
Database
Full table [partitioned table]
Output table
Partitioned View
Query Manager (QM)
Web Based Interface (WBI)
slide 31
“Insert” & “Update”
SQL Insert and Update are expensive operations due to logging and re-indexing
In the PS1 design, Insert and Update have been re-factored into sequences of:
Merge + Constrain + Switch Partition Frequency
• f1: daily• f2: at least monthly• f3: TBD (likely to be every 6 months)
slide 32
Ingest Workflow
ObjectsZ
CSV Detect
X(1”)
DXO_1a
NoMatch X(2”)
DXO_2a
DZone
P2PsfFits
Resolve
P2ToObjOrphans
slide 33
Ingest @ frequency = f1
P2ToObj
Orphans
SLICE_1 MAIN
P2PsfFits
Metadata+
Objects
Orphans_1
P2ToPsfFits_1
P2ToObj_1
Objects_1
11 12 13
Stack*_1
1 2 3
P2ToObj
StackToObj
P2ToObj_1
P2ToPsfFits_1
Orphans_1
ObjectsZ
LOADER
slide 34
SLICE_1 MAIN
Metadata+
Objects
Orphans_1
P2ToPsfFits_1
P2ToObj_1
Objects_1
11 12 13
Stack*_1
1 2 3
P2ToObj
StackToObj
LOADER
Objects
Updates @ frequency = f2
slide 35
Updates @ frequency = f2
SLICE_1 MAIN
Metadata+
Objects
Orphans_1
P2ToPsfFits_1
P2ToObj_1
Objects_1
11 12 13
Stack*_1
1 2 3
P2ToObj
StackToObj
Objects
LOADER
Objects Objects_1
slide 36
Snapshots @ frequency = f3
MAIN
Metadata+
Objects
1 2 3
P2ToObj
StackToObj
Snapshot
Objects
slide 37
Batch Update of a Partition
A1 A2 A3
1 1 2 1 2 3…
merged
select into
switch
B1
select into … where
B2 + PK index
select into … where
B3 + PK index
switch switch
select into … where
B1 + PK index
slide 38
P2
P3
PS1
P1
P2
Pm
P1
PartitionsMap
Objects
LnkToObj
Meta
Legend
Database Duplicate
Full table [partitioned table]
Partitioned View Duplicate P view
[Objects_p1]
[LnkToObj_p1]
[Detections_p1]
[Objects_p2]
[LnkToObj_p2]
[Detections_p2]
Meta
Query Manager (QM)
Detections
Linked servers
PS1 database
[Objects_pm]
[LnkToObj_pm]
[Detections_pm]
[Objects_p1]
[LnkToObj_p1]
[Detections_p1]
MetaPS1
PartitionsMap
Objects
LnkToObj
Meta
Detections
Pm-1
Pm
Scaling-out
Apply Ping-Pong strategy to satisfy query performance during ingest
2 x ( 1 main + m slices)
slide 39
P2
P3
PS1
P1
P2
Pm
P1
Legend
Database Duplicate
Full table [partitioned table]
Partitioned View Duplicate P view
[Objects_p1]
[LnkToObj_p1]
[Detections_p1]
[Objects_p2]
[LnkToObj_p2]
[Detections_p2]
Meta
Query Manager (QM)
Linked servers
PS1 database
[Objects_pm]
[LnkToObj_pm]
[Detections_pm]
[Objects_p1]
[LnkToObj_p1]
[Detections_p1]
MetaPS1
PartitionsMap
Objects
LnkToObj
Meta
Detections
Pm-1
Pm
Scaling-out
More robustness, fault-tolerance, and reabilability calls for
3 x ( 1 main + m slices)
PartitionsMap
Objects
LnkToObj
Meta
Detections
slide 40
Adding New slices
SQL Server range partitioning capabilities make it easy
Recalculate partitioning limits Transfer data to new slices Remove data from slices Define an d Apply new partitioning schema Add new partitions to main server Apply new partitioning schema to main server
slide 41
Adding New Slices
slide 42
Detail Design Outline
General Concepts Distributed Database architecture Ingest Workflow Prototype
slide 44
ODM Ingest Performance
5.3.1.6 The PS1 ODM shall be able to ingest the data
from the IPP at two times the nominal daily arrival rate*
* The nominal daily data rate from the IPP is defined as the total data volume to be ingested annually by the ODM divided by 365.
Nominal daily data rate:• 1.51011 / 3.5 / 365 = 1.2108 P2 detections / day• 8.31010 / 3.5 / 365 = 6.5107 stack detections / day
slide 45
Number of Objects
miniProto myProto Prototype PS1
SDSS* Stars 5.7 x 104 1.3 x 107 1.1 x 108
SDSS* Galaxies 9.1 x 104 1.1 x 107 1.7 x 108
Galactic Plane 1.5 x 106 3 x 106 1.0 x 109
TOTAL 1.6 x 106 2.6 x 107 1.3 x 109 5.5 x 109
* “SDSS” includes a mirror of 11.3 < < 30 objects to < 0
Total GB of csv loaded data: 300 GBCSV Bulk insert load: 8 MB/sBinary Bulk insert: 18-20 MB/sCreation Started: October 15th 2007
Finished: October 29th 2007 (??)Includes
• 10 epochs of P2PsfFits detections• 1 epoch of Stack detections
slide 47
Prototype in Context
Survey Objects Detections
SDSS DR6 3.8 108
2MASS 4.7 108
USNO-B 1.0 109
Prototype 1.3 109 1.4 1010
PS1 (end of survey) 5.5 109 2.3 1011
slide 48
Size of Prototype Database
Table Main Slice1 Slice2 Slice3 Loader Total
Objects 1.30 0.43 0.43 0.43 1.30 3.89
StackPsfFits 6.49 6.49
StackToObj 6.49 6.49
StackModelFits 0.87 0.87
P2PsfFits 4.02 3.90 3.35 0.37 11.64
P2ToObj 4.02 3.90 3.35 0.12 11.39
Total 15.15 8.47 8.23 7.13 1.79 40.77
Extra Tables 0.87 4.89 4.77 4.22 6.86 21.61
Grand Total 16.02 13.36 13.00 11.35 8.65 62.38
Table sizes are in billions of rows
slide 49
Size of Prototype Database
Table Main Slice1 Slice2 Slice3 Loader Total
Objects 547.6 165.4 165.3 165.3 137.1 1180.6
StackPsfFits 841.5 841.6
StackToObj 300.9 300.9
StackModelFits 476.7 476.7
P2PsfFits 879.9 853.0 733.5 74.7 2541.1
P2ToObj 125.7 121.9 104.8 3.8 356.2
Total 2166.7 1171.0 1140.2 1003.6 215.6 5697.1
Extra Tables 207.9 987.1 960.2 840.7 957.3 3953.2
Allocated / Free 1878.0 1223.0 1300.0 1121.0 666.0 6188.0
Grand Total 4252.6 3381.1 3400.4 2965.3 1838.9 15838.3
9.6 TB of data in a distributed databaseTable sizes are in GB
slide 50
Well-Balanced Partitions
Server Partition Rows Fraction Dec Range
Main 1 432,590,598 33.34% 32.59
Slice 1 1 144,199,105 11.11% 14.29
Slice 1 2 144,229,343 11.11% 9.39
Slice 1 3 144,162,150 11.12% 8.91
Main 2 432,456,511 33.33% 23.44
Slice 2 1 144,261,098 11.12% 8.46
Slice 2 2 144,073,972 11.10% 7.21
Slice 2 3 144,121,441 11.11% 7.77
Main 3 432,496,648 33.33% 81.98
Slice 3 1 144,270,093 11.12% 11.15
Slice 3 2 144,090,071 11.10% 14.72
Slice 3 3 144,136,484 11.11% 56.10
slide 51
Ingest and Association Times
TaskMeasuredMinutes
Create Detections Zone Table 39.62
X(0.2") 121M X 1.3B 65.25
Build #noMatches Table 1.50
X(1") 12k X 1.3B 0.65
Build #allMatches Table (121M) 6.58
Build Orphans Table 0.17
Create P2PsfFits Table 11.63
Create P2ToObj Table 14.00
Total of Measured Times 140.40
slide 52
Ingest and Association Times
TaskEstimatedMinutes
Compute DetectionID, HTMID 30
Remove NULLS 15
Index P2PsfFits on ObjID 15
Slices Pulling Data from Loader 5
Resolve 1 Detection - N Objects 10
Total of Estimated Times 75
Educated GuessWild Guess
slide 53
Total Time to I/A daily Data
TaskTime
(hours)Time
(hours)
Ingest 121M Detections (binary) 0.32
Ingest 121M Detections (CSV) 0.98
Total of Measured Times 2.34 2.34
Total of Estimated Times 1.25 1.25
Total Time to I/A Daily Data 3.91 4.57
Requirement: Less than 12 hours (more than 2800 detections / s)
Detection Processing Rate: 8600 to 7400 detections / s
Margin on Requirement: 3.1 to 2.6
Using multiple loaders would improve performance
slide 54
Insert Time @ slices
TaskEstimatedMinutes
Import P2PsfFits (binary out/in) 20.45
Import P2PsfFits (binary out/in) 2.68
Import Orphans 0.00
Merge P2PsfFits 58
Add constraint P2PsfFits 193
Merge P2ToObj 13
Add constraint P2ToObj 54
Total of Measured Times 362
6 h with 8 partitions/slice
(~1.3 x 109 detections/partition)
Educated Guess
slide 55
Detections Per Partition
YearsTotal
Detections SlicesPartition per Slice
Total Partitions
Detections per Slice
0.0 0.00 4 8 32 0.00
1.0 4.29 1010 4 8 32 1.34 109
1.0 4.29 1010 8 8 64 6.7 108
2.0 8.57 1010 8 8 64 1.34 109
2.0 8.57 1010 10 8 80 1.07 109
3.0 1.29 1011 10 8 80 1.61 109
3.0 1.29 1011 12 8 96 1.34 109
3.5 1.50 1011 12 8 96 1.56 109
slide 56
Total Time for Insert @ slice
TaskTime
(hours)
Total of Measured Times 0.25
Total of Estimated Times 5.3
Total Time for daily insert 6
Daily insert may operate in parallel with daily ingest and association.
Requirement: Less than 12 hours
Margin on Requirement: 2.0
Using more slices will improve insert performance.
slide 57
Summary
Ingest + Association < 4 h using 1 loader (@f1= daily)• Scales with the number of servers• Current margin on requirement 3.1 • Room for improvement
Detection Insert @ slices (@f1= daily)• 6 h with 8 partitions/slice• It may happen in parallel with loading
Detections Lnks Insert @ main (@f2 < monthly)• Unknown• 6 h available
Objects insert & update @ slices (@f2 < monthly)• Unknown • 6 hours available
Objects update @ main server (@f2 < monthly)• Unknown• 12 h available. Transfer can be pipelined as soon as objects
have been processed
slide 58
Risks
Estimates of Insert & Update at slices could be underestimated• Need more empirical evaluation of exercising
parallel I/O Estimates and lay out of disk storage could be
underestimated• Merges and Indexes require 2x the data size