1 there goes the neighborhood! spatial (or n-dimensional) search in a relational world jim gray,...
TRANSCRIPT
1
There Goes the Neighborhood!Spatial (or N-Dimensional) Search
in a Relational World
Jim Gray, Microsoft
Alex Szalay, Johns Hopkins U.
2
3
Background• I have been working with
Astronomy community to build the World Wide Telescope: all telescope data federated in one internet-scale DB
• A great Web Services app• The work here
joint with Alex Szalay
• SkyServer.Sdss.Org is first installment,
• SkyQuery.Netis second installment (federated web services)
4
Outline
• How to do spatial lookup:– The old way: HTM– The new way: zoned lookup
• A load workflow system
• Embedded documentation:literate programming for SQL DDL
5
Spatial Data Access – SQL extension
Szalay, Kunszt, Brunner http://www.sdss.jhu.edu/htm
• Added Hierarchical Triangular Mesh (HTM) table-valued function for spatial joins
• Every object has a 20-deep Mesh ID• Given a spatial definition,
routine returns up to 10 covering triangles
• Spatial query is then up to 10 range queries
• Fast: 1,000 triangles / second / Ghz2,2
2,1
2,0
2,3
2,3,0
2,3,12,3,2 2,3,3
2,2
2,1
2,0
2,32,2
2,1
2,0
2,3
2,3,0
2,3,12,3,2 2,3,3
2,3,0
2,3,12,3,2 2,3,3
6
A typical call-- find objects within 1 arcminute of (60,20)select objID, ra, dec from PhotoObj as p, fHtmCover(60,20,1) as triangle where p.htmID between triangle.startHtmID and triangle.endHtmID and <geometry test on (ra,dec) – (60,20) < 1 arcmin>
-- or better yetselect objID, ra, dec, distance from dbo.fGetNearbyObjEq(60,20,1)
careful distance test rejects
false positives
Coarse distance test
Coarse filterCoarse filter Correct filterCorrect filter
7
Integration with CLR Makes it Nicer
• Peter Kukol converted 500 lines of external stored procedure “glue code” to 50 lines of C# code.
• Now we are converting library to C#
• Also, Cross Apply is VERY useful select objID, count(*)from PhotoObj p cross apply dbo.fGetNearbyObjEq(p.ra, p.dec, 1)
8
But…
• Wanted a faster way to do this: some computations were taking toooooo long (see below).
• Wanted to define areas in relational form.
• Wanted a portable way that works on any relational system.
• So, developed a “constraint database” approach – see below.
9
The Idea:Equations Define Subspaces
• For (x,y) above the lineax+by > c
• Reverse the space by-ax + -by > -c
• Intersect 3 half-spaces: a1x + b1y > c1
a2x + b2y > c2
a3x + b3y > c3
x
y
x=c/a
y=c/b
ax + by = c
x
y
10
The Idea:Equations Define Subspaces
a1x + b1y > c1
a2x + b2y > c2
a3x + b3y > c3x
y
select count(*)from convex where a*@x + b*@y < c
3
2
22
1 1
1
select count(*)from convex where a*@x + b*@y > c
x
y
0
1
11
2 2
2
11
Domain is Union of Convex Hulls
• Simple volumes are unions of convex hulls.
• Higher order curves also work
• Complex volumes have holes and their holes have holes. (that is harder).
Not a convex hull
+
12
Now in Relational Termscreate table HalfSpace (
domainID int not null -- domain name foreign key references Domain(domainID), convexID int not null, -- grouping a set of ½ spaces halfSpaceID int identity(), -- a particular ½ space a float not null, -- the (a,b,..) parameters b float not null, -- defining the ½ space c float not null, -- the constraint (“c” above) primary key (domainID, convexID, halfSpaceID)
(x,y) inside a convex if it is inside all lines of the convex(x,y) inside a convex if it is NOT OUTSIDE ANY line of the convex
Convexes containing point (@x,@y):select convexID -- return the convex hullsfrom HalfSpace -- from the constraintswhere (@x * a + @y * b) < c -- point outside the line?group by all convexID -- insist no line of convexhaving count(*) = 0 -- is outside (count outside ==
0)
13
All Domains Containing this Point
• The group by is supported by the domain/convex index, so it’s a sequential scan (pre-sorted!).
select distinct domainID -- return domains
from HalfSpace -- from constraints
where (@x * a + @y * b) < c -- point outside
group by all domainID, convexID -– never happens having count(*) = 0 -- count outside == 0
14
The Algebra is Simple (Boolean)@domainID = spDomainNew (@type varchar(16), @comment varchar(8000))@convexID = spDomainNewConvex (@domainID int)@halfSpaceID = spDomainNewConvexConstraint (@domainID int, @convexID int, @a float, @b float, @c float)@returnCode = spDomainDrop(@domainID)
select * from fDomainsContainPoint(@x float, @y float) Once constructed they can be manipulated with the Boolean operations.@domainID = spDomainOr (@domainID1 int, @domainID2 int, @type varchar(16), @comment varchar(8000))@domainID = spDomainAnd (@domainID1 int, @domainID2 int, @type varchar(16), @comment varchar(8000))@domainID = spDomainNot (@domainID1 int, @type varchar(16), @comment varchar(8000))
15
What! No Bounding Box?
• Bounding box limits search.A subset of the convex hulls.
• If query runs at 3M half-space/sec then no need for bounding box, unless you have more than 10,000 lines.
• But, if you have a lot of half-spaces then bounding box is good.
16
OK: “solved” Areas Contain Point? What about: Points near point?
• Table-valued function find points near a point
– Select * from fGetNearbyEq(ra,dec,r)• Use Hierarchical Triangular Mesh www.sdss.jhu.edu/htm/
– Space filling curve, bounding triangles…– Standard approach
• 13 ms/call… So 70 objects/second.• Too slow, so pre-compute neighbors:
Materialized view.• At 70 objects/sec: takes 6 months
to compute materialized view on billion objects.
17
Zone Based Spatial Join• Divide space into zones• Key points by Zone, offset
(on the sphere this need wrap-around margin.)
• Point search look in a few zonesat a limited offset: ra ± ra bounding box that has
1-π/4 false positives• All inside the relational engine• Avoids “impedance mismatch” • Can “batch” all-all comparisons• 33x faster and parallel
6 days, not 6 months!
r ra-zoneMax
√(r2+(ra-zoneMax)2)cos(radians(zoneMax))
zoneMax
x
Ra ± x
18
In SQL: points near point
select o1.objID -- find objectsfrom zone o1 -- in the zoned tablewhere o1.zoneID between -- where zone #
floor((@dec-@r)/@zoneHeight) and -- overlaps the circleceiling((@dec+@r)/@zoneHeight)
and o1.ra between @ra - @r and @ra + @r -- quick filter on ra and o1.dec between @dec-@r and @dec+@r -- quick filter on dec and ( (sqrt( power(o1.cx-@cx,2)+power(o1.cy-@cy,2)+power(o1.cz-@cz,2))))
< @r -- careful filter on distance
Eliminates the ~ 21% = 1-π/4False positives
Bounding box
19
Quantitative Evaluation: 7x faster than external stored proc:
(linkage is expensive)time vs. radius for neighbors function @ various zone heights. Any small zone height is adequate.
time vs. best time @ various radius. A zoneHeight of 4” is near-optimal
Rows vs elapsed timefit is 1.46+2.2e-4*r^2 ms/asec
1
10
100
1000
10 100 1000
r (asec)
tim
e (
se
c)
7.5 asec
15 asec30 asec1 amin
2 amin4 amin
64 aminr 2̂ fit
Relative time vs zone height (asec)4 minute zone is near optimal
2 & 8 minute are slower
1.00
1.10
1.20
1.30
1.40
1.50
1.60
1.70
1.80
1.90
2.00
10 100 1000r (asec)
tim
e vs
bes
t
7 asec15 asec30 asec1 amin2 amin4 amin16 amin1 degree
20
All Neighbors of All points(can Batch Process the Joins)
• A 5x additional speedup (35x in total)for @deltaZone in {-1, 0, 1} example ignores some spherical geometry details in paper
insert neighbors -- insert one zone's neighbors select o1.objID as objID, -- object pairs
o2.objID as NeighborObjID, .. other fields elided
from zone o1 join zone o2 -- join 2 zones on o1.zoneID-@deltaZone = o2.zoneID -- using zone number and ra
and o2.ra between o1.ra - @r and o1.ra + @r -- points near rawhere -- elided margin logic, see paper. and o2.dec between o1.dec-@r and o1.dec+@r -- quick filter on dec and sqrt(power(o1.x-o2.x,2)+power(o1.y-o2.y,2)+power(o1.z-o2.z,2))
< @r -- careful filter on distance
21
Spatial Stuff Summary• Easy
– Point in polygon– Polygons containing points– (instance and batch)
• Works in higher dimensions
• Side note: Spherical polygons are – hard in 2-space– Easy in 3-space
22
Spatial Stuff Summary• Constraint databases are in
– Streams (data is query, query is in DB)– Notification: subscription in DB, data is query– Spatial: constraints in DB, data is query
• You can express constraints as rows
• Then You – Can evaluate LOTS of predicates per second– Can do set algebra on the predicates.
• Benefits from SQL parallelism
• SQL == Prolog // DataLog?
23
References
• Representing Polygon Areas and Testing Point-in-Polygon
Containment in a Relational Database http://research.microsoft.com/~Gray/papers/Polygon.doc
• A Purely Relational Way of Computing Neighbors on a Sphere, http://research.microsoft.com/~Gray/papers/Neighbors.doc
24
Outline
• How to do spatial lookup:– The old way: HTM– The new way: zoned lookup
• A load workflow system
• Embedded documentation:literate programming for SQL DDL
25
Loading consists of Many Tasks• Very simply:
capture, analyze to produce catalog info, convert to sql,
validate, import,
index,
• Each of these steps has many sub-steps• We learned from the TerraServer that
(1) LOADING IS WHERE THE TIME GOES.(2) You get to load it again when you discover better data you discover a bug in the data
you discover a better design.(3) Essentially, you are always loading or preparing to load.
• Everyone “knows this” but you have to experience it to grasp it.
TelescopeObservation
GenerateCatalogs
Validatecatalogs
Load toDatabase
26
The SkyServer Load Manager• Built a workflow engine with SQL Agent
(our batch job scheduler) and DTS• State machine is in database
visible to all worker nodes• Workers at each node pull work • Step is a stored procedure
– logs to load monitor database – 3 levels of logging: job, step, phase– Logs and reporting are VERY useful
• Automatic some manual backout to fix problems
• Demo http://skyserver.pha.jhu.edu/admin/tasklist.asp?all=1
27
Outline
• How to do spatial lookup:– The old way: HTM– The new way: zoned lookup
• A load workflow system
• Embedded documentation:literate programming for SQL DDL
28
How do you document your schema?
29
Knuth’s Literate Programming
• Put documentation in the program• Tool generates manual from program..
In the new world: tool generates online hypertext from program.
• We Annotated every table, view, function, .. with tags:unitscommentreference to other tables (flags, star schema).
• Program scans DDL, generates web site.• http://skyserver.sdss.org/en/help/docs/browser.asp
http://search.barnesandnoble.com/booksearch/isbnInquiry.asp?userid=1BHUJQ3VNI&isbn=0937073806&itm=1
30
CREATE FUNCTION fGetNearbyObjXYZ (@nx float, @ny float, @nz float, @r float)---------------------------------------------------------------/H Returns table of primary objects within @r arcmins of an xyz point (@nx,@ny, @nz).---------------------------------------------------------------/T There is no limit on the number of objects returned, but there are about 40 per sq arcmin.--/T <p>returned table: --/T <li> objID bigint PRIMARY KEY, -- Photo primary object identifier--/T <li> run int NOT NULL, -- run that observed this object --/T <li> camcol int NOT NULL, -- camera column that observed the object--/T <li> field int NOT NULL, -- field that had the object--/T <li> rerun int NOT NULL, -- computer processing run that discovered the object--/T <li> type int NOT NULL, -- type of the object (3=Galaxy, 6= star, see PhotoType in DBconstants)--/T <li> cx float NOT NULL, -- x,y,z of unit vector to this object--/T <li> cy float NOT NULL,--/T <li> cz float NOT NULL,--/T <li> htmID bigint, -- Hierarchical Triangular Mesh id of this object--/T <li> distance float -- distance in arc minutes to this object from the ra,dec.--/T <br> Sample call to find PhotoObjects within 5 arcminutes of xyz -.0996,-.1,0--/T <br><samp>--/T <br>select *--/T <br>from dbo.fGetNearbyObjXYZ(-.996,-.1,0,5) --/T </samp> --/T <br>see also fGetNearbyObjEq, fGetNearestObjXYZ, fGetNearestObjXYZ------------------------------------------------------------- RETURNS @proxtab TABLE ( objID bigint PRIMARY KEY,
31
CREATE TABLE Chunk (---------------------------------------------------------------------------------/H Contains basic data for a Chunk----/T A Chunk is a unit for SDSS data export. --/T It is a part of an SDSS stripe, a 2.5 degree wide cylindrical segment --/T aligned at a great circle between the survey poles. --/T A Chunk has had both strips completely observed. Since --/T the SDSS camera has gaps between its 6 columns of CCDs, each stripe has --/T to be scanned twice (these are the strips) resulting in 12 slightly --/T overlapping narrow observation segments. <P>--/T Only those parts of a stripe are ready for export where the observation --/T is complete, hence the declaration of a chunk, usually consisting of 2 runs. -------------------------------------------------------------------------------
chunkNumber int NOT NULL, --/D Unique chunk identifierstartMu int NOT NULL , --/D Starting mu value --/U arcsecendMu int NOT NULL , --/D Ending mu value --/U arcsec[stripe] int NOT NULL , --/D Stripe numberexportVersion varchar(32) NOT NULL , --/D Export Version
)
32
SkyServer Object Browserhttp://skyserver.sdss.org/en/help/docs/browser.asp
• VB program scans DDL, generates web site.
• It is VERY usefulFree Text Search VERY useful
33
Outline• How to do spatial lookup:
– The old way: HTM– The new way: zoned lookup
• A load workflow system
• Embedded documentation:literate programming for SQL DDL
• Not shown: A C# web service (http://skyservice.pha.jhu.edu/SdssCutout) or better yet!
http://skyservice.pha.jhu.edu/dr1/imgcutout/chart.asp