Title: Indexing a Sphere
1Indexing a Sphere
- Jim Gray, Microsoft
- Alex Szalay, Johns Hopkins University
- Gyorgy Fekete, Maria Nieto Santisteban (JHU)
- April 2005
23 Ways We Do Spatial?
- Hierarchical mesh (extension to SQL)
- Uses table valued stored procedures
- Acts as a new spatial access method
- Ported to Yukon CLR for a 3x speedup.
- Zones fits SQL well
- Surprisingly simple good.
- Constraints a novel idea
- Lets us do algebra on regions.
- PaperThere Goes the Neighborhood Relational
Algebra for Spatial Data Search
3Sky Points and Regions
- Resources and Footprints
- Sky coverage of surveys
- Spatial pattern of observations
- Regions with bad data (masks)
- Data Management
- Organize objects by sky regions into buckets
- Density of objects highly variable on the sky
- Need a hierarchical partitioning
- Need to map to (btree) index in databases
- Search database by region
4Typical Questions
- Is this point in the survey?
- Give me all objects in this region
- What surveys covers this part of the sky?
- What is the common area of these surveys?
- Give me all good objects (not in bad areas)
- Cross-match these two catalogs
- Give me cumulative counts (Histogram) over area
- Compute fast spherical transforms of densities
- Interpolate sparsely sample functions
(extinction maps, dust temperature, )
5 Typical Calls
-- find objects within 1 arcminute of
(60,20) select objID, ra, dec from PhotoObj
as p, fHtmCover(60,20,1) as triangle
where p.htmID between triangle.startHtmID
and triangle.endHtmID and
ltgeometry test on (ra,dec) (60,20) lt 1
arcmingt -- or better yet select objID, ra, dec,
distance from dbo.fGetNearbyObjEq(60,20,1)
Coarse distance test
careful distance test rejects false positives
-- Also, Cross Apply is VERY useful select objID,
count()from PhotoObj p cross apply
dbo.fGetNearbyObjEq(p.ra, p.dec, 1)
6PreliminariesEquations Define Subspaces
- For (x,y) above the lineaxby gt c
- Reverse the space by-ax -by gt -c
- Intersect a 3 volumes a1x b1y gt c1 a2x b2y
gt c2 a3x b3y gt c3
7Preliminaries
- Spherical geometry is really hard
- Distance
- acos((sin(radians(lat1))sin(radians(lat2)))(cos(r
adians(lat1)))cos(radians(lat2))cos(radians(abs(lo
n1-lon2))))) - Area
- forget it!
- But, astronomers have it easy
- The geoid is not a sphere!
- A trick do everything in 3-space with planes
8Escape From Surface of Sphere to 3D Avoids
Spherical Geometry
- Circle C defined by plane intersect sphere.
- Plane is cx,cy,cz unit vector length
cos(radians(radius)) - Point P (px,py,pz) Inside circle C ifPC gt
length - no transcendental functions just 3 multiplies
and a compare. - Arbitrary polygons are intersections of these
regions.
9Project Sphere onto Octahedron Szalay, Kunszt,
Brunner
- Every point belongs to one of 8 faces
- one of 8 triangles
10Recursively Divide Triangles(into 4
sub-triangles)
1
- This is a space-filling curveHausdorf, Hilbert,
Peano, Z-transform
11Space-Filling CurveEach Triangle Maps to The
Real Line 0..1
0.12,0.13)
So triangles correspond to ranges All points
inside the triangle are inside the range.
12Spatial Data Access SQL extension
- Szalay, Kunszt, Brunner http//www.sdss.jhu.edu/ht
m - Added Hierarchical Triangular Mesh (HTM)
table-valued function for spatial joins - Every object has a 20-deep Mesh ID
- Given a spatial definition,routine returns up to
10 covering triangles - Spatial query is then up to 10 range queries
- Fast 1,000 triangles / second / Ghz
13How Know if point is inside triangle?
- Answer
- Triangle sides are 3 great circles C1, C2, C3
- So (PC1 gt L1) (PC2 gt L2) (PC3 gt L3)
Given an area (spherical polygon) what triangles
cover it?
- Answer
- Recursive descent
- Triangle overlaps if any corner is inside
polygonand/or any triangle edge intersects
polygon. - Return list of start-stop pairs.
HTM Library computes triangles overlapping an
area.
14Find Points Inside Area
- fGetNearbyObjEq(ra,dec,r)fGetNearbyObjXyz(x,y,z,r
)fGetNearestObjEq(ra,dec,r)fGetNearestObjEq(x,y
,z,r) - fGetObjInside(areaSpec) -- string area
description - Alex added Healpix,Igloo is also a nice iso-area
decomposition
15Digression How does Search Work?
- Coarse filter (index) reduces search space
- Btree, QuadTree, HashTable, Bitmap, Rtree,
- Careful test discards false positives
- Efficiency
Careful test
True false
True
16DigressionWhat is an Index?
- New organization of the data
- Quick way to go from attribute to values
- A alt key ? primary key map
- HTM index is a quick (and clustered) way to go
from HTM triangle ID (range) to all points in
the triangle. area ? points
17DigressionB-Trees are Good at Range Lookups
- If you can tell me the triangles
- T (start, stop)
- covering an area
- select from points P join triangles T on
P.HTM_ID between T.start and t.stop - Returns all points inside the triangles
18B-Trees are Good at Range Lookups
- If you can tell me the triangles
- T (start, stop)
- covering an area
- select from points P join triangles
dbo.fHTMcover(_at_ra,_at_dec,_at_r) T on P.HTM_ID
between T.start and t.stop - Returns all points inside the triangles
points
Triangles
Nested Loops Join
Start Stop
HTM Index
19We are Done
using System using System.Collections using
System.Data using System.Data.Sql using
System.Data.SqlTypes namespace xphtm public
partial class HTM SqlFunction(IsDeterministi
c true, IsPrecise true) public static
long HTMlookup(int level, double RA, double dec)
double x, y, z HtmTrixel ht
new HtmTrixel() SpatialVector.radec2cartesia
n(RA, dec, out x, out y, out z) return
ht.cartesian2HID(x, y, z, level)
SqlFunctionAttribute(FillRowMethodName
"FillCoverRow" ,TableDefinition
"bigint HTMID_start, bigint HTMID_end")
public static IEnumerable fHTMcover(float ra,
float dec, float radius) ArrayList
rowSetAnswer new ArrayList() Int64,
arrayAnswer null arrayAnswer
HtmCover.Circle(ra,dec,radius) int i 0
for (int i 0 i lt (arrayAnswer.Length / 2)
i) CoverRow rp new CoverRow(arrayAnswer
i,0, arrayAnsweri, 1)
rowSetAnswer.Add(rp) return (IEnumerable)
rowSetAnswer public static void
FillCoverRow(object source, out long
HTMIDstart, out long HTMIDend) CoverRow
row (CoverRow)source HTMIDstart
row._HTMIDstart HTMIDend row._HTMIDend
public class CoverRow public
long _HTMIDstart public long _HTMIDend
public CoverRow(long HTMIDstart, long HTMIDend)
_HTMIDstart HTMIDstart _HTMIDend
HTMIDend
Returns HTM ID of a point
Returns table of HTM triangles
SQL wrapper
20History/Performance
- HTM has ½ false positive on circle (all sizes)
- Pre-Yukon
- xp wrapper gt500 lines C
- Table linkage expensive (400 µs), cover 1.3ms
- Yukon
- Code MUCH simpler.
- Scalar linkage 18 µs, table linkage 80µs
- HTM lookup 30 µs, cover 500 µs
µs/call SQL 2k TSQL SQL 2k TSQL SQL 05 C SQL 05 C
SQL 2k elapsed cpu elapsed cpu
EchoInteger () 16 15 18 14
EchoStringT() 17 15 18 16
TVF(1) 344 321 82 78
TVF(10) 599 565 91 88
TVF(100) 6,507 2,868 191 181
TVF(1000) 68,678 26,256 1,113 1,076
Linkage costs (HTM is 400 µs more)
21Poster Child App
- SkyServer lookups
- Points near point
- Points in polygon
- Polygons overlap polygon
- 2,000 lookups/second/cpu
- Can add 10,000 points /second/cpu
22HTM Library
- Current implementation in C, public domain.
- automatic code generation for C and Java
- A DB2 port (ROE), 2MASS using with Illustra
- Used at over 20 places in the world
- 15 SkyQuery surveys have HTM
- Easy to add HTM to your data
- In the VO, the ADQL SkyNode supports the spatial
language
http//www.sdss.jhu.edu/htm/
23Summary of HTM
- Good for Points in area.
- Efficiency comes from going to 3-space.
- Uses table-valued function as an index generator.
- New approach to bounding boxes.
- Nested-loop join with function on outside.
- Not good for area-overlaps-area(but can simplify
areas and test for empty)
24To Repeat (for area algebra) Equations Define
Subspaces
- For (x,y) above the lineaxby gt c
- Reverse the space by-ax -by gt -c
- Intersect a 3 volumes a1x b1y gt c1 a2x b2y
gt c2 a3x b3y gt c3
25Domain is Union of Convex Areas
Not a convex
- Regions are unions of convex areas
- Higher order curves also work
- Complex areas have holes and their holes have
holes. (that is harder).
Holes are harder
It is the union of two convexs
26Now in Relational Terms
- create table HalfSpace (
- domainID int not null -- domain name
- foreign key
references Domain(domainID), - convexID int not null, --
grouping a set of ½ spaces - halfSpaceID int
identity(), -- a particular ½ space - x float not null,
-- the (a,b,..) parameters - y float not null,
-- defining the ½ space - z float not null,
- c float not null, -- the
constant (c above) - primary key (domainID,
convexID, halfSpaceID) - (x,y,z) inside a convex if it is inside all
lines of the convex - (x,y,z) inside a convex if it is NOT OUTSIDE ANY
line of the convex - select convexID -- return the convex
- from HalfSpace -- from the constraints
- where _at_x x _at_y y _at_x z lt c -- point
outside the line? - group by all convexID -- consider all the
lines of a convexID - having count() 0 -- count outside 0
27The Algebra is Simple (Boolean)
- _at_domainID spDomainNew (_at_type varchar(16),
_at_comment varchar(8000)) - _at_convexID spDomainNewConvex (_at_domainID int)
- _at_halfSpaceID spDomainNewConvexConstraint
(_at_domainID int, _at_convexID int, - _at_x float, _at_y
float, _at_z float, _at_c float) - _at_returnCode spDomainDrop(_at_domainID)
- select from fDomainsContainPoint(_at_x float, _at_y
float, _at_z float) -
- Once constructed they can be manipulated with the
Boolean operations. - _at_domainID spDomainOr (_at_domainID1 int,
_at_domainID2 int, - _at_type varchar(16), _at_comment
varchar(8000)) - _at_domainID spDomainAnd (_at_domainID1 int,
_at_domainID2 int, - _at_type varchar(16), _at_comment
varchar(8000)) - _at_domainID spDomainNot (_at_domainID1 int,
- _at_type varchar(16), _at_comment
varchar(8000))
28Defining Areas (Regions)
- Circle Center point radiusRectangle Corner
points of squarePolygon sequence corner points
or edges - Region union of convexes
- As with all spatial, many coordinate systems
(equatorial, Cartesian, galactic, ) - Have a linear syntax
- CIRCLE J2000 RA DEC radius
- CONVEX CARTESIAN 1 0 0 0 1 0 0 0 1
- Have advanced XML syntax for space-time. Spec
http//www.ivoa.net/Documents/latest/STC-X.htmlPe
rson Arnold Rots http//hea-www.harvard.edu/arot
s/ - Similar to GIS, but better suited for astronomy
29String (easy) Areas (or struct in C world)
- circleSpec CIRCLE J2000 ra dec radArcMin
- CIRCLE CARTESIAN x y z radArcMin
- rectSpec RECT J2000 ra dec2
- polySpec POLY J2000 ra dec3
- POLY CARTESIAN x y z 3
- convexSpec CONVEX x y z d
- regionSpec REGION convexSpec
- areaSpec circleSpec rectSpec
- polySpec regionSpec
30Regions
- Easy to compare regions.
- Easy to combine regions (union, difference,
intersection, buffer,) - Easy to find points in region
- Easy to find regions containing points
- Easy to find regions overlap region.
31Some Details
- HTM library used to simplify (detect nulls)
- Compute (in TSQL now, C later)
- Corner points of region
- Area of region
- Bounding circle of region
- Center point bounding circle is spatial
index. - See Simplify a Convex Szalay/Gray.
32What! No Bounding Box?
- Bounding box limits search.A subset of the
convexs . - If query runs at 3M halfspace/sec then no need
for bounding box, unless you have more than
10,000 lines. - But, if you have a lot of half-spaces then
bounding box is good. - We have bounding circles.
33Things Can Get Complex
This is convex!
34Poster Child App
- Used as footprint service.
- Take many footprints
- Fuzz them (buffer) to make coarser
footprintconvex hull - See if two footprints overlap
- 20 lines of code 130 lines of logic/comments
35Zones Comparing Massive Point Datasets
- Want to cross match 109 objects with another
109 objects - Using the HTM code, that is a LOT of calls (106
seconds ( a year)) - Want to do it in an hour.
36Zone Approach
- Divide space into zones
- Key points by Zone, offset(on the sphere this
need wrap-around margin.) - Point search look in a few zones at a limited
offset ra r a bounding box that has
1-p/4 false positives - All inside the relational engine
- Avoids impedance mismatch
- Can batch all-all comparisons
- faster and 60,000x parallel1 hours, not 6
months! - This is Maria Nieto Santistebans PhD thesis
r
ra-zoneMax
x
v(r2(ra-zoneMax)2) cos(radians(zoneMax))
zoneMax
Ra x
37In SQL points near point
Ignoring some spherical geometry details, see
paper.
select o1.objID -- find objects from zone o1
-- in the zoned table where o1.zoneID
between -- where zone
floor((_at_dec-_at_r)/_at_zoneHeight) and -- overlaps
the circle ceiling((_at_dec_at_r)/_at_zoneHeight)
and o1.ra between _at_ra - _at_r and _at_ra _at_r --
quick filter on ra and o1.dec between _at_dec-_at_r
and _at_dec_at_r -- quick filter on dec and
( (sqrt( power(o1.cx-_at_cx,2)power(o1.cy-_at_cy,2)pow
er(o1.cz-_at_cz,2)))) lt _at_r -- careful filter
on distance
Bounding box
Eliminates the 21 1-p/4 False positives
Competes with HTM (faster because no xp linkage),
but . What about comparing 1 B points with 1B
points, cross match. Still too slow.
38Quantitative Evaluation 7x faster than external
stored proc (linkage is expensive)
- time vs. radius for neighbors function _at_
various zone heights. Any small zone height is
adequate.
- time vs. best time _at_ various radius. A
zoneHeight of 4 is near-optimal
39All Neighbors of All points(can Batch Process
the Joins)
- A 5x additional speedup (35x in total)for
_at_deltaZone in -1, 0, 1 example ignores some
spherical geometry details in paper
insert neighbors -- insert one zone's
neighbors select o1.objID as objID, -- object
pairs o2.objID as NeighborObjID, .. other
fields elided from zone o1 join zone o2 --
join 2 zones on o1.zoneID-_at_deltaZone
o2.zoneID -- using zone number and ra and
o2.ra between o1.ra - _at_r and o1.ra _at_r --
points near ra where -- elided margin
logic, see paper. and o2.dec between o1.dec-_at_r
and o1.dec_at_r -- quick filter on dec and
sqrt(power(o1.x-o2.x,2)power(o1.y-o2.y,2)power(o
1.z-o2.z,2)) lt _at_r -- careful filter on
distance
Many other speedups, and then there is
parallelism.
40Poster Child Apps
- CrossMatch SDSS with itself (250M obj)Producing
neighbors within 30 arcseconds - Was 4,000 hours (estimated)
- Now 2 hours
- MaxBCG (cross-match SDSS to find clusters)
- 9,167 cpu hours with files (normalize to 2,300
hrs?) - 240 cpu hours, 44 elapsed hours with zones
- See DB meets Grid paper
- Crossmatch
- USNO (US Naval Observatory Catalog) 1.2 B obj
- 2MASS (2 meter all sky survey) 0.5B objects
41More Poster ChildUSNO-2Mass CrossMatch
- Sequential first try, 22 hours
- Optimize code 6.25 hours
- Note
- data on separate machine
- part of zone time is pull of data
42Zones allow 60,000 Parallel Jobs Partition
Parallelism 3.7 hours
2MASSUSNOBZoneZoneComparison
MergeAnswer
Build Index
Source Tables
Zoned Tables
2MASS?USNOB
350 Mrec 12 GB
2MASS 471 Mrec 140 GB
0-1
64 Mrec 2 GB
2MASS 471 Mrec 65 GB
00
260 Mrec 9 GB
USNOB 1.1 Brec 233 GB
USNOB 1.1 Brec 106 GB
350 Mrec 12 GB
01
26 Mrec 1 GB
USNOB?2MASS
2 hours
1.2 hour
.5 hour
43Pipeline Parallelism 2.5 hours Or as fast as we
can read USNOB .5 hours
2MASSUSNOBZoneZoneComparison
MergeAnswer
Build Index
Source Tables
Zones
2MASS?USNOB
350 Mrec 12 GB
2MASS 471 Mrec 140 GB
0-1
64 Mrec 2 GB
Next zone
00
260 Mrec 9 GB
USNOB 1.1 Brec 233 GB
Next zone
350 Mrec 12 GB
01
26 Mrec 1 GB
USNOB?2MASS
2 hours
.5 hour
44Spatial Stuff Summary
- Easy
- Point in polygon
- Polygons containing points
- (instance and batch)
- Works in higher dimensions
- Side note Spherical polygons are
- hard in 2-space
- Easy in 3-space
45Spatial Stuff Summary
- Constraint databases are cool
- Streams (data is query, query is in DB)
- Notification subscription in DB, data is query
- Spatial constraints in DB, data is query
- You can express constraints as rows
- Then You
- Can evaluate LOTS of predicates per second
- Can do set algebra on the predicates.
- Benefits from SQL parallelism
- SQL Prolog // DataLog? ?
46In SQL
select o1.objID -- find objects from zone o1
-- in the zoned table where o1.zoneID
between -- where zone floor((_at_dec-_at_r)/_at_zon
eHeight) and -- overlaps the circle floor((_at_dec
_at_r)/_at_zoneHeight) and o1.ra between _at_ra - _at_r
and _at_ra _at_r -- quick filter on ra and o1.dec
between _at_dec-_at_r and _at_dec_at_r -- quick filter on
dec and ( (sqrt( power(o1.cx-_at_cx,2)power
(o1.cy-_at_cy,2)power(o1.cz-_at_cz,2)))) lt _at_r --
careful filter on distance
Bounding box
Eliminates the 21 1-p/4 False positives
47References\\barcfs\public\Gray\Spatial
- There Goes the Neighborhood Relational Algebra
for Spatial Data Search, J. Gray A. S. Szalay
G. Fekete M. A. Nieto-Santisteban W. O'Mullane
A. R. Thakar G. Heber A. H. Rots
http//research.microsoft.com/research/pubs/view.
aspx?tr_id736 - Roots of a Convex, Alex Szalay, Jim Gray, March
2005 - Representing Polygon Areas and Testing
Point-in-Polygon Containment in a Relational
Database Jim Gray, Alex Szalay - spRegion.SQL Jim Gray, Alex Szalay
- SQL2005 CTP Table-Valued Function Performance vs
SQL2000, Jim Gray, Peter Kukol - Preliminary Study of SQL2005 C HTM procedures
Jim Gray, Peter Kukol, Gyorgy Fekete, Alex
Szalay - Computing the Match Table Jim Gray, Alex Szalay,
Robert Lupton, Jeff Munn - When Database Systems Meet the GridM. A.
Nieto-Santisteban, J. Gray, A. S. Szalay, J.
Annis, A. R. Thakar, W. J. OMullane - Creating Sectors for SkyServer A. Szalay, G.
Fekete, T. Budavari, J. Gray, A. Pope, A. Thakar - Partitioning the Sky for Spatial Data Tasks Jim
Gray, Maria Nieto Santisteban, Alex Szalay - Spatial Clustering of Galaxies in Large
DatasetsA. S. Szalay, T. Budavari, A. Connolly,
J. Gray, T. Matsubara, A. Pope, I. Szapudi