Indexing a Sphere - PowerPoint PPT Presentation

About This Presentation
Title:

Indexing a Sphere

Description:

Acts as a new 'spatial access method' Ported to Yukon CLR for a 3x speedup. ... The geoid is not a sphere! A trick: do everything in 3-space with planes ... – PowerPoint PPT presentation

Number of Views:569
Avg rating:3.0/5.0
Slides: 47
Provided by: gray48
Category:
Tags: geoid | indexing | sphere

less

Transcript and Presenter's Notes

Title: Indexing a Sphere


1
Indexing a Sphere
  • Jim Gray, Microsoft
  • Alex Szalay, Johns Hopkins University
  • Gyorgy Fekete, Maria Nieto Santisteban (JHU)
  • April 2005

2
3 Ways We Do Spatial?
  • Hierarchical mesh (extension to SQL)
  • Uses table valued stored procedures
  • Acts as a new spatial access method
  • Ported to Yukon CLR for a 3x speedup.
  • Zones fits SQL well
  • Surprisingly simple good.
  • Constraints a novel idea
  • Lets us do algebra on regions.
  • PaperThere Goes the Neighborhood Relational
    Algebra for Spatial Data Search

3
Sky Points and Regions
  • Resources and Footprints
  • Sky coverage of surveys
  • Spatial pattern of observations
  • Regions with bad data (masks)
  • Data Management
  • Organize objects by sky regions into buckets
  • Density of objects highly variable on the sky
  • Need a hierarchical partitioning
  • Need to map to (btree) index in databases
  • Search database by region

4
Typical Questions
  • Is this point in the survey?
  • Give me all objects in this region
  • What surveys covers this part of the sky?
  • What is the common area of these surveys?
  • Give me all good objects (not in bad areas)
  • Cross-match these two catalogs
  • Give me cumulative counts (Histogram) over area
  • Compute fast spherical transforms of densities
  • Interpolate sparsely sample functions
    (extinction maps, dust temperature, )

5
Typical Calls
-- find objects within 1 arcminute of
(60,20) select objID, ra, dec from PhotoObj
as p, fHtmCover(60,20,1) as triangle
where p.htmID between triangle.startHtmID
and triangle.endHtmID and
ltgeometry test on (ra,dec) (60,20) lt 1
arcmingt -- or better yet select objID, ra, dec,
distance from dbo.fGetNearbyObjEq(60,20,1)
Coarse distance test
careful distance test rejects false positives
-- Also, Cross Apply is VERY useful select objID,
count()from PhotoObj p cross apply
dbo.fGetNearbyObjEq(p.ra, p.dec, 1)
6
PreliminariesEquations Define Subspaces
  • For (x,y) above the lineaxby gt c
  • Reverse the space by-ax -by gt -c
  • Intersect a 3 volumes a1x b1y gt c1 a2x b2y
    gt c2 a3x b3y gt c3

7
Preliminaries
  • Spherical geometry is really hard
  • Distance
  • acos((sin(radians(lat1))sin(radians(lat2)))(cos(r
    adians(lat1)))cos(radians(lat2))cos(radians(abs(lo
    n1-lon2)))))
  • Area
  • forget it!
  • But, astronomers have it easy
  • The geoid is not a sphere!
  • A trick do everything in 3-space with planes

8
Escape From Surface of Sphere to 3D Avoids
Spherical Geometry
  • Circle C defined by plane intersect sphere.
  • Plane is cx,cy,cz unit vector length
    cos(radians(radius))
  • Point P (px,py,pz) Inside circle C ifPC gt
    length
  • no transcendental functions just 3 multiplies
    and a compare.
  • Arbitrary polygons are intersections of these
    regions.

9
Project Sphere onto Octahedron Szalay, Kunszt,
Brunner
  • Every point belongs to one of 8 faces
  • one of 8 triangles

10
Recursively Divide Triangles(into 4
sub-triangles)
1
  • This is a space-filling curveHausdorf, Hilbert,
    Peano, Z-transform

11
Space-Filling CurveEach Triangle Maps to The
Real Line 0..1
0.12,0.13)
So triangles correspond to ranges All points
inside the triangle are inside the range.
12
Spatial Data Access SQL extension
  • Szalay, Kunszt, Brunner http//www.sdss.jhu.edu/ht
    m
  • Added Hierarchical Triangular Mesh (HTM)
    table-valued function for spatial joins
  • Every object has a 20-deep Mesh ID
  • Given a spatial definition,routine returns up to
    10 covering triangles
  • Spatial query is then up to 10 range queries
  • Fast 1,000 triangles / second / Ghz

13
How Know if point is inside triangle?
  • Answer
  • Triangle sides are 3 great circles C1, C2, C3
  • So (PC1 gt L1) (PC2 gt L2) (PC3 gt L3)

Given an area (spherical polygon) what triangles
cover it?
  • Answer
  • Recursive descent
  • Triangle overlaps if any corner is inside
    polygonand/or any triangle edge intersects
    polygon.
  • Return list of start-stop pairs.

HTM Library computes triangles overlapping an
area.
14
Find Points Inside Area
  • fGetNearbyObjEq(ra,dec,r)fGetNearbyObjXyz(x,y,z,r
    )fGetNearestObjEq(ra,dec,r)fGetNearestObjEq(x,y
    ,z,r)
  • fGetObjInside(areaSpec) -- string area
    description
  • Alex added Healpix,Igloo is also a nice iso-area
    decomposition

15
Digression How does Search Work?
  • Coarse filter (index) reduces search space
  • Btree, QuadTree, HashTable, Bitmap, Rtree,
  • Careful test discards false positives
  • Efficiency

Careful test
True false
True
16
DigressionWhat is an Index?
  • New organization of the data
  • Quick way to go from attribute to values
  • A alt key ? primary key map
  • HTM index is a quick (and clustered) way to go
    from HTM triangle ID (range) to all points in
    the triangle. area ? points

17
DigressionB-Trees are Good at Range Lookups
  • If you can tell me the triangles
  • T (start, stop)
  • covering an area
  • select from points P join triangles T on
    P.HTM_ID between T.start and t.stop
  • Returns all points inside the triangles

18
B-Trees are Good at Range Lookups
  • If you can tell me the triangles
  • T (start, stop)
  • covering an area
  • select from points P join triangles
    dbo.fHTMcover(_at_ra,_at_dec,_at_r) T on P.HTM_ID
    between T.start and t.stop
  • Returns all points inside the triangles

points
Triangles
Nested Loops Join
Start Stop
HTM Index
19
We are Done
using System using System.Collections using
System.Data using System.Data.Sql using
System.Data.SqlTypes namespace xphtm public
partial class HTM SqlFunction(IsDeterministi
c true, IsPrecise true) public static
long HTMlookup(int level, double RA, double dec)
double x, y, z HtmTrixel ht
new HtmTrixel() SpatialVector.radec2cartesia
n(RA, dec, out x, out y, out z) return
ht.cartesian2HID(x, y, z, level)
SqlFunctionAttribute(FillRowMethodName
"FillCoverRow" ,TableDefinition
"bigint HTMID_start, bigint HTMID_end")
public static IEnumerable fHTMcover(float ra,
float dec, float radius) ArrayList
rowSetAnswer new ArrayList() Int64,
arrayAnswer null arrayAnswer
HtmCover.Circle(ra,dec,radius) int i 0
for (int i 0 i lt (arrayAnswer.Length / 2)
i) CoverRow rp new CoverRow(arrayAnswer
i,0, arrayAnsweri, 1)
rowSetAnswer.Add(rp) return (IEnumerable)
rowSetAnswer public static void
FillCoverRow(object source, out long
HTMIDstart, out long HTMIDend) CoverRow
row (CoverRow)source HTMIDstart
row._HTMIDstart HTMIDend row._HTMIDend
public class CoverRow public
long _HTMIDstart public long _HTMIDend
public CoverRow(long HTMIDstart, long HTMIDend)
_HTMIDstart HTMIDstart _HTMIDend
HTMIDend
Returns HTM ID of a point
Returns table of HTM triangles
SQL wrapper
20
History/Performance
  • HTM has ½ false positive on circle (all sizes)
  • Pre-Yukon
  • xp wrapper gt500 lines C
  • Table linkage expensive (400 µs), cover 1.3ms
  • Yukon
  • Code MUCH simpler.
  • Scalar linkage 18 µs, table linkage 80µs
  • HTM lookup 30 µs, cover 500 µs

µs/call SQL 2k TSQL SQL 2k TSQL SQL 05 C SQL 05 C
SQL 2k elapsed cpu elapsed cpu
EchoInteger () 16 15 18 14
EchoStringT() 17 15 18 16
TVF(1) 344 321 82 78
TVF(10) 599 565 91 88
TVF(100) 6,507 2,868 191 181
TVF(1000) 68,678 26,256 1,113 1,076
Linkage costs (HTM is 400 µs more)
21
Poster Child App
  • SkyServer lookups
  • Points near point
  • Points in polygon
  • Polygons overlap polygon
  • 2,000 lookups/second/cpu
  • Can add 10,000 points /second/cpu

22
HTM Library
  • Current implementation in C, public domain.
  • automatic code generation for C and Java
  • A DB2 port (ROE), 2MASS using with Illustra
  • Used at over 20 places in the world
  • 15 SkyQuery surveys have HTM
  • Easy to add HTM to your data
  • In the VO, the ADQL SkyNode supports the spatial
    language

http//www.sdss.jhu.edu/htm/
23
Summary of HTM
  • Good for Points in area.
  • Efficiency comes from going to 3-space.
  • Uses table-valued function as an index generator.
  • New approach to bounding boxes.
  • Nested-loop join with function on outside.
  • Not good for area-overlaps-area(but can simplify
    areas and test for empty)

24
To Repeat (for area algebra) Equations Define
Subspaces
  • For (x,y) above the lineaxby gt c
  • Reverse the space by-ax -by gt -c
  • Intersect a 3 volumes a1x b1y gt c1 a2x b2y
    gt c2 a3x b3y gt c3

25
Domain is Union of Convex Areas
Not a convex
  • Regions are unions of convex areas
  • Higher order curves also work
  • Complex areas have holes and their holes have
    holes. (that is harder).

Holes are harder

It is the union of two convexs
26
Now in Relational Terms
  • create table HalfSpace (
  • domainID int not null -- domain name
  • foreign key
    references Domain(domainID),
  • convexID int not null, --
    grouping a set of ½ spaces
  • halfSpaceID int
    identity(), -- a particular ½ space
  • x float not null,
    -- the (a,b,..) parameters
  • y float not null,
    -- defining the ½ space
  • z float not null,
  • c float not null, -- the
    constant (c above)
  • primary key (domainID,
    convexID, halfSpaceID)
  • (x,y,z) inside a convex if it is inside all
    lines of the convex
  • (x,y,z) inside a convex if it is NOT OUTSIDE ANY
    line of the convex
  • select convexID -- return the convex
  • from HalfSpace -- from the constraints
  • where _at_x x _at_y y _at_x z lt c -- point
    outside the line?
  • group by all convexID -- consider all the
    lines of a convexID
  • having count() 0 -- count outside 0

27
The Algebra is Simple (Boolean)
  • _at_domainID spDomainNew (_at_type varchar(16),
    _at_comment varchar(8000))
  • _at_convexID spDomainNewConvex (_at_domainID int)
  • _at_halfSpaceID spDomainNewConvexConstraint
    (_at_domainID int, _at_convexID int,
  • _at_x float, _at_y
    float, _at_z float, _at_c float)
  • _at_returnCode spDomainDrop(_at_domainID)
  • select from fDomainsContainPoint(_at_x float, _at_y
    float, _at_z float)
  • Once constructed they can be manipulated with the
    Boolean operations.
  • _at_domainID spDomainOr (_at_domainID1 int,
    _at_domainID2 int,
  • _at_type varchar(16), _at_comment
    varchar(8000))
  • _at_domainID spDomainAnd (_at_domainID1 int,
    _at_domainID2 int,
  • _at_type varchar(16), _at_comment
    varchar(8000))
  • _at_domainID spDomainNot (_at_domainID1 int,
  • _at_type varchar(16), _at_comment
    varchar(8000))

28
Defining Areas (Regions)
  • Circle Center point radiusRectangle Corner
    points of squarePolygon sequence corner points
    or edges
  • Region union of convexes
  • As with all spatial, many coordinate systems
    (equatorial, Cartesian, galactic, )
  • Have a linear syntax
  • CIRCLE J2000 RA DEC radius
  • CONVEX CARTESIAN 1 0 0 0 1 0 0 0 1
  • Have advanced XML syntax for space-time. Spec
    http//www.ivoa.net/Documents/latest/STC-X.htmlPe
    rson Arnold Rots http//hea-www.harvard.edu/arot
    s/
  • Similar to GIS, but better suited for astronomy

29
String (easy) Areas (or struct in C world)
  • circleSpec CIRCLE J2000 ra dec radArcMin
  • CIRCLE CARTESIAN x y z radArcMin
  • rectSpec RECT J2000 ra dec2
  • polySpec POLY J2000 ra dec3
  • POLY CARTESIAN x y z 3
  • convexSpec CONVEX x y z d
  • regionSpec REGION convexSpec
  • areaSpec circleSpec rectSpec
  • polySpec regionSpec

30
Regions
  • Easy to compare regions.
  • Easy to combine regions (union, difference,
    intersection, buffer,)
  • Easy to find points in region
  • Easy to find regions containing points
  • Easy to find regions overlap region.

31
Some Details
  • HTM library used to simplify (detect nulls)
  • Compute (in TSQL now, C later)
  • Corner points of region
  • Area of region
  • Bounding circle of region
  • Center point bounding circle is spatial
    index.
  • See Simplify a Convex Szalay/Gray.

32
What! No Bounding Box?
  • Bounding box limits search.A subset of the
    convexs .
  • If query runs at 3M halfspace/sec then no need
    for bounding box, unless you have more than
    10,000 lines.
  • But, if you have a lot of half-spaces then
    bounding box is good.
  • We have bounding circles.

33
Things Can Get Complex
This is convex!
34
Poster Child App
  • Used as footprint service.
  • Take many footprints
  • Fuzz them (buffer) to make coarser
    footprintconvex hull
  • See if two footprints overlap
  • 20 lines of code 130 lines of logic/comments

35
Zones Comparing Massive Point Datasets
  • Want to cross match 109 objects with another
    109 objects
  • Using the HTM code, that is a LOT of calls (106
    seconds ( a year))
  • Want to do it in an hour.

36
Zone Approach
  • Divide space into zones
  • Key points by Zone, offset(on the sphere this
    need wrap-around margin.)
  • Point search look in a few zones at a limited
    offset ra r a bounding box that has
    1-p/4 false positives
  • All inside the relational engine
  • Avoids impedance mismatch
  • Can batch all-all comparisons
  • faster and 60,000x parallel1 hours, not 6
    months!
  • This is Maria Nieto Santistebans PhD thesis

r
ra-zoneMax
x
v(r2(ra-zoneMax)2) cos(radians(zoneMax))
zoneMax
Ra x
37
In SQL points near point
Ignoring some spherical geometry details, see
paper.
select o1.objID -- find objects from zone o1
-- in the zoned table where o1.zoneID
between -- where zone
floor((_at_dec-_at_r)/_at_zoneHeight) and -- overlaps
the circle ceiling((_at_dec_at_r)/_at_zoneHeight)
and o1.ra between _at_ra - _at_r and _at_ra _at_r --
quick filter on ra and o1.dec between _at_dec-_at_r
and _at_dec_at_r -- quick filter on dec and
( (sqrt( power(o1.cx-_at_cx,2)power(o1.cy-_at_cy,2)pow
er(o1.cz-_at_cz,2)))) lt _at_r -- careful filter
on distance
Bounding box
Eliminates the 21 1-p/4 False positives
Competes with HTM (faster because no xp linkage),
but . What about comparing 1 B points with 1B
points, cross match. Still too slow.
38
Quantitative Evaluation 7x faster than external
stored proc (linkage is expensive)
  • time vs. radius for neighbors function _at_
    various zone heights. Any small zone height is
    adequate.
  • time vs. best time _at_ various radius. A
    zoneHeight of 4 is near-optimal


39
All Neighbors of All points(can Batch Process
the Joins)
  • A 5x additional speedup (35x in total)for
    _at_deltaZone in -1, 0, 1 example ignores some
    spherical geometry details in paper

insert neighbors -- insert one zone's
neighbors select o1.objID as objID, -- object
pairs o2.objID as NeighborObjID, .. other
fields elided from zone o1 join zone o2 --
join 2 zones on o1.zoneID-_at_deltaZone
o2.zoneID -- using zone number and ra and
o2.ra between o1.ra - _at_r and o1.ra _at_r --
points near ra where -- elided margin
logic, see paper. and o2.dec between o1.dec-_at_r
and o1.dec_at_r -- quick filter on dec and
sqrt(power(o1.x-o2.x,2)power(o1.y-o2.y,2)power(o
1.z-o2.z,2)) lt _at_r -- careful filter on
distance
Many other speedups, and then there is
parallelism.
40
Poster Child Apps
  • CrossMatch SDSS with itself (250M obj)Producing
    neighbors within 30 arcseconds
  • Was 4,000 hours (estimated)
  • Now 2 hours
  • MaxBCG (cross-match SDSS to find clusters)
  • 9,167 cpu hours with files (normalize to 2,300
    hrs?)
  • 240 cpu hours, 44 elapsed hours with zones
  • See DB meets Grid paper
  • Crossmatch
  • USNO (US Naval Observatory Catalog) 1.2 B obj
  • 2MASS (2 meter all sky survey) 0.5B objects

41
More Poster ChildUSNO-2Mass CrossMatch
  • Sequential first try, 22 hours
  • Optimize code 6.25 hours
  • Note
  • data on separate machine
  • part of zone time is pull of data

42
Zones allow 60,000 Parallel Jobs Partition
Parallelism 3.7 hours
2MASSUSNOBZoneZoneComparison
MergeAnswer
Build Index
Source Tables
Zoned Tables
2MASS?USNOB
350 Mrec 12 GB
2MASS 471 Mrec 140 GB
0-1
64 Mrec 2 GB
2MASS 471 Mrec 65 GB
00
260 Mrec 9 GB
USNOB 1.1 Brec 233 GB
USNOB 1.1 Brec 106 GB
350 Mrec 12 GB
01
26 Mrec 1 GB
USNOB?2MASS
2 hours
1.2 hour
.5 hour
43
Pipeline Parallelism 2.5 hours Or as fast as we
can read USNOB .5 hours
2MASSUSNOBZoneZoneComparison
MergeAnswer
Build Index
Source Tables
Zones
2MASS?USNOB
350 Mrec 12 GB
2MASS 471 Mrec 140 GB
0-1
64 Mrec 2 GB
Next zone
00
260 Mrec 9 GB
USNOB 1.1 Brec 233 GB
Next zone
350 Mrec 12 GB
01
26 Mrec 1 GB
USNOB?2MASS
2 hours
.5 hour
44
Spatial Stuff Summary
  • Easy
  • Point in polygon
  • Polygons containing points
  • (instance and batch)
  • Works in higher dimensions
  • Side note Spherical polygons are
  • hard in 2-space
  • Easy in 3-space

45
Spatial Stuff Summary
  • Constraint databases are cool
  • Streams (data is query, query is in DB)
  • Notification subscription in DB, data is query
  • Spatial constraints in DB, data is query
  • You can express constraints as rows
  • Then You
  • Can evaluate LOTS of predicates per second
  • Can do set algebra on the predicates.
  • Benefits from SQL parallelism
  • SQL Prolog // DataLog? ?

46
In SQL
select o1.objID -- find objects from zone o1
-- in the zoned table where o1.zoneID
between -- where zone floor((_at_dec-_at_r)/_at_zon
eHeight) and -- overlaps the circle floor((_at_dec
_at_r)/_at_zoneHeight) and o1.ra between _at_ra - _at_r
and _at_ra _at_r -- quick filter on ra and o1.dec
between _at_dec-_at_r and _at_dec_at_r -- quick filter on
dec and ( (sqrt( power(o1.cx-_at_cx,2)power
(o1.cy-_at_cy,2)power(o1.cz-_at_cz,2)))) lt _at_r --
careful filter on distance
Bounding box
Eliminates the 21 1-p/4 False positives
47
References\\barcfs\public\Gray\Spatial
  • There Goes the Neighborhood Relational Algebra
    for Spatial Data Search, J. Gray A. S. Szalay
    G. Fekete M. A. Nieto-Santisteban W. O'Mullane
    A. R. Thakar G. Heber A. H. Rots
    http//research.microsoft.com/research/pubs/view.
    aspx?tr_id736
  • Roots of a Convex, Alex Szalay, Jim Gray, March
    2005
  • Representing Polygon Areas and Testing
    Point-in-Polygon Containment in a Relational
    Database Jim Gray, Alex Szalay
  • spRegion.SQL Jim Gray, Alex Szalay
  • SQL2005 CTP Table-Valued Function Performance vs
    SQL2000, Jim Gray, Peter Kukol
  • Preliminary Study of SQL2005 C HTM procedures
    Jim Gray, Peter Kukol, Gyorgy Fekete, Alex
    Szalay
  • Computing the Match Table Jim Gray, Alex Szalay,
    Robert Lupton, Jeff Munn
  • When Database Systems Meet the GridM. A.
    Nieto-Santisteban, J. Gray, A. S. Szalay, J.
    Annis, A. R. Thakar, W. J. OMullane
  • Creating Sectors for SkyServer A. Szalay, G.
    Fekete, T. Budavari, J. Gray, A. Pope, A. Thakar
  • Partitioning the Sky for Spatial Data Tasks Jim
    Gray, Maria Nieto Santisteban, Alex Szalay
  • Spatial Clustering of Galaxies in Large
    DatasetsA. S. Szalay, T. Budavari, A. Connolly,
    J. Gray, T. Matsubara, A. Pope, I. Szapudi
Write a Comment
User Comments (0)
About PowerShow.com