Title: Association techniques for the Virtual Observatory
1Association techniques for the Virtual Observatory
2Why associations are crucial to the Virtual
Observatory
- The essence of the VO is database federation
- Usually DBs of independent origin
- No links between entries in different DBs
- Such links needed for prototypical VO query
- e.g. give me all galaxies in region A of the sky
with an optical/X-ray flux ratio greater than X
which are not detected in the radio to a limiting
flux of Y
Optical
X-ray
Radio
3Why you might think associations are easy to make
- Natural spatial indexing to astro databases
- Plus uncertainties on positions, in general
- Just perform matching by proximity
- Simple-ish methods for doing this Clive
- Some practical issues for distributed case
- Data volumes
- think about transfers performance
- Metadata for interoperability
4SkyQuery www.skyquery.net
- Restriction to SQLServer databases .Net
- Requires special facilities at data centres?
Greg - Matching by proximity alone
5Matching by proximity is not always adequate
Need astrophysical information to know which of
the red objects is the most likely counterpart to
the cyan source
6General Case
- Database A
- Positions (RAi,Deci) for i1,NA
- Pos. Uncerts (sRA,i, sDec,i) or (sX,i, sY,i) or
si or s - Other attributes Aij for j1,MA
- Ditto for Database B
- (NA,NB) may be up to 109
- (MA,MB) may be 102
- lt10 likely to be used in association procedure
7General Requirements
- Users can readily assess whether associations are
suitable for their analysis - Transparency of method used
- Figure of merit for each association
- User-supplied association methods(?)
- Performance pre-computation vs. on-the-fly
- Incorporating astrophysical prior knowledge, but
not biasing associations unduly - Often new classes of source involved
8Likelihood Ratio technique(s)
- Likelihood Ratio, LRij, for association of ith
entry of DB A and jth of B defined to be - LRij prob. that Ai is true counterpart of Bj
- ________________________________
- prob. that Ai is not true counterpart of Bj
- Choose i that maximises LRij
9LR example
- A is an optical catalogue, with magnitudes m and
negligible positional errors - Gaussian positional uncertainty, e(x,y), for B
- Then, LRij nA,ID(mi) e(xj,yj) / nA(mi)
- Problems
- Might not know form of nA,ID(mi)
- Might have several populations in B
10If nA,ID(mi) is not known
- Estimate it
- Compare nA(m) around source positions with nA(m)
for full database A - Learn it
- Use EM algorithm to learn form of nA,ID(mi)
Emma Taylor PhD thesis - Circumvent it
- Set nA,ID(mi)const. and normalise LRij using
randomly-located fictitious sources
11But
- All of these methods require statistics on A
- e.g. nA(m)
- or histogram of any other attribute(s)
- The more complicated the physical model e.g.
multiple source populations in B the more
complicated the statistics that are needed - Not insurmountable problem just lots of
count() queries
12Pre-computing cross-neighbours
- LR chooses between a few candidates usually
- Pre-compute store cross-neighbours
- At least for the few, very large DBs
- Can then allow many probabilistic models to be
used following the initial proximity cut
B
A
CrossNeighbours (B,C)
CrossNeighbours (C,B)
C
13Distributed Association Service?
- c.f. Distributed Annotation Server
- Allows third-party annotation in bio DBs
- inferred function of this gene is junk
- Can be included in queries (somehow)
- Select whatever from BioDB
- where not function is junk
- Some sort of join between BioDB and the
Distributed Annotation Server
14Distributed Association Service (2)
- Is something like this needed in the VO?
- Easier than adding extra columns to tables
- What would it contain
- References to original databases
- entry N in DB A is entry M in DB B
- Descriptions of methods used
- Links to literature referencesADS/CDS
15Associations in the VO
- Basically, something like Gregs picture
- Start with a large dose of SkyQuery
- Add possibility of running user-defined
algorithms on dataset from proximity cut - Pre-compute cross-neighbours for big DBs
- Distributed Association Service to record matches
made?and methods used?