Four Talks - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Four Talks

Description:

Storage/archive NOT the problem (it's almost free) Curating/Publishing is expensive. ... So, the Internet is the world's best telescope: It has data on every ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 36
Provided by: jimg178
Category:
Tags: best | finder | four | free | people | talks

less

Transcript and Presenter's Notes

Title: Four Talks


1
Four Talks
Jim Gray, Microsoft Alex Szalay, Ani Thakar, Jan
Vandenberg, JHU Chris Stoughton, Fermilab
  • The article we actually wroteOnline Scientific
    Publication, Curation, Archiving
  • The advertised talk Computer Science Challenges
    in the VO
  • A Web Server (SkyServer) tour
  • A Web Service (SdssCutout) tour
  • These lead up to Alex Szalays talk on Web
    Services

2
The Paper We WroteOnline Scientific
Publication, Curation, Archiving
  • Jim Gray, Microsoft
  • Alex Szalay, Ani Thakar, Jan Vandenberg, JHU
  • Chris Stoughton, Fermilab

3
Outline
  • Virtual Observatory will be an ecosystem
    ofauthors, curators, publishers, archivers,
    readerscontributing using shared data.
  • The process and rolesauthor, curator, publisher,
    archiver roles are changing
  • Ephemeral derivedMust capture ephemeral
    information.All design metadata info is
    ephemeralCan tradeoff recomputing derived data
  • EconomicsPublish/Archive cost is zero,
    Author/Curator cost dominates
  • SDSSdata inflationwhat we are doing

4
Premise
  • Once published, scientific data needs to be
    available forever,so that the science can be
    reproduced/extended.
  • What does that mean?
  • Data
  • Ephemeral Data could not be reproduced
  • Stable data could be drived from emphemeral
    data.
  • Meta-data how the data was collected/derivedis
    ephemeral
  • Must be preserved
  • Includes design docs, software, email, pubs,
    personal notes

5
Changing Roles
  • Exponential growth
  • Projects last at least 3-5 years
  • Project data online during project lifetime.
  • Data sent to central archive only at the end of
    the project
  • At any instant, only 1/8 of data is centralized
  • New project responsibilities
  • Becoming Publishers and Curators
  • Larger fraction of budget spent on software
  • Standards are needed
  • Easier data interchange, fewer tools
  • Templates are needed
  • Much development duplicated, wasted

6
Publishing Data
Roles Authors Publishers Curators Archives Consume
rs
Traditional Scientists Journals Libraries Archives
Scientists
Emerging Collaborations Project web site DataDoc
Archives Digital Archives Scientists
7
The Core Problem No Economic Model
  • The archive user has not yet been born. How can
    he pay you to curate the data?
  • The Scientist gathered data for his own
    purposeWhy should he pay (invest time) for your
    needs?
  • Answer to both thats the scientific method
  • Curating data (documenting the design, the
    acquisition and the processing)Is very hard and
    there is no reward for doing it.The results are
    rewarded, not the process of getting them.
  • Storage/archive NOT the problem (its almost
    free)
  • Curating/Publishing is expensive.

8
What SDSS is Doing Capture the Bits
  • Best-effort documenting data and process.
  • Publishing data often by UPS( 5TB today (dr1)
    and so 15k for a copy)
  • Replicating data on 3 continents.
  • EVERYTHING online (tape data is dead data)
  • Archiving all email, discussions, .
  • Keeping all web-logs.
  • Now we need to figure out how to organize/search
    all this metadata.

9
SDSS Data Inflation Data Pyramid
  • Level 2Derived data products 10x smaller But
    there are many catalogs.
  • Publish new edition each year
  • Fixes bugs in data.
  • Must preserve old editions
  • Creates data pyramid
  • Store each edition
  • 1, 2, 3, 4 N N2 bytes
  • Net Data Inflation L2 L1
  • Level 1AGrows 5TB pixels/year growing to
    25TB 2 TB/y compressed growing to 13TB 4
    TB today (level 1A in NASA terms)

10
Summary
  • Virtual Observatory will be an ecosystem
    ofauthors, curators, publishers, archivers,
    readerscontributing using shared data.
  • The process and roles are changing author
    project publisher curator
  • Ephemeral stable data Capture ephemeral
    information. All design metadata info is
    ephemeralCan tradeoff recomputing derived data
  • EconomicsAuthor/Curate cost dominates
  • SDSSData Inflation, Data Pyramid

11
Four Talks
  • The article we actually wroteOnline Scientific
    Publication, Curation, Archiving
  • The advertised talk Computer Science Challenges
    in the VO
  • A Web Server (SkyServer) tour
  • A Web Service (SdssCutout) tour
  • These lead up to Alex Szalays talk on Web
    Services

12
The Advertised TalkComputer Science Challenges
in the VO
  • Jim Gray, Microsoft

13
Virtual Observatory
  • Premise Most data is (or could be online)
  • So, the Internet is the worlds best telescope
  • It has data on every part of the sky
  • In every measured spectral band optical, x-ray,
    radio..
  • As deep as the best instruments (2 years ago).
  • It is up when you are up.The seeing is always
    great (no working at night, no clouds no moons
    no..).
  • Its a smart telescope links objects and
    data to literature on them.

14
Virtual ObservatoryData Federation of Web
Services
  • Massive datasets live near their owners
  • Near the instruments software pipeline
  • Near the applications
  • Near data knowledge and curation
  • Computer centers become Data Centers
  • Archives are replicated for
  • Performance
  • Availability/Reliability
  • Each Archive publishes a web service
  • Schema documents the data
  • Methods on objects (queries)
  • Scientists get personalized extracts
  • Uniform access to multiple Archives
  • A common global schema

15
Some Unique Things About Astro Data
  • There is a desire to compare data from different
    instruments
  • Most astronomers publish their data (especially
    surveys)
  • Combining data from different instruments gives
    more info
  • Szalay observes Metcalfs law utility grows as
    N2
  • This is less true in some other fields
  • Its tractable
  • sizes fit in current regimes (10s of terabytes
    today)
  • tasks fit Beowulfs
  • Astro data is great sandbox for CS research.
  • High-dimensional data
  • Temporal, spatial, image datatypes
  • Few privacy/commercial concerns
  • There is lots of it

16
My 1 Challenge going beyond files(a file is an
array of bytes)Science vs Commerce
  • Data in files FTP a local copy /subset.ASCII or
    Binary.
  • Each scientist builds own analysis toolkit
  • Analysis is tcl script of toolkit on local data.
  • Some simple visualization tools x vs y
  • Data in a database
  • Standard reports for standard things.
  • Report writers for non-standard things
  • GUI tools to explore data.
  • Decision trees
  • Clustering
  • Anomaly finders

17
Butsome science is hitting a wallFTP and GREP
are not adequate
  • You can GREP 1 MB in a second
  • You can GREP 1 GB in a minute
  • You can GREP 1 TB in 2 days
  • You can GREP 1 PB in 3 years.
  • Oh!, and 1PB 10,000 disks
  • At some point you need indices to limit
    search parallel data search and analysis
    search and analysis tools
  • This is where databases can help
  • You can FTP 1 MB in 1 sec
  • You can FTP 1 GB / min ( 1 /GB)
  • 2 days and 1K
  • 3 years and 1M

18
Whats needed?(not drawn to scale)
19
CS Challenges For Astronomers
  • Objectify your field
  • Precisely define what you are talking about.
  • Objects and Methods / Attributes
  • This is REALLY difficult.
  • UCDs are a great start but, there is a long way
    to go
  • Software is like entropy, it always increases.
    -- Norman Augustine, Augustines Laws
  • Beware of legacy software cost can eat you
    alive
  • Share software where possible.
  • Use standard software where possible.
  • Expect it will cost you 25 to 40 of project. ?
  • Explain what you want to do with the VO
  • 20 queries or something like that.

20
Challenge to Data Miners Linear and Sub-Linear
Algorithms
Techniques
  • Today most correlation / clustering
    algorithmsare polynomial N2 or N3 or
  • N2 is VERY big when N is big (1018 is big)
  • Need sub-linear algorithms
  • Current approaches are near optimal given
    current assumptions.
  • So, need new assumptionsprobably heuristic and
    approximate

21
Challenge to Data Miners Rediscover Astronomy
  • Astronomy needs deep understanding of physics.
  • But, some was discovered as variable
    correlations then explained with physics.
  • Famous example Hertzsprung-Russell Diagramstar
    luminosity vs color (temperature)
  • Challenge 1 (the student test) How much of
    astronomy can data mining discover?
  • Challenge 2 (the Turing test)Can data mining
    discover NEW correlations?

22
Plumbers Organize and Search Petabytes
  • Automate
  • instrument-to-archive pipelinesIt is is a messy
    business very labor intensiveMost current
    designs do not scale (too many manual
    steps)BaBar (1TB/day) and ESO pipeline seem
    promising.A job-scheduling or workflow system
  • Physical Database design access
  • Data access patterns are difficult to anticipate
  • Aggressively and automatically use indexing,
    sub-setting.
  • Search in parallel
  • Goals
  • Answer easy queries in 10 seconds.
  • Answer hard queries (correlations) in 10 minutes.

23
Q How can a computer scientist help,
without learning a LOT of Astronomy?A Scenario
Design 20 questions.
  • Astronomers proposed 20 questions Typical of
    things they want to do Each would require a
    week (or month) of programming in tcl / C/
    FTP
  • Goal, make it easy to answer questions
  • DB and tools design motivated by this goal
  • Implemented DB utility procedures
  • JHU Built GUI for Linux clients

24
The 20 Queries
  • Q11 Find all elliptical galaxies with spectra
    that have an anomalous emission line.
  • Q12 Create a grided count of galaxies with u-ggt1
    and rlt21.5 over 60ltdeclinationlt70, and 200ltright
    ascensionlt210, on a grid of 2, and create a map
    of masks over the same grid.
  • Q13 Create a count of galaxies for each of the
    HTM triangles which satisfy a certain color cut,
    like 0.7u-0.5g-0.2ilt1.25 rlt21.75, output it in
    a form adequate for visualization.
  • Q14 Find stars with multiple measurements and
    have magnitude variations gt0.1. Scan for stars
    that have a secondary object (observed at a
    different time) and compare their magnitudes.
  • Q15 Provide a list of moving objects consistent
    with an asteroid.
  • Q16 Find all objects similar to the colors of a
    quasar at 5.5ltredshiftlt6.5.
  • Q17 Find binary stars where at least one of them
    has the colors of a white dwarf.
  • Q18 Find all objects within 30 arcseconds of one
    another that have very similar colors that is
    where the color ratios u-g, g-r, r-I are less
    than 0.05m.
  • Q19 Find quasars with a broad absorption line in
    their spectra and at least one galaxy within 10
    arcseconds. Return both the quasars and the
    galaxies.
  • Q20 For each galaxy in the BCG data set
    (brightest color galaxy), in 160ltright
    ascensionlt170, -25ltdeclinationlt35 count of
    galaxies within 30"of it that have a photoz
    within 0.05 of that galaxy.
  • Q1 Find all galaxies without unsaturated pixels
    within 1' of a given point of ra75.327,
    dec21.023
  • Q2 Find all galaxies with blue surface
    brightness between and 23 and 25 mag per square
    arcseconds, and -10ltsuper galactic latitude (sgb)
    lt10, and declination less than zero.
  • Q3 Find all galaxies brighter than magnitude 22,
    where the local extinction is gt0.75.
  • Q4 Find galaxies with an isophotal surface
    brightness (SB) larger than 24 in the red band,
    with an ellipticitygt0.5, and with the major axis
    of the ellipse having a declination of between
    30 and 60arc seconds.
  • Q5 Find all galaxies with a deVaucouleours
    profile (r¼ falloff of intensity on disk) and the
    photometric colors consistent with an elliptical
    galaxy. The deVaucouleours profile
  • Q6 Find galaxies that are blended with a star,
    output the deblended galaxy magnitudes.
  • Q7 Provide a list of star-like objects that are
    1 rare.
  • Q8 Find all objects with unclassified spectra.
  • Q9 Find quasars with a line width gt2000 km/s and
    2.5ltredshiftlt2.7.
  • Q10 Find galaxies with spectra that have an
    equivalent width in Ha gt40Å (Ha is the main
    hydrogen spectral line.)

Also some good queries at http//www.sdss.jhu.edu
/ScienceArchive/sxqt/sxQT/Example_Queries.html
25
Two kinds of SDSS data in an SQL DB(objects and
images all in DB)
  • 15M Photo Objects 400 attributes

50K Spectra with 30 lines/ spectrum
26
An Easy QueryQ15 Find asteroids
  • Sounds hard but there are 5 pictures of the
    object at 5 different times (color filters) and
    so can see velocity.
  • Image pipeline computes velocity.
  • Computing it from the 5 color x,y would also be
    fast
  • Finds 1,303 objects in 3 minutes,
    140MBps. (could go 2x faster with more disks)

select objId, dbo.fGetUrlEq(ra,dec) as url
--return object ID url sqrt(power(rowv,2)powe
r(colv,2)) as velocity from photoObj --
check each object. where (power(rowv,2)
power(colv, 2)) -- square of velocity
between 50 and 1000 -- huge values error
27
Q15 Fast Moving Objects
  • Find near earth asteroids

SELECT r.objID as rId, g.objId as gId,
dbo.fGetUrlEq(g.ra, g.dec) as url FROM PhotoObj
r, PhotoObj g WHERE r.run g.run and
r.camcolg.camcol and abs(g.field-r.field)lt2
-- nearby -- the red selection criteria and
((power(r.q_r,2) power(r.u_r,2)) gt 0.111111
) and r.fiberMag_r between 6 and 22 and
r.fiberMag_r lt r.fiberMag_g and r.fiberMag_r lt
r.fiberMag_i and r.parentID0 and r.fiberMag_r lt
r.fiberMag_u and r.fiberMag_r lt
r.fiberMag_z and r.isoA_r/r.isoB_r gt 1.5 and
r.isoA_rgt2.0 -- the green selection
criteria and ((power(g.q_g,2) power(g.u_g,2))
gt 0.111111 ) and g.fiberMag_g between 6 and 22
and g.fiberMag_g lt g.fiberMag_r and
g.fiberMag_g lt g.fiberMag_i and g.fiberMag_g lt
g.fiberMag_u and g.fiberMag_g lt g.fiberMag_z and
g.parentID0 and g.isoA_g/g.isoB_g gt 1.5 and
g.isoA_g gt 2.0 -- the matchup of the pair and
sqrt(power(r.cx -g.cx,2) power(r.cy-g.cy,2)power
(r.cz-g.cz,2))(10800/PI())lt 4.0 and
abs(r.fiberMag_r-g.fiberMag_g)lt 2.0
28
(No Transcript)
29
Data Visualization(and human-computer interface)
  • Make it easy to ask questions
  • Make it easy to understand the answers.
  • Bad news we have had no takers on the
    visualization 20 questions
  • This is still a VERY retro area.
  • But. The following demos show some progress.

30
Four Talks
  • The article we actually wroteOnline Scientific
    Publication, Curation, Archiving
  • The advertised talk Computer Science Challenges
    in the VO
  • A Web Server (SkyServer) tour
  • A Web Service (SdssCutout) tour
  • These lead up to Alex Szalays talk on Web
    Services

31
SkyServer Tourhttp//skyserver.sdss.org/
  • Shows benefit of a database
  • everything online
  • Easy to find things index helps
  • Automatic parallel search is essential
  • Beware
  • Im a lunatic re using databases for everything
  • Most people do not put images in DB
  • I do, because it is
  • Simpler
  • Easier to manage
  • The right thing to do.

32
Four Talks
  • The article we actually wroteOnline Scientific
    Publication, Curation, Archiving
  • The advertised talk Computer Science Challenges
    in the VO
  • A Web Server (SkyServer) tour
  • A Web Service (SdssCutout) tour
  • Leads up to Alex Szalays talk on Web Services

33
Whats a Web Service
  • Web SERVER
  • Given a url parameters
  • Returns a web page (often dynamic)
  • Web SERVICE
  • Given a XML document (soap msg)
  • Returns an XML document
  • Tools make this look like an RPC.
  • F(x,y,z) returns (u, v, w)
  • Distributed objects for the web.
  • naming, discovery, security,..
  • Internet-scale distributed computing

You
Web Server
http url
Web page
Your program
Web Service
soap
Data In your address space
objectin xml
34
Data Federations of Web Services
  • Massive datasets live near their owners
  • Near the instruments software pipeline
  • Near the applications
  • Near data knowledge and curation
  • Super Computer centers become Super Data Centers
  • Each Archive publishes web services
  • Schema documents the data
  • Methods on objects (queries)
  • Scientists get personalized extracts
  • Uniform access to multiple Archives
  • A common global schema

Federation
35
Grid and Web Services Synergy
  • I believe the Grid will be many web services
  • IETF standards Provide
  • Naming
  • Authorization / Security / Privacy
  • Distributed Objects
  • Discovery, Definition, Invocation, Object Model
  • Higher level services workflow, transactions,
    DB,..
  • Synergy commercial Internet Grid tools

36
SDSS Cutouthttp//SkyService.pha.jhu.edu/SdssCuto
ut/
  • A simple web service
  • You can have a copy of the code
  • Needs an online database backend

37
Four Talks
  • The article we actually wroteOnline Scientific
    Publication, Curation, Archiving
  • The advertised talk Computer Science Challenges
    in the VO
  • A Web Server (SkyServer) tour
  • A Web Service (SdssCutout) tour
  • Leads up to Alex Szalays talk on Web Services

38
References and Links
  • SkyServer
  • http//skyserver.sdss.org/
  • http//SkyService.pha.jhu.edu/SdssCutout/
  • Virtual Observatory
  • http//www.us-vo.org/
  • http//www.voforum.org/
  • World-Wide Telescope
  • paper in ScienceV.293 pp. 2037-2038. 14 Sept
    2001. (MS-TR-2001-77 word or pdf.)
  • SDSS DB
  • Get your personal copy athttp//research.microsof
    t.com/gray/sdss
Write a Comment
User Comments (0)
About PowerShow.com