Title: Where The Rubber Meets the Sky Giving Access to Science Data
1Where The Rubber Meets the SkyGiving Access to
Science Data
- Jim Gray
- Microsoft Research
- Gray_at_Microsoft.com
- Http//research.Microsoft.com/Gray
- Alex SzalayJohns Hopkins University
- Szalay_at_JHU.edu
-
2Outline
- Want to build a TerraServer for Hungary?
- My view of eScience
3TerraServer / TerraServicehttp//terraService.Net
/ http//TerraServer-USA.com/
- USGS Photo of US
- Online since June 1998
- Operated by Microsoft
- 20 TB data source
- 10 M web hits/day
- A web service
- Our laboratory
- I recommend you clone it for Hungary
- 100x less data (92k km2), very useful
- Education, land management, scienceInfo
framework.
4TerraServer Today LOW TCO
- Storage Bricks
- Commodity servers
- 4 TB raw / 2 TB Raid1 SATA storage
- Dual 2 Ghz 4GB RAM
- Bunch
- 3 Bricks TerraServer data
- Data partitioned
- Low Cost Availability Pair Spare
- RAID1 Mirroring
- Mirrored Bunches
- Spare Brick
- Web Application
- Load balances mirrors
- Uses surviving database on failure
5Outline
- Want to build a TerraServer for Hungary?
- My view of eScience
6New Science Paradigms
- Thousand years ago science was empirical
- describing natural phenomena
- Last few hundred years theoretical branch
- using models, generalizations
- Last few decades a computational branch
- simulating complex phenomena
- Today data exploration (eScience)
- unify theory, experiment, and simulation
- using data management and statistics
- Data captured by instrumentsOr generated by
simulator
- Processed by software
- Scientist analyzes database / files
7Information Avalanche and eScience
- In science, industry, government,.
- better observational instruments and
- and, better simulations
- producing a data avalanche
- New emphasis on informatics
- Capturing, Organizing, Summarizing, Analyzing,
Visualizing
- Each science is objectfying itself
- Defining core concepts
- Integrating all data and literature online
- Hungary could be a leader in this (you have the
Martians great tech education )
Image courtesy C. Meneveau A. Szalay _at_ JHU
BaBar, Stanford
PE Gene Sequencer From http//www.genome.uci.edu
/
Space Telescope
8The Big Picture
The Big Problems
- Data ingest
- Managing a petabyte
- Common schema
- How to organize it?
- How to reorganize it?
- How to coexist with others?
- Data Query and Visualization tools
- Support/training
- Performance
- Execute queries in a minute
- Batch (big) query scheduling
9The Virtual Observatory
- Premise most data is (or could be online)
- The Internet is the worlds best telescope
- It has data on every part of the sky
- In every measured spectral band optical, x-ray,
radio..
- As deep as the best instruments (2 years ago).
- It is up when you are up
- The seeing is always great
- Its a smart telescope links objects and
data to literature
- Software is the capital expense
- Share, standardize, reuse..
10What X-info Needs from us (cs)(not drawn to
scale)
11Data Access Hitting a Wall
- Current science practice based on data download
(FTP/GREP)Will not scale to the datasets of
tomorrow
- You can GREP 1 MB in a second
- You can GREP 1 GB in a minute
- You can GREP 1 TB in 2 days
- You can GREP 1 PB in 3 years.
- Oh!, and 1PB 5,000 disks
- At some point you need indices to limit
search parallel data search and analysis
- This is where databases can help
- You can FTP 1 MB in 1 sec
- You can FTP 1 GB / min (1)
- 2 days and 1K
- 3 years and 1M
12Next-Generation Data Analysis
- Looking for
- Needles in haystacks the Higgs particle
- Haystacks dark matter, dark energy, turbulence,
ecosystem dynamics
- Needles are easier than haystacks
- Global statistics have poor scaling
- Correlation functions are N2, likelihood
techniques N3
- As data and computers grow at Moores Law, we
can only keep up with N logN
- A way out?
- Relax optimal notion (data is fuzzy, answers are
approximate)
- Dont assume infinite computational resources or
memory
- Requires combination of statistics computer
science
13Smart Data Unifying DB and Analysis
- There is too much data to move aroundDo data
manipulations at database
- Build custom procedures and functions into DB
- Unify data Access Analysis
- Examples
- Statistical sampling and analysis
- Temporal and spatial indexing
- Pixel processing
- Automatic parallelism
- Auto (re)organize
- Scalable to Petabyte datasets
-
Move Mohamed to the mountain, not the mountain to
Mohamed.
14Experiment Budgets ¼½ Software
- Millions of lines of code
- Repeated for experiment after experiment
- Not much sharing or learning
- Lets work to change this
- Identify generic tools
- Workflow schedulers
- Databases and libraries
- Analysis packages
- Visualizers
- Software for
- Instrument scheduling
- Instrument control
- Data gathering
- Data reduction
- Database
- Analysis
- Visualization
Simulation (computational science) are ½
software
15How to Help?
- Cant learn the discipline before you
start(takes 4 years.)
- Cant go native you are a CS person not a
bio, person
- Have to learn how to communicateHave to learn
the language
- Have to form a working relationship with domain
expert(s)
- Have to find problems that leverage your skills
16Working Cross-Culture A Way to Engage With
Domain Scientists
- Find someone who is desperate for help
- Communicate in terms of scenarios
- Work on a problem that gives 100x benefit
- Weeks/task vs hours/task
- Solve 20 of the problem
- The other 80 will take decades
- Prototype
- Go from working-to-working, Always have
- Something to show
- Clear next steps
- Clear goal
- Avoid death-by-collaboration-meetings.
17Working Cross-Culture -- 20 Questions A Way to
Engage With Domain Scientists
- Astronomers proposed 20 questions
- Typical of things they want to do
- Each would require a week or more in old way
(programming in tcl / C/ FTP)
- Goal, make it easy to answer questions
- This goal motivates DB and tools design
18The 20 Queries
- Q11 Find all elliptical galaxies with spectra
that have an anomalous emission line.
- Q12 Create a grided count of galaxies with u-g1
and rascensionof masks over the same grid. - Q13 Create a count of galaxies for each of the
HTM triangles which satisfy a certain color cut,
like 0.7u-0.5g-0.2ia form adequate for visualization. - Q14 Find stars with multiple measurements and
have magnitude variations 0.1. Scan for stars
that have a secondary object (observed at a
different time) and compare their magnitudes. - Q15 Provide a list of moving objects consistent
with an asteroid.
- Q16 Find all objects similar to the colors of a
quasar at 5.5 - Q17 Find binary stars where at least one of them
has the colors of a white dwarf.
- Q18 Find all objects within 30 arcseconds of one
another that have very similar colors that is
where the color ratios u-g, g-r, r-I are less
than 0.05m. - Q19 Find quasars with a broad absorption line in
their spectra and at least one galaxy within 10
arcseconds. Return both the quasars and the
galaxies. - Q20 For each galaxy in the BCG data set
(brightest color galaxy), in 160ascensiongalaxies within 30"of it that have a photoz
within 0.05 of that galaxy.
- Q1 Find all galaxies without unsaturated pixels
within 1' of a given point of ra75.327,
dec21.023
- Q2 Find all galaxies with blue surface
brightness between and 23 and 25 mag per square
arcseconds, and -10 - Q3 Find all galaxies brighter than magnitude 22,
where the local extinction is 0.75.
- Q4 Find galaxies with an isophotal surface
brightness (SB) larger than 24 in the red band,
with an ellipticity0.5, and with the major axis
of the ellipse having a declination of between
30 and 60arc seconds. - Q5 Find all galaxies with a deVaucouleours
profile (r¼ falloff of intensity on disk) and the
photometric colors consistent with an elliptical
galaxy. The deVaucouleours profile - Q6 Find galaxies that are blended with a star,
output the deblended galaxy magnitudes.
- Q7 Provide a list of star-like objects that are
1 rare.
- Q8 Find all objects with unclassified spectra.
- Q9 Find quasars with a line width 2000 km/s and
2.5 - Q10 Find galaxies with spectra that have an
equivalent width in Ha 40Å (Ha is the main
hydrogen spectral line.)
Also some good queries at http//www.sdss.jhu.edu
/ScienceArchive/sxqt/sxQT/Example_Queries.html
19http//SkyServer.sdss.org
- Solves the 20 queries
- Has 150 hours of online instruction
- Translated to Hungarian
-
- Professional astronomers us it as the SDSS
Science Catalog Analysis Service.
- Clone operating in Hungary.
20SkyQuery (http//skyquery.net/)
- Distributed Query tool using a set of web
services
- Many astronomy archives from Pasadena, Chicago,
Baltimore, Cambridge (England)
- Has grown from 4 to 15 archives,now becoming
international standard
- Allows queries like
SELECT o.objId, o.r, o.type, t.objId
FROM SDSSPhotoPrimary o, TWOMASSPhotoPrimar
y t WHERE XMATCH(o,t)6,6.5) AND o.type3 and (o.I - t.m_j)2
21SkyQuery Structure
- Portal is
- Plans Query (2 phase)
- Integrates answers
- Is itself a web service
- Each SkyNode publishes
- Schema Web Service
- Database Web Service
22MyDB eScience Workbench
- Prototype of bringing analysis to the data
- Everybody gets a workspace (database)
- Executes analysis at the data
- Store intermediate results there
- Long queries run in batch
- Results shared within groups
- Only fetch the final results
- Extremely successful matches work patterns
23Summary
- Computational Science
- Simulation
- Data Bases
- Analysis (organization and mining)
- needed by simulations and
- Experiments
- Visualization
- Each Science X
- Has a comp-X branch
- getting a X-info branch
- Objectifying that science defining terms
precisely
- This broadening is multi-disciplinary
- Pair good domain scientist good computer
scientist
- Chemistry is important
- A concrete way to approach Grid-computing.
24Outline
- Want to build a TerraServer for Hungary?
- Could be done inexpensively (if you have the
data)
- Microsoft would license the software to you
- My view of eScience Hungary
- Hungary cant lead in hardware
- Hungary CAN lead in software
- Algorithms data mining, analysis
- Tools that implement the algorithms
- Systems learn by doing
- Could start an industry, fits EU agenda.