eScience -- A Transformed Scientific Method

About This Presentation

Title:

eScience -- A Transformed Scientific Method

Description:

data analysis (workflow, algorithms, databases, data ... Accelerator. Telescope. Remote sensor. Genome sequencer. Supercomputer. Tier 1, 2, 3 facilities ... – PowerPoint PPT presentation

Number of Views:129

Avg rating:3.0/5.0

Slides: 59

Provided by: jimg178

Category:

more less

Transcript and Presenter's Notes

Title: eScience -- A Transformed Scientific Method

1
eScience -- A Transformed Scientific Method

Jim Gray,
eScience Group,
Microsoft Research
http//research.microsoft.com/Gray
in collaboration with Alex Szalay
Dept. Physics Astronomy
Johns Hopkins University
http//www.sdss.jhu.edu/szalay/

2
Talk Goals

Explain eScience (and what I am doing)
Recommend CSTB foster tools for
data capture (lab info management systems)
data curation (schemas, ontologies, provenance)
data analysis (workflow, algorithms, databases,
data visualization )
datadoc publication (active docs, data-doc
integration)
peer review (editorial services)
access (doc data archives and overlay journals)
Scholarly communication (wikis for each article
and dataset)

3
eScience What is it?

Synthesis of information technology and science.
Science methods are evolving (tools).
Science is being codified/objectified.How
represent scientific information and knowledge in
computers?
Science faces a data deluge.How to manage and
analyze information?
Scientific communication changing
publishing data literature (curation,
access, preservation)

4
Science Paradigms

Thousand years ago science was empirical
describing natural phenomena
Last few hundred years theoretical branch
using models, generalizations
Last few decades a computational branch
simulating complex phenomena
Today data exploration (eScience)
unify theory, experiment, and simulation
Data captured by instrumentsOr generated by
simulator
Processed by software
Information/Knowledge stored in computer
Scientist analyzes database / filesusing data
management and statistics

5
X-Info

The evolution of X-Info and Comp-X
for each discipline X
How to codify and represent our knowledge

The Generic Problems

Data ingest
Managing a petabyte
Common schema
How to organize it
How to reorganize it
How to share with others

Query and Vis tools
Building and executing models
Integrating data and Literature
Documenting experiments
Curation and long-term preservation

6
Experiment Budgets ¼½ Software

Millions of lines of code
Repeated for experiment after experiment
Not much sharing or learning
CS can change this
Build generic tools
Workflow schedulers
Databases and libraries
Analysis packages
Visualizers

Software for
Instrument scheduling
Instrument control
Data gathering
Data reduction
Database
Analysis
Modeling
Visualization

7
Experiment Budgets ¼½ Software

Millions of lines of code
Repeated for experiment after experiment
Not much sharing or learning
CS can change this
Build generic tools
Workflow schedulers
Databases and libraries
Analysis packages
Visualizers

Software for
Instrument scheduling
Instrument control
Data gathering
Data reduction
Database
Analysis
Modeling
Visualization

Action item Foster Tools and Foster Tool Support
8
Project Pyramids
In most disciplines there are a few giga
projects, several mega consortia and then
many small labs. Often some instrument creates
need for giga-or mega-project Polar
station Accelerator Telescope Remote
sensor Genome sequencer Supercomputer Tier 1,
2, 3 facilities to use instrument data
9
Pyramid Funding

Giga Projects need Giga FundingMajor Research
Equipment Grants
Need projects at all scales
computing example supercomputers,
departmental clusters lab clusters
technical social issues
Fully fund giga projects, fund ½ of smaller
projectsthey get matching funds from other
sources
Petascale Computational Systems Balanced
Cyber-Infrastructure in a Data-Centric World ,
IEEE Computer, V. 39.1, pp 110-112, January,
2006.

10
Action item Invest in tools at all levels
11
Need Lab Info Management Systems (LIMSs)

Pipeline Instrument Simulator data to archive
publish to web.
NASA Level 0 (raw) data Level 1
(calibrated) Level 2 (derived)
Needs workflow tool to manage pipeline
Build prototypes.
Examples
SDSS, LifeUnderYourFeetMBARI Shore Side Data
System.

12
Need Lab Info Management Systems (LIMSs)
Action item Foster generic LIMS

Pipeline Instrument Simulator data to archive
publish to web.
NASA Level 0 (raw) data Level 1
(calibrated) Level 2 (derived)
Needs workflow tool to manage pipeline
Build prototypes.
Examples
SDSS, LifeUnderYourFeetMBARI Shore Side Data
System.

13
Science Needs Info Management

Simulators produce lots of data
Experiments produce lots of data
Standard practice
each simulation run produces a file
each instrument-day produces a file
each process step produces a file
files have descriptive names
files have similar formats (described elsewhere)
Projects have millions of files (or soon will)
No easy way to manage or analyze the data.

14
Data Analysis

Looking for
Needles in haystacks the Higgs particle
Haystacks Dark matter, Dark energy
Needles are easier than haystacks
Global statistics have poor scaling
Correlation functions are N2, likelihood
techniques N3
We can only do N logN
Must accept approximate answersNew algorithms
Requires combination of
statistics
computer science

15
Analysis and Databases

Much statistical analysis deals with
Creating uniform samples
data filtering
Assembling relevant subsets
Estimating completeness
Censoring bad data
Counting and building histograms
Generating Monte-Carlo subsets
Likelihood calculations
Hypothesis testing
Traditionally performed on files
These tasks better done in structured store with
indexing,
aggregation,
parallelism
query, analysis,
visualization tools.

16
Data Delivery Hitting a Wall
FTP and GREP are not adequate

You can GREP 1 MB in a second
You can GREP 1 GB in a minute
You can GREP 1 TB in 2 days
You can GREP 1 PB in 3 years
Oh!, and 1PB 4,000 disks
At some point you need indices to limit
search parallel data search and analysis
This is where databases can help

You can FTP 1 MB in 1 sec
FTP 1 GB / min (1 /GB)
2 days and 1K
3 years and 1M

17
Accessing Data

If there is too much data to move around,
take the analysis to the data!
Do all data manipulations at database
Build custom procedures and functions in the
database
Automatic parallelism guaranteed
Easy to build-in custom functionality
Databases Procedures being unified
Example temporal and spatial indexing
Pixel processing
Easy to reorganize the data
Multiple views, each optimal for certain analyses
Building hierarchical summaries are trivial
Scalable to Petabyte datasets

active databases!
18
Analysis and Databases

Much statistical analysis deals with
Creating uniform samples
data filtering
Assembling relevant subsets
Estimating completeness
Censoring bad data
Counting and building histograms
Generating Monte-Carlo subsets
Likelihood calculations
Hypothesis testing
Traditionally performed on files
These tasks better done in structured store with
indexing,
aggregation,
parallelism
query, analysis,
visualization tools.

Action item Foster Data Management Data Analysis
Data Visualization Algorithms Tools
19
Let 100 Flowers Bloom

Comp-X has some nice tools
Beowulf
Condor
BOINC
Matlab
These tools grew from the community
Its HARD to see a common pattern
Linux vs FreeBSD why was Linux more
successful?Community, personality, timing, .???
Lesson let 100 flowers bloom.

20
Talk Goals

Explain eScience (and what I am doing)
Recommend CSTB foster tools and tools for
data capture (lab info management systems)
data curation (schemas, ontologies, provenance)
data analysis (workflow, algorithms, databases,
data visualization )
datadoc publication (active docs, data-doc
integration)
peer review (editorial services)
access (doc data archives and overlay journals)
Scholarly communication (wikis for each article
and dataset)

21
All Scientific Data Online

Many disciplines overlap and use data from other
sciences.
Internet can unify all literature and data
Go from literature to computation to data back
to literature.
Information at your fingertipsFor
everyone-everywhere
Increase Scientific Information Velocity
Huge increase in Science Productivity

22
Unlocking Peer-Reviewed Literature

Agencies and Foundations mandating research be
public domain.
NIH (30 B/y, 40k PIs,)(see http//www.taxpayera
ccess.org/)
Welcome Trust
Japan, China, Italy, South Africa,.
Public Library of Science..
Other agencies will follow NIH

23
How Does the New Library Work?

Who pays for storage access (unfunded mandate)?
Its cheap 1 milli-dollar per access
But curation is not cheap
Author/Title/Subject/Citation/..
Dublin Core is great but
NLM has a 6,000-line XSD for documents
http//dtd.nlm.nih.gov/publishing
Need to capture document structure from author
Sections, figures, equations, citations,
Automate curation
NCBI-PubMedCentral is doing this
Preparing for 1M articles/year
Automate it!

24
Pub Med Central International

Information at your fingertips
Deployed US, China, England, Italy, South Africa,
Japan
UK PMCI http//ukpmc.ac.uk/
Each site can accept documents
Archives replicated
Federate thru web services
Working to integrate Word/Excel/ with
PubmedCentral e.g. WordML, XSD,
To be clear NCBI is doing 99.99 of the work.

25
Overlay Journals

Articles and Data in public archives
Journal title page in public archive.
All covered by Creative Commons License
permits copy/distribute
requires attribution
http//creativecommons.org/

Data Archives
26
Overlay Journals

Articles and Data in public archives
Journal title page in public archive.
All covered by Creative Commons License
permits copy/distribute
requires attribution
http//creativecommons.org/

JournalManagement System
Data Archives
27
Overlay Journals

Articles and Data in public archives
Journal title page in public archive.
All covered by Creative Commons License
permits copy/distribute
requires attribution
http//creativecommons.org/

JournalCollaboration System
JournalManagement System
Data Archives
28
Overlay Journals
Action item Do for other scienceswhat NLM has
done for BIOGenbank-PubMedCentral

Articles and Data in public archives
Journal title page in public archive.
All covered by Creative Commons License
permits copy/distribute
requires attribution
http//creativecommons.org/

JournalCollaboration System
JournalManagement System
Data Archives
29
Better Authoring Tools

Extend Authoring tools to
capture document metadata (NLM tagset)
represent documents in standard format
WordML (ECMA standard)
capture references
Make active documents (words and data).
Easier for authors
Easier for archives

30
Conference Management Tool

Currently a conference peer-review system (300
conferences)
Form committee
Accept Manuscripts
Declare interest/recuse
Review
Decide
Form program
Notify
Revise

31
Publishing Peer Review

Add publishing steps
Form committee
Accept Manuscripts
Declare interest/recuse
Review
Decide
Form program
Notify
Revise
Publish

improve author-reader experience
Manage versions
Capture data
Interactive documents
Capture Workshop
presentations
proceedings
Capture classroom ConferenceXP
Moderated discussions of published articles
Connect to Archives

32
Why Not a Wiki?

Peer-Review is different
It is very structured
It is moderated
There is a degree of confidentiality
Wiki is egalitarian
Its a conversation
Its completely transparent
Dont get me wrong
Wikis are great
SharePoints are great
But.. Peer-Review is different.
And, incidentally review of proposals,
projects, is more like peer-review.
Lets have Moderated Wiki re published literature
PLoS-One is doing this

33
Why Not a Wiki?
Action item Foster new document authoring and
publication models and tools

Peer-Review is different
It is very structured
It is moderated
There is a degree of confidentiality
Wiki is egalitarian
Its a conversation
Its completely transparent
Dont get me wrong
Wikis are great
SharePoints are great
But.. Peer-Review is different.
And, incidentally review of proposals,
projects, is more like peer-review.
Lets have Moderated Wiki re published literature
PLoS-One is doing this

34
So What about Publishing Data?

The answer is 42.
But
What are the units?
How precise? How accurate 42.5 .01
Show your work data provenance

35
Thought Experiment

You have collected some dataand want to publish
science based on it.
How do you publish the data so that others can
read it and reproduce your results in 100
years?
Document collection process?
How document data processing (scrubbing
reducing the data)?
Where do you put it?

36
Objectifying Knowledge

This requires agreement about
Units cgs
Measurements who/what/when/where/how
CONCEPTS
Whats a planet, star, galaxy,?
Whats a gene, protein, pathway?
Need to objectify science
what are the objects?
what are the attributes?
What are the methods (in the OO sense)?
This is mostly Physics/Bio/Eco/Econ/... But CS
can do generic things

37
Objectifying Knowledge

This requires agreement about
Units cgs
Measurements who/what/when/where/how
CONCEPTS
Whats a planet, star, galaxy,?
Whats a gene, protein, pathway?
Need to objectify science
what are the objects?
what are the attributes?
What are the methods (in the OO sense)?
This is mostly Physics/Bio/Eco/Econ/... But CS
can do generic things

Warning!Painful discussions ahead The O
word Ontology The S word Schema The CV
words Controlled Vocabulary Domain experts
do not agree
38
The Best Example Entrez-GenBankhttp//www.ncbi.n
lm.nih.gov/

Sequence data deposited with Genbank
Literature references Genbank ID
BLAST searches Genbank
Entrez integrates and searches
PubMedCentral
PubChem
Genbank
Proteins, SNP,
Structure,..
Taxonomy
Many more

39
Publishing Data

Exponential growth
Projects last at least 3-5 years
Data sent upwards only at the end of the project
Data will never be centralized
More responsibility on projects
Becoming Publishers and Curators
Data will reside with projects
Analyses must be close to the data

40
Data Pyramid

Very extended distribution of data sets
data on all scales!
Most datasets are small, and manually maintained
(Excel spreadsheets)
Total volume dominated by multi-TB archives
But, small datasets have real value
Most data is born digital collected via
electronic sensorsor generated by simulators.

41
Data Sharing/Publishing

What is the business model (reward/career
benefit)?
Three tiers (power law!!!)
(a) big projects
(b) value added, refereed products
(c) ad-hoc data, on-line sensors, images,
outreach info
We have largely done (a)
Need Journal for Data to solve (b)
Need VO-Flickr (a simple interface) (c)
Mashups are emerging in science
Need an integrated environment for virtual
excursions for education (C. Wong)

42
The Best Example Entrez-GenBankhttp//www.ncbi.n
lm.nih.gov/
Action item Foster Digital Data Libraries(not
metadata, real data)and integration with
literature

Sequence data deposited with Genbank
Literature references Genbank ID
BLAST searches Genbank
Entrez integrates and searches
PubMedCentral
PubChem
Genbank
Proteins, SNP,
Structure,..
Taxonomy
Many more

43
Talk Goals

Explain eScience (and what I am doing)
Recommend CSTB foster tools and tools for
data capture (lab info management systems)
data curation (schemas, ontologies, provenance)
data analysis (workflow, algorithms, databases,
data visualization )
datadoc publication (active docs, data-doc
integration)
peer review (editorial services)
access (doc data archives and overlay journals)
Scholarly communication (wikis for each article
and dataset)

44
backup

45
Astronomy

Help build world-wide telescope
All astronomy data and literature online and
cross indexed
Tools to analyze the data
Built SkyServer.SDSS.org
Built Analysis system
MyDB
CasJobs (batch job)
OpenSkyQueryFederation of 20 observatories.
Results
It works and is used every day
Spatial extensions in SQL 2005
A good example of Data Grid
Good examples of Web Services.

46
World Wide TelescopeVirtual Observatoryhttp//w
ww.us-vo.org/
http//www.ivoa.net/

Premise Most data is (or could be online)
So, the Internet is the worlds best telescope
It has data on every part of the sky
In every measured spectral band optical, x-ray,
radio..
As deep as the best instruments (2 years ago).
It is up when you are up.The seeing is always
great (no working at night, no clouds no moons
no..).
Its a smart telescope links objects and
data to literature on them.

47
Why Astronomy Data?

It has no commercial value
No privacy concerns
Can freely share results with others
Great for experimenting with algorithms
It is real and well documented
High-dimensional data (with confidence intervals)
Spatial data
Temporal data
Many different instruments from many different
places and many different times
Federation is a goal
There is a lot of it (petabytes)

48
Time and Spectral DimensionsThe Multiwavelength
Crab Nebulae
Crab star 1053 AD
X-ray, optical, infrared, and radio views of
the nearby Crab Nebula, which is now in a state
of chaotic expansion after a supernova explosion
first sighted in 1054 A.D. by Chinese Astronomers.
Slide courtesy of Robert Brunner _at_ CalTech.
49
SkyServer.SDSS.org

A modern archive
Access to Sloan Digital Sky SurveySpectroscopic
and Optical surveys
Raw Pixel data lives in file servers
Catalog data (derived objects) lives in Database
Online query to any and all
Also used for education
150 hours of online Astronomy
Implicitly teaches data analysis
Interesting things
Spatial data search
Client query interface via Java Applet
Query from Emacs, Python, .
Cloned by other surveys (a template design)
Web services are core of it.

50
SkyServerSkyServer.SDSS.org

Like the TerraServer, but looking the other way
a picture of ¼ of the universe
Sloan Digital Sky Survey Data Pixels Data
Mining
About 400 attributes per object
Spectrograms for 1 of objects

51
Demo of SkyServer

Shows standard web server
Pixel/image data
Point and click
Explore one object
Explore sets of objects (data mining)

52
SkyQuery (http//skyquery.net/)

Distributed Query tool using a set of web
services
Many astronomy archives from Pasadena, Chicago,
Baltimore, Cambridge (England)
Has grown from 4 to 15 archives,now becoming
international standard
WebService Poster Child
Allows queries like

SELECT o.objId, o.r, o.type, t.objId FROM
SDSSPhotoPrimary o, TWOMASSPhotoPrimary t
WHERE XMATCH(o,t)lt3.5 AND AREA(181.3,-0.76,6.5)
AND o.type3 and (o.I - t.m_j)gt2
53
SkyQuery Structure

Each SkyNode publishes
Schema Web Service
Database Web Service

Portal is
Plans Query (2 phase)
Integrates answers
Is itself a web service

54
Schema (aka metadata)

Everyone starts with the same schema
ltstuff/gtThen the start arguing about semantics.
Virtual Observatory http//www.ivoa.net/
Metadata based on Dublin Corehttp//www.ivoa.net
/Documents/latest/RM.html
Universal Content Descriptors (UCD)
http//vizier.u-strasbg.fr/doc/UCD.htxCaptures
quantitative concepts and their unitsReduced
from 100,000 tables in literature to 1,000
terms
VOtable a schema for answers to
questionshttp//www.us-vo.org/VOTable/
Common QueriesCone Search and Simple Image
Access Protocol, SQL
Registry http//www.ivoa.net/Documents/latest/RME
xp.htmlstill a work in progress.

55
SkyServer/SkyQuery Evolution MyDB and Batch Jobs

Problem need multi-step data analysis (not just
single query).
Solution Allow personal databases on portal
Problem some queries are monsters
Solution Batch schedule on portal. Deposits
answer in personal database.

56
Ecosystem Sensor NetLifeUnderYourFeet.Org

Small sensor net monitoring soil
Sensors feed to a database
Helping build system to collect organize data.
Working on data analysis tools
Prototype for other LIMSLaboratory Information
Management Systems

57
RNA Structural Genomics

Goal Predict secondary and tertiary structure
from sequence.Deduce tree of life.
Technique Analyze sequence variations sharing
a common structure across tree of life
Representing structurally aligned sequences is
a key challenge
Creating a database-driven alignment workbench
accessing public and private sequence data

58
VHA Health Informatics

VHA largest standardized electronic medical
records system in US.
Design, populate and tune a 20 TB Data Warehouse
and Analytics environment
Evaluate population health and treatment
outcomes,
Support epidemiological studies
7 million enrollees
5 million patients
Example Milestones
1 Billionth Vital Sign loaded in April 06
30-minutes to population-wide obesity analysis
(next slide)
Discovered seasonality in blood pressure -- NEJM
fall 06

59
HDR Vitals Based Body Mass Index Calculation on
VHA FY04 Population Source VHA Corporate Data
Warehouse
Total Patients 23,876 (0.7) 701,089
(21.6) 1,177,093 (36.2) 1,347,098
(41.5) 3,249,156 (100)

Write a Comment

User Comments (0)

About PowerShow.com

eScience -- A Transformed Scientific Method - PowerPoint PPT Presentation

eScience -- A Transformed Scientific Method

data analysis (workflow, algorithms, databases, data ... Accelerator. Telescope. Remote sensor. Genome sequencer. Supercomputer. Tier 1, 2, 3 facilities ... – PowerPoint PPT presentation