eScience -- A Transformed Scientific Method - PowerPoint PPT Presentation

About This Presentation
Title:

eScience -- A Transformed Scientific Method

Description:

data analysis (workflow, algorithms, databases, data ... Accelerator. Telescope. Remote sensor. Genome sequencer. Supercomputer. Tier 1, 2, 3 facilities ... – PowerPoint PPT presentation

Number of Views:129
Avg rating:3.0/5.0
Slides: 59
Provided by: jimg178
Category:

less

Transcript and Presenter's Notes

Title: eScience -- A Transformed Scientific Method


1
eScience -- A Transformed Scientific Method
  • Jim Gray,
  • eScience Group,
  • Microsoft Research
  • http//research.microsoft.com/Gray
  • in collaboration with Alex Szalay
  • Dept. Physics Astronomy
  • Johns Hopkins University
  • http//www.sdss.jhu.edu/szalay/

2
Talk Goals
  • Explain eScience (and what I am doing)
  • Recommend CSTB foster tools for
  • data capture (lab info management systems)
  • data curation (schemas, ontologies, provenance)
  • data analysis (workflow, algorithms, databases,
    data visualization )
  • datadoc publication (active docs, data-doc
    integration)
  • peer review (editorial services)
  • access (doc data archives and overlay journals)
  • Scholarly communication (wikis for each article
    and dataset)

3
eScience What is it?
  • Synthesis of information technology and science.
  • Science methods are evolving (tools).
  • Science is being codified/objectified.How
    represent scientific information and knowledge in
    computers?
  • Science faces a data deluge.How to manage and
    analyze information?
  • Scientific communication changing
  • publishing data literature (curation,
    access, preservation)

4
Science Paradigms
  • Thousand years ago science was empirical
  • describing natural phenomena
  • Last few hundred years theoretical branch
  • using models, generalizations
  • Last few decades a computational branch
  • simulating complex phenomena
  • Today data exploration (eScience)
  • unify theory, experiment, and simulation
  • Data captured by instrumentsOr generated by
    simulator
  • Processed by software
  • Information/Knowledge stored in computer
  • Scientist analyzes database / filesusing data
    management and statistics

5
X-Info
  • The evolution of X-Info and Comp-X
    for each discipline X
  • How to codify and represent our knowledge

The Generic Problems
  • Data ingest
  • Managing a petabyte
  • Common schema
  • How to organize it
  • How to reorganize it
  • How to share with others
  • Query and Vis tools
  • Building and executing models
  • Integrating data and Literature
  • Documenting experiments
  • Curation and long-term preservation

6
Experiment Budgets ¼½ Software
  • Millions of lines of code
  • Repeated for experiment after experiment
  • Not much sharing or learning
  • CS can change this
  • Build generic tools
  • Workflow schedulers
  • Databases and libraries
  • Analysis packages
  • Visualizers
  • Software for
  • Instrument scheduling
  • Instrument control
  • Data gathering
  • Data reduction
  • Database
  • Analysis
  • Modeling
  • Visualization

7
Experiment Budgets ¼½ Software
  • Millions of lines of code
  • Repeated for experiment after experiment
  • Not much sharing or learning
  • CS can change this
  • Build generic tools
  • Workflow schedulers
  • Databases and libraries
  • Analysis packages
  • Visualizers
  • Software for
  • Instrument scheduling
  • Instrument control
  • Data gathering
  • Data reduction
  • Database
  • Analysis
  • Modeling
  • Visualization

Action item Foster Tools and Foster Tool Support
8
Project Pyramids
In most disciplines there are a few giga
projects, several mega consortia and then
many small labs. Often some instrument creates
need for giga-or mega-project Polar
station Accelerator Telescope Remote
sensor Genome sequencer Supercomputer Tier 1,
2, 3 facilities to use instrument data
9
Pyramid Funding
  • Giga Projects need Giga FundingMajor Research
    Equipment Grants
  • Need projects at all scales
  • computing example supercomputers,
    departmental clusters lab clusters
  • technical social issues
  • Fully fund giga projects, fund ½ of smaller
    projectsthey get matching funds from other
    sources
  • Petascale Computational Systems Balanced
    Cyber-Infrastructure in a Data-Centric World ,
    IEEE Computer,  V. 39.1, pp 110-112, January,
    2006.

10
Action item Invest in tools at all levels
11
Need Lab Info Management Systems (LIMSs)
  • Pipeline Instrument Simulator data to archive
    publish to web.
  • NASA Level 0 (raw) data Level 1
    (calibrated) Level 2 (derived)
  • Needs workflow tool to manage pipeline
  • Build prototypes.
  • Examples
  • SDSS, LifeUnderYourFeetMBARI Shore Side Data
    System.

12
Need Lab Info Management Systems (LIMSs)
Action item Foster generic LIMS
  • Pipeline Instrument Simulator data to archive
    publish to web.
  • NASA Level 0 (raw) data Level 1
    (calibrated) Level 2 (derived)
  • Needs workflow tool to manage pipeline
  • Build prototypes.
  • Examples
  • SDSS, LifeUnderYourFeetMBARI Shore Side Data
    System.

13
Science Needs Info Management
  • Simulators produce lots of data
  • Experiments produce lots of data
  • Standard practice
  • each simulation run produces a file
  • each instrument-day produces a file
  • each process step produces a file
  • files have descriptive names
  • files have similar formats (described elsewhere)
  • Projects have millions of files (or soon will)
  • No easy way to manage or analyze the data.

14
Data Analysis
  • Looking for
  • Needles in haystacks the Higgs particle
  • Haystacks Dark matter, Dark energy
  • Needles are easier than haystacks
  • Global statistics have poor scaling
  • Correlation functions are N2, likelihood
    techniques N3
  • We can only do N logN
  • Must accept approximate answersNew algorithms
  • Requires combination of
  • statistics
  • computer science

15
Analysis and Databases
  • Much statistical analysis deals with
  • Creating uniform samples
  • data filtering
  • Assembling relevant subsets
  • Estimating completeness
  • Censoring bad data
  • Counting and building histograms
  • Generating Monte-Carlo subsets
  • Likelihood calculations
  • Hypothesis testing
  • Traditionally performed on files
  • These tasks better done in structured store with
  • indexing,
  • aggregation,
  • parallelism
  • query, analysis,
  • visualization tools.

16
Data Delivery Hitting a Wall
FTP and GREP are not adequate
  • You can GREP 1 MB in a second
  • You can GREP 1 GB in a minute
  • You can GREP 1 TB in 2 days
  • You can GREP 1 PB in 3 years
  • Oh!, and 1PB 4,000 disks
  • At some point you need indices to limit
    search parallel data search and analysis
  • This is where databases can help
  • You can FTP 1 MB in 1 sec
  • FTP 1 GB / min (1 /GB)
  • 2 days and 1K
  • 3 years and 1M

17
Accessing Data
  • If there is too much data to move around,
  • take the analysis to the data!
  • Do all data manipulations at database
  • Build custom procedures and functions in the
    database
  • Automatic parallelism guaranteed
  • Easy to build-in custom functionality
  • Databases Procedures being unified
  • Example temporal and spatial indexing
  • Pixel processing
  • Easy to reorganize the data
  • Multiple views, each optimal for certain analyses
  • Building hierarchical summaries are trivial
  • Scalable to Petabyte datasets

active databases!
18
Analysis and Databases
  • Much statistical analysis deals with
  • Creating uniform samples
  • data filtering
  • Assembling relevant subsets
  • Estimating completeness
  • Censoring bad data
  • Counting and building histograms
  • Generating Monte-Carlo subsets
  • Likelihood calculations
  • Hypothesis testing
  • Traditionally performed on files
  • These tasks better done in structured store with
  • indexing,
  • aggregation,
  • parallelism
  • query, analysis,
  • visualization tools.

Action item Foster Data Management Data Analysis
Data Visualization Algorithms Tools
19
Let 100 Flowers Bloom
  • Comp-X has some nice tools
  • Beowulf
  • Condor
  • BOINC
  • Matlab
  • These tools grew from the community
  • Its HARD to see a common pattern
  • Linux vs FreeBSD why was Linux more
    successful?Community, personality, timing, .???
  • Lesson let 100 flowers bloom.

20
Talk Goals
  • Explain eScience (and what I am doing)
  • Recommend CSTB foster tools and tools for
  • data capture (lab info management systems)
  • data curation (schemas, ontologies, provenance)
  • data analysis (workflow, algorithms, databases,
    data visualization )
  • datadoc publication (active docs, data-doc
    integration)
  • peer review (editorial services)
  • access (doc data archives and overlay journals)
  • Scholarly communication (wikis for each article
    and dataset)

21
All Scientific Data Online
  • Many disciplines overlap and use data from other
    sciences.
  • Internet can unify all literature and data
  • Go from literature to computation to data back
    to literature.
  • Information at your fingertipsFor
    everyone-everywhere
  • Increase Scientific Information Velocity
  • Huge increase in Science Productivity

22
Unlocking Peer-Reviewed Literature
  • Agencies and Foundations mandating research be
    public domain.
  • NIH (30 B/y, 40k PIs,)(see http//www.taxpayera
    ccess.org/)
  • Welcome Trust
  • Japan, China, Italy, South Africa,.
  • Public Library of Science..
  • Other agencies will follow NIH

23
How Does the New Library Work?
  • Who pays for storage access (unfunded mandate)?
  • Its cheap 1 milli-dollar per access
  • But curation is not cheap
  • Author/Title/Subject/Citation/..
  • Dublin Core is great but
  • NLM has a 6,000-line XSD for documents
    http//dtd.nlm.nih.gov/publishing
  • Need to capture document structure from author
  • Sections, figures, equations, citations,
  • Automate curation
  • NCBI-PubMedCentral is doing this
  • Preparing for 1M articles/year
  • Automate it!

24
Pub Med Central International
  • Information at your fingertips
  • Deployed US, China, England, Italy, South Africa,
    Japan
  • UK PMCI http//ukpmc.ac.uk/
  • Each site can accept documents
  • Archives replicated
  • Federate thru web services
  • Working to integrate Word/Excel/ with
    PubmedCentral e.g. WordML, XSD,
  • To be clear NCBI is doing 99.99 of the work.

25
Overlay Journals
  • Articles and Data in public archives
  • Journal title page in public archive.
  • All covered by Creative Commons License
  • permits copy/distribute
  • requires attribution
  • http//creativecommons.org/

Data Archives
26
Overlay Journals
  • Articles and Data in public archives
  • Journal title page in public archive.
  • All covered by Creative Commons License
  • permits copy/distribute
  • requires attribution
  • http//creativecommons.org/

JournalManagement System
Data Archives
27
Overlay Journals
  • Articles and Data in public archives
  • Journal title page in public archive.
  • All covered by Creative Commons License
  • permits copy/distribute
  • requires attribution
  • http//creativecommons.org/

JournalCollaboration System
JournalManagement System
Data Archives
28
Overlay Journals
Action item Do for other scienceswhat NLM has
done for BIOGenbank-PubMedCentral
  • Articles and Data in public archives
  • Journal title page in public archive.
  • All covered by Creative Commons License
  • permits copy/distribute
  • requires attribution
  • http//creativecommons.org/

JournalCollaboration System
JournalManagement System
Data Archives
29
Better Authoring Tools
  • Extend Authoring tools to
  • capture document metadata (NLM tagset)
  • represent documents in standard format
  • WordML (ECMA standard)
  • capture references
  • Make active documents (words and data).
  • Easier for authors
  • Easier for archives

30
Conference Management Tool
  • Currently a conference peer-review system (300
    conferences)
  • Form committee
  • Accept Manuscripts
  • Declare interest/recuse
  • Review
  • Decide
  • Form program
  • Notify
  • Revise

31
Publishing Peer Review
  • Add publishing steps
  • Form committee
  • Accept Manuscripts
  • Declare interest/recuse
  • Review
  • Decide
  • Form program
  • Notify
  • Revise
  • Publish
  • improve author-reader experience
  • Manage versions
  • Capture data
  • Interactive documents
  • Capture Workshop
  • presentations
  • proceedings
  • Capture classroom ConferenceXP
  • Moderated discussions of published articles
  • Connect to Archives

32
Why Not a Wiki?
  • Peer-Review is different
  • It is very structured
  • It is moderated
  • There is a degree of confidentiality
  • Wiki is egalitarian
  • Its a conversation
  • Its completely transparent
  • Dont get me wrong
  • Wikis are great
  • SharePoints are great
  • But.. Peer-Review is different.
  • And, incidentally review of proposals,
    projects, is more like peer-review.
  • Lets have Moderated Wiki re published literature
    PLoS-One is doing this

33
Why Not a Wiki?
Action item Foster new document authoring and
publication models and tools
  • Peer-Review is different
  • It is very structured
  • It is moderated
  • There is a degree of confidentiality
  • Wiki is egalitarian
  • Its a conversation
  • Its completely transparent
  • Dont get me wrong
  • Wikis are great
  • SharePoints are great
  • But.. Peer-Review is different.
  • And, incidentally review of proposals,
    projects, is more like peer-review.
  • Lets have Moderated Wiki re published literature
    PLoS-One is doing this

34
So What about Publishing Data?
  • The answer is 42.
  • But
  • What are the units?
  • How precise? How accurate 42.5 .01
  • Show your work data provenance

35
Thought Experiment
  • You have collected some dataand want to publish
    science based on it.
  • How do you publish the data so that others can
    read it and reproduce your results in 100
    years?
  • Document collection process?
  • How document data processing (scrubbing
    reducing the data)?
  • Where do you put it?

36
Objectifying Knowledge
  • This requires agreement about
  • Units cgs
  • Measurements who/what/when/where/how
  • CONCEPTS
  • Whats a planet, star, galaxy,?
  • Whats a gene, protein, pathway?
  • Need to objectify science
  • what are the objects?
  • what are the attributes?
  • What are the methods (in the OO sense)?
  • This is mostly Physics/Bio/Eco/Econ/... But CS
    can do generic things

37
Objectifying Knowledge
  • This requires agreement about
  • Units cgs
  • Measurements who/what/when/where/how
  • CONCEPTS
  • Whats a planet, star, galaxy,?
  • Whats a gene, protein, pathway?
  • Need to objectify science
  • what are the objects?
  • what are the attributes?
  • What are the methods (in the OO sense)?
  • This is mostly Physics/Bio/Eco/Econ/... But CS
    can do generic things

Warning!Painful discussions ahead The O
word Ontology The S word Schema The CV
words Controlled Vocabulary Domain experts
do not agree
38
The Best Example Entrez-GenBankhttp//www.ncbi.n
lm.nih.gov/
  • Sequence data deposited with Genbank
  • Literature references Genbank ID
  • BLAST searches Genbank
  • Entrez integrates and searches
  • PubMedCentral
  • PubChem
  • Genbank
  • Proteins, SNP,
  • Structure,..
  • Taxonomy
  • Many more

39
Publishing Data
  • Exponential growth
  • Projects last at least 3-5 years
  • Data sent upwards only at the end of the project
  • Data will never be centralized
  • More responsibility on projects
  • Becoming Publishers and Curators
  • Data will reside with projects
  • Analyses must be close to the data

40
Data Pyramid
  • Very extended distribution of data sets
  • data on all scales!
  • Most datasets are small, and manually maintained
    (Excel spreadsheets)
  • Total volume dominated by multi-TB archives
  • But, small datasets have real value
  • Most data is born digital collected via
    electronic sensorsor generated by simulators.

41
Data Sharing/Publishing
  • What is the business model (reward/career
    benefit)?
  • Three tiers (power law!!!)
  • (a) big projects
  • (b) value added, refereed products
  • (c) ad-hoc data, on-line sensors, images,
    outreach info
  • We have largely done (a)
  • Need Journal for Data to solve (b)
  • Need VO-Flickr (a simple interface) (c)
  • Mashups are emerging in science
  • Need an integrated environment for virtual
    excursions for education (C. Wong)

42
The Best Example Entrez-GenBankhttp//www.ncbi.n
lm.nih.gov/
Action item Foster Digital Data Libraries(not
metadata, real data)and integration with
literature
  • Sequence data deposited with Genbank
  • Literature references Genbank ID
  • BLAST searches Genbank
  • Entrez integrates and searches
  • PubMedCentral
  • PubChem
  • Genbank
  • Proteins, SNP,
  • Structure,..
  • Taxonomy
  • Many more

43
Talk Goals
  • Explain eScience (and what I am doing)
  • Recommend CSTB foster tools and tools for
  • data capture (lab info management systems)
  • data curation (schemas, ontologies, provenance)
  • data analysis (workflow, algorithms, databases,
    data visualization )
  • datadoc publication (active docs, data-doc
    integration)
  • peer review (editorial services)
  • access (doc data archives and overlay journals)
  • Scholarly communication (wikis for each article
    and dataset)

44
backup

45
Astronomy
  • Help build world-wide telescope
  • All astronomy data and literature online and
    cross indexed
  • Tools to analyze the data
  • Built SkyServer.SDSS.org
  • Built Analysis system
  • MyDB
  • CasJobs (batch job)
  • OpenSkyQueryFederation of 20 observatories.
  • Results
  • It works and is used every day
  • Spatial extensions in SQL 2005
  • A good example of Data Grid
  • Good examples of Web Services.

46
World Wide TelescopeVirtual Observatoryhttp//w
ww.us-vo.org/
http//www.ivoa.net/
  • Premise Most data is (or could be online)
  • So, the Internet is the worlds best telescope
  • It has data on every part of the sky
  • In every measured spectral band optical, x-ray,
    radio..
  • As deep as the best instruments (2 years ago).
  • It is up when you are up.The seeing is always
    great (no working at night, no clouds no moons
    no..).
  • Its a smart telescope links objects and
    data to literature on them.

47
Why Astronomy Data?
  • It has no commercial value
  • No privacy concerns
  • Can freely share results with others
  • Great for experimenting with algorithms
  • It is real and well documented
  • High-dimensional data (with confidence intervals)
  • Spatial data
  • Temporal data
  • Many different instruments from many different
    places and many different times
  • Federation is a goal
  • There is a lot of it (petabytes)

48
Time and Spectral DimensionsThe Multiwavelength
Crab Nebulae
Crab star 1053 AD
X-ray, optical, infrared, and radio views of
the nearby Crab Nebula, which is now in a state
of chaotic expansion after a supernova explosion
first sighted in 1054 A.D. by Chinese Astronomers.
Slide courtesy of Robert Brunner _at_ CalTech.
49
SkyServer.SDSS.org
  • A modern archive
  • Access to Sloan Digital Sky SurveySpectroscopic
    and Optical surveys
  • Raw Pixel data lives in file servers
  • Catalog data (derived objects) lives in Database
  • Online query to any and all
  • Also used for education
  • 150 hours of online Astronomy
  • Implicitly teaches data analysis
  • Interesting things
  • Spatial data search
  • Client query interface via Java Applet
  • Query from Emacs, Python, .
  • Cloned by other surveys (a template design)
  • Web services are core of it.

50
SkyServerSkyServer.SDSS.org
  • Like the TerraServer, but looking the other way
    a picture of ¼ of the universe
  • Sloan Digital Sky Survey Data Pixels Data
    Mining
  • About 400 attributes per object
  • Spectrograms for 1 of objects

51
Demo of SkyServer
  • Shows standard web server
  • Pixel/image data
  • Point and click
  • Explore one object
  • Explore sets of objects (data mining)

52
SkyQuery (http//skyquery.net/)
  • Distributed Query tool using a set of web
    services
  • Many astronomy archives from Pasadena, Chicago,
    Baltimore, Cambridge (England)
  • Has grown from 4 to 15 archives,now becoming
    international standard
  • WebService Poster Child
  • Allows queries like

SELECT o.objId, o.r, o.type, t.objId FROM
SDSSPhotoPrimary o, TWOMASSPhotoPrimary t
WHERE XMATCH(o,t)lt3.5 AND AREA(181.3,-0.76,6.5)
AND o.type3 and (o.I - t.m_j)gt2
53
SkyQuery Structure
  • Each SkyNode publishes
  • Schema Web Service
  • Database Web Service
  • Portal is
  • Plans Query (2 phase)
  • Integrates answers
  • Is itself a web service

54
Schema (aka metadata)
  • Everyone starts with the same schema
    ltstuff/gtThen the start arguing about semantics.
  • Virtual Observatory http//www.ivoa.net/
  • Metadata based on Dublin Corehttp//www.ivoa.net
    /Documents/latest/RM.html
  • Universal Content Descriptors (UCD)
    http//vizier.u-strasbg.fr/doc/UCD.htxCaptures
    quantitative concepts and their unitsReduced
    from 100,000 tables in literature to 1,000
    terms
  • VOtable a schema for answers to
    questionshttp//www.us-vo.org/VOTable/
  • Common QueriesCone Search and Simple Image
    Access Protocol, SQL
  • Registry http//www.ivoa.net/Documents/latest/RME
    xp.htmlstill a work in progress.

55
SkyServer/SkyQuery Evolution MyDB and Batch Jobs
  • Problem need multi-step data analysis (not just
    single query).
  • Solution Allow personal databases on portal
  • Problem some queries are monsters
  • Solution Batch schedule on portal. Deposits
    answer in personal database.

56
Ecosystem Sensor NetLifeUnderYourFeet.Org
  • Small sensor net monitoring soil
  • Sensors feed to a database
  • Helping build system to collect organize data.
  • Working on data analysis tools
  • Prototype for other LIMSLaboratory Information
    Management Systems

57
RNA Structural Genomics
  • Goal Predict secondary and tertiary structure
    from sequence.Deduce tree of life.
  • Technique Analyze sequence variations sharing
    a common structure across tree of life
  • Representing structurally aligned sequences is
    a key challenge
  • Creating a database-driven alignment workbench
    accessing public and private sequence data

58
VHA Health Informatics
  • VHA largest standardized electronic medical
    records system in US.
  • Design, populate and tune a 20 TB Data Warehouse
    and Analytics environment
  • Evaluate population health and treatment
    outcomes,
  • Support epidemiological studies
  • 7 million enrollees
  • 5 million patients
  • Example Milestones
  • 1 Billionth Vital Sign loaded in April 06
  • 30-minutes to population-wide obesity analysis
    (next slide)
  • Discovered seasonality in blood pressure -- NEJM
    fall 06

59
HDR Vitals Based Body Mass Index Calculation on
VHA FY04 Population Source VHA Corporate Data
Warehouse
Total Patients 23,876 (0.7) 701,089
(21.6) 1,177,093 (36.2) 1,347,098
(41.5) 3,249,156 (100)
Write a Comment
User Comments (0)
About PowerShow.com