Title: Data- and Compute-Driven Transformation of Modern Science
1Data- and Compute-Driven Transformation of Modern
Science
Edward Seidel Assistant Director, Mathematical
and Physical Sciences, NSF (Director, Office of
Cyberinfrastructure)
2Profound Transformation of ScienceGravitational
Physics
- Galileo, Newton usher in birth of modern science
c. 1600 - Problem single particle (apple) in
gravitational field (General 2 body-problem
already too hard) - Methods
- Data notebooks (Kbytes)
- Theory driven by data
- Computation calculus by hand (1 Flop/s)
- Collaboration
- 1 brilliant scientist, 1-2 student
2
3Profound Transformation of ScienceCollision of
Two Black Holes
- Science Result
- The Pair of Pants
- Year 1972
- Team size
- 1 person (S. Hawking)
- Computation
- Flop/s
- Data produced
- Kbytes (text, hand-drawn sketch)
- 400 years latersame!
- Science Result
- The Pair of Pants
- Year 1994
- Team size
- 10
- Data produced
- 50Mbytes
3
4Move to 3D 1000x more data!
- 3D Collision
- Science Result
- Year 1998
- Team size
- 15
- Data produced
- 50Gbytes
5Just ahead Complexity of UniverseLHC,
Gamma-ray bursts!
- Gamma-ray bursts!
- GR now soluble complex problems in relativistic
astrophysics - All energy emitted in lifetime of sun bursts out
in a few seconds what are they?! Colliding
BH-NS? SN? - GR, hydrodynamics, nuclear physics, radiation
transport, neutrinos, magnetic fields globally
distributed collab! - Scalable algorithms, complex AMR codes, viz,
PFlopsweek, PB output! - LHC Higgs particle?
- 10K scientists, 33 countries, 25PB
- Planetary lab for scientific discovery!
Remote Instrument
Remote Instrument
5
6Grand Challenge Communities Combine it
All...Where is it going to go?
Same CI useful for black holes, hurricanes
6
7Framing the QuestionScience is Radically
Revolutionized by CI
- Modern science
- Data- and compute-intensive
- Integrative
- Multiscale Collaborations for Complexity
- Individuals, groups
- Teams, communities
- Must Transition NSF CI approach to support
- Integrative, multiscale
- 4 centuries of constancy, 4 decades 109-12
change!
But such radical change cannot be adequately
addressed with (our current) incremental approach!
We still think like this
Students take note!
8Data Crisis Information Big Bang
Scientific Computing and Imaging Institute,
University of Utah
9Explosive Trends in Data Growth
- Comparative Metagenomics
- DNA sequencing of entire families of organisms
- Already hundreds of TB, thousands of users
- HD Collaborations and Optiportals
- Multichannel HD, gigapixel visualizations
- Petascale-Exascale simulation
- They generate peta-exabytes per simulation!
- Square Kilometer Array
- 3000 radio receivers, 1 km2 area!
- 19 countries! Possibly beginning in 201X,
operational 202X - Data exabyte per week! Analysis Exaflops!
9
10Provenance in Science
Source Juliana Freire, U of Utah
When
- Provenance is as important as the result
- Not a new issue
- Lab notebooks have
- been used for a long time
- What is new?
- Large volumes of data
- Complex analysescomputational processes
- Writing notes is no longer an option
- GC Communities require open, sharable data,
standards, metadata
Annotation
Observed data
DNA recombination By Lederberg
11 Data Deluge Drives Change at NSF
There is a major shift in science towards
data-intensive methods. NSF is responding
- Data Issues resonate the most across NSF units!
- DataNet 100M investment in Sustainable Archive
Access Partners development of widely
accessible network of interactive data archives
Driven by todays grand challenges, integrating
multiple disciplines - INTEROP Community-led interoperability
interdisciplinary, community approaches to
combine and re-use data in ways not envisioned by
their creators - Data-intensive computing SDSC Gordon facility
- NSF Data Policy The Data Working Group
(NSF-wide group of Program Directors) working to
assure that data are shared within and across
disciplines
11
12NSF Vision and National CI Blueprint
Software
Science is becoming unreproducible in this
environment. Validation?Provenance?
Reproducibility?
Track 1
13The Shift Towards DataImplications
- All science is becoming data-dominated
- Experiment, computation, theory
- Totally new methodologies
- Algorithms, mathematics
- All disciplines from science and engineering to
arts and humanities - End-to-end networking becomes critical part of CI
ecosystem - Campuses, please note!
- How do we train data-intensive scientists?
- Data policy becomes critical!
14Recent NSF Activities on Data Policy and
Implementation
15Fundamental points on data and publication policy
Who pays? The NSF? The Institution? What is
the cost model? What is reasonable?
Where is it placed? Author web site? Library?
NSF sites?
- Publicly funded scientific data and publications
should be available, and science benefits - There has to be a place to keep data, and a way
to access it - There needs to be an affordable, sustainable
cost model for this
What data must be made available? Raw data?
Peer reviewed? When is it available? 6 months?
1 year? After publication?
How long is it made available? How do we enforce
it post-award?
There is great variability in requirements across
science communities peer review can help guide
this process.
15
16Changes Coming for Data!
- Long-standing NSF Policy on Data (Proposal
Award Policies Procedures Guide) - Investigators are expected to share with other
researchers, at no more than incremental cost and
within a reasonable time, the primary data...
created or gathered in the course of work under
NSF grants
- NSF will soon require a Data Management Plan
(DMP), subject to peer review criterion for
award - The DMP will be in the form of a 2-page
supplementary document to the proposal - It will not be possible to submit proposals
without such a document - Customization by discipline, program necessary
17Upcoming Implementation of NSF Data Policy
- Directorate-Specific Issues Peer Review
- Many details are implemented/enforced via peer
review and Program Officer discretion, including
things like embargo period, standards, etc. - The challenge at NSF is that no one size fits
all so each Directorate will be responsible for
its own recommendations for DMP content,
appropriate institutional repositories, etc. - This does not address Open Access as applied
18Electronic Access to Scientific Publications
19Why is this Important?
- Science requires it
- Science progress accelerated by making
publications available and searchable - Results in one community need to easily propagate
to another for multidisciplinary complex problem
solving - Search technologies can be brought to bear
- Publications need to be associated with rich
information videos of simulations, supporting
data, simulation and analysis codes - Equality and Broadening participation
- Young scientists at smaller universities at a
needless disadvantage without it. They may lose
journal access. This hurts science and puts
talent at risk - US Administration focus on transparency and
accountability
20Current Activities
- We have begun serious discussions within NSF on
these issues - National Science Board Committee on Data started
- Goals similar to those for Data
- We have had numerous visits from funding agencies
from around the world - Primary topic what is NSF doing on OA?
- Discussion with various publishers, libraries to
explore options - Quality of science relies on peer review systems
of best journals need a way to support OA
21On Working with Publishers
- Quality of science, identification of talented
scientists we rely on the peer review systems of
the best journals - NSF receives an assurance that the work done on a
grant meets a standard - Universities use impact factors as part of their
tenure and promotion process
- I believe it is in the interests of science, and
hence the public interest, to help journals find
a viable OA business model.. - Bernard Schutz, Presentation to NSF, May 2009
22Final Remarks
- Science is becoming collaborative and data
dominated - We are accelerating efforts to advance NSF in all
aspects of data - Science requires that data need to be open and
accessible we are working to achieve this - All forms of data are important, and must be more
tightly connected in the future - Collections, software, publications
- Time is of the essence