Title: Tony Hey
1Life Sciences
e-Science and its Implications for the Library
Community
Earth Sciences
Computer andInformation Sciences
- Tony Hey
- Corporate Vice President
- Technical Computing
- Microsoft Corporation
New Materials,Technologiesand Processes
MultidisciplinaryResearch
2Lickliders Vision
- Lick had this concept all of the stuff
linked together throughout the world, that you
can use a remote computer, get data from a remote
computer, or use lots of computers in your job - Larry Roberts Principal Architect of the ARPANET
3Physics and the Web
- Tim Berners-Lee developed the Web at CERN as a
tool for exchanging information between the
partners in physics collaborations - The first Web Site in the USA was a link to the
SLAC library catalogue - It was the international particle physics
community who first embraced the Web - Killer application for the Internet
- Transformed modern world academia, business and
leisure
4Beyond the Web?
- Scientists developing collaboration technologies
that go far beyond the capabilities of the Web - To use remote computing resources
- To integrate, federate and analyse information
from many disparate, distributed, data resources - To access and control remote experimental
equipment - Capability to access, move, manipulate and mine
data is the central requirement of these new
collaborative science applications - Data held in file or database repositories
- Data generated by accelerator or telescopes
- Data gathered from mobile sensor networks
5What is e-Science?
- e-Science is about global collaboration in
key areas of science, and the next generation of
infrastructure that will enable it -
- John Taylor
- Director General of Research Councils
- UK, Office of Science and Technology
6The e-Science Vision
- e-Science is about multidisciplinary science and
the technologies to support such distributed,
collaborative scientific research - Many areas of science are in danger of being
overwhelmed by a data deluge from new
high-throughput devices, sensor networks,
satellite surveys - Areas such as bioinformatics, genomics, drug
design, engineering, healthcare require
collaboration between different domain experts - e-Science is a shorthand for a set of
technologies to support collaborative networked
science
7e-Science Vision and Reality
- Vision
- Oceanographic sensors - Project Neptune
- Joint US-Canadian proposal
- Reality
- Chemistry The Comb-e-Chem Project
- Annotation, Remote Facilities and e-Publishing
-
8http//www.neptune.washington.edu/
9(No Transcript)
10(No Transcript)
11(No Transcript)
12(No Transcript)
13(No Transcript)
14(No Transcript)
15(No Transcript)
16(No Transcript)
17(No Transcript)
18The Comb-e-Chem Project
Automatic Annotation
Video Data Stream
HPC Simulation
Data Mining and Analysis
StructuresDatabase
Diffractometer
Combinatorial Chemistry Wet Lab
National X-RayService
Middleware
19National Crystallographic Service
Send sample material to NCS service
Search materials database and predict properties
using Grid computations
Download full data on materials of interest
Collaborate in e-Lab experiment and obtain
structure
20A digital lab book replacement that chemists were
able to use, and liked
21Monitoring laboratory experiments using a broker
delivered over GPRS on a PDA
22Crystallographic e-Prints
Direct Access to Raw Data from scientific
papers
Raw data sets can be very large - stored at UK
National Datastore using SRB software
23 eBank Project
Undergraduate Students
Digital Library
Graduate Students
E-Scientists
E-Scientists
E-Scientists
Grid
5
E-Experimentation
Entire E-Science CycleEncompassing
experimentation, analysis, publication, research,
learning
24Support for e-Science
- Cyberinfrastructure and e-Infrastructure
- In the US, Europe and Asia there is a common
vision for the cyberinfrastructure required to
support the e-Science revolution - Set of Middleware Services supported on top of
high bandwidth academic research networks - Similar to vision of the Grid as a set of
services that allows scientists and industry
to routinely set up Virtual Organizations for
their research or business - Many companies emphasize computing cycle aspect
of Grids - The Microsoft Grid vision is more about data
management than about compute clusters
25Six Key Elements for a Global
Cyberinfrastructure for e-Science
- High bandwidth Research Networks
- Internationally agreed AAA Infrastructure
- Development Centers for Open Standard Grid
Middleware - Technologies and standards for Data Provenance,
Curation and Preservation - Open access to Data and Publications via
Interoperable Repositories - Discovery Services and Collaborative Tools
26The Web Services Magic Bullet
27ComputationalModeling
28Technical Computing in Microsoft
- Radical Computing
- Research in potential breakthrough technologies
- Advanced Computing for Science and Engineering
- Application of new algorithms, tools and
technologies to scientific and engineering
problems - High Performance Computing
- Application of high performance clusters and
database technologies to industrial applications
29New Science Paradigms
- Thousand years ago Experimental Science
- - description of natural phenomena
- Last few hundred years Theoretical Science
- - Newtons Laws, Maxwells Equations
- Last few decades Computational Science
- - simulation of complex phenomena
- Today e-Science or Data-centric Science
- - unify theory, experiment, and simulation
- - using data exploration and data mining
- Data captured by instruments
- Data generated by simulations
- Processed by software
- Scientist analyzes databases/files
- (With thanks to Jim Gray)
30Advanced Computing for Science and Engineering
Bioinformatics
Energy Science
Engineering
Earth Science
. . .
31Top 500 Supercomputer Trends
Clusters over 50
Industry usage rising
x86 is winning
GigE is gaining
32Key Issues for e-Science
- Workflows
- The LEAD Project
- The Data Chain
- From Acquisition to Preservation
- Scholarly Communication
- Open Access to Data and Publications
33 The LEAD Project
Better predictions for Mesoscale weather
34The LEAD Vision
- Analysis/Assimilation
- Quality Control
- Retrieval of Unobserved
- Quantities
- Creation of Gridded Fields
Prediction/Detection PCs to Teraflop Systems
- Product Generation,
- Display,
- Dissemination
Models and Algorithms Driving Sensors
The CS challenge Build a virtual eScience
laboratory to support experimentation and
education leading to this vision.
- End Users
- NWS
- Private Companies
- Students
35Composing LEAD Services
- Need to construct workflows that are
- Data Driven
- The weather input stream defines the nature of
the computation - Persistent and Agile
- An agent mines a data stream and notices an
interesting feature. This event may trigger a
workflow scenario that has been waiting for
months - Adaptive
- The weather changes
- Workflow may have to change on-the-fly
- Resources
36Example LEAD Workflow
37The e-Science Data Chain
- Data Acquisition
- Data Ingest
- Metadata
- Annotation
- Provenance
- Data Storage
- Curation
- Preservation
38The Data Deluge
- In the next 5 years e-Science projects will
produce more scientific data than has been
collected in the whole of human history - Some normalizations
- The Bible 5 Megabytes
- Annual refereed papers 1 Terabyte
- Library of Congress 20 Terabytes
- Internet Archive (1996 2002) 100 Terabytes
- In many fields new high throughput devices,
sensors and surveys will be producing Petabytes
of scientific data
39The Problem for the e-Scientist
- Data ingest
- Managing a petabyte
- Common schema
- How to organize it?
- How to reorganize it?
- How to coexist cooperate with others?
- Data Query and Visualization tools
- Support/training
- Performance
- Execute queries in a minute
- Batch (big) query scheduling
40Digital Curation?
- In 20 years can guarantee that the operating
system and spreadsheet program and the hardware
used to store data will not exist - Need research curation technologies such as
workflow, provenance and preservation - Need to liaise closely with individual research
communities, data archives and libraries - The UK has set up the Digital Curation Centre
in Edinburgh with Glasgow, UKOLN and CCLRC - Attempt to bring together skills of scientists,
computer scientists and librarians
41Digital Curation Centre
- Actions needed to maintain and utilise digital
data and research results over entire life-cycle - For current and future generations of users
- Digital Preservation
- Long-run technological/legal accessibility and
usability - Data curation in science
- Maintenance of body of trusted data to represent
current state of knowledge - Research in tools and technologies
- Integration, annotation, provenance, metadata,
security..
42Berlin Declaration 2003
- To promote the Internet as a functional
instrument for a global scientific knowledge base
and for human reflection - Defines open access contributions as including
- original scientific research results, raw data
and metadata, source materials, digital
representations of pictorial and graphical
materials and scholarly multimedia material
43NSF Atkins Report on Cyberinfrastructure
- the primary access to the latest findings in a
growing number of fields is through the Web, then
through classic preprints and conferences, and
lastly through refereed archival papers - archives containing hundreds or thousands of
terabytes of data will be affordable and
necessary for archiving scientific and
engineering information
44MIT DSpace Vision
- Much of the material produced by faculty,
such as datasets, experimental results and rich
media data as well as more conventional
document-based material (e.g. articles and
reports) is housed on an individuals hard drive
or department Web server. Such material is often
lost forever as faculty and departments change
over time. - Â
45Publishing Data Analysis Is Changing
Roles Authors Publishers Curators Archives Consume
rs
Traditional Scientists Journals Libraries Archives
Scientists
Emerging Collaborations Project web site DataDoc
Archives Digital Archives Scientists
46Data Publishing The Background
- In some areas notably biology databases
are replacing (paper) publications as a medium of
communication - These databases are built and maintained with a
great deal of human effort - They often do not contain source experimental
data - sometimes just annotation/metadata - They borrow extensively from, and refer to, other
databases - You are now judged by your databases as well as
your (paper) publications - Upwards of 1000 (public databases) in genetics
47Data Publishing The issues
- Data integration
- Tying together data from various sources
- Annotation
- Adding comments/observations to existing data
- Becoming a new form of communication
- Provenance
- Where did this data come from?
- Exporting/publishing in agreed formats
- To other programs as well as people
- Security
- Specifying/enforcing read/write access to parts
of your data
48Interoperable Repositories?
- Paul Ginspargs arXiv at Cornell has demonstrated
new model of scientific publishing - Electronic version of preprints hosted on the
Web - David Lipman of the NIH National Library of
Medicine has developed PubMedCentral as
repository for NIH funded research papers - Microsoft funded development of portable PMC
now being deployed in UK and other countries - Stevan Harnads self-archiving EPrints project
in Southampton provides a basis for OAI-compliant
Institutional Repositories - Many national initiatives around the world moving
towards mandating deposition of full text of
publicly funded research papers in repositories
49Microsoft Strategy for e-Science
- Microsoft intends to work with the scientific
and library communities -
- to define open standard and/or interoperable
high-level services, work flows and tools - to assist the community in developing open
scholarly communication and interoperable
repositories
50Acknowledgements
- With special thanks to Kelvin Droegemeier,
Geoffrey Fox, Jeremy Frey, Dennis Gannon, Jim
Gray, Yike Guo, Liz Lyon and Beth Plale