Title: New Generation Database Systems: XML Databases and Grid-based Digital Libraries
1New Generation Database Systems XML Databases
and Grid-based Digital Libraries
- University of California, Berkeley
- School of Information
- IS 257 Database Management
2Lecture Outline
- XML and DBMS
- The Grid and DBMS
- The Grid
- Data Grids
- Grid-based DBMS
3Lecture Outline
- XML and DBMS
- The Grid and DBMS
- The Grid
- Data Grids
- Grid-based DBMS
4Standards XML/SQL
- As part of SQL3 an extension providing a mapping
from XML to DBMS is being created called XML/SQL - The (draft) standard is very complex, but the
ideas are actually pretty simple - Suppose we have a table called EMPLOYEE that has
columns EMPNO, FIRSTNAME, LASTNAME, BIRTHDATE,
SALARY
5Standards XML/SQL
- That table can be mapped to
ltEMPLOYEEgt
ltrowgtltEMPNOgt000020lt/EMPNOgt
ltFIRSTNAMEgtJohnlt/FIRSTNAM
Egt ltLASTNAMEgtSmithlt/LASTNAMEgt
ltBIRTHDATEgt1955-08-21lt/BIRTHDATEgt
ltSALARYgt52300.00lt/SALARYgt
lt/rowgt - ltrowgt etc.
6Standards XML/SQL
- In addition the standard says that XMLSchemas
must be generated for each table, and also allows
relations to be managed by nesting records from
tables in the XML. - Variants of this are incorporated into the latest
versions of ORACLE - (Slides from Oracle Web Site on ORACLE XML)
7Lecture Outline
- XML and DBMS
- The Grid and DBMS
- The Grid
- Data Grids
- Grid-based DBMS
8Grid-based Digital Libraries
- So whats this Grid thing anyhow?
- Data Grids and Distributed Storage
- Grid-Based IR
- Grid-Based Digital Libraries
- This lecture borrows heavily from presentations
by Ian Foster (Argonne National Laboratory
University of Chicago), Reagan Moore and others
from San Diego Supercomputer Center
9The Grid On-Demand Access to Electricity
Quality, economies of scale
Time
Source Ian Foster
10By Analogy, A Computing Grid
- Decouples production and consumption
- Enable on-demand access
- Achieve economies of scale
- Enhance consumer flexibility
- Enable new devices
- On a variety of scales
- Department
- Campus
- Enterprise
- Internet
Source Ian Foster
11What is the Grid?
- The short answer is that, whereas the Web is a
service for sharing information over the
Internet, the Grid is a service for sharing
computer power and data storage capacity over the
Internet. The Grid goes well beyond simple
communication between computers, and aims
ultimately to turn the global network of
computers into one vast computational resource. - Source The Global Grid Forum
12Not Exactly a New Idea
- The time-sharing computer system can unite a
group of investigators . one can conceive of
such a facility as an intellectual public
utility. - Fernando Corbato and Robert Fano , 1966
- We will perhaps see the spread of computer
utilities, which, like present electric and
telephone utilities, will service individual
homes and offices across the country. Len
Kleinrock, 1967
Source Ian Foster
13But, Things are Different Now
- Networks are far faster (and cheaper)
- Faster than computer backplanes
- Computing is very different than pre-Net
- Our computers have already disintegrated
- E-commerce increases size of demand peaks
- Entirely new applications social structures
- Weve learned a few things about software
Source Ian Foster
14Computing isnt Really Like Electricity
- I import electricity but must export data
- Computing is not interchangeable but highly
heterogeneous data, sensors, services, - This complicates things but also means that the
sum can be greater than the parts - Real opportunity Construct new capabilities
dynamically from distributed services - Raises three fundamental questions
- Can I really achieve economies of scale?
- Can I achieve QoS across distributed services?
- Can I identify apps that exploit synergies?
Source Ian Foster
15Why the Grid?(1) Revolution in Science
- Pre-Internet
- Theorize /or experiment, aloneor in small
teams publish paper - Post-Internet
- Construct and mine large databases of
observational or simulation data - Develop simulations analyses
- Access specialized devices remotely
- Exchange information within distributed
multidisciplinary teams
Source Ian Foster
16Why the Grid?(2) Revolution in Business
- Pre-Internet
- Central data processing facility
- Post-Internet
- Enterprise computing is highly distributed,
heterogeneous, inter-enterprise (B2B) - Business processes increasingly computing-
data-rich - Outsourcing becomes feasible gt service
providers of various sorts
Source Ian Foster
17The Information Grid
- Imagine a web of data
- Machine Readable
- Search, Aggregate, Transform, Report On, Mine
Data using more computers, and less humans - Scalable
- Machines are cheap can buy 50 machines with
100Gb or memory and 100 TB disk for under 100K,
and dropping - Network is now faster than disk
- Flexible
- Move data around without breaking the apps
Source S. Banerjee, O. Alonso, M. Drake - ORACLE
18The Foundations are Being Laid
19Data Grid Problem
- Enable a geographically distributed community
of thousands to pool their resources in order
to perform sophisticated, computationally
intensive analyses on Petabytes of data - Note that this problem
- Is common to many areas of science
- Overlaps strongly with other Grid problems
20Data Grids forHigh Energy Physics
Image courtesy Harvey Newman, Caltech
21Grids and Open Standards
App-specific Services
Increased functionality, standardization
Custom solutions
Time
22The Gridas Enabler of 21st Century Science
- Entirely new approaches to enquiry based on
- Deep analysis of huge quantities of data
- Interdisciplinary collaboration
- Large-scale simulation
- Smart instrumentation
- Enabled by an infrastructure that enables access
to, and integration of, resources services
without regard for location
23Not only Science
- The Database world is moving to the Grid for
large-scale applications - Oracle 10g is specifically designed to exploit
clustered/grid computing using RACs (Real
Application Clusters) - An example from the Information/Publishing world
- Presentation from Oracle about Thomson Legals
use of Oracle 10g and RACs