Title: LegoDB
1LegoDB
Customizing Relational Storage for XML
Documents Timothy Sutherland Sachin Patidar
2Managing XML Data
- XML has become widely used for exchange of data
over the Web -
- XML is extensible and flexible it can be used in
applications with widely different requirements -
- There is no one-size-fits-all solution for all
applications -
- What are procedures to store, query and publish
XML data? - Need adaptable and flexible solutions
-
- LegoDB is a component-based XML data management
system - The database-anywhere paradigm
- portable and adaptable to any data and any
environment
3Motivation challenges
- Challenges
- mismatch between nested tree structure and flat
tuples of the relational model - Inducing the flexibility to handle the wide
domain of application - But storing and querying XML data in an RDBMS is
a non-trivial task
- Motivation
- reuse of well developed features
- concurrency control
- crash recovery
- query processors
- wide variety of XML applications
- Integrate with existing data stored in an RDBMS
4Mapping XML schema to relations
- Question ?
- Can different XML schemas validate the exact same
set of documents? - Yes
- Different but equivalent regular expression can
describe the contents of a given element. - (a(bc)) ((a,b) (a,c))
- Sub-elements of an element can be referred to
directly, or can be referred to by a type name
5Sample XML DatasetInternet Movie Database
-
-
- Fugitive, The
- 1993
-
-
- Roger Ebert
- Two thumbs up!
- This is a fun action movie,
- Harrison Ford at his best.
-
-
-
-
- The standard Hollywood summer
- movie strikes back.
-
X
Files,The 1994 4
Fallen
Angel Larry
Shaw
6DTD and XML Schema
7Question ?
- Can you find more storage mapping relations?
- By performing a sequence of transformations (i.e.
rewritings) which preserve the semantics of the
schema.
8Mapping XML Schema into tables
Inline as many elements as possible
Partition reviews table one for NYTimes, and one
for rest
Split show table Into TV and Movies
9Querying XML
- Presence of schema for XML documents
- For applications to interpret data
- For issuing queries
- Find the title, year and box office proceeds
- For all 2001 movies
- For v in document (imbdata)/imbd/show
- Where v/year2001
- Return v/title, v/year, v/box_office
10XML and Relational Databases
- There is a mismatch between the relational model
and - that of XML
-
- Relational Normalized, flat and fragmented
-
- XML Un-normalized, nested and monolithic
-
- How to store XML data into relational tables?
- Need to map the nested and irregular XML data
into flat and - regular tables
- How to evaluate XML queries over relational
tables? - Need to map XQuery into SQL
11Problem Storing XML in RDMS
Taken from Juliana Freires presentation
12Queries
13Mapping an XML Schema into Tables
Different applications requires specific mappings
for best performance
Publish W1Q1 0.4, Q20.4, Q30.1,
Q40.1 Lookup W2Q10.1, Q20.1, Q30.4, Q40.4
Taken from Juliana Freires presentation
14The LegoDB Storage Mapping Engine
- An optimization approach
- automatically explores a space of possible
mappings - selects the mapping which has the lowest cost for
a given application -
- Basic Principles
- Application-driven takes into account schema,
data statistics and query workload - Logical/physical independence interface is
XML-based ( XML - Schema, XQuery, XML data statistics)
- Leverage existing technology XML standards
XML-specific - operations for generating space of mappings
relational optimizer for evaluating configurations
15LegoDB
- Create a p-schema for input XML schema
- Obtain cost estimates with input of data
statistics and XQuery workload. - Search space of alternative storage
configurations to achieved an efficient mapping
for a given application.
16Architecture of the Mapping Engine
Cost (SQi)
RSi Relational Schema/Queries/Stats
PSi Physical Schema
17XML Schema to Relations
- How to transform a XML schema to a relation?
- P-Schema
- Type Show
- show _at_type String,
- year Integer,
- title String
- Review
- Type Review
- review String
18XML Schema to Relations
- For a type T and relation R
- R1- Create one relation R for each T.
- R2- Create a key for each T
- R3- Create a foreign key for all parents of T
- R4- Create columns for R for every physical type
in T - R5- Allow null values in R for every optional
type in T
19XML Schema to Relations
- Type Show
- show _at_type String,
- year Integer,
- title String
- Review
- Type Review
- review String
20Types of XML Transformations
- Inlining/Outlining
- Union Factorization/Distribution
- Repetition Merge/Split
- Wildcard Rewritings
21Inlining/Outlining
- Attributes can be outlined by removing them
from a relation and using a foreign key to relate
them to a table. - Inlining is the exact opposite.
Type TV seasonsInteger, Description, Episo
de Type Description descriptionString
Type TV seasonsInteger descriptionString
Episode
22Union Factorization/Distribution
- ((a,(bc)) (a,ba,c))
- (at1t2 at1at2)
Type Show show _at_type String, titleString
, yearInteger, (MovieTV) Type Movie
box_officeInteger, video_salesInt, Type
TV seasonsInteger, descriptionString, Ep
isode
Type Show show (_at_typeString, titleStrin
g, yearInteger, box_officeInteger, video_s
alesInteger) (_at_typeString, titleString
, yearInteger, seasonsInteger, description
String, Episode)
23Repetition Merge/Split
Type Show show _at_type String, titleString
, yearInteger, Aka, Aka0,
Type Show show _at_type String, titleString
, yearInteger, Aka1,
24Wildcard Rewritings
- We might want to access specific elements in a
wildcard, such as NYTReview
Type Reviews review (NYTReview
OtherReview) Type NYTReview
nytString Type OtherReview (!nyt)String
Type Review reviewString
25Finding the best pSchema
- Use a Greedy Search
- Search until a good result is found
- 1. Get Initial/Current Schema
- 2. Get schema cost
- 3. Apply transformations to the schema
- 4. Select the best schema cost from the
transformations - 5. If the cost is better than the current schema,
continue the search, mark this schema as the
current schema. Otherwise stop searching.
26Example Search
27Problem?
- With the way that this algorithm is set up can
you find a major oversight? - Remember how the relational data is created
- Sample XML Data
- Sample XML Queries
28Problem
- A problem can be that the relative number of each
query type is not taken into consideration. - For example, what will happen if 90 of queries
are to gather a review for a website, while that
is only 1 of 25 queries in the system. Query
distribution is not uniform!
29Problem...SolvedKind Of...
- If we take into account the frequency of a query
30Related Work
- STORED- Storing Semistructured Data
- SilkRoute- Converting Relational Data to XML
- StatiX- XML Schema statistics framework
31Conclusions
- LegoDB is an excellent way to take Cost of a
query into account when transforming an XML
document to the relational model - Although LegoDB does an excellent job of
transforming XML compared to static models, more
work can be done on how to analyze how the
frequency of queries affect the cost of the
relational model.