Title: Lore: A Database Management System for Semistructured Data
1Lore A Database Management System for
Semistructured Data
2LORE
- Lore Lightweight Object Repository
- http//WWW-DB.Stanford.EDU/lore/demo/
3Lore - motivation
- The data may be irregular and thus not conform to
a rigid schema. - Relational data model has null values, and OO
models have inheritance and complex objects. Both
have difficulties in designing schemas to
incorporate irregular data. - It may be difficult to decide in advance on a
single, correct schema, The structure of the data
may evolve rapidly, data elements may change
types, or data not conforming to previous
structure may be added.
4- Thus, there is a need for management of
- semi-structured data!
- Lore system manages semi-structured data. The
data managed by Lore is not confined to a schema
and it may be irregular or incomplete. - OEM is the Lores data model. OEM - object
Exchange Model - graph based self-describing
object instance model where nodes are objects and
edges are labeled with attribute names and leaf
nodes have atomic values - Lore is light weight object repository and Lorel
is Lores query language.
5- Path expression queries
- Automatic type coercion
- Use of Data guides Structural summary of the
database
6- Objects with Oid
- Atomic objects no outgoing edge and contain a
value from one of the basic atomic types such as
Integer, real, string, gif, java, audio etc. - Complex objects may have outgoing edges
- Names Sp. Labels that serves as aliases for
objects and as entry points to the database.
7Object Exchange Model - OEM
- Motivation - information exchange and extraction
- Each value exchanged is given an explicit label.
- Object ?temp-in-Fahrenheit, integer, 80? -
temp-in-Fahrenheit is the label. Each object is
self-describing, with a label, type and value. - ?set-of-temps, set, cmpnt1, cmpnt2 ?
- cmpnt1 is ?temp-in-Fahrenheit, integer, 80?
- cmpnt2 is ?temp-in-Celsius, integer, 20?
8Labels
- Plays two roles
- identifying an object (component)
- identifying the meaning of an object (component)
?person-record, set, cmpnt1, cmpnt2, cmpnt3 ?
cmpnt1 is ?person-name, string, Fred?
cmpnt2 is ?office-num-in-bldg-5, integer, 333?
cmpnt3 is ?department, string, toy?
- Person-name both identifies cmpnt1 and coveys its
meaning.
- In relational data this corresponds to .
9Labels - Issues
- What does the label mean?
- Database of labels
- Ontology of labels - within each source
- Labels are relative (more specific) to the source
of the data object. - Similar labels from different sources need to be
resolved.
- Labels provide the flexibility in representing
- object structure
10Self-describing data models
- Have been in existence for a long time? Why
additional interest now?
- Use the nature of self-describing data model
for information exchange, and to extend the model
to include object nesting. - To provide an appropriate object request language
(query facility)
11OEM - Specification
- Each object in OEM has the following structure
- Label A variable character string describing
what the object represents. - Type The data type of the objects value. Each
is either an atom type, or type set. - Value A variable-length value of the object.
- Object-ID A unique variable-length identifier
for the object or null.
12OEM - Summary
- OEM is an information exchange model. It does not
specify how objects are stored at source.
- OEM does specify how objects are received at a
client, but after objects are received they can
be stored in any way the client likes.
- Each source has a distinguished object with
lexical identifier root.
- Note the schema-less nature of OEM is
particularly useful when a client does not know
in advance the labels or structure of OEM objects.
13- ltbiblio,set,doc1,doc2,,docngt
- doc1 is ltdoc, set, auths1, topic1, call-no1gt
- auths1 is ltauth-set,set auth11gt
- auth11 is ltauth-ln, string, Ullmangt
- topic1 is lttopic, string,Databasesgt
- call-no1 is ltinternal-call-no, integer, 25gt
- doc2 is ltdoc, set, auths2, topic2, call-no2gt
- auths2 is ltauth-set,set auth21, auth22,
auth23gt - auth21 is ltauth-ln, string, Ahogt
- auth22 is ltauth-ln, string, Hopcroftgt
- auth23 is ltauth-ln, string, Ullmangt
-
Example
- topic2 is lttopic, string,Algorithmsgt
- call-no1 is ltdewey-decimal, string, BR273gt
- docn is ltdoc, set, authsn, topicn, call-nongt
- authsn is ltauth,string, Crichtongt
- topic1 is lttopic, string,Dinosaursgt
- call-no1 is ltfictional-call-no, integer, 95gt
- biblio is the root object.
14OEM - QL
- SELECT Fetch-expression
- FROM Object
- WHERE Condition
- The result of this query is itself an object,
with special label answer - ?answer, set, obj1, obj2, , objn ?
- Each returned obji is a component of object
specified in the From clause of the query, where
the component is located by the Fetch-expression
and satisfies the Condition.
15Path
- The notion of path is used in both
Fetch-Expression in the Select clause and the
condition in the Where clause. - Path describes traversals through an object using
subobject structure and labels. - Example biblio.doc.auth
- Paths are used in Fetch-Expression to specify
which components are are returned in the answer
object. - Paths are used in the condition to qualify the
fetched objects or other (related) components in
the same object structure.
16Queries - Simple
- Retrieve the topic of each document for which
Ullman is one of the authors - SELECT biblio.doc.topic
- FROM root
- WHERE biblio.doc.auth-set.auth-ln Ullman
- Intuitively, the querys where clause finds all
paths through subobject structure with the
sequence of labels biblio,doc,auth-set,auth-ln
such that the object at the end of the path has
value Ullman. - ltanswer, set, obj1, obj2gt
- obj1 is lttopic, string, Databasesgt
- obj2 is lttopic, string, Algorithmsgt
17Queries - wild-cards
- Retrieve all documents with internal call number
- SELECT biblio.?.topic
- FROM root
- WHERE biblio.?.internal-call-no
- ? label matches any label. For this query,
the doc labels can be replaced by any other
strings and query would produce the same result.
By convention, two occurrences of ? In the same
query must match the same label unless variables
are used. - ltanswer, set, obj1gt
- obj1 is lttopic, string, Databasesgt
18Queries - wild-paths
- Retrieve all documents with internal call number
- SELECT .topic
- FROM root
- WHERE .internal-call-no
- Symbol matches any path of length one or
more. The use of followed by a single label is
a convenient and common way to locate objects
with a certain label in complex structure.
Similar to ?, two occurrences of in the same
query must match the same sequence of labels,
unless variables are used. - ltanswer, set, obj1gt
- obj1 is lttopic, string, Databasesgt
19Queries - variables
- Retrieve each document for which both
Hopcroft and Aho are co-authors - SELECT biblio.doc
- FROM root
- WHERE biblio.doc.auth-set.auth-ln(a1)Aho and
- biblio.doc.auth-set.auth-ln(a1)H
opcroft - Here, the query finds all the paths with
structure biblio, doc, auth-set, and with two
distinct path completions with label auth with
values Aho and Hopcroft - ltanswer, set, obj1gt
- obj1 is the complete doc2
20OEM (Cont.)
- Examples
- Object 3 is complex, and its subobjects are 8,
9, 10, and 11. - Object 7 is atomic and has value Clark.
- DBGroup is a name that denotes object 1.(Names
are entry points into the database).
21An OEM Database
DBGroup
1
Member
Project
Member
Member
Project
Member
2
3
4
5
6
Name
Project
Name
Office
Project
Age
Name
Age
Office
Office
9
11
8
10
12
13
14
7
15
16
Clark
Smith
46
Gates 252
Lore
Tsimmis
Jones
28
Room
Building
Room
Building
17
18
19
20
CIS
411
CIS
252
22Lorel Queries - Simple Path Expression
- Retrieve the offices of members with age greater
than 30 years - Query SELECT DBGroup.Member.Office
- WHERE DBGroup.Member.Age gt 30
- Result Office Gates 252
- Office
- Building CIS
- Room 411
23Lorel Query Rewrite
- Previous query rewritten to
- select Ofrom DBGroup.Member M, M.Office Owhere
exists y in M.Age y lt 30 - Comparison on age transformed to existential
condition. - Since all properties are set-valued in OEM.
- A user can ask DBGroup.Member.Age lt 30 regardless
of whether Age is single valued, set valued, or
unknown.
24Lorel Query Rewrite
- Why?
- Breaking query into simple path expressions
necessary for query optimization. - Need to explicitly handle coercion.
- Atomic objects and values. 0.5 lt 0.9 should
return true - Comparing objects and sets of objects.
DBGroup.Member.Age is a set of objects.
25Queries - General Path Expression
- Query SELECT DBGroup.Member.Name
- WHERE DBGroup.Member.Office(.Room.Cubicle)?
- Like 252
- Result Name Jones
- Name Smith
- Room matches all labels starting from Room, like
Room68. stands for disjunction. ? indicates
that the label pattern is optional. like 252
specifies that the data value should end with
string 252.
26Queries - SubQueries
Retrieve Lore project members who work on other
projects Query SELECT M.Name, ( SELECT
M.Project.Title WHERE M.Project.Title !
Lore) FROM DBGroup.Member M WHERE
M.Project.Title Lore Result Member Name
Jones Title Tsimmis
27Data Guides
- A DataGuide is a concise and accurate summary of
the structure of an OEM database (stored as OEM
database itself, kind of like the system
catalog). - Why?
- No explicit schema, how do we formulate
meaningful queries? - Large databases (cant just view graph
structure). - What if a path expression doesnt exist (waste).
- Each possible path expression is encoded once.
289, 13
29DataGuides As Histograms
- Each object in the dataguide can have a link to
its corresponding target set. - A target set is a set of oids reachable by that
path. - TS of DBGroup.Member.Age is 9, 13.
- This is a path index. Can find set of objects
reachable by a particular path. - Can store statistics in DataGuide
- For example, the of atomic objects of each type
reachable by p.
30Conclusions
- Takes advantage of the structure where it exists.
- Handles lack of structure well (data type
coercion, general path expressions). - Query language allows users to get and update
data from semistructured sources. - DataGuide allows users to determine what paths
exist, and gives useful statistical information - Lore does facilitate query and updates on
semi-structural databases
31OEM vs. XML
- OEMs objects correspond to elements in XML
- Sub-elements in XML are inherently ordered.
- XML elements may optionally include a list of
attribute value pairs. - Graph structure for multiple incoming edges
specified in XML with references (ID, IDREF
attributes). i.e. the Project attribute.
32OEM to XML
- Example
- ltMember project5 6gt ltnamegtJoneslt/namegt lta
gegt46lt/agegt ltofficegt ltbuildinggtgateslt/buildin
ggt ltroomgt252lt/roomgt lt/officegtlt/membergt - This corresponds to rightmost member in the
example OEM, where project is an attribute.
33Query Optimization for Semistructured Data
34(No Transcript)
35External Data Manager
- Enables retrieval of information from other data
sources, transparent to the user. - An external object in Lore is a placeholder for
the external data and specifies how lore
interacts with an external data source. - The spec for an external object includes
- Location of a wrapper program to fetch and
convert data to OEM, time interval until fetched
information becomes stale, and a set of arguments
used to limit info fetched from external source.
36Layer 0
- Access to Lore API, Applications
- Parser textual query as input and parse tree as
output - Preprocessor input parse tree to OQL like query
- Query plan and query optimizer (index etc)
37Layer 1
- Object manager translation layer between OEM to
lower level files, compare objects, - Query operator execute query plan, perform
simple coercion, iterating over subobjects of a
complex object
38- QUERY
- select M.Name,
- (select M.Project.Title where M.Project.Title !
"Lore" ) - from DBGroup.Member M
- where M.Project.Title "Lore"
- RESULT
- Member
- Name "Jones"
- Title "Tsimmis"
39- update P.Member
- ( select DBGroup.Member
- where DBGroup.Member.Name "Clark" )
- from DBGroup.Project P
- where P.Title "Lore" or
- P.Title "Tsimmis"
- P.member specifies to add member edges between
P and every object returned by subquery.
40Query Execution Plan
- Iterative approach in query processing
- Execution begins at the top of the query plan,
with each node in the plan requesting a tuple at
a time from its children and performing some
operation on the tuple. - After a node completes its operation, it passes a
resulting tuple to its parent. -
41- Object assignment OA is a simple data
structures containing slots corresponding to
range variables in the query and some additional
slots - Each slot within an OA will hold the oid of a
vertex on a data path currently being considered
by the query engine - Example OA1 holds oid for member smith, then
OA2 and OA3 can hold the oids for one of smiths
office subobjects and one of his age subobjects
resp.
42- Query Operators
- Each operator takes a number of arguments with
the last argument being OA slot that will contain
the result of the operation. - Select , project has no target slot.
- Scan returns all oids that are subobjects of a
given object - Scan (starting OA slot, path exp., Target OA
slot) - Scan until no subobjects that satisfies path
expression - Scan (OA1, office, OA2) place into slot OA2
one at a time all office subobjects appearing in
slot OA1.
43- Join/select/project nearly identical to RDBMS
- Project is to limit which objects should be
returned - Select applied predicate to the object identified
by the oid in the OA slot specified - Aggregate implements quantification and
aggregation node calls its child exhaustively ,
storing the results temp. or computing
aggregation - A new object is created when no more valid OAs
-
44- select O
- from DBGroup.Member M,
- M.Office O
- where exists A in M.Age A gt 30
45(No Transcript)
46Primary Operators
- Setop, ArithOp, Create Set and Groupby
- Setop handles union, intersect,except
- Arithop addition, multiplication
- Createset is to package the results of an
arbitrary subquery before proceeding its called
its child exhaustively , storing each oid
returned as part of a newly created complex
object. - It stores oid for the new set of objects within
target slot - Group by handles subquery that includes a groupby
expression
47- select M.Name, count(M.Publication)
- from DBGroup.Member M
- where M.Dept "CS"
- select (select N from M.Name N), count(select P
- from M.Publication P)
- from DBGroup.Member M
- where exists D in M.Dept D "CS"
48(No Transcript)
49Indexing
- Value Index vindex
- Lindex a link (edge) index
- Lindex (oid, label)
- and returns the oids of all parents via the
specified label (provides parent pointers) - Vindex (label,operator,value) returns all
atomic objects having an incoming edge with the
specified label and a value satisfying the
specified operator (lt)
50Index Query Plan
- Bottem Up
- Locate all objects with desired values and
appropriately labeled incoming edges via vindex - Using Lindex then traverse up from these objects
to match the path exp
51New Query Operators
- Vindex
- Lindex
- Once
- Named-obj
52Indexes
- Vindex(l, op, value, x) places into x all atomic
objects that satisfy the op value condition
with an incoming edge labeled l. - Vindex(Age, lt, 30,y) places into y objects with
age lt 30. - Lindex(x, l, y) places into x all objects that
are parents of y via edge labeled l. - Lindex(x, Age, y) places into x all parents of
y via label Age.
53Indexes (cont.)
- Bindex(l, x, y) finds all parent-child object
pairs connected by a label l. - Bindex(Age, x, y) locates all parent-child
pairs with label Age. - Pindex(PathExpression, x) placed into x all
objects reachable via the path expression. - Pindex(A.B x, x.C y, y) places into y all
objects reachable by going from A to B to C. - Uses DataGuide.
54- Vindex as B tress
- Lindex Linear Hashing
- String Vindex index entries for all string
based atomic values - Real Vindex index entries for all numeric based
atomic values - String-coerced-to real Vindex
- Contains all strings values that can be coerced
into an integer or real
55- If the value is of type string then
- Lookup in string index
- If the value can be coerced to a real then lookup
in the coerced value in real vindex - If the value is of type real (integer) then
- Lookup in real vindex
- Also lookup in the string-coerced to real vindex
56(No Transcript)
57Simple Query
- select Ofrom DBGroup.Member M, M.Office Owhere
exists y in M.Age y lt 30 - Possible plans
- Top-down (similar to pointer-chasing,
nested-loops join) - Use Vindex to check y lt 30, traverse backwards
from child to parent using Lindex(bottom-up). - Hybrid, both top down and bottom up. Meet in
middle.
58Coercion for Basic op.
- Arg2 String Real Int
- Arg1
- String - Stirng Real both Real
- Real Stirng Real - Int Real
- Int both Real Int Real -