Lore: A Database Management System for Semistructured Data - PowerPoint PPT Presentation

About This Presentation
Title:

Lore: A Database Management System for Semistructured Data

Description:

Atomic objects have no outgoing edges and are types such as int, real, string, gif, java, etc. ... and max values of all atomic objects of each type reachable ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 47
Provided by: gabrielh
Category:

less

Transcript and Presenter's Notes

Title: Lore: A Database Management System for Semistructured Data


1
Lore A Database Management System for
Semistructured Data
2
Why?
  • Although data may exhibit some structure it may
    be too varied or irregular to map to a fixed
    schema.
  • Relational DBMS might use null values in this
    case.
  • May be difficult to decide in advance on a
    specific schema.
  • Data elements may change types.
  • Structure changes a lot (lots of schema
    modifications).

3
Semistructured Data
  • Examples
  • Data from the web
  • Overall site structure may change often.
  • It would be nice to be able to query a web site.
  • Data integrated from multiple, heterogeneous data
    sources.
  • Information sources change, or new sources added.

4
Object Exchange Model (OEM)
  • Data in this model can be thought of as a labeled
    directed graph.
  • Schema-less and self-describing.
  • Vertices in graph are objects.
  • Each object has a unique object identifier (oid),
    such as 5.
  • Atomic objects have no outgoing edges and are
    types such as int, real, string, gif, java, etc.
  • All other objects that have outgoing edges are
    called complex objects.

5
OEM (Cont.)
  • Examples
  • Object 3 is complex, and its subobjects are 8,
    9, 10, and 11.
  • Object 7 is atomic and has value Clark.
  • DBGroup is a name that denotes object 1.(Names
    are entry points into the database).

6
(No Transcript)
7
OEM to XML
  • Example
  • ltMember project5 6gt ltnamegtJoneslt/namegt lta
    gegt46lt/agegt ltofficegt ltbuildinggtgateslt/buildin
    ggt ltroomgt252lt/roomgt lt/officegtlt/membergt
  • This corresponds to rightmost member in the
    example OEM, where project is an attribute.

8
Lorel Query Language
  • Need query language that supports path
    expressions for traversing graph data and
    handling of typeless data.
  • A simple path expression is a name followed by a
    sequence of labels.
  • DBGroup.Member.Office.
  • Set of objects that can be reached starting with
    the DBGroup object, following edges labels member
    and then office.

9
Lorel (cont.)
  • Example
  • select DBGroup.Member.Officewhere
    DBGroup.Member.Age lt 30
  • Result
  • Office Gates 252
  • Office
  • Building CIS
  • Room 411

10
Lorel Query Rewrite
  • Previous query rewritten to
  • select Ofrom DBGroup.Member M, M.Office Owhere
    exists y in M.Age y lt 30
  • Comparison on age transformed to existential
    condition.
  • Since all properties are set-valued in OEM.
  • A user can ask DBGroup.Member.Age lt 30 regardless
    of whether Age is single valued, set valued, or
    unknown.

11
Lorel Query Rewrite
  • Why?
  • Breaking query into simple path expressions
    necessary for query optimization.
  • Need to explicitly handle coercion.
  • Atomic objects and values. 0.5 lt 0.9 should
    return true
  • Comparing objects and sets of objects.
    DBGroup.Member.Age is a set of objects.

12
Lorel (cont.)
  • General path expressions are loosely specified
    patterns for labels in the database.(
    disjunction, ? label pattern optional)
  • Example
  • select DBGroup.Member.Namewhere
    DBGroup.Member.Office(.Room.Cubicle)? like
    252
  • Result
  • Name JonesName Smith

13
Query and Update Processing
  • Query is parsed
  • Parse tree is preprocessed and translated to new
    OQL-like query.
  • Query plan constructed.
  • Query optimization.
  • Opt. query plan executed.

14
System Architecture
15
Iterators and Object Assignments
  • Use recursive iterator approach
  • execution begins at top of query plan
  • each node in the plan requests a tuple at a time
    from its children and performs some operation on
    the tuple(s).
  • pass result tuples up to parent.

16
Object Assignments (OAs)
  • OA is a data structure containing slots for range
    variables with additional slots depending on the
    query.
  • Each slot within an OA will holds the oid of a
    vertex on a path being considered by the query
    engine.
  • Example if OA1 holds oid for Smith then OA2
    and OA3 can hold the oids for one of Smiths
    Office objects and Age objects.

17
(No Transcript)
18
Query Operators
  • For example, the Scan operator returns all oids
    that are subobjects of a given object following a
    specified path expression.
  • Scan (StartingOASlot, Path_expression,
    TargetOASlot)
  • For each oid in StartingOASlot, check to see if
    object satisfies path_expression and place oid
    into TargetOASlot.
  • Other operators include Join, Project, Select,
    Aggregation, etc.
  • Join node like nested-loop join in relational
    DBMS.

19
Query Optimization
  • Does only a few optimizations
  • Push selection ops down query tree.
  • Eliminate/combine redundant query operators.
  • Explores query plans that use indexes where
    possible.
  • Two kinds of indexes
  • Lindex (link index) provide parent pointers impl.
    as hashing.
  • Vindex (value index) impl. as B-trees

20
Indexes
  • Because of non-strict typing system, have String
    Vindex, Real Vindex, and String-coerced-to-real
    Vindex.
  • Separate B-Trees for each type are constructed.
  • Using Vindex for comparison (e.g. Age lt 30)
    consider the following
  • If type is string, do lookup in String Vindex
  • If can convert to real the do lookup in
    String-coerced-to-real Vindex.
  • If type is real?

21
(No Transcript)
22
Other issues
  • Update query operator example
  • Update(Create_Edge, OA1, OA5, Member)
  • Create edge from results in OA1 to OA5 labeled
    Member.
  • Lore arranges objects in physical disk pages,
    each page with a number of slots with a single
    object in each slot.
  • Objects placed according to first-fit algorithm.
  • Supports large objects spanning multiple pages.
  • Objects clustered in depth-first manner (since
    Scan traverses depth-first).
  • Garbage collector removes unreachable objects.

23
External Data Manager
  • Enables retrieval of information from other data
    sources, transparent to the user.
  • An external object in Lore is a placeholder for
    the external data and specifies how lore
    interacts with an external data source.
  • The spec for an external object includes
  • Location of a wrapper program to fetch and
    convert data to OEM, time interval until fetched
    information becomes stale, and a set of arguments
    used to limit info fetched from external source.

24
Data Guides
  • A DataGuide is a concise and accurate summary of
    the structure of an OEM database (stored as OEM
    database itself, kind of like the system
    catalog).
  • Why?
  • No explicit schema, how do we formulate
    meaningful queries?
  • Large databases (cant just view graph
    structure).
  • What if a path expression doesnt exist (waste).
  • Each possible path expression is encoded once.

25
9, 13
26
DataGuides As Histograms
  • Each object in the dataguide can have a link to
    its corresponding target set.
  • A target set is a set of oids reachable by that
    path.
  • TS of DBGroup.Member.Age is 9, 13.
  • This is a path index. Can find set of objects
    reachable by a particular path.
  • Can store statistics in DataGuide (more in next
    paper).
  • For example, the of atomic objects of each type
    reachable by p.

27
Conclusions
  • Takes advantage of the structure where it exists.
  • Handles lack of structure well (data type
    coercion, general path expressions).
  • Query language allows users to get and update
    data from semistructured sources.
  • DataGuide allows users to determine what paths
    exist, and gives useful statistical information

28
Query Optimization for Semistructured Data
29
OEM vs. XML
  • OEMs objects correspond to elements in XML
  • Sub-elements in XML are inherently ordered.
  • XML elements may optionally include a list of
    attribute value pairs.
  • Graph structure for multiple incoming edges
    specified in XML with references (ID, IDREF
    attributes). i.e. the Project attribute.

30
Indexes
  • Vindex(op, value, l, x) places into x all atomic
    objects that satisfy the op value condition
    with an incoming edge labeled l.
  • Vindex(Age, lt, 30,y) places into y objects with
    age lt 30.
  • Lindex(x, l, y) places into x all objects that
    are parents of y via edge labeled l.
  • Lindex(x, Age, y) places into x all parents of
    y via label Age.

31
Indexes (cont.)
  • Bindex(l, x, y) finds all parent-child object
    pairs connected by a label l.
  • Bindex(Age, x, y) locates all parent-child
    pairs with label Age.
  • Pindex(PathExpression, x) placed into x all
    objects reachable via the path expression.
  • Pindex(A.B x, x.C y, y) places into y all
    objects reachable by going from A to B to C.
  • Uses DataGuide.

32
Simple Query
  • select Ofrom DBGroup.Member M, M.Office Owhere
    exists y in M.Age y lt 30
  • Possible plans
  • Top-down (similar to pointer-chasing,
    nested-loops join)
  • Use Vindex to check y lt 30, traverse backwards
    from child to parent using Lindex(bottom-up).
  • Hybrid, both top down and bottom up. Meet in
    middle.

33
Select xFrom A.B xWhere exists y in x.C y 5
34
Query Plan Generation (Overview)
  • Logical query plan generator creates high-level
    execution strategy.
  • Physical query plan enumerator uses statistics
    and a cost model to transform logical query plan
    into an estimated best physical plan that lies
    within their search space.

35
Logical Query Plans (cont.)
  • Glue node represents a rotation point that has
    as its children two independent subplans.
  • Rotating the order between independent components
    yields different plans.
  • Marks place where execution order is not fixed.
  • Discover node chooses best way to bind variables
    x and y.
  • Chain node chooses best evaluation of a path
    expression.

36
Logical query plan forSelect xFrom
DBGroup.Member xWhere exists y in x.Age ylt30
from clause
where clause
37
Physical Query Plans
38
Physical Query Plans (cont.)
  • Scan(x, l, y) places into y all objects that are
    subojects of x via edge labeled l.
  • Top-down (pointer chasing).
  • Lindex plan is bottom-up approach.
  • Bindex Locate edges whose label appears
    infrequently in database.
  • NLJ left subplan passes variables to right
    subplan.

39
Statistics
  • I/O metric uses estimated of objects fetched.
  • For every label subpath p of length lt k
  • Of atomic objects of each type reachable by p
  • Min, and max values of all atomic objects of each
    type reachable by p
  • Of instances of path p, denoted p
  • Of distinct objects reachable by p, denoted
    pd
  • Of l-labeled subobjects of all objects
    reachable by p
  • Of incoming l-labeled edges to any instance of
    p, denoted pl

40
Plan Enumeration
  • Doesnt consider joining two simple path
    expressions together unless they share a common
    variable.
  • Pindex is used only when path expression begins
    with a name and no variable except the last is
    used in the query.
  • Select clause always executes last.
  • Doesnt try to reorder multiple independent path
    expressions.

41
(No Transcript)
42
Results
  • Used XML database about movies. Database graph
    contained 62,256 nodes and 130,402 edges.
  • Experiment 1 Select DB.Movie.Title
  • Best plan is Pindex, followed by top-down
  • Worst plan is Bindex, with hash joins.

43
Results (cont.)
  • Experiment 2 All Movies with a Genre of
    Comedy
  • Where clause is very selective, bottom-up does a
    Vindex for Comedy with incoming edge Genre

44
Results (cont.)
  • Experiment 3 Query with two existentially
    quantified variables in the where clause.
  • Errors due to bad estimates of atomic value
    distributions and set operation costs.

45
Results (cont.)
  • Experiment 4 Select movies with certain quality
    rating.
  • Quality ratings uncommon in database so optimizer
    chooses to find all ratings via Bindex, and then
    work bottom-up.

46
Conclusions
  • Cost estimates are accurate and select the best
    plan most of the time
  • Execution times of best and worst plans for a
    given query can differ by many orders of
    magnitude.
  • Best strategy is highly dependent upon the query
    and database (Query optimization is good for XML
    data).
Write a Comment
User Comments (0)
About PowerShow.com