CMPUT 692 Course Project - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

CMPUT 692 Course Project

Description:

One of the earliest works on index for semi-structured data ... Xregion: A structure-based approach to storing xml data in relational databases. ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 25
Provided by: webdocsCs
Category:
Tags: cmput | cooper | course | project

less

Transcript and Presenter's Notes

Title: CMPUT 692 Course Project


1
CMPUT 692 Course Project
  • Efficient Index Structures for Semi-Structured
    Data

Dean Cheng April 15, 2005 Department of Computing
Science, University of Alberta
2
Outline
  • Introduction
  • DataGuides
  • APEX
  • XISS
  • Index Fabric
  • ViST
  • XRegion
  • Discussion
  • Conclusion/Recommendation

3
Introduction
  • XML becomes the golden standard for data exchange
    and their sizes are increasing
  • Need an efficient mean to query those XML data
  • XML data can be highly irregular and XML queries
    can be complex
  • Uses Indexes

4
DataGuides
  • One of the earliest works on index for
    semi-structured data
  • Model XML data as a graph and queries as paths
  • Store all paths from root to leaves
  • Handle simple path efficiently (no wildcard, no
    branching)

5
APEX (1)
  • Motivation Storing all paths from root to leaves
    is very inefficient for complex queries
    (traversal and join costs)
  • Only store paths of length two, plus frequent
    query path according to the query workload,
    queries are answered by join
  • Flexible and faster than the strong DataGuide

6
APEX (2)
  • Frequent pattern mining example
  • Let required paths be A, B, C, D, B.D
  • Let query workload be A.D, C, A.D
  • Let minSup be 0.6 (remove path whos count lt 2)
  • A 2, B 0, C 1, D 2, A.D 2, B.D 0
  • Updated required paths A, B, C, D, A.D, a path
    of length 1 is always in the required path set

7
APEX (3)
  • Pentium III-866MHz platform with MS-Windows 2000
    and 512 MBytes of main memory. Dataset Play
    (regular), FlixML (irregular), GedML (highly
    irregular)

8
APEX (4)
9
XISS (1)
  • Similar to APEX, break down XML data into
    subunits BTree indexes for element, attribute,
    name, value and structure
  • Introduces a numbering scheme ltorder, sizegt to
    determine ancestor-descendant relationship in
    constant time
  • Use join algorithms to produce results

10
XISS (2)
  • Extended preorder (order) and a range of
    descendants (size). Y is descendant of X iff
    order(X) lt order(Y) lt order(X) size(X)

11
Intermediate (XISS and APEX)
  • Advantage
  • Can handle both simple and complex queries
  • APEX introduces automatic detection and update of
    frequent query path
  • XISS uses BTree for all indexes, take advantage
    of the RDBMS technologies
  • Disadvantage
  • Join cost of smaller subunits

12
Index Fabric
  • Use Patricia trie to index strings, store all
    paths from root to leaves, refined paths
  • Key points Compact and balanced therefore very
    efficient for index string, very efficient for
    simple paths
  • Require DBA to define refined paths, not as good
    as APEX

13
ViST (1)
  • For previous indexes, they handle branching path
    queries by decomposing the query into multiple
    sub-queries and the results of the sub-queries
    are joined together to form the final answers
  • Encode path at the structure level, no need to
    decompose complex queries
  • Encoding uses both paths and values

14
ViST (2)
  • Preorder sequence PSNv1IMv2Nv3IMv4INv5Lv6BLv7Nv8
  • Encoding (symbol, prefix) pairs

15
ViST (3)
  • PSNv1IMv2Nv3IMv4INv5Lv6BLv7Nv8
  • Find orders with Boston sellers and NY buyers (v5
    Boston, v7 NY), encoded as (P, ?)(S,P)(L,
    PS)(v5, PSL)(B, P)(L, PB)(v7, PBL), no join
    require

16
ViST (4)
  • Use suffix tree

17
ViST(5)
  • Use static labeling schema to determine
    ancestor-descendant relationship under suffix
    tree (RIST)
  • Use dynamic labeling system to determine
    ancestor-descendant, no suffix tree requires
    (hence it is called Virtual Suffix Tree),
    implemented using BTree

18
ViST (6)
  • BTree API from Berkeley DB library and a Linux
    machine with a 662 MHz Pentium III CPU and 256 MB
    main memory is used. Dataset DBLP and XMARK.

19
XRegion
  • A generic mapping method to map XML data into
    relational database schema. This important no
    matter what index structure used.
  • Partition according to the cardinalities of node
    occurrences - reduces fragmentations of data and
    stores related data in one table - less I/O and
    join cost
  • Ancestor-descendant relationships are in a meta
    table - provides efficiency

20
Discussion Desired properties (1)
  • Handle simple path expressions and complex path
    expressions while limiting the traversal cost and
    the join cost require for answering queries
  • Labeling system for ancestor-descendant
    relationship important to reduce traversal cost
  • Take advantage of existing relational database
    technologies (BTree used by XISS and ViST).

21
Discussion Desired Properties (2)
  • The index structure should allow dynamic data
    insertion, deletion, structural changes, etc
  • Uses query workload to do frequent query path
    mining such as shown in APEX
  • Use a good mapping strategy such as XRegion (less
    data fragmentation, less I/O and join cost)
  • This helps in handling the increasing size of XML
    data by using relational database technologies

22
Conclusion/Recommendation
  • ViST seems to have the most desirable properties
  • XRegion outperforms other generic mapping methods
  • APEX is the only one utilizes query workload
  • Recommendation Use XRegion to map XML data into
    relational database, use ViST to index on the
    paths in XRegions meta table, uses frequent
    query mining to provide refined path
    functionality dynamically

23
References
  • 1 Chin-Wan Chung, Jun-Ki Min, and Kyuseok Shim.
    Apex an adaptive path index for xml data. In
    SIGMOND Conference, pages 121132, 2002.
  • 2 Brian Cooper, Neal Sample, Michael J.
    Franklin, Gisli R. Hjaltason, and Moshe Shadmon.
    A fast index for semistructured data. In VLDB,
    pages 341350, 2001.
  • 3 Roy Goldman and Jennifer Widom. Dataguides
    Enabling query formation and optimization in
    semistructured databases. In VLDB, pages 436445,
    1997.
  • 4 Quanzhong Li and Bongki Moon, Indexing and
    querying xml data for regular path expressions.
    In VLDB, pages 361370, 2001.
  • 5 Haixun Wang, Sanghyun Park, Wei Fan, and
    Philip S. Yu. Vist A dynamic index method for
    querying xml data by tree structures. In SIGMOND
    Conference, pages 110121, 2003.
  • 6 Meng Xue. Xregion A structure-based approach
    to storing xml data in relational databases.
    University of Alberta Computing Science Master of
    Science Thesis, 2004.

24
Thanks!
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com