Xyleme - PowerPoint PPT Presentation

About This Presentation
Title:

Xyleme

Description:

Data acquisition strategies to build the repository. ... Several crawlers can be used simultaneously and only XML pages are stored. ... – PowerPoint PPT presentation

Number of Views:199
Avg rating:3.0/5.0
Slides: 21
Provided by: UMR
Learn more at: https://web.mst.edu
Category:
Tags: crawlers | xyleme

less

Transcript and Presenter's Notes

Title: Xyleme


1
Xyleme
  • A Dynamic Warehouse for XML Data of the Web

2
Motivation
  • Efficient storage for huge quantities of XML
    data.
  • Query processing.
  • Data acquisition strategies to build the
    repository.
  • Change control with services such as query
    subscription.
  • Semantic data integration.

3
Architecture
  • Xyleme is functionally organized in four levels
  • Physical level (the Natix repository).
  • Logical level (data acquisition and query
    processing).
  • Application level (change management and semantic
    data integration).
  • Interface level (interface with the web and
    interface with the Xyleme clients).

4
Architecture
5
The Natix Repository
  • Xyleme requires the use of an efficient,
    update-able storage of XML data.
  • The existing approaches can be divided into two
    categories
  • Flat streams
  • Metamodeling
  • Natix uses a hybrid approach.

6
Natix Repository
  • Instead of storing each tree node in a separate
    record, we store whole documents( or subtrees of
    documents) together in one record.
  • Typical data trees may not fit on a single page.
    So the data trees are distributed data over
    several pages.

f1
Physical Tree
r1
p2
p1
Proxy object
h2
r3
h2
r2
Helper aggregate object
f7
f6
f5
f2
f3
f4
7
Natix Repository
  • A certain amount of insertions, removals and
    updates of objects stored in this way would lead
    to an unfavorable distribution of the data.
  • To avoid this, semantically splitting of the
    large objects based on the underlying tree
    structure is done.
  • Data tree is partitioned into subtrees, and store
    each subtree in a single record less than a page
    in size.
  • Connected subtrees residing in other records are
    represented by Proxy objects.
  • Proxy objects consist of the RID of the record
    which contains the subtree they represent.
  • Substituting all proxies by their respective
    subtrees reconstruct the original data tree.

8
Natix Repository
  • Inserting nodes
  • To insert a node into the logical data tree as a
    child node of f1, it must be decided where in the
    physical tree the insert should take place.
  • In Natix this choice may be determined by a
    configuration parameter.
  • After an insertion location has been decided, it
    is possible that the designated records disk
    page is full.
  • So the record has to be split.

9
Natix Repository
  • Splitting a record
  • A records subtree before a split

10
Natix Repository
  • Record assembly for the subtree

11
Natix Repository
  • Split Matrix
  • The elements express the desired clustering
    behavior of a node x with label j as children of
    a node y with label i.

12
Query Processing
  • Query processing in Xyleme is similar to OQL
    except
  • In Xyleme we operate on XML documents that can be
    viewed as trees, where as OQL is defined on
    graphs of objects.
  • Pattern matching of trees is used to extract
    information in Xyleme, where as OQL does not
    provide this facility. This is done with a
    complex algebraic operator, named Pattern scan.

13
Query Processing
  • The pattern scan operator is implemented using an
    index mechanism, named XyIndex, this is an
    extension of the full text index(F T I)
    technology.
  • Standard FTI returns the documents in which a
    word occurs.
  • XyIndex adds annotations to position each
    occurrence of a word within a document relatively
    to the other words.

14
Data Acquisition
  • Crawl the web in search of XML data.
  • Refresh pages to keep the repository up to date.
  • Several crawlers can be used simultaneously and
    only XML pages are stored. HTML pages are used to
    discover new links.
  • Critical issue is deciding which document to
    read/refresh next.
  • The decision to read/refresh each page is based
    on the minimization of a global cost function
    under some constraint.
  • The constraint is the average number of pages
    that Xyleme is willing to read per time period.
  • The cost function is the dissatisfaction of users
    being presented with stale data.

15
Data Acquisition
  • More precisely it is based on the criteria like
  • Subscription and publication
  • Temporal information such as last-time-read or
    change rate
  • Page importance

16
Change Control
  • Change control is useful because the users may
    not only be interested in the current values but
    also in their evolution.
  • BULD diff algorithm is used for change control.
  • The algorithm is illustrated with the following
    example.
  • D1 and D2 be two XML documents, D2 being the
    recent one.
  • The starting point in the algorithm is to match
    the largest identical parts of both the
    documents.
  • This is done by registering in a map a unique
    signature for each subtree of D1.
  • Then every subtree of D2 starting from the
    largest is considered to find a identical
    registered subtree of D1.
  • Then the parents are matched, if they have the
    same label.
  • The fact that parents are matched help detect
    matching between descendants.

17
Change Control
18
Semantic Data Integration
  • Queries in Xyleme are formulated using the
    structure of the documents. In some areas, people
    are defining standard DTDs, but most companies
    publishing in XML have their own.
  • Users cannot be expected to know all of the
    hundreds of DTDs.
  • Xyleme provides a view mechanism, that enables
    users to query a single structure.
  • Defining views manually is a tedious process,
    however RDF can be used by the designer of the
    DTD to provide some extra knowledge, but this
    field is too young.
  • Thus natural language and machine learning
    techniques have been used in Xyleme.

19
Semantic Data Integration
  • First task is to classify DTDs into domains based
    on statistical analysis of the similarities
    between words found in the different DTDs.
    Similarity is based on ontologies.
  • Once an abstract DTD has been defined to
    structure a particular domain, the next task is
    to generate the semantic connections between
    elements in the abstract DTD to the concrete
    ones.
  • The problem now is to map paths to paths.
  • All tags along the path may not be words.

20
Conclusions
  • The main distinguishing feature of Xyleme from
    other systems is that Xyleme is based on
    warehousing.
  • Feasible for queries requiring joins over pages
    distributed over the web.
  • Precise alerts of changes in pages of interests
    can be done by warehousing.
  • Problems with data integration.
Write a Comment
User Comments (0)
About PowerShow.com