Lecture 24: XML Data Management - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Lecture 24: XML Data Management

Description:

In XML persons , name , phone are part of the data, and are repeated many times ... Nodes in the tree are labeled with element names. ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 56
Provided by: ChengXi4
Category:
Tags: xml | data | lecture | management | name

less

Transcript and Presenter's Notes

Title: Lecture 24: XML Data Management


1
Lecture 24 XML Data Management
Nov. 17, 2006 ChengXiang Zhai
Most slides are from Ning Zhangs
presentation www2.cs.uh.edu/ceick/3480/XML-3480.
ppt
2
What is XML?
  • XML documents have elements and attributes
  • Elements (indicated by begin end tags)
  • can be nested but cannot interleave each other
  • can have arbitrary number of sub-elements
  • can have free text as values
  • ltchap title Introduction To XMLgt
  • some free text
  • ltsect title What is XML?gt lt/sectgt
  • ltsect title Elementsgt lt/sectgt
  • ltsect title Why XML?gt lt/sectgt
  • possibly more free text
  • lt/chapgt

end element
attribute
begin element
Elements w/ same name can be nested
3
XML
  • ltbibliographygt
  • ltbookgt lttitlegt Foundations lt/titlegt
  • ltauthorgt Abiteboul lt/authorgt
  • ltauthorgt Hull lt/authorgt
  • ltauthorgt Vianu lt/authorgt
  • ltpublishergt Addison Wesley
    lt/publishergt
  • ltyeargt 1995 lt/yeargt
  • lt/bookgt
  • lt/bibliographygt

XML describes the content easy for applications
4
Document Type Definitions (DTDs) as Grammars
lt!DOCTYPE paper lt!ELEMENT paper
(section)gt lt!ELEMENT section ((title,section)
text)gt lt!ELEMENT title (PCDATA)gt
lt!ELEMENT text (PCDATA)gt gt
ltpapergt ltsectiongt lttextgt lt/textgt lt/sectiongt
ltsectiongt lttitlegt lt/titlegt ltsectiongt
lt/sectiongt
ltsectiongt lt/sectiongt
lt/sectiongt lt/papergt
XML documents can be nested arbitrarily deep
5
XML for Representing Data
XML
persons
persons
row
row
row
phone
name
name
name
phone
phone
John
3634
Sue
Dick
6343
6363
  • ltpersonsgt
  • ltrowgt ltnamegtJohnlt/namegt
  • ltphonegt 3634lt/phonegtlt/rowgt
  • ltrowgt ltnamegtSuelt/namegt
  • ltphonegt 6343lt/phonegt
  • ltrowgt ltnamegtDicklt/namegt
  • ltphonegt 6363lt/phonegtlt/rowgt
  • lt/personsgt

6
XML vs Data Models
  • XML is self-describing
  • Schema elements become part of the data
  • Relational schema persons(name,phone)
  • In XML ltpersonsgt, ltnamegt, ltphonegt are part of the
    data, and are repeated many times
  • Consequence XML is much more flexible
  • XML semistructured data

7
Semi-structured Data Explained
  • Missing attributes
  • Repeated attributes

ltpersongt ltnamegt Johnlt/namegt
ltphonegt1234lt/phonegt lt/persongt ltpersongt
ltnamegtJoelt/namegt lt/persongt
? no phone !
ltpersongt ltnamegt Marylt/namegt
ltphonegt2345lt/phonegt
ltphonegt3456lt/phonegt lt/persongt
? two phones !
8
Semistructured Data Explained
  • Attributes with different types in different
    objects
  • Nested collections
  • Heterogeneous collections
  • ltdbgt contains both ltbookgts and ltpublishergts

ltpersongt ltnamegt ltfirstgt John lt/firstgt
ltlastgt Smith lt/lastgt
lt/namegt
ltphonegt1234lt/phonegt lt/persongt
? structured name !
9
Why XML?
chap
  • Database Side XML is a new way to organize data
  • Relational databases organize data in tables
  • XML documents organize data in ordered trees
  • Document Side XML is a semantic markup language
  • HTML focuses on presentation while plain text has
    no structure
  • XML focuses on semantics/structure in the data

sect
sect
sect
sect
sect
sect
lthtmlgt lth1gt Chapter 1 lt/h1gt some free
text lth2gt Section 1 lt/h2gt some more free
text lth3gt Section 1.1 lt/h3gt lt/htmlgt
10
Data Management Relational vs. XML
  • Relational data are well organized fully
    structured (more strict)
  • E-R modeling to model the data structures in the
    application
  • E-R diagram is converted to relational tables and
    integrity constraints (relational schemas)
  • XML data are semi-structured (more flexible)
  • Schemas may be unfixed, or unknown (flexible
    anyone can author a document)
  • Suitable for data integration (data on the web,
    data exchange between different enterprises).

11
More about Relational vs. XML
  • XML is not meant to replace relational database
    systems
  • RDBMSs are well suited for OLTP applications
    (e.g., electronic banking) which has 1000 small
    transactions per minute.
  • XML is suitable for data exchange over
    heterogeneous data sources (e.g., Web services)
    that allow them to talk.

12
Uses of XML
  • As document representation language
  • XML can be transformed to other format (e.g., by
    XSLT)
  • XML ? HTML
  • XML ? LaTeX, bibTeX
  • XML ? PDF
  • DocBook (standard schema for authoring
    document/book)

13
Uses of XML (cont.)
  • As data integration and exchange language
  • Web services (SOAP, WSDL, UDDI)
  • Amazon.com, eBay, Microsoft MapPoint,
  • Domain specific data exchange schemas (gt1000)
  • legal document exchange language
  • business information exchange
  • RSS XML news feed
  • CNN, slashdot, blogs,

14
Uses of XML (cont.)
  • In general, appropriate for any data having
    hierarchical structure
  • Email
  • Header from, to, cc, bcc
  • Body my message, replied email
  • Network log file
  • IP address, time, request type, error code

15
Exporting Relational Data to XML
  • Product(pid, name, weight)
  • Company(cid, name, address)
  • Makes(pid, cid, price)

makes
product
company
16
Export data grouped by companies
  • ltdbgtltcompanygt ltnamegt GizmoWorks lt/namegt
  • ltaddressgt Tacoma
    lt/addressgt
  • ltproductgt ltnamegt gizmo
    lt/namegt
  • ltpricegt
    19.99 lt/pricegt
  • lt/productgt
  • ltproductgt lt/productgt
  • lt/companygt
  • ltcompanygt ltnamegt Bang lt/namegt
  • ltaddressgt Kirkland
    lt/addressgt
  • ltproductgt ltnamegt gizmo
    lt/namegt
  • ltpricegt 22.99 lt/pricegt
  • lt/productgt
  • lt/companygt
  • lt/dbgt

Redundant representation of products
17
The DTD
  • lt!ELEMENT db (company)gt
  • lt!ELEMENT company (name, address, product)gt
  • lt!ELEMENT product (name,price)gt
  • lt!ELEMENT name (PCDATA)gt
  • lt!ELEMENT address (PCDATA)gt
  • lt!ELEMENT price (PCDATA)gt

18
Export Data by Products
  • ltdbgt ltproductgt ltnamegt Gizmo lt/namegt
  • ltmanufacturergt
  • ltnamegt
    GizmoWorks lt/namegt
  • ltpricegt
    19.99 lt/pricegt
  • ltaddressgt
    Tacoma lt/addressgt
  • lt/manufacturergt
  • ltmanufacturergt
  • ltnamegt Bang
    lt/namegt
  • ltpricegt
    22.99 lt/pricegt
  • ltaddressgt
    Kirkland lt/addressgt
  • lt/manufacturergt
  • lt/productgt
  • ltproductgt ltnamegt OneClick lt/namegt
  • lt/dbgt

Redundant Representation of companies
19
Reminds us of the network data model
20
A Data-Integration View of XML
  • What should be the underlying data model for DI
    contexts?
  • relational model is not an ideal choice
  • Developed semi-structured data model
  • started with the OEM (object exchange model) in
    the project Lore
  • Then XML came along
  • It is now the most well-known semi-structured
    data model
  • Generating much research in the DB community
  • Current standards XMLSchema, Xquery
    (http//www.w3.org/XML/Query/)

21
XML Databases
  • Advantages
  • Manage large volume of XML data
  • Provide high-level declarative language
  • Efficiently evaluate complex queries
  • XML Data Management Issues
  • XML Data Model
  • XML Query Languages
  • XML Query Processing and Optimization

22
XML Data Model
  • Hierarchical data model
  • An XML document is an ordered tree
  • Nodes in the tree are labeled with element names.
  • Element nesting relationship corresponds to
    parent-child relationship

chap
sect
_at_title

some free text
Introduction to XML
sect
_at_title

_at_title
What is XML?

23
XML Schema Languages
  • Schema language defines the structure
  • Document Type Definition (DTD)
  • Context-free grammar
  • Structurally richer than relational schema
    definition language because of recursion.
  • XML Schema
  • Also context-free
  • Richer than DTD because of data types definition
    (integer, date, sequence).

24
XML Query Languages
  • XPath
  • 13 axes (navigation directions in the tree)
  • child (/), descendant (//), following-sibling,
    following
  • NameTest, predicates
  • E.g,
  • doc(bib.xml)//booktitleHarry Potter/ISBN
  • XQuery (superset of XPath)
  • FLWOR expression
  • for x in doc(bib.xml)//booktitle Harry
    Potter/ISBN,
  • y in doc(imdb.xml)//movie
  • where y//novel/ISBN x
  • return y//title

25
Important Problems in XML Data Management
  • How to store XML data?
  • How to efficiently evaluate XPath/XQuery
    languages?
  • Efficient physical operators
  • Query optimization
  • How to support XML update languages?
  • How to support transaction management?
  • Recovery management?

26
XML Storage
  • Extended Relational Storage
  • Convert XML documents to relational tables
  • Native Storage
  • Treat XML elements as first-class citizens
  • Hybrid of Relational and Native Storage
  • XML documents can be stored in columns of
    relational tables (XML typed column)

27
Extended Relational Storage
  • Edge-based Storage Scheme (Florescu and Kossman
    99)
  • Each node has an ID
  • Each tuple in the edge table consists of
    (parentID, childID, type of data, reference to
    data)
  • Pro easy to convert XML to relational tables
  • Con impossible to answer path queries such as
    //a//b using SQL (needs transitive closure
    operator)

28
Extended Relational Storage
  • Path-based Storage Scheme XRel (Yoshikawa et al.
    01)
  • Each node corresponds to a tuple in the table
  • Each tuple keeps a rooted path to the node (e.g.,
    /article/chap/sec/sec/_at_title)
  • Pro also easy to convert XML to tables
  • Con answering path queries, such as //a//b,
    needs expensive string pattern matching

29
Extended Relational Storage
  • Node-based Storage Scheme Niagara, TIMBER etc.
    (Zhang et al. 01)
  • Each node is encoded with a begin and end
    integers.
  • Begin corresponds to the order of in-order
    traversal of tree end corresponds to the order
    in post-order traversal.
  • Pro checking parent-child/ancestor-descendant
    relationships is efficient (constant time using
    begin and end)
  • Con inefficient for updating XML

30
Native Storage
  • Subtree partition-based scheme Natix (Kanne and
    Moerkotte 00)
  • A large XML tree is partitioned into small
    subtrees, each of which can be fit into one disk
    page
  • Introducing aproxy and aggregate nodes to connect
    different subtrees
  • Pro easy to update and traversal
  • Con complex update algorithm frequent
    deletion/addition may deteriorate page usage ratio

31
Native Storage
  • Binary tree-based scheme Arb (Koch 03)
  • Convert a tree with arbitrary number of children
    to a binary tree (first child translates to left
    child next sibling translate to right child)
  • Tree nodes are stored in document order
  • Each node has 2 bits indicating whether it has a
    left right child
  • Pro easy to do depth-first search (DFS)
    traversal
  • Con inefficient to do next_sibling navigation
    and hard to update

32
Native Storage
  • String-based scheme NoK (Zhang 04)
  • Convert a tree to a parenthesized string
  • E.g., a having b and c as children is converted
    to ab)c)), by DFS of the tree and )
    representing end-of-subtree
  • Tree can be reconstructed by the string
  • A long string can be cut into substrings and fit
    them into disk pages
  • Page header can contain simple statistics to
    expedite next_sibling navigation
  • Pro particularly optimized for DFS navigational
    evaluation plan
  • Con inefficient for breadth-first search (BFS)

33
Hybrid of Relational and Native Storage
  • All major commercial RDBMS vendaors (IBM, Oracle,
    Microsoft and Sybase) support XML type in their
    RDBMS
  • A table can have a column whose type is XML
  • When inserting a tuple in the table, the XML
    field could be an XML document
  • XML documents are stored natively

34
Hybrid of Relational and Native Storage
  • IBM DB2 UDB
  • System RX XML storage is similar to Natix
  • Microsoft SQL Server
  • Uses BLOB (binary large object) to represent XML
    documents
  • Oracle
  • Can use multiple format
  • CLOB (character large object)
  • Serialized object
  • Shredded relational table

35
XML Path Processing
  • Extended Relational Approach
  • Translate XML queries to SQL statements
  • Native Approach (may be based on extended
    relational storage)
  • Join-based approach
  • Navigational approach
  • Hybrid approach

36
Extended Relational Query Processing
  • Regular expression based approach XRel
    (Yoshikawa et al. 01)
  • Linear path expression (without branches) are
    translated to regular expressions on strings
    (rooted paths)
  • Use the like predicate in SQL to evaluate
    regular expressions
  • Pro easy to implement
  • Con cannot answer branching path queries

37
Extended Relational Query Processing
  • Dynamic Interval based approach DI (DeHaan et
    al. 03)
  • Use the node labeling (begin,end) interval
    storage scheme
  • Dynamically calculate (begin,end) intervals for
    resulting nodes give a path/FLWOR expression
  • Pro can handle all types of queries including
    FLWOR expression
  • Con inefficient for answering complex path
    queries

38
Native Path Query Processing
  • Merge-Join based approach Multi-predicate Merge
    Join (MPMGJN) algorithm (Zhang et al. 01)
  • Modify the merge join algorithm to reduce
    unnecessary comparisons
  • Keep to position p of the last successful
    comparisons in the right input stream
  • The next item from the left input stream starts
    scanning from position p.

39
Native Path Query Processing
  • Stack-based Structural Join (Wu et al. 02)
  • Improve the MPMGJN algorithm
  • Do not look back but keep all ancestors in a
    stack
  • When comparing the new item, just compare it with
    the top of the stack

40
Native Path Query Processing
  • Holistic Twig Join (Bruno et al. 02)
  • Improve the stack-based structure algorithm
  • Use one join algorithm for the whole path
    expression instead of one join for one step
  • Reduce the overhead to produce and store
    intermediate results

41
Native Path Query Processing
  • Natix (Brantner et al. 05)
  • Translate each step into a logical navigational
    operator Unnest-Map
  • Each unnest-map operator is translated into a
    physical operator that performs tree traversal on
    the Natix storage
  • Physical optimization can be performed on the
    physical navigational operators to reduce
    cross-cluster I/O.

42
Native Path Query Processing
  • IBM DB2 XNav (Josifovski et al. 04)
  • XML path expressions are translated into automata
  • The automaton is constructed dynamically while
    traversing the XML tree in DFS
  • Physical I/O can be optimized by navigating to
    next_sibling without traversing the whole subtree

43
Native Path Query Processing
  • Tree automata (Koch 03)
  • The tree automaton needs two passes of tree
  • The first traversal is a bottom-up deterministic
    tree automaton to determine which states are
    reachable
  • The second traversal is a top-down tree automaton
    to prune the reachable states and compute
    predicates.

44
Hybrid Processing
  • BlossomTree (Zhang 04, Zhang05)
  • Navigational approach is efficient for
    parent-child navigation
  • Join-based approach is efficient for
    ancestor-descendant
  • BlossomTree approach identifies sub-expressions,
    Next-of-Kin (NoK), that are efficient for
    navigational approach.
  • Use navigational approach for NoK subexpressions
    and use structural joins to join intermediate
    results

45
XML Indexing
  • Structural Index
  • Clustering tree nodes by their structural
    similarity
  • Index is a graph, in which each vertex is an
    equivalence class of similar XML tree nodes
  • Path query evaluation amounts to navigational
    evaluation on the graph

46
Overview of Cost-based Optimization
  • Query Optimization depends on
  • How much knowledge about the data we have?
  • How intelligent we can be in making use of the
    knowledge (within a time constraint)?
  • The cost of a plan is heavily dependent on
  • The cost model of each operator
  • The cardinality/selectivity of each operator

47
Cardinality Estimation
  • Full path summarization DataGuide (Goldman 97)
    and PathTree (Aboulnaga 01)
  • Summarize all distinct paths in XML documents in
    a graph
  • Cardinality information is annotated on graph
    vertices

48
Cardinality Estimation
  • Partial path summarization Markov Table
    (Aboulnaga 01)
  • Keep sub-paths and cardinality information in a
    table
  • Cardinality for longer paths are calculated using
    partial paths.
  • Can use additional compression methods to
    accommodate Internet scale database

49
Cardinality Estimation
  • Structural clustered summarization XSketch
    (Neoklis 02) and TreeSketch (Neoklis 04)
  • Similar idea to clustered-based index
  • XSketch uses forward and backward stability, and
    TreeSketch uses count stability as similarity
    measurement
  • Heuristics to reduce graph to fit memory budget

50
Cardinality Estimation
  • Decompression-based approach XSEED (Zhang 06)
  • XML documents are compressed into a small kernel
    with edge cardinality labels
  • Kernel can be decompressed into XML document with
    cardinality annotations
  • Navigational path operator can be reused on the
    decompressed XML document for cardinality
    estimation

51
Cost Modeling
  • Statistical Learning Cost Model COMET (Zhang
    05)
  • Relational operator cost modeling is performed by
    analyzing the source code
  • XML operators are much more complex than
    relational operators therefore analytical
    approach is too time-consuming
  • Statistical learning approach needs a training
    set of queries and learn the cost model from the
    input parameters and real cost.

52
What does XML Offer?
  • Two major points raised by XML from data
    management viewpoint
  • Schema last
  • Complex network-oriented data model

53
Schema Last
  • Application categories
  • Rigidly structured data
  • Rigidly structured data with some text fields
  • Semi-structured data (need to handle semantic
    heterogeneity)
  • Text
  • Very few examples of the 3rd category
  • The 3rd category can be converted to 1 and 2.

54
XML Data Model
  • XML Records can be hierarchical as in IMS
  • Have links as in CODASYL
  • Have set-based attributes as in SDM
  • Inherit from other records as in SDM
  • And others that are known to be hard to implement
  • Possible scenarios
  • XMLSchema will fail
  • A data-oriented subset of XMLSchema will be
    proposed
  • Repeat the great debate
  • Lessons
  • L16Schema-last is probably a niche market
  • L17 XQuery is pretty much OR SQL with a
    different syntax
  • L18 XML will not solve the semantic
    heterogeneity either inside or outside the
    enpterprise

55
Future of XML
  • Likely a hot topic for many years for both data
    exchanges and data integration
  • Likely will become a common playground for DB
    and IR researchers (e.g., the INEX initiative
    http//qmir.dcs.qmul.ac.uk/INEX/)
  • Many challenges to solve!
  • Would XML converge to either relational DB search
    or free text search?
Write a Comment
User Comments (0)
About PowerShow.com