Management of XML Documents without Schema in Relational Database Systems - PowerPoint PPT Presentation

About This Presentation
Title:

Management of XML Documents without Schema in Relational Database Systems

Description:

data publishing (document-centric documents) data exchange (data-centric documents) ... decomposition of an XML document into smaller units (elements) ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 27
Provided by: thomask51
Category:

less

Transcript and Presenter's Notes

Title: Management of XML Documents without Schema in Relational Database Systems


1
Management of XML Documents without Schema in
Relational Database Systems
  • Workshop
  • Objects, ltXMLgt and Databases
  • OOPSLA 2001, Tampa
  • Thomas Kudrass
  • Leipzig University of Applied Sciences
  • Department of Computer Science and Mathematics

2
Overview
  • Introduction
  • Motivation
  • Main Issues
  • Structure-Oriented Approach
  • Storage / Data Model
  • Queries
  • Evaluation
  • Opaque Approach
  • Storage
  • Queries (vs. XPath)
  • Evaluation
  • Prototype Implementation
  • Interface
  • Experience
  • Outlook

3
Motivation
  • XML is used in
  • data publishing (document-centric documents)
  • data exchange (data-centric documents)
  • Why XML Documents without Schema?
  • generated by programs
  • mostly data-centric documents, e.g., account
    statements
  • high update frequency of the document structure
    ? evolving schemas
  • Problems
  • How to deal with XML documents without DTD / XML
    Schema in databases?
  • Evaluate approaches
  • Use relational database systems
  • Introduction
  • Structure-Oriented Approach
  • Opaque Approach
  • Prototype Implementation
  • Outlook

4
Main Issues
  • Evaluate Storage and Retrieval Methods
  • no defined document schema
  • platform Oracle 8i
  • XML-to-Relational Mapping Approaches
  • structure-oriented decomposition
  • opaque approach
  • Identify Parameters of a Unified XML-DB Interface
  • Implement a Testbed
  • qualitiative assessment
  • performance of both approaches
  • Introduction
  • Structure-Oriented Approach
  • Opaque Approach
  • Prototype Implementation
  • Outlook

5
Structure-Oriented Approach
  • Characteristics
  • decomposition of an XML document into smaller
    units (elements)
  • depends on the document structure only
  • target system relational DBMs, generic schema
  • Variety of Mapping Methods
  • model XML document als directed ordered labeled
    graphs and map them to tables
  • proposed algorithms
  • edge tables Florescu, Kossmann
  • universal table
  • inlining techniques Shanmugasundaram et.al.
  • model-based fragmentation
  • Monet XML-model
  • Introduction
  • Structure-Oriented Approach
  • Opaque Approach
  • Prototype Implementation
  • Outlook

6
Storing XML Data in Relations
  • XML-QL Data Model
  • lttreegt ltperson age55gt
    ltnamegtPeterlt/namegt lt/persongt ltperson
    age38gt ltnamegtMarylt/namegt
    ltaddressgtFruitdale Ave. lt/addressgt
    lt/persongtlt/treegt
  • Introduction
  • Structure-Oriented Approach
  • Opaque Approach
  • Prototype Implementation
  • Outlook

7
Data Model
1
1
n
n
1
1
1
1
  • Introduction
  • Structure-Oriented Approach
  • Opaque Approach
  • Prototype Implementation
  • Outlook

0/1
0/1
0/1
0/1
8
Import Algorithm
  • lttreegt
  • ltperson age36gt
  • ltnamegtPeterlt/namegt
  • ltaddressgt
  • ltstreetgtMain Road 4lt/streetgt
  • ltzipgt04236lt/zipgt
  • ltcitygtLeipziglt/citygt
  • lt/addressgt
  • lt/persongt
  • lt/treegt
  • Introduction
  • Structure-Oriented Approach
  • Opaque Approach
  • Prototype Implementation
  • Outlook

9
Query Processing
  • Query Language
  • XML query language most appropriate
  • data model of our solution is based on XML-QL?
    XML-QL preferred choice
  • XML-Relational Mismatch
  • relational DBMS understands SQL only
  • ? requires translation from XML-QL to SQL
  • generate result document from the tuples
    retrieved
  • Introduction
  • Structure-Oriented Approach
  • Opaque Approach
  • Prototype Implementation
  • Outlook

10
Query Processing
XML-QLQuery
Parser
GenerateSQL Statement
ObjectStructure
SQLStatement
DB
ExecuteSQL Statement
  • Introduction
  • Structure-Oriented Approach
  • Opaque Approach
  • Prototype Implementation
  • Outlook

Row Set
ConstructResult Document
XMLDocument
11
Generate an SQL-Statement
  • XML-QL Query
  • CONSTRUCT ltresultgt
  • WHERE
  • ltpersongt
  • ltnamegtnlt/namegt
  • ltaddressgtalt/addressgt
  • lt/persongt
  • CONSTRUCT
  • ltpersongt
  • ltnamegtnlt/namegt
  • ltaddressgtalt/addressgt
  • lt/persongt
  • lt/resultgt
  • SQL Statement
  • SELECT DISTINCT
  • B.Type AS n_Type,
  • B.TargetId AS n_TargetId,
  • B.Depth AS n_Depth,
  • C.Value AS n_Value,
  • D.Type AS a_Type,
  • D.TargetId AS a_TargetId,
  • D.Depth AS a_Depth,
  • E.Value AS a_Value
  • FROM
  • tblEdge A,tblEdge B,tblLeafs C,
  • tblEdge D,tblLeafs E
  • WHERE
  • (A.EdgeName person) AND
  • (A.TargetId B.SourceId) AND
  • (B.EdgeName name) AND
  • (B.LeafId C.LeafId()) AND
  • (A.TargetId D.SourceId) AND
  • Introduction
  • Structure-Oriented Approach
  • Opaque Approach
  • Prototype Implementation
  • Outlook

12
Construct Result Document
  • Result Tuple
  • Subtree Reconstruction
  • SELECT
  • A.EdgeName,
  • A.Type,
  • Al.Value AS A_LeafVal,
  • Aa.Value AS A_AttrVal
  • FROM
  • tblEdge A,
  • tblLeafs Al,
  • tblAttrs Aa
  • WHERE
  • A.SourceId5 AND
  • A.leafIdAl.leafId() AND
  • A.attrIdAa.attrId()
  • Introduction
  • Structure-Oriented Approach
  • Opaque Approach
  • Prototype Implementation
  • Outlook

13
Query Result
  • XML Result Document
  • ltresultgt
  • ltpersongt
  • ltnamegtPeterlt/namegt
  • ltaddressgt
  • ltstreetgtMain Road 4lt/streetgt
  • ltzipgt04236lt/zipgt
  • ltcitygtLeipziglt/citygt
  • lt/addressgt
  • lt/persongt  
  • lt/resultgt
  • Introduction
  • Structure-Oriented Approach
  • Opaque Approach
  • Prototype Implementation
  • Outlook

14
Advantages
  • Vendor Independence
  • no specific DBMS features needed
  • Stability
  • High Flexibility of Queries
  • retrieve and update single values
  • full SQL functionality can be used
  • Well-Suited for Structure-Oriented Queries
  • structures are represented in tables
  • Introduction
  • Structure-Oriented Approach
  • Opaque Approach
  • Prototype Implementation
  • Outlook

15
Drawbacks
  • Information Loss
  • Comments
  • Processing Instructions
  • Prolog
  • CDATA Sections
  • Entities
  • Restrictions
  • only one text (content) per element
  • ltelementgt
  • Text1
  • ltsubelement/gt
  • Text2 ? lost
  • lt/elementgt
  • element text as VARCHAR(n) n lt 4000
  • Increased Load Time
  • sample document 3.3. MB, 130.000 tuples, 13
    minutes
  • Introduction
  • Structure-Oriented Approach
  • Opaque Approach
  • Prototype Implementation
  • Outlook

16
Opaque Approach
  • Characteristics
  • XML document stored as Large Object (LOB)
  • document completely preserved
  • Storage
  • Insert into tblXMLClob values (1,person.xml,
  • ltpersongt
  • ltnamegtMarylt/namegt
  • lt/persongt )
  • Introduction
  • Structure-Oriented Approach
  • Opaque Approach
  • Prototype Implementation
  • Outlook

17
Oracle interMedia Text
  • Query Facilities of interMedia Text
  • full-text retrieval (word matching only)
  • path expression only together with content search
  • no range queries
  • Example in interMedia Text
  • SELECT DocId FROM tblXMLClob WHERE
  • CONTAINS(content,(Mary WITHIN name) WITHIN
    person)gt0
  • XML Full-Text Index
  • Autosectioner Index
  • XML Sectioner Index
  • WITHIN operator
  • text_subquery WITHIN elementname
  • searches the entire text content of the named tag
  • Introduction
  • Structure-Oriented Approach
  • Opaque Approach
  • Prototype Implementation
  • Outlook

18
Comparison of Queries
interMedia Text
XPath
  • return document IDs
  • word matching (default)
  • no existence test for elements or attributes
  • restricted set of path expressions using
    WITHINe.g. (xml WITHIN title) WITHIN book
  • provides limited attribute value searches, no
    nesting of attribute searches
  • numeric and data values not type-converted
  • no range searches on attribute values
  • return document fragments
  • substring matching
  • search for existing elements or attributes
  • path expressions ? structure-oriented queries
    //Book/Title/contains(..xml)
  • searches for attribute values and element text
    can be combined
  • considers also decimal values
  • range searches possible using filters

?
  • Introduction
  • Structure-Oriented Approach
  • Opaque Approach
  • Prototype Implementation
  • Outlook

19
XPath Queries with PL/SQL
  • Prerequisite
  • XDK for PL/SQL installed on the server
  • Parse CLOB into DOM representation
  • Introduction
  • Structure-Oriented Approach
  • Opaque Approach
  • Prototype Implementation
  • Outlook

20
Advantages
  • Information Preservation
  • Handling of Large Documents
  • appropriate for document-centric documents with
    little structure and prose-rich elements
  • Different XML Document APIs
  • interMedia Text restricted set of XPath
    functionality
  • generate a DOM of the document before using XPath
    queries
  • Introduction
  • Structure-Oriented Approach
  • Opaque Approach
  • Prototype Implementation
  • Outlook

21
Drawbacks
  • Restricted Expressiveness of Text Queries
  • Performance vs. Accuracy of Query Results
  • interMedia Text queries on CLOBs faster than the
    DOM-API
  • sample document 12.5 MB, parse time 3 min, load
    time 5 min
  • Restrictions of Indexes
  • maximum tag names for indexing (incl. namespace)
    64 bytes
  • Problems with Markup
  • character entities
  • Vendor Dependence
  • text engines are proprietary, e.g., Oracle
    interMedia
  • Stability
  • maximum document size 50 MB
  • memory errors may occur with smaller documents
  • Introduction
  • Structure-Oriented Approach
  • Opaque Approach
  • Prototype Implementation
  • Outlook

22
User Interface
  • Introduction
  • Structure-Oriented Approach
  • Opaque Approach
  • Prototype Implementation
  • Outlook

23
XML Database Interface
Client
XMLDocument
DocID /Doc Name
XMLQuery
DocList
  • Introduction
  • Structure-Oriented Approach
  • Opaque Approach
  • Prototype Implementation
  • Outlook

24
Comparison
Structure-Oriented Approach
Opaque Approach
  • Import
  • loss of information
  • time-consuming
  • produces lots of tuples
  • Queries
  • XML-QL
  • fast
  • new document as result
  • Import
  • no loss of information
  • faster than structure-oriented decomposition
  • Queries
  • interMedia Text
  • fast
  • only document IDs as result
  • XPath
  • high response time
  • flexible granularity of results
  • Introduction
  • Structure-Oriented Approach
  • Opaque Approach
  • Prototype Implementation
  • Outlook

25
Implementation Experience
  • Import Problems
  • structure-oriented decomposition
  • SAX parser produces FillBuf Error for XML
    documents gt 3 MB
  • limitations of VARCHAR columns (max. 4000 bytes)
  • opaque Approach
  • OutOfMemory error during import of XML documents
    gt 4MB
  • reason to little heap size in Java
  • use start option XmxltHeapSizegt
  • Queries
  • Opaque Approach
  • OutOfMemory error when parsing CLOB into DB
  • increase java_pool_size (100-150 MB)
  • increase shared_pool_size (150-200 MB)
  • Export and Delete without Problems
  • Introduction
  • Structure-Oriented Approach
  • Opaque Approach
  • Prototype Implementation
  • Outlook

26
Outlook
  • Experience
  • problems with larger documents in both approaches
  • no universal solution for all requirements
  • Co-Existence of Multiple Storage Approaches
  • integrate different storage engines
  • combine structure-oriented decomposition and
    opaque approach
  • need for a generic XML data type
  • New Data Model for Structure-Oriented Approach
  • reduce loss of information
  • XML Database Interface / Middleware
  • combines different approaches
  • parameterize the XML database interface
  • Introduction
  • Structure-Oriented Approach
  • Opaque Approach
  • Prototype Implementation
  • Outlook
Write a Comment
User Comments (0)
About PowerShow.com