Processing of structured documents - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

Processing of structured documents

Description:

influenced by the work of many research groups and query languages ... constructor, curly braces {} delimit enclosed expressions, distinguishing them ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 62
Provided by: helenaah
Category:

less

Transcript and Presenter's Notes

Title: Processing of structured documents


1
Processing of structured documents
  • Part 6

2
XML Query language
  • W3C 20.12.2001 working drafts
  • A data model (XQuery 1.0 and XPath 2.0)
  • XQuery 1.0 Formal Semantics (June 2001)
  • XQuery 1.0 A Query Language for XML
  • influenced by the work of many research groups
    and query languages
  • goal a query language that is broadly
    applicable across all types of XML data sources

3
Usage scenarios
  • Human-readable documents
  • perform queries on structured documents and
    collections of documents, such as technical
    manuals,
  • to retrieve individual documents,
  • to generate tables of contents,
  • to search for information in structures found
    within a document, or
  • to generate new documents as the result of a query

4
Usage scenarios
  • Data-oriented documents
  • perform queries on the XML representation of
    database data, object data, or other traditional
    data sources
  • to extract data from these sources
  • to transform data into new XML representations
  • to integrate data from multiple heterogeneous
    data sources
  • the XML representation of data sources may be
    either physical or virtual
  • data may be physically encoded in XML, or an XML
    representation of the data may be produced

5
Usage scenarios
  • Mixed-model documents
  • perform both document-oriented and data-oriented
    queries on documents with embedded data, such as
    catalogs, patient health records, employment
    records
  • Administrative data
  • perform queries on configuration files, user
    profiles, or administrative logs represented in
    XML
  • Native XML repositories (databases)

6
Usage scenarios
  • Filtering streams
  • perform queries on streams of XML data to process
    the data (logs of email messages, network
    packets, stock market data, newswire feeds, EDI)
  • to filter and route messages represented in XML
  • to extract data from XML streams
  • to transform data in XML streams
  • DOM
  • perform queries on DOM structures to return sets
    of nodes that meet the specified criteria

7
Usage scenarios
  • Multiple syntactic environments
  • queries may be used in many environments
  • a query might be embedded in a URL, an XML page,
    or a JSP or ASP page
  • represented by a string in a program written in a
    general-purpose programming language
  • provided as an argument on the command-line or
    standard input

8
Requirements
  • Query language syntax
  • the XML Query Language may have more than one
    syntax binding
  • one query language syntax must be convenient for
    humans to read and write
  • one query language syntax must be expressed in
    XML in a way that reflects the underlying
    structure of the query
  • Declarativity
  • the language must be declarative
  • it must not enforce a particular evaluation
    strategy

9
Requirements
  • Reliance on XML Information Set
  • the XML Query data model relies on information
    provided by XML Processors and Schema Processors
  • it must ensure that it does not require
    information that is not made available by such
    processors
  • Datatypes
  • the data model must represent both XML 1.0
    character data and the simple and complex types
    of the XML Schema specification
  • Schema availability
  • queries must be possible whether or not a schema
    is available

10
Requirements functionality
  • Support operations (selection, projection,
    aggregation, sorting, etc.) on all data types
  • Choose a part of the data based on content or
    structure
  • Also operations on hierarchy and sequence of
    document structures
  • Structural preservation and transformation
  • Preserve the relative hierarchy and sequence of
    input document structures in the query results
  • Transform XML structures and create new XML
    structures
  • Combination and joining
  • Combine related information from different parts
    of a given document or from multiple documents

11
Requirements functionality
  • References
  • Queries must be able to traverse intra- and
    inter-document references
  • Closure property
  • The result of an XML document query is also an
    XML document (usually not valid but well-formed)
  • The results of a query can be used as input to
    another query
  • Extensibility
  • The query language should support the use of
    externally defined functions on all datatypes of
    the data model

12
XQuery
  • Design goals
  • a small, easily implementable language
  • queries are concise and easily understood
  • flexible enough to query a broad spectrum of XML
    information sources (incl. both databases and
    documents)
  • a human-readable query syntax
  • features borrowed from many languages
  • Quilt, Xpath, XQL, XML-QL, SQL, OQL, Lorel, ...

13
XQuery vs. another XML activities
  • XQuery 1.0 and XPath 2.0 Data Model
  • semantics of XQuery is defined in the XQuery
    Formal Semantics
  • type system is based on the type system of XML
    Schema
  • path expressions (for navigating in hierarchic
    documents) path expressions of XPath 2.0
  • the XML-based syntax is described in XQueryX

14
XQuery
  • A query is represented as an expression
  • several kinds of expressions -gt several forms
  • expressions can be nested with full generality
  • the input and output of a query are instances of
    a data model (XQuery 1.0 and XPath 2.0 Data
    Model)
  • a fragment of a document or a collection of
    documents may lack a common root and may be
    modeled as an ordered forest of nodes

15
An instance of the Data Model - an ordered forest
16
XQuery expressions
  • path expressions
  • element constructors
  • FLWR (flower for-let-where-return) expressions
  • expressions involving operators and functions
  • conditional expressions
  • quantified expressions
  • expressions that test or modify datatypes

17
Path expressions
  • the result of a path expression is an ordered
    list of nodes (document order)
  • each node includes its descendant nodes -gt the
    result is an ordered forest
  • the top-level nodes in the result are ordered
    according to their position in the original
    hierarchy (in top-down, left-right order)
  • no duplicate nodes

18
Element constructors
  • An element constructor creates an XML element
  • consists of a start tag and an end tag, enclosing
    an optional list of expressions that provide the
    content of the element
  • the start tag may also specify the values of one
    of more attributes
  • typical use
  • nested inside another expression that binds
    variables that are used in the element constructor

19
Example
  • Generate an ltempgt element containing an empid
    attribute and nested ltnamegt and ltjobgt elements.
    The values of the attribute and nested elements
    are specified elsewhere.

ltemp empid idgt ltnamegt n lt/namegt
ltjobgt j lt/jobgt lt/empgt
20
Element constructors
  • In an element constructor, curly braces
    delimit enclosed expressions, distinguishing them
    from literal text
  • enclosed expressions are evaluated and replaced
    by their value, whereas material outside curly
    braces is simply treated as literal text
  • an enclosed expression may evaluate to any
    sequence of nodes and/or simple values

21
Computed element constructors
  • Generate an element with a computed name,
    containing nested elements named ltdescriptiongt
    and ltpricegt

element tagname ltdescriptiongt d
lt/descriptiongt , ltpricegt p lt/pricegt
22
FLWR expressions
  • Constructed from for, let, where, and return
    clauses
  • SQL select-from-where
  • clauses must appear in a specific order
  • 1. for/let, 2. where, 3. return
  • a FLWR expression binds values to one or more
    variables and then uses these variables to
    construct a result (in general, an ordered forest
    of nodes)

23
A flow of data in a FLWR expression
24
for clauses
  • A for clause introduces one or more variables,
    associating each variable with an expression that
    returns a list of nodes (e.g. a path expression)
  • the result of a for clause is a list of tuples,
    each of which contains a binding for each of the
    variables
  • each variable in a for clause can be thought of
    as iterating over the nodes returned by its
    respective expression

25
let clauses
  • A let clause is also used to bind one or more
    variables to one or more expressions
  • a let clause binds each variable to the value of
    its respective expression without iteration
  • results in a single binding for each variable
  • Compare
  • for x in /library/book -gt many bindings (books)
  • let x /library/book -gt single binding (a list
    of books)

26
for/let clauses
  • A FLWR expression may contain several for and let
    clauses
  • each of these clauses may contain references to
    variables bound in previous clauses
  • the result of the for/let sequence
  • an ordered list of tuples of bound variables
  • the number of tuples generated by the for/let
    sequence
  • the product of the cardinalities of the
    node-lists returned by the expressions in the for
    clauses

27
for/let clauses
let s (ltone/gt, lttwo/gt, ltthree/gt) return
ltoutgtslt/outgt Result ltoutgt ltone/gt
lttwo/gt ltthree/gt lt/outgt
28
for/let clauses
for s in (ltone/gt, lttwo/gt, ltthree/gt) return
ltoutgtslt/outgt Result ltoutgtltone/gtlt/outgt ltoutgtltt
wo/gtlt/outgt ltoutgtltthree/gtlt/outgt
29
for/let clauses
for i in (1,2), j in (3,4) return lttuplegtltigt
i lt/igt ltjgt j lt/jgtlt/tuplegt Result lttuplegtltigt
1lt/igtltjgt3lt/jgtlt/tuplegt lttuplegtltigt1lt/igtltjgt4lt/jgtlt/tup
legt lttuplegtltigt2lt/igtltjgt3lt/jgtlt/tuplegt lttuplegtltigt2lt/i
gtltjgt4lt/jgtlt/tuplegt
30
where clause
  • Each of the binding tuples generated by the for
    and let clauses can be filtered by an optional
    where clause
  • only those tuples for which the condition in the
    where clause is true are used to invoke the
    return clause
  • the where clause may contain several predicates
    connected by and, or, and not
  • predicates usually contain references to the
    bound variables

31
where clause
  • Variables bound by a for clause represent a
    single node
  • -gt scalar predicates, e.g. p/color Red
  • Variables bound by a let clause may represent
    lists of nodes
  • -gt list-oriented predicates, e.g. avg(p/price) gt
    100

32
return clause
  • The return clause generates the output of the
    FLWR expression
  • a node, an ordered forest of nodes, primitive
    value
  • is executed on each tuple
  • contains an expression that often contains
    element constuctors, references to bound
    variables, and nested subexpressions

33
Examples
  • Assume a document named bib.xml
  • contains a list of ltbookgt elements
  • each ltbookgt contains a lttitlegt element, one or
    more ltauthorgt elements, a ltpublishergt element, a
    ltyeargt element, and a ltpricegt element

34
List the titles of books published by Addison
Wesley after 1998
ltbibgt for b in document(bib.xml)//book
where b/publisher Addison Wesley and
b/year gt 1998 return ltbook year
b/yeargt b/title lt/bookgt lt/bibgt
35
Result could be...
ltbibgt ltbook year1999gt lttitlegtTCP/IP
Illustratedlt/titlegt lt/bookgt ltbook year2000gt
lttitlegtAdvanced Programming in the Unix
environmentlt/titlegt lt/bookgt lt/bibgt
36
List each publisher and the average price of its
books
for p in distinct-values(document(bib.xml)//pub
lisher) let a avg(document(bib.xml)//bookpu
blisher p/price) return ltpublishergt ltnamegt
p/text() lt/namegt , ltavgpricegt a
lt/avgpricegt lt/publishergt
37
List the publishers who have published more than
100 books
ltbig_publishersgt for p in distinct-values(docum
ent(bib.xml)//publisher) let b
document(bib.xml)//bookpublisher p where
count(b) gt 100 return p lt/big_publishersgt
38
Invert the structure of the input document so
that each distinct author element contains a list
of book-titles
ltauthor_listgt let input
document(bib.xml) for a in
distinct-values(input//author) return
ltauthorgt ltnamegt a/text()
lt/namegt, ltbooksgt for b in
input//book where b/author a
return b/title lt/booksgt
lt/authorgt lt/author_listgt
39
Make an alphabetic list of publishers, within
each publisher, make a list of books (title
price), in descending order by price
for p in distinct-values(document(bib.xml)//pub
lisher) return ltpublishergt ltnamegt
p/text() lt/namegt for b in
document(bib.xml)//bookpublisher p
return ltbookgt b/title
b/price
lt/bookgt sortby(price descending)
lt/publishergt sortby(name)
40
Operators in expressions
  • Expressions can be constructed using infix and
    prefix operators nested expressions inside
    parenthesis can serve as operands
  • arithmetic and logical operators collection
    operators (union, intersect, except)

41
Queries on sequence
  • XQuery uses the precedes and follows operators to
    express conditions based on sequence
  • the following example involves a surgical report
    that contains procedure, incision and anesthesia
    elements
  • the query returns a critical sequence that
    contains all elements and nodes found between the
    1st and 2nd incisions of the 1st procedure

42
Queries on sequence
ltcritical-sequencegt let proc
//procedure1 for n in proc//node() where
n follows (proc//incision)1 and n precedes
(proc//incision)2 return n lt/critical-seque
ncegt
43
Conditional expressions
  • if-then-else
  • conditional expressions can be nested and used
    wherever a value is expected
  • assume a library has many holdings (element
    ltholdinggt with a type attribute that identifies
    its type, e.g. book or journal). All holdings
    have a title and other nested elements that
    depend on the type of holding

44
Make a list of holdings, ordered by title. For
journals, include the editor, and for all others,
include the author
for h in //holding return ltholdinggt
h/title, if (h/_at_type Journal)
then h/editor else h/author lt/holdinggt
sortby (title)
45
Quantifiers
  • It may be necessary to test for existence of some
    element that satisfies a condition, or to
    determine whether all elements in some collection
    satisfy a condition
  • -gt existential and universal quantifiers

46
Find titles of books in which both sailing and
windsurfing are mentioned in the same paragraph
for b in //book where some p in b//para
satisfies contains(p/text(), sailing) and
contains(p/text(), windsurfing) return b/title
47
Find titles of books in which sailing is
mentioned in every paragraph
for b in //book where every p in b//para
satisfies contains(p/text(), sailing)
return b/title
48
Filtering
  • Function filter (in XQuery core function library)
  • one parameter
  • expression that evaluates to an ordered forest of
    nodes
  • filter returns copies of some of the nodes in the
    original document
  • order and hierarchy are preserved
  • nodes that are copied
  • nodes that are present at any level in the
    original document and are also top-level nodes of
    the forest returned by the parameter

49
Action of filter on a hierarchy
  • filter (C\\(A B))

50
Prepare a table of contents for the document
cookbook.xml, containing nested sections and
their title
let b document(cookbook.xml) return
lttocgt filter(b// (section
section/title section/title/text() ))
lt/tocgt
51
Other built-in functions
  • A core library of built-in functions
  • document returns the root node of a named
    document
  • all functions of the XPath core function library
  • all the aggregation functions of SQL
  • avg, sum, count, max, min
  • distinct-values eliminates duplicates from a
    list
  • empty returns true if and only if its argument
    is an empty list

52
User-defined functions
  • Users are allowed to define own functions
  • each function definition must
  • declare the datatypes of its parameters and
    result
  • provide an expression that defines how the result
    of the function is computed from its parameters
  • when a function is invoked, its arguments must be
    valid instances of the declared parameter types
  • the result must also be a valid instance of its
    declared type

53
Functions
  • Example assume a purchase order bound to
    variable po1

ltcomplexType nameUSAddressgt ltelement
nameshipTo typepoUSAddress/gt ltelement
namebillTo typepoUSAddress/gt define
function timezone(element of type poUSAddress
a) returns integer ... call
timezone(po1/shipTo) - timezone(po1/billTo)
54
Querying relational data
  • A lot of data is stored in relational databases
  • an XML query language should be able to access
    this data
  • Example suppliers and parts
  • Table S supplier numbers (sno) and names (sname)
  • Table P part numbers (pno) and descriptions
    (descrip)
  • Table SP relationships between suppliers and the
    parts they supply, including the price (price) of
    each part from each supplier

55
One possible XML representation of relational data
56
SQL vs. XQuery
  • SQL
  • XQuery

SELECT pno FROM p WHERE descrip LIKE Gear ORDER
BY pno
for p in document(p.xml)//p_tuple where
contains(p/descrip, Gear) return p/pno
sortby(.)
57
Grouping
  • Many relational queries involve forming data into
    groups and applying some aggregation function
    such as count or avg to each group
  • in SQL GROUP BY and HAVING clauses
  • Example Find the part number and average price
    for parts that have at least 3 suppliers

58
Grouping SQL
SELECT pno, avg(price) AS avgprice FROM sp GROUP
BY pno HAVING count() gt 3 ORDER BY pno
59
Grouping XQuery
for pn in distinct-values(document(sp.xml)//pno
) let sp document(sp.xml)//sp_tuplepno
pn where count(sp) gt 3 return ltwell_supplied_
itemgt pn ltavgpricegt
avg(sp/price) lt/avgprice) lt/well_supplied_item
gt sortby(pno)
60
Joins
  • Joins combine data from multiple sources into a
    single query result
  • Example Return a flat list of supplier names
    and their part descriptions, in alphabetic order

for sp in document(sp.xml)//sp_tuple,
p in document(p.xml)//p_tuplepno sp/pno,
s in document(s.xml)//s_tuplesno
sp/sno return ltsp_pairgt s/sname
, p/descrip ltsp_pairgt sortby
(sname, descrip)
61
Example left outer join
  • Return names of all the suppliers in alphabetic
    order, including those that supply no parts
    inside each supplier element, list the
    descriptions of all the parts it supplies, in
    alphabetic order

for s in document(s.xml)//s_tuple return
ltsuppliergt s/sname, for sp
in document(sp.xml)//sp_tuplesno s/sno,
p in document(p.xml)//p_tuplepno
sp/pno return p/descrip
sortby(.) lt/suppliergt sortby(sname)
Write a Comment
User Comments (0)
About PowerShow.com