Processing of structured documents - PowerPoint PPT Presentation

1 / 74
About This Presentation
Title:

Processing of structured documents

Description:

source code, program-generated mail, EDI (electronic data interchange) static ... some parts free text ... ELEMENT zip (#PCDATA) ] Attribute-list declarations ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 75
Provided by: helenaah
Category:

less

Transcript and Presenter's Notes

Title: Processing of structured documents


1
Processing of structured documents
  • Spring 2001
  • Helena Ahonen-Myka

2
Course organization
  • 581290-5 laudatur course, 3 cu
  • lectures (in Finnish)
  • 27.2.-5.4. Tue 12-14, Thu 10-12
  • exceptions no lectures 6. and 8.3.
  • exercise sessions
  • 6.3.-5.4. Tue 10-12 A318 (in English?), Thu
    12-14 C454 (in Finnish 22.3. at 8-10)
  • course assistant Olli Lahti
  • not obligatory

3
Project work
  • an XML application that is constructed during the
    course
  • a framework is given in the first lecture
  • in connection with the exercises, more
    requirements are given
  • a report has to be returned by 12.4.

4
Requirements
  • Exam (Wed 11.4. at 16-20) 45 points
  • Project 15 points
  • Exercises 5 extra points
  • Maximum of points 60

5
Outline (preliminary)
  • 1. Introduction
  • 2. Descriptions of structure
  • context-free grammars
  • XML DTD, XML Schema
  • 3. Programming interfaces
  • SAX, DOM
  • 4. Querying structured documents
  • XML Query

6
Outline...
  • 5. Transforming structured documents
  • XSL (XSLT, formatting objects)
  • presentation issues
  • 6. Document architectures
  • 7. Metadata RDF
  • 8. Compressing XML data
  • 9. ...

7
1. Introduction
8
Structured documents
  • Document?
  • A structured representation of (textual)
    information on some medium
  • normally for a human reader
  • messages, manuals, memos, books
  • also to/from/between applications
  • source code, program-generated mail, EDI
    (electronic data interchange)
  • static - dynamic

9
Presentation and structure
  • Presentation informs the human reader about the
    meaning of text and the role of its parts
  • markup indicating the presentation or the
    meaning of different parts of text
  • originally hand-written annotations for the
    typesetter
  • nowadays primarily codes embedded in digital
    documents

10
Markup
  • Procedural markup
  • formatting commands (start boldface, produce an
    empty line, indent 5mm)
  • Descriptive markup
  • indicating the logical structure of text using
    chosen names

11
Structured documents?
  • Generally speaking any text is structured
    (punctuation, words, sentences)
  • but especially descriptively marked-up documents
  • especially if they adhere to a rigorous
    specification of structure.

12
Document
ltmemo importancehigh
date19990323gt ltfromgtPaul V. Bironlt/fromgt
lttogtAshok Malhotralt/togt ltsubjectgtLatest
draftlt/subjectgt ltbodygt We need to
discuss the latest draft
ltemphgtimmediatelylt/emphgt. Either
email me at ltemailgt
mailtopaul.v.biron_at_kp.orglt/emailgt or
call ltphonegt555-9876lt/phonegt lt/bodygt lt/memogt
13
Data
ltinvoicegt ltorderDategt19990121lt/orderDategt
ltshipDategt19990125lt/shipDategt
ltbillingAddressgt ltnamegtAshok
Malhotralt/namegt ltstreetgt123 IBM
Ave.lt/streetgt ltcitygtHawthornelt/citygt
ltstategtNYlt/stategt ltzipgt10532-0000lt/zipgt
lt/billingAddressgt ltvoicegt555-1234lt/voicegt
ltfaxgt555-4321lt/faxgt lt/invoicegt
14
ltbodygt ltpgtltbgtOrder datelt/bgt 19990121lt/pgt
ltpgtltbgtShipping datelt/bgt 19990125lt/pgt
ltpgtltbgtAddresslt/bgtlt/pgt lttablegt
lttrgtltthgtnameltthgtstreetltthgtcityltthgtstateltthgtzip
lttrgtlttdgtAshok Malhotra
lttdgt123 IBM Ave. lttdgtHawthorne
lttdgtNY lttdgt10532-0000
lt/tablegt ltpgtPhone 555-1234lt/pgt ltpgtFax
555-4321lt/pgt lt/bodygt
15
Theses of structured documenting
  • Separation of structure and presentation
  • markup of structure and other (meta) information
    should be done
  • at creation time
  • for future needs
  • rigor of markup
  • automatization of processing

16
Advantages of structure
  • Better control over documents
  • guidance of writing, validation of structure
  • higher-precision retrieval (conditions for parts)
  • reuse of information
  • automated processing
  • control of uniform style

17
Advantages of structure
  • Transport of documents between different
    environments and applications
  • archival of documents
  • storing in databases
  • multiuse of documents
  • different layout styles
  • paper, online, CD-ROM, pda
  • different versions

18
Disadvantages of structure
  • Start-up costs
  • design of document structures
  • conversion of legacy (non-structured) documents
  • implementation/adaptation of tools, procedures
    and policies
  • attitudes of authors
  • from a producer of a final publication to an
    information-feeding clerk?

19
2. Project work
  • The goal everyone builds a (non-trivial) XML
    application that can be used during the course to
    train different concepts and methods
  • Example I would need a system to track the work
    of my Masters thesis students

20
A wish list
  • I want to store information about my students,
    e.g., name, contact information, scheduled
    meetings and deadlines, comments, problems,
    deals, links to the drafts and the homepages of
    the students, etc.
  • As a primary interface Id like to have a web
    page (with forms)

21
A wish list functions
  • I want to add information using the HTML form on
    the web page (easily!)
  • I want to have a listing on the web page of 1)
    all the students 2) information about one student
  • I need also other listings (e.g. simple ASCII)
    for reporting the state of my students (or just a
    list of my current students)

22
And now you...
  • Design an application that is somehow similar
    to mine
  • set of persons (or other objects) with
    information (e.g. your customer contacts)
  • some parts free text
  • several different ways to use the data, e.g.
    several listings (both content and presentation)

23
Requirements
  • More requirements follow later...
  • return a report by 12.4.
  • The report should include
  • (short) requirements analysis
  • descriptions of the structure (DTD, Schema)
  • other designs, architecture, ...
  • Some kind of a working prototype
  • not necessarily the whole system

24
3. Structure descriptions
  • Regular expressions, context-free grammars
  • XML Document type definitions
  • XML Schema

25
Regular expressions
  • A way to describe set of strings over an alphabet
    (of chars, events, elements)
  • many uses
  • text searching (e.g. emacs, grep, perl)
  • in grammatical formalisms (e.g. XML DTDs)
  • relevant for document structures what kind of
    structural content is allowed for different
    document components

26
Regular expressions
  • A regular expression over alphabet ? is either
  • ? (an empty set)
  • ? epsilon sometimes lambda ?)
  • a, where a ? ?
  • R S (choice sometimes R ? S)
  • R S (catenation) or
  • R (Kleene closure)
  • where R and S are regular expressions

27
Regular expressions
  • Regular expression E denotes a language (a set of
    strings) L(E)
  • L(?) ? (empty set)
  • L(?) ? (singleton set of empty string)
  • L(a) a (singleton set of a ? ?)
  • L(RS) L(R) ? L(S) w w ? L(R) or w ? L(S)
  • L(RS) L(R)L(S) xy x ? L(R) and y ? L(S)
  • L(R) L(R) x1xn xk ? L(R), k1,,n n ? 0

28
Example
  • top-level structure of a document
  • ? title, author, date, sect)
  • title followed by an optional list of authors,
    followed by an optional date, followed by one or
    more sections
  • title auth (date ?) sect sect
  • common abbreviations
  • E? (E ?) E E E
  • -gt title auth date? sect

29
Context-free grammars
  • Used widely to syntax specification (programming
    languages)
  • G (V, ?, P, S)
  • V the alphabet of the grammar G V ? ? N
  • ? the set of terminal symbols
    N V- ? the set of nonterminal symbols
  • P set of productions
  • S ? N the start symbol

30
Productions and derivations
  • Productions A -gt ?, where A ? N, ? ? V
  • e.g. A -gt aBa (1)
  • Let ?, ? ? V. String ? derives ? directly, ?
    gt ?, if
  • ? ?A?, ? ??? for some ?,? ? V, and A -gt ?
    is a production of the grammar
  • e.g. AA gt AaBa (assuming prod. 1 above)

31
Language generated by a context-free grammar
  • ? derives ?, ? gt ?, if there is a sequence of
    0 or more direct derivations that transforms ? to
    ?
  • The language generated by a CFG G
  • L(G) w ? ? S gt w
  • L(G) is a set of strings to model structural
    elements, we consider parse trees

32
Parse trees of a CFG
  • Aka syntax trees or derivation trees
  • nodes labelled by symbols of V (or by ?)
  • internal nodes by nonterminals, root by start
    symbol
  • leaves using terminal symbols (or ?)
  • parent with label A can have children labeled by
    X1,,Xk only if A -gt X1Xk is a production

33
CFGs for document structures
  • Nonterminals represent document structures
  • e.g. Ref -gt AuthorList Title PublData AuthorList
    -gt Author AuthorList AuthorList -gt ?
  • problem
  • obscures the relation of elements (the last
    Author several hierarchical levels away from Ref)
    -gt solution extended CFGs

34
Extended CFGs (ECFGs)
  • Like CFGs, but right-hand-sides of productions
    are regular expressions over V, e.g. Ref -gt
    Author Title PublData
  • Let ?, ? ? V. String ? derives ? directly, ?
    gt ?, if
  • ? ?A?, ? ??? for some ?,? ? V, and A -gt E
    is a production such that ? ? L(E)
  • e.g. Ref gt Author Author Author Title PublData

35
Language generated by an ECFG
  • Defined similarly to CFGs
  • Theorem Languages generated by extended and
    ordinary CGFs are the same

36
Parse trees of an ECFG
  • Similar to parse trees of an ordinary CFG, except
    that
  • parent with label A can have children labeled by
    X1,,Xk when A -gt E is a production such that
    X1Xk ? L(E)
  • -gt an internal node may have arbitrarily many
    children (e.g. Authors below a Ref node)

37
What is XML?
  • W3C Recommendation Feb 1998
  • metalanguage that can be used to define markup
    languages
  • gives syntax for defining extended context free
    grammars
  • XML documents that adhere to the ECFG are strings
    in the language
  • document types (grammars)- document instances
    (strings in the language)

38
XML encoding of structure
  • XML document essentially a parenthesized linear
    encoding of a parse tree
  • corresponds to a preorder walk
  • start of inner node (element) A denoted by a
    start tag ltAgt, end denoted by end tag lt/Agt
  • leaves are strings (or empty elements)
  • certain extensions (especially attributes)

39
Terminal symbols in practice
  • Leaves of parse trees are labeled by single
    characters (symbols of ?)
  • too granular in practice instead terminal
    symbols which stand for all values of a type
  • e.g. PCDATA in XML for variable length content
    of data characters
  • richer data types in proposed XML schema
    formalisms

40
XML logical structure
  • Elements
  • correspond to internal nodes of the parse tree
  • unique root element -gt document is a single parse
    tree
  • indicated by matching (case-sensitive!) tags
    ltElementTypeNamegtlt/ElementTypeNamegt
  • can contain text and/or subelements
  • can be empty
  • ltelem-typegtlt/elem-typegt
  • ltbr /gt

41
Logical structure
  • Attributes
  • name-value pairs attached to elements
  • metadata, usually not treated as content
  • e.g. ltdiv classpreface date990126gt
  • also
  • lt!-- comments --gt
  • lt?note this text would be passed to the
    application as a processing instruction named
    note?gt

42
Document type declaration
  • Provides a grammar (document type definition,
    DTD) for a class of documents
  • syntax
  • lt!DOCTYPE root-type-name SYSTEM ex.dtd lt!--
    external subset in file ex.dtd --gt
    lt!-- internal subset may come here --gt gt
  • external and internal subset make up the DTD
    internal has higher precedence

43
XML declaration
  • lt?xml version1.0 encodingUTF-8
    standaloneyes ?gt

44
Defining the structure DTD
  • document type definition (DTD)
  • content model for each element
  • describes how the elements are formed from the
    other elements and text
  • defines which attributes an element may/must
    have default values
  • content models are regular expressions

45
Markup declarations
  • Element type declarations (similar to productions
    of ECFGs)
  • attribute-list declarations (for declared element
    types)
  • entity declarations
  • notation declarations

46
Element type declarations
  • The general form is
  • lt!ELEMENT elem-type-name (E)gt
  • where E is a content model
    regular expression over element names

47
Regular expression syntax
  • 1 or more
  • 0 or more
  • ? 0 or 1
  • choice (one has to be chosen)
  • () grouping
  • , order

48
Examples of definitions
  • lt!ELEMENT name (fname, lname)gt
  • lt!ELEMENT address (name, street, (city, state,
    zipcode) (zipcode, city))gt
  • lt!ELEMENT contact
    (address, phone, email?)gt
  • lt!ELEMENT contact2 (address
    phone email)gt

49
DTD for the Invoice example
lt!DOCTYPE invoice lt!ELEMENT invoice
(orderDate, shipDate, billingAddress
voice,
fax?)gt lt!ELEMENT orderDate (PCDATA)gt lt!ELEMENT
shipDate (PCDATA)gt lt!ELEMENT billingAddress
(name, street, city, state, zip)gt lt!ELEMENT voice
(PCDATA)gt lt!ELEMENT fax
(PCDATA)gt lt!ELEMENT name (PCDATA)gt lt!ELEMENT
street (PCDATA)gt lt!ELEMENT city
(PCDATA)gt lt!ELEMENT state (PCDATA)gt lt!ELEMEN
T zip (PCDATA)gtgt
50
Attribute-list declarations
  • Name, data type and possible default value for
    each attribute for a given element type
  • Example
  • lt!ATTLIST FIG
  • id ID IMPLIED
  • descr CDATA REQUIRED
  • class (a b c) agt
  • semantics mainly up to the application

51
Mixed, empty and arbitrary content
  • Mixed content
  • lt!ELEMENT P (PCDATA I IMG)gt
  • may contain text (PCDATA) and elements
  • Empty content
  • lt!ELEMENT IMG EMPTYgt
  • Arbitrary content
  • lt!ELEMENT X ANYgt
  • lt!ELEMENT X (PCDATA choice-of-all-declared-e
    lement-types)gt

52
Entities
  • Character entities, e.g. lt
  • amp, lt, gt, apos, quot are built-in
  • general entities are shorthand notations
    lt!ENTITY HY University of Helsinkigt
  • physical storage units comprising a document
  • parsed entities
  • lt!ENTITY chapt1 SYSTEM chapter1.xmlgt
  • elements in entities must nest properly

53
Unparsed entities
  • External (binary) files
  • declarations
  • lt!NOTATION TIFF bin/xvgt
  • lt!ENTITY fig123 SYSTEM figs/f123.tif NDATA
    TIFFgt
  • lt!ATTLIST IMG file ENTITY REQUIREDgt
  • usage
  • ltIMG filefig123gt

54
Parameter entities
  • A way to parameterize and modularize DTDs
  • lt!ENTITY stattr status (draft ready) draftgt
  • lt!ATTLIST chap stattrgt
  • lt!ATTLIST sect stattrgt

55
Note
  • elements cannot overlap
  • container elements must have end tags
  • empty elements ltbr /gt
  • all names are case-sensitive
  • attribute values must be delimited by quotation
    marks

56
XML processing model
  • A processor (parser)
  • reads XML documents
  • passes data to an application
  • XML Specification tells how to read, what to pass

57
XML Information set
  • An XML documents information set consists of a
    number of information items
  • an information item is an abstract representation
    of some part of an XML document
  • each information item has a set of association
    properties

58
XML Information set
  • Tree structure provided by the processor (no
    special interface is specified)
  • e.g. entities expanded to their replacement text,
    attributes with their default values
  • properties e.g. for each element its child
    elements and attributes

59
Namespaces
  • An XML document may contain multiple markup
    vocabularies
  • reuse of existing markup, e.g. including HTML
    markup in some document type
  • An XML namespace is a collection of names,
    identified by a URI reference, which are used in
    XML documents as element types and attribute names

60
Namespace prefix declaration and use
  • ltx xmlnsedihttp//ecommerce.org/schemagt
  • ltediprice unitsEurogt32.18lt/edipricegt
  • ltlineItem editaxClassexemptgtBaby
    foodlt/lineItemgt
  • lt/xgt

61
XML Schema
  • DTDs have drawbacks
  • They can only define the element structure and
    attributes
  • They cannot define any database-like constraints
    for elements
  • Value (min, max, etc.)
  • Type (integer, string, etc.)
  • DTDs are not written in XML and cannot thus be
    processed with the same tools as XML documents,
    XSL(T), etc.
  • XML Schema
  • Is written in XML
  • Avoids most of the DTD drawbacks

62
XML Schema
  • XML Schema Part 1 Structures
  • Element structure definition as with DTD
    Elements, attributes, also enhanced ways to
    control structures
  • XML Schema Part 2 Datatypes
  • Primitive datatypes (string, boolean, float,
    etc.)
  • Derived datatypes from primitive datatypes (time,
    recurringDate)
  • Constraining facets for each datatype (minLength,
    maxLength, pattern, precision, etc.)
  • Information about Schemas
  • http//www.w3c.org/XML/Schema/

63
Complex and simple types
  • complex types allow elements in their content
    and may have attributes
  • simple types cannot have element content and
    cannot have attributes

64
Reminder DTD declarations
  • lt!ELEMENT name (fname, lname)gt
  • lt!ELEMENT address (name, street, (city, state,
    zipcode) (zipcode, city))gt
  • lt!ELEMENT contact
    (address, phone, email?)gt
  • lt!ELEMENT fname (PCDATA)gt

65
Example USAddress type
ltxsdcomplexType nameUSAddress gt
ltxsdsequencegt ltxsdelement namename
typexsdstring /gt ltxsdelement
namestreet typexsdstring /gt
ltxsdelement namecity typexsdstring /gt
ltxsdelement namestate typexsdstring
/gt ltxsdelement namezip
typexsddecimal /gt lt/xsdsequencegt
ltxsdattribute namecountry typexsdNMTOKEN
usefixed valueUS
/gt lt/xsdcomplexTypegt
66
Example PurchaseOrderType
ltxsdcomplexType namePurchaseOrderTypegt
ltxsdsequencegt ltxsdelement
nameshipTo typeUSAddress /gt
ltxsdelement namebillTo typeUSAddress
/gt ltxsdelement refcomment
minOccurs0 /gt ltxsdelement
nameitems typeItems /gt
lt/xsdsequencegt ltxsdattribute
nameorderDate typexsddate
/gt lt/xsdcomplexTypegt
67
Notes
  • element declarations for shipTo and billTo
    associate different element names with the same
    complex type
  • attribute declarations must reference simple
    types
  • element comment declared elsewhere in the schema
    (here reference only)

68
continues
  • element is optional, if minOccurs 0
  • maximum number of times an element may appear
    maxOccurs
  • attributes may appear once or not at all
  • use attribute is used in an attribute declaration
    to indicate whether the attribute is required or
    optional, and if optional, whether the value is
    fixed or whether there is a default

69
More examples
ltitemsgt ltitem partNum"872-AA"gt
ltproductNamegtLawnmowerlt/productNamegt
ltquantitygt1lt/quantitygt ltpricegt148.95lt/pricegt
ltcommentgtConfirm this is
electriclt/commentgt lt/itemgt ltitem
partNum"926-AA"gt ltproductNamegtBaby
Monitorlt/productNamegt ltquantitygt1lt/quantitygt
ltpricegt39.98lt/pricegt
ltshipDategt1999-05-21lt/shipDategt lt/itemgt
lt/itemsgt
70
ltxsdcomplexType name"Items"gt ltxsdelement
name"item" minOccurs"0
maxOccurs"unbounded"gt ltxsdcomplexTypegt
ltxsdelement name"quantity"gt
ltxsdsimpleType base"xsdpositiveInteger"gt
ltxsdmaxExclusive value"100"/gt
lt/xsdsimpleTypegt lt/xsdelementgt
ltxsdelement name"price" type"xsddecimal"/gt
ltxsdelement ref"comment" minOccurs"0"/gt
ltxsdelement name"shipDate" type"xsddate
minOccurs"0"/gt
ltxsdattribute name"partNum" type"Sku"/gt
lt/xsdcomplexTypegt lt/xsdelementgt lt/xsdcomplexT
ypegt ltxsdsimpleType nameSkugt ltxsdpattern
value"\d3-A-Z2"/gt lt/xsdsimpleTypegt
71
Patterns
ltxsdsimpleType nameSkugt ltxsdrestriction
basexsdstringgt ltxsdpattern
value"\d3-A-Z2"/gt ltxsdrestrictiongt lt/xsd
simpleTypegt
  • three digits followed by a hyphen followed by
    two upper-case ASCII letters

72
Building content models
  • ltxsdsequencegt fixed order
  • ltxsdchoicegt (1) choice of alternatives
  • ltxsdgroupgt grouping (also named)
  • ltxsdallgt no order specified

73
Well-formed XML documents
  • documents that adhere to the formal requirements
    (syntax) of the XML specification
  • if a document is not well-formed, it is not an
    XML document (and the XML tools do not have to
    process it)

74
Valid documents
  • a document is a valid XML-document, if it is
    well-formed and adheres to the structure defined
    in the DTD given
  • XML-processor can be validating or non-validating
  • sometimes validity is important, sometimes not
Write a Comment
User Comments (0)
About PowerShow.com