Effective XML - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Effective XML

Description:

Parsers report all white space in element content, including boundary white space ... ID type dependence (XPath, DOM, etc.) Encode binary data using quoted ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 41
Provided by: cafeco
Category:

less

Transcript and Presenter's Notes

Title: Effective XML


1
Effective XML
  • Elliotte Rusty Harold
  • elharo_at_metalab.unc.edu
  • http//www.cafeconleche.org/

2
Part I Syntax
3
Stay with XML 1.0
  • XML 1.1
  • New name characters
  • C0 control characters
  • C1 control characters
  • NEL
  • Undeclare namespace prefixes
  • Incompatible with
  • Most XML parsers
  • W3C and RELAX NG schema languages
  • XOM, JDOM

4
Part II Structure
5
The XML Stack
6
Allow All XML syntax
  • CDATA sections
  • Entity references
  • Processing instructions
  • Comments
  • Numeric character references
  • Document type declarations
  • Different ways of representing the same core
    content not different information

7
Distinguish text from markup
  • A DocBook element
  • ltprogramlistinggtlt!CDATAltvaluegt
    ltdoublegt28657lt/doublegtlt/valuegtgtlt/programlisting
    gt
  • The content isltvaluegt ltdoublegt28657lt/doublegtlt
    /valuegt
  • This is the sameltprogramlistinggtltvaluegt
    ltdoublegt28657lt/doublegt
    lt/valuegtlt/programlistinggt

8
The reverse problem
  • Tools that create XML from strings
  • Tree-based editors like ltOxygen/gt or XML Spy
  • WYSIWYG applications like OpenOffice Writer
  • Programming APIs such as DOM, JDOM, and XOM
  • The tool automatically escapes reserved
    characters like lt, gt, or .
  • Just because something looks like an XML tag does
    not mean it is an XML tag.

9
White space matters
  • Parsers report all white space in element
    content, including boundary white space
  • An xmlspace attribute is for the client
    application only, not the parser
  • White space in attribute values is normalized
  • Parsers do not report white space in the prolog,
    epilog, the document type declaration, and tags.

10
Make structure explicit through markup
  • Bad
  • ltTransactiongtWithdrawal 2003 12 15
    200.00lt/Transactiongt
  • Better
  • ltTransaction type"withdrawal"gt
  • ltDategt2003-12-15lt/Dategt
  • ltAmountgt200.00lt/Amountgt
  • lt/Transactiongt

11
Store metadata in attributes
  • Material the reader doesnt want to see
  • URLs
  • IDs
  • Styles
  • Revision dates
  • Authors name
  • No substructure
  • Revision tracking
  • Citations
  • No multiple elements

12
Remember mixed content
  • Narrative documents
  • Record-like documents
  • The RSS problem
  • ltitemgt
  • lttitlegtXerlin 1.3 releasedlt/titlegt
  • ltdescriptiongt
  • Xerlin 1.3, an open source XML Editor written
    in
  • Java, has been released. Users can extend the
  • application via custom editor interfaces for
  • specific DTDs. New features in version 1.3
    include
  • XML Schema support, WebDAV capabilities, and
  • various user interface enhancements. Java 1.2
  • or later is required.
  • lt/descriptiongt
  • ltlinkgthttp//www.cafeconleche.org/news2003April7lt
    /linkgt
  • lt/itemgt

13
What you really want is this
ltdescriptiongt ltpgtlta href"http//www.xerlin.or
g"gtltstronggtXerlin 1.3lt/stronggtlt/agt,an open
source XML Editor written in Java, has been
released. Users can extend the application via
custom editor interfaces for specific DTDs. New
features in version 1.3 includelt/pgt ltulgt
ltligtXML Schema supportlt/ligt ltligtWebDAV
capabilitieslt/ligt ltligtVarious user interface
enhancementslt/ligt lt/ulgt ltpgtJava 1.2 or later
is required.lt/pgt lt/descriptiongt
14
What people do is this
ltdescriptiongtltpgtlta href"http//www.xerlin.o
rg"gtltstronggtXerlin 1.3lt/stronggtlt/agt, an
open source XML Editor written in Java, has been
released. Users can extend the application via
custom editor interfaces for specific DTDs. New
features in version 1.3 includelt/pgt
ltulgt ltligtXML Schema supportlt/ligt
ltligtWebDAV capabilitieslt/ligt
ltligtVarious user interface enhancementslt/ligt
lt/ulgt ltpgtJava 1.2 or later is
required.lt/pgt lt/descriptiongt
15
Prefer URLs to unparsed entities and notations
  • URLs are simple and well understood
  • Notations and unparsed entities are confusing and
    little used
  • URLs dont require the DTD to be read
  • Many APIs dont even support notations and
    unparsed entities

16
Part III Semantics
17
Use processing instructions for process-specific
content
  • For a very particular, even local, process
  • Describes how a particular process acts on the
    data in the document
  • Does not describe or add to the content itself
  • A unit that can be treated in isolation
  • Content is not XML-like.
  • Applies to the entire document

18
Processing instructions are not appropriate when
  • Content is closely related to the content of the
    document itself
  • Structure extends beyond a single processing
    instruction
  • Needs to be validated

19
Include all information in instance documents
  • Not all parsers read the DTD
  • Especially browsers
  • Beware
  • Default attribute values
  • Parsed entity references
  • XInclude
  • ID type dependence (XPath, DOM, etc.)

20
Encode binary data using quoted printable and/or
Base64
  • Quoted printable works well for mostly text
  • Base-64 for non-text data
  • Can you link to the data with a URL instead?
  • Can you bundle the data with XML using zip, jar,
    XOP, or MIME?

21
Use namespaces for modularity and extensibility
  • Not hard simple cases can use one default
    namespace
  • http URIs are normally preferred
  • DTD validation is tricky
  • Code to namespace URIs, not prefixes
  • Avoid namespace prefixes in element content and
    attribute values

22
Reuse XHTML for generic narrative content
  • lt!ENTITY xhtml1 SYSTEM "http//www.w3.org/TR/xht
    ml1/DTD/strict.dtd"gtxhtml1
  • lt!ELEMENT description Blockgt

23
Choose the right schema language for the job
  • DTDs
  • The W3C XML Schema Language
  • RELAX NG
  • Schematron

24
Use only what you need
  • You need
  • Well-formed XML 1.0
  • A parser
  • You probably need
  • Namespaces
  • You may not need
  • DTDs
  • Schemas
  • XInclude
  • SOAP
  • WS-Kitchen-Sink
  • etc.

25
Always use a parser
  • Cant use regular expressions
  • Detecting encoding
  • Comments and processing instructions that contain
    tags
  • CDATA sections
  • Unexpected placement of spaces and line breaks
    within tags
  • Default attribute values
  • Character and entity references
  • Malformed documents
  • Internal DTD Subset
  • Why not?
  • Unfamiliarity with parsers
  • Too slow

26
Layer Functionality
27
Program to standard APIs
  • Easier to deploy in Java 1.4/1.5
  • Different implementations have different
    performance characteristics
  • SAX is fast
  • DOM interoperates
  • Semi-standard
  • JDOM
  • XOM
  • Bleeding edge
  • StAX
  • JAXB

28
Read the complete DTD
  • Be conservative in what you generate liberal in
    what you accept
  • Important content from DTD
  • Default attribute values
  • Namespace declarations
  • Entity references
  • ID types

29
Navigate with XPath
  • More robust against unexpected structure
  • Allow optimization by engine
  • Easier to code enhanced programmer productivity
  • Might be slower

30
Validate inside your program with schemas
31
Part IV Implementation
32
Write documents in Unicode
  • Prefer UTF-8
  • Smaller in English
  • ASCII compatible
  • Normalization
  • É, ü, ì and so forth
  • NFC
  • ICU

33
Avoid Vendor Lockin Beware
  • Opaque, binary data used in place of marked up
    text.
  • Over-abbreviated, inobvious names like F17354 and
    grgyt
  • APIs that hide the XML
  • Products that focus on the "Infoset
  • Alternate serializations of XML
  • Patented formats

34
Hang on to your relational database
35
Document Namespaces with RDDL
lt!DOCTYPE html PUBLIC "-//XML-DEV//DTD XHTML RDDL
1.0//EN"
"http//www.rddl.org/rddl-xhtml.dtd"gt lthtml
xmlns"http//www.w3.org/1999/xhtml"
xmlnsxlink"http//www.w3.org/1999/xlink"
xmlnsrddl"http//www.rddl.org/"gt ltheadgt
lttitlegtMegaBank Statement Markup Language
(MBSML)lt/titlegt lt/headgt ltpgt This is the XML
namespace for the lta href"http//developer.megaba
nk.com/xml/"gtMegaBank Statement Markup
Languagelt/agt. lt/pgt ltrddlresource
xlinktype"simple" xlinkhref"http//develope
r.megabank.com/xml/spec.html"
xlinkrole"http//www.w3.org/TR/html4/"
xlinkarcrole "http//www.rddl.org/purposes
normative-reference" gt ltpgt The lta
href"http//developer.megabank.com/xml/spec.html"
gtMegaBank Statement Markup Language
Specification 1.0lt/agt lt/pgt lt/rddlresourcegt lt/bo
dygtlt/htmlgt
36
Pick the correct MIME type
  • application/xml
  • Not text/xml!
  • Don't use charset
  • application/mathmlxml
  • image/svgxml
  • application/xsltxml

37
TagSoup Your HTML
38
Catalog common resources
  • lt?xml version"1.0"?gt
  • ltcatalog xmlns
  • "urnoasisnamestcentityxmlnsxmlcatalog"
  • gt
  • ltpublic publicId
  • "-//OASIS//DTD DocBook XML V4.2//EN"
  • uri
  • "file///opt/xml/docbook/docbookx.dtd"/gt
  • lt/cataloggt

39
Compress if space is a problem
//output OutputStream fout new
FileOutputStream("data.xml.gz") OutputStream
out new GZipOutputStream(fout)
OutputFormat format new OutputFormat(document)
XMLSerializer output new XMLSerializer(out,
format) output.serialize(doc) // input
InputStream fin new FileInputStream("data.xml.gz
") InputStream in new GZipInputStream(fin)
DocumentBuilderFactory factory
DocumentBuilderFactory.newInstance()
DocumentBuilder parser factory.newDocumentBuilde
r() Document doc parser.parse(in) // work
with the document...
40
To Learn More
  • This Presentation http//cafeconleche.org/slides/
    sd2007west/effectivexml
  • Effective XML 50 Specific Ways to Improve Your
    XML Documents
  • Elliotte Rusty Harold
  • Addison-Wesley, 2003
  • ISBN 0-321-15040-6
  • 44.99
  • http//cafeconleche.org/books/effectivexml
Write a Comment
User Comments (0)
About PowerShow.com