Effective XML - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

Effective XML

Description:

Parsers report all white space in element content, including boundary white space ... ID type dependence (XPath, DOM, etc.) Encode binary data using quoted ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 41

Provided by: cafeco

Learn more at: http://www.cafeconleche.org

Category:

more less

Transcript and Presenter's Notes

Title: Effective XML

1
Effective XML

Elliotte Rusty Harold
elharo_at_metalab.unc.edu
http//www.cafeconleche.org/

2
Part I Syntax
3
Stay with XML 1.0

XML 1.1
New name characters
C0 control characters
C1 control characters
NEL
Undeclare namespace prefixes
Incompatible with
Most XML parsers
W3C and RELAX NG schema languages
XOM, JDOM

4
Part II Structure
5
The XML Stack
6
Allow All XML syntax

CDATA sections
Entity references
Processing instructions
Comments
Numeric character references
Document type declarations
Different ways of representing the same core
content not different information

7
Distinguish text from markup

A DocBook element
ltprogramlistinggtlt!CDATAltvaluegt
ltdoublegt28657lt/doublegtlt/valuegtgtlt/programlisting
gt
The content isltvaluegt ltdoublegt28657lt/doublegtlt
/valuegt
This is the sameltprogramlistinggtltvaluegt
ltdoublegt28657lt/doublegt
lt/valuegtlt/programlistinggt

8
The reverse problem

Tools that create XML from strings
Tree-based editors like ltOxygen/gt or XML Spy
WYSIWYG applications like OpenOffice Writer
Programming APIs such as DOM, JDOM, and XOM
The tool automatically escapes reserved
characters like lt, gt, or .
Just because something looks like an XML tag does
not mean it is an XML tag.

9
White space matters

Parsers report all white space in element
content, including boundary white space
An xmlspace attribute is for the client
application only, not the parser
White space in attribute values is normalized
Parsers do not report white space in the prolog,
epilog, the document type declaration, and tags.

10
Make structure explicit through markup

Bad
ltTransactiongtWithdrawal 2003 12 15
200.00lt/Transactiongt
Better
ltTransaction type"withdrawal"gt
ltDategt2003-12-15lt/Dategt
ltAmountgt200.00lt/Amountgt
lt/Transactiongt

11
Store metadata in attributes

Material the reader doesnt want to see
URLs
IDs
Styles
Revision dates
Authors name
No substructure
Revision tracking
Citations
No multiple elements

12
Remember mixed content

Narrative documents
Record-like documents
The RSS problem
ltitemgt
lttitlegtXerlin 1.3 releasedlt/titlegt
ltdescriptiongt
Xerlin 1.3, an open source XML Editor written
in
Java, has been released. Users can extend the
application via custom editor interfaces for
specific DTDs. New features in version 1.3
include
XML Schema support, WebDAV capabilities, and
various user interface enhancements. Java 1.2
or later is required.
lt/descriptiongt
ltlinkgthttp//www.cafeconleche.org/news2003April7lt
/linkgt
lt/itemgt

13
What you really want is this
ltdescriptiongt ltpgtlta href"http//www.xerlin.or
g"gtltstronggtXerlin 1.3lt/stronggtlt/agt,an open
source XML Editor written in Java, has been
released. Users can extend the application via
custom editor interfaces for specific DTDs. New
features in version 1.3 includelt/pgt ltulgt
ltligtXML Schema supportlt/ligt ltligtWebDAV
capabilitieslt/ligt ltligtVarious user interface
enhancementslt/ligt lt/ulgt ltpgtJava 1.2 or later
is required.lt/pgt lt/descriptiongt
14
What people do is this
ltdescriptiongtltpgtlta href"http//www.xerlin.o
rg"gtltstronggtXerlin 1.3lt/stronggtlt/agt, an
open source XML Editor written in Java, has been
released. Users can extend the application via
custom editor interfaces for specific DTDs. New
features in version 1.3 includelt/pgt
ltulgt ltligtXML Schema supportlt/ligt
ltligtWebDAV capabilitieslt/ligt
ltligtVarious user interface enhancementslt/ligt
lt/ulgt ltpgtJava 1.2 or later is
required.lt/pgt lt/descriptiongt
15
Prefer URLs to unparsed entities and notations

URLs are simple and well understood
Notations and unparsed entities are confusing and
little used
URLs dont require the DTD to be read
Many APIs dont even support notations and
unparsed entities

16
Part III Semantics
17
Use processing instructions for process-specific
content

For a very particular, even local, process
Describes how a particular process acts on the
data in the document
Does not describe or add to the content itself
A unit that can be treated in isolation
Content is not XML-like.
Applies to the entire document

18
Processing instructions are not appropriate when

Content is closely related to the content of the
document itself
Structure extends beyond a single processing
instruction
Needs to be validated

19
Include all information in instance documents

Not all parsers read the DTD
Especially browsers
Beware
Default attribute values
Parsed entity references
XInclude
ID type dependence (XPath, DOM, etc.)

20
Encode binary data using quoted printable and/or
Base64

Quoted printable works well for mostly text
Base-64 for non-text data
Can you link to the data with a URL instead?
Can you bundle the data with XML using zip, jar,
XOP, or MIME?

21
Use namespaces for modularity and extensibility

Not hard simple cases can use one default
namespace
http URIs are normally preferred
DTD validation is tricky
Code to namespace URIs, not prefixes
Avoid namespace prefixes in element content and
attribute values

22
Reuse XHTML for generic narrative content

lt!ENTITY xhtml1 SYSTEM "http//www.w3.org/TR/xht
ml1/DTD/strict.dtd"gtxhtml1
lt!ELEMENT description Blockgt

23
Choose the right schema language for the job

DTDs
The W3C XML Schema Language
RELAX NG
Schematron

24
Use only what you need

You need
Well-formed XML 1.0
A parser
You probably need
Namespaces
You may not need
DTDs
Schemas
XInclude
SOAP
WS-Kitchen-Sink
etc.

25
Always use a parser

Cant use regular expressions
Detecting encoding
Comments and processing instructions that contain
tags
CDATA sections
Unexpected placement of spaces and line breaks
within tags
Default attribute values
Character and entity references
Malformed documents
Internal DTD Subset
Why not?
Unfamiliarity with parsers
Too slow

26
Layer Functionality
27
Program to standard APIs

Easier to deploy in Java 1.4/1.5
Different implementations have different
performance characteristics
SAX is fast
DOM interoperates
Semi-standard
JDOM
XOM
Bleeding edge
StAX
JAXB

28
Read the complete DTD

Be conservative in what you generate liberal in
what you accept
Important content from DTD
Default attribute values
Namespace declarations
Entity references
ID types

29
Navigate with XPath

More robust against unexpected structure
Allow optimization by engine
Easier to code enhanced programmer productivity
Might be slower

30
Validate inside your program with schemas
31
Part IV Implementation
32
Write documents in Unicode

Prefer UTF-8
Smaller in English
ASCII compatible
Normalization
É, ü, ì and so forth
NFC
ICU

33
Avoid Vendor Lockin Beware

Opaque, binary data used in place of marked up
text.
Over-abbreviated, inobvious names like F17354 and
grgyt
APIs that hide the XML
Products that focus on the "Infoset
Alternate serializations of XML
Patented formats

34
Hang on to your relational database
35
Document Namespaces with RDDL
lt!DOCTYPE html PUBLIC "-//XML-DEV//DTD XHTML RDDL
1.0//EN"
"http//www.rddl.org/rddl-xhtml.dtd"gt lthtml
xmlns"http//www.w3.org/1999/xhtml"
xmlnsxlink"http//www.w3.org/1999/xlink"
xmlnsrddl"http//www.rddl.org/"gt ltheadgt
lttitlegtMegaBank Statement Markup Language
(MBSML)lt/titlegt lt/headgt ltpgt This is the XML
namespace for the lta href"http//developer.megaba
nk.com/xml/"gtMegaBank Statement Markup
Languagelt/agt. lt/pgt ltrddlresource
xlinktype"simple" xlinkhref"http//develope
r.megabank.com/xml/spec.html"
xlinkrole"http//www.w3.org/TR/html4/"
xlinkarcrole "http//www.rddl.org/purposes
normative-reference" gt ltpgt The lta
href"http//developer.megabank.com/xml/spec.html"
gtMegaBank Statement Markup Language
Specification 1.0lt/agt lt/pgt lt/rddlresourcegt lt/bo
dygtlt/htmlgt
36
Pick the correct MIME type

application/xml
Not text/xml!
Don't use charset
application/mathmlxml
image/svgxml
application/xsltxml

37
TagSoup Your HTML
38
Catalog common resources

lt?xml version"1.0"?gt
ltcatalog xmlns
"urnoasisnamestcentityxmlnsxmlcatalog"
gt
ltpublic publicId
"-//OASIS//DTD DocBook XML V4.2//EN"
uri
"file///opt/xml/docbook/docbookx.dtd"/gt
lt/cataloggt

39
Compress if space is a problem
//output OutputStream fout new
FileOutputStream("data.xml.gz") OutputStream
out new GZipOutputStream(fout)
OutputFormat format new OutputFormat(document)
XMLSerializer output new XMLSerializer(out,
format) output.serialize(doc) // input
InputStream fin new FileInputStream("data.xml.gz
") InputStream in new GZipInputStream(fin)
DocumentBuilderFactory factory
DocumentBuilderFactory.newInstance()
DocumentBuilder parser factory.newDocumentBuilde
r() Document doc parser.parse(in) // work
with the document...
40
To Learn More