Title: Ace104 Lecture 6
1Ace104Lecture 6
- Parsing XML into programming languages
2Parsing XML
- Goal read XML files into data structures in
programming languages - Possible strategies
- Parse by hand with some reusable libraries
- Parse into generic tree structure
- Parse as sequence of events
- Automagically parse to language-specific objects
3Parsing by-hand
- Advantages
- Complete control
- Good if simple needs build off of regex package
- Disadvantages
- Must write the initial code yourself, even if it
becomes generalized - Pretty tedious and error prone.
- Gets very hard when using schema or DTD to
validate - No one does this anymore
4Parsing into generic tree structure
- Advantages
- Industry-wide, language neutral W3C standard
exists called DOM (Document Object Model) - Learning DOM for one language makes it easy to
learn for any other - As of JAXP 1.2, support for Schema
- Have to write much less code to get XML to
something you want to manipulate in your program - Disadvantages
- Non-intuitive API, doesnt take full advantage of
Java - Still quite a bit of work
5What is JAXP?
- JAXP Java API for XML Processing
- In the Java language, the definition of these
standard APIs (together with XSLT API) comprise
a set of interfaces known as JAXP - Java also provides standard implementations
together with vendor pluggability layer - Some of these come standard with J2SDK, others
are only availdable with Web Services Developers
Pack - We will study these shortly
6Another alternative
- JDOM Native Java published API for representing
XML as tree - Like DOM but much more Java-specific, object
oriented - However, not supported by other languages
- Also, no support for schema
- Dom4j another alternative
7JAXB
- JAXB Java API for XML Bindings
- Defines an API for automagically representing XML
schema as collections of Java classes. - Most convenient for application programming
- Will cover next class
8DOM
9About DOM
- Stands for Document Object Model
- A World Wide Web Consortium (w3c) standard
- Standard constantly adding new features Level 3
Core released late 05 - Well cover most of the basics. Theres always
more, and its always changing.
10DOM abstraction layer in Java -- architecture
Emphasis is on allowing vendors to supply their
own DOM Implementation without requiring change
to source code
Returns specific parser implementation
org.w3d.dom.Document
11Sample Code
A factory instance is the parser
implementation. Can be changed with runtime
System property. Jdk has default. Xerces much
better.
DocumentBuilderFactor factory
DocumentBuilderFactory.newInstance() / set
some factory options here / DocumentBuilder
builder factory.newDocumentBuilde
r() Document doc builder.parse(xmlFile)
From the factory one obtains an instance of the
parser
xmlFile can be an java.io.File, an inputstream,
etc.
javax.xml.parsers.DocumentBuilderFactory javax.xml
.parsers.DocumentBuilder org.w3c.dom.Document
For reference. Notice that the Document class
comes from the w3c-specified bindings.
12Validation
- Note that by default the parser will not validate
against a schema or DTD - As of JAXP1.2, java provides a default parser
than can handle most schema features - See next slide for details on how to setup
13Important Schema validation
String JAXP_SCHEMA_LANGUAGE
"http//java.sun.com/xml/jaxp/properties/schemaLa
nguage" String W3C_XML_SCHEMA
"http//www.w3.org/2001/XMLSchema" Next, you
need to configure DocumentBuilderFactory to
generate a namespace-aware, validating parser
that uses XML Schema DocumentBuilderFactory
factory DocumentBuilderFactory.newInstance()
factory.setNamespaceAware(true)
factory.setValidating(true) try
factory.setAttribute(JAXP_SCHEMA_LANGUAGE,
W3C_XML_SCHEMA) catch (IllegalArgumentExcepti
on x) // Happens if the parser does not
support JAXP 1.2 ...
14Associating document with schema
- An xml file can be associated with a schema in
two ways - Directly in xml file in regular way
- Programmatically from java
- Latter is done as
- factory.setAttribute(JAXP_SCHEMA_SOURCE, new
File(schemaSource))
15A few notes
- Factory allows ease of switching parser
implementations - Java provides simple DOM implementation, but much
better to use vendor-supplied when doing serious
work - Xerces, part of apache project, is installed on
cluster as Eclipse plugin. Well use next week. - Note that some properties are not supported by
all parser implementations.
16Document object
- Once a Document object is obtained, rich API to
manipulate. - First call is usually
- Element root doc.getDocumentElement()
- This gets the root element of the Document as an
instance of the Element class - Note that Element subclasses Node and has methods
getType(), getName(), and getValue(), and
getChildNodes()
17Types of Nodes
- Note that there are many types of Nodes (ie
subclasses of Node) - Attr, CDATASection, Comment, Document,
DocumentFragment, DocumentType, Element, Entity,
EntityReference, Notation, ProcessingInstruction,
Text - Each of these has a special and non-obvious
associated type, value, and name. - Standards are language-neutral and are specified
on chart on following slide -
- Important keep this chart nearby when using DOM
18(No Transcript)
19DOM Exercise Write a function to do a depth
search printout of the node information of a
given XML file as recursePrint(root) Assume
you have access to the following printNodeInfo(No
de node)prints the name, type, and value of the
input node. boolean Node.hasChildNodes() to
check if a node has any children NodeList
Node.getChildNodes() to get a list of all
children nodes Node NodeList.item(int num) to
select the numth child node public static void
recursePrint(Node node)
20DOM Exercise Answer Write a function to do a
depth search printout of the node information of
a given XML file as recursePrint(root) Assume
you have access to the following printNodeInfo(N
ode node)prints the name, type, and value of the
input node. boolean Node.hasChildNodes() to
check if a node has any children NodeList
Node.getChildNodes() to get a list of all
children nodes Node NodeList.item(int num) to
select the numth child node public static void
recursePrint(Node node)
printNodeInfo(node) if
(!node.hasChildNodes()) return NodeList
nodes node.getChildNodes() for (int i
0 i lt nodes.getLength() i)
node nodes.item(i)
recursePrint(depth, node)
21Transforming XML
22The JAXP Transformation Packages
- JAXP Transformation APIs
- javax.xml.transform
- This package defines the factory class you use to
get a Transformer object. You then configure the
transformer with input (Source) and output
(Result) objects, and invoke its transform()
method to make the transformation happen. The
source and result objects are created using
classes from one of the other three packages. - javax.xml.transform.dom
- Defines the DOMSource and DOMResult classes that
let you use a DOM as an input to or output from a
transformation. - javax.xml.transform.sax
- Defines the SAXSource and SAXResult classes that
let you use a SAX event generator as input to a
transformation, or deliver SAX events as output
to a SAX event processor. - javax.xml.transform.stream
- Defines the StreamSource and StreamResult classes
that let you use an I/O stream as an input to or
output from a transformation.
23Transformer Architecture
24Writing DOM to XML
public class WriteDOM public static void
main(String argv) throws Exception
File f new File(argv0)
DocumentBuilderFactory factory
DocumentBuilderFactory.newInstance()
DocumentBuilder builder factory.newDocumentBuild
er() Document document
builder.parse(f) TransformerFactory
tFactory TransformerFactory.newInsta
nce() Transformer transformer
tFactory.newTransformer() DOMSource
source new DOMSource(document)
StreamResult result new StreamResult(System.out)
transformer.transform(source, result)
25Creating a DOM from scratch
- Sometimes you may want to create a DOM tree
directly in memory. This is done with - DocumentBuilderFactory factory
DocumentBuilderFactory.newInstance() - DocumentBuilder builder factory.newDocum
entBuilder() - document builder.newDocument()
26Manipulating Nodes
- Once the root node is obtained, typical tree
methods exist to manipulate other elements - boolean node.hasChildNodes()
- NodeList node.getChildNodes()
- Node node.getNextSibling()
- Node node.getParentNode()
- String node.getValue()
- String node.getName()
- String node.getText()
- void setNodeValue(String nodeValue)
- Node insertBefore(Node new, Node ref)
27JDOM
28JDOM Motivation(from Elliot Harold)
- Unfortunately DOM suffers from a number of design
flaws and limitations that make it less than
ideal as a Java API for processing XML - DOM had to be backwards compatible with the
hackish, poorly thought out, unplanned object
models used in third generation web browsers. - DOM was designed by a committee trying to
reconcile differences between the object models
implemented by Netscape, Microsoft, and other
vendors. They needed a solution that was at least
minimally acceptable to everybody, which resulted
in an API that?s maximally acceptable to no one. - DOM is a cross-language API defined in IDL, and
thus limited to those features and classes that
are available in essentially all programming
languages, including not fully-object oriented
scripting languages like JavaScript and Visual
Basic. It is a lowest common denominator API. It
does not take full advantage of Java, nor does it
adhere to Java best practices, naming
conventions, and coding standards. - DOM must work for both HTML (not just XHTML, but
traditional malformed HTML) and XML.
29Some sample JDOM
ltfibonacci/gt In JDOM Element element new
Element("fibonacci") In DOM DocumentBuilderFac
tory factory DocumentBuilderFactory.newInstance
() DocumentBuilder builder factory.newDocumentB
uilder() DOMImplementation impl
builder.getDOMImplementation() Document doc
impl.createDocument( null, "Fibonacci_Numbers",
null) In JDOM Element element
doc.createElement("fibonacci") Element element
new Element("fibonacci") element.setText("8")
element.setAttribute("index", "6") Extremely
simple and intuitive!
30More JDOM
- To create this element
- ltsequencegt
- ltnumbergt3lt/numbergt
- ltnumbergt5lt/numbergt
- lt/sequencegt
- Element element new Element("sequence")
- Element firstNumber new Element("number")
- Element secondNumber new Element("number")
- firstNumber.setText("3")
- secondNumber.setText("5")
- element.addContent(firstNumber)
- element.addContent(secondNumber)
31import org.jdom. import org.jdom.input.SAXBuilde
r Parsing XML file with JDOM import
java.io.IOException import java.util. public
class ElementLister public static void
main(String args) if (args.length 0)
System.out.println("Usage java
ElementLister URL") return
SAXBuilder builder new SAXBuilder()
try Document doc
builder.build(args0) Element root
doc.getRootElement()
listChildren(root, 0) // indicates a
well-formedness error catch
(JDOMException e)
System.out.println(args0 " is not
well-formed.") System.out.println(e.
getMessage()) catch (IOException
e) System.out.println(e)
public static void listChildren(Element
current, int depth)
printSpaces(depth) System.out.println(cu
rrent.getName()) List children
current.getChildren() Iterator iterator
children.iterator() while
(iterator.hasNext()) Element child
(Element) iterator.next()
listChildren(child, depth1)
private static void printSpaces(int n)
for (int i 0 i lt n i)
System.out.print(' ')
32SAX
- Simple API for XML Processing
33About SAX
- SAX in Java is hosted on source forge
- SAX is not a w3c standard
- Originated purely in Java
- Other languages have chosen to implement in their
own ways based on this prototype
34SAX vs.
- Please dont compared unrelated things
- SAX is an alternative to DOM, but realize that
DOM is often built on top of SAX - SAX and DOM do not compete with JAXP
- They do both compete with JAXB implementations
35How a SAX parser works
- SAX parser scans an xml stream on the fly and
responds to certain parsing events as it
encounters them. - This is very different than digesting an entire
XML document into memory. - Much faster, requires less memory.
- However, need to reparse if you need to revisit
data.
36Obtaining a SAX parser
- Important classes
- javax.xml.parsers.SAXParserFactory
- javax.xml.parsers.SAXParser
- javax.xml.parsers.ParserConfigurationException
- //get the parser
- SAXParserFactory factory
SAXParserFactory.newInstance() - SAXParser saxParser factory.newSAXParser
() - //parse the document
- saxParser.parse( new File(argv0),
handler)
37DefaultHandler
- Note that an event handler has to be passed to
the SAX parser. - This must implement the interface
- org.xml.sax.ContentHandler
- Easier to extend the adapter
- org.xml.sax.helpers.DefaultHandler
38Overriding Handler methods
- Most important methods to override
- void startDocument()
- Called once when document parsing begins
- void endDocument()
- Called once when parsing ends
- void startElement(...)
- Called each time an element begin tag is
encountered - void endElement(...)
- Called each time an element end tag is
encountered - void characters(...)
- Called randomly between startElement and
endElement calls to accumulated character data
39startElement
- public void startElement(
- String namespaceURI, //if namespace
assoc - String sName,
//nonqualified name - String qName,
//qualified name - Attributes attrs) //list
of attributes -
- Attribute info is obtained by querying Attributes
objects.
40Characters
- public void characters(
- char buf, //buffer of
chars accumulated - int offset, //begin
element of chars - int len) //number of
chars - Note, characters may be called more than once
between begin tag / end tag - Also, mixed-content elements require careful
handling
41Entity references
- Recall that entity references are special
character sequences for referring to characters
that have special meaning in XML syntax - lt is lt
- gt is gt
- In SAX these are automatically converted and
passed to the characters stream unless they are
part of a CDATA section
42Choosing a Parser
- Choosing your Parser Implementation
- If no other factory class is specified, the
default SAXParserFactory class is used. To use a
different manufacturer's parser, you can change
the value of the environment variable that points
to it. You can do that from the command line,
like this - java -Djavax.xml.parsers.SAXParserFactoryyourFact
oryHere ... - The factory name you specify must be a fully
qualified class name (all package prefixes
included). For more information, see the
documentation in the newInstance() method of the
SAXParserFactory class.
43Validating SAX Parsers
String JAXP_SCHEMA_LANGUAGE
"http//java.sun.com/xml/jaxp/properties/schemaLa
nguage" String W3C_XML_SCHEMA
"http//www.w3.org/2001/XMLSchema" Next, you
need to configure DocumentBuilderFactory to
generate a namespace-aware, validating parser
that uses XML Schema SaxParserFactory
factory SaxParserFactory.newInstance()
factory.setNamespaceAware(true)
factory.setValidating(true) try
factory.setAttribute(JAXP_SCHEMA_LANGUAGE,
W3C_XML_SCHEMA) catch (IllegalArgumentExcepti
on x) // Happens if the parser does not
support JAXP 1.2 ...
44Transforming arbitrary data structures using SAX
and Transformer
45Goal
- Now that we know SAX and a little about
Transformations, there are some cool things we
can do. - One immediate thing is to create xml files from
plain text files using the help of a faux SAX
parser - Turns out to be more robust than doing by hand
46Transformers
- Recall that transformers easily let us go between
any source and result by arbitrary wirings of - StreamSource / StreamResult
- SAXSource / SAXResult
- DOMSource / DOMResult
- We used this to write a DOM tree to an XML file
- Now we will use a SAXSource together with a
StreamResult to convert our text file
47Strategy
- We construct our own SAXParser ie a class that
implements the XMLReader interface - This class must have a parse method (among
others) - We use parse to read our input file and fire the
appropriate SAX events, rather than handcoding
the Strings ourselves.
48Main snippet
public static void main (String argv )
StudentReader parser new StudentReader()
TransformerFactory tFactory
TransformerFactory.newInstance()
Transformer transformer tFactory.newTransformer(
) FileReader fr new FileReader(student
s.txt) BufferedReader br new
BufferedReader(fr) InputSource
inputSource new InputSource(fr)
SAXSource source new SAXSource(saxReader,
inputSource) StreamResult result new
StreamResult(System.out)
transformer.transform(source, result)
Create SAX parser
create transformer
Use text File as Transformer source
Use text as result
49XMLReader implementation
- To have a valid SAXSource we need a class that
implements - XMLReader interface
- public void parse(InputSource input)
- public void setContentHandler(ContentHandler
handler) - public ContentHandler getContentHandler()
- .
- .
- .
- Shown are the important methods for a simple app
50See Course Examples for details
51JAXB
- Java Architecture for XML Bindings
52What is JAXB?
- JAXB defines the behavior of a standard set of
tools and interfaces that automatically generate
java class files from XML schema - JAXB is a framework or architecture, not an
implementation. - Sun provides a reference implementation of JAXB
with the Web Services Developers kit, available
as a separate download http//java.sun.com/webserv
ices/downloads/webservicespack.html
53JAXB vs. DOM and SAX
- JAXB is a higher level construct than DOM or SAX
- DOM represents XML documents as generic trees
- SAX represents XML documents as generic event
streams - JAXB represents XML documents as Java classes
with properties that are specific to the
particular XML document - E.g. book.xml becomes Book.java with getTitle,
setTitle, etc. - JAXB thus requires almost no knowledge of XML to
be able to programmatically process XML documents!
54High-level comparison
- Before diving into details of JAXB, its good to
see a birds-eye-view of the difference between
JAXB and SAX and/or DOM-like parsers - Study the books/ examples under the examples/jaxb
directory on the course website
55JAXB steps
- We start by assuming that you have a valid
installation of java web services developers pack
version 3. We cover these installation details
later - Using JAXB then requires several steps
- Run the binding compiler on the schema file to
automagically produce the appropriate java class
files - Compile the java class files (ant tool helps
here) - Study the autogenerated api to learn what java
types have been created - Create a program that unmarshals an xml document
into these elementary data structures
56Running binding compiler
- ltinstall_dirgt/jaxb/bin/xjc.sh -p test.jaxb
books.xsd -d work - xjc.sh executes binding compiler
- -p test.jaxb place resulting class files in
package test.jaxb - books.xsd run compiler on schema books.xsd
- -d work place resulting files in directory
called work/ - Note that this creates a huge number of files
that together represent the content of the
books.xsd schema as a set of Java classes - It is not necessary to know all of these classes.
Well study them only at a high level so we can
understand how to use them
57Example students.xsd
58Generated interfaces
- xjc.sh -p test.lottery students.xsd
- This generates the following interfaces
- test/lottery/ObjectFactory.java
- Contains methods for generating instances of the
interfaces - test/lottery/Students.java
- Represents the root node ltstudentsgt
- test/lottery/StudentsType.java
- Represents the unnamed type of each student object
59Generated implementations
- Each interface is implemented in the impl
directory - test/lottery/impl/StudentsImpl.java
- Vendor-specific implementation of the Students
inteface - test/lottery/impl/StudentsTypeImpl.java
- Vendor-specific implementation of the
StudentsType Interface
60Compilation
- Next, the generated classes must be compiled
- javac students/.java students/impl/.java
- CLASSPATH requires many jar files
- jaxb/lib/.jar
- jwsdp-shared/lib/.jar
- jaxp/lib//.jar
- Note an ant buildfile (like a java makefile)
makes this much easier. More on this later
61Generated docs
- Java API docs for these classes are generated in
- students/docs/api/.html
- After bindings are generated, one usually works
directly through these API docs to learn how to
access/manipulate the XML data.
62Sample Programs
63Sample Programs
- Easiest way to learn is to cover certain generic
sample cases. These are all on the course website
under ace104/lesson6/examples - Summary of examples
- student/
- Use JAXB to read an xml document composed of a
single student complex type - student/
- Same, but for an xml document composed of a
sequence of such student types of indefinite
length - purchaseOrder/
- Another read example, but for a more complex
schema
64Sample programs, cont
- Course examples, cont
- create-marshal
- Purchase-order example modified to create in
memory and write to XML - modify-marshal
- Purchase-order example modified to read XML,
change it and write back to XML - Study these examples!
65Some additional JAXB details
66Binding Data Types
- Default java datatype bindings can be found at
- http//java.sun.com/webservices/docs/1.3/tutorial/
doc/JAXBWorks5.html - These defaults can be changed if required for an
application - Also, name binding are fairly standard changes of
names to things acceptable in java programming
language - See other binding rules on subsequent pages
67Default binding rules summary
- The JAXB binding model follows the default
binding rules summarized below - Bind the following to Java package
- XML Namespace URI
- Bind the following XML Schema components to Java
content interface - Named complex type
- Anonymous inlined type definition of an element
declaration - Bind to typesafe enum class
- A named simple type definition with a basetype
that derives from "xsdNCName" and has
enumeration facets. - Bind the following XML Schema components to a
Java Element interface - A global element declaration to a Element
interface. - Local element declaration that can be inserted
into a general content list. - Bind to Java property
- Attribute use
- Particle with a term that is an element reference
or local element declaration.
68End