Title: XML Tools
1XML Tools
2XML Processing
Well-formedness checks reference expansion
document parser
document validator
application
XML infoset
XML infoset (annotated)
XML document
DTD or XML schema
storage system
3Tools for XML Processing
- DOM a language-neutral interface for
manipulating XML data - requires that the entire document be in memory
- SAX push-based stream processing
- hard to write non-trivial applications
- XPath a declarative tree-navigation language
- beautiful and easy to use
- is part of many other languages
- XSLT a language for transforming XML based on
templates - very ugly!
- XQuery full-fledged query language
- influenced by OQL
- XmlPull pull-based stream processing
- far better than SAX, but not a standard yet
4DOM
- The Document Object Model (DOM) is a platform-
and language-neutral interface that allows
programs and scripts to dynamically access and
update the content and structure of XML
documents. - The following is part of the DOM interface
- public interface Node
- public String getNodeName ()
- public String getNodeValue ()
- public NodeList getChildNodes ()
- public NamedNodeMap getAttributes ()
-
- public interface Element extends Node
- public Node getElementsByTagName ( String name
) -
- public interface Document extends Node
- public Element getDocumentElement ()
-
- public interface NodeList
- public int getLength ()
- public Node item ( int index )
5DOM Example
- import java.io.File
- import javax.xml.parsers.
- import org.w3c.dom.
- class Test
- public static void main ( String args ) throws
Exception - DocumentBuilderFactory dbf DocumentBuilderFacto
ry.newInstance() - DocumentBuilder db dbf.newDocumentBuilder()
- Document doc db.parse(new File("depts.xml"))
- NodeList nodes doc.getDocumentElement().getChil
dNodes() - for (int i0 iltnodes.getLength() i)
- Node n nodes.item(i)
- NodeList ndl n.getChildNodes()
- for (int k0 kltndl.getLength() k)
- Node m ndl.item(k)
- if ( (m.getNodeName() "dept")
- (m.getFirstChild().getNodeValue() "cse")
) - NodeList ncl ((Element)
m).getElementsByTagName("tel") - for (int j0 jltncl.getLength() j)
/dept/text()cse/tel/text()
6Better Programming
- import java.io.File
- import javax.xml.parsers.
- import org.w3c.dom.
- import java.util.Vector
- class Sequence extends Vector
- Sequence () super()
- Sequence ( String filename ) throws Exception
- super()
- DocumentBuilderFactory dbf
- DocumentBuilderFactory.newInstance()
- DocumentBuilder db dbf.newDocumentBuilder()
- Document doc db.parse(new File(filename))
- add((Object) doc.getDocumentElement())
-
Sequence child ( String tagname )
Sequence result new Sequence() for
(int i 0 iltsize() i) Node n
(Node) elementAt(i) NodeList c
n.getChildNodes() for (int k 0
kltc.getLength() k) if (c.item(k).getNodeName(
).equals(tagname)) result.add((Object)
c.item(k)) return result
void print () for (int i 0
iltsize() i) System.out.println(e
lementAt(i).toString())
class DOM public static void main ( String
args ) throws Exception (new
Sequence("cs.xml")).child("gradstudent").child("na
me").print()
7SAX
- SAX is a Simple API for XML that allows you to
process a document as it's being read - in contrast to DOM, which requires the entire
document to be read before it takes any action) - The SAX API is event based
- The XML parser sends events, such as the start or
the end of an element, to an event handler, which
processes the information
8Parser Events
- Receive notification of the beginning of a
document - void startDocument ()
- Receive notification of the end of a document
- void endDocument ()
- Receive notification of the beginning of an
element - void startElement ( String namespace, String
localName, - String qName, Attributes atts )
- Receive notification of the end of an element
- void endElement ( String namespace, String
localName, - String qName )
- Receive notification of character data
- void characters ( char ch, int start, int
length )
9SAX Example a Printer
- import java.io.FileReader
- import javax.xml.parsers.
- import org.xml.sax.
- import org.xml.sax.helpers.
- class Printer extends DefaultHandler
- public Printer () super()
- public void startDocument ()
- public void endDocument () System.out.println(
) - public void startElement ( String uri, String
name, - String tag, Attributes atts )
- System.out.print(lt tag gt)
-
- public void endElement ( String uri, String
name, String tag ) - System.out.print(lt/ tag gt)
-
- public void characters ( char text, int
start, int length ) - System.out.print(new String(text,start,lengt
h)) -
10The Child Handler
- class Child extends DefaultHandler
- DefaultHandler next // the next handler in
the pipeline - String ptag // the tagname of the child
- boolean keep // are we keeping or skipping
events? - short level // the depth level of the
current element - public Child ( String s, DefaultHandler n )
- super()
- next n ptag s
- keep false level 0
-
- public void startDocument () throws
SAXException - next.startDocument()
-
- public void endDocument () throws
SAXException - next.endDocument()
-
11The Child Handler (cont.)
- public void startElement ( String nm, String
ln, String qn, Attributes a ) throws SAXException
- if (level 1)
- keep ptag.equals(qn)
- if (keep)
- next.startElement(nm,ln,qn,a)
-
- public void endElement ( String nm, String
ln, String qn ) throws SAXException - if (keep)
- next.endElement(nm,ln,qn)
- if (--level 1)
- keep false
-
- public void characters ( char text, int
start, int length ) throws SAXException - if (keep)
- next.characters(text,start,length)
-
12Forming the Pipeline
- class SAX
- public static void main ( String args )
throws Exception - SAXParserFactory pf SAXParserFactory.new
Instance() - SAXParser parser pf.newSAXParser()
- DefaultHandler handler
- new Child("gradstudent",
- new Child("name",
- new Printer()))
- parser.parse(new InputSource(new
FileReader("cs.xml")), - handler)
-
Childname
Printer
SAX parser
Childgradstudent
13Example
- Input Stream
- ltdepartmentgt
- ltdeptnamegt
- Computer Science
- lt/deptnamegt
- ltgradstudentgt
- ltnamegt
- ltlastnamegt
- Smith
- lt/lastnamegt
- ltfirstnamegt
- John
- lt/firstnamegt
- lt/namegt
- lt/gradstudentgt
- ...
- lt/departmentgt
SAX Events SD SE department SE deptname C
Computer Science EE deptname SE gradstudent SE
name SE lastname C Smith EE lastname SE
firstname C John EE firstname EE name EE
gradstudent ... EE department ED
Child gradstudent
Child name
Printer
14XmlPull
- Unlike SAX, you pull events from document
- Create a pull parser
- XmlPullParser xpp
- xpp factory.newPullParser()
- Pull the next event xpp.getEventType()
- Type of events
- START_TAG
- END_TAG
- TEXT
- START_DOCUMENT
- END_DOCUMENT
- More information at
- http//www.xmlpull.org/
15Better XmlPull Events
- class Attributes
- public String names
- public String values
-
- abstract class Event
-
- class StartTag extends Event
- public String tag
- public Attributes attributes
-
- class EndTag extends Event
- public String tag
-
- class CData extends Event
- public String text
-
- class EOS extends Event
16Iterators
- import org.xmlpull.v1.XmlPullParser
- import org.xmlpull.v1.XmlPullParserFactory
- abstract class Iterator
- abstract public void open () // open the
stream iterator - abstract public void close () // close the
stream iterator - abstract public Event next () // get the
next tuple from stream -
- abstract class Filter extends Iterator
- Iterator input
17Document Reader
- class Document extends Iterator
- String path
- int state
- FileReader reader
- XmlPullParser xpp
- static XmlPullParserFactory factory
- Event getEvent ()
- int eventType xpp.getEventType()
- if (eventType XmlPullParser.START_TAG)
- int len xpp.getAttributeCount()
- String names new Stringlen
- String values new Stringlen
- for (int i 0 iltlen i)
- namesi xpp.getAttributeName(i)
- valuesi xpp.getAttributeValue(i)
-
- return new StartTag(xpp.getName(),new
Attributes(names,values)) - else if (eventType XmlPullParser.END_TAG
) - return new EndTag(xpp.getName())
18Document Reader (cont.)
- public void open ()
- reader new FileReader(path)
- xpp factory.newPullParser()
- xpp.setInput(reader)
- state 0
-
- public void close ()
- reader.close()
-
- public Event next ()
- if (state gt 0)
- state
- if (state 2)
- return new EOS()
-
- Event e getEvent()
- if (xpp.getEventType() ! XmlPullParser.END_DOCUM
ENT) - xpp.next()
- return e
19The Child Iterator
- class Child extends Filter
- String tag
- short nest // the nesting level of the
event - boolean keep // are we in keeping mode?
- public void open () keep false nest 0
input.open() - public Event next ()
- while (true)
- Event t input.next()
- if (t instanceof EOS)
- return t
- else if (t instanceof StartTag)
- if (nest 1)
- keep tag.equals(((StartTag) t).tag)
- if (!keep)
- continue
-
- else if (t instanceof EndTag)
- if (--nest 1 keep)
- keep false
20XSL Transformation
- A stylesheet specification language for
converting XML documents into various forms (XML,
HTML, plain text, etc). - Can transform each XML element into another
element, add new elements into the output file,
or remove elements. - Can rearrange and sort elements, test and make
decisions about which elements to display, and
much more. - Based on XPath
- ltxslstylesheet version1.0
- xmlnsxslhttp//www.w3.org/1999/XSL/Transform
gt - ltstudentsgt
- ltxslcopy-of select//student/name/gt
- lt/studentsgt
- lt/xslstylesheetgt
21XSLT Templates
- XSL uses XPath to define parts of the source
document that match one or more predefined
templates. - When a match is found, XSLT will transform the
matching part of the source document into the
result document. - The parts of the source document that do not
match a template will end up unmodified in the
result document (they will use the default
templates). - Form
- ltxsltemplate matchXPath expressiongt
-
- lt/xsltemplategt
- The default (implicit) templates visit all nodes
and strip out all tags - ltxsltemplate match/gt
- ltxslapply-templates/gt
- lt/xsltemplategt
- ltxsltemplate matchtext()_at_"gt
- ltxslvalue-of select./gt
- lt/xsltemplategt
22Other XSLT Elements
- ltxslvalue-of selectXPath expression/gt
- select the value of an XML element and add it to
the output stream of the transformation, e.g.
ltxslvalue-of select"//books/book/author"/gt. - ltxslcopy-of selectXPath expression/gt
- copy the entire XML element to the output stream
of the transformation. - ltxslapply-templates matchXPath expression/gt
- apply the template rules to the elements that
match the XPath expression. - ltxslelement nameXPath expressiongt
lt/xslelementgt - add an element to the output with a tag-name
derived from the XPath. - Example
- ltxslstylesheet version 1.0
- xmlnsxslhttp//www.w3.org/1999/XSL/Tra
nsformgt - ltxsltemplate match"employee"gt
- ltbgt ltxslapply-templates select"node()"/gt
lt/bgt - lt/xsltemplategt
- ltxsltemplate match"surname"gt
- ltigt ltxslvalue-of select"."/gt lt/igt
- lt/xsltemplategt
- lt/xslstylesheetgt
23Copy the Entire Document
- ltxslstylesheet version 1.0
- xmlnsxslhttp//www.w3.org/1999/XSL/Transfo
rmgt - ltxsltemplate match/"gt
- ltxslapply-templates/gt
- lt/xsltemplategt
- ltxsltemplate matchtext()"gt
- ltxslvalue-of select./gt
- lt/xsltemplategt
- ltxsltemplate match"gt
- ltxslelement namename(.)gt
- ltxslapply-templates/gt
- lt/xslelementgt
- lt/xsltemplategt
- lt/xslstylesheetgt
24More on XSLT
- Conflict resolution more specific templates
overwrite more general templates. Templates are
assigned default priorities, but they can be
overwritten using priorityn in a template. - Modes can be used to group together templates. No
mode is an empty mode. - ltxsltemplate match modeAgt
- ltxslapply-templates modeB/gt
- lt/xsltemplategt
- Conditional and loop statements
- ltxslif testXPath predicategt body lt/xslifgt
- ltxslfor-each selectXPathgt body
lt/xslfor-eachgt - Variables can be used to name data
- ltxslvariable namexgt value lt/xslvariablegt
- Variables are used as x in XPaths.
25Using XSLT
- import javax.xml.parsers.
- import org.xml.sax.
- import org.w3c.dom.
- import javax.xml.transform.
- import javax.xml. . transform.dom.
- import javax.xml.transformstream.
- import java.io.
- class XSLT
- public static void main ( String argv )
throws Exception - File stylesheet new File("x.xsl")
- File xmlfile new File("a.xml")
- DocumentBuilderFactory dbf DocumentBuilderFacto
ry.newInstance() - DocumentBuilder db dbf.newDocumentBuilder()
- Document document db.parse(xmlfile)
- StreamSource stylesource new
StreamSource(stylesheet) - TransformerFactory tf TransformerFactory.newIns
tance() - Transformer transformer tf.newTransformer(style
source) - DOMSource source new DOMSource(document)