Title: PerlXML::DOM reading and writing XML from Perl
1Perl/XMLDOM - reading and writing XML from Perl
- Dr. Andrew C.R. Martin
- martin_at_biochem.ucl.ac.uk
- http//www.bioinf.org.uk/
2Aims and objectives
- Refresh the structure of a XML (or XHTML)
document - Know problems in reading and writing XML
- Understand the requirements of XML parsers and
the two main types - Know how to write code using the DOM parser
- PRACTICAL write a script to read XML
3An XML refresher!
- ltmutantsgt
- ltmutant_group native'1abc01'gt
- ltstructuregt
- ltmethodgtx-raylt/methodgt
- ltresolutiongt1.8lt/resolutiongt
- ltrfactorgt0.20lt/rfactorgt
- lt/structuregt
- ltmutant domid'2bcd01'gt
- ltstructuregt
- ltmethodgtx-raylt/methodgt
- ltresolutiongt1.8lt/resolutiongt
- ltrfactorgt0.20lt/rfactorgt
- lt/structuregt
- ltmutation res'L24' native'ALA'
subs'ARG' /gt - lt/mutantgt
- lt/mutant_groupgt
- lt/mutantsgt
4Writing XML
- Writing XML is straightforward
- Generate XML from a Perl script using print()
statements. - However
- tags correctly nested
- quote marks correctly paired
- international character sets
5Reading XML
- As simple or complex as you wish!
- Full control over XML
- simple pattern may suffice
- Otherwise, may be dangerous
- may rely on un-guaranteed formatting
ltmutantsgtltmutant_group native'1abc01'gtltstructuregt
ltmethodgt x-raylt/methodgtltresolutiongt1.8lt/resolution
gtltrfactorgt0.20 lt/rfactorgtlt/structuregtltmutant
domid'2bcd01'gtltstructuregt ltmethodgtx-raylt/methodgtlt
resolutiongt1.8lt/resolutiongt ltrfactorgt0.20lt/rfactor
gtlt/structuregtltmutation res'L24 native'ALA
subs'ARG'/gtlt/mutantgtlt/mutant_groupgtlt/mutantsgt
6XML Parsers
- Clear rules for data boundaries and hierarchy
- Predictable unambiguous
- Parser translates XML into
- stream of events
- complex data object
7XML Parsers
Good parser will handle
- Different data sources of data
- files
- character strings
- remote references
- different character encodings
- standard Latin
- Japanese
- checking for well-formedness errors
8XML Parsers
- Read stream of characters
- differentiate markup and data
- Optionally replace entity references
- (e.g. lt with lt)
- Assemble complete document
- disparate (perhaps remote) sources
- Report syntax and validation errors
- Pass data to client program
9XML Parsers
- If XML has no syntax errors it is 'well formed'
- With a DTD, a validating parser will check it
matches'valid'
10XML Parsers
- Writing a good parser is a lot of work!
- A lot of testing needed
- Fortunately, many parsers available
11Getting data to your program
- Parser can generate 'events'
- Tags are converted into events
- Events triggered in your program as the document
is read - Parser acts as a pipeline converting XML into
processed chunks of data sent to your program - an 'event stream'
12Getting data to your program
- OR
- XML converted into a tree structure
- Reflects organization of the XML
- Whole document read into memory before your
program gets access
13Pros and cons
- Data structure
- More convenient
- Can access data in any order
- Code usually simpler
- May be impossible to handle very large files
- Need more processor time
- Need more memory
- Event stream
- Faster to access limited data
- Use less memory
- Parser loses data at the next event
- More complex code
- In the parser, everything is likely to be
event-driven - tree-based parsers create a data structure from
the event stream
14SAX and DOM
- de facto standard APIs for XML parsing
- SAX (Simple API for XML)
- event-stream API
- originally for Java, but now for several
programming languages (including Perl) - development promoted by Peter Murray Rust, 1997-8
- DOM (Document Object Model)
- W3C standard tree-based parser
- platform- and language-neutral
- allows update of document content and structure
as well as reading
15Perl XML parsers
- Many parsers available
- Differ in three major ways
- parsing style (event driven or data structure)
- 'standards-completeness
- speed (implementation in C or pure Perl)
16Perl XML parsers
- XMLSimple
- Very easy to use
- Designed for simple applications
- Cant handle 'mixed content'
- tags containing both data and other tags
ltpgtThis is ltbgtmixedlt/bgt contentlt/pgt
17Perl XML parsers
- XMLParser
- Oldest Perl XML parser
- Reasonably fast and flexibile
- Not very standards-compliant.
- Is a wrapper to 'expat
- probably the first C XML parser written by James
Clark
18XMLParser
Simple example - check well-formedness
- use XMLParser
- my xmlfile shift _at_ARGV the file to parse
- initialize parser object and parse the string
- my parser XMLParser-gtnew( ErrorContext gt 2
) - eval parser-gtparsefile( xmlfile )
- report error or success
- if( _at_ )
-
- _at_ s/at \/.?//s remove module line
number - print STDERR "\nERROR in 'xmlfile'\n_at_\n"
-
- else
-
- print STDERR "'xmlfile' is well-formed\n"
-
19Perl XML parsers
- XMLDOM
- Implements W3C DOM Level 1
- Built on top of XMLParser
- Very good fast, stable and complete
- Limited extended functionality
- XMLSAX
- Implements SAX2 wrapper to Expat
- Fast, stable and complete
20Perl XML parsers
- XMLLibXML
- Wrapper around GNOME libxml2
- Very fast, complete and stable
- Validating/non-validating
- DOM and SAX support
21Perl XML parsers
- XMLTwig
- DOM-like parser, BUT
- Allows you to define elements which can be parsed
as discrete units - 'twigs' (small branches of a tree)
22Perl XML parsers
- Several others
- More specialized, adding
- XPath (to select data from XML document)
- re-formatting (XSLT or other methods)
- ...
23XMLDOM
- DOM is a standard API
- once learned moving to a different language is
straightforward - moving between implementations also easy
- Suppose we want to extract some data from an XML
file...
24XMLDOM
- ltdatagt
- ltspecies name'Felix domesticus'gt
- ltcommon-namegtcatlt/common-namegt
- ltconservation status'not endangered' /gt
- lt/speciesgt
- ltspecies name'Drosophila melanogaster'gt
- ltcommon-namegtfruit flylt/common-namegt
- ltconservation status'not endangered' /gt
- lt/speciesgt
- lt/datagt
- We want
- cat (Felix domesticus) not endangered
- fruit fly (Drosophila melanogaster) not
endangered
25- !/usr/bin/perl
- use XMLDOM
- file shift _at_ARGV
- parser XMLDOMParser-gtnew()
- doc parser-gtparsefile(file)
- foreach species (doc-gtgetElementsByTagName('spec
ies')) -
- common_name species-gtgetElementsByTagName('
common-name') - cname common_name-gtitem(0)-gtgetFirstChild-gt
getNodeValue - name species-gtgetAttribute('name')
- conservation species-gtgetElementsByTagName(
'conservation') - status conservation-gtitem(0)-gtgetAttribute(
'status') - print "cname (name) status\n"
-
- doc-gtdispose()
26- !/usr/bin/perl
- use XMLDOM
- file shift _at_ARGV
- parser XMLDOMParser-gtnew()
- doc parser-gtparsefile(file)
- foreach species (doc-gtgetElementsByTagName('spec
ies')) -
- common_name species-gtgetElementsByTagName('
common-name') - cname common_name-gtitem(0)-gtgetFirstChild-gt
getNodeValue - name species-gtgetAttribute('name')
- conservation species-gtgetElementsByTagName(
'conservation') - status conservation-gtitem(0)-gtgetAttribute(
'status') - print "cname (name) status\n"
-
- doc-gtdispose()
27-gtgetElementsByTagName returns an array Here we
work through the array
foreach species (doc-gtgetElementsByTagName('spec
ies'))
28ltspecies name'Felix domesticus'gt
ltcommon-namegtcatlt/common-namegt ltconservation
status'not endangered' /gt lt/speciesgt
common_name species-gtgetElementsByTagName('com
mon-name') cname common_name-gtitem(0)-gtgetFir
stChild-gtgetNodeValue
29Attributes
- !/usr/bin/perl
- use XMLDOM
- file shift _at_ARGV
- parser XMLDOMParser-gtnew()
- doc parser-gtparsefile(file)
- foreach species (doc-gtgetElementsByTagName('spec
ies')) -
- common_name species-gtgetElementsByTagName('
common-name') - cname common_name-gtitem(0)-gtgetFirstChild-gt
getNodeValue - name species-gtgetAttribute('name')
- conservation species-gtgetElementsByTagName(
'conservation') - status conservation-gtitem(0)-gtgetAttribute(
'status') - print "cname (name) status\n"
-
- doc-gtdispose()
30ltspecies name'Felix domesticus'gt
ltcommon-namegtcatlt/common-namegt ltconservation
status'not endangered' /gt lt/speciesgt
This is an empty element, there are no child
elements so we dont need -gtgetFirstChild
conservation species-gtgetElementsByTagName('co
nservation') status conservation-gtitem(0)-gtge
tAttribute('status')
31- !/usr/bin/perl
- use XMLDOM
- file shift _at_ARGV
- parser XMLDOMParser-gtnew()
- doc parser-gtparsefile(file)
- foreach species (doc-gtgetElementsByTagName('spec
ies')) -
- common_name species-gtgetElementsByTagName('
common-name') - cname common_name-gtitem(0)-gtgetFirstChild-gt
getNodeValue - name species-gtgetAttribute('name')
- conservation species-gtgetElementsByTagName(
'conservation') - status conservation-gtitem(0)-gtgetAttribute(
'status') - print "cname (name) status\n"
-
- doc-gtdispose()
32XMLDOM
- Note
- Not necessary to use variable names that match
the tags, but it is a very good idea! - There are many many more functions, but this set
covers most needs
33Writing XML with XMLDOM
34- !/usr/bin/perl
- use XMLDOM
- nspecies 2
- _at_names ('Felix domesticus', 'Drosophila
melanogaster') - _at_commonNames ('cat', 'fruit fly')
- _at_consStatus ('not endangered', 'not
endangered') - doc XMLDOMDocument-gtnew
- xml_pi doc-gtcreateXMLDecl ('1.0')
- print xml_pi-gttoString
- root doc-gtcreateElement('data')
- for(i0 iltnspecies i)
-
- species doc-gtcreateElement('species')
- species-gtsetAttribute('name', namesi)
- root-gtappendChild(species)
35- !/usr/bin/perl
- use XMLDOM
- nspecies 2
- _at_names ('Felix domesticus', 'Drosophila
melanogaster') - _at_commonNames ('cat', 'fruit fly')
- _at_consStatus ('not endangered', 'not
endangered') - doc XMLDOMDocument-gtnew
- xml_pi doc-gtcreateXMLDecl ('1.0')
- print xml_pi-gttoString
- root doc-gtcreateElement('data')
- for(i0 iltnspecies i)
-
- species doc-gtcreateElement('species')
- species-gtsetAttribute('name', namesi)
- root-gtappendChild(species)
36lt?xml version1.0 ?gt ltdatagt ltspecies
name'Felix domesticus /gt lt/datagt
37- !/usr/bin/perl
- use XMLDOM
- nspecies 2
- _at_names ('Felix domesticus', 'Drosophila
melanogaster') - _at_commonNames ('cat', 'fruit fly')
- _at_consStatus ('not endangered', 'not
endangered') - doc XMLDOMDocument-gtnew
- xml_pi doc-gtcreateXMLDecl ('1.0')
- print xml_pi-gttoString
- root doc-gtcreateElement('data')
- for(i0 iltnspecies i)
-
- species doc-gtcreateElement('species')
- species-gtsetAttribute('name', namesi)
- root-gtappendChild(species)
38lt?xml version1.0 ?gt ltdatagt ltspecies
name'Felix domesticus'gt ltcommon-namegtcatlt/c
ommon-namegt lt/speciesgt lt/datagt
39- !/usr/bin/perl
- use XMLDOM
- nspecies 2
- _at_names ('Felix domesticus', 'Drosophila
melanogaster') - _at_commonNames ('cat', 'fruit fly')
- _at_consStatus ('not endangered', 'not
endangered') - doc XMLDOMDocument-gtnew
- xml_pi doc-gtcreateXMLDecl ('1.0')
- print xml_pi-gttoString
- root doc-gtcreateElement('data')
- for(i0 iltnspecies i)
-
- species doc-gtcreateElement('species')
- species-gtsetAttribute('name', namesi)
- root-gtappendChild(species)
40lt?xml version1.0 ?gt ltdatagt ltspecies
name'Felix domesticus'gt ltcommon-namegtcatlt/c
ommon-namegt ltconservation status'not
endangered' /gt lt/speciesgt lt/datagt
41- !/usr/bin/perl
- use XMLDOM
- nspecies 2
- _at_names ('Felix domesticus', 'Drosophila
melanogaster') - _at_commonNames ('cat', 'fruit fly')
- _at_consStatus ('not endangered', 'not
endangered') - doc XMLDOMDocument-gtnew
- xml_pi doc-gtcreateXMLDecl ('1.0')
- print xml_pi-gttoString
- root doc-gtcreateElement('data')
- for(i0 iltnspecies i)
-
- species doc-gtcreateElement('species')
- species-gtsetAttribute('name', namesi)
- root-gtappendChild(species)
42lt?xml version1.0 ?gt ltdatagt ltspecies
name'Felix domesticus'gt ltcommon-namegtcatlt/c
ommon-namegt ltconservation status'not
endangered' /gt lt/speciesgt ltspecies
name'Drosophila melanogaster'gt
ltcommon-namegtfruit flylt/common-namegt
ltconservation status'not endangered' /gt
lt/speciesgt lt/datagt
43Summary - reading XML
- Create a parser
- parser XMLDOMParser-gtnew()
- Parse a file
- doc parser-gtparsefile('filename')
- Extract all elements matching tag-name
- element_set doc-gtgetElementsByTagName('tag-n
ame') - Extract first element of a set
- element element_set-gtitem(0)
- Extract first child of an element
- child_element element-gtgetFirstChild
- Extract text from an element
- text element-gtgetNodeValue
- Get the value of a tags attribute
- text element-gtgetAttribute('attribute-name')
44Summary - writing XML
- Create an XML document structure
- doc XMLDOMDocument-gtnew
- Utility to create an XML header
- header doc-gtcreateXMLDecl('1.0')
- Create a tagged element
- element doc-gtcreateElement('tag-name')
- Set an attribute for an element
- element-gtsetAttribute('attrib-name', 'value')
- Append a child element to an element
- parent_element-gtappendChild(child_element)
- Create a text node element
- element doc-gtcreateTextNode('text')
- Print a document structure as a string
- print root_element-gttoString
45Summary
- Two types of parser
- Event-driven
- Data structure
- Writing a good parser is difficult!
- Many parsers available
- XMLDOM for reading and writing data