PerlXML::DOM reading and writing XML from Perl - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

PerlXML::DOM reading and writing XML from Perl

Description:

PerlXML::DOM reading and writing XML from Perl – PowerPoint PPT presentation

Number of Views:630
Avg rating:3.0/5.0
Slides: 46
Provided by: bioin4
Category:
Tags: dom | perlxml | xml | argv | perl | reading | writing

less

Transcript and Presenter's Notes

Title: PerlXML::DOM reading and writing XML from Perl


1
Perl/XMLDOM - reading and writing XML from Perl
  • Dr. Andrew C.R. Martin
  • martin_at_biochem.ucl.ac.uk
  • http//www.bioinf.org.uk/

2
Aims and objectives
  • Refresh the structure of a XML (or XHTML)
    document
  • Know problems in reading and writing XML
  • Understand the requirements of XML parsers and
    the two main types
  • Know how to write code using the DOM parser
  • PRACTICAL write a script to read XML

3
An XML refresher!
  • ltmutantsgt
  • ltmutant_group native'1abc01'gt
  • ltstructuregt
  • ltmethodgtx-raylt/methodgt
  • ltresolutiongt1.8lt/resolutiongt
  • ltrfactorgt0.20lt/rfactorgt
  • lt/structuregt
  • ltmutant domid'2bcd01'gt
  • ltstructuregt
  • ltmethodgtx-raylt/methodgt
  • ltresolutiongt1.8lt/resolutiongt
  • ltrfactorgt0.20lt/rfactorgt
  • lt/structuregt
  • ltmutation res'L24' native'ALA'
    subs'ARG' /gt
  • lt/mutantgt
  • lt/mutant_groupgt
  • lt/mutantsgt

4
Writing XML
  • Writing XML is straightforward
  • Generate XML from a Perl script using print()
    statements.
  • However
  • tags correctly nested
  • quote marks correctly paired
  • international character sets

5
Reading XML
  • As simple or complex as you wish!
  • Full control over XML
  • simple pattern may suffice
  • Otherwise, may be dangerous
  • may rely on un-guaranteed formatting

ltmutantsgtltmutant_group native'1abc01'gtltstructuregt
ltmethodgt x-raylt/methodgtltresolutiongt1.8lt/resolution
gtltrfactorgt0.20 lt/rfactorgtlt/structuregtltmutant
domid'2bcd01'gtltstructuregt ltmethodgtx-raylt/methodgtlt
resolutiongt1.8lt/resolutiongt ltrfactorgt0.20lt/rfactor
gtlt/structuregtltmutation res'L24 native'ALA
subs'ARG'/gtlt/mutantgtlt/mutant_groupgtlt/mutantsgt
6
XML Parsers
  • Clear rules for data boundaries and hierarchy
  • Predictable unambiguous
  • Parser translates XML into
  • stream of events
  • complex data object

7
XML Parsers
Good parser will handle
  • Different data sources of data
  • files
  • character strings
  • remote references
  • different character encodings
  • standard Latin
  • Japanese
  • checking for well-formedness errors

8
XML Parsers
  • Read stream of characters
  • differentiate markup and data
  • Optionally replace entity references
  • (e.g. lt with lt)
  • Assemble complete document
  • disparate (perhaps remote) sources
  • Report syntax and validation errors
  • Pass data to client program

9
XML Parsers
  • If XML has no syntax errors it is 'well formed'
  • With a DTD, a validating parser will check it
    matches'valid'

10
XML Parsers
  • Writing a good parser is a lot of work!
  • A lot of testing needed
  • Fortunately, many parsers available

11
Getting data to your program
  • Parser can generate 'events'
  • Tags are converted into events
  • Events triggered in your program as the document
    is read
  • Parser acts as a pipeline converting XML into
    processed chunks of data sent to your program
  • an 'event stream'

12
Getting data to your program
  • OR
  • XML converted into a tree structure
  • Reflects organization of the XML
  • Whole document read into memory before your
    program gets access

13
Pros and cons
  • Data structure
  • More convenient
  • Can access data in any order
  • Code usually simpler
  • May be impossible to handle very large files
  • Need more processor time
  • Need more memory
  • Event stream
  • Faster to access limited data
  • Use less memory
  • Parser loses data at the next event
  • More complex code
  • In the parser, everything is likely to be
    event-driven
  • tree-based parsers create a data structure from
    the event stream

14
SAX and DOM
  • de facto standard APIs for XML parsing
  • SAX (Simple API for XML)
  • event-stream API
  • originally for Java, but now for several
    programming languages (including Perl)
  • development promoted by Peter Murray Rust, 1997-8
  • DOM (Document Object Model)
  • W3C standard tree-based parser
  • platform- and language-neutral
  • allows update of document content and structure
    as well as reading

15
Perl XML parsers
  • Many parsers available
  • Differ in three major ways
  • parsing style (event driven or data structure)
  • 'standards-completeness
  • speed (implementation in C or pure Perl)

16
Perl XML parsers
  • XMLSimple
  • Very easy to use
  • Designed for simple applications
  • Cant handle 'mixed content'
  • tags containing both data and other tags

ltpgtThis is ltbgtmixedlt/bgt contentlt/pgt
17
Perl XML parsers
  • XMLParser
  • Oldest Perl XML parser
  • Reasonably fast and flexibile
  • Not very standards-compliant.
  • Is a wrapper to 'expat
  • probably the first C XML parser written by James
    Clark

18
XMLParser
Simple example - check well-formedness
  • use XMLParser
  • my xmlfile shift _at_ARGV the file to parse
  • initialize parser object and parse the string
  • my parser XMLParser-gtnew( ErrorContext gt 2
    )
  • eval parser-gtparsefile( xmlfile )
  • report error or success
  • if( _at_ )
  • _at_ s/at \/.?//s remove module line
    number
  • print STDERR "\nERROR in 'xmlfile'\n_at_\n"
  • else
  • print STDERR "'xmlfile' is well-formed\n"

19
Perl XML parsers
  • XMLDOM
  • Implements W3C DOM Level 1
  • Built on top of XMLParser
  • Very good fast, stable and complete
  • Limited extended functionality
  • XMLSAX
  • Implements SAX2 wrapper to Expat
  • Fast, stable and complete

20
Perl XML parsers
  • XMLLibXML
  • Wrapper around GNOME libxml2
  • Very fast, complete and stable
  • Validating/non-validating
  • DOM and SAX support

21
Perl XML parsers
  • XMLTwig
  • DOM-like parser, BUT
  • Allows you to define elements which can be parsed
    as discrete units
  • 'twigs' (small branches of a tree)

22
Perl XML parsers
  • Several others
  • More specialized, adding
  • XPath (to select data from XML document)
  • re-formatting (XSLT or other methods)
  • ...

23
XMLDOM
  • DOM is a standard API
  • once learned moving to a different language is
    straightforward
  • moving between implementations also easy
  • Suppose we want to extract some data from an XML
    file...

24
XMLDOM
  • ltdatagt
  • ltspecies name'Felix domesticus'gt
  • ltcommon-namegtcatlt/common-namegt
  • ltconservation status'not endangered' /gt
  • lt/speciesgt
  • ltspecies name'Drosophila melanogaster'gt
  • ltcommon-namegtfruit flylt/common-namegt
  • ltconservation status'not endangered' /gt
  • lt/speciesgt
  • lt/datagt
  • We want
  • cat (Felix domesticus) not endangered
  • fruit fly (Drosophila melanogaster) not
    endangered

25
  • !/usr/bin/perl
  • use XMLDOM
  • file shift _at_ARGV
  • parser XMLDOMParser-gtnew()
  • doc parser-gtparsefile(file)
  • foreach species (doc-gtgetElementsByTagName('spec
    ies'))
  • common_name species-gtgetElementsByTagName('
    common-name')
  • cname common_name-gtitem(0)-gtgetFirstChild-gt
    getNodeValue
  • name species-gtgetAttribute('name')
  • conservation species-gtgetElementsByTagName(
    'conservation')
  • status conservation-gtitem(0)-gtgetAttribute(
    'status')
  • print "cname (name) status\n"
  • doc-gtdispose()

26
  • !/usr/bin/perl
  • use XMLDOM
  • file shift _at_ARGV
  • parser XMLDOMParser-gtnew()
  • doc parser-gtparsefile(file)
  • foreach species (doc-gtgetElementsByTagName('spec
    ies'))
  • common_name species-gtgetElementsByTagName('
    common-name')
  • cname common_name-gtitem(0)-gtgetFirstChild-gt
    getNodeValue
  • name species-gtgetAttribute('name')
  • conservation species-gtgetElementsByTagName(
    'conservation')
  • status conservation-gtitem(0)-gtgetAttribute(
    'status')
  • print "cname (name) status\n"
  • doc-gtdispose()

27
-gtgetElementsByTagName returns an array Here we
work through the array
foreach species (doc-gtgetElementsByTagName('spec
ies'))
28
ltspecies name'Felix domesticus'gt
ltcommon-namegtcatlt/common-namegt ltconservation
status'not endangered' /gt lt/speciesgt
common_name species-gtgetElementsByTagName('com
mon-name') cname common_name-gtitem(0)-gtgetFir
stChild-gtgetNodeValue
29
Attributes
  • !/usr/bin/perl
  • use XMLDOM
  • file shift _at_ARGV
  • parser XMLDOMParser-gtnew()
  • doc parser-gtparsefile(file)
  • foreach species (doc-gtgetElementsByTagName('spec
    ies'))
  • common_name species-gtgetElementsByTagName('
    common-name')
  • cname common_name-gtitem(0)-gtgetFirstChild-gt
    getNodeValue
  • name species-gtgetAttribute('name')
  • conservation species-gtgetElementsByTagName(
    'conservation')
  • status conservation-gtitem(0)-gtgetAttribute(
    'status')
  • print "cname (name) status\n"
  • doc-gtdispose()

30
ltspecies name'Felix domesticus'gt
ltcommon-namegtcatlt/common-namegt ltconservation
status'not endangered' /gt lt/speciesgt
This is an empty element, there are no child
elements so we dont need -gtgetFirstChild
conservation species-gtgetElementsByTagName('co
nservation') status conservation-gtitem(0)-gtge
tAttribute('status')
31
  • !/usr/bin/perl
  • use XMLDOM
  • file shift _at_ARGV
  • parser XMLDOMParser-gtnew()
  • doc parser-gtparsefile(file)
  • foreach species (doc-gtgetElementsByTagName('spec
    ies'))
  • common_name species-gtgetElementsByTagName('
    common-name')
  • cname common_name-gtitem(0)-gtgetFirstChild-gt
    getNodeValue
  • name species-gtgetAttribute('name')
  • conservation species-gtgetElementsByTagName(
    'conservation')
  • status conservation-gtitem(0)-gtgetAttribute(
    'status')
  • print "cname (name) status\n"
  • doc-gtdispose()

32
XMLDOM
  • Note
  • Not necessary to use variable names that match
    the tags, but it is a very good idea!
  • There are many many more functions, but this set
    covers most needs

33
Writing XML with XMLDOM
34
  • !/usr/bin/perl
  • use XMLDOM
  • nspecies 2
  • _at_names ('Felix domesticus', 'Drosophila
    melanogaster')
  • _at_commonNames ('cat', 'fruit fly')
  • _at_consStatus ('not endangered', 'not
    endangered')
  • doc XMLDOMDocument-gtnew
  • xml_pi doc-gtcreateXMLDecl ('1.0')
  • print xml_pi-gttoString
  • root doc-gtcreateElement('data')
  • for(i0 iltnspecies i)
  • species doc-gtcreateElement('species')
  • species-gtsetAttribute('name', namesi)
  • root-gtappendChild(species)

35
  • !/usr/bin/perl
  • use XMLDOM
  • nspecies 2
  • _at_names ('Felix domesticus', 'Drosophila
    melanogaster')
  • _at_commonNames ('cat', 'fruit fly')
  • _at_consStatus ('not endangered', 'not
    endangered')
  • doc XMLDOMDocument-gtnew
  • xml_pi doc-gtcreateXMLDecl ('1.0')
  • print xml_pi-gttoString
  • root doc-gtcreateElement('data')
  • for(i0 iltnspecies i)
  • species doc-gtcreateElement('species')
  • species-gtsetAttribute('name', namesi)
  • root-gtappendChild(species)

36
lt?xml version1.0 ?gt ltdatagt ltspecies
name'Felix domesticus /gt lt/datagt
37
  • !/usr/bin/perl
  • use XMLDOM
  • nspecies 2
  • _at_names ('Felix domesticus', 'Drosophila
    melanogaster')
  • _at_commonNames ('cat', 'fruit fly')
  • _at_consStatus ('not endangered', 'not
    endangered')
  • doc XMLDOMDocument-gtnew
  • xml_pi doc-gtcreateXMLDecl ('1.0')
  • print xml_pi-gttoString
  • root doc-gtcreateElement('data')
  • for(i0 iltnspecies i)
  • species doc-gtcreateElement('species')
  • species-gtsetAttribute('name', namesi)
  • root-gtappendChild(species)

38
lt?xml version1.0 ?gt ltdatagt ltspecies
name'Felix domesticus'gt ltcommon-namegtcatlt/c
ommon-namegt lt/speciesgt lt/datagt
39
  • !/usr/bin/perl
  • use XMLDOM
  • nspecies 2
  • _at_names ('Felix domesticus', 'Drosophila
    melanogaster')
  • _at_commonNames ('cat', 'fruit fly')
  • _at_consStatus ('not endangered', 'not
    endangered')
  • doc XMLDOMDocument-gtnew
  • xml_pi doc-gtcreateXMLDecl ('1.0')
  • print xml_pi-gttoString
  • root doc-gtcreateElement('data')
  • for(i0 iltnspecies i)
  • species doc-gtcreateElement('species')
  • species-gtsetAttribute('name', namesi)
  • root-gtappendChild(species)

40
lt?xml version1.0 ?gt ltdatagt ltspecies
name'Felix domesticus'gt ltcommon-namegtcatlt/c
ommon-namegt ltconservation status'not
endangered' /gt lt/speciesgt lt/datagt
41
  • !/usr/bin/perl
  • use XMLDOM
  • nspecies 2
  • _at_names ('Felix domesticus', 'Drosophila
    melanogaster')
  • _at_commonNames ('cat', 'fruit fly')
  • _at_consStatus ('not endangered', 'not
    endangered')
  • doc XMLDOMDocument-gtnew
  • xml_pi doc-gtcreateXMLDecl ('1.0')
  • print xml_pi-gttoString
  • root doc-gtcreateElement('data')
  • for(i0 iltnspecies i)
  • species doc-gtcreateElement('species')
  • species-gtsetAttribute('name', namesi)
  • root-gtappendChild(species)

42
lt?xml version1.0 ?gt ltdatagt ltspecies
name'Felix domesticus'gt ltcommon-namegtcatlt/c
ommon-namegt ltconservation status'not
endangered' /gt lt/speciesgt ltspecies
name'Drosophila melanogaster'gt
ltcommon-namegtfruit flylt/common-namegt
ltconservation status'not endangered' /gt
lt/speciesgt lt/datagt
43
Summary - reading XML
  • Create a parser
  • parser XMLDOMParser-gtnew()
  • Parse a file
  • doc parser-gtparsefile('filename')
  • Extract all elements matching tag-name
  • element_set doc-gtgetElementsByTagName('tag-n
    ame')
  • Extract first element of a set
  • element element_set-gtitem(0)
  • Extract first child of an element
  • child_element element-gtgetFirstChild
  • Extract text from an element
  • text element-gtgetNodeValue
  • Get the value of a tags attribute
  • text element-gtgetAttribute('attribute-name')

44
Summary - writing XML
  • Create an XML document structure
  • doc XMLDOMDocument-gtnew
  • Utility to create an XML header
  • header doc-gtcreateXMLDecl('1.0')
  • Create a tagged element
  • element doc-gtcreateElement('tag-name')
  • Set an attribute for an element
  • element-gtsetAttribute('attrib-name', 'value')
  • Append a child element to an element
  • parent_element-gtappendChild(child_element)
  • Create a text node element
  • element doc-gtcreateTextNode('text')
  • Print a document structure as a string
  • print root_element-gttoString

45
Summary
  • Two types of parser
  • Event-driven
  • Data structure
  • Writing a good parser is difficult!
  • Many parsers available
  • XMLDOM for reading and writing data
Write a Comment
User Comments (0)
About PowerShow.com