Introduction to XML: A Librarians Perspective - PowerPoint PPT Presentation

1 / 104
About This Presentation
Title:

Introduction to XML: A Librarians Perspective

Description:

xsl:template TITLE Italian Cuisine /TITLE /BOOK HTML Output: ... P B Sue Meyer: /B Italian Cuisine /P D. Khanna -- Introduction to XML -- 7/25/99 ... – PowerPoint PPT presentation

Number of Views:123
Avg rating:3.0/5.0
Slides: 105
Provided by: OEM582
Category:

less

Transcript and Presenter's Notes

Title: Introduction to XML: A Librarians Perspective


1
Introduction to XMLA Librarians Perspective
  • Delphine KhannaRutgers UniversityPalinet,
    July/August 1999

2
Overview of the workshop
  • What is XML? How does it work?
  • Why XML? What is it going to change?
  • Overview of some XML-related standards.
  • XML in libraries (standards projects).
  • Practical skills
  • Creating an XML document,
  • Creating an XSL style sheet,
  • Work with MS Internet Explorer 5.0.

3
Workshop Web site
  • http//scc01.rutgers.edu/ceth/intromat/xml/
  • Contents
  • This slide presentation,
  • XML Samples used in this workshop,
  • List of useful Web links and other resources.

4
A First Look at XML
5
Basics
  • Simplified definition XML is a kind of
    super-HTML where you can define your own tags.

6
Term Clarification
  • XML can be called a
  • encoding format,
  • language,
  • standard.
  • We will prefer standard.

7
The XML Family A whole family of standards
  • XML,
  • XSL,
  • XLINK, XPOINTER,
  • Namespaces,
  • RDF,
  • XML Schemas,
  • DOM,
  • and more

8
XML Who? When?
  • XML family developed by W3C.
  • Very recent
  • XML 1.0 February 1998.
  • Namespaces January 1999.
  • RDF February 1999.
  • XLINK, XPOINTER, XSL, XML Schemas still working
    drafts.

9
XML To develop the next generation of Web
applications
  • People want to do more sophisticated things with
    the Web.
  • HTML is too limited for that.
  • Need for a more powerful language XML.

10
XML Hype 2 myths
  • XML will replace everything.
  • (HTML, back-end relational databases, etc.)
  • XML is completely different from Web technologies
    we had before.

11
Why is XML better?
12
Lets look at a typical HTML document
  • ltBODYgt
  • ltH1gtLines Written in Early Springlt/H1gt
  • ltH2gtWilliam Wordsworthlt/H2gt
  • ltPgtI heard a thousand blended notes,ltBRgt
  • While in grove I sate reclined,ltBRgt
  • In that sweet mood when pleasant
    thoughtsltBRgt
  • Bring sad thoughts to the mind.lt/Pgt
  • ltPgtTo her fair works did nature linkltBRgt
  • The human soul that through me ranltBRgt
  • And much it griev'd me my heart to
    thinkltBRgt
  • What man has made of man.lt/Pgt
  • lt/BODYgt

13
What is the problem?
  • To do more fancy things with documents
  • need to make their logical structure explicit.
  • Otherwise software applications
  • do not know what is what,
  • do not have any handle over documents.

14
Why XML is better Overall
  • HTML
  • Encoding too vague and messy.
  • Logical structure is not clearly encoded.
  • XML
  • Allows us to create clean structured documents,
    where logical structure of document is totally
    explicit.

15
The same document in XML
  • lt?xml version"1.0"?gt
  • ltPOEMgt
  • ltTITLEgtLines Written in Early Springlt/TITLEgt
  • ltAUTHORgtltFIRSTNAMEgtWilliamlt/FIRSTNAMEgt
    ltLASTNAMEgtWordsworthlt/LASTNAMEgtlt/AUTHOR
    gt
  • ltSTANZAgt
  • ltLINE N1gtI heard a thousand blended
    notes,lt/LINEgt
  • ltLINE N2gtWhile in grove I sate
    reclined,lt/LINEgt
  • ltLINE N3gtIn that sweet mood when
    pleasant thoughtslt/LINEgt
  • ltLINE N4gtBring sad thoughts to the
    mind.lt/LINEgt
  • lt/STANZAgt
  • ltSTANZAgt
  • ltLINE N5gtTo her fair works did nature
    linklt/LINEgt
  • ltLINE N6gtThe human soul that through
    me ranlt/LINEgt
  • ltLINE N7gtAnd much it griev'd me my
    heart to thinklt/LINEgt
  • ltLINE N8gtWhat man has made of
    man.lt/LINEgt
  • lt/STANZAgt

16
Why XML is better Reason 1
  • HTML One single fixed tag set
  • ltH1gt, ltH2gt, ltPgt, ltIMGgt, etc.
  • XML You can define your own tag set
  • ltPOEMgt, ltSTANZAgt, ltLINE N18gt.
  • ltPgt, ltABSTRACTgt, ltFOOTNOTEgt, ltBIBL_ENTRYgt.
  • gt Possible to describe the logical structure
    exactly.

17
Why XML is better Reason 2
  • HTML Lack of syntax controlltAgtltPgtHellolt/Agtlt/Pgtlt
    Pgtis considered OK.
  • XML Documents have to be at least
    well-formedltPgtltAgtHellolt/Agtlt/Pgtis the only
    form acceptable.
  • gt Code much cleaner.

18
Why XML is better Reason 3
  • HTML Logical structure and display are mixed up
  • ltPgt, ltH1gt, ltH2gt.
  • This text is ltFONT COLORbluegtimportantlt/FONTgt.
  • XML Clear distinction between logical structure
    and display
  • This text is ltEMPHgtimportantlt/EMPHgt.
  • ltEMPHgt ltFONT COLORbluegt
  • gt Code much cleaner.

19
By the way, HTML is not that bad
  • HTML
  • Really simple Attractive to basic users.
  • Works fine for basic Web pages.
  • XML
  • Clearly more complexWill scare off basic users.
  • Probably an overkill for basic Web pages.

20
What will XML change?Or, why do we need to
make the logical structure explicit?
21
Different displays for different output devices
  • Regular computer screens,
  • Pocket computers, Palm Pilot,
  • WebTV,
  • Audio (visually-impaired, cars),
  • Braille,
  • Print.

22
Term Clarification
  • The Web based on a client/server architecture.

23
Server-side Databases should speak to each other
  • A very successful model
  • relational databases on the server side.
  • Next step data integration.
  • Example 1 An online bookstore,
  • Example 2 Medical records,
  • Example 3 An index that knows which journals
    are available in the library.

24
XML representing structured data
  • If XML can represent structured text,it can also
    represent structured data.
  • XML is also very good at representing mixed data
    seamlessly.

25
XML for Interchange Example of a converted R-DB
record
  • lt?xml version"1.0?gt
  • ltPACKAGEgt
  • ltIDgt33456lt/IDgt
  • ltCATEGORYgtNext day deliverylt/CATEGORYgt
  • ltSHIPPING_COSTgt15lt/ SHIPPING_ COSTgt
  • ltLOC_DEPARTUREgtNew York Citylt/LOC_DEPARTUREgt
  • ltLOC_ARRIVALgtPittsburghlt/LOC_ARRIVALgt
  • ltDATE_DEPARTUREgt07/30/1999lt/ DATE_DEPARTUREgt
  • ltDATE_ARRIVALgt07/31/1999lt/ DATE_ARRIVALgt
  • ltPACKAGEgt

26
Client-side The Web more than an online
fax-machine
  • Web-browsers thin clients
  • They just display documents.
  • Clients can do more
  • Client workstation has a lot of unused power,
  • Less strain on the network and on the server,
  • Example Viewing and sorting of a medical record.

27
Client-side The Web more than an online
fax-machine (2)
  • Clients can do more
  • Personalized and sophisticated processing
    possible.
  • Processing possibly provided by 3rd-party client
    applications.
  • Example Bibliography manager.

28
XML The nitty-gritty details
29
Term Clarification
  • Element,
  • Tag (opening tag / closing tag / delimiter),
  • Element content,
  • Attribute (name / value).
  • Example
  • ltAUTHOR TYPEnovelistgtJohn Smithlt/AUTHORgt

30
Differences in Syntax between XML and HTML
  • XML Declarationlt?xml version"1.0?gt
  • Every opening tag must have a closing tag.
  • Empty tags have a different syntax ltBR/gt
  • Tags are case sensitive ltSTANZAgt different from
    ltstanzagt

31
2-Level Syntax Control
  • XML documents can be
  • Well-formed,
  • Valid.

32
Syntax Control Well-formed documents
  • All XML documents must be well-formed.
  • XML parsers check the well-formedness.
  • Criteria of well-formedness
  • Every opening tag must have a closing tag.
    Illegal ltPgtHello
  • No overlapping elements Illegal
    ltDIVgtltPgtHellolt/DIVgtlt/Pgt
  • One unique root element

33
Tree Representation
  • POEM
  • TITLE AUTHOR STANZA STANZA
  • FIRSTNAME LASTNAME LINE LINE
    LINE LINE LINE LINE

34
Create your own XML document
  • The cooking recipe document
  • 1. Brainstorming on the structure of the
    document,
  • 2. Creation of the document with a template.

35
Editing XML Documents
  • Textpad Internet Explorer 5 as a parser.
  • Caution IE5 comes with limitations and
    proprietary features.
  • Alternative
  • XML editor (e.g. Softquads XMetal).

36
To get started
  • Create file in Textpad and load it in IE5.
  • File extension xml.
  • Save regularly and reload in IE5.
  • Begin with
  • lt?xml version"1.0"?gt
  • ltRECIPEgt
  • lt/ RECIPE gt

37
Document Type Definitions (DTD)
38
Document Type Definitions(DTD)
  • Formal way of defining the tags used in a series
    of documents.
  • A DTD
  • specifies a list of tags,
  • defines the relationships between these tags.
  • Allows us to create consistency across a
    collection of documents (e.g., 5000 poems).

39
How does a DTD look like?
  • lt!ELEMENT POEM (TITLE, AUTHOR, STANZA)gt
  • lt!ELEMENT TITLE (PCDATA)gt
  • lt!ELEMENT AUTHOR (FIRSTNAME, LASTNAME)gt
  • lt!ELEMENT FIRSTNAME (PCDATA)gt
  • lt!ELEMENT LASTNAME (PCDATA)gt
  • lt!ELEMENT STANZA (LINE)gt
  • lt!ELEMENT LINE (PCDATA)gt
  • lt!ATTLIST LINE N CDATA REQUIREDgt

40
Creating a DTD
  • Non-trivial task.
  • Higher level of expertise needed than for using a
    DTD.
  • In-depth knowledge of XML,
  • In-depth knowledge of the type of documents being
    described.
  • Preliminary Document Analysis.
  • A DTD can be dozens of pages long.

41
Syntax ControlValid documents
  • Higher level of control than well-formed
    documents.
  • An XML document is valid if it conforms to its
    DTD.
  • To validate an XML Document, it is necessary to
    declare the name and location of its DTD.

42
XML DTD declaration
  • The DTD should be declared at the top of the XML
    document.
  • Local file
  • lt?xml version1.0 standaloneno?gt
  • lt!DOCTYPE recipe SYSTEM poem.dtdgt
  • URL
  • lt?xml version1.0 standaloneno?gt
  • lt!DOCTYPE recipe SYSTEM http//scc01.rutgers.
    edu/ceth/intromat/xml/samples/poem/poem.dtdgt

43
Validation with IE5
  • When loading a documentThe IE5 parser does not
    validate it.
  • Possible to validate a document through a script.
  • Possible also to use a separate validating
    parser.
  • For instance, the Scholarly Technology Groups
    XML parser at Brown U.
  • (http//www.stg.brown.edu/service/xmlvalid/).
  • Validating vs. non-validating parsers.

44
Validation Strategy
  • For now, best model
  • When creating documents use a validating parser.
  • (like the Scholarly Technology Group's XML
    Parser)
  • When users download them parser only checks if
    well formed.

45
Namespaces
  • Need to use elements from several DTDs in the
    same document.
  • Scheme to identify the source of each element.
  • Special case Same element name used by 2 DTDs.

46
Namespace Example
  • ltbook xmlnsbooks'urnloc.govbooks'
  • xmlnsisbn'urlISBNhttp//www.isbn.o
    rg/isbndtdgt
  • ltbookstitlegtCheaper by the
    Dozenlt/bookstitlegt
  • ltisbnnumbergt1568491379lt/isbnnumbergt
  • ltbooksnotesgtThis is funny
    booklt/booksnotesgt
  • lt/bookgt
  • Note Adapted from example in the Namespaces
    recommendation.

47
Namespace Example (2)Default Namespace
  • ltbook xmlns'urnloc.govbooks'
  • xmlnsisbn'urlISBNhttp//www.isbn.org
    /isbndtdgt
  • lttitlegtCheaper by the Dozenlt/titlegt
  • ltisbnnumbergt1568491379lt/isbnnumbergt
  • ltnotesgtThis is funny booklt/notesgt
  • lt/bookgt
  • Note Adapted from example in the Namespaces
    recommendation.

48
More good things about XML
49
Positive side-effects of XML (1)
  • XML fosters the development of community-based
    standards.
  • Concept of 2-level standard very powerful
  • XML universal,
  • DTDs community-specific.
  • Now developing a new standard amounts to writing
    a DTD.
  • Much easier than starting from scratch.
  • E.g., Xlit.

50
Positive side-effects of XML (2)
  • Wide-spread standards are stronger than those
    used by a limited community(regardless of their
    intrinsic value).
  • HL7 --gt XML.
  • Easier to hire programmers.
  • More documentation available.
  • Actively maintained by very large base of people.

51
Positive side-effects of XML (3)
  • A set of standards bundled together are stronger
    than an isolated one.
  • Likely to appeal to more people (The Microsoft
    Office idea).
  • The standards reinforce each others.

52
Stylesheet Languages for XML
53
Stylesheet Languages for XML
  • Specify how to display logical elements.
  • XML supports 2 stylesheet languages
  • CSS
  • Quite Limited,
  • But eases transition HTML--gtXML.
  • XSL
  • Very powerful,
  • Still a working draft.

54
Extensible Stylesheet Language (XSL)
  • 2 Parts
  • Transformations
  • Transform the XML document (reorder, hide, add
    elements).
  • Formatting Objects (FO)
  • Attach formatting properties to XML elements.

55
XSL in IE 5.0
  • Supports transformations but not the FO.
  • Trick transform XML DTD-specific elements into
    HTML elements.
  • Convenient because everybody knows HTML.

56
XSL-to-HTML Stylesheets Syntax
  • Style Sheet Excerpt XML Document Excerpt
  • ltxsltemplate matchBOOK"gt ltBOOKgt
  • ltPgtltxslapply-templates/gtlt/Pgt
    ltAUTHORgtMary Brownlt/AUTHORgt
  • lt/xsltemplategt ltTITLEgtEasy
    Cookinglt/TITLEgt
  • lt/BOOKgt
  • ltxsltemplate matchAUTHORgt ltBOOKgt
  • ltBgt ltxslvalue-of/gtlt/Bgt
    ltAUTHORgtJohn Smithlt/AUTHORgt
  • lt/xsltemplategt ltTITLEgt101
    Recipeslt/TITLEgt
  • lt/BOOKgt
  • ltxsltemplate matchTITLE"gt ltBOOKgt
  • ltxslvalue-of/gt ltAUTHORgtSue
    Meyerlt/AUTHORgt
  • lt/xsltemplategt ltTITLEgtItalian
    Cuisinelt/TITLEgt
  • lt/BOOKgt
  • HTML Output
  • ltPgtltBgtMary Brownlt/Bgt Easy Cookinglt/Pgt
  • ltPgtltBgtJohn Smithlt/Bgt 101 Recipeslt/Pgt
  • ltPgtltBgtSue Meyerlt/Bgt Italian Cuisinelt/Pgt

57
Beginning of an XSL-to-HTML Stylesheet
  • lt?xml version'1.0'?gt
  • ltxslstylesheet xmlnsxsl"http//www.w3.org/TR/WD
    -xsl"gt
  • ltxsltemplate match"/"gt
  • ltxslapply-templates/gt
  • lt/xsltemplategt
  • ltxsltemplate match"POEM"gt
  • ltHTMLgt
  • ltBODYgt
  • ltxslapply-templates/gt
  • lt/BODYgt
  • lt/HTMLgt
  • lt/xsltemplategt
  • lt/xslstylesheetgt

58
Example of XSL-to-HTML Stylesheet
  • See poem.xsl at
  • http//scc01.rutgers.edu/ceth/intromat/xml/samples
    /poem/poem.xsl

59
Declaring an XSL Stylesheet in an XML document
  • Just after the XML declaration (and the DTD
    declaration if there is one).
  • Local file
  • lt?xml-stylesheet typetext/xsl
    hrefpoem.xsl?gt
  • URL
  • lt?xml-stylesheet typetext/xsl
    hrefhttp//scc01.rutg ers.edu/ceth/intromat/xml/
    samples/poem/poem.xsl ?gt

60
Creating your own Stylesheet
  • The XSL-to-HTML recipe stylesheet
  • XSL stylesheets can be tricky.
  • Always use another stylesheet as a model.
  • Name the file recipe.xsl.
  • Make sure to declare it in the XML document.
  • lt?xml-stylesheet type"text/xsl"
    hrefrecipe.xsl"?gt
  • Always add one template at a time, and reload in
    IE5 to make sure it works.

61
Recipe Stylesheet Step 1
  • lt?xml version"1.0"?gt
  • ltxslstylesheet xmlnsxsl"http//www.w3.org/TR/WD
    -xsl"gt
  • ltxsltemplate match"/"gt
  • ltxslapply-templates/gt
  • lt/xsltemplategt
  • lt/xslstylesheetgt

62
Recipe Stylesheet Step 2
  • ltxsltemplate match"RECIPE"gt
  • ltHTMLgt
  • ltBODY BGCOLOR"FFFFCC"gt
  • ltxslapply-templates/gt
  • lt/BODYgt
  • lt/HTMLgt
  • lt/xsltemplategt
  • ltxsltemplate match"TITLE"gt
  • ltH1gtltCENTERgtltFONT COLOR"red"gt
  • ltxslvalue-of/gt
  • lt/FONTgtlt/CENTERgtlt/H1gt
  • lt/xsltemplategt

63
Recipe Stylesheet Step 3 and after
  • For the rest of the stylesheet, see the sample
    recipe.xsl at
  • http//scc01.rutgers.edu/ceth/intromat/xml/samples
    /recipe/recipe.xsl

64
Recipe Stylesheet Advanced
  • Sorting the ingredients
  • ltxsltemplate match"INGREDIENTLIST"gt
  • ltHR/gt
  • ltH2gtltFONT COLOR"red"gtIngredientslt/FONTgtlt/H2gt
  • ltULgtltxslapply-templates select"INGREDIENT"
  • order-by"PRODUCT/_at_AISLE
    PRODUCT"/gtlt/ULgt
  • ltHR/gt
  • lt/xsltemplategt

65
XML Formatting ObjectsExample
  • ltxsltemplate matchtitlegt
  • ltfoblock font-weightbold font-colorrgb(0,255
    ,255)
  • font-size16ptgt
  • ltxslapply-templates/gt
  • lt/foblockgt
  • lt/xsltemplategt
  • Note Adapted from stylesheet created by Lynn
    Lobash.

66
Some Other XML-related Standards
67
Linking Standards
  • HTML links
  • Really primitive and limited.
  • Linking standards for XML
  • Much more powerful.
  • 2 parts
  • XLink (aka. XLL),
  • XPointer (aka. XLP).
  • Still working drafts.

68
XLink
  • To define links to one or several documents.
  • 2 types of links
  • Simple,
  • Extended.

69
XLink Simple link
  • Example
  • ltrelated_poem xmllinksimple inlinefalse
    hrefpoem1.xmlgtGo to related poemlt/related_poemgt
  • Other attributes / Alternative values
  • inline true, false (link to same document vs.
    outside).
  • show replace, new, embed.
  • actuate user, auto.
  • title ( a caption).
  • Similar to HTML links, but slightly more fancy.

70
XLink Simple links (2)
  • Example 2
  • anchor
  • ltpoem_anchor xmllinksimple rolepoem312gtlttit
    legt Blue Mountainlt/titlegtlt/poem_anchorgt
  • link
  • ltrelated_poem xmllinksimple inlinefalse
    hrefpoem1 .xmlpoem312gtGoto related
    poemlt/related_poemgt
  • Similar to HTML ltA NAMEpoem312gt.

71
Xlink Extended Link
  • One link, several targets.
  • For instance, the link See related poems would
    open as a list of links in a pop-up window.

72
Xlink Extended Link (2)
  • Example
  • ltrelated_poems xmllinkextended inlinefalse
    titleSee related poemsgt
  • ltpoem_target xmllinklocator inlinefalse
    titleBlue Mountains hrefpoem1.xml/gt
  • ltpoem_target xmllinklocator inlinefalse
    titlePink Flowers hrefpoem2.xml/gt
  • ltpoem_target xmllinklocator inlinefalse
    titleSea of Green hrefpoem3.xml/gt
  • lt/related_poemsgt

73
XPointer
  • To define links that target points within
    documents.
  • Special language to explain which spot is
    targeted.
  • In HTML
  • Need to manually insert a tag ltA NAMEgt.
  • Hence need to own the document.
  • With Xpointer
  • No need to add anything to the target document.

74
XPointer (2)
  • Example
  • ltrelated_poem xmllinksimple inlinefalse
    hrefpoem1.xmlroot().child(2)gtGo to related
    poemlt/related_poemgt
  • Other possibilities
  • root().child(3).child(4)
  • id(poem273)
  • root().descendant(2, stanza)
  • root().string(1, my heart)
  • span(root().child(3), root().child(5))

75
Resource Description Framework (RDF)
  • Defines syntax for describing resources.
  • Metadata
  • Similar to information in OPAC records.
  • Essential for identification and retrieval of
    documents.
  • Illimited nesting.

76
RDF Example
  • ltrdfRDFgt
  • ltrdfDescription about"http//www.w3.orggt
  • ltPublishergtWorld Wide Web Consortiumlt/Publishergt
  • ltTitlegtW3C Home Pagelt/Titlegt
  • ltDategt1998-10-03T0227lt/Dategt
  • lt/rdfDescriptiongt
  • lt/rdfRDFgt
  • Note Adapted from example in the RDF
    recommendation.

77
RDF Example (with nesting)
  • ltrdfRDFgt
  • ltrdfDescription about"http//www.w3.orggt
  • ltPublishergtWorld Wide Web Consortiumlt/Publishergt
  • ltTitlegtW3C Home Pagelt/Titlegt
  • ltDategt1998-10-03T0227lt/Dategt
  • ltCreatorgt
  • ltPerson about"http//www.w3.org/staffId/85740"
    gt
  • ltNamegtOra Lassilalt/Namegt
  • ltEmailgtlassila_at_w3.orglt/Emailgt
  • lt/Persongt
  • lt/Creatorgt
  • lt/rdfDescriptiongt
  • lt/rdfRDFgt
  • Note Adapted from example in the RDF
    recommendation.

78
RDF in practice
  • RDF defines only the syntax, not the content.
  • Can accommodate most document description
    schemes.
  • Example of use Implementation of Dublin Core.
  • Example of industry support ABC News, CNN and
    Time Inc.
  • Both for HTML and XML documents.
  • Details of implementation not entirely clear.

79
XML Schemas
  • An alternative to DTDs.
  • Still a working draft.
  • Easier because uses the XML syntax.
  • Data typing possible, unlike DTDs
  • (integer, floating number, date, string, etc.).

80
XML Schema Example
  • lt?xml version"1.0"?gt
  • ltSchema name"poemSchema"
  • xmlns"urnschemas-microsoft-comxml-data"
  • xmlnsdt"urnschemas-microsoft-comdatatypes"
    gt
  • ltElementType name"POEM" content"eltOnly"
    model"closed"gt
  • ltelement type"TITLE"/gt
  • ltelement type"AUTHOR"/gt
  • ltelement type"STANZA" maxOccurs""/gt
  • lt/ElementTypegt
  • ltElementType name"TITLE" content"textOnly"
    model"closed" dttype"string"/gt
  • ltElementType name"AUTHOR" content"eltOnly"
    model"closed"gt
  • ltelement type"FIRSTNAME"/gt
  • ltelement type"LASTNAME"/gt
  • lt/ElementTypegt
  • ...
  • lt/Schemagt

81
Unicode
  • Default character encoding for XML.
  • Great improvement for encoding of non-western
    languages
  • more than 65,000 characters,
  • Eventually will represent all alphabets and
    writing systems,
  • Also includes special typographic characters (
    ¼ ).

82
SGML, XML, HTMLWhat is the difference?
  • XML SGML slightly simplified.
  • HTML just an SGML DTD.
  • Can be easily converted to an XML DTD.
  • Relationship
  • XML and SGML are meta-languages,
  • HTML is a language.

83
Searching XML Documents
84
Models for XML RepositoriesFlat file system
  • A bunch of XML documents in a folder.
  • Native XML search engine
  • an XML-aware Web site search engine.
  • XML Query Language XQL
  • Still in development
  • Find word milk only when it appears in
    attribute DIETINFO2 of element PRODUCT.

85
Models for XML RepositoriesRegular relational
databases
  • E.g., Web-based OPACs, Ovid, Amazon.
  • Back-end relational DBMS
  • MS Access or Oracle, for instance.
  • Web interface
  • uses scripts like CGI or Cold Fusion,
  • Easy to change the scripts to output XML instead
    of HTML,
  • Can even produce XML OR HTML according to the
    capabilities of the requesting browser.

86
Models for XML RepositoriesXML-aware relational
DBs
  • Benefit from R-Databases AND XML advantages.
  • Mixed record
  • Nested structured text difficult to map to R-DB.
  • However, many structured texts have a table-like
    section (the bibliographic information).
  • R-Databases very mature technology (data
    integrity, security, load balance, etc.).

87
Models for XML RepositoriesXML-aware relational
DBs (2)
  • Example of Oracle
  • Enhanced full-text capabilities
  • indexing,
  • truncations, stemming, thesaurus, etc.,
  • XML-like searching,
  • can create SQL queries with embedded XML
    subqueries.
  • Automatic mapping
  • R-DB record --gt XML document,
  • XML document --gt R-DB record,
  • Virtual flat file system.

88
Information Retrieval Standardfor XML
  • Needed to implement cross-repository search
  • To query across several XML servers seamlessly,
  • Whatever the implementation on the server side
    (Flat file system, R-DBMS, etc.).

89
Information Retrieval Standard Z39.50
  • Used in the library community.
  • To query OPACs, indexes, etc.
  • Possible to specify
  • A Query Language,
  • The format of the results,
  • A session protocol.

90
Information Retrieval Standard Z39.50 XML
  • Currently beginning to integrate XML
  • Defined as a possible output format,
  • Some propositions to use XML as an alternative to
    BER for overall Z39.50 syntax.
  • Once XQL is stabilized it could be ported to
    Z39.50.
  • Good candidate to become the IR-Standard for XML.
  • Little known outside the library community.

91
XML in Libraries
92
Which library projects are already using XML/SGML?
  • Mostly academic institutions.
  • (as well as Library of Congress and NYPL.)
  • Usually in SGML.
  • (Very recent ones in XML.)
  • Mostly
  • large and long-term digitization projects,
  • involving the digitization of numerous texts.
  • Converted to HTML on-the-fly.

93
Text Encoding Initiative (TEI)
  • Standard to encode primary sources in the
    Humanities.
  • SGML-based. (It is an SGML DTD.)
  • Currently being converted to XML.
  • Maintained by TEI Consortium.
  • Widely adopted in Humanities computing community.
  • Has spread to libraries.

94
Examples of TEI Projects
  • Special collections
  • Library of Congresss American Memory Project,
  • Literary texts
  • U. of Virginias E-text Collection,
  • Browns Women Writers Project,
  • Historical editions (MEP DTD)
  • Abraham Lincoln Papers,
  • Susan B. Anthony Papers.

95
Encoding Archival Description (EAD)
  • Finding Aids to Special Collections and Archives.
  • SGML/XML-based standard. (It is a DTD.)
  • Maintained by the Library of Congress.
  • Widely adopted.

96
Examples of EAD Projects
  • Among many others California Digital Librarys
    Online Archive of California.
  • Union DatabaseRLGs Archival Resources Project
  • (MARC AMC records and EAD finding aids).

97
Materials Used by Libraries
  • Reference Materials
  • Oxford English Dictionary,
  • American National Biography,
  • Electronic Journals
  • Springer-Verlags Link.

98
XML in Libraries What will it change? (1)
  • EAD finding aids
  • Offer precise and controlled search capabilities,
  • Make the creation of union databases possible.

99
XML in Libraries What will it change? (2)
  • Full-text databases of primary sources
  • Easy to search, display, etc.
  • Next step, union databases.
  • With precise and controlled search capabilities.
  • Full-text databases of e-journals, monographs.
  • Competition with PDF/page images, though.
  • Again next step, union databases.

100
XML in Libraries What will it change? (3)
  • More sophisticated and customized clients
  • Bibliography manager,
  • Concordance program.
  • New library standards based on XML
  • TEI, EAD
  • MARC (!)
  • XML not just a fad, more than 10 years of
    SGML-based TEI.

101
XML in Libraries What will it change? (4)
  • XML is more likely than any other formats to
    resist obsolescence
  • Platform independent,
  • Open standard
  • (not proprietary),
  • Written in ASCII/Unicode plain text
  • (no binary encoding, the simplest text editor can
    read it),
  • Tags are human-readable.

102
Web-based referenceWhat will XML change? (1)
  • Topic-specific meta-search engines
  • e.g., job search or book search.
  • Will become ubiquitous.
  • Already exist but awkward for developers.
  • Small communities --gt can agree on a DTD.
  • See agreement CNN all on RDF.
  • Can also work without common DTD.
  • In the vendors interest.
  • All database-based --gt easy conversion.

103
Web-based referenceWhat will XML change? (1)
  • General search will not improve for a long time
  • A lot of legacy data
  • The whole current WWW!.
  • Numerous users will not switch to XML.
  • Especially basic users.
  • How to deal with thousands of different DTDs?

104
Should you use XML in your project today?
  • Are your data made of a repetition of similar
    objects? (e.g., 3000 poems)
  • Is your project database-based?
  • Is your project large?
  • Do you plan to
  • deliver to different output devices?
  • integrate your project with others? (e.g. union
    database)
  • develop advanced capabilities? (server-side or
    client-side)
Write a Comment
User Comments (0)
About PowerShow.com