XML Information Retreival - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

XML Information Retreival

Description:

Provide a method to avoid element name conflicts. XML namespace(Cont. ... E.g. 'Search for an artist named Ulbrich, living in Frankfurt, Germany about 100 ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 50
Provided by: sifaka
Category:

less

Transcript and Presenter's Notes

Title: XML Information Retreival


1
XML Information Retreival
  • Hui Fang
  • Department of Computer Science
  • University of Illinois at Urbana-Champaign

Some slides are borrowed from Nobert Fuhrs XML
Tutorial.
2
Outline
  • XML basics
  • Research Topics
  • XML IR
  • Tasks
  • Retrieval methods
  • Clustering XML documents

3
XML standards
4
Basic XML
  • Hierarchical document format for information
    exchange in WWW
  • Self describing data (tags)
  • Nested element structure having a root
  • Element data can have
  • Attributes
  • Sub-elements

(Slides from Jayavel Shanmugasundaram)
5
Example XML document
  • lt?xml version"1.0" encoding"ISO-8859-1" ?gt -
  • lt!-- Edited with XML Spy v4.2   --gt
  • ltbookgt
  • lttitlegt Finding Out About A Cognitive
    Perspective on Search Engine Technology and the
    WWWlt/titlegt
  • ltauthor id rbelewgt
  • ltnamegt
  • ltfirstnamegt Richard lt/firstnamegt
  • ltlastnamegt Belew lt/lastnamegt
  • lt/namegt
  • ltaddressgt
  • ltcitygt San Diego lt/citygt
  • ltzipgt 92093 lt/zipgt
  • lt/addressgt
  • lt/authorgt
  • lt/bookgt

6
Tree structure of XML documents
book
author
title
idrbelew
name
address
Finding.
First name
Last name
city
Zip code
Richard
Belew
San Diego
92093
7
Basic XML standard does not deal with
  • Standardization of element names
  • ?XML namespaces
  • Structure of element content
  • ?XML DTDs
  • Data types of element content
  • ?XML schema

8
XML namespace
Provide a method to avoid element name conflicts
  • lttablegt
  • lttrgt
  • lttdgtAppleslt/tdgt
  • lttdgtBananaslt/tdgt
  • lt/trgt
  • lt/tablegt
  • lttablegt
  • ltnamegtGPA Tablelt/namegt
  • ltwidthgt80lt/widthgt
  • ltlengthgt120lt/lengthgt
  • lt/tablegt

9
XML namespace(Cont.)
Provide a method to avoid element name conflicts
  • lthtable xmlnsh"http//www.w3.org/TR/html4/"gt
  • lthtrgt
  • lthtdgtAppleslt/htdgt lthtdgtBananaslt/htdgt
  • lt/htrgt
  • lt/htablegt
  • ltftable xmlnsf"http//www.w3schools.com/gpa"gt
  • ltfnamegtGPA Tablelt/fnamegt ltfwidthgt80lt/fwidthgt
    ltflengthgt120lt/flengthgt
  • lt/ftablegt

10
XML Document Type Definition
Define the document structure with a list of
legal elements
  • lt?xml version"1.0"?gt
  • lt!DOCTYPE note SYSTEM "note.dtd"gt
  • ltnotegt
  • lttogtTovelt/togt
  • ltfromgtJanilt/fromgt ltheadinggtReminderlt/headinggt
    ltbodygtHave a rest!lt/bodygt
  • lt/notegt

lt!ELEMENT note (to,from,heading,body)gt lt!ELEMENT
to (PCDATA)gt lt!ELEMENT from (PCDATA)gt lt!ELEMENT
heading (PCDATA)gt lt!ELEMENT body (PCDATA)gt
11
Research Topics related to XML
12
Research Topics
  • IR areas
  • Retrieval Models
  • Query Languages
  • DB areas
  • Query Languages
  • System architecture
  • Apply relational DB technology to XML data
  • Streaming XML
  • XML Query Processing
  • XML indexing and compression

13
XML IR
14
INEXInitiative for the Evaluation for XML
Retrieval
  • Documents 12,107 articles in XML format
  • Queries 30 Content-only
  • 30 Content and structure
  • Relevance Assessments by participating groups
  • Participants 36 active groups in 2003

15
CO search task
  • Document as hierarchical structure of nested
    elements
  • Type of elements is not considered
  • Query refers to content only
  • Query syntax as in standard text retrieval
  • Task Find smallest subtree(element) satisfying
    the query

16
Example of CO Topic
  • ltINEX-Topic topic-id45 query-typeCO
    ct-no056gt
  • ltTitlegt ltcwgtaugmented reality and
    medicinelt/cwgtlt/Titlegt
  • ltDescriptiongt
  • How virtual (or augmented )reality can contribute
    to improve the medical and surgical practice.
  • lt/Descriptiongt
  • ltNarrativegt
  • In order to be considered relevant, a
    document/component must include considerations
    about applications of computer graphics and
    especially augmented (or virtual) reality to
    medice(including surgery).
  • lt/Narrativegt
  • ltKeywordsgt
  • Augmented virtual reality medicine surgery
    improve computer assisted aided image
  • lt/Keywordsgt
  • lt/INEX-Topicgt

17
CAS search Task
  • Queries contain explicit references to the XML
    structure, by restricing
  • The context of interest
  • lttegttarget element
  • The context of certain search concepts
  • (ltcwgt,ltcegt) pairs

18
Example of CAS topic
  • ltINEX-Topic topic-id09 query-typeCAS
    ct-no048gt
  • ltTitlegt
  • lttegtarticlelt/tegt
  • ltcwgtnon-monotonic reasoninglt/cwgtltcegtbdy/seclt/cegt
  • ltcwgt1999 2000lt/cwgt ltcegthdr//yrlt/cegt
  • ltcwgt-calendarlt/cwgtltcegtlttig/at1ltcegt
  • ltcwgtbelief revisionlt/cwgt
  • lt/Titlegt
  • ltNarrativegt
  • Retrieve all articles from the years 1999-2000
    that deal with works on non-monotonic reaonsing.
    Do not retrieve CfPs/calendar entries
  • lt/Narrativegt
  • ltKeywordsgtnon-monotonic reasoning belief revision
    lt/Keywordsgt
  • lt/INEX-Topicgt

19
XML Retrieval Methods
  • XIRQL
  • XML query languages with IR-related features
  • Language models
  • JuruXML

20
XIRQL(I)
  • CO Approaches
  • Split document text into disjoint nodes
  • Index nodes separately
  • Aggregate indexing weights for higher-level
    elements (subtrees)

21
Index nodes as units for term weighting
  • Application of known indexing functions (e.g.
    tfidf)

22
Index nodes for relevance-oriented search
document
class"H.3.3"
chapter
chapter
author
title
section
heading
section
John Smith
heading
This. . .
XML Query
We describe
heading
heading
syntax of XQL
Lang. XQL
XML Retrieval
Introduction
1
3
2
Syntax
Examples
Q1 syntax ? example Q2 XQL
4
5
23
Combining weights
by disjunction
chapter
0.3 XQL
section1
section2
0.5 example
0.8 XQL
0.7 syntax
Need to return most specific element satisfying
the query!
Q1 syntax ? example Q2 XQL
24
Combining weights
with augmentation weight
chapter
0.3 XQL
section1
section2
0.5 example
0.8 XQL
0.7 syntax
Q2 XQL
25
XIRQL(II)
  • CAS approaches
  • Extension of XQL by
  • Weighting and ranking
  • Data types with vague predicates
  • Structural relativism

26
XQL Expressions
  • Path condition
  • search for single elements
  • heading
  • parent-child
  • chapter/heading
  • ancestor-descendant
  • chapter//section
  • document root
  • /book/
  • Filter wrt. structure
  • //chapterheading
  • Filter wrt. content
  • /document_at_classH.3.3 and authorJohn
    Smith

27
Data types with vague predicates
  • Compares two values of a specific data-type
  • E.g. Near, broader, narrower
  • Returns (probabilistic) matching value
  • E.g. Search for an artist named Ulbrich, living
    in Frankfurt, Germany about 100 years ago
  • ?Ernst Olbrich, Darmstadt, 1899
  • P(Olbrich Ulbrich)0.8 (phonetic similarity)
  • P(1899 1903)0.9 (numeric similarity)
  • P(Darmstadt Frankfurt)0.7 (geographic
    distance)

28
Semantic Relativism
  • Drop distinction attribute/element
  • author searches for attribute or element
  • Generalize to data types
  • personname searches for attribute/elements of
    specific data type

29
Language models
  • Generate language models for each node in the
    tree
  • Combine the children language models using linear
    interpolation
  • Use EM approach to train the linear interpolation
    parameters

30
Element-specific language models---CO Approaches
31
Higher level nodes mixture of language models
Query dog and cat
32
Type-specific language models--- CAS approaches
33
0.5
0.5
  • Return components of type x where it has
    component y that contains the query term w
  • e.g. return documents where the title is
    contains the word bird

e.g. return documents where the bodys first
section is contains the word dog
34
Juru-XML
  • Element-specific indexingvector space model
  • Transform query into set of (term,path)-conditions
  • Vague matching of path conditions
  • Modified cosine similarity as retrieval function

35
JuruXML(1)---Transform Query
36
JuruXML(2)---Vague matching of path conditions
37
JuruXML(3)---Retrieval function
  • Modified cosine similarity
  • wQ(ti ,ciQ) query term weight of pair (ti,ciQ)
  • wD(ti ,ciD) indexing weight of pair (ti,ciD) in
    the document
  • Standard cosine similarity
  • wQ(ti) query term weight of term ti
  • wD(ti) indexing weight of term ti in the
    document

38
JuruXML(4)---Alternative approach (Merging
contexts)
  • For each query term (ti,ciQ) treat all matched
    document terms (ti,cjD) equally from the user
    perspective.
  • Define a weight function w(ciQ)
  • E.g.

39
Clustering XML documents
40
Document similarity
  • Document representation
  • document?N-dimensional vector
  • N document features
  • Feature sets
  • Text only
  • Tags only
  • Text Tags
  • Feature weighting in the document vector
  • Similarity measure--- vector similarity
  • E.g. cosine measure

41
Clustering methods
  • Hierarchical clustering
  • Main weakness quadratic complexity
  • Partitional clustering
  • K-means
  • Linear time complexity
  • Simplicity of its algorithm

42
K-Means clustering algorithm
43
Measuring clustering quality
  • External quality comparison of clusters with
    external classification
  • Entropy distribution of classes within clusters
  • Purity largest class in a cluster/cluster size
  • Internal quality calculate average inter- and
    intra- cluster similarities.
  • cohesiveness ( overall similarity)

44
Discussion
  • Text alone give best results
  • Texttags problem with weighting of tags vs.
    terms

45
Conclusion
  • XML basics
  • XML Retrieval Tasks and methods
  • Clustering XML documents

46
Bayesian Networks
47
Context-dependent Retrieval
  • The score of one element is given by
    RSV(Retrieval Status Value).
  • RSV of node depends on RSVs of nodes in the
    context(parent nodes)
  • Elements with highest values are then presented
    to the user.

48
Bayesian Networks
49
Bayesian Networks(Cont.)
Write a Comment
User Comments (0)
About PowerShow.com