Title: XML Information Retreival
1XML Information Retreival
- Hui Fang
- Department of Computer Science
- University of Illinois at Urbana-Champaign
Some slides are borrowed from Nobert Fuhrs XML
Tutorial.
2Outline
- XML basics
- Research Topics
- XML IR
- Tasks
- Retrieval methods
- Clustering XML documents
3XML standards
4Basic XML
- Hierarchical document format for information
exchange in WWW - Self describing data (tags)
- Nested element structure having a root
- Element data can have
- Attributes
- Sub-elements
(Slides from Jayavel Shanmugasundaram)
5Example XML document
- lt?xml version"1.0" encoding"ISO-8859-1" ?gt -
- lt!-- Edited with XML Spy v4.2 Â --gt
- ltbookgt
- lttitlegt Finding Out About A Cognitive
Perspective on Search Engine Technology and the
WWWlt/titlegt - ltauthor id rbelewgt
- ltnamegt
- ltfirstnamegt Richard lt/firstnamegt
- ltlastnamegt Belew lt/lastnamegt
- lt/namegt
- ltaddressgt
- ltcitygt San Diego lt/citygt
- ltzipgt 92093 lt/zipgt
- lt/addressgt
- lt/authorgt
- lt/bookgt
6Tree structure of XML documents
book
author
title
idrbelew
name
address
Finding.
First name
Last name
city
Zip code
Richard
Belew
San Diego
92093
7Basic XML standard does not deal with
- Standardization of element names
- ?XML namespaces
- Structure of element content
- ?XML DTDs
- Data types of element content
- ?XML schema
8XML namespace
Provide a method to avoid element name conflicts
- lttablegt
- lttrgt
- lttdgtAppleslt/tdgt
- lttdgtBananaslt/tdgt
- lt/trgt
- lt/tablegt
- lttablegt
- ltnamegtGPA Tablelt/namegt
- ltwidthgt80lt/widthgt
- ltlengthgt120lt/lengthgt
- lt/tablegt
9XML namespace(Cont.)
Provide a method to avoid element name conflicts
- lthtable xmlnsh"http//www.w3.org/TR/html4/"gt
- lthtrgt
- lthtdgtAppleslt/htdgt lthtdgtBananaslt/htdgt
- lt/htrgt
- lt/htablegt
- ltftable xmlnsf"http//www.w3schools.com/gpa"gt
- ltfnamegtGPA Tablelt/fnamegt ltfwidthgt80lt/fwidthgt
ltflengthgt120lt/flengthgt - lt/ftablegt
10XML Document Type Definition
Define the document structure with a list of
legal elements
- lt?xml version"1.0"?gt
- lt!DOCTYPE note SYSTEM "note.dtd"gt
- ltnotegt
- lttogtTovelt/togt
- ltfromgtJanilt/fromgt ltheadinggtReminderlt/headinggt
ltbodygtHave a rest!lt/bodygt - lt/notegt
lt!ELEMENT note (to,from,heading,body)gt lt!ELEMENT
to (PCDATA)gt lt!ELEMENT from (PCDATA)gt lt!ELEMENT
heading (PCDATA)gt lt!ELEMENT body (PCDATA)gt
11Research Topics related to XML
12Research Topics
- IR areas
- Retrieval Models
- Query Languages
- DB areas
- Query Languages
- System architecture
- Apply relational DB technology to XML data
- Streaming XML
- XML Query Processing
- XML indexing and compression
13XML IR
14INEXInitiative for the Evaluation for XML
Retrieval
- Documents 12,107 articles in XML format
- Queries 30 Content-only
- 30 Content and structure
- Relevance Assessments by participating groups
- Participants 36 active groups in 2003
15CO search task
- Document as hierarchical structure of nested
elements - Type of elements is not considered
- Query refers to content only
- Query syntax as in standard text retrieval
- Task Find smallest subtree(element) satisfying
the query
16Example of CO Topic
- ltINEX-Topic topic-id45 query-typeCO
ct-no056gt - ltTitlegt ltcwgtaugmented reality and
medicinelt/cwgtlt/Titlegt - ltDescriptiongt
- How virtual (or augmented )reality can contribute
to improve the medical and surgical practice. - lt/Descriptiongt
- ltNarrativegt
- In order to be considered relevant, a
document/component must include considerations
about applications of computer graphics and
especially augmented (or virtual) reality to
medice(including surgery). - lt/Narrativegt
- ltKeywordsgt
- Augmented virtual reality medicine surgery
improve computer assisted aided image - lt/Keywordsgt
- lt/INEX-Topicgt
17CAS search Task
- Queries contain explicit references to the XML
structure, by restricing - The context of interest
- lttegttarget element
- The context of certain search concepts
- (ltcwgt,ltcegt) pairs
18Example of CAS topic
- ltINEX-Topic topic-id09 query-typeCAS
ct-no048gt - ltTitlegt
- lttegtarticlelt/tegt
- ltcwgtnon-monotonic reasoninglt/cwgtltcegtbdy/seclt/cegt
- ltcwgt1999 2000lt/cwgt ltcegthdr//yrlt/cegt
- ltcwgt-calendarlt/cwgtltcegtlttig/at1ltcegt
- ltcwgtbelief revisionlt/cwgt
- lt/Titlegt
- ltNarrativegt
- Retrieve all articles from the years 1999-2000
that deal with works on non-monotonic reaonsing.
Do not retrieve CfPs/calendar entries - lt/Narrativegt
- ltKeywordsgtnon-monotonic reasoning belief revision
lt/Keywordsgt - lt/INEX-Topicgt
19XML Retrieval Methods
- XIRQL
- XML query languages with IR-related features
- Language models
- JuruXML
20XIRQL(I)
- CO Approaches
- Split document text into disjoint nodes
- Index nodes separately
- Aggregate indexing weights for higher-level
elements (subtrees)
21Index nodes as units for term weighting
- Application of known indexing functions (e.g.
tfidf)
22Index nodes for relevance-oriented search
document
class"H.3.3"
chapter
chapter
author
title
section
heading
section
John Smith
heading
This. . .
XML Query
We describe
heading
heading
syntax of XQL
Lang. XQL
XML Retrieval
Introduction
1
3
2
Syntax
Examples
Q1 syntax ? example Q2 XQL
4
5
23Combining weights
by disjunction
chapter
0.3 XQL
section1
section2
0.5 example
0.8 XQL
0.7 syntax
Need to return most specific element satisfying
the query!
Q1 syntax ? example Q2 XQL
24Combining weights
with augmentation weight
chapter
0.3 XQL
section1
section2
0.5 example
0.8 XQL
0.7 syntax
Q2 XQL
25XIRQL(II)
- CAS approaches
- Extension of XQL by
- Weighting and ranking
- Data types with vague predicates
- Structural relativism
26XQL Expressions
- Path condition
- search for single elements
- heading
- parent-child
- chapter/heading
- ancestor-descendant
- chapter//section
- document root
- /book/
- Filter wrt. structure
- //chapterheading
- Filter wrt. content
- /document_at_classH.3.3 and authorJohn
Smith
27Data types with vague predicates
- Compares two values of a specific data-type
- E.g. Near, broader, narrower
- Returns (probabilistic) matching value
- E.g. Search for an artist named Ulbrich, living
in Frankfurt, Germany about 100 years ago - ?Ernst Olbrich, Darmstadt, 1899
- P(Olbrich Ulbrich)0.8 (phonetic similarity)
- P(1899 1903)0.9 (numeric similarity)
- P(Darmstadt Frankfurt)0.7 (geographic
distance)
28Semantic Relativism
- Drop distinction attribute/element
- author searches for attribute or element
- Generalize to data types
- personname searches for attribute/elements of
specific data type
29Language models
- Generate language models for each node in the
tree - Combine the children language models using linear
interpolation - Use EM approach to train the linear interpolation
parameters
30Element-specific language models---CO Approaches
31Higher level nodes mixture of language models
Query dog and cat
32Type-specific language models--- CAS approaches
330.5
0.5
- Return components of type x where it has
component y that contains the query term w - e.g. return documents where the title is
contains the word bird
e.g. return documents where the bodys first
section is contains the word dog
34Juru-XML
- Element-specific indexingvector space model
- Transform query into set of (term,path)-conditions
- Vague matching of path conditions
- Modified cosine similarity as retrieval function
35JuruXML(1)---Transform Query
36JuruXML(2)---Vague matching of path conditions
37JuruXML(3)---Retrieval function
- Modified cosine similarity
- wQ(ti ,ciQ) query term weight of pair (ti,ciQ)
- wD(ti ,ciD) indexing weight of pair (ti,ciD) in
the document
- Standard cosine similarity
- wQ(ti) query term weight of term ti
- wD(ti) indexing weight of term ti in the
document
38JuruXML(4)---Alternative approach (Merging
contexts)
- For each query term (ti,ciQ) treat all matched
document terms (ti,cjD) equally from the user
perspective. - Define a weight function w(ciQ)
- E.g.
-
39Clustering XML documents
40Document similarity
- Document representation
- document?N-dimensional vector
- N document features
- Feature sets
- Text only
- Tags only
- Text Tags
- Feature weighting in the document vector
- Similarity measure--- vector similarity
- E.g. cosine measure
41Clustering methods
- Hierarchical clustering
- Main weakness quadratic complexity
- Partitional clustering
- K-means
- Linear time complexity
- Simplicity of its algorithm
42K-Means clustering algorithm
43Measuring clustering quality
- External quality comparison of clusters with
external classification - Entropy distribution of classes within clusters
- Purity largest class in a cluster/cluster size
- Internal quality calculate average inter- and
intra- cluster similarities. - cohesiveness ( overall similarity)
44Discussion
- Text alone give best results
- Texttags problem with weighting of tags vs.
terms
45Conclusion
- XML basics
- XML Retrieval Tasks and methods
- Clustering XML documents
46Bayesian Networks
47Context-dependent Retrieval
- The score of one element is given by
RSV(Retrieval Status Value). - RSV of node depends on RSVs of nodes in the
context(parent nodes) - Elements with highest values are then presented
to the user.
48Bayesian Networks
49Bayesian Networks(Cont.)