Prof. Ray Larson - PowerPoint PPT Presentation

About This Presentation
Title:

Prof. Ray Larson

Description:

GIR Algorithms and evaluation based on a presentation to the 2004 European ... Web Search Engines and Algorithms ... 162 TITLE = Text and Index Compression Algorithms ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 73
Provided by: ValuedGate70
Category:
Tags: larson | prof | ray

less

Transcript and Presenter's Notes

Title: Prof. Ray Larson


1
Lecture 21 XML Retrieval
Principles of Information Retrieval
  • Prof. Ray Larson
  • University of California, Berkeley
  • School of Information
  • Tuesday and Thursday 1030 am - 1200 pm
  • Spring 2007
  • http//courses.ischool.berkeley.edu/i240/s07

2
Mini-TREC
  • Proposed Schedule
  • February 15 Database and previous Queries
  • February 27 report on system acquisition and
    setup
  • March 8, New Queries for testing
  • April 19, Results due (Next Thursday)
  • April 24 or 26, Results and system rankings
  • May 8 Group reports and discussion

3
Announcement
  • No Class on Tuesday (April 17th)

4
Today
  • Review
  • Geographic Information Retrieval
  • GIR Algorithms and evaluation based on a
    presentation to the 2004 European Conference on
    Digital Libraries, held in Bath, U.K.
  • XML and Structured Element Retrieval
  • INEX
  • Approaches to XML retrieval

Credit for some of the slides in this lecture
goes to Marti Hearst
5
Today
  • Review
  • Geographic Information Retrieval
  • GIR Algorithms and evaluation based on a
    presentation to the 2004 European Conference on
    Digital Libraries, held in Bath, U.K.
  • Web Crawling and Search Issues
  • Web Crawling
  • Web Search Engines and Algorithms

Credit for some of the slides in this lecture
goes to Marti Hearst
6
Introduction
  • What is Geographic Information Retrieval?
  • GIR is concerned with providing access to
    georeferenced information sources. It includes
    all of the areas of traditional IR research with
    the addition of spatially and geographically
    oriented indexing and retrieval.
  • It combines aspects of DBMS research, User
    Interface Research, GIS research, and Information
    Retrieval research.

7
Example Results display from CheshireGeo
http//calsip.regis.berkeley.edu/pattyf/mapserver/
cheshire2/cheshire_init.html
8
Other convex, conservative Approximations
9
Our Research Questions
  • Spatial Ranking
  • How effectively can the spatial similarity
    between a query region and a document region be
    evaluated and ranked based on the overlap of the
    geometric approximations for these regions?
  • Geometric Approximations Spatial Ranking
  • How do different geometric approximations affect
    the rankings?
  • MBRs the most popular approximation
  • Convex hulls the highest quality convex
    approximation

10
Spatial Ranking Methods for computing spatial
similarity
11
Probabilistic Models Logistic Regression
attributes
  • X1 area of overlap(query region, candidate GIO)
    / area of query region
  • X2 area of overlap(query region, candidate GIO)
    / area of candidate GIO 
  • X3 1 abs(fraction of overlap region that is
    onshore fraction of candidate GIO that is
    onshore)
  • Where
  • Range for all variables is 0 (not similar) to 1
    (same)

12
CA Named Places in the Test Collection complex
polygons
13
CA Counties Geometric Approximations
MBRs
Convex Hulls
Ave. False Area of Approximation MBRs
94.61 Convex Hulls 26.73
14
CA User Defined Areas (UDAs) in the Test
Collection
15
Test Collection Query Regions CA Counties
  • 42 of 58 counties referenced in the test
    collection metadata
  • 10 counties randomly selected as query regions to
    train LR model
  • 32 counties used as query regions to test model

16
LR model
  • X1 area of overlap(query region, candidate GIO)
    / area of query region
  •  
  • X2 area of overlap(query region, candidate GIO)
    / area of candidate GIO
  • Where
  • Range for all variables is 0 (not similar) to 1
    (same)

17
Some of our Results
  • Mean Average Query Precision the average
    precision values after each new relevant document
    is observed in a ranked list.

For metadata indexed by CA named place regions
  • These results suggest
  • Convex Hulls perform better than MBRs
  • Expected result given that the CH is a higher
    quality approximation
  • A probabilistic ranking based on MBRs can perform
    as well if not better than a non-probabiliistic
    ranking method based on Convex Hulls
  • Interesting
  • Since any approximation other than the MBR
    requires great expense, this suggests that the
    exploration of new ranking methods based on the
    MBR are a good way to go.

For all metadata in the test collection
18
Some of our Results
  • Mean Average Query Precision the average
    precision values after each new relevant document
    is observed in a ranked list.

For metadata indexed by CA named place regions
BUT The inclusion of UDA indexed metadata
reduces precision. This is because coarse
approximations of onshore or coastal geographic
regions will necessarily include much irrelevant
offshore area, and vice versa
For all metadata in the test collection
19
Shorefactor Model
  • X1 area of overlap(query region, candidate GIO)
    / area of query region
  • X2 area of overlap(query region, candidate GIO)
    / area of candidate GIO
  • X3 1 abs(fraction of query region
    approximation that is onshore fraction of
    candidate GIO approximation that is onshore)
  • Where Range for all variables is 0 (not
    similar) to 1 (same)

20
Some of our Results, with Shorefactor
For all metadata in the test collection
Mean Average Query Precision the average
precision values after each new relevant document
is observed in a ranked list.
  • These results suggest
  • Addition of Shorefactor variable improves the
    model (LR 2), especially for MBRs
  • Improvement not so dramatic for convex hull
    approximations b/c the problem that shorefactor
    addresses is not that significant when areas are
    represented by convex hulls.

21
Results for All Data - MBRs
Precision
Recall
22
Results for All Data - Convex Hull
Precision
Recall
23
XML Retrieval
  • The following slides are adapted from
    presentations at INEX 2003-2005 and at the INEX
    Element Retrieval Workshop in Glasgow 2005, with
    some new additions for general context, etc.

24
INEX Organization
  • Organized By
  • University of Duisburg-Essen, Germany
  • Norbert Fuhr, Saadia Malik, and others
  • Queen Mary University of London, UK
  • Mounia Lalmas, Gabriella Kazai, and others
  • Supported By
  • DELOS Network of Excellence in Digital Libraries
    (EU)
  • IEEE Computer Society
  • University of Duisburg-Essen

25
XML Retrieval Issues
  • Using Structure?
  • Specification of Queries
  • How to evaluate?

26
Cheshire SGML/XML Support
  • Underlying native format for all data is SGML or
    XML
  • The DTD defines the database contents
  • Full SGML/XML parsing
  • SGML/XML Format Configuration Files define the
    database location and indexes
  • Various format conversions and utilities
    available for Z39.50 support (MARC, GRS-1

27
SGML/XML Support
  • Configuration files for the Server are SGML/XML
  • They include elements describing all of the data
    files and indexes for the database.
  • They also include instructions on how data is to
    be extracted for indexing and how Z39.50
    attributes map to the indexes for a given
    database.

28
Indexing
  • Any SGML/XML tagged field or attribute can be
    indexed
  • B-Tree and Hash access via Berkeley DB
    (Sleepycat)
  • Stemming, keyword, exact keys and special keys
  • Mapping from any Z39.50 Attribute combination to
    a specific index
  • Underlying postings information includes term
    frequency for probabilistic searching
  • Component extraction with separate component
    indexes

29
XML Element Extraction
  • A new search ElementSetName is XML_ELEMENT_
  • Any Xpath, element name, or regular expression
    can be included following the final underscore
    when submitting a present request
  • The matching elements are extracted from the
    records matching the search and delivered in a
    simple format..

30
XML Extraction
zselect sherlock 372 Connection with SHERLOCK
(sherlock.berkeley.edu) database 'bibfile' at
port 2100 is open as connection 372 zfind
topic mathematics OK Status 1 Hits 26
Received 0 Set Default RecordSyntax
UNKNOWN zset recsyntax XML zset elementset
XML_ELEMENT_Fld245 zdisplay OK Status 0
Received 10 Position 1 Set Default
NextPosition 11 RecordSyntax XML
1.2.840.10003.5.109.10 ltRESULT_DATA
DOCID"1"gt ltITEM XPATH"/USMARC1/VarFlds1/VarD
Flds1/Titles1/Fld2451"gt ltFld245
AddEnty"No" NFChars"0"gtltagtSingularitâes áa
Cargáeselt/agtlt/Fld245gt lt/ITEMgt ltRESULT_DATAgt etc
31
TREC3 Logistic Regression
Probability of relevance is based on Logistic
regression from a sample set of documents to
determine values of the coefficients. At
retrieval the probability estimate is obtained by
For the 6 X attribute measures shown on the next
slide
32
TREC3 Logistic Regression
Average Absolute Query Frequency Query
Length Average Absolute Component
Frequency Document Length Average Inverse
Component Frequency Number of Terms in both
query and Component
33
Okapi BM25
  • Where
  • Q is a query containing terms T
  • K is k1((1-b) b.dl/avdl)
  • k1, b and k3 are parameters , usually 1.2, 0.75
    and 7-1000
  • tf is the frequency of the term in a specific
    document
  • qtf is the frequency of the term in a topic from
    which Q was derived
  • dl and avdl are the document length and the
    average document length measured in some
    convenient unit
  • w(1) is the Robertson-Sparck Jones weight.

34
Combining Boolean and Probabilistic Search
Elements
  • Two original approaches
  • Boolean Approach
  • Non-probabilistic Fusion Search Set merger
    approach is a weighted merger of document scores
    from separate Boolean and Probabilistic queries

35
INEX 04 Fusion Search
Subquery
Subquery
Final Ranked List
Fusion/ Merge
Subquery
Subquery
Comp. Query Results
Comp. Query Results
  • Merge multiple ranked and Boolean index searches
    within each query and multiple component search
    resultsets
  • Major components merged are Articles, Body,
    Sections, subsections, paragraphs

36
Merging and Ranking Operators
  • Extends the capabilities of merging to include
    merger operations in queries like Boolean
    operators
  • Fuzzy Logic Operators (not used for INEX)
  • !FUZZY_AND
  • !FUZZY_OR
  • !FUZZY_NOT
  • Containment operators Restrict components to or
    with a particular parent
  • !RESTRICT_FROM
  • !RESTRICT_TO
  • Merge Operators
  • !MERGE_SUM
  • !MERGE_MEAN
  • !MERGE_NORM
  • !MERGE_CMBZ

37
New LR Coefficients
Estimates using INEX 03 relevance assessments
for b1 Average Absolute Query Frequency b2
Query Length b3 Average Absolute Component
Frequency b4 Document Length b5 Average
Inverse Component Frequency b6 Number of Terms
in common between query and Component
38
INEX CO Runs
  • Three official, one later run - all Title-only
  • Fusion - Combines Okapi and LR using the
    MERGE_CMBZ operator
  • NewParms (LR)- Using only LR with the new
    parameters
  • Feedback - An attempt at blind relevance feedback
  • PostFusion - Fusion of the new LR coefficients
    and Okapi

39
Query Generation - CO
  • 162 TITLE Text and Index Compression
    Algorithms
  • QUERY topicshort _at_ Text and Index Compression
    Algorithms) !MERGE_CMBZ (alltitles _at_ Text and
    Index Compression Algorithms) !MERGE_CMBZ
    (topicshort _at_ Text and Index Compression
    Algorithms) !MERGE_CMBZ (alltitles _at_ Text and
    Index Compression Algorithms)
  • _at_ is Okapi, _at_ is LR
  • !MERGE_CMBZ is a normalized score summation and
    enhancement

40
INEX CO Runs
Strict
Generalized
Avg Prec FUSION 0.0642 NEWPARMS
0.0582 FDBK 0.0415 POSTFUS
0.0690
Avg Prec FUSION 0.0923 NEWPARMS
0.0853 FDBK 0.0390 POSTFUS
0.0952
41
INEX VCAS Runs
  • Two official runs
  • FUSVCAS - Element fusion using LR and various
    operators for path restriction
  • NEWVCAS - Using the new LR coefficients for each
    appropriate index and various operators for path
    restriction

42
Query Generation - VCAS
  • 66 TITLE //articleabout(., intelligent
    transport systems)//secabout(., on-board route
    planning navigation system for automobiles)
  • Submitted query ((topic _at_ intelligent
    transport systems)) !RESTRICT_FROM ((sec_words _at_
    on-board route planning navigation system for
    automobiles))
  • Target elements secss1ss2ss3

43
VCAS Results
Generalized
Strict
Avg Prec FUSVCAS 0.0321 NEWVCAS
0.0270
Avg Prec FUSVCAS 0.0601 NEWVCAS
0.0569
44
Heterogeneous Track
  • Approach using the Cheshires Virtual Database
    options
  • Primarily a version of distributed IR
  • Each collection indexed separately
  • Search via Z39.50 distributed queries
  • Z39.50 Attribute mapping used to map query
    indexes to appropriate elements in a given
    collection
  • Only LR used and collection results merged using
    probability of relevance for each collection
    result

45
INEX 2005 Approach
  • Used only Logistic regression methods
  • TREC3 with Pivot
  • TREC2 with Pivot
  • TREC2 with Blind Feedback
  • Used post-processing for specific tasks

46
Logistic Regression
Probability of relevance is based on Logistic
regression from a sample set of documents to
determine values of the coefficients. At
retrieval the probability estimate is obtained by
For some set of m statistical measures, Xi,
derived from the collection and query
47
TREC2 Algorithm
48
Blind Feedback
  • Term selection from top-ranked documents is based
    on the classic Robertson/Sparck Jones
    probabilistic model

For each term t
49
Blind Feedback
  • Top x new terms taken from top y documents
  • For each term in the top y assumed relevant set
  • Terms are ranked by termwt and the top x selected
    for inclusion in the query

50
Pivot method
  • Based on the pivot weighting used by IBM Haifa in
    INEX 2004 (Mass Mandelbrod)
  • Used 0.50 as pivot for all cases
  • For TREC3 and TREC2 runs all component results
    weighted by article-level results for the
    matching article

51
Adhoc Component Fusion Search
Subquery
Subquery
Raw Ranked List
Fusion/ Merge
Subquery
Subquery
Comp. Query Results
Comp. Query Results
  • Merge multiple ranked component types
  • Major components merged are Article Body,
    Sections, paragraphs, figures

52
TREC3 Logistic Regression
Probability of relevance is based on Logistic
regression from a sample set of documents to
determine values of the coefficients. At
retrieval the probability estimate is obtained by
53
TREC3 Logistic Regression attributes
Average Absolute Query Frequency Query
Length Average Absolute Component
Frequency Document Length Average Inverse
Component Frequency Inverse Component
Frequency Number of Terms in common between
query and Component -- logged
54
TREC3 LR Coefficients
Estimates using INEX 03 relevance assessments
for b1 Average Absolute Query Frequency b2
Query Length b3 Average Absolute Component
Frequency b4 Document Length b5 Average
Inverse Component Frequency b6 Number of Terms
in common between query and Component
55
CO.Focused
  • Generalized Strict

56
COS.Focused
  • Generalized Strict

57
CO.Thorough
  • Generalized Strict

58
COS.Thorough
  • Generalized Strict

59
CAS
  • Generalize Strict

60
Het. Element Retr. Overview
  • The Problem
  • Issues with Element Retrieval and Heterogeneous
    Retrieval
  • Possible Approaches
  • XPointer
  • Generic Metadata systems
  • E.g., Dublin Core
  • Other Metadata Systems

61
The Problem
  • The Adhoc track in INEX has dealt with a single
    DTD for one type of data (computer science
    journal articles)
  • In real-world environments, XML retrieval must
    deal with different DTDs, different genres of
    data and widely varying topical content

62
The Heterogeneous Track
  • Research Questions (2004)
  • For content-oriented queries, what methods are
    possible for determining which elements contain re
    asonable answers? Are pure statistical methods app
    ropriate, or are ontology-based approaches also
    helpful?
  • What methods can be used to map structural
    criteria onto other DTDs?
  • Should mappings focus on element names only, or
    also deal with element content or semantics?
  • What are appropriate evaluation criteria for
    heterogeneous collections?

63
INEX 2004 Het Collection Tags
64
Issues with Element Retrieval for Heterogeneous
Retrieval
  • Conceptual Issues (user view)
  • To actually specify structural elements for
    retrieval requires that the user know the
    structure of the items to be retrieved
  • As the number of DTDs or schemas increase this
    task becomes more complex for both specification
    and for understanding
  • For real world XML retrieval, specifying
    structure effectively requires omniscience on the
    part of the user
  • The collection itself must be specified in some
    way (can the user know all of the collections?)
  • Users of INEX cant do correct specifications for
    even one DTD

65
Issues with Element Retrieval for Heterogeneous
Retrieval
  • Practical Issues (programmers view)
  • Most of the same problems as the user view
  • As seen in an earlier papers today the system
    must provide an interface that the user can
    understand, but maps to the complexities of the
    DTD(s)
  • But, once again, as the number of DTDs or schemas
    increase this task becomes increasingly complex
    for the specification of the mappings
  • For real world XML retrieval, specifying
    structure effectively requires omniscience on the
    part of the programmer to provide exhaustive
    mappings of the document elements to be retrieved
  • As Roelof noted earlier today, this rapidly can
    become a system that has too many options for a
    user to understand or use

66
Postulate of Impotence
  • In summation we might suggest another Postulate
    of Impotence'' like those suggested by Swanson
  • You can either have heterogeneous retrieval, or
    precise element specifications in queries, but
    you cannot have both simultaneously

67
Possible Approaches
  • Generalized structure
  • Parent/child as in Xpath/Xpointer
  • What about flat structures? (like most
    collections in the Het track)
  • Abstract query elements
  • Use semantic representations in queries rather
    than structural representations
  • E.g. Title instead of //fm/tig/atl
  • What semantic representations can/should be used?

68
XPointer
  • Can specify collection-level identification
  • Basically a URN attached to an Xpath
  • Can also specify various string-matching
    constraints on Xpath
  • Might be useful in INEX Het Track for specifying
    relevance judgements
  • But, it doesnt address (or worsens) the larger
    problem of dealing with large numbers of
    heterogeneous structures

69
Abstract Data Elements
  • The idea is to remove the requirement of precise
    and explicit specification of structural elements
    and replace them with abstract and implied
    specifications
  • Used in other heterogeneous retrieval systems
  • Z39.50/SRW (attributesets and elementsets)
  • Dublin Core (limited set of elements for search
    or retrieval)

70
Dublin Core
  • Simple metadata for describing internet resources
  • For Document-Like Objects
  • 15 Elements (in base DC)

71
Dublin Core Elements
  • Title
  • Creator
  • Subject
  • Description
  • Publisher
  • Other Contributors
  • Date
  • Resource Type
  • Format
  • Resource Identifier
  • Source
  • Language
  • Relation
  • Coverage
  • Rights Management

72
Issues in Dublin Core
  • Lack of guidance on what to put into each element
  • How to structure or organize at the element
    level?
  • How to ensure consistency across descriptions for
    the same persons, places, things, etc.
Write a Comment
User Comments (0)
About PowerShow.com