Title: CS 430 INFO 430 Information Retrieval
1CS 430 / INFO 430 Information Retrieval
Lecture 7 Queries 1
2Course administration
3Queries Tasks and Applications
Task Application Ad hoc Search
Systems Retrieval Which documents are relevant
to an information need? Information Information
Agents Filtering Which news articles are
interesting to a particular person? Text
Routing Help-Desk Support Who is an
appropriate expert for a particular problem?
4Queries Choices
What is the optimal query for each
application? Rich query languages Extending
the Boolean model Relevance feedback and query
refinement Automated query formulation using
machine learning Different approaches are needed
for fielded information and free text.
5Query Language
A query language defines the syntax and the
semantics of the queries in a given search
system. Factors to consider in designing a query
language include Service needs What are the
characteristics of the documents being searched?
What need does the service satisfy? Human
factors Are the users trained or untrained or
both? What is the trade- off between power of
the language and easy of learning? Efficiency Ca
n the search system process all queries
efficiently?
6Query Languages
Traditionally, query languages have fallen into
two camps (a) Powerful and expressive
languages which are not easily readable nor
writable by non-experts (e.g., SQL and
XQuery). (b) Simple and intuitive languages not
powerful enough to express complex concepts
(e.g., Google's query language).
7Query Languages the Common Query Language
The Common Query Language a formal language for
queries to information retrieval systems such as
abstracting and indexing services, bibliographic
catalogs, and museum collection information.
Objective human readable and human writable
intuitive while maintaining the expressiveness
of more complex languages. Supports Full
text searching Boolean operators
Fielded searching
8The Common Query Language
The Common Query Language is maintained by the
Library of Congress. http//www.loc.gov/standards/
sru/cql/ The following examples are taken from
the CQL Tutorial, A Gentle Introduction to CQL.
9The Common Query Language Examples
Simple queries dinosaur comp.sources.misc
"complete dinosaur" "the complete
dinosaur" "ext-gtu.generic"
"and" Booleans dinosaur or bird dinosaur and
bird or dinobird (bird or dinosaur) and
(feathers or scales) "feathered dinosaur" and
(yixian or jehol) (((a and b) or (c not d)
not (e or f and g)) and h not i) or j
10The Common Query Language Examples
Indexes fielded searching title dinosaur
title ((dinosaur and bird) or dinobird)
dc.title saurischia bath.title "the
complete dinosaur" srw.serverChoice foo
srw.resultSet bar Index-set mapping
gtdchttp//www.loc.gov/srw/index-sets/dc ... dc
.title dinosaur and dc.author farlow
Definition of fields (Dublin Core)
title and author use the Dublin Core definitions
11The Common Query Language Examples
Proximity The prox operator prox/relation/distan
ce/unit/ordering Examples complete prox
dinosaur adjacent (caudal or dorsal) prox
vertebra ribs prox//5 chevrons near 5 ribs
prox//0/sentence chevrons same sentence ribs
prox/gt/0/paragraph chevrons not adjacent
12The Common Query Language Examples
Relations year gt 1998 title all "complete
dinosaur" all terms in title title any
"dinosaur bird reptile" any term in
title title exact "the complete
dinosaur" publicationYear lt 1980 numberOfWheels
lt 3 numberOfPlates 18 lengthOfFemur gt
2.4 bioMass gt 100 numberOfToes ltgt 3
13The Common Query Language Examples
Relation Modifiers title all/stem "complete
dinosaur" title any/relevant "dinosaur bird
reptile" title exact/fuzzy "the complete
dinosaur" author /fuzzy tailor The
implementations of relevant and fuzzy are not
defined by the query language.
14The Common Query Language Examples
Pattern Matching dinosaur zero or more
characters sauria man?raptor exactly
one character man?raptor "the
compsaur" char\ literal "" Word
Anchoring title"the complete dinosaur"
beginning of field author"bakker"
end of field author all
"kernighan ritchie" author any "kernighan
ritchie thompson"
15The Common Query Language Examples
A complete example dc.author(kern or
ritchie) and (bath.title exact "the c
programming language" or
dc.titleelements prox///4 dc.titleprogramming)
and subject any/relevant "style design
analysis" Find records whose author (in the
Dublin Core sense) includes either a word
beginning kern or the word ritchie, and which
have either the exact title (in the sense of the
Bath profile) the c programming language or a
title containing the words elements and
programming not more the four words apart, and
whose subject is relevant to one or more of the
words style, design or analysis.
16Problems with the Boolean model
Boolean is all or nothing Boolean model has no
way to rank documents. Boolean model allows for
no uncertainty in assigning index terms to
documents. The Boolean model has no provision
for adjusting the importance of query terms.
17Boolean model as sets
d is either in the set A or not in A.
d
A
18Problems with the Boolean model
Counter-intuitive results Query q a and b and
c and d and e Document d has terms a, b, c and
d, but not e Intuitively, d is quite a good match
for q, but it is rejected by the Boolean model.
Query q a or b or c or d or e Document d1 has
terms a, b, c, d, and e Document d2 has term a,
but not b, c, d or e Intuitively, d1 is a much
better match than d2, but the Boolean model ranks
them as equal.
19Extending the Boolean model
Term weighting Give weights to terms in
documents and/or queries. Combine standard
Boolean retrieval with vector ranking of
results Fuzzy sets Relax the boundaries of the
sets used in Boolean retrieval
20Ranking methods in Boolean systems
SIRE (Syracuse Information Retrieval
Experiment) Term weights Add term weights
Weights calculated by the standard method
of term frequency inverse document
frequency. Ranking Calculate results set by
standard Boolean methods Rank results by
vector distances
21Expanding the results set in SIRE
SIRE (Syracuse Information Retrieval
Experiment) Results set is created by
standard Boolean retrieval User selects one
document from results set Other documents in
collection are ranked by vector distance
from this document This process allows the
results set to be expanded, thus overcoming the
all-or-nothing problem of Boolean retrieval
SIRE used relevance feedback to refine the
results set, as will be discussed in Lecture 18.
22Boolean model as fuzzy sets
d is more or less in A.
d
A
23Fuzzy Sets Basic concept
A document has a term weight associated with
each index term. The term weight measures the
degree to which that term characterizes the
document. Term weights are in the range 0, 1.
(In the standard Boolean model all weights are
either 0 or 1.) For a given query, calculate
the similarity between the query and each
document in the collection. This calculation
is needed for every document that has a non-zero
weight for any of the terms in the query.
24Fuzzy Sets
Fuzzy set theory dA is the degree of membership
of an element to set A intersection (and) dA?B
min(dA, dB) union (or) dA?B max(dA, dB)
25Fuzzy Sets
Fuzzy set theory example standard
fuzzy set theory set
theory dA 1 1 0 0 0.5 0.5 0 0 dB 1 0 1 0 0.7 0
0.7 0 and dA?B 1 0 0 0 0.5 0 0 0 or
dA?B 1 1 1 0 0.7 0.5 0.7 0
26MMM Mixed Min and Max model
Terms a1, a2, . . . , an Document d, with
index-term weights d1, d2, . . . , dn
qor (a1 or a2 or . . . or an) Query-document
similarity S(qor, d) ?or max(d1,
d2,.. , dn) (1 - ?or) min(d1, d2,.. ,
dn) With regular Boolean logic, all di 1 or 0,
?or 1
27MMM Mixed Min and Max model
Terms a1, a2, . . . , an Document d, with
index-term weights d1, d2, . . . , dn qand
(a1 and a2 and . . . and an) Query-document
similarity S(qand, d) ?and
min(d1,.. , dn) (1 - ?and) max(d1,.. ,
dn) With regular Boolean logic, all di 1 or 0,
?and 1
28MMM Mixed Min and Max model
Experimental values all di 1 or 0 ?and in
range 0.5, 0.8 ?or gt 0.2 Computational cost is
low. Retrieval performance much improved.
29Test data
CISI CACM INSPEC MMM 68 109 195
Percentage improvement over standard Boolean
model (average best precision) Lee and Fox, 1988
30Reading
E. Fox, S. Betrabet, M. Koushik, W. Lee, Extended
Boolean Models, Frake, Chapter 15 Methods based
on fuzzy set concepts