CS 430 INFO 430 Information Retrieval - PowerPoint PPT Presentation

1 / 50

About This Presentation

Title:

CS 430 INFO 430 Information Retrieval

Description:

SUMMARY: U. S. President Calvin Coolidge sits at a desk and ... Coolidge, Calvin,--1872-1933. Presidents--United States--1920-1930. ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 51

Provided by: wya2

Category:

more less

Transcript and Presenter's Notes

Title: CS 430 INFO 430 Information Retrieval

1
CS 430 / INFO 430Information Retrieval
Lecture 27 Information Discovery in Practice
2
Course Administration
3
Searching and Browsing The Human in the Loop
Return objects
Return hits
Browse documents
Search index
4
Information Retrieval from Collections of Textual
Documents

Major Categories of Methods
Ranking by similarity to query (vector space
model)
Exact matching (Boolean)
Ranking of matches by importance of documents
(PageRank)
Combinations of methods
Example Web search engines, such as Google and
Yahoo, use a combination of methods, based on the
first and third approaches, with the exact
combination being chosen by machine learning.

5
Problems with the Boolean model
Counter-intuitive results Query q a and b and
c and d and e Document d has terms a, b, c and
d, but not e Intuitively, d is quite a good match
for q, but it is rejected by the Boolean model.
Query q a or b or c or d or e Document d1 has
terms a, b, c, d, and e Document d2 has term a,
but not b, c, d or e Intuitively, d1 is a much
better match than d2, but the Boolean model ranks
them as equal.
6
Similarity between a Query and a Document in
3-Dimensional Term Vector Space
t3
q
d
cos(?) is used as a measure of similarity
t2
?
t1
7
Zipf's Law
If the words in a collection are ranked, r, by
their frequency, f, they roughly fit the
relation r (f/n) c Where n is the number
of word occurrences in the collection, 19 million
in the example. Different collections have
different constants c. In English text, c tends
to be about 0.1.
8
Weighting Standard Form of tf.idf
Practical experience has demonstrated that
weights of the following form perform well in a
wide variety of circumstances (weight of term i
in document j) (term frequency) (inverse
document frequency) A standard tf.idf weighting
scheme, for free text documents, is tij tfij
idfi (fij / mj) (log2 (N/ni) 1)
when ni gt 0
9
Organization of Files for Full Text Searching
Documents store
Word list (index file)
Postings
Term Pointer to postings ant bee
cat dog
elk fox gnu
hog
Inverted lists
10
Postings FileA Linked List for Each Term
1 abacus 3 94 19 7
19 63 22 56
2 actor 2 66 19 64
29 45
3 aspen 5 43
4 atoll 11 3 11 70
34 40
A linked list for each term is convenient to
process sequentially, but slow to update when the
lists are long.
11
Query Language
A query language defines the syntax and the
semantics of the queries in a given search
system. Factors to consider in designing a query
language include Service needs What are the
characteristics of the documents being searched?
What need does the service satisfy? Human
factors Are the users trained or untrained or
both? What is the trade- off between power of
the language and easy of learning? Efficiency Ca
n the search system process all queries
efficiently?
12
Latent Semantic Indexing
term document query --- cosine gt 0.9
13
Probabilistic Principle
Basic concept The probability that a document is
relevant to a query is assumed to depend on the
terms in the query and the terms used to index
the document, only. Given a user query q, the
ideal answer set, R, is the set of all relevant
documents. Given a user query q and a document d
in the collection, the probabilistic model
estimates the probability that the user will find
d relevant, i.e., that d is a member of R.
14
Binary Independence Retrieval Model (BIR)
Let x (x1, x2, ... xn) be the term incidence
vector for d, xi 1 if term i is in the
document and 0 otherwise. We estimate P(d R) by
P(x R) If the index terms are independent P(x
R) P(x1 ? R) P(x2 ? R) ... P(xn ? R) P(x1
R) P(x2 R) ... P(xn R) ? P(xi R) This
is known as the Naive Bayes probabilistic
model.
15
Relevance Feedback
optimal query
o
x
Hits from the initial query are contained in the
gray shaded area
x
o
x
o
x
x
x
x
x
x
x
o
?
x
x
o
x
x
o
x
x
x
x
x non-relevant documents o relevant documents ?
original query reformulated query
16
Adjusting Parameters Relevance Feedback
?, ? and ? are weights that adjust the importance
of the three vectors. If ? 0, the weights
provide positive feedback, by emphasizing the
relevant documents in the initial set. If ? 0,
the weights provide negative feedback, by
reducing the emphasis on the non-relevant
documents in the initial set.
17
Evaluating Matching Recall and Precision
With matching methods, if information retrieval
were perfect ... Every hit would be relevant to
the original query, and every relevant item in
the body of information would be found.
Precision fraction (or percentage) of the hits
that are relevant, i.e., the extent to which
the set of hits retrieved by a query satisfies
the requirement that generated the query.
Recall fraction (or percentage) of the relevant
items that are found by the query, i.e., the
extent to which the query found all the items
that satisfy the requirement.
18
Evaluating RankingRecall and Precision
If information retrieval were perfect ... Every
document relevant to the original information
need would be ranked above every other document.
With ranking, precision and recall are functions
of the rank order. Precision(n) fraction (or
percentage) of the n most highly ranked
documents that are relevant. Recall(n) fraction
(or percentage) of the relevant items that are
in the n most highly ranked documents.
19
Relevance
Recall and precision depend on concept of
relevance Relevance is a context-, task-dependent
property of documents
"Relevance is the correspondence in context
between an information requirement statement ...
and an article (a document), that is, the extent
to which the article covers the material that is
appropriate to the requirement statement." F.
W. Lancaster, 1979
20
Characteristics of Evaluation Experiments
Corpus Standard sets of documents that can be
used for repeated experiments. Topic statements
Formal statement of user information need, not
related to any query language or approach to
searching. Results set for each topic statement
Identify all relevant documents (or a
well-defined procedure for estimating all
relevant documents) Publication of results
Description of testing methodology, metrics,
and results.
21
Structural Mark-up Example
ltplaygt ltauthorgtShakespearelt/authorgt lttitlegtMacbe
thlt/titlegt ltact number"I"gt ltscene
number"VII"gt lttitlegtMacbeths
castlelt/titlegt ltversegtWill I with wine and
wassail ...lt/versegt lt/scenegt lt/actgt lt/playgt N
ote that Macbeth appears in two different
contexts from Manning, et al., Chapter 10
22
Web Search
Goal Provide information discovery for large
amounts of open access material on the
web Challenges Volume of material -- several
billion items, growing steadily Items created
dynamically or in databases Great variety --
length, formats, quality control, purpose, etc.
Inexperience of users -- range of needs
Economic models to pay for the service
Mischievous Web sites
23
Concept of Relevance and Importance

Document measures
Relevance, as conventionally defined, is binary
(relevant or not relevant). It is usually
estimated by the similarity between the terms in
the query and each document.
Importance measures documents by their
likelihood of being useful to a variety of users.
It is usually estimated by some measure of
popularity.
Web search engines rank documents by combining
estimates
of relevance and importance.

24
Graphical Methods
Document A provides information about document B
Document A refers to document B
25
Anchor Text
The source of Document A contains the marked-up
text
lta href"http//www.cis.cornell.edu/"gtThe Faculty
of Computing and Information Sciencelt/agt
The anchor text The Faculty of Computing and
Information Science can be considered
descriptive metadata about the document http//w
ww.cis.cornell.edu/
26
Graphical Analysis of Hyperlinks on the Web
This page links to many other pages (hub)
2
1
4
Many pages link to this page (authority)
3
6
5
27
Context Image Searching
HTML source
Captions and other adjacent text on the web page
From the Information Science web site
28
Distributed File Systems on Large-scale Clusters
of Commodity Computers
"Component failures are the norm rather than the
exception.... The quantity and quality of the
components virtually guarantee that some are not
functional at any given time and some will not
recover from their current failures. We have seen
problems caused by application bugs, operating
system bugs, human errors, and the failures of
disks, memory, connectors, networking, and power
supplies...." Ghemawat, et al.
29
Map/Reduce Cluster Implementation
M map tasks
R reduce tasks
Input files
Output files
Intermediate files
split 0 split 1 split 2 split 3 split 4
Output 0
Output 1
Several map or reduce tasks can run on a single
computer
Each intermediate file is divided into R
partitions, by partitioning function
Each reduce task corresponds to one partition
30
Metadata Surrogates for Non-textual Materials
Text based methods of information retrieval can
search a surrogate for a photograph
Document
Surrogate (catalog record)
See next page for a textual catalog record about
a non-textual item (photograph).
31
Library of Congress catalog record (part)
CREATED/PUBLISHED between 1925 and
1930? SUMMARY U. S. President Calvin Coolidge
sits at a desk and signs a photograph,
probably in Denver, Colorado. A group of
unidentified men look on. NOTES Title supplied
by cataloger. Source Morey Engle. SUBJECTS
Coolidge, Calvin,--1872-1933.
Presidents--United States--1920-1930.
Autographing--Colorado--Denver--1920-1930.
Denver (Colo.)--1920-1930. Photographic
prints. MEDIUM 1 photoprint 21 x 26 cm. (8 x
10 in.)
32
Using Metadata for Information Retrieval
The basic operation of information retrieval is
to match the way that a user describes an
information requirement (a query), against the
way that items are described (an index). The
success of conventional catalogs (e.g., MARC
Anglo-American Cataloguing Rules) or indexing
services (e.g., Medline) comes from the
combination of precise language to describe
items trained and experienced users to
formulate queries.
33
Cataloguing Online Materials Dublin Core
Dublin Core is an attempt to apply cataloguing
methods to online materials, notably the
Web. History It was anticipated that the methods
of full text indexing that were used by the early
Web search engines, such as Lycos, would not
scale up. "... automated indexes are most
useful in small collections within a given
domain. As the scope of their coverage expands,
indexes succumb to problems of large retrieval
sets and problems of cross disciplinary semantic
drift. Richer records, created by content
experts, are necessary to improve search and
retrieval." Weibel 1995
34
Standardization Function Versus Cost of
Acceptance
Cost of acceptance
Few adopters
Many adopters
Function
35
Example Textual Mark-up
Cost of acceptance
SGML
XML
HTML
Function
ASCII
36
Effective Information Discovery With Homogeneous
Digital Information
Comprehensive metadata with Boolean retrieval
Can be excellent for well-understood categories
of material, but requires standardized metadata
and relatively homogeneous content (e.g., MARC
catalog). Full text indexing with ranked
retrieval Can be excellent, but methods
developed and validated for relatively
homogeneous textual material (e.g., TREC ad hoc
track).
37
Standard Model of Information Retrieval
38
Human Factors Browsing
Users give queries of 2 to 4 words Most users
click only on the first few results few go
beyond the fold on the first page 80 of users,
use search engine to find sites search to find
site browse to find information Amil Singhal,
Google, 2004
39
Browsing in Information Space
Starting point
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Effectiveness depends on (a) Starting point (b)
Effective feedback (c) Convenience
40
Dynamic Snippets with Pre-computed Summary
41
Evaluation Example Eye Tracking
42
(No Transcript)
43
Visualization within Documents Tilebars
The figure represents a set of hits from a text
search. Each large rectangle represents a
document or section of text. Each row represents
a search term or subquery. The density of each
small square indicates the frequency with which a
term appears in a section of a document.
Hearst 1995
44
Case Study Treemaps
Original design using TreeViz
45
Case Study Treemaps
Hughes satellite management system shows
hierarchy and available capacity
46
Case Study Treemaps
Squarified layout using Treemap 3.0 (University
of Maryland)
47
Case Study Treemaps
Voronoi Treemaps using arbitrary polygons
48
Visual thesaurus for geographic images
Methodology Divide images into small regions.
Create a similarity measure based on
properties of these images. Use cluster
analysis tools to generate clusters of similar
images. Provide alternative representations of
clusters. Marshall Ramsey, Hsinchun Chen, Bin
Zhu, A Collection of Visual Thesauri for
Browsing Large Collections of Geographic Images,
May 1997. http//ai.bpa.arizona.edu/mramsey/pape
rs/visualThesaurus/visual Thesaurus.html
49
(No Transcript)
50
The End
Return objects
Return hits
Browse content
Scan results
Search index

Write a Comment

User Comments (0)