CSM06 Information Retrieval - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

CSM06 Information Retrieval

Description:

CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway_at_surrey.ac.uk – PowerPoint PPT presentation

Number of Views:118

Avg rating:3.0/5.0

Slides: 24

Provided by: css146

Category:

more less

Transcript and Presenter's Notes

Title: CSM06 Information Retrieval

1
CSM06 Information Retrieval

Lecture 4 Web IR part 1
Dr Andrew Salway a.salway_at_surrey.ac.uk

2
Lecture 4 OVERVIEW

Previously we looked at IR techniques that
indexed a document based on the words that occur
in the document
Some of these techniques are applied in web
search engines (but VSM may not be appropriate).
However, web IR can also exploit a distinctive
feature of information on the web hypertext
link structure
Use of anchor text for indexing web pages
The PageRank algorithm based on link structure
analysis
Other techniques for ranking web pages

3
Challenges for IR on the Web

High volume of information
Heterogeneous information (multimedia and
multilingual)
Diverse users - hence diverse information needs,
and many inexperienced users
Average query length 2-5 words
Poorly structured and low quality information

4
Scale

Projection of worldwide Internet population in
2005 1.07 billion users, www.clickz.com/stats/we
b_worldwide/
Early in 2005 Google claimed to index over 8
billion web pages, Yahoo recently claimed 19
billion, now Google claims to index 3 times more
than nearest competitor
http//select.nytimes.com/gst/abstract.html?resF3
0610F93E540C748EDDA00894DD404482
Given the low overlap in search engine results
for a given query, it is likely that the total
number of webpages is much greater than that
indexed by any single web search engine

5
Requirements of Web Search Engine Users?

Fast response time
Some relevant results in first page maybe less
concern with getting all relevant results
Good coverage of web, at least of important
sites
Up-to-date links
Simple and intuitive to use making queries and
understanding results
NB. Some of these requirements contrast with
those of expert researchers using specialist
information retrieval systems

6
User Goals (Information Needs)

Queries are used to express a users goal (or
information need), but note that the same query
might be used for quite different goals
(Rose and Levinson 2004)

7
User Goals Rose and Levinsons classification
(2004)

Navigational wanting a specific known website
Informational my goal is to learn something by
reading or viewing web pages e.g. closed and
open-ended questions, advice
Resource my goal is to obtain a resource (not
information) available on web pages e.g.
download music, interact with online shopping
service
NOTE prior to web most IR was concerned only
with Informational queries

8
User Goals Rose and Levinsons classification
(2004)

The more a search engine understands about a
users goal then the better results it can
provide
? User goals may be deduced not only from the
query, but also from
The results returned by the search engine
Results clicked on by the user
Further searches / actions by the user

9
Opportunity

Web search engines can exploit the fact that
information on the web is in the form of
hypertext

10
Hypertext

The web is, in some senses at least,
hypertextual, i.e. it can be viewed as networks
of nodes (e.g. pages) and links (between pages)

11
Hypertext

Links suggest relatedness of topic / perhaps
also a recommendation
Topological information about the hypertext graph
gained by link structure analysis can be
exploited for ranking

12
Use of Anchor Text (Brin and Page 1998)

Words in the anchor text can be used to index the
webpage being linked to the text in an anchor
may give a good description of the page it points
to, e.g.
ltahrefwww.bio.com/beckhambio.html"gt A Biography
of David Beckhamlt/agtlt/pgt
The words in the anchor text might be a better
indicator of what the webpage is about than the
words in the webpage
Anchor text is also good for resources like
images that can not be analysed as keywords

13
PageRank (Brin and Page 1998)

Google makes use of both link structure and
anchor text
The citation (link) graph of the web is an
important resource that has largely gone unused
in existing web search engines
? PageRank is an objective measure of a web
pages citation importance that corresponds well
with peoples subjective idea of importance

14
Calculating PageRank

PR(A) (1-d) d(PR(T1)/C(T1)
PR(Tn)/C(Tn)
PR(A) PageRank of webpage A
C (A) the number of links out of webpage A
T1Tn the webpages that point to webpage A
d a damping factor set between 0-1
In reality, the calculation of PageRank is
iterative

15
Web-adjacency Analysis (a similar idea to
PageRank)

Kleinberg and colleagues proposed a method for
identifying authoritative web-pages
Identify set of relevant pages (as normal)
Identify those with a large in-degree, i.e. lots
of pages point to them (cf. impact)
Ensure that the authorities selected are referred
to by a number of the same hubs, i.e. those with
a large out-degree

16
Web-adjacency Analysis

Hubs and authorities exhibit what could be
called a mutually reinforcing relationship
(Kleinberg 1998)
Computing authority and hub values for web-pages
is an iterative process over a graph, where each
node is a web-page
Two weights are given to each node relating to
in-degree and out-degree total in-degree weights
and total out-degree weights are kept constant
Weights are modified each iteration depending on
weights of connected nodes

17
Some other Factors used to rank Web Pages (Hock
2001)

Popularity of the Page measured either by how
many other web-pages link to it, or by how many
people have clicked on it when they had the same
query
Frequency of search terms need to consider
length of the document, and web-page authors
attempts to affect ranking by deliberate
repetition
Number of query terms matched but remember many
queries are only one or two words

18
Other Factors (continued)

Rarity of terms rank pages containing rare
search terms more highly (cf. TFIDF)
Weighting by Field give high ranking to pages
including search terms in important fields, e.g.
Title
Proximity of Terms rank pages more highly if
search terms occur near one another
Order of Query Terms give priority to pages
containing the search term entered first

19
Set Reading for Lecture 4

Page and Brin (1998), The Anatomy of a
Large-Scale Hypertextual Web Search Engine.
SECTIONS 1 and 2. Explains Googles use of
anchor text and PageRank.
www-db.stanford.edu/backrub/google.html
Hock (2001), The extreme searcher's guide to web
search engines, pages 25-31. Gives an overview
of some factors used by web search engines to
rank webpages. AVAILABLE in Main Library
collection and in Library Article Collection.

20
Exercise

Explore the idea of PageRank using an online
PageRank calculator, e.g.
www.markhorrell.com/seo/pagerank.shtml
OR
www.webworkshop.net/pagerank_calculator.php3

21
Further Reading

Rose and Levinson (2004), Understanding User
Goals in Web Search, 13th International WWW
Conference, 2004. www.sims.berkeley.edu/courses/is
141/f05/readings/rose_www04.pdf
Page, Brin, Motwani and Winograd (1999), The
PageRank Citation Ranking Bringing Order to the
Web. http//dbpubs.stanford.edu8090/pub/1999-66
Belew (2000), Finding Out About, pages 195-199
for an overview of Kleinbergs work on
web-adjacency analysis and authorities and hubs.
Kleinberg (1998), Authoritative Sources in a
Hyperlinked Environment, Journal of the ACM.
http//citeseer.nj.nec.com/87928.html
Kobayashi and Takeda (2000), Information
Retrieval on the Web, ACM Computing Surveys
32(2), pp. 144-173. AVAILABLE IN LIBRARY /
ARTICLE COLLECTION. This comprehensive article
reviews a lot the ideas covered so far in this
module and discusses them in the context of Web
IR. NOTE, it is already a little out of date in
places because of the rapid changes of the Web.

22
Lecture 4 LEARNING OUTCOMES

After this lecture you should be able to
Explain how the challenges of web IR are
different than those facing the developers of
traditional IR systems
Explain how web search engines can exploit the
hypertext structure of the web to index and rank
web pages, e.g. using Anchor Text, and PageRank
Explain how PageRank is calculated
Discuss and critique a range of factors used by
web search engines to rank web pages

23
Reading ahead for LECTURE 5

If you want to read about next weeks lecture
topics, see
Dean and Henzinger (1999), Finding Related Pages
in the World Wide Web. Pages 1-10.
http//citeseer.ist.psu.edu/dean99finding.html
Agichtein, Lawrence and Gravano (2001), Learning
Search Engine Specific Query Transformations for
Question Answering, Procs. 10th International
WWW Conference. Section 1 and Section 3
www.cs.columbia.edu/eugene/papers/www10.pdf
Oppenheim, Morris and McKnight (2000), The
Evaluation of WWW Search Engines, Journal of
Documentation, 56(2). Pages 194-205. In Library
Article Collection.