Web Search

1 / 54

About This Presentation

Title:

Web Search

Description:

Cronen-Townsend et. al. SIGIR 2002. Compare a language model ... vh1.com. 23. Assume our surfer is on a page. In the next time step she can: ... – PowerPoint PPT presentation

Number of Views:169

Avg rating:3.0/5.0

Slides: 55

Provided by: Emi21

Learn more at: https://www.cis.upenn.edu

more less

Transcript and Presenter's Notes

Title: Web Search

1
Web Search

CIS 430 November 6, 2008
Emily Pitler

2
What are queries like?
3
Question Answering
4
How do people actually search?

Named Entities
1 or 2 words
Ambiguous meaning
Ambiguous intent

5
What are they searching for?
6
Searching follows temporal patterns
Mei and Church, WSDM 2008
7
Queries are short

Beitzel et. al. SIGIR 2004
America Online, week in December 2003
Popular queries
1.7 words
Overall
2.2 words

8
Most Queries are Rare

Lempel and Moran WWW2003
AltaVista, summer 2001
7,175,151 queries
2,657,410 distinct queries
1,792,104 queries occurred only once 63.7
Most popular query 31,546 times

9
What distribution does this look like?
Saraiva et. al. SIGIR 2001
10
Zipfian Distribution of Popular Searches
Lempel and Moran WWW2003
11
What makes queries difficult?
12
Ambiguity makes queries difficult
American Airlines?
or Alcoholics Anonymous?
13
Query Clarity

Clarity score low ambiguity
Cronen-Townsend et. al. SIGIR 2002
Compare a language model
over the relevant documents for a query
over all possible documents
The more difference these are, the more clear the
query is
programming perl vs. the

14
Language Models

Query Language Model
Collection Language Model (unigram)

15
Kullback-Leibler Divergence

Relative entropy between the two distributions
Cost in bits of coding using Q when true
distribution is P

16
Kullback-Leibler Divergence
17
Clarity score
18
Clarity scores on TREC-7 collection
19
Query Type Broders Taxonomy

Navigational
greyhound bus
compaq
Informational
San Francisco
normocytic anemia
Transactional
britney spears lyrics
download adobe reader

Broder SIGIR 2002
20
What results should the search engine show?
21
Important
22
How can you tell if a webpage is important?

The more webpages that point to you, the more
important you are
The more important webpages point to you, the
more important you are
These intuitions led to PageRank
PageRank led to

Page et. al. 1998
23
Random surfer
washingtonpost.com
Mtv.com
cnn.com
vh1.com
Nytimes.com
24
Random walk on the web

Assume our surfer is on a page
In the next time step she can
Choose a link on the current page uniformly at
random
Or
Go somewhere else in the web uniformly at random
After a long time, what is the probability she is
on a given page?

25
Simplified PageRank
Spread out their probability over outgoing links
Pages that point to v
26
PageRank on a Graph
27
PageRank

Could also get bored with probability d and
jump somewhere else completely

28
Compute PageRanks in O(log(n))
29
Applications of PageRank

Google, obviously
Given objects and links between them, measures
importance
Summarization (Erkan and Radev, 2004)
Nodes sentences, edges thresholded cosine
similarity
Research (Mimno and McCallum, 2007)
Nodes people, edges citations
Facebook?

30
Relevant
31
Find the query terms

Words on the page
Title
Domain
Anchor textwhat other sites say when they link
to that page

32
What information can we get from this page?
Title Ani Nenkova - Home
Domain www.cis.upenn.edu
33
Open Directory Project (dmoz)

Ontology of webpages
Over 4 million webpages are categorized
Like WordNet for webpages
Search engines use this
Where is www.cis.upenn.edu?
Computers
Computer Science
Academic Departments
North America
United States
Pennsylvania

34
Anchor Text

What OTHER webpages say about your webpage
Very good descriptions of whats on a page

Link to www.cis.upenn.edu/nenkova
Ani Nenkova is anchor text for that page
35
Evaluation
36
Why not just use accuracy?

10,000 documents
10 of them are relevant
What happens if you decide to return absolutely
nothing?
99.9 accuracy

37
Precision and Recall

Standard metrics in Information Retrieval
Precision Of what you return, how many are
relevant?
Recall Of what is relevant, how many do you
return?

38
Problems with Precision and Recall

Not always clear-cut binary classification
relevant vs. not relevant
How do you measure recall over the whole web?
How many of the 2.7 billion results will get
looked at? Which ones actually need to be good?

39
Normalized Discounted Cumulative Gain (NDCG)

Very relevant Somewhat relevant Not relevant
Want most relevant documents to be ranked first
NDCG DCG / ideal ordering DCG
Ranges from 0 to 1

40
NDCG

Proposed ordering
DCG 4 2/log(2) 0/log(3) 1/log(4)
6.5
IDCG 4 2/log(2) 1/log(3) 0/log(4)
6.63
NDCG 6.5/6.63 .98

1
0
2
4
41
Relevance Feedback
42
Queries and Documents Apples and Oranges?

Documentshundreds of words
Queries1 or 2, often ambiguous, words
It would be much easier to compare documents and
documents
How can we turn a query into a document?
Just find ONE relevant document, then use that to
find more

43
Reformulating the query (behind the users back)

New Query Original Query
Terms from Relevant Docs
- Terms from Irrelevant Docs
Original query train
Relevant
www.dog-obedience-training-review.com
Irrelevant
http//en.wikipedia.org/wiki/Caboose
New query train .3dog -.2railroad

44
How do you know what is relevant?

Explicit feedback
Ask the user to mark relevant versus irrelevant
Or, grade on a scale (like we saw for NDCG)
Implicit feedback
Users see list of top 10 results, click on a few
Assume clicked on pages were relevant, rest
werent
Pseudo-relevance feedback
Do search, assume top results are relevant,
repeat

45
Query Suggestion

Have query logs for millions of users
hybrid car?toyota prius is more likely than
hybrid car- flights to LA
Find statistically significant pairs of queries
(Jones et. al. WWW 2006) using

46
Graph-based query suggestion

Make a bipartite graph of queries and URLs
Cluster (Beeferman and Berger, KDD 2000)

47
Clustering Queries

Suggest queries in the same cluster

48
Personalization
49
Personalization helps Ambiguity

A lot of ambiguity is removed by knowing who the
searcher is
Lots of Fernando Pereiras
I (Emily Pitler) only know one of them
Location matters
Thai restaurants from me means Thai
restaurants Philadelphia, PA

50
Entropy of Search

Mei and Church, WSDM 2008
H(URLQ) H(URL,Q)-H(Q) 23.88-21.142.74
H(URLQ,IP) H(URL,Q,IP)-H(Q,IP)27.17-261.17

51
Conclusion
52
Problems in search

Powerset trying to apply NLP to Wikipedia

53
Problems in search

Descriptive searches pictures of mountains
I dont want a document with the words
picture, of, mountains
Link farms trying to game PageRank
Spelling correction a huge portion of queries
are misspelled
Ambiguity

54
Web Search as Applied CIS430

Text normalization, documents as vectors,
document similarity, log likelihood ratio,
relative entropy, precision and recall, tf-idf,
machine learning
Choosing relevant documents/content
Snippets short summaries

Write a Comment

User Comments (0)