Web Search

1 / 54
About This Presentation
Title:

Web Search

Description:

Cronen-Townsend et. al. SIGIR 2002. Compare a language model ... vh1.com. 23. Assume our surfer is on a page. In the next time step she can: ... – PowerPoint PPT presentation

Number of Views:169
Avg rating:3.0/5.0
Slides: 55
Provided by: Emi21

less

Transcript and Presenter's Notes

Title: Web Search


1
Web Search
  • CIS 430 November 6, 2008
  • Emily Pitler

2
What are queries like?
3
Question Answering
4
How do people actually search?
  • Named Entities
  • 1 or 2 words
  • Ambiguous meaning
  • Ambiguous intent

5
What are they searching for?
6
Searching follows temporal patterns
Mei and Church, WSDM 2008
7
Queries are short
  • Beitzel et. al. SIGIR 2004
  • America Online, week in December 2003
  • Popular queries
  • 1.7 words
  • Overall
  • 2.2 words

8
Most Queries are Rare
  • Lempel and Moran WWW2003
  • AltaVista, summer 2001
  • 7,175,151 queries
  • 2,657,410 distinct queries
  • 1,792,104 queries occurred only once 63.7
  • Most popular query 31,546 times

9
What distribution does this look like?
Saraiva et. al. SIGIR 2001
10
Zipfian Distribution of Popular Searches
Lempel and Moran WWW2003
11
What makes queries difficult?
12
Ambiguity makes queries difficult
American Airlines?
or Alcoholics Anonymous?
13
Query Clarity
  • Clarity score low ambiguity
  • Cronen-Townsend et. al. SIGIR 2002
  • Compare a language model
  • over the relevant documents for a query
  • over all possible documents
  • The more difference these are, the more clear the
    query is
  • programming perl vs. the

14
Language Models
  • Query Language Model
  • Collection Language Model (unigram)

15
Kullback-Leibler Divergence
  • Relative entropy between the two distributions
  • Cost in bits of coding using Q when true
    distribution is P

16
Kullback-Leibler Divergence
17
Clarity score
18
Clarity scores on TREC-7 collection
19
Query Type Broders Taxonomy
  • Navigational
  • greyhound bus
  • compaq
  • Informational
  • San Francisco
  • normocytic anemia
  • Transactional
  • britney spears lyrics
  • download adobe reader

Broder SIGIR 2002
20
What results should the search engine show?
21
Important
22
How can you tell if a webpage is important?
  • The more webpages that point to you, the more
    important you are
  • The more important webpages point to you, the
    more important you are
  • These intuitions led to PageRank
  • PageRank led to

Page et. al. 1998
23
Random surfer
washingtonpost.com
Mtv.com
cnn.com
vh1.com
Nytimes.com
24
Random walk on the web
  • Assume our surfer is on a page
  • In the next time step she can
  • Choose a link on the current page uniformly at
    random
  • Or
  • Go somewhere else in the web uniformly at random
  • After a long time, what is the probability she is
    on a given page?

25
Simplified PageRank
Spread out their probability over outgoing links
Pages that point to v
26
PageRank on a Graph
27
PageRank
  • Could also get bored with probability d and
    jump somewhere else completely

28
Compute PageRanks in O(log(n))
29
Applications of PageRank
  • Google, obviously
  • Given objects and links between them, measures
    importance
  • Summarization (Erkan and Radev, 2004)
  • Nodes sentences, edges thresholded cosine
    similarity
  • Research (Mimno and McCallum, 2007)
  • Nodes people, edges citations
  • Facebook?

30
Relevant
31
Find the query terms
  • Words on the page
  • Title
  • Domain
  • Anchor textwhat other sites say when they link
    to that page

32
What information can we get from this page?
Title Ani Nenkova - Home
Domain www.cis.upenn.edu
33
Open Directory Project (dmoz)
  • Ontology of webpages
  • Over 4 million webpages are categorized
  • Like WordNet for webpages
  • Search engines use this
  • Where is www.cis.upenn.edu?
  • Computers
  • Computer Science
  • Academic Departments
  • North America
  • United States
  • Pennsylvania

34
Anchor Text
  • What OTHER webpages say about your webpage
  • Very good descriptions of whats on a page

Link to www.cis.upenn.edu/nenkova
Ani Nenkova is anchor text for that page
35
Evaluation
36
Why not just use accuracy?
  • 10,000 documents
  • 10 of them are relevant
  • What happens if you decide to return absolutely
    nothing?
  • 99.9 accuracy

37
Precision and Recall
  • Standard metrics in Information Retrieval
  • Precision Of what you return, how many are
    relevant?
  • Recall Of what is relevant, how many do you
    return?

38
Problems with Precision and Recall
  • Not always clear-cut binary classification
    relevant vs. not relevant
  • How do you measure recall over the whole web?
  • How many of the 2.7 billion results will get
    looked at? Which ones actually need to be good?

39
Normalized Discounted Cumulative Gain (NDCG)
  • Very relevant Somewhat relevant Not relevant
  • Want most relevant documents to be ranked first
  • NDCG DCG / ideal ordering DCG
  • Ranges from 0 to 1

40
NDCG
  • Proposed ordering
  • DCG 4 2/log(2) 0/log(3) 1/log(4)
  • 6.5
  • IDCG 4 2/log(2) 1/log(3) 0/log(4)
  • 6.63
  • NDCG 6.5/6.63 .98

1
0
2
4
41
Relevance Feedback
42
Queries and Documents Apples and Oranges?
  • Documentshundreds of words
  • Queries1 or 2, often ambiguous, words
  • It would be much easier to compare documents and
    documents
  • How can we turn a query into a document?
  • Just find ONE relevant document, then use that to
    find more

43
Reformulating the query (behind the users back)
  • New Query Original Query
  • Terms from Relevant Docs
  • - Terms from Irrelevant Docs
  • Original query train
  • Relevant
  • www.dog-obedience-training-review.com
  • Irrelevant
  • http//en.wikipedia.org/wiki/Caboose
  • New query train .3dog -.2railroad

44
How do you know what is relevant?
  • Explicit feedback
  • Ask the user to mark relevant versus irrelevant
  • Or, grade on a scale (like we saw for NDCG)
  • Implicit feedback
  • Users see list of top 10 results, click on a few
  • Assume clicked on pages were relevant, rest
    werent
  • Pseudo-relevance feedback
  • Do search, assume top results are relevant,
    repeat

45
Query Suggestion
  • Have query logs for millions of users
  • hybrid car?toyota prius is more likely than
    hybrid car- flights to LA
  • Find statistically significant pairs of queries
    (Jones et. al. WWW 2006) using

46
Graph-based query suggestion
  • Make a bipartite graph of queries and URLs
  • Cluster (Beeferman and Berger, KDD 2000)

47
Clustering Queries
  • Suggest queries in the same cluster

48
Personalization
49
Personalization helps Ambiguity
  • A lot of ambiguity is removed by knowing who the
    searcher is
  • Lots of Fernando Pereiras
  • I (Emily Pitler) only know one of them
  • Location matters
  • Thai restaurants from me means Thai
    restaurants Philadelphia, PA

50
Entropy of Search
  • Mei and Church, WSDM 2008
  • H(URLQ) H(URL,Q)-H(Q) 23.88-21.142.74
  • H(URLQ,IP) H(URL,Q,IP)-H(Q,IP)27.17-261.17

51
Conclusion
52
Problems in search
  • Powerset trying to apply NLP to Wikipedia

53
Problems in search
  • Descriptive searches pictures of mountains
  • I dont want a document with the words
  • picture, of, mountains
  • Link farms trying to game PageRank
  • Spelling correction a huge portion of queries
    are misspelled
  • Ambiguity

54
Web Search as Applied CIS430
  • Text normalization, documents as vectors,
    document similarity, log likelihood ratio,
    relative entropy, precision and recall, tf-idf,
    machine learning
  • Choosing relevant documents/content
  • Snippets short summaries
Write a Comment
User Comments (0)