Searching, being found - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Searching, being found

Description:

... and gifts are exchanged between loved ones, all in the name of St. Valentine. ... So, who was Saint Valentine and how did he become associated with this ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 34
Provided by: zged
Category:

less

Transcript and Presenter's Notes

Title: Searching, being found


1
Searching, being found
  • How web search works

2
Outline
  • History
  • What can be found on the web
  • WWW vs. corporate web
  • Internals of search engine
  • What can I do? How to get found
  • Corporate search

3
History
  • The web happened (1992)
  • Mosaic/Netscape happened (1993-95)
  • Crawler happened (1994) M. Mauldin
  • SEs happened 1994-1996
  • InfoSeek, Lycos, Altavista, Excite, Inktomi,
  • Yahoo decided to go with a directory
  • Google happened 1996-98
  • Tried selling technology to other engines
  • SEs though search was a commodity, portals were
    in
  • Microsoft said whatever

4
d
  • New Standard
  • Best of breed
  • Or is it?

5
Web Search Engine Characteristics
  • Unedited anyone can enter content
  • Quality issues Spam
  • Scale
  • Hundreds of millions of searches/day billions of
    docs
  • Varied information types
  • Phone book, brochures, catalogs, dissertations,
    news reports, weather, all in one place!
  • Different kinds of users
  • Web Every type of person with every type of goal
  • Online catalogs Scholars searching scholarly
    literature
  • Lexis-Nexis Paying, professional searchers

6
Web Search Queries
  • Web search queries are short
  • 2.4 words on average (Aug 2000)
  • Has increased, was 1.7 (1997)
  • User Expectations
  • Many say The first item shown should be what I
    want to see!
  • This works if the user has the most
    popular/common notion in mind, not otherwise.

7
Corporate web
  • You will not find this by Googling
  • Hidden behind firewalls
  • and passwords
  • Inside databases
  • Within applications

8
Directories vs. Search Engines
  • Directories
  • Hand-selected sites
  • Search over the contents of the descriptions of
    the pages
  • Organized in advance into categories
  • Search Engines
  • All pages in all sites
  • Search over the contents of the pages themselves
  • Organized in response to a query by relevance
    rankings or other scores

9
Standard Web Search Engine Architecture
Check for duplicates, store the documents
crawl the web
DocIds
user query
create an inverted index
Inverted index
Search engine servers
Show results To user
10
Standard Web Search Engine Processes
  • Crawling
  • Follow links to find information
  • Indexing
  • Record what words appear where
  • Ranking
  • What information is a good match to a user
    query?
  • What information is inherently good?
  • Displaying
  • Find a good format for the information
  • Serving
  • Handle queries, find pages, display results

11
Crawling
  • How do the web search engines get all of the
    items they index?
  • Main idea
  • Start with known sites
  • Record information for these sites
  • Follow the links from each site
  • Record information found at new sites
  • Repeat

12
Web Crawling Algorithm
  • More precisely
  • Put a set of known sites on a queue
  • Repeat the following until the queue is empty
  • Take the first page off of the queue
  • If this page has not yet been processed
  • Record the information found on this page
  • Positions of words, links going out, etc
  • Add each link on the current page to the queue
  • Record that this page has been processed

13
Web Crawling Issues
  • Keep out signs
  • A file called norobots.txt tells the crawler
    which directories are off limits
  • Freshness
  • Figure out which pages change often
  • Recrawl these often
  • Duplicates, virtual hosts, etc
  • Convert page contents with a hash function
  • Compare new pages to the hash table
  • Lots of problems
  • Server unavailable
  • Incorrect html
  • Missing links
  • Infinite loops
  • Web crawling is difficult to do robustly!

14
Indexing
  • Analysis
  • .pdf, flash, script,
  • Extract words from text
  • Whitespace
  • Long-distance
  • ??????????
  • ? Manipulate or modify words
  • Stemming (Walking - Walk)
  • Removal of frequent words (and, or, the)

15
Inverted index
16
Word match
  • TF term frequency
  • IDF document frequency
  • Document size

Valentine Hallmark holiday
Valentine
Every February, across the country, candy,
flowers, and gifts are exchanged between loved
ones, all in the name of St. Valentine. But who
is this mysterious saint and why do we celebrate
this holiday? The history of Valentine's Day --
and its patron saint -- is shrouded in mystery.
But we do know that February has long been a
month of romance. St. Valentine's Day, as we know
it today, contains vestiges of both Christian and
ancient Roman tradition. So, who was Saint
Valentine and how did he become associated with
this ancient rite? Today, the Catholic Church
recognizes at least three different saints named
Valentine or Valentinus, all of whom were
martyred. One legend contends that Valentine was
a priest who served during the third century in
Rome. When Emperor Claudius II decided that
single men made better soldiers than those with
wives and families, he outlawed marriage for
young men -- his crop of potential soldiers.
Valentine, realizing the injustice of the decree,
defied Claudius and continued to perform
marriages for young lovers in secret. When
Valentine's actions were discovered, Claudius
ordered that he be put to death. Other stories
suggest that Valentine may have been killed for
attempting to help Christians escape harsh Roman
prisons where they were often beaten and
tortured. According to one legend, Valentine
actually sent the first 'valentine' greeting
himself. While in prison, it is believed that
Valentine fell in love with a young girl -- who
may have been his jailor's daughter -- who
visited him during his confinement. Before his
death, it is alleged that he wrote her a letter,
which he signed 'From your Valentine,' an
expression that is still in use today. Although
the truth behind the Valentine legends is murky,
the stories certainly emphasize his appeal as a
sympathetic, heroic, and, most importantly,
romantic figure. It's no surprise that by the
Middle Ages, Valentine was one of the most
popular saints in England and France
17
PageRank
  • Let A1, A2, , An be the pages that point to page
    A. Let C(P) be the links out of page P. The
    PageRank (PR) of page A is defined as
  • d is the probability of
    getting bored at a page
  • PageRank is principal eigenvector of the link
    matrix of the web.
  • Can be computed as the fixpoint of the above
    equation.

PR(A0) (1-d) d ( PR(A1)/C(A1)
PR(An)/C(An) )
18
Ant trail
  • Web clicks
  • Users define links between pages
  • The pages that are most usefull have highest
    feromone trail

19
  • More money higher rank on the list

goto.com
20
Display
  • Vivisimo

21
Federation
  • Multiple search engines
  • One display

22
Serving (Google)
  • Sorted barrels inverted index
  • Pagerank computed from link structure combined
    with IR rank
  • Billion documents
  • Hundred million queries a day
  • http//infolab.stanford.edu/backrub/google.html

23
What am I to do?
  • Crawling
  • Sitemaps to help search engines to find you
  • Indexing
  • Make sure the page/text is visible to search
    engines
  • Ranking
  • Get linked to by big sites

24
Sitemaps
  • a-instance" xsischemaLocation"http//www.sitemap
    s.org/schemas/sitemap/0.9 http//www.sitemaps.org/
    schemas/sitemap/0.9/sitemap.xsd"
    xmlns"http//www.sitemaps.org/schemas/sitemap/0.9
    "
  • http//somewebsite.com/page1
  • 2008-01-09T122923-0800
    0.5

25
What is will search engine see?
Welcome to By Design Furniture sale
s enter ENTER
http//www.bydesignfurniture.com/
26
But they see more
  • If you're part of our gadget developer
    community, perhaps hearing about interesting and
    unique ways people are using gadgets will help
    spark some creative ideas. But whether you are
    HTML-savvy or not, and you want to show your
    sweetie how much you care, it's very easy to be
    able to create gadgets. Just visit the Google
    Gadget Center or Gadget Maker and give it a try.

  • http//googleblog.blogspot.com/

about/" id"t6i." title"Google Gadget
Center"Google Gadget Center
Who gets the credit?
27
What about search my site
  • Add search engine to your site
  • HtDig
  • Content Management Provider
  • Autonomy
  • Exalead
  • Use advanced search directives
  • search query sitewww.mysite.com

28
Getting Googled some more
  • rch"
  • value"nbsp Enter Search Topic(s) here"
  • value"bdsra.org"
  • Site"
  • value"bdsra.org"

29
Web Search vs. IR
  • Web search differs from traditional IR systems
  • Different kind of collection
  • Different kinds of users/queries
  • Different economic motivations
  • Ranking combines many features in a
    difficult-to-specify manner
  • Link analysis and proximity of terms seems
    especially important
  • This is in contrast to the term-frequency
    orientation of standard search
  • Why?

30
The Standard Information Retrieval Interaction
Model
31
Web Search vs. IR (cont.)
  • Web search engine archicture
  • Similar in many ways to standard IR
  • Indexes usually duplicated across machines to
    handle many queries quickly
  • Web crawling
  • Used to create the collection
  • Can be guided by quality metrics
  • Is very difficult to do robustly

32
How to they do that?
  • How do you see just a part of th esource code?

33
References
  • Most of this slide show content has been
    shamelesly ripped out of other presentations
  • http//www.sims.berkeley.edu/courses/is202/f00
Write a Comment
User Comments (0)
About PowerShow.com