Searching, being found - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Searching, being found

Description:

... and gifts are exchanged between loved ones, all in the name of St. Valentine. ... So, who was Saint Valentine and how did he become associated with this ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 34

Provided by: zged

Category:

more less

Transcript and Presenter's Notes

Title: Searching, being found

1
Searching, being found

How web search works

2
Outline

History
What can be found on the web
WWW vs. corporate web
Internals of search engine
What can I do? How to get found
Corporate search

3
History

The web happened (1992)
Mosaic/Netscape happened (1993-95)
Crawler happened (1994) M. Mauldin
SEs happened 1994-1996
InfoSeek, Lycos, Altavista, Excite, Inktomi,
Yahoo decided to go with a directory
Google happened 1996-98
Tried selling technology to other engines
SEs though search was a commodity, portals were
in
Microsoft said whatever

4
d

New Standard
Best of breed
Or is it?

5
Web Search Engine Characteristics

Unedited anyone can enter content
Quality issues Spam
Scale
Hundreds of millions of searches/day billions of
docs
Varied information types
Phone book, brochures, catalogs, dissertations,
news reports, weather, all in one place!
Different kinds of users
Web Every type of person with every type of goal
Online catalogs Scholars searching scholarly
literature
Lexis-Nexis Paying, professional searchers

6
Web Search Queries

Web search queries are short
2.4 words on average (Aug 2000)
Has increased, was 1.7 (1997)
User Expectations
Many say The first item shown should be what I
want to see!
This works if the user has the most
popular/common notion in mind, not otherwise.

7
Corporate web

You will not find this by Googling
Hidden behind firewalls
and passwords
Inside databases
Within applications

8
Directories vs. Search Engines

Directories
Hand-selected sites
Search over the contents of the descriptions of
the pages
Organized in advance into categories

Search Engines
All pages in all sites
Search over the contents of the pages themselves
Organized in response to a query by relevance
rankings or other scores

9
Standard Web Search Engine Architecture
Check for duplicates, store the documents
crawl the web
DocIds
user query
create an inverted index
Inverted index
Search engine servers
Show results To user
10
Standard Web Search Engine Processes

Crawling
Follow links to find information
Indexing
Record what words appear where
Ranking
What information is a good match to a user
query?
What information is inherently good?
Displaying
Find a good format for the information
Serving
Handle queries, find pages, display results

11
Crawling

How do the web search engines get all of the
items they index?
Main idea
Start with known sites
Record information for these sites
Follow the links from each site
Record information found at new sites
Repeat

12
Web Crawling Algorithm

More precisely
Put a set of known sites on a queue
Repeat the following until the queue is empty
Take the first page off of the queue
If this page has not yet been processed
Record the information found on this page
Positions of words, links going out, etc
Add each link on the current page to the queue
Record that this page has been processed

13
Web Crawling Issues

Keep out signs
A file called norobots.txt tells the crawler
which directories are off limits
Freshness
Figure out which pages change often
Recrawl these often
Duplicates, virtual hosts, etc
Convert page contents with a hash function
Compare new pages to the hash table
Lots of problems
Server unavailable
Incorrect html
Missing links
Infinite loops
Web crawling is difficult to do robustly!

14
Indexing

Analysis
.pdf, flash, script,
Extract words from text
Whitespace
Long-distance
??????????
? Manipulate or modify words
Stemming (Walking - Walk)
Removal of frequent words (and, or, the)

15
Inverted index
16
Word match

TF term frequency
IDF document frequency
Document size

Valentine Hallmark holiday
Valentine
Every February, across the country, candy,
flowers, and gifts are exchanged between loved
ones, all in the name of St. Valentine. But who
is this mysterious saint and why do we celebrate
this holiday? The history of Valentine's Day --
and its patron saint -- is shrouded in mystery.
But we do know that February has long been a
month of romance. St. Valentine's Day, as we know
it today, contains vestiges of both Christian and
ancient Roman tradition. So, who was Saint
Valentine and how did he become associated with
this ancient rite? Today, the Catholic Church
recognizes at least three different saints named
Valentine or Valentinus, all of whom were
martyred. One legend contends that Valentine was
a priest who served during the third century in
Rome. When Emperor Claudius II decided that
single men made better soldiers than those with
wives and families, he outlawed marriage for
young men -- his crop of potential soldiers.
Valentine, realizing the injustice of the decree,
defied Claudius and continued to perform
marriages for young lovers in secret. When
Valentine's actions were discovered, Claudius
ordered that he be put to death. Other stories
suggest that Valentine may have been killed for
attempting to help Christians escape harsh Roman
prisons where they were often beaten and
tortured. According to one legend, Valentine
actually sent the first 'valentine' greeting
himself. While in prison, it is believed that
Valentine fell in love with a young girl -- who
may have been his jailor's daughter -- who
visited him during his confinement. Before his
death, it is alleged that he wrote her a letter,
which he signed 'From your Valentine,' an
expression that is still in use today. Although
the truth behind the Valentine legends is murky,
the stories certainly emphasize his appeal as a
sympathetic, heroic, and, most importantly,
romantic figure. It's no surprise that by the
Middle Ages, Valentine was one of the most
popular saints in England and France
17
PageRank

Let A1, A2, , An be the pages that point to page
A. Let C(P) be the links out of page P. The
PageRank (PR) of page A is defined as
d is the probability of
getting bored at a page
PageRank is principal eigenvector of the link
matrix of the web.
Can be computed as the fixpoint of the above
equation.

PR(A0) (1-d) d ( PR(A1)/C(A1)
PR(An)/C(An) )
18
Ant trail

Web clicks
Users define links between pages
The pages that are most usefull have highest
feromone trail

More money higher rank on the list

goto.com
20
Display

Vivisimo

21
Federation

Multiple search engines
One display

22
Serving (Google)

Sorted barrels inverted index
Pagerank computed from link structure combined
with IR rank
Billion documents
Hundred million queries a day
http//infolab.stanford.edu/backrub/google.html

23
What am I to do?

Crawling
Sitemaps to help search engines to find you
Indexing
Make sure the page/text is visible to search
engines
Ranking
Get linked to by big sites

24
Sitemaps

a-instance" xsischemaLocation"http//www.sitemap
s.org/schemas/sitemap/0.9 http//www.sitemaps.org/
schemas/sitemap/0.9/sitemap.xsd"
xmlns"http//www.sitemaps.org/schemas/sitemap/0.9
"
http//somewebsite.com/page1
2008-01-09T122923-0800
0.5

25
What is will search engine see?
Welcome to By Design Furniture sale
s enter ENTER
http//www.bydesignfurniture.com/
26
But they see more

If you're part of our gadget developer
community, perhaps hearing about interesting and
unique ways people are using gadgets will help
spark some creative ideas. But whether you are
HTML-savvy or not, and you want to show your
sweetie how much you care, it's very easy to be
able to create gadgets. Just visit the Google
Gadget Center or Gadget Maker and give it a try.
http//googleblog.blogspot.com/

about/" id"t6i." title"Google Gadget
Center"Google Gadget Center
Who gets the credit?
27
What about search my site