Title: Searching, being found
1Searching, being found
2Outline
- History
- What can be found on the web
- WWW vs. corporate web
- Internals of search engine
- What can I do? How to get found
- Corporate search
3History
- The web happened (1992)
- Mosaic/Netscape happened (1993-95)
- Crawler happened (1994) M. Mauldin
- SEs happened 1994-1996
- InfoSeek, Lycos, Altavista, Excite, Inktomi,
- Yahoo decided to go with a directory
- Google happened 1996-98
- Tried selling technology to other engines
- SEs though search was a commodity, portals were
in - Microsoft said whatever
4 d
- New Standard
- Best of breed
-
- Or is it?
5Web Search Engine Characteristics
- Unedited anyone can enter content
- Quality issues Spam
- Scale
- Hundreds of millions of searches/day billions of
docs - Varied information types
- Phone book, brochures, catalogs, dissertations,
news reports, weather, all in one place! - Different kinds of users
- Web Every type of person with every type of goal
- Online catalogs Scholars searching scholarly
literature - Lexis-Nexis Paying, professional searchers
6Web Search Queries
- Web search queries are short
- 2.4 words on average (Aug 2000)
- Has increased, was 1.7 (1997)
- User Expectations
- Many say The first item shown should be what I
want to see! - This works if the user has the most
popular/common notion in mind, not otherwise.
7Corporate web
- You will not find this by Googling
- Hidden behind firewalls
- and passwords
- Inside databases
- Within applications
8Directories vs. Search Engines
- Directories
- Hand-selected sites
- Search over the contents of the descriptions of
the pages - Organized in advance into categories
- Search Engines
- All pages in all sites
- Search over the contents of the pages themselves
- Organized in response to a query by relevance
rankings or other scores
9Standard Web Search Engine Architecture
Check for duplicates, store the documents
crawl the web
DocIds
user query
create an inverted index
Inverted index
Search engine servers
Show results To user
10Standard Web Search Engine Processes
- Crawling
- Follow links to find information
- Indexing
- Record what words appear where
- Ranking
- What information is a good match to a user
query? - What information is inherently good?
- Displaying
- Find a good format for the information
- Serving
- Handle queries, find pages, display results
11Crawling
- How do the web search engines get all of the
items they index? - Main idea
- Start with known sites
- Record information for these sites
- Follow the links from each site
- Record information found at new sites
- Repeat
12Web Crawling Algorithm
- More precisely
- Put a set of known sites on a queue
- Repeat the following until the queue is empty
- Take the first page off of the queue
- If this page has not yet been processed
- Record the information found on this page
- Positions of words, links going out, etc
- Add each link on the current page to the queue
- Record that this page has been processed
13Web Crawling Issues
- Keep out signs
- A file called norobots.txt tells the crawler
which directories are off limits - Freshness
- Figure out which pages change often
- Recrawl these often
- Duplicates, virtual hosts, etc
- Convert page contents with a hash function
- Compare new pages to the hash table
- Lots of problems
- Server unavailable
- Incorrect html
- Missing links
- Infinite loops
- Web crawling is difficult to do robustly!
14Indexing
- Analysis
- .pdf, flash, script,
- Extract words from text
- Whitespace
- Long-distance
- ??????????
- ? Manipulate or modify words
- Stemming (Walking - Walk)
- Removal of frequent words (and, or, the)
15Inverted index
16Word match
- TF term frequency
- IDF document frequency
- Document size
Valentine Hallmark holiday
Valentine
Every February, across the country, candy,
flowers, and gifts are exchanged between loved
ones, all in the name of St. Valentine. But who
is this mysterious saint and why do we celebrate
this holiday? The history of Valentine's Day --
and its patron saint -- is shrouded in mystery.
But we do know that February has long been a
month of romance. St. Valentine's Day, as we know
it today, contains vestiges of both Christian and
ancient Roman tradition. So, who was Saint
Valentine and how did he become associated with
this ancient rite? Today, the Catholic Church
recognizes at least three different saints named
Valentine or Valentinus, all of whom were
martyred. One legend contends that Valentine was
a priest who served during the third century in
Rome. When Emperor Claudius II decided that
single men made better soldiers than those with
wives and families, he outlawed marriage for
young men -- his crop of potential soldiers.
Valentine, realizing the injustice of the decree,
defied Claudius and continued to perform
marriages for young lovers in secret. When
Valentine's actions were discovered, Claudius
ordered that he be put to death. Other stories
suggest that Valentine may have been killed for
attempting to help Christians escape harsh Roman
prisons where they were often beaten and
tortured. According to one legend, Valentine
actually sent the first 'valentine' greeting
himself. While in prison, it is believed that
Valentine fell in love with a young girl -- who
may have been his jailor's daughter -- who
visited him during his confinement. Before his
death, it is alleged that he wrote her a letter,
which he signed 'From your Valentine,' an
expression that is still in use today. Although
the truth behind the Valentine legends is murky,
the stories certainly emphasize his appeal as a
sympathetic, heroic, and, most importantly,
romantic figure. It's no surprise that by the
Middle Ages, Valentine was one of the most
popular saints in England and France
17PageRank
- Let A1, A2, , An be the pages that point to page
A. Let C(P) be the links out of page P. The
PageRank (PR) of page A is defined as - d is the probability of
getting bored at a page - PageRank is principal eigenvector of the link
matrix of the web. - Can be computed as the fixpoint of the above
equation.
PR(A0) (1-d) d ( PR(A1)/C(A1)
PR(An)/C(An) )
18Ant trail
- Web clicks
- Users define links between pages
- The pages that are most usefull have highest
feromone trail
19- More money higher rank on the list
goto.com
20Display
21Federation
- Multiple search engines
- One display
22Serving (Google)
- Sorted barrels inverted index
- Pagerank computed from link structure combined
with IR rank - Billion documents
- Hundred million queries a day
- http//infolab.stanford.edu/backrub/google.html
23What am I to do?
- Crawling
- Sitemaps to help search engines to find you
- Indexing
- Make sure the page/text is visible to search
engines - Ranking
- Get linked to by big sites
24Sitemaps
-
- a-instance" xsischemaLocation"http//www.sitemap
s.org/schemas/sitemap/0.9 http//www.sitemaps.org/
schemas/sitemap/0.9/sitemap.xsd"
xmlns"http//www.sitemaps.org/schemas/sitemap/0.9
" -
- http//somewebsite.com/page1
- 2008-01-09T122923-0800
0.5 -
-
-
-
-
25What is will search engine see?
Welcome to By Design Furniture sale
s enter ENTER
http//www.bydesignfurniture.com/
26But they see more
- If you're part of our gadget developer
community, perhaps hearing about interesting and
unique ways people are using gadgets will help
spark some creative ideas. But whether you are
HTML-savvy or not, and you want to show your
sweetie how much you care, it's very easy to be
able to create gadgets. Just visit the Google
Gadget Center or Gadget Maker and give it a try. -
http//googleblog.blogspot.com/
about/" id"t6i." title"Google Gadget
Center"Google Gadget Center
Who gets the credit?
27What about search my site
- Add search engine to your site
- HtDig
- Content Management Provider
- Autonomy
- Exalead
- Use advanced search directives
- search query sitewww.mysite.com
28Getting Googled some more
- rch"
-
- value"nbsp Enter Search Topic(s) here"
- value"bdsra.org"
- Site"
- value"bdsra.org"
-
-
29Web Search vs. IR
- Web search differs from traditional IR systems
- Different kind of collection
- Different kinds of users/queries
- Different economic motivations
- Ranking combines many features in a
difficult-to-specify manner - Link analysis and proximity of terms seems
especially important - This is in contrast to the term-frequency
orientation of standard search - Why?
30The Standard Information Retrieval Interaction
Model
31Web Search vs. IR (cont.)
- Web search engine archicture
- Similar in many ways to standard IR
- Indexes usually duplicated across machines to
handle many queries quickly - Web crawling
- Used to create the collection
- Can be guided by quality metrics
- Is very difficult to do robustly
32How to they do that?
- How do you see just a part of th esource code?
33References
- Most of this slide show content has been
shamelesly ripped out of other presentations - http//www.sims.berkeley.edu/courses/is202/f00