What do Search Engines Consist of - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

What do Search Engines Consist of

Description:

... Search Engine operating on that database. A Series of programs that determine how search results ... MSN Live Search. Strengths: Large, fresh, unique database ... – PowerPoint PPT presentation

Number of Views:72

Avg rating:3.0/5.0

Slides: 38

Provided by: ccGa

Category:

more less

Transcript and Presenter's Notes

Title: What do Search Engines Consist of

1
What do Search Engines Consist of ?

A Database of Web Documents.
A Search Engine operating on that database.
A Series of programs that determine how search
results are displayed.
The mission of Google is To organize the
world's information to make it universally
accessible and useful

2
What Makes a Good Search Engine ?

Size of the database
(a) How many documents does the search engine
claim It has.
(b) How Much of the total web are you able to
search.
Freshness
(a) How often is the database refreshed to
find new pages.
(b) How often do crawlers update their copys
Speed and Consistency
(a) How fast is it.
(b) How Consistent is it.

3
What Makes a Good Search Engine ?

Basic Search options and limitations
(a) Automatic default of AND assumed between
words.
(b) Is there a way to allow for synonyms.
Advanced Search Options
(a) Can you restrict to documents only from a
certain domain
(b) Can you limit by language ,type of the
document.
(c) Ability to search within previous results.
Ranking
(a) Are they ranked by popularity or relevance.
Display

4
Features

Boolean Capabilities and Constraints
and requires that both terms be found.
or lets either term be found
not means any record containing the second term
will be excluded
( ) means the Boolean operators can be nested
using parentheses
is equivalent to AND, requiring the term the
should be placed directly in front of the search
term
- is equivalent to NOT and means to exclude the
term the - should be placed directly in front of
the search term
Proximity
refers to ability to specify how close within a
record multiple terms should be to each other.
The most commonly used proximity search is a
phrase search
requires terms to be in the exact order specified
within the phrase markings.
The default standard for identifying phrases uses
double quotes (" ") to surround the phrase.

5
Features

Truncation Stemming
Truncation
refers to the ability to search just a portion of
a word
Stemming
refers to the ability of a search engine to find
word variants such as plurals, singular forms,
past tense, present tense, etc
Case Sensitivity
Field searching
allows the searcher to designate where a specific
search term will appear.

6
Features

Limits
The ability to narrow search results by adding a
specific restriction to the search
File Types
Stop Words
Sorting
Typically, Internet search engines sort the
results by "relevance" determined by their
proprietary relevance ranking algorithms.
Other options are to arrange the results by date,
alphabetically by title, or by root URL or host
name.

7
Google

Strengths
Size and scope It is one of the largest, and
includes PDF, DOC, PS, and many other file types
Relevance based on sites' linkages and authority
Cached archive of Web pages as they looked when
they were indexed
Additional databases Google Groups, News,
Directory, Books, Scholar, etc.
Weaknesses
Limited search features no nesting, no
truncation, does not support full Boolean
Link searches must be exact and are incomplete
Only indexes first 101 KB of a Web page and about
120 KB of PDF files
May search for plural/singular, synonyms, and
grammatical variants without telling you
Not as comprehensive as it is rumored to be

8
Yahoo

Strengths
A large, unique search engine database
Includes cached copies of pages
Includes links to the Yahoo! Directory
Supports full Boolean searching
Wild Card Word in Phrase
Weaknesses
Lack of some advanced search features such as
truncation
Only indexes first 500 KB of a Web page (still
more than Google's 101KB)
Link searches require the inclusion of the
http//
File type search uses originurlextension rather
than filetype
Includes some pay for inclusion sites

9
Ask

Strengths
Identifying metasites
Refine feature to focus on Web communities
Weaknesses
Smaller database
No free URL submission
No ability to uncluster results to easily see
more than two hits per site
No cached copies of pages

10
MSN Live Search

Strengths
Large, fresh, unique database
Query building Advanced Search and full Boolean
searching
Cached copies of Web pages including date cached
Automatic local search options.
Weaknesses
No truncation, stemming, or wild card word in a
phrase
Limited to 10 words in a query
Advanced search not on front page, but available
after running a search

11
Gigablast

Summary
Debuted in beta July 21, 2002
Strengths
Date reporting (including date indexed and date
last modified)
Cached pages and links to the Wayback Machine
Includes PDF and other file types and cached HTML
versions of these other file types
Indexing and displaying of meta tags
Weaknesses
Small database and not refreshed as frequently as
others
Lacks truncation, proximity, and other advanced
search features.

12
Exalead

Summary Exalead is a newer search engine
arriving in October 2004. Hailing from France, it
offers a unique and different approach to
presenting results.
Strengths
Truncation, proximity, and many other advanced
operators not available from other search engines
Includes thumbnails of pages
Provides excellent narrowing options on right
side
Weaknesses
Smaller database than the major search engines
Few people know about or use it
May not be updated as frequently

13
Features by Search Engine
14
Search Engine Ratings
15
Search Engine Ratings
16
The Page Rank Algorithm

Assume a page A has pages T1,T2,Tn pointing to
it.
Let d be a damping factor whose value is set to
0.85(say).
C(A) be defined as the number of links going out
of page A.
The Page Rank PR(A) can be defined as.
PR(A) (1-d) d (PR(T1)/C(T1) ...
PR(Tn)/C(Tn))
The Sum of all Webpage's page rank being equal
to 1.

17
Architecture Overview
18
The Basic Operations of a Search Engine can be
divided into.

Crawling
Indexing
Sorting

19
Crawling

Crawling is a process of following links to
locate and read pages.
Crawling is the most fragile application since it
involves hundreds of thousands of web servers.
A single URL Server Serves list of URLs to a
number of crawlers Implemented in Python.
At peak speeds the system can crawl more than 100
pages a second using 4 crawlers as of 1998.
This amounts to 600kB per second of data.

20
Crawling

The Google crawler known as google bot crawls
all the URL it knows every few weeks to keep its
information up to date.
Each Crawler has a DNS cache so it does not need
to do a DNS look up before crawling each
document.
The Google Bot obeys the robots.txt directive
avoiding the pages which the webmaster has
designated as off limits.

21
Indexing

After each document is parsed, It is encoded into
a number of barrels. Every word is converted to
an word ID using an in memory hash table.
Once Words are converted to wordIDs,their
occurrence in the current document are translated
into hit lists and are written into the forward
barrels.
In short for every word the system keeps a list
of pages the word occurs in.
Google knows about 10 billion web documents.

22
Sorting

Once Google has matched a word in a index, It
wants to put the best document first. It choses
the best document based on a number of techniques
1.Text Analysis.
2.Links and link text/Anchor text.
3.Page Rank.

23
Google Query Evaluation

1.Parse the Query.
2.Convert Word into WordIDs.
3.Seek to the start of the doc list in the short
barrel for every word.
4.Scan thru the doc list until there is a
document with all of the search terms.
5.Compute the rank of the document for that
query.
6.Sort the documents that have matched by rank
and return the top K.

24
Google v/s Inktomi

Google likes to include as many pages as it can
find. Inktomi would rather not clutter its index
with pages of little value. This makes Google
useful when conducting very specific searches -
such as researching an individual.
Inktomis ranking algorithm has changed over the
past months, but compared to Google Inktomi has
been very stable.
As long as Google places an inordinate amount of
weight on any single factor such as anchor text,
aggressive site promoters can play that factor to
their benefit.

25
Meta Search Engines

A meta search engine basically searches multiple
search engines simultaneously and displays the
results based on certain preferences.
A meta search engine doesnt have a database of
its own. They send search terms to database
maintained by search engine companies.
It basically works on the principle More heads
better than one
Smarter Meta search engines comes with options
like textual analysis that lets one dig deeply
into search results.

26
Some Good Meta Search Engines

www.clusty.com searches a number of free search
engine directories.Doesnt include Yahoo! And
Google.
www.dogpile.com searches Google
,Yahoo!,Looksmart,Ask Jeeves,MSN Search.
www.copernic.com copernic agent select from a
list of search engines by changing the properties
dialogue.

27
Things you can do on Google,Yahoo! and Ask.

Phrase Searching By enclosing terms in double
quotes.
OR Searching with capitalized OR.
- excludes requires the exact form of the
word.
Limit results by advanced search.
Things not supported on Google ,Yahoo! and Ask.
Truncation Use OR searches for variants. (ex
Airlines OR Airline)
Case Sensitivity Capitalization does not matter.

28
(No Transcript)
29
(No Transcript)
30
Trends On The Web

50 of Web users use search engines as a
starting point
The web is estimated to have billions of pages
New Pages are created at a rate of 8 per week

31
What Do The Trends Indicate?

Search engines dictate a large amount of web
traffic
Crawlers must be able to scale to the rapidly
growing web
Query engines must be able to cope with the large
amount of data indexed by crawlers
Search engines have to give users relevant
information with less effort from the users

32
Search Engine Bias

Do search engines dictate what is popular and
what is not?
How much control does a search engine have over
content on the web?
Is page rank flawed?

33
Current TrendsSearch Engine Bias

A Rich get richer, poor stay poor phenomenon is
occurring
Research in this area is still new

34
Web Crawlers

Typically 3 Areas of Research
General Architecture
Page Selection
Page Update

35
Web Crawlers Continued

General Architecture
How should parallel crawlers be designed?
Page Selection
What should be the next page visited once a site
is harvested?
Page Updating
How often should a page be refreshed?

36
Database Indexing

Databases use a static index of all data
available
At query time, the index is dynamically traversed
The result represents all known data that
satisfies the query

37
Web Indexing

Typically search engines maintain a frequency
index and a positional index
Frequency index
For every term, stores the frequency of that term
for all web pages
Positional Index
For every term, stores all positions of that term
in all web pages

38
Improving Query Time ForSearch Engines

Disk size is a scalability issue
A static index of all web pages becomes somewhat
impractical
The size of a complete static index becomes a
bottleneck for query response time

39
Improving Query Time Continued

Research focuses on compression of indexes
Lossless and more recently lossy compression are
used
Locality Based Pruning Method (lbpm) is a lossy
compression technique

40
Personalized Search

The massive amount of data requires a more
personalized search paradigm
Searching with context personalizes searches
behind the scenes
Probabilistic Query Expansion is another avenue
to alleviate the size of the web

41
DEMO!!