What do Search Engines Consist of - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

What do Search Engines Consist of

Description:

... Search Engine operating on that database. A Series of programs that determine how search results ... MSN Live Search. Strengths: Large, fresh, unique database ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 38
Provided by: ccGa
Category:
Tags: consist | engines | live | search

less

Transcript and Presenter's Notes

Title: What do Search Engines Consist of


1
What do Search Engines Consist of ?
  • A Database of Web Documents.
  • A Search Engine operating on that database.
  • A Series of programs that determine how search
    results are displayed.
  • The mission of Google is To organize the
    world's information to make it universally
    accessible and useful

2
What Makes a Good Search Engine ?
  • Size of the database
  • (a) How many documents does the search engine
    claim It has.
  • (b) How Much of the total web are you able to
    search.
  • Freshness
  • (a) How often is the database refreshed to
    find new pages.
  • (b) How often do crawlers update their copys
  • Speed and Consistency
  • (a) How fast is it.
  • (b) How Consistent is it.

3
What Makes a Good Search Engine ?
  • Basic Search options and limitations
  • (a) Automatic default of AND assumed between
    words.
  • (b) Is there a way to allow for synonyms.
  • Advanced Search Options
  • (a) Can you restrict to documents only from a
    certain domain
  • (b) Can you limit by language ,type of the
    document.
  • (c) Ability to search within previous results.
  • Ranking
  • (a) Are they ranked by popularity or relevance.
  • Display

4
Features
  • Boolean Capabilities and Constraints
  • and requires that both terms be found.
  • or lets either term be found
  • not means any record containing the second term
    will be excluded
  • ( ) means the Boolean operators can be nested
    using parentheses
  • is equivalent to AND, requiring the term the
    should be placed directly in front of the search
    term
  • - is equivalent to NOT and means to exclude the
    term the - should be placed directly in front of
    the search term
  • Proximity
  • refers to ability to specify how close within a
    record multiple terms should be to each other.
  • The most commonly used proximity search is a
    phrase search
  • requires terms to be in the exact order specified
    within the phrase markings.
  • The default standard for identifying phrases uses
    double quotes (" ") to surround the phrase.

5
Features
  • Truncation Stemming
  • Truncation
  • refers to the ability to search just a portion of
    a word
  • Stemming
  • refers to the ability of a search engine to find
    word variants such as plurals, singular forms,
    past tense, present tense, etc
  • Case Sensitivity
  • Field searching
  • allows the searcher to designate where a specific
    search term will appear.

6
Features
  • Limits
  • The ability to narrow search results by adding a
    specific restriction to the search
  • File Types
  • Stop Words
  • Sorting
  • Typically, Internet search engines sort the
    results by "relevance" determined by their
    proprietary relevance ranking algorithms.
  • Other options are to arrange the results by date,
    alphabetically by title, or by root URL or host
    name.

7
Google
  • Strengths
  • Size and scope It is one of the largest, and
    includes PDF, DOC, PS, and many other file types
  • Relevance based on sites' linkages and authority
  • Cached archive of Web pages as they looked when
    they were indexed
  • Additional databases Google Groups, News,
    Directory, Books, Scholar, etc.
  • Weaknesses
  • Limited search features no nesting, no
    truncation, does not support full Boolean
  • Link searches must be exact and are incomplete
  • Only indexes first 101 KB of a Web page and about
    120 KB of PDF files
  • May search for plural/singular, synonyms, and
    grammatical variants without telling you
  • Not as comprehensive as it is rumored to be

8
Yahoo
  • Strengths
  • A large, unique search engine database
  • Includes cached copies of pages
  • Includes links to the Yahoo! Directory
  • Supports full Boolean searching
  • Wild Card Word in Phrase
  • Weaknesses
  • Lack of some advanced search features such as
    truncation
  • Only indexes first 500 KB of a Web page (still
    more than Google's 101KB)
  • Link searches require the inclusion of the
    http//
  • File type search uses originurlextension rather
    than filetype
  • Includes some pay for inclusion sites

9
Ask
  • Strengths
  • Identifying metasites
  • Refine feature to focus on Web communities
  • Weaknesses
  • Smaller database
  • No free URL submission
  • No ability to uncluster results to easily see
    more than two hits per site
  • No cached copies of pages

10
MSN Live Search
  • Strengths
  • Large, fresh, unique database
  • Query building Advanced Search and full Boolean
    searching
  • Cached copies of Web pages including date cached
  • Automatic local search options.
  • Weaknesses
  • No truncation, stemming, or wild card word in a
    phrase
  • Limited to 10 words in a query
  • Advanced search not on front page, but available
    after running a search

11
Gigablast
  • Summary
  • Debuted in beta July 21, 2002
  • Strengths
  • Date reporting (including date indexed and date
    last modified)
  • Cached pages and links to the Wayback Machine
  • Includes PDF and other file types and cached HTML
    versions of these other file types
  • Indexing and displaying of meta tags
  • Weaknesses
  • Small database and not refreshed as frequently as
    others
  • Lacks truncation, proximity, and other advanced
    search features.

12
Exalead
  • Summary Exalead is a newer search engine
    arriving in October 2004. Hailing from France, it
    offers a unique and different approach to
    presenting results.
  • Strengths
  • Truncation, proximity, and many other advanced
    operators not available from other search engines
  • Includes thumbnails of pages
  • Provides excellent narrowing options on right
    side
  • Weaknesses
  • Smaller database than the major search engines
  • Few people know about or use it
  • May not be updated as frequently

13
Features by Search Engine
14
Search Engine Ratings
15
Search Engine Ratings
16
The Page Rank Algorithm
  • Assume a page A has pages T1,T2,Tn pointing to
    it.
  • Let d be a damping factor whose value is set to
    0.85(say).
  • C(A) be defined as the number of links going out
    of page A.
  • The Page Rank PR(A) can be defined as.
  • PR(A) (1-d) d (PR(T1)/C(T1) ...
    PR(Tn)/C(Tn))
  • The Sum of all Webpage's page rank being equal
    to 1.

17
Architecture Overview
18
The Basic Operations of a Search Engine can be
divided into.
  • Crawling
  • Indexing
  • Sorting

19
Crawling
  • Crawling is a process of following links to
    locate and read pages.
  • Crawling is the most fragile application since it
    involves hundreds of thousands of web servers.
  • A single URL Server Serves list of URLs to a
    number of crawlers Implemented in Python.
  • At peak speeds the system can crawl more than 100
    pages a second using 4 crawlers as of 1998.
  • This amounts to 600kB per second of data.

20
Crawling
  • The Google crawler known as google bot crawls
    all the URL it knows every few weeks to keep its
    information up to date.
  • Each Crawler has a DNS cache so it does not need
    to do a DNS look up before crawling each
    document.
  • The Google Bot obeys the robots.txt directive
    avoiding the pages which the webmaster has
    designated as off limits.

21
Indexing
  • After each document is parsed, It is encoded into
    a number of barrels. Every word is converted to
    an word ID using an in memory hash table.
  • Once Words are converted to wordIDs,their
    occurrence in the current document are translated
    into hit lists and are written into the forward
    barrels.
  • In short for every word the system keeps a list
    of pages the word occurs in.
  • Google knows about 10 billion web documents.

22
Sorting
  • Once Google has matched a word in a index, It
    wants to put the best document first. It choses
    the best document based on a number of techniques
  • 1.Text Analysis.
  • 2.Links and link text/Anchor text.
  • 3.Page Rank.

23
Google Query Evaluation
  • 1.Parse the Query.
  • 2.Convert Word into WordIDs.
  • 3.Seek to the start of the doc list in the short
    barrel for every word.
  • 4.Scan thru the doc list until there is a
    document with all of the search terms.
  • 5.Compute the rank of the document for that
    query.
  • 6.Sort the documents that have matched by rank
    and return the top K.

24
Google v/s Inktomi
  • Google likes to include as many pages as it can
    find. Inktomi would rather not clutter its index
    with pages of little value. This makes Google
    useful when conducting very specific searches -
    such as researching an individual.
  • Inktomis ranking algorithm has changed over the
    past months, but compared to Google Inktomi has
    been very stable.
  • As long as Google places an inordinate amount of
    weight on any single factor such as anchor text,
    aggressive site promoters can play that factor to
    their benefit.

25
Meta Search Engines
  • A meta search engine basically searches multiple
    search engines simultaneously and displays the
    results based on certain preferences.
  • A meta search engine doesnt have a database of
    its own. They send search terms to database
    maintained by search engine companies.
  • It basically works on the principle More heads
    better than one
  • Smarter Meta search engines comes with options
    like textual analysis that lets one dig deeply
    into search results.

26
Some Good Meta Search Engines
  • www.clusty.com searches a number of free search
    engine directories.Doesnt include Yahoo! And
    Google.
  • www.dogpile.com searches Google
    ,Yahoo!,Looksmart,Ask Jeeves,MSN Search.
  • www.copernic.com copernic agent select from a
    list of search engines by changing the properties
    dialogue.

27
Things you can do on Google,Yahoo! and Ask.
  • Phrase Searching By enclosing terms in double
    quotes.
  • OR Searching with capitalized OR.
  • - excludes requires the exact form of the
    word.
  • Limit results by advanced search.
  • Things not supported on Google ,Yahoo! and Ask.
  • Truncation Use OR searches for variants. (ex
    Airlines OR Airline)
  • Case Sensitivity Capitalization does not matter.

28
(No Transcript)
29
(No Transcript)
30
Trends On The Web
  • 50 of Web users use search engines as a
    starting point
  • The web is estimated to have billions of pages
  • New Pages are created at a rate of 8 per week

31
What Do The Trends Indicate?
  • Search engines dictate a large amount of web
    traffic
  • Crawlers must be able to scale to the rapidly
    growing web
  • Query engines must be able to cope with the large
    amount of data indexed by crawlers
  • Search engines have to give users relevant
    information with less effort from the users

32
Search Engine Bias
  • Do search engines dictate what is popular and
    what is not?
  • How much control does a search engine have over
    content on the web?
  • Is page rank flawed?

33
Current TrendsSearch Engine Bias
  • A Rich get richer, poor stay poor phenomenon is
    occurring
  • Research in this area is still new

34
Web Crawlers
  • Typically 3 Areas of Research
  • General Architecture
  • Page Selection
  • Page Update

35
Web Crawlers Continued
  • General Architecture
  • How should parallel crawlers be designed?
  • Page Selection
  • What should be the next page visited once a site
    is harvested?
  • Page Updating
  • How often should a page be refreshed?

36
Database Indexing
  • Databases use a static index of all data
    available
  • At query time, the index is dynamically traversed
  • The result represents all known data that
    satisfies the query

37
Web Indexing
  • Typically search engines maintain a frequency
    index and a positional index
  • Frequency index
  • For every term, stores the frequency of that term
    for all web pages
  • Positional Index
  • For every term, stores all positions of that term
    in all web pages

38
Improving Query Time ForSearch Engines
  • Disk size is a scalability issue
  • A static index of all web pages becomes somewhat
    impractical
  • The size of a complete static index becomes a
    bottleneck for query response time

39
Improving Query Time Continued
  • Research focuses on compression of indexes
  • Lossless and more recently lossy compression are
    used
  • Locality Based Pruning Method (lbpm) is a lossy
    compression technique

40
Personalized Search
  • The massive amount of data requires a more
    personalized search paradigm
  • Searching with context personalizes searches
    behind the scenes
  • Probabilistic Query Expansion is another avenue
    to alleviate the size of the web

41
DEMO!!
  • http//mindset.research.yahoo.com
Write a Comment
User Comments (0)
About PowerShow.com