Searching the Web - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Searching the Web

Description:

The Web can be seen as a very large, unstructured but ... Web crawler. A web crawler (also known as a web spider) is a program which browses the World ... – PowerPoint PPT presentation

Number of Views:315
Avg rating:3.0/5.0
Slides: 37
Provided by: has46
Category:

less

Transcript and Presenter's Notes

Title: Searching the Web


1
Searching the Web
King Saud University College of Computer
Information Sciences Information System
Department IS 531Document Storage and Retrieval
Systems
  • Prepared By
  • Hasan Ba-Abdullah. 425121603
  • Supervised By
  • Dr. Mourad Ykhlef

2
Agenda
1. Introduction
2. Challenges of Searching the Web
3. Measuring the Web
4. Searching Engines (Google)
5. Web Directories
6. Metasearchers
7. Google Searching Guidelines
3
1. Introduction
  • The Web can be seen as a very large, unstructured
    but ubiquitous database.
  • So we need for efficient tools to manage,
    retrieve and filter the information.
  • There are 3 different forms of searching the Web
  • 1. Search Engines, which index a portion of
    Web
  • pages as a full-text database.
  • 2. Web Directories, which classify selected
    Web
  • documents by subject.
  • 3. Searching by hyperlinks structure.

4
2. Challenges of Searching the Web
  • Problem with the data itself
  • Distributed data
  • High percentage of volatile data
  • it is estimated that 40 of the Web changes every
    month.
  • Unstructured and redundant data
  • No conceptual model, no organization, no
    constraints.
  • By some estimates, about 30 of the Web is
    redundant.
  • Quality of data
  • Data can be false, invalid, outdated, poorly
    written or with many errors.
  • Heterogeneous data
  • Multiple media types, multiple formats, languages
    and alphabets.
  • Problems regarding the user and his interaction
    with the retrieval system
  • How to specify a query
  • How to interpret the answer provided by the
    system.

5
3. Measuring the Web
  • Detailed Domain Counts and Internet Statistic.
  • Source http//www.whois.sc/internet-statistics/

6
3. Measuring the Web (Cont.)
Source http//www.whois.sc/internet-statistics/co
untry-ip-counts.html
on April 1st 2006.
7
4. Search Engine
  • A search engine is a program designed to help
    find information stored on a computer system such
    as the World Wide Web, or a personal computer.
  • The search engine allows one to ask for content
    meeting specific criteria and retrieves a list of
    references that match those criteria.
  • Two main architectures
  • 1. Centralized Using crawlers ,
    information is
  • gathered into a single site, where it is
    indexed the site
  • then processes all user queries.
  • 2. Distributed Searching is a coordinated
    effort of many
  • information gatherers and brokers .

8
4.1 Centralized Architecture
  • Most search engines uses a centralized
    crawler-indexer architecture.
  • Components Crawlers, Index, Query Engine, and
    Interface.

9
4.2 Distributed Architecture
  • Harvest is an example of distributed
    architecture.
  • Main drawback requires the coordination of
    several Web servers.
  • Components
  • Gatherers
  • Extracts information from the documents stored
    on one or more
  • Web servers.
  • Can handle documents in many formats HTML,
    PDF, Postscript, etc.
  • Broker provides the indexing mechanism and query
    interface.
  • 3. Replicator to replicate servers.
  • 4. Object Cache reduces network and server load.

10
4.2 Distributed Architecture (Cont.)
11
4.3 About Google?
  • The name "Google" is a play on the word "googol",
    which refers to the number represented by 1
    followed by one hundred zeros.
  • Google receives over 200 million queries each day
    through its various services.
  • As of January 2006, Google has indexed 9.7
    billion web pages, 1.3 billion images, and over
    one billion Usenet messages  in total,
    approximately 12 billion items. It also caches
    much of the content that it indexes.

12
User Interfaces
13
Google Services and Tools Source
http//en.wikipedia.org/wiki/List_of_Google_servic
es_and_tools
14
How Google works
15
Google finds important pages
  • The idea is that the documents on the web have
    different degrees of "importance".
  • Google will show the most important pages first.
  • The ideas is that more important pages are likely
    to be more relevant to any query than
    non-important pages.

16
Google Relevance Factors
  • Google's considers over 100 factors, including

1. PageRank algorithm.
1. PageRank algorithm.
2. Popularity of page.
3. Position and size of the search terms within
page. 4. Unique Content.
5. Terms order. 6. Page size and
load time. 7. Error free websites.
8. Important incoming links.
9. Website Optimization.
17
Google PageRank
  • Numeric value to measure how important a page is.
  • PageRank (PR) is the actual ranking of a page, as
    determined by Google.
  • A probability is expressed as a numeric value
    between 0 and 1.

18
Google System Features
  • PageRank Bring order to the web
  • PR(A) (1-d) d (PR(T1)/C(T1) .. PR
    (Tn)/C(Tn))
  • PR(A) is the PageRank of page A.
  • PR(T1) is the PageRank of the page that links to
    our (A) page.
  • C(T1) is the number of links going out of page
    T1.
  • d is a damping factor, usually set to 0.85.

A
T1
C1
T2
C2
Tn
Cm
19
Example
  • PageRank calculation
  • PR(A) 0.5 0.5 PR(C)PR(B) 0.5 0.5 (PR(A)
    / 2)PR(C) 0.5 0.5 (PR(A) / 2 PR(B))These
    equations can easily be solved. We get the
    following PageRank values for the single pages
  • PR(A) 14/13 1.07692308PR(B) 10/13
    0.76923077PR(C) 15/13 1.15384615

20
Indexing
  • For example, the word "civil" might occur in
    documents 3, 8, 22, 56, 68, and 92, while the
    word "war" might occur in documents 2, 8, 15, 22,
    68, and 77.
  • Suppose someone comes to Google and types in
    civil war. In order to present and score the
    results, we need to do two things
  • 1. Find the set of pages that contain the
    users query somewhere
  • 2. Rank the matching pages in order of
    relevance

21
Web crawler
  • A web crawler (also known as a web spider) is a
    program which browses the World Wide Web in a
    methodical, automated manner.
  • Web crawlers are mainly used to create a copy of
    all the visited pages for later processing by a
    search engine, that will index the downloaded
    pages to provide fast searches.
  • It starts with a list of URLs to visit. As it
    visits these URLs, it identifies all the
    hyperlinks in the page and adds them to the list
    of URLs to visit, recursively browsing the Web
    according to a set of policies.

22
Google Search Engine Architecture
  • URL Server- Provides URLs to be fetched
  • Crawler is distributed
  • Store Server - compresses and stores
  • pages
  • Repository - holds pages for indexing
  • Indexer - parses documents, records
  • words, positions, font size, capitalization
  • Lexicon - list of unique words found
  • Barrels hold
  • Anchors - keep information about link
  • found in web pages
  • URL Resolver - converts relative URLs to
  • absolute
  • Sorter - generates Doc Index
  • Doc Index - inverted index of all words in
  • all documents (except stop words)
  • Links - stores info about links to each
  • page (used for Pagerank)
  • Pagerank - computes a rank for each page

23
5. Web Directories
  • Web directory A classification of Web pages by
    subject.
  • Principles
  • Classification is by a hierarchical taxonomy.
  • Directory may be specific to a subject, a region,
    a language.
  • Pages are submitted and reviewed before they are
    included.
  • Automatic classification is not successful
    enough.
  • Advantage
  • if found, the answer will be useful in most
    cases
  • Disadvantage
  • classification is not specialized enough
  • not all Web pages are classified

24
6. Metasearchers
  • Metasearcher web server that sends a given
    query to several search engines and Web
    directories, collects the answers and unifies
    them.
  • Examples Metacrawler, Savvysearch, MetaSearch,
    Mamma.
  • Advantages
  • Combine the results of many sources.
  • Save users from the need to pose queries to
    multiple searchers.
  • Ability to sort the results by different
    attributes.
  • Pages retrieved by multiple searchers are more
    relevant.
  • Improve coverage individual searchers cover a
    small fraction of the Web.
  • Issues
  • How to translate the given query to the specific
    language of each search Engine?
  • How to rank the unified results?

25
7. Google Searching Guidelines
  • Query modifiers
  • Use these commands in the search window.
  • intitletest
  • allintitletest results
  • inurltestresults
  • allinurltestresults personality
  • allintexttest results personality
  • allinanchortest results personality
  • siteloc.gov
  • filetypedoc

26
intitletest results
This search returns sites with the word test in
the title and results anywhere in the document.
27
inurltest results
  • inurltest results only test must be found in
    the web address (URL)

28
allintext
  • Sometimes you get pages that do not have your
    search term/phrase in them.
  • Use allintext to get only those pages that have
    your search terms in them.
  • Compare the searches in the next two slides

29
Example crash test results
30
allintextcrash test results
Different pages float to the top of your hit
list. And you get fewer pages than before.
31
site
  • Limit your search to a specific web site.
  • Enter search terms then qualifier.
  • EXAMPLES
  • students siteksu.edu
  • Finds student(s) on the King Saud University site

32
filetype
  • You can specify a type of document to search.
  • EXAMPLES
  • pdf Adobe readable files
  • doc Microsoft Word documents
  • mdb Microsoft Access databases
  • jpg, gif, tif graphics, photos
  • ppt Microsoft PowerPoint presentations

33
define
  • will provide definitions of the words, gathered
    from various online sources.

34
Funny Google News
  • Google Bombing Miserable Failure
  • ???? ???? ???? ??????? ??? ????? ??????? ??
    ?????? ????????? ???? ?? ???? ??? ?????? ??? ???
    ?????

35
Summary
  • Search engines are among the most important
    applications or services on the web.
  • The success of the Google search engine was
    mainly due to its simple, easy-to-use, no-ad
    interface, and its powerful PageRank algorithm.

36
  • Thanks
  • Any Questions
Write a Comment
User Comments (0)
About PowerShow.com