Introductory Survey of Internet Search Services - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Introductory Survey of Internet Search Services

Description:

A searchable database of resources extracted from the ... Argos -Classics and ancient history. http://argos.evansville.edu/ How large is a search engine? ... – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 54
Provided by: cent172
Learn more at: https://people.hws.edu
Category:

less

Transcript and Presenter's Notes

Title: Introductory Survey of Internet Search Services


1
Introductory Survey of Internet Search Services
Michael Hunter Reference Librarian Hobart and
William Smith Colleges for Rochester Regional
Library Council Member Libraries
Staff Sponsored by the Rochester Regional
Library Council Supported by Library Services
and Technology Act (LSTA) and/or Regional
Bibliographic Databases and Resources Sharing
(RBDB) funds granted by the New York State
Library 2000
2
What is a search engine?
  • A searchable database of resources extracted from
    the Internet by computer-generated search and
    retrieval processes.
  • Updated frequently
  • Search features vary among engines
  • Results of searches are ranked forrelevance as
    predicted by automated logical algorithms.

3
Search Engines and Subject Directories in 2000
Genres in Flux
  • What types are available today?
  • Automatic Human-compiled
  • Pure Crawler
  • Crawler Plus
  • Specialized Crawler
  • Peer-Reviewed

4
Search Engines in 2000
  • Pure crawler-based
  • Google
  • Fast
  • Crawler plus
  • Subject Directory (HB, Lycos, Excite, AV,
    Infoseek, WebCrawler)
  • Special Collection (NL)
  • Pre-programmed answers-Ask Jeeves (AV)

5
Search Engines in 2000
  • Specialized (chiefly crawler-based)
  • SearchEdu.com
  • Specialized (crawler/human compiled)
  • Scicentral.com metasite
  • Peer reviewed
  • Hippias-Philosophy http//hippias.evansville.edu/
  • Argos -Classics and ancient history
  • http//argos.evansville.edu/

6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
How large is a search engine?
  • Typical personal computer - 64 MB RAM
  • General search engines - 4,000 MB RAM (and
    more)
  • Database - 1,000 GB of storage

12
WS
WS
WS
WS
WS
WS
WS
WS
WS
CR
CR
CR
CR
WS
DATABASE
CR
CR
CR - Crawler WS - Web Server
13
User 1 User 2 User 3 User 4 User 5 User 6 User 7
Search Engine
DATABASE
14
WS
WS
WS
WS
WS
WS
WS
WS
WS
CR
CR
CR
CR
WS
DATABASE
CR
CR
CR - Crawler WS - Web Server
15
Crawling the WebThe Big Picture ...
  • Crawlers
  • download a page
  • extract links to other web pages
  • index words from the page
  • crawl the extracted links and
  • continue the cycle

16
Crawling the WebThe Detailed View ...
  • While downloading one page the crawlers
    simultaneously
  • check for the next page to download in the
    queue
  • check for any robots exclusion files that
    prohibit downloading of pages from a web server
  • download the whole page
  • extract all links from the page and add them to
    the queue

17
Crawling the WebThe Detailed View ...
  • Index contents (extract all words and save them
    to a database associated with the pages URL
    also save the order of the words to allow for
    phrase searching)
  • Optionally filter for adult content, language of
    document, other criteria
  • Save (or make) summary of the page
  • Record the date downloaded for future reference
    in scheduling re-visits to the site

18
Scale?
  • One page at a time?
  • Covering the Internet would take several years
  • Instead
  • Thousands of pages are processed simultaneously
    by multiple crawlers (Google has ca. 4,000)

19
Performance?
  • What about maintenance down-time?
  • Services have duplicate machines so no
    interruptions occur during maintenance
  • Why are interface changes so rare?
  • Updating software on complex systems is expensive
  • Usually slows service down, or stops it completely

20
Performance?
  • If I execute the same search in the same engine
    several times in succession I get different
    results. Why?
  • Query is run against multiple machines in
    parallel
  • Ranking may be performed on a limited subset of
    the hits (ie, those returned first) rather than
    the entire set of results.

21
Why do search engines exist?
  • To make money!!!
  • Advertising
  • Banner ads
  • Allied services
  • Pay-for-placement in search results
  • Many other commercial endeavors

22
In pursuit of user loyalty . . .
  • Advertisers want stickyness ie, users that
    return often and at length
  • Stickyness drives design
  • Portalization One-stop access for all your
    Intenet needs
  • Speed
  • Freshness
  • Relevance of results
  • Value-added search features such as customization
    (My Yahoo, etc.)

23
How Search Engines Differ . . .
  • Content
  • Update frequency (freshness)
  • Ways you can search
  • Ways results are presented to you

24
Breadth of Content
  • How much of the geographic Internet is searched
    and to what degree?
  • What types of files are included?
  • Web sites
  • Usenet News
  • Software
  • Image/Video/Audio
  • Multimedia
  • FTP

25
Depth of Content
  • How much of a given site has been downloaded?
  • URL?
  • Title?
  • First heading?
  • First 200 words?
  • Full text?
  • Full text and some of the documents linked to?
  • Full text and all of the documents linked to?
  • Full text and documents that are linking to this
    one?

26
Update frequency
  • When was the content last refreshed or rebuilt
    from direct searching of the Internet?

27
Ways you can search
  • Boolean operators
  • Requiring, combining or excluding words or
    phrases
  • Searching for a phrase
  • Searching by word stem (truncation)
  • Searching by location in the document (field
    searching)
  • Searching by date
  • Searching by media
  • Searching by language

28
Ways results are presented to youRelevance
Prediction
Based on TEXT ON THE PAGE FACTORS EXTERNAL TO
THE PAGE
29
Relevance PredictionText on the page
  • Based on
  • Word frequency profiles
  • More like this
  • Suggested similar sites
  • Relational clustering
  • Northern Lights Custom Folders

30
Relevance PredictionProblems with text on the
page ranking
  • Designed for text-heavy pages design-heavy
    pages may be ranked lower as a result
  • No added weight possible for evaluated, rated or
    reviewed sites
  • Ill-suited for a web that grows so rapidly

31
Relevance PredictionFactors external to the page
  • Link popularity
  • Sites with more links pointing to them ranked
    higher
  • Click popularity
  • Sites visited more often and longer ranked higher
    (Direct Hits knowledge base of users click
    paths)
  • Sector popularity
  • Tracking demographic or social groups clickpaths

32
Relevance PredictionFactors external to the page
  • Pre-packaged human-generated questions with
    answers (Ask Jeeves)
  • Business alliances among services
  • Editorial partnerships
  • Pay-for-placement options (GoTo)

33
Relevance PredictionFactors external to the page
  • Advantages
  • Helps focus and limit results for popular, common
    queries
  • Human-generated criteria improve quality of
    results
  • Disadvantages
  • Increases the invisible layer between the
    searcher and the results
  • How did I get these results?
  • Who is controlling the search process?
  • Privacy issues around tracking users click paths

34
What is a subject directory?
  • A human-generated listing of resources usually
    classified and hierarchically arranged by subject
    category, often containing descriptions of the
    resources included.

35
Subject Directories
  • Ways directories differ from search engines
  • Sites are examined and cataloged by a human being
  • Descriptions of the sites are often included
  • Generally fewer ways of searching
  • Generally not updated as frequently

36
Types of Subject Directories
  • My favorite links
  • Personal homepages
  • Subject-focused sites with related links
  • The Cervantes Home Page
  • Subject-focused metasites
  • Scicentral http//sciquest.com
  • Sections of the WWW Virtual Library
    http//wwwvl.org
  • General comprehensive directories
  • Yahoo, Snap, Excite

37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40
Important aspects of subject directories
  • Authorship/sponsorship
  • Intended audience
  • Update frequency

41
How are users faring?NPD User Study April, 2000
  • 40,000 respondents chosen randomly
  • October-November, 1999
  • Conducted by NPD New Media Services on behalf of
    13 major search services
  • Summary at http//searchenginewatch.com/reports/np
    d.html
  • See http//www.npd.com for more information

42
Search Engine or Subject DirectoryWhich one do
I use?
  • Portalization has blurred the distinctions
    however ---
  • Use a search engine for
  • Narrowly defined topics Plate tectonics in
    northern California
  • Up-to-date news and research
  • Occurrences of a name or phrase

43
Search Engine or Subject DirectoryWhich one do
I use?
  • Use a subject directory for
  • Broadly defined topics geophysical research
  • Subject-specific gateways or vortals
  • websites
  • discussion groups
  • media files
  • A few good sites
  • General browsing

44
Improving search strategy
  • Little overlap in coverage among engines (Gregg
    Notess at http//searchengineshowdown.com)
  • Even the largest ones cover no more than 20 25
    of the Internet
  • Therefore use 2 or more engines you know and
    trust to insure a wider range of results

45
Improving search strategy
  • Know the advanced features of your favorite
    engine(s) and use them.
  • Use unique identifiers or keywords
  • Use phrase searching when possible
  • Restrict search to title or other fields
  • Incorporate date searching when available
  • Use the Find in page function to locate your
    search term(s) quickly

46
What NO search engine covers . . .
  • Dynamic Web content
  • Created through user interaction
  • File extensions include .asp, .php, .jsp
  • PDF files (See Adobes new engine for these at
    http//searchpdf.adobe.com
  • Pages requiring a login
  • Wireless content
  • WAP (Wireless Application Protocol) engine
    available at FAST http//alltheweb.com)

47
Once you have a list of hits ask yourself . . .
  • How might the domain type influence the content
    of this site?
  • Do I trust the author/creator? Why or why not?
  • How might the organization responsible influence
    the content?

48
Once you have a list of hits ask yourself . . .
  • Is the date of publication critical or important
    in this case?
  • Is the intended audience appropriate for this
    information need?

49
Search is . . .
  • Intriguing
  • Frustrating
  • Exciting
  • Maddening
  • Gratifying

50
The Internet is . . .
  • Vast
  • Constantly changing
  • Uncataloged
  • Of wildly varying quality

51
Search Services are . . .
  • Presently our best hope of locating the
    increasingly valuable resources found on the Net

52
How can I keep up???
  • Use monitoring services such as
  • http//searchenginewatch.com
  • http//searchengineshowdown.com
  • http//researchbuzz.com
  • Network with colleagues and other expert users
  • Try new services out (on your own, at first !!)
  • Learn how to evaluate new services on your own

53
Thank you and best of luck!!!
  • Michael Hunter
  • Reference Librarian
  • Warren Hunting Smith Library
  • Hobart and William Smith Colleges
  • Geneva, NY 14456
  • (315) 781-3552 hunter_at_hws.edu
Write a Comment
User Comments (0)
About PowerShow.com