The Invisible Web presentation

About This Presentation

Transcript and Presenter's Notes

Title: The Invisible Web

1
The Invisible Web

David Boudinot
Heather De Forest
Joan Pries
Lindsay Ure
Created for LIBR 557 Advanced Information
Retrieval
Dr. Mary Sue Stephenson
November 29, 2004

2
The Invisible Web

Searching on the Internet today can be compared
to dragging a net across the surface of the
ocean. While a great deal may be caught in the
net, there is still a wealth of information that
is deep, and therefore, missed. The reason is
simple Most of the Web's information is buried
far down on dynamically generated sites, and
standard search engines never find it.
Deep Web White Paper, BrightPlanet.com (2000)

3
The Invisible Web

Part I The Invisible Web Explained
What is the Invisible Web?
Why cant I access the Invisible Web using
regular search engines?
How deep is the Invisible Web and what does it
contain?
Where do I start with searching the Invisible Web?

4
What is the Invisible Web?Background and
definitions.
5
Background

The phrase Invisible Web was first used in the
mid 1990s to describe web content that is not
indexed by regular search engines
2000 Deep Web White Paper, published by the
BrightPlanet Corporation, discusses the nature
and scope of the Invisible Web
2001 publication of , The Invisible Web
Uncovering Information Sources Search Engines
cant See, by Chris Sherman and Gary Price

6
The Visible Web

The visible or surface web is that part of the
Web that can be retrieved using standard search
engines such as Google or AltaVista, or subject
directories
In order for search engines to find them, web
pages must be static and either linked to other
web pages or submitted for indexing by Webmasters

7
The Invisible Web

Also known as deep web dark web or hidden
web
The Invisible Web is what standard search tools
either cannot or will not crawl and index
A large part of the Invisible Web consists of
authoritative and pertinent information

8
The Invisible Web

The term invisible is somewhat misleading. It
is possible to retrieve this content, but not
using the same methods as for visible web content
Various reasons why certain web pages are not
indexed by standard search engines or directories
Hard to define and varying definitions of the
precise nature of the Invisible Web exist

9
Definitions

The Deep Web is content that resides in
searchable databases, the results from which can
only be discovered by a direct query. Without the
directed query, the database does not publish the
result. When queried, deep Web sites post their
results as dynamic Web pages in real-time. Though
these dynamic pages have a unique URL address
that allows them to be retrieved again later,
they are not persistent.
Bright Planet Corporation, BrightPlanet.com

10
Definitions

Text pages, files, or other often high-quality
or authoritative information available via the
World Wide Web that general-purpose search
engines cannot, due to technical limitations, or
will not, due to deliberate choice, add to their
indices of Web pages. Sometimes also referred to
as the Deep Web or dark matter.
Chris Sherman and Gary Price, The Invisible
Web Uncovering Information Sources Search
Engines Cant See (2001)

11
Types of Invisibility

Opaque Web
Web pages that have not yet been crawled by
search engines for various reasons, but could
become part of the Visible Web at any time
Private Web
Sites that could be indexed but Webmasters have
chosen to exclude from search engines, or at
least to specify that access is restricted.
Either password protected, robots exclusion
protocol and robot metatags
Proprietary Web
Sites that are available only to those who have
agreed to terms or conditions to access the
content. May be free registration or paid
subscription.
Adapted from Sherman and Price, The Invisible
Web Uncovering Information Sources Search
Engines Cant See. (2001)

12
Types of Invisibility

Truly Invisible Web
Sites that search engines cannot, or will not,
crawl for technical reasons
Certain file types
Real time information, such flight arrivals and
weather reports that is relevant only for a very
short time
Pages that generate scripts these can trap
spiders
Dynamic pages that are created in response to a
user query namely the content of relational
databases
Adapted from Sherman and Price, The Invisible
Web Uncovering Information Sources Search
Engines Cant See. (2001)

13
Why cant I access content from the Invisible Web
using regular search engines?
14

Disclaimer
As technologies grow and search engines develop,
parts of the Invisible Web are becoming visible
so what may be invisible today might become
visible tomorrow.

15
Why cant I access content from the Invisible Web
using search engines?

There are 4 main reasons you cant access
Invisible Web content using search engines
Search engines were originally designed to index
HTML pages
Search engine cant find the content
Search engine is blocked from the content
Search engine purposely ignores Invisible Web
site

16
Search engines were originally designed to index
HTML pages

Anything outside of HTML (such as Flash,
Shockwave, or mp3s) has traditionally remained
invisible.
If this type of content is described in meta tags
within the HTML document, a web crawler can index
it.
Companies like Google have been developing
technology to search non-HTML content on the
Internet.

17
Search engine cant find the content

Web crawlers work by following links on websites
and reporting back home what was found. If a
webpage is not linked from any other page, the
Web crawler will not be able to find it.

18
Adapted from Chris Sherman and Gary Price, The
Invisible Web Uncovering Information Sources
Search Engines Cant See (2001)
19
Search engine is blocked from the content

Problem You dont want a search engine to index
parts of your website.
Solution Include the Robots Exclusion Protocol
or Robots META tag in your website.

20
How the Robots Exclusion Protocol works

When a web crawler visits a website, it first
checks for a robots.txt file, which tells the
crawler what parts of the site it is allowed to
index.
For the SLAIS site, this would look like
http//www.slais.ubc.ca/robots.txt

21
How the Robots Exclusion Protocol works Part II

Code in a simple text document tells the crawler
what to do. For example
To exclude all crawlers from part of the server
User-agent
Disallow /cgi-bin
Disallow /tmp/
Disallow /private/
To exclude a single crawler
User-agent BadBot
Disallow /

Source Web Server Administrators Guide to the
Robots Exclusion Protocol
22
How the Robots META tag works

The Robots META tag is inserted into an
individual HTML document to inform web crawlers
to buzz off.
Unfortunately, some crawlers ignore this tag,
and index your webpage anyway.

23
How the Robots META tag works Part II

Here are some examples of what Robots META tag
code looks like
INDEX or NOINDEX tells the crawler to index the
page or not.
FOLLOW or NOFOLLOW instructs the crawler to
follow (or not follow) the links on the page.

24
Search engine purposely ignores Invisible Web
site

Due to budget constraints or technical issues,
some search engines choose not to index non-HTML
files.
Spammers tend to use script commands to trap web
crawlers. Some search engines opt out of
indexing sites with any script commands.
Web crawlers are not programmed to understand
database structures, therefore, information in
relational databases remains invisible.

25
Databases and HTML

Online databases generate web pages dynamically
and respond to commands issued from an HTML form.
Some databases are proprietary.
In many instances web crawlers and databases are
incompatible.

26
How it all works
27
How deep is the Invisible Web and what does it
contain?
28
How deep is the Invisible Web?

Bright Planets study of the Deep Web (2000)
estimated approximately 400-550 times more
information than in the surface Web (or World
Wide Web)
Sherman and Price (2001) refute this claim
estimate the IW is somewhere between 2 and 50
times larger since much of the information is
from ephemeral data (such as weather)

29
How fast is the Deep Web growing?

significantly faster than the visible Web
(Sherman and Price)
The Deep Web is the fastest growing category of
new information on the Internet. All signs point
to the Deep Web as the dominant paradigm for the
next-generation Internet. (Bright Planet)

30
Quality and Content Invisible Web vs.
surface web?

Many IW sites are first-rate content sites
tend to be narrower in focus with more content in
subject area
Often use a variety of media and file types, many
of which are not easily indexed
Largest part of the IW is information contained
in databases
More than half of the content resides in
subject-specific databases
Mostly human indexed

31
Content

Invisible Web sources are critical because they
provide users with specific, targeted
information, not just static text or HTML pages
However, general search engines are becoming much
more sophisticated and capable
Eg. Googles new Google Scholar for scholarly
resources opens up invisible web by allowing
access to some material that wouldnt ordinarily
be available to search spiders (Search Engine
Watch, November 18, 2004)
What is invisible today may be visible tomorrow

32
Content

At the time Sherman and Prices book was first
written in 2001, PDF and Microsoft Office
documents were among those which could not be
indexed by general search engines
Google became the first to index PDF and Office
documents, a search capability that is now widely
accepted

33
Content

A number of other file formats are still not
being searched well by most search engines
Postscript
Flash
Shockwave
Executables (programs)
Compressed files (.zip, .tar, etc.)

34
Why arent these formats searched?

Although the above formats can be indexed, they
often are not because it is expensive to index
non-HTML pages
In other words, the major web engines are not in
business to meet every need of information
professionals and researchers. (Sherman and
Price, 2003)

These difficult file types are becoming more
prevalent, especially in some kinds of
high-quality, authoritative information
E.g., official government documents, or scholarly
papers stored on the Web using Postscript or
compressed Postscript files
(Postscript is a page description language
first used by Adobe in 1985. It is a programming
language optimized for printing graphics and
text.)

36
Whats NOT on the Web

Proprietary databases and information services
Dialog, LexisNexis, etc.
Government and public records
Some coverage of government docs but too much
information to ever have complete coverage
Privacy issues come into play for public records
Scholarly journals
Publishers have tight control
There are a few scholarly free e-journals,
usually found via library websites
Full Text of newspapers and magazines
Some limited content of archives but information
is often still valuable so publishers want to
retain control of info
Authors rights also a concernmany retain re-use
rights
Millions of documents that will never be
available on the Web
libraries are still important!

37
Why use the Invisible Web?

There are thousands of databases with
high-quality information accessible via the Web,
many from libraries, universities, businesses,
government agencies, etc.
Previously, this type of information was
available only in proprietary information systems
Although these databases may be accessible
through the Web, they may not be on the Web

38
Why use the Invisible Web?

More comprehensive results
resources are more subject specific
More control
more specialized tools for searching, thus easier
retrieval of subject-specific information
Increased precision and recall
smaller databases better recall
Subject-specific resources better precision
Authoritative
High quality content from reputable institutions
or organizations

39
WHERE

can I find Invisible Web resources?

40
Q. How do I search the invisible web?
41
A. You already do!
42
Top 25 types of content on the invisible web.

1. Public company filings
2. Telephone numbers
3. Customized maps driving directions
4. Clinical trials
5. Patents
6. Out of Print Books
7. Library catalogues
8. Authoritative dictionaries
9. Environmental information
10. Historical stock quotes
11. Historical documents and images
12. Company directories
13. Searchable subject bibliographies

14. Economic information
15. Award Winnings
16. Job Postings
17. Philanthropy grant information
18. Translation tools
19. Postal codes
20. Basic demographic information
21. Interactive school finders
22. Campaign financing information
23. Weather data
24. Product catalogues
25. Art Gallery holdings

43
Attitude Shift

Remember that the invisible web is there.
Change your expectations of what youll find.
Look for entryways to the invisible web, not for
the content.
Develop a toolkit now for later consultation

44
Searching the Invisible Web

1. Adopt the mindset of a hunter
-Tools (weapons) are important
-Reading the environment and looking for clues
is more important.
Adapted from Price, G. Chris Sherman (2001)
. Exploring the invisible web Seven essential
strategies. Online, 25(4), 32-35

45
Searching the Invisible Web

2. Use search engines
Use a general purpose engine like
Teoma to search for your term
with
database or interactive tool
Adapted from Price, G. Chris Sherman (2001) .
Exploring the invisible web Seven essential
strategies. Online, 25(4), 32-35

46
Searching the Invisible Web

3. Use Site Maps and Site Searches
Big sites like Library of Congress
and Library and Archives Canada
are often hybrids part visible, part
invisible.
Use the site map, search for database, and
see what you get!
Adapted from Price, G. Chris Sherman (2001) .
Exploring the invisible web Seven essential
strategies. Online, 25(4), 32-35

47
Searching the Invisible Web

4. Rely on Baker Street Irregulars
Sherlock Holmes had key informants
You can too.
Early Warning Systems
-Search Stuff from Susie list.
-Search Engine Watch newsletters and
blogfeeds.
-Gary Prices www.resourceshelf.com
Adapted from Price, G. Chris Sherman (2001) .
Exploring the invisible web Seven essential
strategies. Online, 25(4), 32-35

48
Searching the Invisible Web

5. Use Invisible Web Directories
Directories like
the Librarians Index to the Internet
and the Invisible Web Directory
have the advantage of presenting resources
that have been hand selected.
Adapted from Price, G. Chris Sherman (2001) .
Exploring the invisible web Seven essential
strategies. Online, 25(4), 32-35

49
Searching the Invisible Web

6. Use offline finding aids
Handbooks
The Invisible Web, Sherman Price
Best of the Web Geography, Leftley
Website Reviews
Adapted from Price, G. Chris Sherman (2001) .
Exploring the invisible web Seven essential
strategies. Online, 25(4), 32-35

50
Searching the Invisible Web

7. Create your own monitoring service
Some specialized search engines like
InfoMine
and ProFusion
have alert services that will let you know
when new resources have been added.
Adapted from Price, G. Chris Sherman (2001) .
Exploring the invisible web Seven essential
strategies. Online, 25(4), 32-35

51
Searching the Invisible Web

What about these so-called
Invisible Web Search Engines?
Eg. ProFusion, Incy-Wincy, Complete Planet

52
The Invisible Web

Part II Demonstrations of Invisible Web Search
Tools
ProFusion
Complete Planet
The Invisible Web Directory

53
ProFusion

Claims
ProFusion is very dynamic and an extremely
exciting search site that makes it easy to
intelligently search and find information from
the very deep and invisible parts of the web.
Press release from Intelliseek
http//www.intelliseek.com/releases2.asp?id41

54
ProFusion

Advantages
Vertical search fields
Clean interface
May highlight resources that you havent seen
before.
Can retrieve some items which are inaccessible
through Google.

55
ProFusion

Disadvantages
Cant log in
No real help section
Categories/Resources are mystery meat
Not very effective

56
Complete Planet

Strengths
A lot of good information about the site as well
as the invisible web
Help/FAQ link very useful
Good categories to choose from for searching
Advanced search provides date limiters and allows
for either natural language or Boolean searching

57
Complete Planet

Weaknesses
Searches have to be quite broad
No results if search is too specific
Necessitates searching through individual
databases
Results not always relevant
Advanced search is not as useful as it appears
Using the Basic Search and then individual
databases gives better results

58
The Invisible Web Directory

A companion to the book by Sherman and Price
Directory of Invisible Web resources arranged
into broad categories and subcategories covering
a wide range of topics
Browse only
Focus on free sites
Emphasis on quality over quantity authoritative
resources that all contain some Invisible content

59
The Invisible Web Directory

Strengths
High quality, authoritative information
Resources contain Invisible Web content
Simple interface
Provides annotations
Weaknesses
Small number of resources (Sherman Price argue
it is intended as a starting point)
Browse only cannot search by keyword
Must know which broad category your search fits
into
Not good for the general searcher more useful
for those that have read the book
Several broken links
No information about frequency of updates

60
Invisible Web

What?
How?
Why?
Where?
Who?

For more resources, look at our website
http//www.slais.ubc.ca/boudinot/links.htm

Write a Comment

User Comments (0)

About PowerShow.com

The Invisible Web PowerPoint PPT Presentation