Title: The Invisible Web
1The Invisible Web
- David Boudinot
- Heather De Forest
- Joan Pries
- Lindsay Ure
- Created for LIBR 557 Advanced Information
Retrieval - Dr. Mary Sue Stephenson
- November 29, 2004
2The Invisible Web
-
- Searching on the Internet today can be compared
to dragging a net across the surface of the
ocean. While a great deal may be caught in the
net, there is still a wealth of information that
is deep, and therefore, missed. The reason is
simple Most of the Web's information is buried
far down on dynamically generated sites, and
standard search engines never find it. -
- Deep Web White Paper, BrightPlanet.com (2000)
3The Invisible Web
- Part I The Invisible Web Explained
- What is the Invisible Web?
- Why cant I access the Invisible Web using
regular search engines? - How deep is the Invisible Web and what does it
contain? - Where do I start with searching the Invisible Web?
4What is the Invisible Web?Background and
definitions.
5Background
- The phrase Invisible Web was first used in the
mid 1990s to describe web content that is not
indexed by regular search engines - 2000 Deep Web White Paper, published by the
BrightPlanet Corporation, discusses the nature
and scope of the Invisible Web - 2001 publication of , The Invisible Web
Uncovering Information Sources Search Engines
cant See, by Chris Sherman and Gary Price
6The Visible Web
- The visible or surface web is that part of the
Web that can be retrieved using standard search
engines such as Google or AltaVista, or subject
directories - In order for search engines to find them, web
pages must be static and either linked to other
web pages or submitted for indexing by Webmasters
7The Invisible Web
- Also known as deep web dark web or hidden
web - The Invisible Web is what standard search tools
either cannot or will not crawl and index - A large part of the Invisible Web consists of
authoritative and pertinent information
8The Invisible Web
- The term invisible is somewhat misleading. It
is possible to retrieve this content, but not
using the same methods as for visible web content - Various reasons why certain web pages are not
indexed by standard search engines or directories - Hard to define and varying definitions of the
precise nature of the Invisible Web exist
9Definitions
- The Deep Web is content that resides in
searchable databases, the results from which can
only be discovered by a direct query. Without the
directed query, the database does not publish the
result. When queried, deep Web sites post their
results as dynamic Web pages in real-time. Though
these dynamic pages have a unique URL address
that allows them to be retrieved again later,
they are not persistent. - Bright Planet Corporation, BrightPlanet.com
10Definitions
- Text pages, files, or other often high-quality
or authoritative information available via the
World Wide Web that general-purpose search
engines cannot, due to technical limitations, or
will not, due to deliberate choice, add to their
indices of Web pages. Sometimes also referred to
as the Deep Web or dark matter. - Chris Sherman and Gary Price, The Invisible
Web Uncovering Information Sources Search
Engines Cant See (2001)
11Types of Invisibility
- Opaque Web
- Web pages that have not yet been crawled by
search engines for various reasons, but could
become part of the Visible Web at any time - Private Web
- Sites that could be indexed but Webmasters have
chosen to exclude from search engines, or at
least to specify that access is restricted.
Either password protected, robots exclusion
protocol and robot metatags - Proprietary Web
- Sites that are available only to those who have
agreed to terms or conditions to access the
content. May be free registration or paid
subscription. - Adapted from Sherman and Price, The Invisible
Web Uncovering Information Sources Search
Engines Cant See. (2001)
12Types of Invisibility
- Truly Invisible Web
- Sites that search engines cannot, or will not,
crawl for technical reasons - Certain file types
- Real time information, such flight arrivals and
weather reports that is relevant only for a very
short time - Pages that generate scripts these can trap
spiders - Dynamic pages that are created in response to a
user query namely the content of relational
databases - Adapted from Sherman and Price, The Invisible
Web Uncovering Information Sources Search
Engines Cant See. (2001)
13Why cant I access content from the Invisible Web
using regular search engines?
14- Disclaimer
- As technologies grow and search engines develop,
parts of the Invisible Web are becoming visible
so what may be invisible today might become
visible tomorrow.
15Why cant I access content from the Invisible Web
using search engines?
- There are 4 main reasons you cant access
Invisible Web content using search engines - Search engines were originally designed to index
HTML pages - Search engine cant find the content
- Search engine is blocked from the content
- Search engine purposely ignores Invisible Web
site
16Search engines were originally designed to index
HTML pages
- Anything outside of HTML (such as Flash,
Shockwave, or mp3s) has traditionally remained
invisible. - If this type of content is described in meta tags
within the HTML document, a web crawler can index
it. - Companies like Google have been developing
technology to search non-HTML content on the
Internet.
17Search engine cant find the content
- Web crawlers work by following links on websites
and reporting back home what was found. If a
webpage is not linked from any other page, the
Web crawler will not be able to find it.
18Adapted from Chris Sherman and Gary Price, The
Invisible Web Uncovering Information Sources
Search Engines Cant See (2001)
19Search engine is blocked from the content
-
- Problem You dont want a search engine to index
parts of your website. - Solution Include the Robots Exclusion Protocol
or Robots META tag in your website.
20How the Robots Exclusion Protocol works
- When a web crawler visits a website, it first
checks for a robots.txt file, which tells the
crawler what parts of the site it is allowed to
index. - For the SLAIS site, this would look like
- http//www.slais.ubc.ca/robots.txt
21How the Robots Exclusion Protocol works Part II
- Code in a simple text document tells the crawler
what to do. For example -
- To exclude all crawlers from part of the server
- User-agent
- Disallow /cgi-bin
- Disallow /tmp/
- Disallow /private/
- To exclude a single crawler
- User-agent BadBot
- Disallow /
Source Web Server Administrators Guide to the
Robots Exclusion Protocol
22How the Robots META tag works
-
- The Robots META tag is inserted into an
individual HTML document to inform web crawlers
to buzz off. - Unfortunately, some crawlers ignore this tag,
and index your webpage anyway.
23How the Robots META tag works Part II
- Here are some examples of what Robots META tag
code looks like -
-
-
-
- INDEX or NOINDEX tells the crawler to index the
page or not. - FOLLOW or NOFOLLOW instructs the crawler to
follow (or not follow) the links on the page.
24Search engine purposely ignores Invisible Web
site
- Due to budget constraints or technical issues,
some search engines choose not to index non-HTML
files. - Spammers tend to use script commands to trap web
crawlers. Some search engines opt out of
indexing sites with any script commands. - Web crawlers are not programmed to understand
database structures, therefore, information in
relational databases remains invisible.
25Databases and HTML
- Online databases generate web pages dynamically
and respond to commands issued from an HTML form. - Some databases are proprietary.
- In many instances web crawlers and databases are
incompatible.
26How it all works
27How deep is the Invisible Web and what does it
contain?
28How deep is the Invisible Web?
- Bright Planets study of the Deep Web (2000)
- estimated approximately 400-550 times more
information than in the surface Web (or World
Wide Web) - Sherman and Price (2001) refute this claim
- estimate the IW is somewhere between 2 and 50
times larger since much of the information is
from ephemeral data (such as weather)
29How fast is the Deep Web growing?
- significantly faster than the visible Web
(Sherman and Price) - The Deep Web is the fastest growing category of
new information on the Internet. All signs point
to the Deep Web as the dominant paradigm for the
next-generation Internet. (Bright Planet)
30Quality and Content Invisible Web vs.
surface web?
- Many IW sites are first-rate content sites
- tend to be narrower in focus with more content in
subject area - Often use a variety of media and file types, many
of which are not easily indexed - Largest part of the IW is information contained
in databases - More than half of the content resides in
subject-specific databases - Mostly human indexed
31Content
- Invisible Web sources are critical because they
provide users with specific, targeted
information, not just static text or HTML pages - However, general search engines are becoming much
more sophisticated and capable - Eg. Googles new Google Scholar for scholarly
resources opens up invisible web by allowing
access to some material that wouldnt ordinarily
be available to search spiders (Search Engine
Watch, November 18, 2004) - What is invisible today may be visible tomorrow
32Content
- At the time Sherman and Prices book was first
written in 2001, PDF and Microsoft Office
documents were among those which could not be
indexed by general search engines - Google became the first to index PDF and Office
documents, a search capability that is now widely
accepted
33Content
- A number of other file formats are still not
being searched well by most search engines - Postscript
- Flash
- Shockwave
- Executables (programs)
- Compressed files (.zip, .tar, etc.)
34Why arent these formats searched?
- Although the above formats can be indexed, they
often are not because it is expensive to index
non-HTML pages - In other words, the major web engines are not in
business to meet every need of information
professionals and researchers. (Sherman and
Price, 2003)
35- These difficult file types are becoming more
prevalent, especially in some kinds of
high-quality, authoritative information - E.g., official government documents, or scholarly
papers stored on the Web using Postscript or
compressed Postscript files - (Postscript is a page description language
first used by Adobe in 1985. It is a programming
language optimized for printing graphics and
text.)
36Whats NOT on the Web
- Proprietary databases and information services
- Dialog, LexisNexis, etc.
- Government and public records
- Some coverage of government docs but too much
information to ever have complete coverage - Privacy issues come into play for public records
- Scholarly journals
- Publishers have tight control
- There are a few scholarly free e-journals,
usually found via library websites - Full Text of newspapers and magazines
- Some limited content of archives but information
is often still valuable so publishers want to
retain control of info - Authors rights also a concernmany retain re-use
rights - Millions of documents that will never be
available on the Web - libraries are still important!
37Why use the Invisible Web?
- There are thousands of databases with
high-quality information accessible via the Web,
many from libraries, universities, businesses,
government agencies, etc. - Previously, this type of information was
available only in proprietary information systems
- Although these databases may be accessible
through the Web, they may not be on the Web
38Why use the Invisible Web?
- More comprehensive results
- resources are more subject specific
- More control
- more specialized tools for searching, thus easier
retrieval of subject-specific information - Increased precision and recall
- smaller databases better recall
- Subject-specific resources better precision
- Authoritative
- High quality content from reputable institutions
or organizations
39WHERE
- can I find Invisible Web resources?
40Q. How do I search the invisible web?
41A. You already do!
42Top 25 types of content on the invisible web.
- 1. Public company filings
- 2. Telephone numbers
- 3. Customized maps driving directions
- 4. Clinical trials
- 5. Patents
- 6. Out of Print Books
- 7. Library catalogues
- 8. Authoritative dictionaries
- 9. Environmental information
- 10. Historical stock quotes
- 11. Historical documents and images
- 12. Company directories
- 13. Searchable subject bibliographies
- 14. Economic information
- 15. Award Winnings
- 16. Job Postings
- 17. Philanthropy grant information
- 18. Translation tools
- 19. Postal codes
- 20. Basic demographic information
- 21. Interactive school finders
- 22. Campaign financing information
- 23. Weather data
- 24. Product catalogues
- 25. Art Gallery holdings
43Attitude Shift
- Remember that the invisible web is there.
- Change your expectations of what youll find.
Look for entryways to the invisible web, not for
the content. - Develop a toolkit now for later consultation
44Searching the Invisible Web
- 1. Adopt the mindset of a hunter
-
- -Tools (weapons) are important
- -Reading the environment and looking for clues
is more important. - Adapted from Price, G. Chris Sherman (2001)
. Exploring the invisible web Seven essential
strategies. Online, 25(4), 32-35
45Searching the Invisible Web
- 2. Use search engines
- Use a general purpose engine like
- Teoma to search for your term
- with
- database or interactive tool
- Adapted from Price, G. Chris Sherman (2001) .
Exploring the invisible web Seven essential
strategies. Online, 25(4), 32-35
46Searching the Invisible Web
- 3. Use Site Maps and Site Searches
- Big sites like Library of Congress
- and Library and Archives Canada
- are often hybrids part visible, part
invisible. - Use the site map, search for database, and
see what you get! - Adapted from Price, G. Chris Sherman (2001) .
Exploring the invisible web Seven essential
strategies. Online, 25(4), 32-35
47Searching the Invisible Web
- 4. Rely on Baker Street Irregulars
- Sherlock Holmes had key informants
- You can too.
- Early Warning Systems
- -Search Stuff from Susie list.
- -Search Engine Watch newsletters and
blogfeeds. - -Gary Prices www.resourceshelf.com
- Adapted from Price, G. Chris Sherman (2001) .
Exploring the invisible web Seven essential
strategies. Online, 25(4), 32-35
48Searching the Invisible Web
- 5. Use Invisible Web Directories
- Directories like
- the Librarians Index to the Internet
- and the Invisible Web Directory
- have the advantage of presenting resources
that have been hand selected. - Adapted from Price, G. Chris Sherman (2001) .
Exploring the invisible web Seven essential
strategies. Online, 25(4), 32-35
49Searching the Invisible Web
- 6. Use offline finding aids
- Handbooks
- The Invisible Web, Sherman Price
- Best of the Web Geography, Leftley
- Website Reviews
-
- Adapted from Price, G. Chris Sherman (2001) .
Exploring the invisible web Seven essential
strategies. Online, 25(4), 32-35
50Searching the Invisible Web
- 7. Create your own monitoring service
- Some specialized search engines like
- InfoMine
- and ProFusion
- have alert services that will let you know
when new resources have been added. - Adapted from Price, G. Chris Sherman (2001) .
Exploring the invisible web Seven essential
strategies. Online, 25(4), 32-35
51Searching the Invisible Web
- What about these so-called
- Invisible Web Search Engines?
- Eg. ProFusion, Incy-Wincy, Complete Planet
52The Invisible Web
- Part II Demonstrations of Invisible Web Search
Tools - ProFusion
- Complete Planet
- The Invisible Web Directory
53ProFusion
- Claims
- ProFusion is very dynamic and an extremely
exciting search site that makes it easy to
intelligently search and find information from
the very deep and invisible parts of the web. - Press release from Intelliseek
http//www.intelliseek.com/releases2.asp?id41 -
54ProFusion
- Advantages
- Vertical search fields
- Clean interface
- May highlight resources that you havent seen
before. - Can retrieve some items which are inaccessible
through Google.
55ProFusion
- Disadvantages
- Cant log in
- No real help section
- Categories/Resources are mystery meat
- Not very effective
56Complete Planet
- Strengths
- A lot of good information about the site as well
as the invisible web - Help/FAQ link very useful
- Good categories to choose from for searching
- Advanced search provides date limiters and allows
for either natural language or Boolean searching
57Complete Planet
- Weaknesses
- Searches have to be quite broad
- No results if search is too specific
- Necessitates searching through individual
databases - Results not always relevant
- Advanced search is not as useful as it appears
- Using the Basic Search and then individual
databases gives better results
58The Invisible Web Directory
- A companion to the book by Sherman and Price
- Directory of Invisible Web resources arranged
into broad categories and subcategories covering
a wide range of topics - Browse only
- Focus on free sites
- Emphasis on quality over quantity authoritative
resources that all contain some Invisible content
59The Invisible Web Directory
- Strengths
- High quality, authoritative information
- Resources contain Invisible Web content
- Simple interface
- Provides annotations
- Weaknesses
- Small number of resources (Sherman Price argue
it is intended as a starting point) - Browse only cannot search by keyword
- Must know which broad category your search fits
into - Not good for the general searcher more useful
for those that have read the book - Several broken links
- No information about frequency of updates
60Invisible Web
- What?
- How?
- Why?
- Where?
- Who?
61- For more resources, look at our website
- http//www.slais.ubc.ca/boudinot/links.htm