Title: Presented by: Allen Brown IS/SE
1Searching the Web Or If theres so much out
there, why cant I find it?
- Presented by Allen Brown IS/SE
- Date 2003-05-12
?
2Outline - Searching the Web
- Information Cartography
- Visible and Invisible Web Information
- Information Finding Strategies
- Reference Tools, Pathfinders, Specialized
Information Repositories, Subject Directories,
and Search Engines - Information Search Strategies
- Information Evaluation Strategies
- Information Finding Summary
- Search Engines and their Characteristics
?
3Information Cartography
- Imagine a physical map of an ocean basin
- identifiable areas of the sea floor
- large abyssal plain
- many undulating hills above the plain
- occasional higher elevations or plateaus
- sparse atolls and seamounts
- Imagine the Web
- some information content identifiable by subject
- vast amounts of very low value information
- some good stuff distributed across many sites
- occasional high quality site with quality and
quantity - sparse stunningly useful sites (to die for)
?
4Information Cartography - 2
Information issues
quality
completeness
location!
- In searching for information we need to adjust
the - breadth of search to find all that is relevant
in an ocean of information - quality level to find only atolls of
information quality - to find everything that is important and useful
?
5Visible and Invisible Information
- Visible indexed by search engine
Invisible not indexed but accessible
engine 4
db 2
site 3
engine 2
engine 3
db 1
db 4
site 7
engine 1
site 5
db 6
?
6Search Engines Wont Do It All!
- According to a recent study reported in Nature
(1) no search engine indexes more than 16 of the
Web. Even though search engine databases are
enormous, they cover very little of what's
actually available on the Web. - 1) Steve Lawrence and C. Lee Giles. (July 8,
1999). Accessibility of Information on the Web.
Nature, 400, 107 - 109
?
7Information Finding Strategies
- Identify Starting Points based on your question
- What type of information do you need?
- Facts, statistics, government document, scholarly
articles, popular opinion, music, picture,
multimedia, news, - What form do you want the information in?
- Dictionary definition, encyclopedia entry,
journal article, elementary school project, video
file, audio file, - What type of site would offer this information?
- Academic, commercial, government, non-government
organization - How much information do you need?
- Introduction, in-depth, references,
?
8Information Finding
- Reference Materials (Often invisible)
- dictionaries, thesauri, encyclopedia, newspapers
- Information Pathfinders (Sometimes invisible) /
Portals / Vortals - subject specific, highly relevant, sometimes
bizarre - usually high quality
- managed by dedicated enthusiasts, possibly
amateur - e.g., Web design, Perl, micro cars, Curta
calculators, - Specialized Information Repositories (Often
invisible) / Portals - institution-based, sometimes obscure
- usually high quality
- managed by information professionals
- e.g., government documents, archives,
?
9Information Finding - 2
- Subject Indices (Often invisible but this is
changing) - subject-based
- e.g., Yahoo
- Search Engines and Search Brokers (Visible web)
- e.g., Google, Alta Vista, Hot Bot, Lycos,
Vivisimo, dogpile
?
10Reference Tools - Dictionaries
http//www.yourdictionary.com/
?
11Reference Tools - Thesauri http//www.visualthesa
urus.com/index.jsp
?
12Reference Tools - Encyclopedia
http//www.britannica.com/
?
13Pathfinders
A pathfinder site provides an information map of
what is available within a fairly narrow area of
interest usually compiled by domain experts.
These sites are often called vortals (vertical
portals).
?
14Specialized Information Repositories - National
Library of Canada
A specialized information repository often
collects and catalogues relatively specific
information usually compiled by information
experts. Some are considered to be vortals.
?
15Subject Directorieswww.yahoo.com
Subject directories are lists compiled by people.
They are organized in a hierarchy where each
subject includes a list of sub-topics. These
sites are often called portals - a one-site
starting location for general information
seeking.
?
16Subject Directories
Subjects lists are usually evaluated but sites
are not presented in order of relevancy. In other
words, the best sites on a topic are not
necessarily listed first. Sites are compiled
through submission of URLs by site creators and
human evaluation and selection. One advantage
of is their browsability, although this feature
is only suitable with fairly general topics. A
disadvantage is their relatively small
size. Other examples of subject directories
Infomine http//infomine.ucr.edu Scout Report
Signpost http//www.signpost.org/signpost
?
17Invisible Web Directories
Look at http//www.invisible-web.net/
?
18Search Engines
Search engines use computer programs that
automatically collect web sites using "spiders"
or "robots". The sites are indexed and stored in
an index database. To query a search engine,
type topic keywords and Boolean connectors into a
search "box." The search engine scans its index
and returns links to websites containing the
specified keyword relationships. Size matters -
an advantage of using search engines is their
coverage (though size is relative), but this can
also be a disadvantage if relevance ranking is
poor.
?
19Search Engines Operational Concepts
query
query parsing, index lookup, results ranking and
management
crawling and page contents extraction and indexing
query results
?
20Search Engines - Does Size Matter?
?
21Size
If you are looking for unusual or hard-to-find
information should try one or more of the search
engines with a large index to check more web
content. This improves the likelihood of finding
what you seek. However, for general searches or
when looking for information about popular
topics, a large index does not necessarily equal
better results. Also, large indexes may have
longer re-visit intervals.
?
22Search EnginesSearch Scopingand Ranking /
Results Management
- It is essential to learn and apply each engine's
specialized search formats to narrow results and
filter and push the most relevant pages to the
top of the results list. Use Boolen operators,
proximity connectors, stems, wild cards,
sounds-like, media-type and metadata filters. - Result relevancy ranking also depends on the size
of the search index and how the search engine
interprets and uses your query. - Each engine determines result relevancy ranking
in unique ways. Consult the help file of each
engine to learn about these. - Some engines offer search refinement and
conceptual clustering for better focus (tighter
hit cluster) or greater accuracy / validity
(centred on the right stuff).
?
23Search Engines - Search Scoping
- expands the scope, - reduces the scope
- Exact phrase - - quotes, e.g., We hold these
things to be self-evident - Boolean operators - and - (default) or
(caution!) not - (extreme caution!), e.g., large
male dog, large or male or dog, not cat - Proximity connectors - near - (depends on
engine), e.g., spring near flower - Stemming and wildcards - e.g., swim ? swim,
swimming, swimmer, swimmers, swimmingly, - Sounds-like - e.g., table ? cable, able, fable,
- Media type - - e.g., image, audio file,
- Concept-based - e.g., synonym ? thesaurus,
antonym, homonym, - Metadata-based - - in some systems
?
24Search Engines - Ranking
- Result relevancy ranking (usefulness) can be
done according to two techniques (or some
combination) - Conventional - using intra-page information
- Relative - using extra-page information
?
25Search Engines - Conventional Ranking
- Conventional (intra-page)
- frequency of words (number and density)
- phrases (exact word sequences)
- hierarchy (e.g., closer to the top of the
document) - adjacency (proximity of words)
- metadata (keywords provided by content owners)
- font size and style (relative intra-page)
?
Jack Christensen repairs CURTA calculators. I've
known Jack for many years and can highly
recommend him. Here are a few questions I asked
Jack What do you charge to clean a Curta?
Typically 65 to 95, depending on the work
involved. More often than not, the upper carriage
needs a complete disassembly, whereas the main
body can be cleaned without a complete
disassembly. If the main body needs to be
completely disassembled, something is usually
bent, out of adjustment, or broken. What do you
charge when repairing a Curta? I charge 20 per
hour of my time. It seems my hours are about 90
minutes long, however, because I rarely finish in
the time I originally quoted. Extended repair
time is absorbed by me. What spare parts do you
have? Are they expensive? I actually have many
hundreds of new original Curta parts. Most are
for inside the instrument, though. I use them
when I do general cleaning and repairs. Outer
body pieces, replacement cannisters, and external
parts that are easily damaged or broken due to
abuse are not generally available, although I do
occasionally locate some these items. Sometimes I
have to fabricate a part, or repair an item as
best I can. Obviously, this takes time, and the
cost is high. Parts costs are charged as the
traffic will bear. I usually try to be blunt
about this to the Curta owner, often telling them
that a severely damaged unit is best sold as a
"parts Curta". Unfortunately, I've sometimes had
to tell this to someone who wanted to repair a
Curta looked upon as an heirloom. What to them
appears to be a minor issue actually turns out to
be a major problem (e.g., a crank handle tilted
downward is due to a broken main shaft). I think
the most I ever charged for a repair was about
375. There were many severe problems with the
unit. Generally, when the price gets to be above
175 most people simply decide to keep the
damaged Curta as a memento. Can you replace a
clearing ring? What costs are involved? The
plastic clearing rings are easy to install. I
have several new ones, but I typically do not
sell them separately as a spare part. Rather, I
install them during a general cleaning and
repair. Metal rings are more difficult to
replace. As with the plastic clearing rings, I
will only install a metal clearing ring during a
general cleaning and repair. It takes a special
tool to properly swage the rivet in place.
Editor's note Very old Type I clearing rings
were held on with a screw and nut. The nut was
also crimped to the screw threads. I used all
the new metal clearing rings I had about five
years ago, but I do have a few used ones that
were removed from other damaged Curtas. I have
these for both the Type I and
?
26Search Engines - Relative Ranking
- Relative (extra-page)
- popularity (page visits - from the search engine)
- citation (links pointing to the item)
- relevance of the pages containing the links
pointing to the item (!)
Yahoo
?
?
Web Pages
?
27Search Engines Keys to Success
World Wide Web
Size ? Large index and / or several engines
Scoped query ? wide net but appropriate sieve
carefully constructed for your needs
- Ranked and manageable results ? query
construction and search engine features
?
28Meta Search Engines
- Meta" search tools are able to search the index
databases of multiple engines simultaneously,
via a single interface. - Meta search tools dont really search metadata.
They are simply brokers that reformulate a query
and hand it off to a set of search engines, then
combine the results. - Meta engines are very fast but they do not
offer the same level of control over the
relationship between keywords as do individual
search engines. - Also, meta search engines may produce poor
ranking of combined results.
?
29Search Engines
Examples of popular search engines include
Google http//www.google.com Alta Vista
http//www.altavista.com All the Web
http//www.alltheweb.com Northern Light
http//www.northernlight.com Also see The KartOO
clustering visual engine http//www.kartoo.com/ Fo
r meta engines, try Vivisimo at
http//vivisimo.com/
?
30Information Search Strategies
- Think hard about what you are looking for!
- Use a Reference Tool, if appropriate
- Use a Pathfinder, if you know one
- Use a Specialized Information Repository, if
appropriate - Use Subject Indexes, if it is a common topic
- Use several Search Engines, if needed, especially
for the obscure or academic topic, but learn how
they work - Use keywords - be narrow, and specific (and
technical) - Use phrases - try synonyms or related concepts
- Use Boolean connectors - but find out if / how
the engine uses them - Use stemming and wildcards - but find out if /
how the engine uses them - Use media-type filters or metadata, if appropriate
?
31Information Search Tools - Use
Pathfinder
depth
Search Engines and Meta-engines
easy to use
focused content pre-selected by domain experts
obscure or academic caveat emptor!
Subject Indexes
popular or common pre-selected by interested
people
Specialized Information Repository
hard to use well
generic simple lookup created by professionals
contains invisible content
related or themed pre-selected by
professionals contains invisible content
Reference Tool
breadth
?
32Information Evaluation Strategies CARS
- CARS checklist
- http//library.queensu.ca./inforef/guides/evalchar
t.htm - Credibility
- - author credentials stated with email contact
- - evidence of quality control (site location)
- Accuracy
- - timeliness
- - comprehensiveness
- - audience purpose
- Reasonableness
- - fairness
- - objectivity
- - consistency
- - world view
- Support
- - source documentation or bibliography
?
33Summary
- There is much information on the Web, but its
not- all there- all good (or all bad)- always
easy to locate - Use an information search strategy that-
matches the information sought - uses the
appropriate tools- uses them in the correct
ways - Use an information evaluation strategy, e.g.,
CARS methodology. - Choose and use search engines wisely, knowing
their strengths, features, and their limitations.
?
34How Do Search Engines Work?
- Three Activities Occur
- 1. Crawling
- fetch pages
- compile URL list (a db)
- re-visit pages
- 2. Page harvesting
- parse page
- add to index db and establish ranking
- 3. Responding to search requests
- parse query
- apply to index
- present and rank results
?
35Search Engines Operation
fetch
Crawler Robot
re-visit
URL
URL data base
query
QueryProcessor
fetch
Harvester Robot
query results
Index data base
?
36Search Engine - Hardware
(not really )
?
37How Do Search Engines Work?
- See The Anatomy of a Large-Scale Hypertextual
Web Search Engine at http//www-db.stanford.edu/
backrub/google.html
?
38References
- Information Search Strategies
- lthttp//www.lib.berkeley.edu/TeachingLib/Guides/In
ternet/FindInfo.htmlgt - Information Evaluation Strategies
- lthttp//www.vuw.ac.nz/agsmith/evaln/evaln.htmgt
- Search Engines
- lt http//www.library.arizona.edu/search.htmgt
- lt http//www.brightplanet.com/deepcontent/tutorial
s/search/index.asp gt - lt http//www.searchenginewatch.com/ gt
- Susan Maze, David Moxley, Donna Smith
Authoritative Guide to Web Search Engines,
Neal Schuman Pub, 1997, ISBN 1555703054
?