Indexing your web server(s) - PowerPoint PPT Presentation

About This Presentation
Title:

Indexing your web server(s)

Description:

what kind of search (keyword, phrase, natural language, constrained) ... You may have to have advertising on your search page as a condition of use ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 19
Provided by: ukol
Category:

less

Transcript and Presenter's Notes

Title: Indexing your web server(s)


1
Indexing your web server(s)
  • Helen Varley Sargan

2
Why create an index?
  • Helps users (and webmasters) to find things
  • but isnt a substitute for good navigation
  • Gives cohesion to a group of unrelated servers
  • Observation of logs gives information on what
    people are looking for - and what they are having
    trouble finding
  • You are already being part-indexed by many search
    engines, unless you have taken specific action
    against it

3
Current situation
Name
Total
25
ht//Dig
Based on UKOLN survey of search engines used in
160 UK HEIs carried out in July/Aug 1999. Report
to be published in Ariadne issue 21. See
lthttp//www.ariadne.ac.uk/gt.
19
Excite
12
Microsoft
8
Harvest
Ultraseek
7
SWISH
5
Webinator
4
Netscape
3
3
wwwwais
2
FreeFind
13
Other
59
None
4
Current situation questions
  • Is the version of Muscat used by Surrey the free
    version available for a time (but not any more)?
  • Are the users of Excite quite happy with the
    security and that development seems to have
    ceased?
  • Are users of local search engines that don't use
    robots.txt happy with what other search engines
    can index on their sites (you have got a
    robots.txt file haven't you?)

5
Types of tool
  • External services are robots
  • Tools you install yourself fall into two main
    categories (some will work both ways)
  • direct indexes of local and/or networked file
    structure
  • robot- or spider-based following instructions
    from the robots.txt file on each web server
    indexed
  • The programs are either in a form you have to
    compile yourself or are precompiled for your OS,
    or they are written in Perl or Java, so will need
    either Perl or Java runtime to function.

6
Controlling robot access 1
  • All of our web servers are being part-indexed by
    external robots
  • Control of external robots and a local
    robot-mediated indexer is by the same route
  • a robots.txt file to give access information
  • Meta tags for robots in each HTML file giving
    indexing and link-following entry or exclusion
  • Meta tags in each HTML file giving description
    and keywords
  • The first two controls are observed by all the
    major search engines. Some search engines do not
    observe description and keyword meta tags.

7
Controlling robot access 2
  • Some patchy support for Dublin Core metadata
  • Access to branches of the server can be limited
    by the server software - by combining access
    control with metadata you can give limited
    information to some users and more to others.
  • If you dont want people to read files, either
    password-protect that section of the server or
    remove them. Limiting robot access to a directory
    can make nosey users flock to look whats inside.

8
Security
  • There has been a security problem with indexing
    software (Excite free version in 1998)
  • Remember the security of the OS the indexing
    software is running under - keep all machines
    up-to-date with security patches whether they are
    causing trouble or not.
  • Seek help with security if you are not an expert
    in the OS, particularly with Unix or Windows NT

9
What tool to use? 1
  • Find out if any money, hardware and/or staff are
    available for the project first
  • Make a shopping list of your requirements and
    conditions
  • hosting the index (where)?
  • platform (available and desirable)?
  • how many servers (and/or pages) will I index?
  • is the indexed data very dynamic?
  • what types of files do I want indexed?
  • what kind of search (keyword, phrase, natural
    language, constrained)?
  • Are you concerned how you are indexed by others?

10
What tool to use? 2
  • Equipped with the answers to the previous
    questions, you will be able to select a suitable
    category of tool
  • If you are concerned how others index your site,
    install a local robot- or spider-based indexer
    and look at indexer control measures
  • Free externally hosted services for very small
    needs
  • Free tools (mainly Unix-based) for the
    technically literate or built-in to some server
    software
  • Commercial tools cover a range of platforms and
    pocket-depths but vary enormously in features

11
Free externally hosted services
  • Will be limited to the number of pages indexed,
    possibly the number of times the index is access,
    and may be deleted if not used for a certain
    number of days (5-7)
  • Very useful for small sites and/or those with
    little technical experience or resources
  • Access is prey to Internet traffic (most services
    are in US) and server availability, and for UK
    users incoming transatlantic traffic will be
    charged for
  • You may have to have advertising on your search
    page as a condition of use

12
Free tools - built in
  • Microsoft, Netscape, WebStar, WebTen and WebSite
    Pro all come with built in indexers (others may
    too)
  • With any or all of these there may be problems
    indexing some other servers, since they are all
    using vendor-specific APIs (they may receive
    responses from other servers that they cant
    interpret). Problems are more likely with more
    and varied server types being indexed

13
Free tools - installed
  • Most active current development on SWISH (both E
    and ), Webglimpse, ht//Dig and Alkaline
  • Alkaline is a new product, all the others have
    been through long periods of inactivity and all
    are dependent on volunteer effort
  • All of these are now robot based but may have
    other means of looking at directories as well
  • Alkaline is available on Windows NT, but all the
    others are Unix. Some need to be compiled.

14
Commercial tools
  • Most have specialisms - sort out your
    requirements very carefully before you select a
    shortlist
  • Real money price may vary from US250 to
    10,000 (possibly with additional yearly
    maintenance), depending on product
  • The cost of most will be on a sliding scale
    depending on the size of index being used
  • Bear in mind that Java-based tools will require
    the user to be running a Java-enabled browser

15
Case Study 1 - Essex
  • Platform Windows NT
  • Number of servers searched 16
  • Number of entries approx 11,500
  • File types indexed Office files, html and txt.
    Filters available for other formats
  • Index updating Configured with windows task
    scheduler. Incremental updates possible.
  • Constrained searches possible Yes
  • Configuration follows robots.txt but can take a
    'back door' route as well. Obeys robots meta tag
  • Logs and reports Creates reports on crawling
    progress. Log analysis not included but can be
    written as add-ons (asp scripts)
  • Pros Free of charge with Windows NT.
  • Cons Needs high level of Windows NT expertise to
    set up and run it effectively. May run into
    problems indexing servers running diverse server
    software. Not compatible with Microsoft Index
    server (a single server product). Creates several
    catlog files, which may create network problems
    when indexing many servers.

16
Case Study 2 - Oxford
  • Platform Unix
  • Number of servers searched 131
  • Number of entries approx 43, 500 (specifically 9
    levels down as a maximum on any server)
  • File types indexed Office files, html and txt.
    Filters available for other formats
  • Index updating Configured to reindex after a set
    time period. Incremental updates possible.
  • Constrained searches possible Yes but need to be
    configured on the ht//Dig server
  • Configuration follows robots.txt but can take a
    'back door' route as well.
  • Logs and reports none generated in an obvious
    manner, but probably available somehow.
  • Pros Free of charge. Wide number of
    configuration options available.
  • Cons Needs high level of Unix expertise to set
    up and run it effectively. Index files are very
    large.

17
Case Study 3 - Cambridge
  • Platform Unix
  • Number of servers searched 232
  • Number of entries approx 188,000
  • File types indexed Many formats, including PDF,
    html and txt.
  • Index updating Intelligent incremental
    reindexing dependent on the frequency of file
    updates - can be given permitted schedule. Manual
    incremental updates easily done.
  • Constrained searches possible Yes easily
    configured by users and can also be added to
    configuration as a known constrained search.
  • Configuration follows robots.txt and meta tags.
    Configurable weighting given to terms in title
    and meta tags. Thesaurus add-on available to give
    user-controlled alternatives
  • Logs and reports Logs and reports available for
    every aspect of use - search terms, number of
    terms, servers searched, etc.
  • Pros Very easy to install and maintain. Gives
    extremely good results in a problematic
    environment. Technical support excellent.
  • Cons Relatively expensive.

18
Recommendations
  • Choosing an appropriate search engine is wholly
    dependent on your particular needs and
    circumstances
  • Sort out all your robot-based indexing controls
    when you install your local indexer
  • Do review your indexing software regularly - if
    its trouble free it still needs maintaining
Write a Comment
User Comments (0)
About PowerShow.com