Title: CMP 788 Distributed IR
 1CMP 788 Distributed IR
- Part 2 Lecture 8 
 - Harvest- Part 3 
 - Harvest Replication and Object Caching 
 - Fall 05 
 - Department of Mathematics 
 - and Computer Science 
 - Lehman College, CUNY 
 
  2Broker 
 3SOIF Example 
 4Replicator and Object Cache
- The Harvest Replicator can be used to replicate 
servers, to enhance user-base scalability.  - For example, the HSR will likely become heavily 
replicated, since it acts a point of first 
contact for searches and new server deployment 
efforts.  - The Replication subsystem can also be used to 
divide the gathering process among many servers 
(e.g., letting one server index each U.S. 
regional network), distributing the partial 
updates among the replicas.  - The Harvest Object Cache reduces network load, 
server load, and response latency when accessing 
located information objects.  - LFU (Least Frequently Used) cache replacement 
strategy is often used.  - Hierarchical caching if target object is not in 
the local cache, use subnets cache (often 
provided by firewall software), or parent cache 
(larger caches stored on server shared by many 
machines) 
  5Hierarchical Cache Arrangement 
 6Caching Subsystem 
 7Caching Resolution Protocol
- Each cache in the hierarchy independently decides 
whether to fetch the reference from the objects 
home site or from its parent or sibling caches, 
using a simple resolution protocol.  - If the URL contains any of a configurable list of 
substrings, then the object is fetched directly 
from the objects home, rather than through the 
cache hierarchy.  - This feature is used to resolve non-cacheable 
(e.g., cgi-bin, password protected objects) URLs.  - If the URLs domain name matches a configurable 
list of substrings, then the object is resolved 
through the particular parent bound to that 
domain.  - If a cache receives a request for a URL that 
misses, it performs a remote procedure call to 
all of its siblings and parents, checking if the 
URL hits any sibling or parent.  - The cache retrieves the object from the site with 
the lowest measured latency. 
  8Caching Resolution Protocol (contd)
- Hierarchies as deep as three caches add little 
noticeable access latency.  - The only case where the cache adds noticeable 
latency is when one of its parents fail, but the 
child cache has not yet detected it.  - In this case, references to this object are 
delayed by two seconds, the parent-to-child cache 
timeout.  - As the hierarchy deepens, the root caches become 
responsible for more and more clients.  - To keep root caches servers from becoming 
overloaded, Harvest hierarchy terminates at the 
first place in the regional or backbone network 
where bandwidth is plentiful.  
  9Caching Resolution Protocol (contd)
- Additionally, a cache option can be enabled that 
tricks the referenced URLs home site.  - This option allows the cache to retrieve the 
object from the home site if it happens to be 
closer than any of the sibling or parent caches.  - Can be based on estimating object access latency 
time using network ping or echo.  - A cache resolves a reference through the first 
sibling, parent, or home site to return a UDP 
Hit packet (through echo port).  - The first parent returns a UDP Miss message if 
all caches miss within two seconds.  
  10Caching Resolution Protocol (contd)
- The cache will not wait for a home machine to 
time out.  - it will begin transmitting as soon as all of the 
parent and sibling caches have responded.  - The resolution protocols goal is for a cache to 
resolve an object through the source (cache or 
home) that can provide it most efficiently.  - This protocol is really a simple heuristic 
 - Fast response to a ping indicates low latency 
 - But bandwidth is more important for large objects.
 
  11Non-cacheable Objects/Security
- The wide variety of Internet information systems 
leads to a number of cases where objects should 
not be cached.  - Objects that are password protected are not 
cached. Rather, the cache acts as an application 
gateway and discards the retrieved object as soon 
as it has been delivered. ? can resolve security 
and privacy problems.  - CGI-Bins (server-side scripts) 
 - May limit the size of the largest cacheable 
object, so that a few large FTP objects do not 
purge ten thousand smaller objects from the 
cache.  - Caching subsystem does not prevent servers from 
encrypting or applying digital signature to their 
documents.  
  12Cache Updates Problem
- Problems with caching 
 - Difficult to know if a cache object has been 
updated  -  before its next use without checking (at least 
HEAD)  - No integrated mechanism in Web for remotely 
forced cache flush  - Can be controlled by object header (e.g., 
Expires 0 Expires  Thu, 16 May 2001 144030 
GMT).  - This mechanism only supports predictive 
expiration (says in advance how long a copy may 
be used).  - But what if unexpected change before expiration 
or unchanged persistence after that specified 
time?  - Cache Updates 
 - Based on Data access efficiency Use log of uses 
statistics (LRU  Least Recently Used) triggered 
by a cache server.  - Cache consistency problem may occur when one or 
more cache servers maintain same copy of object. 
  13The Economy of Cache Updates 
 14Negative Caching
- To reduce the costs of repeated failures, 
negative caching is used.  - When a DNS lookup failure occurs, Harvest caches 
the negative result for five minutes (chosen 
because transient Internet conditions are 
typically resolved this quickly).  - When an object retrieval failure occurs, Harvest 
caches the negative result for a parameterized 
period of time, with a default of five minutes.  
  15Replication Subsystem
- Motivations 
 - like to have(complete) regional copies with 
 mechanism to ensure active consistency 
updates  - mirror-d (replication tool for Harvest using ftp 
mirror) 
site2
site1
site3
Thin black  mirror
Thick gray  locally maintained master copies 
 16Replication Subsystem (contd)
- Active consistent updates 
 - If a server changes its master copy, it notifies 
mirror sites.  - Harvest supports replication domains 
 - Mirroring within domain and carefully 
coordinated/synchronized  - Mirroring/replication between domains involves 
gradual propagation of changes (between sites 
responsible for inter-domain communication)  
  17Replication Subsystem (contd)
- mirror-d replication tool  weakly consistent 
replicated tree of files  - Motivation  multiple copies for future access 
 -  (e.g. Europe, North America) ? 
replication domain  - Problem  maintaining data consistency 
 - Logical topology 
 - replication subgroups that coordinate consistency 
and internally share updates within subgroup 
domain.  - Physical issues(network bandwidth/usage) 
 - Help determine how replication domains propagate 
(flood) updates among its neighbors ? flooding 
handling issue (e.g., flooding in Peer to Peer 
network issues)  
  18Replication Group 
 19Replication Group (contd)
- Although Replication Domain members are stable, 
Pathways for inter-domain communication may 
change based on dynamic properties of server load 
and bandwidth  
  20Physical vs. Logical Topology
- Logical inter-domain network topology is a subset 
of the  -  full physical topology (and is dynamically 
re-configurable based on network load and 
bandwidth)  
  21Replication in Broker World 
 22Index Subsystem
- Please refer the papers for understanding 
indexing subsystem.  - GLIMPSE 
 - Uses Essence SOIF objects to create inverted 
index entries.  - NEBULA 
 - Supports hierarchical classification scheme ? 
automatic Yahoo classification.  - Provides views (pre-computed query responses) ? 
basically vector clusters w.r.t. a given query.  
  23Harvest NG Architecture
- User agents  goal directed 
 -  extraction, analysis, 
 -  even dialog 
 - Meta Brokers  meta search 
 -  collection/query fusion 
 - Brokers(Index, Search) 
 - Gatherers gathering and extracting (SOIF) 
 - Finders (Spiders)  locate pages 
 - Content Provider (Web content pages)