Title: New Technologies in the JANET Web Cache Service Martin Hamilton
1New Technologies in the JANET Web Cache
ServiceMartin HamiltonGeorge
Neisserhttp//wwwcache.ja.net/support_at_wwwcache.
ja.net
2What is the JANET Web Cache Service ?
- National caching service for the UK education and
research community. - Funded by JISC.
- Awarded by competitive tender to Loughborough
University Computing Services and Manchester
Computing. - Largest "site" on JANET.
- 155 Megabits/second aggregate traffic.
- 70-80 million transactions/day.
- 700-800 Gigabytes transferred/day.
- Being used by some 170 institutions.
3What is Web Caching ?
- Caches keep copies of popular Internet content.
- First site to fetch a URL causes it to be cached.
- Subsequent visits get the cached copy.
- Exceptions for things like secure (SSL) content,
cookies, and dynamic content (CGI). - Web caching seen as essential by most ISPs and
large Internet sites. - Caches can also be used for content filtering -
e.g. legal requirement for FE sites.
4Service configuration
- Cache machines (34 of these) are typically
Pentium II or III processor, 512 Megabytes
memory, dual 100 Megabit/second Ethernet, and 6
or 12 Ultra2 SCSI disks for cached objects. - Small number (currently 3) of load balancers to
distribute requests between caches. - Caches and load balancers all running Linux and
the Squid Web Cache server. - Some 1.5TB of pooled cache disk.
5New technologies covered today...
- Automation of service monitoring and
availability. - Automating operations, so that a small number of
people can run a huge service. - "Glue" needed to link monitoring and management
tools with email/paging/WAP. - Incident and change logging/reporting.
- Management of machines at remote sites.
- Identify useful info for other service operators.
6Problems encountered
- A server goes down, e.g. crashes or locks up.
- The service (e.g. Squid cache) goes down, but the
server is still up. - The machine or service is slow/overloaded.
- Time taken for machines to recover after a crash
- Unix fsck process. - Knowing who changed what, and when.
- Capturing long terms stats for profiling.
7Problem Machine goes down
- Spotting the problem - can get away with using
ping for this. Many other tools available to
automate this basic testing. - Fixing may require local action (e.g. push the
reset button), but most Unix systems support
serial console access. Linux also has serial
access to the LILO boot loader. - Serial console useful for remotely managed kit,
and also remote (off-site) access to local kit in
an emergency.
8Solution Linux Virtual Server
9Linux Virtual Server explained
- Layer 4 switch in software. High service
availability through redundancy. - Load balances traffic across multiple "real
servers" using a virtual IP address per server
weightings. - Real server death only affects current users -
traffic routes around dead servers. - Now fully deployed on the JANET caches.
- Useful for other services too, e.g. Websites.
Note that e-mail and DNS have automatic fallback
already.
10Problem Service goes down
- e.g. Squid dies when disks fill up.
- Older Squids used to lose track of disk
consumption and fill disks up after a time. - Can spot if Squid is running OK by SNMP.
- LVS monitor uses SNMP for service upness and
performance check. - What constitutes your service? Can you measure
its availability automatically?
11Problem Overloading
- Performance metrics available via SNMP already,
plus addons like df and top. - Can also try to use the service, e.g. fetch via
proxy HTTP and measure performance. - Fetch a test URL via each cache at intervals.
- Consider what you want to do with the info, e.g.
tune LVS weightings, make case to management for
more funding -)
12Solution SNMP network monitoring
13Solution SNMP service monitoring
14Problem Filestore check (fsck)
- Bugbear of traditional Unix systems.
- After a crash, 6 x 9GB disks can take over half
an hour to check -( - Possible solution - trialling Linux journalling
filesystem ReiserFS, which is also a lot faster
than the conventional ext2 filesystem. - Generally useful for server and workstation
applications. Can be a work-around for other
problems, e.g. recovery of remote systems much
less painful after a crash.
15Tracking changes - manually
- Web form - who, what, when?
- Search/browse interface for analysis/reporting.
- Only requires Unix, HTTP server, Perl.
- Nightly summary mailshot for management.
- Also being used by EMMAN and several groups at
L'boro. - Easier to use than paper record and more readily
available. Structure allows for sensible queries.
16Solution Change logging system
17Tracking changes - automatically
- Mail from service monitoring script.
- Urgent warnings (e.g. machine down) gatewayed to
cellphone using sms_client modem. - LVS monitor logs incidents with timestamp,
machine name, and type of problem. - Mobile phone (SMS) message size very limited.
Must be careful not to send too many messages,
and to provide positive feedback - i.e. that the
service/machine recovered.
18Long term stats
- Daily log file analysis overnight (Calamaris,
squidtimes, squidclients our own code). - Log file summaries - possible to usefully
summarise 1GB down to 5MB! - Dynamic monitoring of Ethernet traffic levels and
Squid performance metrics via SNMP and
MRTG/rrdtool. Stats can hang around forever. - 30GB disk 200! Figure out what to monitor and
keep historical stats. You won't regret it.
19WAP - Tomorrow Today -)
- Phones buggy - easy to crash, which can require a
trip to the service centre. - Different vendors support different features,
e.g. Nokia doesn't do tables. - Screens far too small for detailed info.
- Space on "cards" very limited on some phones,
e.g. Nokia is 1397 characters. - But... very easy to create content for!
20WAP example - LVS stats
-
- 1.1//EN" "http//www.wapforum.org/DTD/wml_1.1.xml"
-
-
-
- Wed May 31 192502 2000
- babylonnchor
- kair/
- wilburhor
-
-
- ... more cards ...
21WAP in practice - 1
22WAP in practice - 2
23WAP redux
- Phones use Wireless Markup Language (WML) instead
of HTML. WML is very simple by comparison. - One line tweak to Web server config required for
serving WML documents. - Easy to create WML automatically from monitoring
scripts. - Watch out for bugs and incompatibilities! Use
Internet emulators to save on phone bill.
24Current Future developments
- Two way WAP control for common jobs, e.g. restart
Squid, take a faulty disk out of service, reboot
a machine. - Failover of load balancers, so cluster survives
death of primary load balancer. - Mirror service integration, so that caches
automatically find mirrored resources - e.g. from
the UK Mirror Service. - "Cluster digests", to give sites an accurate
impression of JANET cache hit rates.
25Closing thoughts...
- Much of this technology is truly new - didn't
exist in 1997 when we started the JANET Web Cache
Service. - Perl and cron used extensively to glue other
tools togther. - Most of the software used existed already, so it
wasn't necessary to develop it from scratch. - Don't be afraid to lead from the front - JANET
cache team members have been very active in Web
caching development internationally.
26Useful links
- LVS - http//www.linuxvirtualserver.org/
- MRTG - http//www.mrtg.org/
- Perl - http//www.perl.org/
- ReiserFS - http//devlinux.com/namesys/
- L'boro change logging system - http//lanlord.lbor
o.ac.uk/martin/change/ - sms_client - http//www.styx.demon.co.uk/
- WAP emulator - http//www.gelon.net/