Title: Life of a Cell
1Life of a Cell
2The Conundrum
- Distribute -- on-line -- millions of pages of
aircraft maintenance documentation in a system
that the FAA requires to be foolproof - No downtime
- All data identical for every mechanic worldwide.
Always
3Business Risks
- An airplane cannot leave the gate if maintenance
documentation is unavailable. - An airplane stuck at the gate causes the airline
to lose lots of money (system wide) - Hasnt been done before
4Business Drivers
- Faster access to documentation translates to
millions of dollars a year in recovered revenue - No such thing as I did that yesterday Ill just
wing it documents change daily - New document is printed and carried aboard the
aircraft (or youre busted) - Search times and print times must be low
5Business Drivers
- Consistency of documentation eliminates flip
flop maintenance costs - I use procedure A and perform X
- Downline old documents ... Hey, who did that?
But uh oh I can fix it. Procedure B - Downline new documents, Procedure A ....
6Business Drivers
- Safety
- An incident involving a fatality drops ticket
sales by 50 for two weeks. - If the incident cannot be explained ticket sales
remain off until it is - US Airways 737 (1994?), Pittsburgh, almost put
airline out of business - Airline people really do care about the people
theyre responsible for
7The Plan
- Be the first airline to gain competitive
advantage by going to 100 online documentation - Retire microfilm/microfiche completely
- Dont lose shirt
8The Technologies
- Excalibur Technologies EFS (Electronic File
System) - Transarc AFS 3.3
- HP Servers
- Bunchostuff to convert manuals to TIF
- Windows 3.1 target user platform
9The Process
- Scan microfiche/film manual pages to TIF
- EFS OCR TIFs
- AFS Store TIF pages
- EFS Index TIFs (OCR output), keyword indexes
- AFS Store index
- AFS Replicate to strategically placed
fileservers - Mechanics and engineers
- Click on index icon (File cabinet)
- Keyword search
- EFS client on Windows 3.1 desktop requests data
from EFS server running on AFS fileserver
10World wide airline, world wide cell
- Fileserver locations decided by
- Location on corporate backbone
- Connectivity from other linestations (smaller
airports) - Number of linestations that can be served from
location - Paranoia (over designed by 2x)
11Domestic Fileserver Locations
12End User Workstations
- Every hangar -- many per dock
- Every gate 2x, independent LANs
- Every engineering department
- Facilities for support of in-air aircraft
- (World wide)
13AFS Client Locations
- Minimal
- No supported Windows 3.1 AFS client
- EFS client requests data from AFS client
14Number of users
- 40000 human users
- I forgot my password puts airline out of
business - 1500 workstations workstation hostname is
user and is written on front of workstation
15Woes and Wins
- Network shoving data into your LAN
- Replication management
- Who is authorized
- You want me to release how many volumes?
- vos release times
- FAA the system will not go down! All replicas
will be identical - Lets use a really big cache for Seattle!
16Woe Network
- How to get 300 600 GB of data to fileserver for
initial load of ROs - Slow links to small airports
- Slow links to international server locations
- Fast links heavily trafficked
- vos release can beat the out of a network
- An airline is always in operation no magic
window of opportunity
17Win Network
- Cant use vos release
- Hey, we have lots of those airplane things
- Load local (SFO) fileserver array with disks,
setup viceps - vos addsite to fileserver/array vos release
- vgexport OS says by to volume groups
- vos remsite remove drives
- Fly to wherever vgimport, vos addsite / vos
release. Rio, anyone?
18Woes Replication Management
- 15000 RW volumes, all replicated
- Whos authorized to issue vos release?
- Which volumes to release? EFS randomly places
data ... - How many volumes did you say to release?
19Win Replication Management
- Authorization/automation
- Per fleet per manual vosrel PTS group
- PTS group on every relevant volume root node
- User interface writes record to work queue, a
file in /afs - Requester manual/index priority
- Fileserver cron job compares requester with
vosrel PTS group, figures out volume list,
performs vos release localauth
20Woe Replication Management
- Which volumes to release?
- Well known volume tree and consistent naming
conventions - Release all volumes for requested manual
- Who cares, really? How many can there be?
- Sometimes 4000 volumes per night
- vos release is slowish doesnt check to see if
volume is unchanged looks at contents - Release cycle gt 24 hours, queue issue. OW!
21Win Replication Management
- Filter release requests
- Compare RO dates, RW dates if RW not changed
and all ROs same date, skip it - Filter 3 seconds
- vos release no op 30 seconds
- Small fraction of volumes for given manual are
actually changed - Sometimes 0 changed sometimes lt 1 usually
small fraction of total
22Woe FAA the system will not fail!!
- FAA requires 100 uptime, else wont approve
system and airline can go fish - Yeah, right!
23Win FAA the system will not fail!!
- Data outage vs. system outage
- Replication, of course
- Multiple configurations for EFS client
- Crude failover
- No data outage for six years and counting
- Well, there were a couple of times when ... but
we fixed that ...
24Woe FAA replicas will be identical
- Several million RW files X 5 replicas
- Have to prove that all files are identical across
the 5 ROs for a given volume
25Win FAA replicas will be identical
- Tree crawler!
- A little cheesy ls l cksum each directory
in volume and compare results - Known bad case looked for 6x per day
- Key fs setserverprefs I prefer you, now you,
now you, now you - Dedicated client, no mounted .backups
26Woe Lets use a really big cache
- It seemed like a really good idea
- 20 files changed per quarter -- lt 2/week
- Average file size 10K
- Oops, the indexes are monolithic and 300 MB ...
but dont change often - Lets try a 12 GB cache!
- Hello? Ive got twenty minutes to turn the
shuttle. It takes fifteen minutes to ...
27Win Lets not use a really big cache
- AFS client (still I believe?) chokes on large
cache - 12 GB 1,200,000 cache Vfiles
- At garbage collection time, cache purge looks for
LRU - Gee, that takes a long time. Is the machine
dead? - Lets try a 3 GB cache!
- (Worked indefinitely from 3.3 through 3.6)
28Other smidgeons
- vos release manager
- Does volume need to be released?
- Are all the relevant fileservers available?
- Is there a sync site for the VLDB?
- Do it
- Did it?
- Check VLDB entry
- Compare dates
29Other smidgeons
- Data reasonableness checks
- Do files pointed to by index actually exist?
- If not, do not vos rel the index
- Avoids the data outage of empty index for
example (bad day)
30Other smidgeons
- popcache
- Index files monolithic and large
- Fileservers overseas, slow networks
- Initial search of newly released index could take
many minutes - Cat indexes to /dev/null every five minutes
- If index unchanged, local cached copy is used
- If index changed, pulled from fileserver and user
doesnt pay penalty for first search
31Other smidgeons
- Anyone here ever have these?
- AFS is complaining about the network, so AFS
broke the network - AFS is the networks canary in a cage
- We could do the whole thing with NFS!
- AFS isnt POSIX compliant. Yay DFS!
- A file lock resides on disk. File in RO volume
cant be locked. (Oh yes it can.) - HP T500 goes to sleep?
- We could do the whole thing on a Kenmore!
32Outcome AFS Rules
- The airline became the first airline (and may
still be the only) to place 100 of its aircraft
maintenance documentation on line - The system has run reliably for 5 years
- So of course its time to replace it
- There are three server locations in the US, one
each in Europe, Hong Kong, Narita, Sydney,
Montevideo, Rio de J - Mechanics no longer mash the microfilm reader
- This system was enabled by AFS