life of a cell
DESCRIPTION
Life of a Cell. Woes and Wins. The Conundrum. Distribute -- on-line -- millions of pages of aircraft maintenance documentation in a system that the FAA requires to be foolproof: No downtime All data identical for every mechanic worldwide. “Always”. Business Risks. - PowerPoint PPT PresentationTRANSCRIPT
Life of a Cell
Woes and Wins
Dexter "Kim" Kimball ([email protected]) 2
The Conundrum
Distribute -- on-line -- millions of pages of aircraft maintenance documentation in a system that the FAA requires to be foolproof:– No downtime– All data identical for every mechanic
worldwide. “Always”
Dexter "Kim" Kimball ([email protected]) 3
Business Risks
An airplane cannot leave the gate if maintenance documentation is unavailable.
An airplane stuck at the gate causes the airline to lose lots of money (system wide)
Hasn’t been done before
Dexter "Kim" Kimball ([email protected]) 4
Business Drivers
Faster access to documentation translates to millions of dollars a year in recovered revenue– No such thing as “I did that yesterday I’ll just
wing it” – documents change daily– New document is printed and carried aboard
the aircraft (or you’re busted)– Search times and print times must be low
Dexter "Kim" Kimball ([email protected]) 5
Business Drivers
Consistency of documentation eliminates “flip flop” maintenance costs– I use procedure A and perform X– Downline – old documents ... “Hey, who did
that? But uh oh I can fix it.” Procedure B– Downline – new documents, Procedure A ....
Dexter "Kim" Kimball ([email protected]) 6
Business Drivers
• Safety– An incident involving a fatality drops ticket
sales by 50% for two weeks.– If the incident cannot be explained ticket
sales remain off until it is– US Airways 737 (1994?), Pittsburgh, almost
put airline out of business– Airline people really do care about the people
they’re responsible for
Dexter "Kim" Kimball ([email protected]) 7
The Plan
Be the first airline to gain competitive advantage by going to 100% online documentation
Retire microfilm/microfiche completely
Don’t lose shirt
Dexter "Kim" Kimball ([email protected]) 8
The Technologies
• Excalibur Technologies “EFS” (Electronic File System)
• Transarc AFS 3.3
• HP Servers
• Bunch’o’stuff to convert manuals to TIF
• Windows 3.1 target user platform
Dexter "Kim" Kimball ([email protected]) 9
The Process
Scan microfiche/film manual pages to TIF• EFS: OCR TIFs• AFS: Store TIF pages• EFS: Index TIFs (OCR output), keyword indexes• AFS: Store index• AFS: Replicate to strategically placed fileservers• Mechanics and engineers:
– Click on index icon (File cabinet)– Keyword search– EFS client on Windows 3.1 desktop requests data from EFS
server running on AFS fileserver
Dexter "Kim" Kimball ([email protected]) 10
World wide airline, world wide cell
• Fileserver locations decided by– Location on corporate backbone– Connectivity from other linestations (smaller
airports)– Number of linestations that can be served
from location– Paranoia (over designed by 2x)
Dexter "Kim" Kimball ([email protected]) 11
Domestic Fileserver Locations
BOI
PIT
IAD
BWI
MIA
IAH
IND.181130
189
96
75
373
nLarge location (> 50 workstations);
Fileserver location. n is totalnumber of workstations in region.
Medium location (8-workstations); AFSclient only. No local fileservers.
Small location (< 8 workstations);AFS client only. No local fileservers.
Basic U.S. map with airport codes courtesy of Roger Blundell
AFS Fileserver Locations and their FileserviceRegions
Dexter "Kim" Kimball ([email protected]) 12
End User Workstations
• Every hangar -- many per “dock”
• Every gate – 2x, independent LANs
• Every engineering department
• Facilities for support of in-air aircraft
(World wide)
Dexter "Kim" Kimball ([email protected]) 13
AFS Client Locations
• Minimal– No supported Windows 3.1 AFS client– EFS client requests data from AFS client
Dexter "Kim" Kimball ([email protected]) 14
Number of users
• 40000 human users– “I forgot my password” puts airline out of
business
• 1500 workstations – workstation hostname is “user” and is written on front of workstation
Dexter "Kim" Kimball ([email protected]) 15
Woes and Wins
• Network – shoving data into your LAN
• Replication management– Who is authorized– You want me to release how many volumes?– vos release times
• FAA – the system will not go down! All replicas will be identical
• Let’s use a really big cache for Seattle!
Dexter "Kim" Kimball ([email protected]) 16
Woe: Network
How to get 300 – 600 GB of data to fileserver for initial load of ROs– Slow links to small airports– Slow links to international server locations– Fast links heavily trafficked– vos release can beat the * out of a network– An airline is always in operation – no magic
window of opportunity
Dexter "Kim" Kimball ([email protected]) 17
Win: Network
• Can’t use vos release
• Hey, we have lots of those airplane things– Load local (SFO) fileserver array with disks,
setup vicep’s– vos addsite to fileserver/array; vos release– vgexport – OS says by to volume groups – vos remsite; remove drives; – Fly to wherever; vgimport, vos addsite / vos
release. Rio, anyone?
Dexter "Kim" Kimball ([email protected]) 18
Woes: Replication Management
15000 RW volumes, all replicated
• Who’s authorized to issue vos release?
• Which volumes to release? EFS randomly places data ...
• How many volumes did you say to release?
Dexter "Kim" Kimball ([email protected]) 19
Win: Replication Management
• Authorization/automation– Per fleet per manual vosrel PTS group– PTS group on every relevant volume root
node– User interface writes record to work queue, a
file in /afs• Requester; manual/index; priority
– Fileserver cron job compares requester with vosrel PTS group, figures out volume list, performs vos release –localauth
Dexter "Kim" Kimball ([email protected]) 20
Woe: Replication Management
• Which volumes to release?– Well known volume tree and consistent
naming conventions– Release all volumes for requested manual– Who cares, really? How many can there be?
• Sometimes 4000+ volumes per night• vos release is slowish – doesn’t check to see if
volume is unchanged; looks at contents• Release cycle > 24 hours, queue issue. OW!
Dexter "Kim" Kimball ([email protected]) 21
Win: Replication Management
• Filter release requests– Compare RO dates, RW dates – if RW not
changed and all ROs same date, skip it• Filter: 3 seconds • vos release “no op” – 30 seconds
– Small fraction of volumes for given manual are actually changed
• Sometimes 0 changed; sometimes < 1%; usually small fraction of total
Dexter "Kim" Kimball ([email protected]) 22
Woe: FAA – the system will not fail!!
• FAA requires 100% uptime, else won’t approve system and airline can go fish
• Yeah, right!
Dexter "Kim" Kimball ([email protected]) 23
Win: FAA – the system will not fail!!
• Data outage vs. system outage
• Replication, of course
• Multiple configurations for EFS client– Crude failover
• No data outage for six years and counting– Well, there were a couple of times when ...
but we fixed that ...
Dexter "Kim" Kimball ([email protected]) 24
Woe: FAA –replicas will be identical
• Several million RW files X 5 replicas
• Have to prove that all files are identical across the 5 ROs for a given volume
Dexter "Kim" Kimball ([email protected]) 25
Win: FAA –replicas will be identical
• Tree crawler!
• A little cheesy – “ls –l | cksum” each directory in volume and compare results
• Known “bad case” looked for 6x per day
• Key “fs setserverprefs” – I prefer you, now you, now you, now you
• Dedicated client, no mounted .backups
Dexter "Kim" Kimball ([email protected]) 26
Woe: Let’s use a really big cache
• It seemed like a really good idea– 20% files changed per quarter -- < 2%/week– Average file size 10K– Oops, the indexes are monolithic and 300
MB ... but don’t change often– Let’s try a 12 GB cache!
• “Hello? I’ve got twenty minutes to turn the shuttle. It takes fifteen minutes to ...”
Dexter "Kim" Kimball ([email protected]) 27
Win: Let’s not use a really big cache
• AFS client (still I believe?) chokes on large cache– 12 GB =~ 1,200,000 cache “Vfiles”– At garbage collection time, cache purge
looks for LRU– Gee, that takes a long time. Is the machine
dead?– Let’s try a 3 GB cache!
• (Worked indefinitely from 3.3 through 3.6)
Dexter "Kim" Kimball ([email protected]) 28
Other smidgeons
• vos release manager– Does volume need to be released?– Are all the relevant fileservers available?– Is there a sync site for the VLDB?– Do it– Did it?
• Check VLDB entry• Compare dates
Dexter "Kim" Kimball ([email protected]) 29
Other smidgeons
• Data reasonableness checks– Do files pointed to by index actually exist?– If not, do not vos rel the index– Avoids the data outage of “empty index” – for
example *(bad day)*
Dexter "Kim" Kimball ([email protected]) 30
Other smidgeons
• popcache– Index files: monolithic and large– Fileservers: overseas, slow networks– Initial search of newly released index could
take many minutes– Cat indexes to /dev/null every five minutes
• If index unchanged, local cached copy is used• If index changed, pulled from fileserver and user
doesn’t pay penalty for first search
Dexter "Kim" Kimball ([email protected]) 31
Other smidgeons
• Anyone here ever have these?– AFS is complaining about the network, so AFS broke
the network • AFS is the network’s canary in a cage
– We could do the whole thing with NFS!– AFS isn’t POSIX compliant. Yay DFS! – A file lock resides on disk. File in RO volume can’t be
locked. (Oh yes it can.)– HP T500 goes to sleep?– We could do the whole thing on a Kenmore!
Dexter "Kim" Kimball ([email protected]) 32
Outcome: AFS Rules
• The airline became the first airline (and may still be the only) to place 100% of its aircraft maintenance documentation on line
• The system has run reliably for 5 years +• So of course it’s time to replace it
• There are three server locations in the US, one each in Europe, Hong Kong, Narita, Sydney, Montevideo, Rio de J
• Mechanics no longer mash the microfilm reader
This system was enabled by AFS