cloud computing and architecture architectural tactics (tonight’s guest star: availability)
TRANSCRIPT
Cloud Computing andArchitecture
Architectural Tactics
(Tonight’s guest star: Availability)
Quality framework (Bass et al.)
• Central quality attributes– Availability– Interoperability– Modifiability– Performance– Security– Testability– Usability
• Other qualities– Portability– Scalability– Variability– Flexibility
– Cost– Time to market
– …
Strongly recommended
reading!
A Writing Template
3
· Source of stimulus. This is some entity (a human, a computer system, or any other actuator) that generated the stimulus.
· Stimulus. The stimulus is a condition that needs to be considered when it arrives at a system.
· Environment. The stimulus occurs within certain conditions. The system may be in an overload condition or may be running when the stimulus occurs, or some other condition may be true.
· Artifact. Some artifact is stimulated. This may be the whole system or some pieces of it.
· Response. The response is the activity undertaken after the arrival of the stimulus.
· Response measure. When the response occurs, it should be measurable in some fashion so that the requirement can be tested.
Example: World of Warcraft
CS@AU Henrik Bærbak Christensen 4
Example: SkyCave
Quality attribute AvailabilitySource Internal to the systemStimuli A crashArtifact Database serverEnvironment Normal operationResponse Detects events, record it in log, continues in normal operationResponse Measure Within 3 seconds
CS@AU Henrik Bærbak Christensen 5
Quality attribute PerformanceSource 1000 independent clientsStimuli Generate on average 2 character events per second Artifact SkyCave App serverEnvironment Normal operationResponse Events are processed, cave state is updatedResponse Measure With maximal 5 seconds latency
Tactic
• Tactic– A design decision that influences the achievement of a
quality attribute response
• Example of modifiability tactic:– Encapsulate: Introduce explicit interface to module
CS@AU Henrik Bærbak Christensen 6
CloudArch Core Focus
Discussion
• If a system is not available, what is the point of all other QAs?
• Security ?– Equals slowness
CS@AU Henrik Bærbak Christensen 7
• System quality attributes– Availability– Modifiability– Performance– Security– Testability– Usability– Interoperability– Scalability
Availability
CS@AU Henrik Bærbak Christensen 8
Definition(s)
• Availability (1): Property of software that it is there and ready to carry out its task when you need it to be
• Availability (2): Ability of a system to mask or repair faults such that the cumulative service outage period does not exceed a required value over a specified time interval
CS@AU Henrik Bærbak Christensen 9
Nygard Stability (resilience, longevity): Ability to keep processing for a long time even when there are transient impulses, persistent stresses, or component failures
Measurements
• MTBF: Mean time between failure• MTTR: Mean time to repair
• But often we talk in percentages!– 99% 3d 15h downtime per year– 99,9% 8h 1m– 99,99% 52m– 99,9999% 32 seconds (!)
CS@AU Henrik Bærbak Christensen 10
Tactics
• Lots of techs!
CS@AU Henrik Bærbak Christensen 11
Tactics
• Categories– Fault detection– Recovery
• Preparation+Repair• Reintroduction
– Prevention
CS@AU Henrik Bærbak Christensen 12
Detection
• Ping-echo
• Monitor Nagios – Zabbix - …
• Exceptions– Time out
CS@AU Henrik Bærbak Christensen 13
Recover: Prep and Repair
• Active redundancy Hot standby– All receive and process all events
• Millisecond failover
• Passive redundancy Warm standby– Master-slave
• Minute failover
• Spare Cold standby– ”I think we have an extra machine in the cellar”
CS@AU Henrik Bærbak Christensen 14
Recover: Prep and Repair
• Exceptions• Rollback
– Used in DB and [exercise: where else?]– Check pointing
• Retry• Degradation
CS@AU Henrik Bærbak Christensen 15
Which Nygard patterns?
Recover: Reintroduction
• Shadow– Run in shadow mode until ‘up-to-speed’
• State Resync– Typical DB behaviour
• Cold slaves must catch up with primary
– EcoSense db war story Stale DB
CS@AU Henrik Bærbak Christensen 16
Preventing
• Removal from service– ‘scrubbing’– Use to be that Tomcat server would respawn every 12
hours• Easiest way to fix the numerous memory leaks!
• Transactions– ACID guaranties
CS@AU Henrik Bærbak Christensen 17
Summary
• All things bad can and will happen to real systems having real users operating in the real world!
• You systems should strive for high availability and graceful degradation– If you want to keep your customers!
• The architectural tool box is big!
CS@AU Henrik Bærbak Christensen 18