Modern Distributed Systems Design
– Security and High Availability
1. Measuring Availability
2. Highly Available Data Management
3. Redundant System Design
Measuring Availability
• How resiliency and high availability are interconnected?
• Define downtime and what causing downtime.
• How to meager availability?
Measuring Availability
Percentage Uptime Percentage Downtime
Downtime per year Downtime per week
98% 2% 7.3 days 3h22m 99% 1% 3.65 days 1h41m
99.8% 0.2% 17h30m 20m10s 99.9% 0.1% 8h45m 10m5s
99.99% 0.01% 52.5m 1m 99.999% 0.001% 5.25m 6s
99.9999% 0.00001% 31.5s 0.6s
Define Downtime
• Downtime could be defined by following: “If a user cannot get his job done on time, the system is down”
What causing downtime?
• Planned – ones that easiest to reduce that include scheduled system maintenance, hot-swappable hard drives, cluster upgrades and even failovers. Usually 30% of all downtime;
• People or human factor – dumb mistakes and complex innovation in IT equipment, software and protocols requires greater knowledge of engineers. Usually 15 % of all downtime;
• Software Failures - due to software bugs and viruses. (40%)
How to meager availability?
MTBF
Availability = ---------------------, where
MTBF + MTTR
MTBF – “mean time between failures” and MTTR - “maximum time to repair”
What can go wrong?
• Hardware
• Environmental and Physical Failures
• Network Failures
• Database System Failures
• Web Server Failures
• File and Print Server Failures
The Cost of Downtime.
Industry Business Operation Average Downtime cost per hour
Financial Brokerage Operation $6.45 Mil Financial Credit Card/Sales
Authorization $2.6M
Media Pay per view TV $150K Retail Catalog sales $90K-$115K
Transportation Airlines $89.5K
Levels of Availability:
1.Regular Availability
2.Increased Availability
3.High Availability
4.Disaster recovery
5.Fault-Tolerant System
Highly Available Data Management
• Data management is the most sensitive area of modern distributed systems.
• Quick overview of existing data topologies
Redundant System Design
• Redundant storage (RAID, Multi-hosting, Multi-Pathing, DiskArray, JBOD, etc)
• Failover Configurations and Management
• Introduction to SAN and Fibre Channel protocol
• Security aspects of data management in Storage Area Networks
Redundant storage
Redundant Storage (RAID 5)
Failover Configurations and Management
Failover must meet following requirements:
• Transparent to client;
• Quick (no more then 5 min, ideally 0-2 min);
• Minimal manual intervention, guaranteed data access.
Failover components:
• Two servers, one primary another takeover;
• Two network connections, third is highly recommended
• All disks on a failover pair should have some sort of redundancy
• Application portability
• No single point of failure.
Symmetric Failover
Asymmetric Failover
Fibre Channel, SAN, IP Storage
Security in IP Storage Networks
• Security in Fibre Channel SANs
• Security Options for IP Storage Networks
Fibre Channel SAN Security
• Port or hard zoning
• WWN Zoning
• LUN Masking
Security Options for IP Storage Networks
• iSNS
• LUN Masking as in Fibre Channel and VLAN tagging
• IP Security or IPSec
• ACL