Download - Modern Distributed Systems Design – Security and High Availability 1.Measuring Availability 2.Highly Available Data Management 3.Redundant System Design

Modern Distributed Systems Design

– Security and High Availability

1. Measuring Availability

2. Highly Available Data Management

3. Redundant System Design

Measuring Availability

• How resiliency and high availability are interconnected?

• Define downtime and what causing downtime.

• How to meager availability?

Measuring Availability

Percentage Uptime Percentage Downtime

Downtime per year Downtime per week

98% 2% 7.3 days 3h22m 99% 1% 3.65 days 1h41m

99.8% 0.2% 17h30m 20m10s 99.9% 0.1% 8h45m 10m5s

99.99% 0.01% 52.5m 1m 99.999% 0.001% 5.25m 6s

99.9999% 0.00001% 31.5s 0.6s

Define Downtime

• Downtime could be defined by following: “If a user cannot get his job done on time, the system is down”

What causing downtime?

• Planned – ones that easiest to reduce that include scheduled system maintenance, hot-swappable hard drives, cluster upgrades and even failovers. Usually 30% of all downtime;

• People or human factor – dumb mistakes and complex innovation in IT equipment, software and protocols requires greater knowledge of engineers. Usually 15 % of all downtime;

• Software Failures - due to software bugs and viruses. (40%)

How to meager availability?

MTBF

Availability = ---------------------, where

MTBF + MTTR

MTBF – “mean time between failures” and MTTR - “maximum time to repair”

What can go wrong?

• Hardware

• Environmental and Physical Failures

• Network Failures

• Database System Failures

• Web Server Failures

• File and Print Server Failures

The Cost of Downtime.

Industry Business Operation Average Downtime cost per hour

Financial Brokerage Operation $6.45 Mil Financial Credit Card/Sales

Authorization $2.6M

Media Pay per view TV $150K Retail Catalog sales $90K-$115K

Transportation Airlines $89.5K

Levels of Availability:

1.Regular Availability

2.Increased Availability

3.High Availability

4.Disaster recovery

5.Fault-Tolerant System

Highly Available Data Management

• Data management is the most sensitive area of modern distributed systems.

• Quick overview of existing data topologies

Redundant System Design

• Redundant storage (RAID, Multi-hosting, Multi-Pathing, DiskArray, JBOD, etc)

• Failover Configurations and Management

• Introduction to SAN and Fibre Channel protocol

• Security aspects of data management in Storage Area Networks

Redundant storage

Redundant Storage (RAID 5)

Failover Configurations and Management

Failover must meet following requirements:

• Transparent to client;

• Quick (no more then 5 min, ideally 0-2 min);

• Minimal manual intervention, guaranteed data access.

Failover components:

• Two servers, one primary another takeover;

• Two network connections, third is highly recommended

• All disks on a failover pair should have some sort of redundancy

• Application portability

• No single point of failure.

Symmetric Failover

Asymmetric Failover

Fibre Channel, SAN, IP Storage

Security in IP Storage Networks

• Security in Fibre Channel SANs

• Security Options for IP Storage Networks

Fibre Channel SAN Security

• Port or hard zoning

• WWN Zoning

• LUN Masking

Security Options for IP Storage Networks

• iSNS

• LUN Masking as in Fibre Channel and VLAN tagging

• IP Security or IPSec

• ACL

Download - Modern Distributed Systems Design – Security and High Availability 1.Measuring Availability 2.Highly Available Data Management 3.Redundant System Design

Top Related