things you should know about scalability!
Post on 17-Jan-2015
1.807 Views
Preview:
DESCRIPTION
TRANSCRIPT
Copyright © 2011 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are trademarks of Accenture.
Things you should know about Scalability! WJAX 2011, 08.11.2011, Munich
Robert Mederer
Copyright © 2011 Accenture. All rights reserved. 2
Delivering architecture@internet-scale has several challenges to be solved to be ready for extreme scalable architectures. This session is about the art of scale, scalability, and scaling of web architectures. It will give an overview of challenges, good practices and solutions to achieve high scalability for web-based systems.
Things you should know about Scalability!
Abstract
Copyright © 2011 Accenture. All rights reserved. 3
Who am I?
Robert Mederer Lead Architecture & Execution
Anni-Albers-Straße 11 80807 München Mobil: +49-175-57-68012 robert.mederer@accenture.com
2000 - 2005: Technology Architect and Software Engineer in several projects
2006: Technical Architecture Lead, Integration and Execution Architecture for Location-Based Service Provider
2009: Technical Architecture Lead, Frontend and Execution Architecture for a Government Agency
2009/2010: Technical Architecture and front-office integration build lead, Integration and Execution Architecture Financial Services Agency
2011: Architect and QA for Location Based Services Platform
Experience
Copyright © 2011 Accenture. All rights reserved. 4
• Global management consulting, technology services and outsourcing company
• 236.000 employees
• Rank 47 among the “Best Global Brands 2008”
• Top 100 Employer
• 28 of the DAX-30-Companies
• 96 of the Fortune-Global-100
• More than three-quarters of the Fortune-Global-500
• 87 of our Top 100-clients have been with us for 10 or more years
Company Profile
Accenture High performance achieved
Worldwide Revenues $25.5 billion (in US$ billion, as of August 31, 2011)
Communications & High Tech
Public Service
Products
Financial Services
Resources
Copyright © 2011 Accenture. All rights reserved. 5
• Austria • Switzerland • Germany
Employees • >6000 • We are hiring!
Exciting Technology work • Large scale projects
(100+ people / multiple years) • Most challenging requirements
– Stock Exchange / Banking / Trading Systems
– AEMS Mobility Platform – Large Scale Web Applications
(> 1M page views / day) – Batch Architectures
Geographic unit
Local Accenture … ???
Frankfurt
Munich
Düsseldorf Berlin
Vienna
Zurich
Erlangen/ Nürnberg
Copyright © 2011 Accenture. All rights reserved. 6
• Introduction
• Case Study
• Solution and Good Practice
• Further Topics
• Conclusion
Agenda
Copyright © 2011 Accenture. All rights reserved. 7
• Introduction
• Case Study
• Solution and Good Practice
• Further Topics
• Conclusion
Agenda
Copyright © 2011 Accenture. All rights reserved. 8
High Scalability / Overload
Introduction
Source: Ezprezzo
Copyright © 2011 Accenture. All rights reserved. 9
Who are You?
– Developers,
– Architects,
– IT Manager
How large are your application (QPS)?
– 10-100?
– 100-1000?
– 1000-10000?
– 10000+?
How large is your total database?
– < 10 GB?
– 10 GB-100 GB?
– 100GB-1TB?
– 1TB-10 TB?
– 10TB+?
Audience?
Introduction | Question
Copyright © 2011 Accenture. All rights reserved. 10
If your system is
slow for a single user
How do I know if I have a performance problem?
Introduction | What is Performance?
Copyright © 2011 Accenture. All rights reserved. 11
If your system is
fast for a individual user but slow under high load
How do I know if I have a scalability problem?
Introduction | What is Scalability?
Copyright © 2011 Accenture. All rights reserved. 12
• Performance testing is defined as the technical investigation to determine or validate the speed, scalability, and/or stability characteristics of the web based system under test.
• Performance-related activities, such as testing and tuning, are concerned with achieving response times, throughput, and resource-utilization levels that meet the performance objectives for the application (SLA) under test.
Key Types of Performance Testing:
Non-Functional Testing Performance Testing of Web based systems
Performance Testing
Load Testing Stress Testing Capacity Testing
“Will it be fast enough?“
"Will it support all of my clients?“
"What happens if something goes wrong?"
"What do I need to plan for when I get more customers?“
Introduction | What is Performance?
Source: Thomas Werft, Performance Engineer at Accenture
Definition
Copyright © 2011 Accenture. All rights reserved. 13
Key performance indicators:
Non-Functional Testing Performance Testing of Web based systems
Criteria KPI Description
Response Time (first / last byte in ms)
Average An average is a value found by adding all of the numbers in a set together and then dividing them by the quantity of numbers in the set
Percentile (Target 98%)
A percentile is a measure that tells us what percent of the total frequency scored at or below that measure.
Median A median is simply the middle value in a data set when sequenced from lowest to highest.
Throughput (QPS) Requests per Second; Transaction per Second
Throughput is the number of units of work that can be handled per unit t of time; for instance, requests per second, calls per day, hits per second, reports per year, etc.
Resource Utilization Processor; Memory; Disk I/O; Network I/O
Resource utilization is the cost of the project in terms of system resources. Utilization is the percentage of time that a resource is busy servicing user requests. The remaining percentage of time is considered idle time.
Introduction | What is Performance?
Source: Thomas Werft, Performance Engineer at Accenture
Results are used for Performance Engineering, Performance Tuning
Copyright © 2011 Accenture. All rights reserved. 14
A system’s capacity to uphold the same performance under heavier volumes.
Scalability
Introduction | What is Scalability?
Definition
Source: Patterns for Performance and Operability: Building and Testing Enterprise Software, Chris Ford et. al., 2008
Copyright © 2011 Accenture. All rights reserved. 15
• CPU,
• Memory,
• Bandwidth, …
Simple Process • Application is generally not affected by
those changes
Classical Example are Super Computers like • HP Integrity Superdome
• IBM Mainframe
Is achieved by increasing the capacity of a single node
Vertical Scalability
Introduction | What is Scalability?
Source: Hewlett-Packard
Copyright © 2011 Accenture. All rights reserved. 16
• Nodes can be added to scale out Produces overhead - Keep cluster consistent
- Node error detection and handling
- Communication between nodes
• May be used to increase reliability and availability
• Distributed Systems and Programs like – SETI@Home
– World Wide Web
– Domain Name Service
• Application is spread on a cluster with several nodes
Horizontal Scalability
Source: Space Sciences Laboratory, U.C. Berkeley
Introduction | What is Scalability?
Copyright © 2011 Accenture. All rights reserved. 17
Consistency
Availability Partition Tolerance
• Consistency – all clients see the same data at the same time
• Availability – all clients can find all data even in presence of failure
• Partition Tolerance – system works even when one node failed
CAP Theorem (Brewer‘s Theorem)
Impossible
Source: PODC-keynote, Towards Robust Distributed Systems, Dr. Eric A. Brewer, 2000
Introduction | Scalability Trade-Offs | Availability vs. Consistency
Copyright © 2011 Accenture. All rights reserved. 18
Consistency + Availability • High data integrity
• Single site, cluster database, LDAP, etc.
• 2-phase commit, data replication, etc.
Consistency + Partition • Distributed database, distributed locking, etc.
• Pessimistic locking, etc.
Availability + Partition • High scalability
• Distributed cache, DNS, etc.
• Optimistic locking, expiration/leases (timeout), etc.
Normally, two of these properties for any shared-data system
CAP Theorem
C
A P
C
A P
C
A P
Introduction | Scalability Trade-Offs | Availability vs. Consistency
Source: “Architecting Cloudy Applications”, David Chou
Copyright © 2011 Accenture. All rights reserved. 19
Data and Scalability
Source: Visual Guide to NoSQL Systems, http://blog.nahurst.com/tag/cap
Data Models Key:
Relational (comparison) Key-Value Column-Oriented Document-Oriented
• RDBMSs (MySQL, Postgres, etc.)
• Greenplum • Vertica
Consistent & Available • Cassandra • SimpleDB • CouchDB • Riak • Dynamo • Voldemort • Tokyo
Cabinet • KAI
Available & Partition Tolerant Distributed Non-Relational data store solutions must relax guarantees around consistency, partition tolerance and availability, resulting in systems optimized for different combinations of properties.
• BigTable • HyperTable • Hbase • MongoDB • Terrastore
• Scalaris • BerkeleyDB • MemcacheDB • Redis
Consistent & Partition Tolerant
Consistency
Availability Partition Tolerance
Introduction | Scalability Trade-Offs | Availability vs. Consistency
Copyright © 2011 Accenture. All rights reserved. 20
Analysis and Classification
Data and Scalability
Introduction | Scalability Trade-Offs | Availability vs. Consistency
Copyright © 2011 Accenture. All rights reserved. 21
ACID - Do I really need it?
Data and Scalability
Introduction | Scalability Trade-Offs | Availability vs. Consistency
In order to guarantee transactional integrity, the traditional relational database management system (RDBMS) was architected to guarantee four core properties: Atomicity, Consistency, Isolation and Durability (ACID).
Atomicity A database is said to be atomic if when one
part of the transaction fails, the entire transaction fails and database state is left
unchanged.
Consistency if the database remains in a consistent state
after any transaction. Therefore, if a transaction violates the consistency of the
database (e.g. the value is not the right type) then the transaction should be rolled back.
Durability A database is said to be durable if it recovers
all of the committed transactions in the system even after system failure.
Isolation A database is said to be isolated if transactions
can’t have access to data currently being modified by another transaction.
Relational databases were originally designed for transactional data processing – reliably processing and maintaining data integrity – on different HW architectures.
Copyright © 2011 Accenture. All rights reserved. 22
• Basically Available • Soft-state (or scalable)
• Eventually consistent
Modern Internet systems: focused on BASE
BASE
Introduction | Scalability Trade-Offs | Availability vs. Consistency
Example: Amazon outage in April 2010 brought thousand of customers down, including Pfizer, Netflix, Quora, Foursquare, Reddit, … • The Amazon.com 2010 Shareholder Letter Focusses on Technology
• http://www.allthingsdistributed.com/2011/04/the_amazoncom_2010_shareholder.html • http://broadcast.oreilly.com/2011/04/the-aws-outage-the-clouds-shining-moment.html • http://www.nytimes.com/2011/04/23/technology/23cloud.html
• http://www.allthingsdistributed.com/2007/12/eventually_consistent.html Dec. 2007
Copyright © 2011 Accenture. All rights reserved. 23
ACID BASE
• Strong consistency for transactions highest priority
• Availability less important
• Pessimistic
• Complex mechanisms
• Availability and scaling highest priorities
• Weak consistency
• Optimistic
• Simple and fast
ACID vs. BASE
Introduction | Scalability Trade-Offs | Availability vs. Consistency
Copyright © 2011 Accenture. All rights reserved. 24
Network protocols has an inherent throughput bottleneck that becomes more severe with increased packet loss and latency
Network Latency vs. Throughput
Introduction | Scalability Trade-Offs - Latency vs. Throughput
Source: http://www.asperasoft.com/en/technology/shortcomings_of_TCP_2/the_shortcomings_of_TCP_file_transfer_2
Copyright © 2011 Accenture. All rights reserved. 25
• Processing load is distributed
• Closer to the user
• Decreases latency
• Lower cost of hardware
• Increases service levels
• Greater flexibility in responding to service requests
• Seasonal spikes in demand can be off-loaded to other edge servers
Transferring data or services from a centralized point to the edge of the network
Edge Computing
Introduction | Scalability and Edge Computing
Copyright © 2011 Accenture. All rights reserved. 26
• Store objects for the application to be reused
• Cache data from database or generated by application
• E.g. ehCache, memcached, etc.
Application Cache • Speed up performance or minimize resources used
• Proxy caching / Reverse proxy caching
• E.g. Squid, Varnish, etc
Content Delivery Network (CDN) • Faster response time and fewer requests on the origin servers
• Push content closer to end user
• E.g. Akamai, Savvis, Mirror Image Internet, Netscaler, Amazon CloudFoundry, etc
Object cache
Caching and Types of Caches
Introduction | Caching
Copyright © 2011 Accenture. All rights reserved. 27
Abstract architecture of a Content Delivery Network (CDN)
CDN
Source:Content Delivery Network (CDN) Research Directory, http://ww2.cs.mu.oz.au/~apathan/CDNs.html
Introduction | Caching
Copyright © 2011 Accenture. All rights reserved. 28
Basic interaction flows in a CDN environment
CDN
Introduction | Caching
Source: Basic interaction flows in a CDN environment, http://ww2.cs.mu.oz.au/~apathan/CDNs.html
Copyright © 2011 Accenture. All rights reserved. 29
• Methodology to distribute workload across multiple computers or a computer cluster, network links, central processing units, disk drives, or other resources, to achieve optimal resource utilization, maximize throughput, minimize response time, and avoid overload
• Using multiple components with load balancing, instead of a single component, may increase reliability through redundancy. The load balancing service is usually provided by dedicated software or hardware, such as a multilayer switch or a Domain Name System server.
Definition:
Basics Load Balancing
Introduction Introduction
Copyright © 2011 Accenture. All rights reserved. 30
Load Balancing (Major) Usage
• Distributing the load across multiple servers • Target is to scale beyond the capacity of one server, and to tolerate a
server failure. Server LB
• Directing users to different data center sites consisting of server farms • Target is to provide users with fast response time and to tolerate a
complete data center failure (availability, business continuity, disaster recovery, geographic routing)
Global Server LB
• Distribute the load across multiple firewalls • Target is to scale beyond the capacity of one firewall, and tolerate a
firewall failure. Firewall LB
• Transparently directs traffic to caches to accelerate the response time for clients
• Or improve the performance of web servers by offloading the static content to caches.
Transparent Cache Switching
Source: Load Balancing Servers, Firewalls, and Caches by Chandra Kopparapu; John Wiley & Sons © 2002
Introduction
Copyright © 2011 Accenture. All rights reserved. 31
• Pros: Simple to implement.
• Cons: Can lead to overloading of one server while under-utilization of others.
Round-Robin Allocation • Pros: Better than random allocation because the requests are equally
divided among the available servers in an orderly fashion.
• Cons: Round robin algorithm is not enough for load balancing based on processing overhead required and if the server specifications are not identical to each other in the server group.
Weighted Round-Robin Allocation • Pros: Takes care of the capacity of the servers in the group.
• Cons: Does not consider the advanced load balancing requirements such as processing times for each individual request.
Random Allocation
Basics Load Balancing Algorithm’s
Introduction
Copyright © 2011 Accenture. All rights reserved. 32
• Hardware
– Barracuda Networks
– Cisco Systems
– Citrix Systems
– F5 Networks (BigIp)
– Etc.
• Software
– HAProxy
– Apache HTTP Server with mod_proxy for Tomcat
– …
Basics Server Load Balancing
Introduction
Problem: • No real load balancing due to TTL of DNS • No health check for service availability
Simple Load Balancing over DNS (List of IP‘s with round robin)
Does that work?
Copyright © 2011 Accenture. All rights reserved. 33
• Functionality
– DNS based routing
– Based on IP GEO database (Geographic routing)
– Assumption: Local DNS for client
• Provider
– F5 Networks (Global Load Balancing Solutions)
– UltraDNS (Traffic Controller Service)
– Level3 (Traffic Manager, BCDR Solution)
Global Server Load Balancing
Introduction | Load Balancing
Copyright © 2011 Accenture. All rights reserved. 34
• Increase application availability in event of entire site failure or overload (Business Continuity, Disaster Recovery)
• Scale application performance by load balancing traffic across multiple sites (Edge Computing (together with CDN))
• Need for more granularity and control in directing Web traffic
• More flexibility in building and managing Internet infrastructures
– E.g. Site based downtime management during release upgrade
• Cons: Not always working! Due to assumption of a local DNS (Public DNS usage, DNS over VPN could fail to get the nearest server location) – (see: http://www.royans.net/arch/fixing-gslb-global-server-load-balancing/)
• Fix: Google proposed a DNS enhancement to not use the DNS resolver IP further more the client / end-user IP (see: DNS resolver, http://googlecode.blogspot.com/2010/01/proposal-to-extend-dns-protocol.html )
Characteristics / Usage
Global Server Load Balancing
Introduction | Load Balancing
Copyright © 2011 Accenture. All rights reserved. 35
• Introduction
• Case Study
• Solution and Good Practice
• Further Topics
• Conclusion
Agenda
Copyright © 2011 Accenture. All rights reserved. 36
Case Study – Non-Functional Requirements
User groups: • Web Browser users • Mobile users
USA: 50 Mil. EU: 30 Mil.
ASIA: 15 Mil.
AU: 2 Mil.
Case Study – Internet Scale Web Services
Availability: 99,99 %
Copyright © 2011 Accenture. All rights reserved. 37
AU: 1 data center: • Sydney Peak: 3.000 QPS
USA: 2 data center: • New York • San Francisco Peak: 20.000 QPS
EU: 2 data center: • Frankfurt • London Peak: 10.000 QPS
ASIA: 1 data center: • Singapore Peak: 5.000 QPS
Case Study – Non-Functional Requirements
Case Study – Internet Scale Web Services
Copyright © 2011 Accenture. All rights reserved. 38
AU / ASIA: Failover Singapore ↔ Sydney: 8.000 QPS
Case Study – Non-Functional Requirements Performance in Case of Failure
USA: Failover New York ↔ San Francisco: 40.000 QPS
EU: Failover Frankfurt ↔ London: 20.000 QPS
Case Study – Internet Scale Web Services
Copyright © 2011 Accenture. All rights reserved. 39
Case Study – Non-Functional Requirements Response Times
RESTful Web Services: • Calculate service: 100 ms (50ms latency) • Binary service: 60 ms (50 ms latency) • Search service: 50 ms (50 ms latency)
Case Study – Internet Scale Web Services
Copyright © 2011 Accenture. All rights reserved. 40
Case Study – Non-Functional Requirements Data
100 TByte on each geography - Binary (video, image, …) - Index data
Case Study – Internet Scale Web Services
Copyright © 2011 Accenture. All rights reserved. 41
• Introduction
• Case Study
• Solution and Good Practice
• Further Topics
• Conclusion
Agenda
Copyright © 2011 Accenture. All rights reserved. 42
Case Study – Solution
Copyright © 2011 Accenture. All rights reserved. 43
Case Study – Solution
Copyright © 2011 Accenture. All rights reserved. 44
• Introduction
• Case Study
• Solution and Good Practice
• Further Topics
• Conclusion
Agenda
Copyright © 2011 Accenture. All rights reserved. 45
• Organization
– People, Process and Tools
– Governance (Lifecycle management)
• Where I do I find the truth in a highly scaled and distributed architecture?
– Logging
• Log Analytics (e.g. Scribe (not really), Splunk)
– End-to-end data visualization
Furhter Topics
Copyright © 2011 Accenture. All rights reserved. 46
• Introduction
• Case Study
• Solution and Good Practice
• Further Topics
• Conclusion
Agenda
Copyright © 2011 Accenture. All rights reserved. 47
Reverse proxy Caching
• Fast and Scales well
• Dealing with invalidation is tricky
• Direct cache invalidation scales badly
• Instead, change URLs of modified resources
• Old ones will drop out of cache naturally
CDN – Content Delivery Network
• Faster response time and fewer requests on the origin servers
• No 100% control of caching. Based on internal statistics (Akamai).
• Operated by 3rd parties. Already in place. Not for Free
• Once something is cached on CDN, assume that it never changes
• Sometimes does load balancing as well
Content Caching
Conclusion
Copyright © 2011 Accenture. All rights reserved. 48
Common Concepts of Scalable Architecture
idempotent operations
shared nothing loosely coupled
parallelization
7 Habits of Good
Distributed Systems
asynchronous
partitioned data
fault-tolerance
optimistic concurrency
Source: "Architecting Cloudy Applications", David Chou Source: highscalability.com
Conclusion
Copyright © 2011 Accenture. All rights reserved. 49
• Is there a need to scale my application?
– Vertical scaling is more easy to achieve (Cost)
– Use horizontal scaling only when required (Complexity)
• Is there a plan to proof your designed solution?
– Plan to do a lot of realistic Proof-of-Concepts
• Is there a one size fits all solution?
– NO!
• How important is ACID?
– Is BASE enough?
– Can a NoSQL solution be used?
Questionnaire
Conclusion
Copyright © 2011 Accenture. All rights reserved. 50
References
The Art of Scalability: Scalable Web Architecture, Processes and Organizations for the Modern Enterprise; Michael T. Fisher, Martin L. Abbott; Addison-Wesley Professional; 1 edition
Scalability Rules: 50 Principles for Scaling Web Sites; Martin L. Abbott, Michael T. Fisher Addison-Wesley Professional; 1 edition (May 15, 2011)
Scalable Internet Architectures; Theo Schlossnagle; Sams; 1 edition (July 31, 2006)
Building Scalable Web Sites; Henderson; Oreilly
Websites: HighScalability.com, infoQ.com, Qcon.com, …
Copyright © 2011 Accenture. All rights reserved. 51
Contribution and Review:
Bukowski, Markus; Conradt, Steffen; Jacobs, Mareike; Krogemann, Markus; Peuker, Jan; Van Isacker, Pieter; Wagenknecht, Dominik; Wagner, Hubert; Werft, Thomas; Zakotnik, Jure
Thank You!
top related