openstack arch design

216
Architecture Guide June 3, 2015 current i OpenStack Architecture Design Guide current (2015-06-03) Copyright © 2014, 2015 OpenStack Foundation Some rights reserved. To reap the benefits of OpenStack, you should plan, design, and architect your cloud properly, taking user's needs into account and understanding the use cases. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. Except where otherwise noted, this document is licensed under Creative Commons Attribution ShareAlike 3.0 License. http://creativecommons.org/licenses/by-sa/3.0/legalcode

Upload: permasa

Post on 15-Sep-2015

44 views

Category:

Documents


8 download

DESCRIPTION

OpenStack Architectire Design Document - a standard reference for a detailed note on Openstack Architecture

TRANSCRIPT

  • Architecture Guide June 3, 2015 current

    i

    OpenStack Architecture Design Guidecurrent (2015-06-03)Copyright 2014, 2015 OpenStack Foundation Some rights reserved.

    To reap the benefits of OpenStack, you should plan, design, and architect your cloudproperly, taking user's needs into account and understanding the use cases.

    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliancewith the License. You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software distributed under the License is distributedon an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See theLicense for the specific language governing permissions and limitations under the License.

    Except where otherwise noted, this document is licensed underCreative Commons Attribution ShareAlike 3.0 License.http://creativecommons.org/licenses/by-sa/3.0/legalcode

  • Architecture Guide June 3, 2015 current

    iii

    Table of ContentsPreface .................................................................................................. v

    Conventions ................................................................................... vDocument change history .............................................................. v

    1. Introduction ....................................................................................... 1Intended audience ......................................................................... 1How this book is organized ........................................................... 2Why and how we wrote this book ................................................. 3Methodology ................................................................................. 4

    2. General purpose .............................................................................. 11User requirements ........................................................................ 12Technical considerations ............................................................... 16Operational considerations ........................................................... 31Architecture ................................................................................. 34Prescriptive example .................................................................... 47

    3. Compute focused ............................................................................. 51User requirements ........................................................................ 51Technical considerations ............................................................... 54Operational considerations ........................................................... 64Architecture ................................................................................. 66Prescriptive examples ................................................................... 75

    4. Storage focused ............................................................................... 81User requirements ........................................................................ 82Technical considerations ............................................................... 84Operational considerations ........................................................... 85Architecture ................................................................................. 91Prescriptive examples ................................................................. 102

    5. Network focused ............................................................................ 107User requirements ...................................................................... 110Technical considerations ............................................................. 113Operational considerations ......................................................... 120Architecture ............................................................................... 122Prescriptive examples ................................................................. 126

    6. Multi-site ........................................................................................ 131User requirements ...................................................................... 131Technical considerations ............................................................. 136Operational considerations ......................................................... 140Architecture ............................................................................... 143Prescriptive examples ................................................................. 146

    7. Hybrid ............................................................................................ 153User requirements ...................................................................... 154Technical considerations ............................................................. 160

  • Architecture Guide June 3, 2015 current

    iv

    Operational considerations ......................................................... 166Architecture ............................................................................... 168Prescriptive examples ................................................................. 172

    8. Massively scalable ........................................................................... 177User requirements ...................................................................... 178Technical considerations ............................................................. 181Operational considerations ......................................................... 183

    9. Specialized cases ............................................................................ 187Multi-hypervisor example ........................................................... 187Specialized networking example ................................................. 190Software-defined networking ..................................................... 190Desktop-as-a-Service ................................................................... 192OpenStack on OpenStack ........................................................... 194Specialized hardware ................................................................. 196

    10. References ................................................................................... 199A. Community support ....................................................................... 201

    Documentation .......................................................................... 201ask.openstack.org ...................................................................... 202OpenStack mailing lists ............................................................... 203The OpenStack wiki ................................................................... 203The Launchpad Bugs area .......................................................... 203The OpenStack IRC channel ........................................................ 204Documentation feedback ........................................................... 205OpenStack distribution packages ................................................ 205

    Glossary ............................................................................................. 207

  • Architecture Guide June 3, 2015 current

    v

    PrefaceConventions

    The OpenStack documentation uses several typesetting conventions.

    NoticesNotices take these forms:

    Note

    A handy tip or reminder.

    Important

    Something you must be aware of before proceeding.

    Warning

    Critical information about the risk of data loss or security is-sues.

    Command prompts$ prompt Any user, including the root user, can run commands that

    are prefixed with the $ prompt.

    # prompt The root user must run commands that are prefixed withthe # prompt. You can also prefix these commands with thesudo command, if available, to run them.

    Document change historyThis version of the guide replaces and obsoletes all earlier versions.

    The following table describes the most recent changes:

    Revision Date Summary of Changes

    October 15, 2014 Incorporate edits to follow OpenStack style.

    July 21, 2014 Initial release.

  • Architecture Guide June 3, 2015 current

    1

    1. Introduction

    Table of ContentsIntended audience ................................................................................. 1How this book is organized ................................................................... 2Why and how we wrote this book ......................................................... 3Methodology ......................................................................................... 4

    OpenStack is a leader in the cloud technology gold rush, as organizationsof all stripes discover the increased flexibility and speed to market that self-service cloud and Infrastructure-as-a-Service (IaaS) provides. However, inorder to reap those benefits, the cloud must be designed and architectedproperly.

    A well-architected cloud provides a stable IT environment that offers easyaccess to needed resources, usage-based expenses, extra capacity on de-mand, disaster recovery, and a secure environment. A well-architectedcloud does not magically build itself. It requires careful consideration of amultitude of factors both technical and non-technical.

    There is no single architecture that is "right" for an OpenStack cloud de-ployment. OpenStack can be used for any number of different purposes,each with its own particular requirements and architectural peculiarities.

    This book is designed to examine some of the most common uses forOpenStack clouds (and some less common uses) and to provide knowledgeand advice to help explain the issues that require consideration. These ex-amples, coupled with a wealth of knowledge and advice will help an or-ganization design and build a well-architected OpenStack cloud to fit itsunique requirements.

    Intended audienceThis book has been written for architects and designers of OpenStackclouds. This book is not intended for people who are deploying Open-Stack. For a guide on deploying and operating OpenStack, please referto the OpenStack Operations Guide (http://docs.openstack.org/open-stack-ops).

  • Architecture Guide June 3, 2015 current

    2

    The reader should have prior knowledge of cloud architecture and princi-ples, experience in enterprise system design, Linux and virtualization expe-rience, and a basic understanding of networking principles and protocols.

    How this book is organizedThis book has been organized into various chapters that help define theuse cases associated with making architectural choices related to an Open-Stack cloud installation. Each chapter is intended to stand alone to encour-age individual chapter readability, however each chapter is designed tocontain useful information that may be applicable in situations coveredby other chapters. Cloud architects may use this book as a comprehensiveguide by reading all of the use cases, but it is also possible to review onlythe chapters which pertain to a specific use case. When choosing to readspecific use cases, note that it may be necessary to read more than one sec-tion of the guide to formulate a complete design for the cloud. The usecases covered in this guide include:

    General purpose: A cloud built with common components that shouldaddress 80% of common use cases.

    Compute focused: A cloud designed to address compute intensive work-loads such as high performance computing (HPC).

    Storage focused: A cloud focused on storage intensive workloads suchas data analytics with parallel file systems.

    Network focused: A cloud depending on high performance and reliablenetworking, such as a content delivery network (CDN).

    Multi-site: A cloud built with multiple sites available for application de-ployments for geographical, reliability or data locality reasons.

    Hybrid cloud: An architecture where multiple disparate clouds are con-nected either for failover, hybrid cloud bursting, or availability.

    Massively scalable: An architecture that is intended for cloud serviceproviders or other extremely large installations.

    A chapter titled Specialized cases provides information on architecturesthat have not previously been covered in the defined use cases.

    Each chapter in the guide is then further broken down into the followingsections:

  • Architecture Guide June 3, 2015 current

    3

    Introduction: Provides an overview of the architectural use case.

    User requirements: Defines the set of user considerations that typicallycome into play for that use case.

    Technical considerations: Covers the technical issues that must be ac-counted when dealing with this use case.

    Operational considerations: Covers the ongoing operational tasks associ-ated with this use case and architecture.

    Architecture: Covers the overall architecture associated with the usecase.

    Prescriptive examples: Presents one or more scenarios where this archi-tecture could be deployed.

    A glossary covers the terms used in the book.

    Why and how we wrote this bookThe velocity at which OpenStack environments are moving from proof-of-concepts to production deployments is leading to increasing questionsand issues related to architecture design considerations. By and large theseconsiderations are not addressed in the existing documentation, whichtypically focuses on the specifics of deployment and configuration optionsor operational considerations, rather than the bigger picture.

    We wrote this book to guide readers in designing an OpenStack architec-ture that meets the needs of their organization. This guide concentrateson identifying important design considerations for common cloud use cas-es and provides examples based on these design guidelines. This guidedoes not aim to provide explicit instructions for installing and configuringthe cloud, but rather focuses on design principles as they relate to user re-quirements as well as technical and operational considerations. For spe-cific guidance with installation and configuration there are a number ofresources already available in the OpenStack documentation that help inthat area.

    This book was written in a book sprint format, which is a facilitated, rapiddevelopment production method for books. For more information, see theBook Sprints website (www.booksprints.net).

    This book was written in five days during July 2014 while exhausting theM&M, Mountain Dew and healthy options supply, complete with juggling

  • Architecture Guide June 3, 2015 current

    4

    entertainment during lunches at VMware's headquarters in Palo Alto. Theevent was also documented on Twitter using the #OpenStackDesign hash-tag. The Book Sprint was facilitated by Faith Bosworth and Adam Hyde.

    We would like to thank VMware for their generous hospitality, as well asour employers, Cisco, Cloudscaling, Comcast, EMC, Mirantis, Rackspace,Red Hat, Verizon, and VMware, for enabling us to contribute our time. Wewould especially like to thank Anne Gentle and Kenneth Hui for all of theirshepherding and organization in making this happen.

    The author team includes:

    Kenneth Hui (EMC) @hui_kenneth

    Alexandra Settle (Rackspace) @dewsday

    Anthony Veiga (Comcast) @daaelar

    Beth Cohen (Verizon) @bfcohen

    Kevin Jackson (Rackspace) @itarchitectkev

    Maish Saidel-Keesing (Cisco) @maishsk

    Nick Chase (Mirantis) @NickChase

    Scott Lowe (VMware) @scott_lowe

    Sean Collins (Comcast) @sc68cal

    Sean Winn (Cloudscaling) @seanmwinn

    Sebastian Gutierrez (Red Hat) @gutseb

    Stephen Gordon (Red Hat) @xsgordon

    Vinny Valdez (Red Hat) @VinnyValdez

    MethodologyThe magic of the cloud is that it can do anything. It is both robust and flex-ible, the best of both worlds. Yes, the cloud is highly flexible and it can doalmost anything, but to get the most out of a cloud investment, it is impor-tant to define how the cloud will be used by creating and testing use cases.

  • Architecture Guide June 3, 2015 current

    5

    This is the chapter that describes the thought process behind how to de-sign a cloud architecture that best suits the intended use.

    The diagram shows at a very abstract level the process for capturing re-quirements and building use cases. Once a set of use cases has been de-fined, it can then be used to design the cloud architecture.

    Use case planning can seem counter-intuitive. After all, it takes about fiveminutes to sign up for a server with Amazon. Amazon does not know inadvance what any given user is planning on doing with it, right? Wrong.Amazon's product management department spends plenty of time figur-ing out exactly what would be attractive to their typical customer and hon-ing the service to deliver it. For the enterprise, the planning process is nodifferent, but instead of planning for an external paying customer, for ex-ample, the use could be for internal application developers or a web por-tal. The following is a list of the high level objectives that need to be incor-porated into the thinking about creating a use case.

    Overall business objectives

    Develop clear definition of business goals and requirements

    Increase project support and engagement with business, customers andend users.

    Technology

  • Architecture Guide June 3, 2015 current

    6

    Coordinate the OpenStack architecture across the project and leverageOpenStack community efforts more effectively.

    Architect for automation as much as possible to speed development anddeployment.

    Use the appropriate tools for the development effort.

    Create better and more test metrics and test harnesses to support con-tinuous and integrated development, test processes and automation.

    Organization

    Better messaging of management support of team efforts

    Develop better cultural understanding of Open Source, cloud architec-tures, Agile methodologies, continuous development, test and integra-tion, overall development concepts in general

    As an example of how this works, consider a business goal of using thecloud for the company's E-commerce website. This goal means planningfor applications that will support thousands of sessions per second, vari-able workloads, and lots of complex and changing data. By identifying thekey metrics, such as number of concurrent transactions per second, size ofdatabase, and so on, it is possible to then build a method for testing theassumptions.

    Develop functional user scenarios. Develop functional user scenariosthat can be used to develop test cases that can be used to measure over-all project trajectory. If the organization is not ready to commit to an ap-plication or applications that can be used to develop user requirements, itneeds to create requirements to build valid test harnesses and develop us-able metrics. Once the metrics are established, as requirements change, itis easier to respond to the changes quickly without having to worry overlymuch about setting the exact requirements in advance. Think of this as cre-ating ways to configure the system, rather than redesigning it every timethere is a requirements change.

    Limit cloud feature set. Create requirements that address the painpoints, but do not recreate the entire OpenStack tool suite. The require-ment to build OpenStack, only better, is self-defeating. It is important tolimit scope creep by concentrating on developing a platform that will ad-dress tool limitations for the requirements, but not recreating the entiresuite of tools. Work with technical product owners to establish critical fea-tures that are needed for a successful cloud deployment.

  • Architecture Guide June 3, 2015 current

    7

    Application cloud readinessAlthough the cloud is designed to make things easier, it is important to re-alize that "using cloud" is more than just firing up an instance and drop-ping an application on it. The "lift and shift" approach works in certain sit-uations, but there is a fundamental difference between clouds and tradi-tional bare-metal-based environments, or even traditional virtualized envi-ronments.

    In traditional environments, with traditional enterprise applications, theapplications and the servers that run on them are "pets". They're lovinglycrafted and cared for, the servers have names like Gandalf or Tardis, and ifthey get sick, someone nurses them back to health. All of this is designedso that the application does not experience an outage.

    In cloud environments, on the other hand, servers are more like cattle.There are thousands of them, they get names like NY-1138-Q, and if theyget sick, they get put down and a sysadmin installs another one. Tradition-al applications that are unprepared for this kind of environment, naturallywill suffer outages, lost data, or worse.

    There are other reasons to design applications with cloud in mind. Someare defensive, such as the fact that applications cannot be certain of exact-ly where or on what hardware they will be launched, they need to be flex-ible, or at least adaptable. Others are proactive. For example, one of theadvantages of using the cloud is scalability, so applications need to be de-signed in such a way that they can take advantage of those and other op-portunities.

    Determining whether an application is cloud-ready

    There are several factors to take into consideration when looking atwhether an application is a good fit for the cloud.

    Structure A large, monolithic, single-tiered lega-cy application typically isn't a good fitfor the cloud. Efficiencies are gainedwhen load can be spread over severalinstances, so that a failure in one partof the system can be mitigated with-out affecting other parts of the system,or so that scaling can take place wherethe app needs it.

  • Architecture Guide June 3, 2015 current

    8

    Dependencies Applications that depend on specifichardwaresuch as a particular chip setor an external device such as a finger-print readermight not be a good fitfor the cloud, unless those dependen-cies are specifically addressed. Similarly,if an application depends on an oper-ating system or set of libraries that can-not be used in the cloud, or cannot bevirtualized, that is a problem.

    Connectivity Self-contained applications or thosethat depend on resources that are notreachable by the cloud in question,will not run. In some situations, workaround these issues with custom net-work setup, but how well this worksdepends on the chosen cloud environ-ment.

    Durability and resilience Despite the existence of SLAs, thingsbreak: servers go down, network con-nections are disrupted, or too manytenants on a server make a server un-usable. An application must be sturdyenough to contend with these issues.

    Designing for the cloudHere are some guidelines to keep in mind when designing an applicationfor the cloud:

    Be a pessimist: Assume everything fails and design backwards. Love yourchaos monkey.

    Put your eggs in multiple baskets: Leverage multiple providers, geo-graphic regions and availability zones to accommodate for local avail-ability issues. Design for portability.

    Think efficiency: Inefficient designs will not scale. Efficient designs be-come cheaper as they scale. Kill off unneeded components or capacity.

    Be paranoid: Design for defense in depth and zero tolerance by buildingin security at every level and between every component. Trust no one.

  • Architecture Guide June 3, 2015 current

    9

    But not too paranoid: Not every application needs the platinum solu-tion. Architect for different SLA's, service tiers and security levels.

    Manage the data: Data is usually the most inflexible and complex areaof a cloud and cloud integration architecture. Don't short change the ef-fort in analyzing and addressing data needs.

    Hands off: Leverage automation to increase consistency and quality andreduce response times.

    Divide and conquer: Pursue partitioning and parallel layering whereverpossible. Make components as small and portable as possible. Use loadbalancing between layers.

    Think elasticity: Increasing resources should result in a proportional in-crease in performance and scalability. Decreasing resources should havethe opposite effect.

    Be dynamic: Enable dynamic configuration changes such as auto scaling,failure recovery and resource discovery to adapt to changing environ-ments, faults and workload volumes.

    Stay close: Reduce latency by moving highly interactive components anddata near each other.

    Keep it loose: Loose coupling, service interfaces, separation of concerns,abstraction and well defined API's deliver flexibility.

    Be cost aware: Autoscaling, data transmission, virtual software licens-es, reserved instances, and so on can rapidly increase monthly usagecharges. Monitor usage closely.

  • Architecture Guide June 3, 2015 current

    11

    2. General purpose

    Table of ContentsUser requirements ................................................................................ 12Technical considerations ....................................................................... 16Operational considerations .................................................................. 31Architecture ......................................................................................... 34Prescriptive example ............................................................................ 47

    An OpenStack general purpose cloud is often considered a starting pointfor building a cloud deployment. They are designed to balance the com-ponents and do not emphasize any particular aspect of the overall com-puting environment. Cloud design must give equal weight to the compute,network, and storage components. General purpose clouds are found inprivate, public, and hybrid environments, lending themselves to many dif-ferent use cases.

    NoteGeneral purpose clouds are homogeneous deployments andare not suited to specialized environments or edge case situa-tions.

    Common uses of a general purpose cloud include:

    Providing a simple database

    A web application runtime environment

    A shared application development platform

    Lab test bed

    Use cases that benefit from scale-out rather than scale-up approaches aregood candidates for general purpose cloud architecture.

    A general purpose cloud is designed to have a range of potential uses orfunctions; not specialized for specific use cases. General purpose architec-ture is designed to address 80% of potential use cases available. The in-frastructure, in itself, is a specific use case, enabling it to be used as a basemodel for the design process. General purpose clouds are designed to beplatforms that are suited for general purpose applications.

  • Architecture Guide June 3, 2015 current

    12

    General purpose clouds are limited to the most basic components, butthey can include additional resources such as:

    Virtual-machine disk image library

    Raw block storage

    File or object storage

    Firewalls

    Load balancers

    IP addresses

    Network overlays or virtual local area networks (VLANs)

    Software bundles

    User requirementsWhen building a general purpose cloud, you should follow the Infrastruc-ture-as-a-Service (IaaS) model; a platform best suited for use cases withsimple requirements. General purpose cloud user requirements are notcomplex. However, it is important to capture them even if the project hasminimum business and technical requirements, such as a proof of concept(PoC), or a small lab platform.

    Note

    The following user considerations are written from the per-spective of the cloud builder, not from the perspective of theend user.

    Cost Financial factors are a primary concern for anyorganization. Cost is an important criterionas general purpose clouds are considered thebaseline from which all other cloud architec-ture environments derive. General purposeclouds do not always provide the most cost-effective environment for specialized appli-cations or situations. Unless razor-thin mar-gins and costs have been mandated as a criti-cal factor, cost should not be the sole consid-

  • Architecture Guide June 3, 2015 current

    13

    eration when choosing or designing a generalpurpose architecture.

    Time to market The ability to deliver services or products with-in a flexible time frame is a common businessfactor when building a general purpose cloud.In today's high-speed business world, the abil-ity to deliver a product in six months insteadof two years is a driving force behind the de-cision to build general purpose clouds. Gener-al purpose clouds allow users to self-provisionand gain access to compute, network, andstorage resources on-demand thus decreasingtime to market.

    Revenue opportunity Revenue opportunities for a cloud will varygreatly based on the intended use case ofthat particular cloud. Some general purposeclouds are built for commercial customer fac-ing products, but there are alternatives thatmight make the general purpose cloud theright choice. For example, a small cloud serviceprovider (CSP) might want to build a generalpurpose cloud rather than a massively scalablecloud because they do not have the deep fi-nancial resources needed, or because they donot, or will not, know in advance the purposesfor which their customers are going to use thecloud. For some users, the advantages clouditself offers mean an enhancement of revenueopportunity. For others, the fact that a gener-al purpose cloud provides only baseline func-tionality will be a disincentive for use, leadingto a potential stagnation of potential revenueopportunities.

    Legal requirementsMany jurisdictions have legislative and regulatory requirements governingthe storage and management of data in cloud environments. Common ar-eas of regulation include:

    Data retention policies ensuring storage of persistent data and recordsmanagement to meet data archival requirements.

  • Architecture Guide June 3, 2015 current

    14

    Data ownership policies governing the possession and responsibility fordata.

    Data sovereignty policies governing the storage of data in foreign coun-tries or otherwise separate jurisdictions.

    Data compliance policies governing certain types of information needingto reside in certain locations due to regulatory issues - and more impor-tantly, cannot reside in other locations for the same reason.

    Examples of such legal frameworks include the data protection frameworkof the European Union and the requirements of the Financial Industry Reg-ulatory Authority in the United States. Consult a local regulatory body formore information.

    Technical requirementsTechnical cloud architecture requirements should be weighted against thebusiness requirements.

    Performance As a baseline product, general purposeclouds do not provide optimized per-formance for any particular function.While a general purpose cloud shouldprovide enough performance to satis-fy average user considerations, perfor-mance is not a general purpose cloudcustomer driver.

    No predefined usage model The lack of a pre-defined usage mod-el enables the user to run a wide vari-ety of applications without having toknow the application requirements inadvance. This provides a degree of in-dependence and flexibility that no oth-er cloud scenarios are able to provide.

    On-demand and self-service ap-plication

    By definition, a cloud provides endusers with the ability to self-provisioncomputing power, storage, networks,and software in a simple and flexibleway. The user must be able to scaletheir resources up to a substantial levelwithout disrupting the underlying host

  • Architecture Guide June 3, 2015 current

    15

    operations. One of the benefits of us-ing a general purpose cloud architec-ture is the ability to start with limitedresources and increase them over timeas the user demand grows.

    Public cloud For a company interested in building acommercial public cloud offering basedon OpenStack, the general purposearchitecture model might be the bestchoice. Designers are not always goingto know the purposes or workloads forwhich the end users will use the cloud.

    Internal consumption (private)cloud

    Organizations need to determine if itis logical to create their own clouds in-ternally. Using a private cloud, organi-zations are able to maintain completecontrol over architectural and cloudcomponents.

    NoteUsers will want to combineusing the internal cloudwith access to an externalcloud. If that case is likely,it might be worth exploringthe possibility of taking amulti-cloud approach withregard to at least some ofthe architectural elements.

    Designs that incorporate the use ofmultiple clouds, such as a private cloudand a public cloud offering, are de-scribed in the "Multi-Cloud" scenario,see Chapter6, Multi-site [131].

    Security Security should be implemented ac-cording to asset, threat, and vulnerabil-ity risk assessment matrices. For clouddomains that require increased com-puter security, network security, or in-formation security, a general purpose

  • Architecture Guide June 3, 2015 current

    16

    cloud is not considered an appropriatechoice.

    Technical considerationsGeneral purpose clouds are often expected to include these base services:

    Compute

    Network

    Storage

    Each of these services has different resource requirements. As a result, youmust make design decisions relating directly to the service, as well as pro-vide a balanced infrastructure for all services.

    Consider the unique aspects of each service that requires design since indi-vidual characteristics and service mass can impact the hardware selectionprocess. Hardware designs are generated for each type of the following re-source pools:

    Compute

    Network

    Storage

    Hardware decisions are also made in relation to network architecture andfacilities planning. These factors play heavily into the overall architectureof an OpenStack cloud.

    Designing compute resourcesWhen designing compute resource pools, a number of factors can impactyour design decisions. For example, decisions related to processors, mem-ory, and storage within each hypervisor are just one element of designingcompute resources. In addition, decide whether to provide compute re-sources in a single pool or in multiple pools. We recommend the computedesign allocates multiple pools of resources to be addressed on-demand.

    A compute design that allocates multiple pools of resources makes bestuse of application resources running in the cloud. Each independent re-source pool should be designed to provide service for specific flavors of in-stances or groupings of flavors. Designing multiple resource pools helps to

  • Architecture Guide June 3, 2015 current

    17

    ensure that, as instances are scheduled onto compute hypervisors, each in-dependent node's resources will be allocated to make the most efficientuse of available hardware. This is commonly referred to as bin packing.

    Using a consistent hardware design among the nodes that are placed with-in a resource pool also helps support bin packing. Hardware nodes select-ed for being a part of a compute resource pool should share a commonprocessor, memory, and storage layout. By choosing a common hardwaredesign, it becomes easier to deploy, support and maintain those nodesthroughout their life cycle in the cloud.

    An overcommit ratio is the ratio of available virtual resources, compared tothe available physical resources. OpenStack is able to configure the over-commit ratio for CPU and memory. The default CPU overcommit ratio is16:1 and the default memory overcommit ratio is 1.5:1. Determining thetuning of the overcommit ratios for both of these options during the de-sign phase is important as it has a direct impact on the hardware layout ofyour compute nodes.

    For example, consider a m1.small instance uses 1 vCPU, 20GB of ephemer-al storage and 2,048MB of RAM. When designing a hardware node as acompute resource pool to service instances, take into consideration thenumber of processor cores available on the node as well as the requireddisk and memory to service instances running at capacity. For a serverwith 2 CPUs of 10 cores each, with hyperthreading turned on, the defaultCPU overcommit ratio of 16:1 would allow for 640 (2 10 2 16) totalm1.small instances. By the same reasoning, using the default memoryovercommit ratio of 1.5:1 you can determine that the server will need atleast 853GB (640 2,048MB / 1.5) of RAM. When sizing nodes for mem-ory, it is also important to consider the additional memory required to ser-vice operating system and service needs.

    Processor selection is an extremely important consideration in hardwaredesign, especially when comparing the features and performance charac-teristics of different processors. Processors can include features specific tovirtualized compute hosts including hardware assisted virtualization andtechnology related to memory paging (also known as EPT shadowing).These types of features can have a significant impact on the performanceof your virtual machine running in the cloud.

    It is also important to consider the compute requirements of resourcenodes within the cloud. Resource nodes refer to non-hypervisor nodes pro-viding the following in the cloud:

    Controller

  • Architecture Guide June 3, 2015 current

    18

    Object storage

    Block storage

    Networking services

    The number of processor cores and threads has a direct correlation to thenumber of worker threads which can be run on a resource node. As a re-sult, you must make design decisions relating directly to the service, as wellas provide a balanced infrastructure for all services.

    Workload profiles are unpredictable in a general purpose cloud. Addition-al compute resource pools can be added to the cloud later, reducing thestress of unpredictability. In some cases, the demand on certain instancetypes or flavors may not justify individual hardware design. In either ofthese cases, initiate the design by allocating hardware designs that are ca-pable of servicing the most common instances requests. If you are lookingto add additional hardware designs to the overall architecture, this can bedone at a later time.

    Designing network resourcesOpenStack clouds traditionally have multiple network segments, each ofwhich provides access to resources within the cloud to both operators andtenants. The network services themselves also require network communi-cation paths which should be separated from the other networks. Whendesigning network services for a general purpose cloud, we recommendplanning for a physical or logical separation of network segments that willbe used by operators and tenants. We further suggest the creation of anadditional network segment for access to internal services such as the mes-sage bus and databse used by the various cloud services. Segregating theseservices onto separate networks helps to protect sensitive data and pro-tects against unauthorized access to services.

    Based on the requirements of instances being serviced in the cloud, thechoice of network service will be the next decision that affects your designarchitecture.

    The choice between legacy networking (nova-network), as a part of Open-Stack Compute, and OpenStack Networking (neutron), has a huge impacton the architecture and design of the cloud network infrastructure.

    Legacy networking (nova-net-work)

    The legacy networking (nova-network)service is primarily a layer-2 networkingservice that functions in two modes.

  • Architecture Guide June 3, 2015 current

    19

    In legacy networking, the two modesdiffer in their use of VLANs. When us-ing legacy networking in a flat networkmode, all network hardware nodes anddevices throughout the cloud are con-nected to a single layer-2 network seg-ment that provides access to applica-tion data.

    When the network devices in thecloud support segmentation usingVLANs, legacy networking can oper-ate in the second mode. In this designmodel, each tenant within the cloudis assigned a network subnet whichis mapped to a VLAN on the physicalnetwork. It is especially important toremember the maximum number of4096 VLANs which can be used withina spanning tree domain. These limita-tions place hard limits on the amountof growth possible within the data cen-ter. When designing a general purposecloud intended to support multiple ten-ants, we recommend the use of legacynetworking with VLANs, and not in flatnetwork mode.

    Another consideration regarding network is the fact that legacy network-ing is entirely managed by the cloud operator; tenants do not have con-trol over network resources. If tenants require the ability to manage andcreate network resources such as network segments and subnets, it will benecessary to install the OpenStack Networking service to provide networkaccess to instances.

    OpenStack Networking (neu-tron)

    OpenStack Networking (neutron) is afirst class networking service that givesfull control over creation of virtual net-work resources to tenants. This is oftenaccomplished in the form of tunnelingprotocols which will establish encapsu-lated communication paths over exist-ing network infrastructure in order tosegment tenant traffic. These methods

  • Architecture Guide June 3, 2015 current

    20

    vary depending on the specific imple-mentation, but some of the more com-mon methods include tunneling overGRE, encapsulating with VXLAN, andVLAN tags.

    Initially, it is suggested to design at least three network segments, the firstof which will be used for access to the cloud's REST APIs by tenants and op-erators. This is referred to as a public network. In most cases, the controllernodes and swift proxies within the cloud will be the only devices necessaryto connect to this network segment. In some cases, this network might al-so be serviced by hardware load balancers and other network devices.

    The next segment is used by cloud administrators to manage hardware re-sources and is also used by configuration management tools when deploy-ing software and services onto new hardware. In some cases, this networksegment might also be used for internal services, including the messagebus and database services, to communicate with each other. Due to thehighly secure nature of this network segment, it may be desirable to se-cure this network from unauthorized access. This network will likely needto communicate with every hardware node within the cloud.

    The last network segment is used by applications and consumers to pro-vide access to the physical network and also for users accessing applica-tions running within the cloud. This network is generally segregated fromthe one used to access the cloud APIs and is not capable of communicat-ing directly with the hardware resources in the cloud. Compute resourcenodes will need to communicate on this network segment, as will any net-work gateway services which allow application data to access the physicalnetwork outside of the cloud.

    Designing storage resourcesOpenStack has two independent storage services to consider, each with itsown specific design requirements and goals. In addition to services whichprovide storage as their primary function, there are additional design con-siderations with regard to compute and controller nodes which will affectthe overall cloud architecture.

    Designing OpenStack Object StorageWhen designing hardware resources for OpenStack Object Storage, theprimary goal is to maximize the amount of storage in each resource nodewhile also ensuring that the cost per terabyte is kept to a minimum. This

  • Architecture Guide June 3, 2015 current

    21

    often involves utilizing servers which can hold a large number of spinningdisks. Whether choosing to use 2U server form factors with directly at-tached storage or an external chassis that holds a larger number of drives,the main goal is to maximize the storage available in each node.

    We do not recommended investing in enterprise class drives for an Open-Stack Object Storage cluster. The consistency and partition tolerance char-acteristics of OpenStack Object Storage will ensure that data stays up todate and survives hardware faults without the use of any specialized datareplication devices.

    One of the benefits of OpenStack Object Storage is the ability to mix andmatch drives by making use of weighting within the swift ring. When de-signing your swift storage cluster, we recommend making use of the mostcost effective storage solution available at the time. Many server chassis onthe market can hold 60 or more drives in 4U of rack space, therefore werecommend maximizing the amount of storage per rack unit at the bestcost per terabyte. Furthermore, we do not recommend the use of RAIDcontrollers in an object storage node.

    To achieve durability and availability of data stored as objects it is impor-tant to design object storage resource pools to ensure they can providethe suggested availability. Considering rack-level and zone-level designsto accommodate the number of replicas configured to be stored in theObject Storage service (the defult number of replicas is three) is impor-tant when designing beyond the hardware node level. Each replica of da-ta should exist in its own availability zone with its own power, cooling, andnetwork resources available to service that specific zone.

    Object storage nodes should be designed so that the number of requestsdoes not hinder the performance of the cluster. The object storage serviceis a chatty protocol, therefore making use of multiple processors that havehigher core counts will ensure the IO requests do not inundate the server.

    Designing OpenStack Block StorageWhen designing OpenStack Block Storage resource nodes, it is helpfulto understand the workloads and requirements that will drive the use ofblock storage in the cloud. We recommend designing block storage poolsso that tenants can choose appropriate storage solutions for their appli-cations. By creating multiple storage pools of different types, in conjunc-tion with configuring an advanced storage scheduler for the block storageservice, it is possible to provide tenants with a large catalog of storage ser-vices with a variety of performance levels and redundancy options.

  • Architecture Guide June 3, 2015 current

    22

    Block storage also takes advantage of a number of enterprise storage solu-tions. These are addressed via a plug-in driver developed by the hardwarevendor. A large number of enterprise storage plug-in drivers ship out-of-the-box with OpenStack Block Storage (and many more available via thirdparty channels). General purpose clouds are more likely to use directly at-tached storage in the majority of block storage nodes, deeming it neces-sary to provide additional levels of service to tenants which can only beprovided by enterprise class storage solutions.

    Redundancy and availability requirements impact the decision to use aRAID controller card in block storage nodes. The input-output per sec-ond (IOPS) demand of your application will influence whether or not youshould use a RAID controller, and which level of RAID is required. Makinguse of higher performing RAID volumes is suggested when considering per-formance. However, where redundancy of block storage volumes is moreimportant we recommend making use of a redundant RAID configura-tion such as RAID 5 or RAID 6. Some specialized features, such as automat-ed replication of block storage volumes, may require the use of third-par-ty plug-ins and enterprise block storage solutions in order to provide thehigh demand on storage. Furthermore, where extreme performance is arequirement it may also be necessary to make use of high speed SSD diskdrives' high performing flash storage solutions.

    Software selectionThe software selection process plays a large role in the architecture of ageneral purpose cloud. The following have a large impact on the design ofthe cloud:

    Choice of operating system

    Selection of OpenStack software components

    Choice of hypervisor

    Selection of supplemental software

    Operating system (OS) selection plays a large role in the design and archi-tecture of a cloud. There are a number of OSes which have native supportfor OpenStack including:

    Ubuntu

    Red Hat Enterprise Linux (RHEL)

  • Architecture Guide June 3, 2015 current

    23

    CentOS

    SUSE Linux Enterprise Server (SLES)

    Note

    Native support is not a constraint on the choice of OS; usersare free to choose just about any Linux distribution (or evenMicrosoft Windows) and install OpenStack directly from source(or compile their own packages). However, many organiza-tions will prefer to install OpenStack from distribution-sup-plied packages or repositories (although using the distributionvendor's OpenStack packages might be a requirement for sup-port).

    OS selection also directly influences hypervisor selection. A cloud architectwho selects Ubuntu, RHEL, or SLES has some flexibility in hypervisor; KVM,Xen, and LXC are supported virtualization methods available under Open-Stack Compute (nova) on these Linux distributions. However, a cloud archi-tect who selects Hyper-V is limited to Windows Servers. Similarly, a cloudarchitect who selects XenServer is limited to the CentOS-based dom0 oper-ating system provided with XenServer.

    The primary factors that play into OS-hypervisor selection include:

    User requirements The selection of OS-hypervisor combination firstand foremost needs to support the user require-ments.

    Support The selected OS-hypervisor combination needs tobe supported by OpenStack.

    Interoperability The OS-hypervisor needs to be interoperablewith other features and services in the Open-Stack design in order to meet the user require-ments.

    HypervisorOpenStack supports a wide variety of hypervisors, one or more of whichcan be used in a single cloud. These hypervisors include:

    KVM (and QEMU)

    XCP/XenServer

  • Architecture Guide June 3, 2015 current

    24

    vSphere (vCenter and ESXi)

    Hyper-V

    LXC

    Docker

    Bare-metal

    A complete list of supported hypervisors and their capabilities can befound at OpenStack Hypervisor Support Matrix.

    We recommend general purpose clouds use hypervisors that support themost general purpose use cases, such as KVM and Xen. More specific hy-pervisors should be chosen to account for specific functionality or a sup-ported feature requirement. In some cases, there may also be a mandatedrequirement to run software on a certified hypervisor including solutionsfrom VMware, Microsoft, and Citrix.

    The features offered through the OpenStack cloud platform determine thebest choice of a hypervisor. As an example, for a general purpose cloudthat predominantly supports a Microsoft-based migration, or is managedby staff that has a particular skill for managing certain hypervisors and op-erating systems, Hyper-V would be the best available choice. While the de-cision to use Hyper-V does not limit the ability to run alternative operat-ing systems, be mindful of those that are deemed supported. Each differ-ent hypervisor also has their own hardware requirements which may affectthe decisions around designing a general purpose cloud. For example, toutilize the live migration feature of VMware, vMotion, this requires an in-stallation of vCenter/vSphere and the use of the ESXi hypervisor, which in-creases the infrastructure requirements.

    In a mixed hypervisor environment, specific aggregates of compute re-sources, each with defined capabilities, enable workloads to utilize soft-ware and hardware specific to their particular requirements. This function-ality can be exposed explicitly to the end user, or accessed through definedmetadata within a particular flavor of an instance.

    OpenStack componentsA general purpose OpenStack cloud design should incorporate the coreOpenStack services to provide a wide range of services to end-users. TheOpenStack core services recommended in a general purpose cloud are:

    OpenStack Compute (nova)

  • Architecture Guide June 3, 2015 current

    25

    OpenStack Networking (neutron)

    OpenStack Image service (glance)

    OpenStack Identity (keystone)

    OpenStack dashboard (horizon)

    Telemetry module (ceilometer)

    A general purpose cloud may also include OpenStack Object Storage(swift). OpenStack Block Storage (cinder). These may be selected to pro-vide storage to applications and instances.

    Note

    However, depending on the use case, these could be optional.

    Supplemental softwareA general purpose OpenStack deployment consists of more than justOpenStack-specific components. A typical deployment involves servicesthat provide supporting functionality, including databases and messagequeues, and may also involve software to provide high availability of theOpenStack environment. Design decisions around the underlying messagequeue might affect the required number of controller services, as well asthe technology to provide highly resilient database functionality, such asMariaDB with Galera. In such a scenario, replication of services relies onquorum. Therefore, the underlying database nodes, for example, shouldconsist of at least 3 nodes to account for the recovery of a failed Galeranode. When increasing the number of nodes to support a feature of thesoftware, consideration of rack space and switch port density becomes im-portant.

    Where many general purpose deployments use hardware load balancersto provide highly available API access and SSL termination, software so-lutions, for example HAProxy, can also be considered. It is vital to ensurethat such software implementations are also made highly available. Highavailability can be achieved by using software such as Keepalived or Pace-maker with Corosync. Pacemaker and Corosync can provide active-active oractive-passive highly available configuration depending on the specific ser-vice in the OpenStack environment. Using this software can affect the de-sign as it assumes at least a 2-node controller infrastructure where one ofthose nodes may be running certain services in standby mode.

  • Architecture Guide June 3, 2015 current

    26

    Memcached is a distributed memory object caching system, and Redis isa key-value store. Both are deployed on general purpose clouds to assistin alleviating load to the Identity service. The memcached service cachestokens, and due to its distributed nature it can help alleviate some bottle-necks to the underlying authentication system. Using memcached or Redisdoes not affect the overall design of your architecture as they tend to bedeployed onto the infrastructure nodes providing the OpenStack services.

    PerformancePerformance of an OpenStack deployment is dependent on a number offactors related to the infrastructure and controller services. The user re-quirements can be split into general network performance, performanceof compute resources, and performance of storage systems.

    Controller infrastructureThe Controller infrastructure nodes provide management services to theend-user as well as providing services internally for the operating of thecloud. The Controllers run message queuing services that carry system mes-sages between each service. Performance issues related to the messagebus would lead to delays in sending that message to where it needs to go.The result of this condition would be delays in operation functions suchas spinning up and deleting instances, provisioning new storage volumesand managing network resources. Such delays could adversely affect anapplications ability to react to certain conditions, especially when usingauto-scaling features. It is important to properly design the hardware usedto run the controller infrastructure as outlined above in the Hardware Se-lection section.

    Performance of the controller services is not limited to processing power,but restrictions may emerge in serving concurrent users. Ensure that theAPIs and Horizon services are load tested to ensure that you are able toserve your customers. Particular attention should be made to the Open-Stack Identity Service (Keystone), which provides the authentication andauthorization for all services, both internally to OpenStack itself and toend-users. This service can lead to a degradation of overall performance ifthis is not sized appropriately.

    Network performanceIn a general purpose OpenStack cloud, the requirements of the networkhelp determine performance capabilities. For example, small deployments

  • Architecture Guide June 3, 2015 current

    27

    may employ 1 Gigabit Ethernet (GbE) networking, whereas larger instal-lations serving multiple departments or many users would be better archi-tected with 10GbE networking. The performance of the running instanceswill be limited by these speeds. It is possible to design OpenStack environ-ments that run a mix of networking capabilities. By utilizing the differentinterface speeds, the users of the OpenStack environment can choose net-works that are fit for their purpose.

    For example, web application instances may run on a public network pre-sented through OpenStack Networking that has 1 GbE capability, whereasthe back-end database uses an OpenStack Networking network that has10GbE capability to replicate its data or, in some cases, the design may in-corporate link aggregation for greater throughput.

    Network performance can be boosted considerably by implementing hard-ware load balancers to provide front-end service to the cloud APIs. Thehardware load balancers also perform SSL termination if that is a require-ment of your environment. When implementing SSL offloading, it is impor-tant to understand the SSL offloading capabilities of the devices selected.

    Compute hostThe choice of hardware specifications used in compute nodes includingCPU, memory and disk type directly affects the performance of the in-stances. Other factors which can directly affect performance include tun-able parameters within the OpenStack services, for example the overcom-mit ratio applied to resources. The defaults in OpenStack Compute set a16:1 over-commit of the CPU and 1.5 over-commit of the memory. Runningat such high ratios leads to an increase in "noisy-neighbor" activity. Caremust be taken when sizing your Compute environment to avoid this sce-nario. For running general purpose OpenStack environments it is possibleto keep to the defaults, but make sure to monitor your environment as us-age increases.

    Storage performanceWhen considering performance of OpenStack Block Storage, hardwareand architecture choice is important. Block Storage can use enterpriseback-end systems such as NetApp or EMC, scale out storage such as Glus-terFS and Ceph, or simply use the capabilities of directly attached storagein the nodes themselves. Block Storage may be deployed so that traffic tra-verses the host network, which could affect, and be adversely affected by,the front-side API traffic performance. As such, consider using a dedicat-

  • Architecture Guide June 3, 2015 current

    28

    ed data storage network with dedicated interfaces on the Controller andCompute hosts.

    When considering performance of OpenStack Object Storage, a numberof design choices will affect performance. A users access to the ObjectStorage is through the proxy services, which sit behind hardware load bal-ancers. By the very nature of a highly resilient storage system, replicationof the data would affect performance of the overall system. In this case, 10GbE (or better) networking is recommended throughout the storage net-work architecture.

    AvailabilityIn OpenStack, the infrastructure is integral to providing services and shouldalways be available, especially when operating with SLAs. Ensuring net-work availability is accomplished by designing the network architectureso that no single point of failure exists. A consideration of the number ofswitches, routes and redundancies of power should be factored into coreinfrastructure, as well as the associated bonding of networks to provide di-verse routes to your highly available switch infrastructure.

    The OpenStack services themselves should be deployed across multipleservers that do not represent a single point of failure. Ensuring API avail-ability can be achieved by placing these services behind highly availableload balancers that have multiple OpenStack servers as members.

    OpenStack lends itself to deployment in a highly available manner whereit is expected that at least 2 servers be utilized. These can run all the ser-vices involved from the message queuing service, for example RabbitMQor QPID, and an appropriately deployed database service such as MySQLor MariaDB. As services in the cloud are scaled out, back-end services willneed to scale too. Monitoring and reporting on server utilization and re-sponse times, as well as load testing your systems, will help determine scaleout decisions.

    Care must be taken when deciding network functionality. Currently, Open-Stack supports both the legacy networking (nova-network) system andthe newer, extensible OpenStack Networking (neutron). Both have theirpros and cons when it comes to providing highly available access. Legacynetworking, which provides networking access maintained in the Open-Stack Compute code, provides a feature that removes a single point of fail-ure when it comes to routing, and this feature is currently missing in Open-Stack Networking. The effect of legacy networkings multi-host functionali-ty restricts failure domains to the host running that instance.

  • Architecture Guide June 3, 2015 current

    29

    When using OpenStack Networking, the OpenStack controller servers orseparate Networking hosts handle routing. For a deployment that requiresfeatures available in only Networking, it is possible to remove this restric-tion by using third party software that helps maintain highly available L3routes. Doing so allows for common APIs to control network hardware, orto provide complex multi-tier web applications in a secure manner. It is al-so possible to completely remove routing from Networking, and insteadrely on hardware routing capabilities. In this case, the switching infrastruc-ture must support L3 routing.

    OpenStack Networking and legacy networking both have their advantagesand disadvantages. They are both valid and supported options that fit dif-ferent network deployment models described in the OpenStack Opera-tions Guide.

    Ensure your deployment has adequate back-up capabilities. As an exam-ple, in a deployment that has two infrastructure controller nodes, the de-sign should include controller availability. In the event of the loss of a sin-gle controller, cloud services will run from a single controller in the event offailure. Where the design has higher availability requirements, it is impor-tant to meet those requirements by designing the proper redundancy andavailability of controller nodes.

    Application design must also be factored into the capabilities of the under-lying cloud infrastructure. If the compute hosts do not provide a seamlesslive migration capability, then it must be expected that when a computehost fails, that instance and any data local to that instance will be deleted.Conversely, when providing an expectation to users that instances have ahigh-level of uptime guarantees, the infrastructure must be deployed in away that eliminates any single point of failure when a compute host disap-pears. This may include utilizing shared file systems on enterprise storageor OpenStack Block storage to provide a level of guarantee to match ser-vice features.

    For more information on high availability in OpenStack, see the OpenStackHigh Availability Guide.

    SecurityA security domain comprises users, applications, servers or networks thatshare common trust requirements and expectations within a system. Typ-ically they have the same authentication and authorization requirementsand users.

    These security domains are:

  • Architecture Guide June 3, 2015 current

    30

    Public

    Guest

    Management

    Data

    These security domains can be mapped to an OpenStack deployment indi-vidually, or combined. For example, some deployment topologies combineboth guest and data domains onto one physical network, whereas in othercases these networks are physically separated. In each case, the cloud op-erator should be aware of the appropriate security concerns. Security do-mains should be mapped out against your specific OpenStack deploymenttopology. The domains and their trust requirements depend upon whetherthe cloud instance is public, private, or hybrid.

    The public security domain is an entirely untrusted area of the cloud infras-tructure. It can refer to the Internet as a whole or simply to networks overwhich you have no authority. This domain should always be considered un-trusted.

    Typically used for compute instance-to-instance traffic, the guest securi-ty domain handles compute data generated by instances on the cloudbut not services that support the operation of the cloud, such as API calls.Public cloud providers and private cloud providers who do not have strin-gent controls on instance use or who allow unrestricted Internet accessto instances should consider this domain to be untrusted. Private cloudproviders may want to consider this network as internal and thereforetrusted only if they have controls in place to assert that they trust instancesand all their tenants.

    The management security domain is where services interact. Sometimesreferred to as the "control plane", the networks in this domain transportconfidential data such as configuration parameters, user names, and pass-words. In most deployments this domain is considered trusted.

    The data security domain is concerned primarily with information pertain-ing to the storage services within OpenStack. Much of the data that cross-es this network has high integrity and confidentiality requirements and,depending on the type of deployment, may also have strong availability re-quirements. The trust level of this network is heavily dependent on otherdeployment decisions.

    When deploying OpenStack in an enterprise as a private cloud it is usual-ly behind the firewall and within the trusted network alongside existing

  • Architecture Guide June 3, 2015 current

    31

    systems. Users of the cloud are, traditionally, employees that are boundby the security requirements set forth by the company. This tends to pushmost of the security domains towards a more trusted model. However,when deploying OpenStack in a public facing role, no assumptions can bemade and the attack vectors significantly increase. For example, the APIendpoints, along with the software behind them, become vulnerable tobad actors wanting to gain unauthorized access or prevent access to ser-vices, which could lead to loss of data, functionality, and reputation. Theseservices must be protected against through auditing and appropriate filter-ing.

    Consideration must be taken when managing the users of the system forboth public and private clouds. The identity service allows for LDAP to bepart of the authentication process. Including such systems in an OpenStackdeployment may ease user management if integrating into existing sys-tems.

    It's important to understand that user authentication requests include sen-sitive information including user names, passwords and authentication to-kens. For this reason, placing the API services behind hardware that per-forms SSL termination is strongly recommended.

    For more information OpenStack Security, see the OpenStack SecurityGuide

    Operational considerationsIn the planning and design phases of the build out, it is important to in-clude the operation's function. Operational factors affect the design choic-es for a general purpose cloud, and operations staff are often tasked withthe maintenance of cloud environments for larger installations.

    Knowing when and where to implement redundancy and high availabili-ty is directly affected by expectations set by the terms of the Service Lev-el Agreements (SLAs). SLAs are contractual obligations that provide assur-ances for service availability. They define the levels of availability that drivethe technical design, often with penalties for not meeting contractual obli-gations.

    SLA terms that will affect the design include:

    API availability guarantees implying multiple infrastructure services, andhighly available load balancers.

  • Architecture Guide June 3, 2015 current

    32

    Network uptime guarantees affecting switch design, which might re-quire redundant switching and power.

    Network security policies requirements need to be factored in to deploy-ments.

    Support and maintainabilityTo be able to support and maintain an installation, OpenStack cloud man-agement requires operations staff to understand and comprehend designarchitecture content. The operations and engineering staff skill level, andlevel of separation, are dependent on size and purpose of the installation.Large cloud service providers, or telecom providers, are more likely to bemanaged by a specially trained, dedicated operations organization. Small-er implementations are more likely to rely on support staff that need totake on combined engineering, design and operations functions.

    Maintaining OpenStack installations requires a variety of technical skills.For example, if you are to incorporate features into an architecture anddesign that reduce the operations burden, it is advised to automate theoperations functions. It may, however, be beneficial to use third partymanagement companies with special expertise in managing OpenStack de-ployment.

    MonitoringOpenStack clouds require appropriate monitoring platforms to ensure er-rors are caught and managed appropriately. Specific metrics that are criti-cally important to monitor include:

    Image disk utilization

    Response time to the Compute API

    Leveraging existing monitoring systems is an effective check to ensureOpenStack environments can be monitored.

    DowntimeTo effectively run cloud installations, initial downtime planning includescreating processes and architectures that support the following:

    Planned (maintenance)

    Unplanned (system faults)

  • Architecture Guide June 3, 2015 current

    33

    Resiliency of overall system and individual components are going to be dic-tated by the requirements of the SLA, meaning designing for high avail-ability (HA) can have cost ramifications.

    For example, if a compute host failed, this would be an operational con-sideration; requiring the restoration of instances from a snapshot or re-spawning an instance. The overall application design is impacted, gener-al purpose clouds should not need to provide abilities to migrate instancesfrom one host to another. Additional considerations need to be madearound supporting instance migration if the expectation is that the appli-cation will be designed to tolerate failure. Extra support services, includingshared storage attached to compute hosts, might need to be deployed inthis example.

    Capacity planningCapacity constraints for a general purpose cloud environment include:

    Compute limits

    Storage limits

    A relationship exists between the size of the compute environment andthe supporting OpenStack infrastructure controller nodes requiring sup-port.

    Increasing the size of the supporting compute environment increases thenetwork traffic and messages, adding load to the controller or networkingnodes. Effective monitoring of the environment will help with capacity de-cisions on scaling.

    Compute nodes automatically attach to OpenStack clouds, resulting in ahorizontally scaling process when adding extra compute capacity to anOpenStack cloud. Additional processes are required to place nodes into ap-propriate availability zones and host aggregates. When adding additionalcompute nodes to environments, ensure identical or functional compatibleCPUs are used, otherwise live migration features will break. It is necessaryto add rack capacity or network switches as scaling out compute hosts di-rectly affects network and datacenter resources.

    Assessing the average workloads and increasing the number of instancesthat can run within the compute environment by adjusting the overcom-mit ratio is another option. It is important to remember that changing theCPU overcommit ratio can have a detrimental effect and cause a potential

  • Architecture Guide June 3, 2015 current

    34

    increase in a noisy neighbor. The additional risk of increasing the overcom-mit ratio is more instances failing when a compute host fails.

    Compute host components can also be upgraded to account for increas-es in demand; this is known as vertical scaling. Upgrading CPUs with morecores, or increasing the overall server memory, can add extra needed ca-pacity depending on whether the running applications are more CPU in-tensive or memory intensive.

    Insufficient disk capacity could also have a negative effect on overall per-formance including CPU and memory usage. Depending on the back-end architecture of the OpenStack Block Storage layer, capacity includesadding disk shelves to enterprise storage systems or installing addition-al block storage nodes. Upgrading directly attached storage installed incompute hosts, and adding capacity to the shared storage for additionalephemeral storage to instances, may be necessary.

    For a deeper discussion on many of these topics, refer to the OpenStackOperations Guide.

    ArchitectureHardware selection involves three key areas:

    Compute

    Network

    Storage

    Selecting hardware for a general purpose OpenStack cloud should reflecta cloud with no pre-defined usage model. General purpose clouds are de-signed to run a wide variety of applications with varying resource usage re-quirements. These applications include any of the following:

    RAM-intensive

    CPU-intensive

    Storage-intensive

    Choosing hardware for a general purpose OpenStack cloud must providebalanced access to all major resources.

    Certain hardware form factors may better suit a general purpose Open-Stack cloud due to the requirement for equal (or nearly equal) balance ofresources. Server hardware must provide the following:

  • Architecture Guide June 3, 2015 current

    35

    Equal (or nearly equal) balance of compute capacity (RAM and CPU)

    Network capacity (number and speed of links)

    Storage capacity (gigabytes or terabytes as well as Input/Output Opera-tions Per Second (IOPS)

    Server hardware is evaluated around four conflicting dimensions.

    Server density A measure of how many servers can fit into a giv-en measure of physical space, such as a rack unit[U].

    Resource capacity The number of CPU cores, how much RAM, orhow much storage a given server will deliver.

    Expandability The number of additional resources that can beadded to a server before it has reached its limit.

    Cost The relative purchase price of the hardwareweighted against the level of design effort need-ed to build the system.

    Increasing server density means sacrificing resource capacity or expandabil-ity, however, increasing resource capacity and expandability increases costand decreases server density. As a result, determining the best server hard-ware for a general purpose OpenStack architecture means understandinghow choice of form factor will impact the rest of the design. The followinglist outlines the form factors to choose from:

    Blade servers typically support dual-socket multi-core CPUs, which is theconfiguration generally considered to be the "sweet spot" for a gener-al purpose cloud deployment. Blades also offer outstanding density. Asan example, both HP BladeSystem and Dell PowerEdge M1000e supportup to 16 servers in only 10 rack units. However, the blade servers them-selves often have limited storage and networking capacity. Additionally,the expandability of many blade servers can be limited.

    1U rack-mounted servers occupy only a single rack unit. Their benefits in-clude high density, support for dual-socket multi-core CPUs, and supportfor reasonable RAM amounts. This form factor offers limited storage ca-pacity, limited network capacity, and limited expandability.

    2U rack-mounted servers offer the expanded storage and networkingcapacity that 1U servers tend to lack, but with a corresponding decreasein server density (half the density offered by 1U rack-mounted servers).

  • Architecture Guide June 3, 2015 current

    36

    Larger rack-mounted servers, such as 4U servers, will tend to offer evengreater CPU capacity, often supporting four or even eight CPU sock-ets. These servers often have much greater expandability so will providethe best option for upgradability. This means, however, that the servershave a much lower server density and a much greater hardware cost.

    "Sled servers" are rack-mounted servers that support multiple indepen-dent servers in a single 2U or 3U enclosure. This form factor offers in-creased density over typical 1U-2U rack-mounted servers but tends tosuffer from limitations in the amount of storage or network capacityeach individual server supports.

    The best form factor for server hardware supporting a general purposeOpenStack cloud is driven by outside business and cost factors. No singlereference architecture will apply to all implementations; the decision mustflow from user requirements, technical considerations, and operationalconsiderations. Here are some of the key factors that influence the selec-tion of server hardware:

    Instance density Sizing is an important consideration for ageneral purpose OpenStack cloud. The ex-pected or anticipated number of instancesthat each hypervisor can host is a commonmetric used in sizing the deployment. The se-lected server hardware needs to support theexpected or anticipated instance density.

    Host density Physical data centers have limited physicalspace, power, and cooling. The number ofhosts (or hypervisors) that can be fitted intoa given metric (rack, rack unit, or floor tile)is another important method of sizing. Floorweight is an often overlooked consideration.The data center floor must be able to sup-port the weight of the proposed number ofhosts within a rack or set of racks. These fac-tors need to be applied as part of the hostdensity calculation and server hardware se-lection.

    Power density Data centers have a specified amount ofpower fed to a given rack or set of racks.Older data centers may have a power den-sity as power as low as 20 AMPs per rack,

  • Architecture Guide June 3, 2015 current

    37

    while more recent data centers can be archi-tected to support power densities as high as120 AMP per rack. The selected server hard-ware must take power density into account.

    Network connectivity The selected server hardware must have theappropriate number of network connec-tions, as well as the right type of networkconnections, in order to support the pro-posed architecture. Ensure that, at a mini-mum, there are at least two diverse networkconnections coming into each rack. For archi-tectures requiring even more redundancy, itmight be necessary to confirm that the net-work connections are from diverse telecomproviders. Many data centers have that ca-pacity available.

    The selection of form factors or architectures affects the selection of serv-er hardware. For example, if the design is a scale-out storage architecture,then the server hardware selection will require careful consideration whenmatching the requirements set to the commercial solution.

    Ensure that the selected server hardware is configured to support enoughstorage capacity (or storage expandability) to match the requirements ofselected scale-out storage solution. For example, if a centralized storagesolution is required, such as a centralized storage array from a storage ven-dor that has InfiniBand or FDDI connections, the server hardware will needto have appropriate network adapters installed to be compatible with thestorage array vendor's specifications.

    Similarly, the network architecture will have an impact on the server hard-ware selection and vice versa. For example, make sure that the server isconfigured with enough additional network ports and expansion cards tosupport all of the networks required. There is variability in network expan-sion cards, so it is important to be aware of potential impacts or interoper-ability issues with other components in the architecture.

    Selecting storage hardwareStorage hardware architecture is largely determined by the selected stor-age architecture. The selection of storage architecture, as well as the corre-sponding storage hardware, is determined by evaluating possible solutionsagainst the critical factors, the user requirements, technical considerations,

  • Architecture Guide June 3, 2015 current

    38

    and operational considerations. Factors that need to be incorporated intothe storage architecture include:

    Cost Storage can be a significant portion of the overall sys-tem cost. For an organization that is concerned withvendor support, a commercial storage solution is advis-able, although it comes with a higher price tag. If ini-tial capital expenditure requires minimization, design-ing a system based on commodity hardware would ap-ply. The trade-off is potentially higher support costsand a greater risk of incompatibility and interoperabili-ty issues.

    Scalability Scalability, along with expandability, is a major con-sideration in a general purpose OpenStack cloud. Itmight be difficult to predict the final intended size ofthe implementation as there are no established usagepatterns for a general purpose cloud. It might becomenecessary to expand the initial deployment in order toaccommodate growth and user demand.

    Expandability Expandability is a major architecture factor for stor-age solutions with general purpose OpenStack cloud.A storage solution that expands to 50PB is consid-ered more expandable than a solution that only scalesto 10PB. This metric is related to, but different, fromscalability, which is a measure of the solution's perfor-mance as it expands. For example, the storage archi-tecture for a cloud that is intended for a developmentplatform may not have the same expandability andscalability requirements as a cloud that is intended fora commercial product.

    Using a scale-out storage solution with direct-attached storage (DAS) inthe servers is well suited for a general purpose OpenStack cloud. For exam-ple, it is possible to populate storage in either the compute hosts similar toa grid computing solution, or into hosts dedicated to providing block stor-age exclusively. When deploying storage in the compute hosts appropriatehardware, that can support both the storage and compute services on thesame hardware, will be required.

    Understanding the requirements of cloud services will help determinewhat scale-out solution should be used. Determining if a single, highly ex-pandable and highly vertical, scalable, centralized storage array should be

  • Architecture Guide June 3, 2015 current

    39

    included in the design. Once an approach has been determined, the stor-age hardware needs to be selected based on this criteria.

    This list expands upon the potential impacts for including a particular stor-age architecture (and corresponding storage hardware) into the design fora general purpose OpenStack cloud:

    Connectivity Ensure that, if storage protocols otherthan Ethernet are part of the storagesolution, the appropriate hardware hasbeen selected. If a centralized storagearray is selected, ensure that the hy-pervisor will be able to connect to thatstorage array for image storage.

    Usage How the particular storage architecturewill be used is critical for determiningthe architecture. Some of the configu-rations that will influence the architec-ture include whether it will be used bythe hypervisors for ephemeral instancestorage or if OpenStack Object Storagewill use it for object storage.

    Instance and image locations Where instances and images will bestored will influence the architecture.

    Server hardware If the solution is a scale-out storagearchitecture that includes DAS, it willaffect the server hardware selection.This could ripple into the decisions thataffect host density, instance density,power density, OS-hypervisor, manage-ment tools and others.

    General purpose OpenStack cloud has multiple options. The key factorsthat will have an influence on selection of storage hardware for a generalpurpose OpenStack cloud are as follows:

    Capacity Hardware resources selected for the resource nodesshould be capable of supporting enough storagefor the cloud services. Defining the initial require-ments and ensuring the design can support addingcapacity is important. Hardware nodes selected forobject storage should be capable of support a largenumber of inexpensive disks with no reliance on

  • Architecture Guide June 3, 2015 current

    40

    RAID controller cards. Hardware nodes selected forblock storage should be capable of supporting highspeed storage solutions and RAID controller cardsto provide performance and redundancy to storageat a hardware level. Selecting hardware RAID con-trollers that automatically repair damaged arrayswill assist with the replacement and repair of de-graded or destroyed storage devices.

    Performance Disks selected for object storage services do notneed to be fast performing disks. We recommendthat object storage nodes take advantage of thebest cost per terabyte available for storage. Con-trastingly, disks chosen for block storage servicesshould take advantage of performance boostingfeatures that may entail the use of SSDs or flashstorage to provide high performance block stor-age pools. Storage performance of ephemeral disksused for instances should also be taken into consid-eration. If compute pools are expected to have ahigh utilization of ephemeral storage, or requiresvery high performance, it would be advantageousto deploy similar hardware solutions to block stor-age.

    Fault tolerance Object storage resource nodes have no require-ments for hardware fault tolerance or RAID con-trollers. It is not necessary to plan for fault toler-ance within the object storage hardware becausethe object storage service provides replication be-tween zones as a feature of the service. Block stor-age nodes, compute nodes and cloud controllersshould all have fault tolerance built in at the hard-ware level by making use of hardware RAID con-trollers and varying levels of RAID configuration.The level of RAID chosen should be consistent withthe performance and availability requirements ofthe cloud.

    Selecting networking hardwareSelecting network architecture determines which network hardware willbe used. Networking software is determined by the selected networkinghardware. For example, selecting networking hardware that only supports

  • Architecture Guide June 3, 2015 current

    41

    Gigabit Ethernet (GbE) will impact the overall design. Similarly, deciding touse 10 Gigabit Ethernet (10GbE) will have a number of impacts on variousareas of the overall design.

    There are more subtle design impacts that need to be considered. The se-lection of certain networking hardware (and the networking software) af-fects the management tools that can be used. There are exceptions to this;the rise of "open" networking software that supports a range of network-ing hardware means that there are instances where the relationship be-tween networking hardware and networking software are not as tightlydefined. An example of this type of software is Cumulus Linux, which is ca-pable of running on a number of switch vendor's hardware solutions.

    Some of the key considerations that should be included in the selection ofnetworking hardware include:

    Port count T