openstack at ebsco

24
OpenStack at Ebsco Nate Baechtold, IT Architect Ebsco Information Services August 23, 2016

Upload: tesora

Post on 13-Apr-2017

88 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: OpenStack at EBSCO

OpenStack at Ebsco

Nate Baechtold, IT ArchitectEbsco Information ServicesAugust 23, 2016

Page 2: OpenStack at EBSCO

2

Bulleted List

• The leading discovery service provider for libraries worldwide with more than 10,000 discovery customers in over 100 countries.

• Preeminent provider of online research content for libraries, including hundreds of research databases, historical archives, point-of-care medical reference, and corporate learning tools serving millions of end users at tens of thousands of institutions.

• Leading provider of electronic journals & books for libraries, with more than 360,000 serials, including more than 57,000 e-journals, as well as online access to more than 800,000 e-books.

Page 3: OpenStack at EBSCO

3

What did we need?

• Self service infrastructure to all development teams.• Full stack automation to all environments.• Increase agility and productivity of operations and development

teams.• Lower costs by leveraging open source solutions.• Provide a solution that integrates well with other products and

allows other products and tools to easily integrate with it.

Page 4: OpenStack at EBSCO

4

Why OpenStack?

• Easy to consume API that commoditizes infrastructure with the same methodology used by public clouds.

• Abstraction of underlying infrastructure allowing for configuration or hardware differences to not propagate to consumers and automation.

• Standardized interface for compute, network and storage• When software supports OpenStack it tends to “just work”

• Allows us to build an IaaS platform fit for live services and safely hand out access to diverse teams through built in project isolation.

• Prefer to tell consumers that “if you break it then it is our fault” rather than giving them a long list of things that they should never do.

Page 5: OpenStack at EBSCO

5

Current Scale

• 3 OpenStack clouds• Approximately 1100 running

instances• Almost 500,000 instances

created and destroyed since general availability

• 68% of workloads concentrated in development environments

• Around 1/3 of all virtualized workloads currently on OpenStack

68%

10%

22%

Distribution By Running Instance

DevQa Live DC 1 Live DC 2

Page 6: OpenStack at EBSCO

6

Design Philosophy

• Build a platform to run production applications.• Multi-tenant at its core

• Should be able to safely support development and operations teams sharing the same cloud.

• All tools needed to build a highly available production application need to be available

• Good enough for development but not production is not an acceptable permanent state.

• Build general purpose solutions. Customize as little as possible.• Provide an easy menu of infrastructure offerings

• Easy to use solution with safeguards to encourage experimentation• Development is easier when you don’t need to worry about breaking the environment

Page 7: OpenStack at EBSCO

7

Current ArchitectureEbsco Private Cloud Platform

OpenStack CloudMonitoring

Operations

Dashboards NovaNeutron CinderGlance

Keystone Heat Ceilometer Horizon

Load Balancing

Page 8: OpenStack at EBSCO

What we learned…

Page 9: OpenStack at EBSCO

9

Problems to Solve:• Skills and training• Selection of vendors and

integrations• Deployment• Adoption• Productionization

Page 10: OpenStack at EBSCO

10

Skills and training:Our Experiences• Internally develop a core group of

OpenStack SMEs before progressing too far. • Do not waste learning opportunities by

relying to much on professional services.• Look for candidates with strong Linux,

networking, virtualization and python skills rather than OpenStack experience.

• Give your team the time and opportunity to experiment and learn how OpenStack works.

• Vendor support lowers the amount of expertise you need to go to production.

• OpenStack skills are VERY hard to hire

• Administration requires good Linux experience

• Inexperienced administrators can cause huge amounts of damage

Page 11: OpenStack at EBSCO

11

Vendors and integrations:

Our Experiences• Prefer products that align with OpenStack’s

multi-tenancy model whenever possible.• Focus on vendors building for cloud rather

than trying to integrate it afterwards.• Look at areas to improve everywhere in the

stack. Re-evaluate your product decisions. There is high value when an integration is done right.

• You will not know how good a vendor’s integration is until you try it. There can be many hidden landmines with missing capabilities or API support.

• Tons of vendor integrations with varying degrees of quality

• Many established vendors

• Users need access to everything that they need to deploy and manage a highly available production application

Page 12: OpenStack at EBSCO

12

Case Study – Existing Load Balancing

• Existing vendor had limited OpenStack knowledge and bare bones integration at the time.

• Actual quote from support after a bug was discovered (vendor specific lines edited)

• “For now, to avoid a failover, I would recommend to program the OpenStack not to delete IPs.”

• LBaaS v1 was extremely limited. Would not have covered all production use cases.

• Product did not support safe multi-tenancy. There were shared resources that were a point of failure.

• Prolonged evaluation period of 6-8 months resulting in rejection.

Page 13: OpenStack at EBSCO

13

Case Study – Cloud Load Balancer (AVI)

• Installation involves providing OpenStack credentials and it handles the rest.

• Allowed us to make production grade load balancing generally available in development within a week and produciton within a month.

• Multi-tenancy model aligns with OpenStack Projects and with keystone• Nobody had to ask for access. If you had access to OpenStack then you have

access to a load balancing services.• No fighting with permissions or concerns with preventing untrained users from

damaging the environment.

Page 14: OpenStack at EBSCO

14

Problems to Solve:Our Experiences• Align resources for storage, networking and

datacenter teams and make sure that someone on each team will make troubleshooting installation issues a top priority.

• OpenStack requires tight integration with all of these elements. A slow troubleshooting feedback loop will have a very negative effect on the deployment.

• Understand what deployment choices are difficult to change afterwards and make sure that you got them right.

• Assume multiple tries to get a production ready configuration.

• Deployment• Deployments take a

long time and are complex

• Some OpenStack functionality is not ready for production

Page 15: OpenStack at EBSCO

15

Problems to Solve:Our Experiences• Have a close relationship with your early adopters.

They will help you increase the resiliency of your deployment.

• Regularly speak with them in person to help them understand OpenStack and to let them tell you about issues before they become a problem.

• Get deployments into your users hands as soon as possible.

• Do not stall getting to production. Teams will not want to code to an API that they cannot use in production.

• Adoption will be limited until you can get production availability.

• Solving problems “just for development environments” is the wrong mentality.

• Early feedback is critical.

• Adoption• Adoption is one of the

most critical elements to success.

Page 16: OpenStack at EBSCO

16

Problems to Solve:Our Experiences• Monitor OpenStack by actually using

OpenStack. Build instances and use OpenStack functionality to detect failures.

• OpenStack is very complex and understanding the effect of a failure can be difficult.

• If you monitor by using OpenStack you will catch most failures before your users do and know what functionality is impacted.

• Automate common operational and maintenance tasks.

• OpenStack HA is complex but needed for all environments.

• Productionizaton• OpenStack provides

building blocks but some assembly is required to build a product out of it.

• Monitoring and common operational tasks are not solved out of the box.

Page 17: OpenStack at EBSCO

What we did…

Page 18: OpenStack at EBSCO

18

Phased Environments…

Prototype• Single machine all

in one deployment• Learn basics• Validate direction• Disposable

environment

Interim• Break apart compute

and control • Limited release to

early adopters• Get feedback and

determine desired configuration

DevQa• Highly available

environment • Treated like production• General availability for

development workloads• Determine

producitonization tasks needed

Production• Implement

productionizaiton tasks

• Deploy production clouds

Page 19: OpenStack at EBSCO

19

What wound up happening…

Prototype Interim DevQa Production

Page 20: OpenStack at EBSCO

20

Took too long to get to production…• Critical team member left

• Took too long finding a replacement due to focus on hiring OpenStack skillset.

• Additional work for monitoring and operations automation were required before we were confident hosting production workloads.

• Required skillsets that were not a part of the OpenStack team and focused manpower.

Page 21: OpenStack at EBSCO

21

Solution: Create a focus squad

• Kicked of a 6 week effort with a cross-functional team that had all required skills.

• This team would focus 100% on getting OpenStack to live.• OpenStack tasks must be top priority for all team members.• Director quote “Set your email to out of office if you have to”

• The focused effort was incredibly efficient.• Feedback loops for troubleshooting massively reduced.• Reduction of blocked tasks created a higher quality implementation.

Page 22: OpenStack at EBSCO

22

What the focus squad do?

• Created a reliable monitoring solution based on Zabbix and a python framework for executing OpenStack checks.

• Created automated recovery for problems discovered in DevQa.• Automated compute node evacuation• Automated failed OpenStack service recovery

• Increased visibility into the environment with Zabbix and Grafana.• Automated common operational tasks to push button jobs in Rundeck.

• Taking a compute or control node out of service• Restarting OpenStack services

• Deployed all production OpenStack, Zabbix and Rundeck infrastructure.

Page 23: OpenStack at EBSCO

23

Tracking Success…

• Critical to getting continued commitment but hard to determine.• We track the following metrics:

• Instance count and resource usage• Number of teams and products leveraging OpenStack• The number of instances created and deleted

• This can be a good indicator as to whether OpenStack was the right fit for your organization. Indicates people using automation as opposed to manual usage.

Page 24: OpenStack at EBSCO

Thank YouQuestons?