it shouldn’t be a cost-center with mesos -...

IT shouldn’t be a cost-center with Mesos

Imran ShaikhLead/Architect

Blog http://elasticcompute.io@imranshaikh

LinuxCon 2016

About Me• Lead/Architect• Leading the Mesos & Containers initiative at YP• Manages thousands of server infrastructure in multiple data centers• Presented at various conferences about containers & solutions

– USENIX LISA -2015– MesosCon -2016/2015– SCALE (Southern California Linux Expo) -2016

LinuxCon 2016

Agenda• Ops a cost-center• Common Ops problems

– Static provisioning– Wasted capacity– Maintenance window– Silo’ed teams

• DevOps as a solution• DevOps in practice - Mesos • How does Mesos solve these problems?• How can you benefit?• Future Ahead• Q/A

LinuxCon 2016

Ops as cost-center• Doesn't produce direct profit for the company• Removing or scaling down Ops have detrimental affect on the profit

margin• Typically cost-centers are given autonomy• Typically Ops bosses have responsibility

– to manage financial performance– keeping it under budget– Accounting for expenditures

• Over time, Ops is streamlined, process improved etc.– Thereby reducing the overall cost

• All the C-Suite bosses are happy to run the fat checks• At the end, C-Suite boss decides which cost-center grows and which

one gets slashed.• This is basically what is happening across all the companies

everywhere

LinuxCon 2016

Ops as cost-center• Why it’s hard to change?

– Mindset is to keep the lights-on– 70 to 80% of the budget goes in maintaining existing infrastructure and

applications– That gives very little room to pursue new direction for the Ops org

• Since it cannot be changed (and there is a constant urge to improve):– Cloud computing, outsourcing and offshoring has been common– They made maintenance costs more predictable and easier to measure

and manage• Argument is by doing that (Hiner, J. 2014)

– Ops can focus more on exploring new vendors– Upgrading software– and looking for new solutions that could save the company money,

leapfrog competitors, or break into new markets.

LinuxCon 2016

Ops as cost-center• I have a genuine problem with this word

“cost-center” because I don’t consider myself as a burden

• I consider myself as:– Providing value to you– Serving you– I want to improve your product– If I cant replace you, I want to at least augment

you• If I can live with that motto, why the dept. I

work in can’t do the same ?

LinuxCon 2016

Common Ops problems

LinuxCon 2016

Problem 1: Static provisioning - resources

LinuxCon 2016

Static provisioning - resources

LinuxCon 2016

Fig: (Angry gorilla, n.d.).

Consider your app as hero and the footprint it requires

to run as villain.Initially, your villain is small

& timid

LinuxCon 2016

As your product grows, so is your villain. You

need to scale your resources vertically

LinuxCon 2016

And if your product becomes mature

enough, you have grown your villain so much that

it becomes invincible

LinuxCon 2016

And if you have more products, so are your

villains

LinuxCon 2016

And at one point, you would running

datacenters full of these villains

LinuxCon 2016

So what do you do next? You hire people like me to

manage these villains

Problem 2: Static provisioning - people

LinuxCon 2016

Static provisioning - people

LinuxCon 2016

SysAdmins DBAs

SREs Network Engrs

Static provisioning - people

LinuxCon 2016

SysAdmins DBAs

SREs Network Engrs

Tools Engineer

Problem 3:Wasted capacity

LinuxCon 2016

Wasted capacity• When you run apps on dedicated host, approx.

20% of resources get utilized (CPU, Memory etc.)

• Remaining all of that goes to waste• Reason being there are no good isolation

techniques to run multiple apps on a single host• Making multi-tenant apps behave on a same

node is a difficult challenge

LinuxCon 2016

Problem 4: Maintenance Window

LinuxCon 2016

Maintenance window• Maintenance of infrastructure require days if not weeks of planning• A thorough launch plan is designed• All the stakeholders are cramped up in a war room to handle their

respective parts• A whole army of Ops people get involved if something goes wrong

– SysAdmins– DBAs– Network Engineers– Storage Admins– SREs– Operation Center– Developers– QAs– PMs

• Larger than expected downtime– Frustrations– Tons of overtime pay– Less than favorable work/personal life balance

LinuxCon 2016

Problem 5: Silo’ed teams

LinuxCon 2016

Silo’ed teams

• App that gets build by developer is completely different to what runs in prod

• Dev & Ops are completely isolated world.• Ops team massages it, add configurations,

custom deployment tools etc. • Ops have designed checks, UIs, policies

and processes to monitor, scale or view performance of those apps

LinuxCon 2016

Problem 6: Ops Rigidity

LinuxCon 2016

Ops Rigidity• Dev have no window into it and they no idea what

happens to their app in prod• Devs are completely agnostic. There is no

feedback mechanism.• That ends up having poorly written apps.• Running apps shouldn't’t be Ops forte.

– Dev know more about their apps• Help Dev run and manage their apps. Ops should

focus on securing and managing the underlying infrastructure and system.

• Empower them. Don’t handicap them.

LinuxCon 2016

How to make Ops a profit-center?

• Answer is DevOps• I know it is such a cliché and a management

buzz word• There are lots of theories and best practices

floating around how to go about it• None of the DevOps best practices can yield

immediate result

LinuxCon 2016

DevOps theories• Dev & Ops collaboration• Treating “Infrastructure as Code” (Riley, C.

2014)• Using Automation• Culture• Using Sprints or Agile or Kanban for Ops

work• Using tools like Jenkins, Chef/Puppet,

Vagrant, Docker, etcd etc.

LinuxCon 2016

How does DevOps look in practice?

• All these theories sound great, but what is the practical solution

• Tell me which tools or suite of tools encompasses all the DevOps best-practices?

• Answer is Mesos

LinuxCon 2016

Product 3Product 3Product 2Product 1

static partitioning - resources

LinuxCon 2016

Host Host

!static partitioning - resources

LinuxCon 2016

AppsMessag

e Queues

Build pipeline jobs

Map reduce jobs

Batch processing

NoSQLdb

static partitioning - people

Servers

Storage

Applications

Databases

Network

Systems Administrators

Storage Admins

Network Engrs

Developers

!static partitioning - people

Servers

Storage

Applications

Databases

Network

Systems Administrators

Storage Admins

Network Engrs

Developers

• More secure• HA, Scalable, Fault tolerant• Manage the envt.

• Self serve storage• DFS, NFS or Block storage

solutions

• Visibility for devs• Service discovery solutions

• Persistent storage• Solutions that use persistent storage for

• Layer 3 virtual networking• Overlay networks• Solutions that provide IPs to containers

• More ops aware• Write better apps that performs in production• Auto-scaling

Product 3Product 3Product 2Product 1

Maintenance Window

LinuxCon 2016

Host Host

• Notifications sent out• Whole army of Ops and Dev team is hurdled up• Traffic is shifted• Apps bounced• Revenue loss• And a whole lot of passing the buck & post-mortem analysis

!Maintenance window

LinuxCon 2016

Apps Message Queues

Build pipeline jobs Map reduce jobs Batch processing

jobsNoSQL

db RDBMS

!Ops Rigidity• In Mesos, everything gets open up• There is no Ops world• Use Marathon to see what is running in

production, # of instances, scale • Use Chronos to submit batch or cron jobs• Run Build pipeline on Jenkins• Run message queue brokers, scale them with the

pool of resources you have• Employ ELK stack to view logs in real-time• Employ metrics solutions to see performance

metrics of your apps (containers)

LinuxCon 2016

!Wasted Capacity• With Mesos, there is no wasted capacity. • You can technically run 10s or 100s apps on a

single machine• To isolate every apps, containerize them• With containers, you will be able to rate-limit or

meter CPU usage, memory usage, disk IO, network bandwidth utilization, disk usage etc.

LinuxCon 2016

!Silo’ed teams• Now that everybody uses the same infrastructure

and processes, there are no silo’ed teams.• Ops and Dev also convert their existing apps to

run on this unified cluster.• They get a unified system to view or manage their

apps– Logging– Metrics– Service discovery– App config store– Same security model to secure their apps– Same isolation techniques to run multi-tenant apps

LinuxCon 2016

How you can benefit?

• Now the next question you have is:– Does Mesos provide all these things out of

the box? No– But there are enterprise solutions from

Mesosphere’s DCOS which will help you jump start.

– Or if you have heterogeneous environment like YP, wherein you run all kinds of app, then develop this solution in-house

LinuxCon 2016

What are we doing at YP Engineering?

• Our engineering team has drunken DevOps cool aid• In the beginning, it tastes weird like Dr. Pepper but trust me the taste

grows you on quickly.• We are doing all these crazy stuff you saw earlier

– Centralized logging– Performance metrics– Application secrets– App config store– Service discovery– Persistent storage– Real-time analytics

• Running this DevOps’y infrastructure for more than a year• Open source contribution:

www.github.com/yp-engineering

LinuxCon 2016

Future Ahead• Things that we saw can be intimidating• After all, we are talking about changing things

we have been doing all these years• But this is the future

– Datacenter operating system is big– Containers are big– Isolation is big– DevOps is big

• If we can do that, our Ops will no longer be a COST-CENTER. It will become PROFIT-CENTER, Indeed!!!

LinuxCon 2016

REFERENCE LIST• Hiner, J. (2014, October 1). IT as profit center versus cost center: State of the

argument. Retrieved from http://www.zdnet.com/article/it-as-profit-center-versus-cost-center-state-of-the-argument/

• [Angry gorilla]. (n.d.). Retrieved from http://onedaylate.com/images/angry_gorilla.png• Riley, C. (2014, May 5). Meet Infrastructure as Code. Retrieved from

http://devops.com/2014/05/05/meet-infrastructure-code/ • Docker: http://www.docker.com• Mesos: http://mesos.apache.org• Mesosphere DCOS: https://mesosphere.com/product/• Marathon: https://mesosphere.github.io/marathon/• Chronos: https://mesos.github.io/chronos/• Jenkins: https://jenkins.io/• Chef: https://www.chef.io/chef/• Puppet: https://puppet.com/• Vagrant: https://www.vagrantup.com/• Etcd: https://github.com/coreos/etcd

LinuxCon 2016

Thank you for listening !!

Q/AImran Shaikh

Lead/Architect

Blog http://elasticcompute.io@imranshaikh

imran@elasticcompute.io

LinuxCon 2016

it shouldn’t be a cost-center with mesos -...

Documents

introduction to mesos bay

meetup mesos : mesos, chronos and marathon in ci/cd factory

secrets management in mesos -...

introduction to apache mesos

resource management - mesos and yarn · mesos yarn 4/46....

growing the mesos ecosystem

mesos and yarn

mesos tech report

online scheduling of spark workloads with mesos using...

mesos sys adminday

kubernetes apache mesos...kubernetes, and mesos all have...

apache mesos

cs 744: mesos

monedes i bitllets d’euro...1 trimestre = 3 mesos 1 segon...

docker on mesos

a travel through mesos

fault tolerance in mesos -...

mesos study report 03v1.2

networking & security for mesos -...

podila mesos con-northamerica_sep2017