seer: leveraging big data to navigate the complexity of ... · data to extract the causal...

15
Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices Yu Gan Cornell University [email protected] Yanqi Zhang Cornell University [email protected] Kelvin Hu Cornell University [email protected] Dailun Cheng Cornell University [email protected] Yuan He Cornell University [email protected] Meghna Pancholi Cornell University [email protected] Christina Delimitrou Cornell University [email protected] Abstract Performance unpredictability is a major roadblock towards cloud adoption, and has performance, cost, and revenue ramifications. Predictable performance is even more crit- ical as cloud services transition from monolithic designs to microservices. Detecting QoS violations after they occur in systems with microservices results in long recovery times, as hotspots propagate and amplify across dependent services. We present Seer, an online cloud performance debugging system that leverages deep learning and the massive amount of tracing data cloud systems collect to learn spatial and temporal patterns that translate to QoS violations. Seer com- bines lightweight distributed RPC-level tracing, with detailed low-level hardware monitoring to signal an upcoming QoS violation, and diagnose the source of unpredictable perfor- mance. Once an imminent QoS violation is detected, Seer notifies the cluster manager to take action to avoid perfor- mance degradation altogether. We evaluate Seer both in local clusters, and in large-scale deployments of end-to-end appli- cations built with microservices with hundreds of users. We show that Seer correctly anticipates QoS violations 91% of the time, and avoids the QoS violation to begin with in 84% of cases. Finally, we show that Seer can identify application- level design bugs, and provide insights on how to better architect microservices to achieve predictable performance. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. ASPLOS ’19, April 13–17, 2019, Providence, RI, USA © 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-6240-5/19/04. . . $15.00 hps://doi.org/10.1145/3297858.3304004 CCS Concepts Computer systems organization Cloud computing; Availability; Computing method- ologies Neural networks; Software and its engi- neering Scheduling. Keywords cloud computing, datacenter, performance de- bugging, QoS, deep learning, data mining, tracing, monitor- ing, microservices, resource management ACM Reference Format: Yu Gan, Yanqi Zhang, Kelvin Hu, Dailun Cheng, Yuan He, Meghna Pancholi, and Christina Delimitrou. 2019. Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices. In 2019 Architectural Support for Programming Languages and Operating Systems (ASPLOS ’19), April 13–17, 2019, Providence, RI, USA. ACM, New York, NY, USA, 15 pages. hps: //doi.org/10.1145/3297858.3304004 1 Introduction Cloud computing services are governed by strict quality of service (QoS) constraints in terms of throughput, and more critically tail latency [14, 28, 31, 34]. Violating these requirements worsens the end user experience, leads to loss of availability and reliability, and has severe revenue im- plications [13, 14, 28, 32, 35, 36]. In an effort to meet these performance constraints and facilitate frequent application updates, cloud services have recently undergone a major shift from complex monolithic designs, which encompass the entire functionality in a single binary, to graphs of hundreds of loosely-coupled, single-concerned microservices [9, 45]. Microservices are appealing for several reasons, including accelerating development and deployment, simplifying cor- rectness debugging, as errors can be isolated in specific tiers, and enabling a rich software ecosystem, as each microservice is written in the language or programming framework that best suits its needs. At the same time microservices signal a fundamental de- parture from the way traditional cloud applications were designed, and bring with them several system challenges.

Upload: others

Post on 25-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Seer: Leveraging Big Data to Navigate the Complexity of ... · data to extract the causal relationships between requests. Even though such tracing systems help cloud providers detect

Seer Leveraging Big Data to Navigate the Complexityof Performance Debugging in Cloud Microservices

Yu GanCornell Universityyg397cornelledu

Yanqi ZhangCornell Universityyz2297cornelledu

Kelvin HuCornell Universitysh2442cornelledu

Dailun ChengCornell Universitydc924cornelledu

Yuan HeCornell Universityyh772cornelledu

Meghna PancholiCornell Universitymp832cornelledu

Christina DelimitrouCornell University

delimitroucornelledu

AbstractPerformance unpredictability is a major roadblock towardscloud adoption and has performance cost and revenueramifications Predictable performance is even more crit-ical as cloud services transition from monolithic designs tomicroservices Detecting QoS violations after they occur insystems with microservices results in long recovery times ashotspots propagate and amplify across dependent services

We present Seer an online cloud performance debuggingsystem that leverages deep learning and the massive amountof tracing data cloud systems collect to learn spatial andtemporal patterns that translate to QoS violations Seer com-bines lightweight distributed RPC-level tracing with detailedlow-level hardware monitoring to signal an upcoming QoSviolation and diagnose the source of unpredictable perfor-mance Once an imminent QoS violation is detected Seernotifies the cluster manager to take action to avoid perfor-mance degradation altogether We evaluate Seer both in localclusters and in large-scale deployments of end-to-end appli-cations built with microservices with hundreds of users Weshow that Seer correctly anticipates QoS violations 91 ofthe time and avoids the QoS violation to begin with in 84of cases Finally we show that Seer can identify application-level design bugs and provide insights on how to betterarchitect microservices to achieve predictable performance

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copiesare not made or distributed for profit or commercial advantage and thatcopies bear this notice and the full citation on the first page Copyrightsfor components of this work owned by others than the author(s) mustbe honored Abstracting with credit is permitted To copy otherwise orrepublish to post on servers or to redistribute to lists requires prior specificpermission andor a fee Request permissions from permissionsacmorgASPLOS rsquo19 April 13ndash17 2019 Providence RI USAcopy 2019 Copyright held by the ownerauthor(s) Publication rights licensedto ACMACM ISBN 978-1-4503-6240-51904 $1500httpsdoiorg10114532978583304004

CCS Concepts bull Computer systems organization rarrCloud computing Availability bull Computing method-ologies rarr Neural networks bull Software and its engi-neeringrarr Scheduling

Keywords cloud computing datacenter performance de-bugging QoS deep learning data mining tracing monitor-ing microservices resource managementACM Reference FormatYu Gan Yanqi Zhang Kelvin Hu Dailun Cheng Yuan He MeghnaPancholi and Christina Delimitrou 2019 Seer Leveraging BigData to Navigate the Complexity of Performance Debugging inCloud Microservices In 2019 Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS rsquo19) April 13ndash17 2019Providence RI USA ACM New York NY USA 15 pages httpsdoiorg10114532978583304004

1 IntroductionCloud computing services are governed by strict qualityof service (QoS) constraints in terms of throughput andmore critically tail latency [14 28 31 34] Violating theserequirements worsens the end user experience leads to lossof availability and reliability and has severe revenue im-plications [13 14 28 32 35 36] In an effort to meet theseperformance constraints and facilitate frequent applicationupdates cloud services have recently undergone a majorshift from complex monolithic designs which encompass theentire functionality in a single binary to graphs of hundredsof loosely-coupled single-concerned microservices [9 45]Microservices are appealing for several reasons includingaccelerating development and deployment simplifying cor-rectness debugging as errors can be isolated in specific tiersand enabling a rich software ecosystem as each microserviceis written in the language or programming framework thatbest suits its needsAt the same time microservices signal a fundamental de-

parture from the way traditional cloud applications weredesigned and bring with them several system challenges

Specifically even though the quality-of-service (QoS) re-quirements of the end-to-end application are similar for mi-croservices and monoliths the tail latency required for eachindividual microservice is much stricter than for traditionalcloud applications [34 43ndash45 61 62 64 67 70 77] This putsincreased pressure on delivering predictable performance asdependencies between microservices mean that a single mis-behaving microservice can cause cascading QoS violationsacross the system

Netflix Twitter

Amazon Social Network

Figure 1 Microservicesgraphs in three large cloudproviders [2 9 11] and ourSocial Network service

Fig 1 shows threeinstances of real large-scale production de-ployments of microser-vices [2 9 11] Theperimeter of the cir-cle (or sphere surface)shows the different mi-croservices and edgesshow dependencies be-tween them We alsoshow these dependen-cies for Social Networkone of the large-scaleservices used in theevaluation of this work(see Sec 3) Unfortu-nately the complexity of modern cloud services means thatmanually determining the impact of each pair-wide depen-dency on end-to-end QoS or relying on the user to providethis information is impractical

Apart from software heterogeneity datacenter hardware isalso becoming increasingly heterogeneous as special-purposearchitectures [20ndash22 39 55] and FPGAs are used to accel-erate critical operations [19 25 40 75] This adds to theexisting server heterogeneity in the cloud where servers areprogressively replaced and upgraded over the datacenterrsquosprovisioned lifetime [31 33 65 68 95] and further compli-cates the effort to guarantee predictable performanceThe need for performance predictability has prompted a

long line of work on performance tracing monitoring anddebugging systems [24 42 46 78 85 93 97] Systems likeDapper and GWP for example rely on distributed tracing(often at RPC level) and low-level hardware event monitoringrespectively to detect performance abnormalities while theMystery Machine [24] leverages the large amount of loggeddata to extract the causal relationships between requestsEven though such tracing systems help cloud providers

detect QoS violations and apply corrective actions to restoreperformance until those actions take effect performance suf-fers For monolithic services this primarily affects the serviceexperiencing the QoS violation itself and potentially servicesit is sharing physical resources with With microserviceshowever a posteriori QoS violation detection is more im-pactful as hotspots propagate and amplify across dependent

A Posteriori Seer

0 50 100 150 200 250 300Time (sec)

40

50

60

70

80

90

100

La

ten

cy

ile m

ee

tin

g Q

oS

Violationdetected

QoSrestored

0 50 100 150 200 250 300

Time (s)

Mic

rose

rvic

es I

nsta

nce

s

100

101

102

CP

U U

tiliz

atio

n (

)

Front-end

Back-end

0 50 100 150 200 250 300

Time (s)

Mic

rose

rvic

es I

nsta

nce

s

100

101

102

La

ten

cy in

cre

ase

(

)

Front-end

Back-end

Figure 2 The performance impact of a posteriori perfor-mance diagnostics for a monolith and for microservices

services forcing the system to operate in a degraded state forlonger until all oversubscribed tiers have been relieved andall accumulated queues have drained Fig 2a shows the im-pact of reacting to a QoS violation after it occurs for the SocialNetwork application with several hundred users running on20 two-socket high-end servers Even though the schedulerscales out all oversubscribed tiers once the violation occursit takes several seconds for the service to return to nominaloperation There are two reasons for this first by the timeone tier has been upsized its neighboring tiers have builtup request backlogs which cause them to saturate in turnSecond utilization is not always a good proxy for tail latencyandor QoS violations [14 28 61 62 73] Fig 2b shows theutilization of all microservices ordered from the back-endto the front-end over time and Fig 2c shows their corre-sponding 99th percentile latencies normalized to nominaloperation Although there are cases where high utilizationand high latency match the effect of hotspots propagatingthrough the service is much more pronounced when lookingat latencies with the back-end tiers progressively saturatingthe servicersquos logic and front-end microservices In contrastthere are highly-utilized microservices that do not experi-ence increases in their tail latency A commonway to addresssuch QoS violations is rate limiting [86] which constrainsthe incoming load until hotspots dissipate This restoresperformance but degrades the end userrsquos experience as afraction of input requests is dropped

We present Seer a proactive cloud performance debuggingsystem that leverages practical deep learning techniques todiagnose upcoming QoS violations in a scalable and onlinemanner First Seer is proactive to avoid the long recoveryperiods of a posteriori QoS violation detection Second ituses themassive amount of tracing data cloud systems collectover time to learn spatial and temporal patterns that leadto QoS violations early enough to avoid them altogetherSeer includes a lightweight distributed RPC-level tracingsystem based on Apache Thriftrsquos timing interface [1] tocollect end-to-end traces of request execution and track per-microservice outstanding requests Seer uses these tracesto train a deep neural network to recognize imminent QoSviolations and identify the microservice(s) that initiated theperformance degradation Once Seer identifies the culprit

of a QoS violation that will occur over the next few 100s ofmilliseconds it uses detailed per-node hardware monitoringto determine the reason behind the degraded performanceand provide the cluster scheduler with recommendations onactions required to avoid it

We evaluate Seer both in our local cluster of 20 two-socketservers and on large-scale clusters on Google Compute En-gine (GCE) with a set of end-to-end interactive applicationsbuilt with microservices including the Social Network aboveIn our local cluster Seer correctly identifies upcoming QoSviolations in 93 of cases and correctly pinpoints the mi-croservice initiating the violation 89 of the time To combatlong inference times as clusters scale we offload the DNNtraining and inference to Googlersquos Tensor Processing Units(TPUs) when running on GCE [55] We additionally experi-ment with using FPGAs in Seer via Project Brainwave [25]when running on Windows Azure and show that both typesof acceleration speed up Seer by 200-235x with the TPU help-ing the most during training and vice versa for inferenceAccuracy is consistent with the small cluster results

Finally we deploy Seer in a large-scale installation of theSocial Network service with several hundred users and showthat it not only correctly identifies 906 of upcoming QoSviolations and avoids 84 of them but that detecting pat-terns that create hotspots helps the applicationrsquos developersimprove the service design resulting in a decreasing numberof QoS violations over time As cloud application and hard-ware complexity continues to grow data-driven systems likeSeer can offer practical solutions for systems whose scalemake empirical approaches intractable

2 Related WorkPerformance unpredictability is a well-studied problem inpublic clouds that stems from platform heterogeneity re-source interference software bugs and load variation [2332 34 35 38 53 58 61ndash63 72 76 81 81] We now reviewrelated work on reducing performance unpredictability incloud systems including through scheduling and clustermanagement or through online tracing systemsCloud management The prevalence of cloud computinghas motivated several cluster management designs Systemslike Mesos [51] Torque [87] Tarcil [38] and Omega [82]all target the problem of resource allocation in large multi-tenant clusters Mesos is a two-level scheduler It has a centralcoordinator that makes resource offers to application frame-works and each framework has an individual scheduler thathandles its assigned resources Omega on the other handfollows a shared-state approach where multiple concurrentschedulers can view the whole cluster state with conflicts be-ing resolved through a transactional mechanism [82] Tarcilleverages information on the type of resources applicationsneed to employ a sampling-base distributed scheduler thatreturns high quality resources within a fewmilliseconds [38]

Dejavu identifies a few workload classes and reuses previ-ous allocations for each class to minimize reallocation over-heads [90] CloudScale [83] PRESS [48] AGILE [70] and thework by Gmach et al [47] predict future resource needs on-line often without a priori knowledge Finally auto-scalingsystems such as Rightscale [79] automatically scale thenumber of physical or virtual instances used by webservingworkloads to accommodate changes in user load

A second line of work tries to identify resources that willallow a new potentially-unknown application to meet itsperformance (throughput or tail latency) requirements [2931 32 34 66 68 95] Paragon uses classification to determinethe impact of platform heterogeneity and workload interfer-ence on an unknown incoming workload [30 31] It thenuses this information to achieve predictable performanceand high cluster utilization Paragon assumes that the clus-ter manager has full control over all resources which is oftennot the case in public clouds Quasar extends the use of datamining in cluster management by additionally determiningthe appropriate amount of resources for a new applicationNathuji et al developed a feedback-based scheme that tunesresource assignments to mitigate memory interference [69]Yang et al developed an online scheme that detects mem-ory pressure and finds colocations that avoid interference onlatency-sensitiveworkloads [95] Similarly DeepDive detectsand manages interference between co-scheduled workloadsin a VM environment [71]Finally CPI2 [98] throttles low-priority workloads that

introduce destructive interference to important latency-critical services using low-level metrics of performance col-lected through Google-Wide Profiling (GWP) In terms ofmanaging platform heterogeneity Nathuji et al [68] andMars et al [64] quantified its impact on conventional bench-marks and Google services and designed schemes to predictthe most appropriate servers for a workloadCloud tracing amp diagnostics There is extensive relatedwork on monitoring systems that has shown that execu-tion traces can help diagnose performance efficiency andeven security problems in large-scale systems [12 24 26 4254 77 85 93 97] For example X-Trace is a tracing frame-work that provides a comprehensive view of the behavior ofservices running on large-scale potentially shared clustersX-Trace supports several protocols and software systemsand has been deployed in several real-world scenarios in-cluding DNS resolution and a photo-hosting site [42] TheMystery Machine on the other hand leverages the massiveamount of monitoring data cloud systems collect to deter-mine the causal relationship between different requests [24]Cloudseer serves a similar purpose building an automatonfor the workflow of each task based on normal executionand then compares against this automaton at runtime todetermine if the workflow has diverged from its expectedbehavior [97] Finally there are several production systems

Service Communication Unique Per-language LoC breakdownProtocol Microservices (end-to-end service)

Social Network RPC 36 34 C 23 C++ 18 Java 7 node 6 Python 5 Scala 3 PHP 2 JS 2 GoMedia Service RPC 38 30 C 21 C++ 20 Java 10 PHP 8 Scala 5 node 3 Python 3 JS

E-commerce Site REST 41 21 Java 16 C++ 15 C 14 Go 10 JS 7 node 5 Scala 4 HTML 3 RubyBanking System RPC 28 29 C 25 Javascript 16 Java 16 nodejs 11 C++ 3 Python

Hotel Reservations [3] RPC 15 89 Go 7 HTML 4 PythonTable 1 Characteristics and code composition of each end-to-end microservices-based application

mongoDB

mongoDB

mongoDB

memcached

memcached

memcached

mongoDB

memcached

Social Network

Service

text

video

image

userTag

composePost

postsStorage

writeTimeline

writeGraph

readPost blockedUsers

readTimeline

login

userInfomongoDB

memcached

search

index0

index1

indexn

hellip

uniqueIDads

recommender

Client nginx

http

http

fastcgiphp-

fpm

Load

BalancerurlShorten

favorite

followUser

Figure 3 Dependency graph between the microservices of the end-to-end Social Network application

Go-microservices

(hotel reservations)

ads

userProfile

search

geo

rate

mongoDB

mongoDB

recommender

front-end

(nodejs)http

Client

httpLoad

Balancer

memcached

memcached

mongoDB

memcached

memcached

Figure 4 Architecture of the hotel reserva-tion site using Go-microservices [3]

including Dapper [85] GWP [78] and Zipkin [8] which pro-vide the tracing infrastructure for large-scale productionsservices at Google and Twitter respectively Dapper and Zip-kin trace distributed user requests at RPC granularity whileGWP focuses on low-level hardware monitoring diagnosticsRoot cause analysis of performance abnormalities in the

cloud has also gained increased attention over the past fewyears as the number of interactive latency-critical serviceshosted in cloud systems has increased Jayathilaka et al [54]for example developed Roots a system that automaticallyidentifies the root cause of performance anomalies in webapplications deployed in Platform-as-a-Service (PaaS) cloudsRoots tracks events within the PaaS cloud using a combina-tion of metadata injection and platform-level instrumenta-tion Weng et al [91] similarly explore the cloud providerrsquosability to diagnose the root cause of performance abnormali-ties inmulti-tier applications Finally Ouyang et al [74] focuson the root cause analysis of straggler tasks in distributedprogramming frameworks like MapReduce and Spark

Even though this work does not specifically target interac-tive latency-critical microservices or applications of similargranularity such examples provide promising evidence thatdata-driven performance diagnostics can improve a large-scale systemrsquos ability to identify performance anomalies andaddress them to meet its performance guarantees

3 End-to-End Applications withMicroservices

We motivate and evaluate Seer with a set of new end-to-endinteractive services built with microservices Even thoughthere are open-source microservices that can serve as compo-nents of a larger application such as nginx [6] memcached [41]MongoDB [5] Xapian [57] and RabbitMQ [4] there are currentlyno publicly-available end-to-end microservices applicationswith the exception of a few simple architectures like Go-microservices [3] and Sockshop [7] We design four end-to-end services implementing a Social Network a Media Servicean E-commerce Site and a Banking System Starting fromthe Go-microservices architecture [3] we also develop anend-to-end Hotel Reservation system Services are designedto be representative of frameworks used in production sys-tems modular and easily reconfigurable The end-to-endapplications and tracing infrastructure are described in moredetail and open-sourced in [45]Table 1 briefly shows the characteristics of each end-to-

end application including its communication protocol thenumber of unique microservices it includes and its break-down by programming language and framework Unlessotherwise noted all microservices are deployed in Dockercontainers Below we briefly describe the scope and func-tionality of each service

E-commerce

Service

front-end

(nodejs)http

memcached

orders

search

index0

index1

indexn

hellip

recommender

media

discounts

catalogue

wishlist

cart

accountInfo

mongoDB

mongoDB

mongoDB

shipping

mongoDB

queueMaster orderQueue

mongoDB

mongoDB

memcached

mongoDB

payment

authorization

transactionID

invoicing

login

adsClient

httpLoad

Balancer

socialNet

memcached

Figure 5 The architecture of the end-to-end E-commerceapplication implementing an online clothing store

31 Social NetworkScope The end-to-end service implements a broadcast-stylesocial network with uni-directional follow relationshipsFunctionality Fig 3 shows the architecture of the end-to-end service Users (client) send requests over http whichfirst reach a load balancer implemented with nginx whichselects a specific webserver is selected also in nginx Userscan create posts embedded with text media links and tags toother users which are then broadcasted to all their followersUsers can also read favorite and repost posts as well as replypublicly or send a direct message to another user The appli-cation also includes machine learning plugins such as adsand user recommender engines [15 16 59 92] a search ser-vice using Xapian [57] and microservices that allow users tofollow unfollow or block other accounts Inter-microservicemessages use Apache Thrift RPCs [1] The servicersquos backenduses memcached for caching and MongoDB for persistently stor-ing posts user profiles media and user recommendationsThis service is broadly deployed at Cornell and elsewhereand currently has several hundred users We use this in-stallation to test the effectiveness and scalability of Seer inSection 6

32 Media ServiceScope The application implements an end-to-end servicefor browsing movie information as well as reviewing ratingrenting and streaming movies [9 11]Functionality As with the social network a client requesthits the load balancer which distributes requests among mul-tiple nginx webservers The front-end is similar to SocialNetwork and users can search and browse information aboutmovies including the plot photos videos and review in-formation as well as insert a review for a specific movie bylogging in to their account Users can also select to rent amovie which involves a payment authentication module to

verify the user has enough funds and a video streaming mod-ule using nginx-hls a production nginx module for HTTPlive streaming Movie files are stored in NFS to avoid thelatency and complexity of accessing chunked records fromnon-relational databases while reviews are held in memcached

and MongoDB instances Movie information is maintained ina sharded and replicated MySQL DB We are similarly de-ploying Media Service as a hosting site for project demos atCornell which students can browse and review

33 E-Commerce ServiceScope The service implements an e-commerce site for cloth-ing The design draws inspiration and uses several compo-nents of the open-source Sockshop application [7]Functionality The application front-end here is a nodejs

service Clients can use the service to browse the inventoryusing catalogue a Go microservice that mines the back-end memcached and MongoDB instances holding informationabout products Users can also place orders (Go) by addingitems to their cart (Java) After they log in (Go) to their ac-count they can select shipping options (Java) process theirpayment (Go) and obtain an invoice (Java) for their orderOrders are serialized and committed using QueueMaster (Go)Finally the service includes a recommender engine (C++) andmicroservices for creating wishlists (Java)

34 Banking SystemScope The service implements a secure banking systemsupporting payments loans and credit card managementFunctionality Users interface with a nodejs front-endsimilar to E-commerce to login to their account search in-formation about the bank or contact a representative Oncelogged in a user can process a payment from their accountpay their credit card or request a new one request a loan andobtain information about wealth management options Mostmicroservices are written in Java and Javascript The back-end databases are memcached and MongoDB instances The ser-vice also has a relational database (BankInfoDB) that includesinformation about the bank its services and representatives

35 Hotel Reservation SiteScope The service is an online hotel reservation site forbrowsing information about hotels and making reservationsFunctionalityThe service is based on theGo-microservicesopen-source project [3] augmented with backend databasesand machine learning widgets for advertisement and hotelrecommendations A client request is first directed to oneof the front-end webservers in nodejs by a load balancerThe front-end then interfaces with the search engine whichallows users to explore hotel availability in a given regionand place a reservation The service back-end consists ofmemcached and MongoDB instances

Tracing

Coordinator

Client

Load Balancer

Microservice

Tracing module

Seer

TraceDB

(Cassandra)

Cluster ManagerAdjust

resource

allocation

Microservices dependency graph

Figure 6 Overview of Seerrsquos operation

4 Seer Design41 OverviewFig 6 shows the high-level architecture of the system Seer isan online performance debugging system for cloud systemshosting interactive latency-critical services Even though weare focusing our analysis on microservices where the impactof QoS violations is more severe Seer is also applicable togeneral cloud services and traditional multi-tier or Service-Oriented Architecture (SOA) workloads Seer uses two levelsof tracing shown in Fig 7

RPC-level tracing

Signal QoS

violation

Identify culprit

microservice

Node-level diagnostics

Identify culprit

resource

Notify cluster

manager

Perf cntrs uBench

Figure 7 The two levels oftracing used in Seer

First it uses a light-weight distributed RPC-level tracing system de-scribed in Sec 42 whichcollects end-to-end ex-ecution traces for eachuser request includingper-tier latency and out-standing requests asso-ciates RPCs belongingto the same end-to-endrequest and aggregates

them to a centralized Cassandra database (TraceDB) Fromthere traces are used to train Seer to recognize patterns inspace (between microservices) and time that lead to QoSviolations At runtime Seer consumes real-time streamingtraces to infer whether there is an imminent QoS violation

When a QoS violation is expected to occur and a culprit mi-croservice has been located Seer uses its lower tracing levelwhich consists of detailed per-node low-level hardware mon-itoring primitives such as performance counters to identifythe reason behind the QoS violation It also uses this informa-tion to provide the cluster manager with recommendationson how to avoid the performance degradation altogetherWhen Seer runs on a public cloud where performance coun-ters are disabled it uses a set of tunable microbenchmarksto determine the source of unpredictable performance (seeSec 44) Using two specialized tracing levels instead of col-lecting detailed low-level traces for all active microservicesensures that the distributed tracing is lightweight enough to

Microservice m+1

Microservice m-1

me

mca

che

dNIC

epoll

(libevent)

socket

read

mem$

proc

Packet

filtering

Packet

filtering

TCPIP

TX

TCPIP

RX

Tracing

Coordinator

WebUI

User

http

TraceDB

(Cassandra)

QueryEngine

mic

rose

rvic

es

latency

Gantt charts

NIC

Figure 8 Distributed tracing and instrumentation in Seer

track all active requests and services in the system and thatdetailed low-level hardware tracing is only used on-demandfor microservices likely to cause performance disruptions Inthe following sections we describe the design of the tracingsystem the learning techniques in Seer and its low-leveldiagnostic framework and the system insights we can drawfrom Seerrsquos decisions to improve cloud application design

42 Distributed TracingA major challenge with microservices is that one cannotsimply rely on the client to report performance as withtraditional client-server applications We have developeda distributed tracing system for Seer similar in design toDapper [85] and Zipkin [8] that records per-microservicelatencies using the Thrift timing interface as shown in Fig 8We additionally track the number of requests queued in eachmicroservice (outstanding requests) since queue lengths arehighly correlated with performance and QoS violations [4649 56 96] In all cases the overhead from tracing withoutrequest sampling is negligible less than 01 on end-to-endlatency and less than 015 on throughput (QPS) whichis tolerable for such systems [24 78 85] Traces from allmicroservices are aggregated in a centralized database [18]Instrumentation The tracing system requires two typesof application instrumentation First to distinguish betweenthe time spent processing network requests and the timethat goes towards application computation we instrumentthe application to report the time it sees a new request (post-RPC processing) We similarly instrument the transmit sideof the application Second systems have multiple sources ofqueueing in both hardware and software To obtain accuratemeasurements of queue lengths per microservice we need toaccount for these different queues Fig 8 shows an example of

Name LoC Layers Nonlinear Weights BatchFC Conv Vect Total Function Size

CNN 1456 8 8 ReLU 30K 4LSTM 944 12 6 18 sigmoidtanh 52K 32

Seer 2882 10 7 5 22 ReLU 80K 32sigmoidtanhTable 2 The different neural network configurations weexplored for Seer

Seerrsquos instrumentation for memcached Memcached includesfive main stages [60] TCPIP receive epolllibevent read-ing the request from the socket processing the request andresponding over TCPIP either with the ltkvgt pair for a reador with an ack for a write Each of these stages includes ahardware (NIC) or software (epollsocket readmemcachedproc) queue For theNIC queues Seer filters packets based onthe destination microservice but accounts for the aggregatequeue length if hardware queues are shared since that willimpact how fast a microservicersquos packets get processed Forthe software queues Seer inserts probes in the applicationto read the number of queued requests in each caseLimited instrumentation As seen above accounting forall sources of queueing in a complex system requires non-trivial instrumentation This can become cumbersome ifusers leverage third-party applications in their services or inthe case of public cloud providers which do not have accessto the source code of externally-submitted applications forinstrumentation In these cases Seer relies on the requestsqueued exclusively in the NIC to signal upcoming QoS viola-tions In Section 5 we compare the accuracy of the full versuslimited instrumentation and see that using network queuedepths alone is enough to signal a large fraction of QoS viola-tions although smaller than when the full instrumentation isavailable Exclusively polling NIC queues identifies hotspotscaused by routing incast failures and resource saturationbut misses QoS violations that are caused by performanceand efficiency bugs in the application implementation suchas blocking behavior between microservices Signaling suchbugs helps developers better understand the microservicesmodel and results in better application designInferring queue lengths Additionally there has been re-cent work on using deep learning to reverse engineer thenumber of queued requests in switches across a large net-work topology [46] when tracing information is incompleteSuch techniques are also applicable and beneficial for Seerwhen the default level of instrumentation is not available

43 Deep Learning in Performance DebuggingA popular way to model performance in cloud systems es-pecially when there are dependencies between tasks arequeueing networks [49] Although queueing networks are avaluable tool to model how bottlenecks propagate through

the system they require in-depth knowledge of applicationsemantics and structure and can become overly complexas applications and systems scale They additionally cannoteasily capture all sources of contention such as the OS andnetwork stack

Instead in Seer we take a data-driven application-agnosticapproach that assumes no information about the structureand bottlenecks of a service making it robust to unknownand changing applications and relying instead on practi-cal learning techniques to infer patterns that lead to QoSviolations This includes both spatial patterns such as de-pendencies between microservices and temporal patternssuch as input load and resource contention The key idea inSeer is that conditions that led to QoS violations in the pastcan be used to anticipate unpredictable performance in thenear future Seer uses execution traces annotated with QoSviolations and collected over time to train a deep neural net-work to signal upcoming QoS violations Below we describethe structure of the neural network why deep learning iswell-suited for this problem and how Seer adapts to changesin application structure onlineUsing deep learning Although deep learning is not theonly approach that can be used for proactive QoS violationdetection there are several reasonswhy it is preferable in thiscase First the problem Seer must solve is a pattern matchingproblem of recognizing conditions that result in QoS viola-tions where the patterns are not always known in advanceor easy to annotate This is a more complicated task than sim-ply signaling a microservice with many enqueued requestsfor which simpler classification regression or sequence la-beling techniques would suffice [15 16 92] Second the DNNin Seer assumes no a priori knowledge about dependenciesbetween individual microservices making it applicable tofrequently-updated services where describing changes iscumbersome or even difficult for the user to know Thirddeep learning has been shown to be especially effective inpattern recognition problems with massive datasets eg inimage or text recognition [10] Finally as we show in thevalidation section (Sec 5) deep learning allows Seer to rec-ognize QoS violations with high accuracy in practice andwithin the opportunity window the cluster manager has toapply corrective actionsConfiguring the DNN The input used in the network isessential for its accuracy We have experimented with re-source utilization latency and queue depths as input met-rics Consistent with prior work utilization is not a goodproxy for performance [31 35 56 61] Latency similarlyleads to many false positives or to incorrectly pinpointingcomputationally-intensive microservices as QoS violationculprits Again consistent with queueing theory [49] andprior work [34 37 38 46 56] per-microservice queue depthsaccurately capture performance bottlenecks and pinpoint the

hellip

CNN

LSTM

hellip

LSTM

LSTM

LSTM

LSTM

LSTM

hellip

LSTM

LSTM

LSTM

LSTM

LSTM

hellip

LSTM

LSTM

LSTM

LSTM

M

icro

serv

ice

s(i

n d

ep

en

de

nci

es

ord

er)

helliphellip hellip

hellip

M

icrose

rvice

s(P

rfo

r Qo

Sv

iola

tion

)

LSTM

hellip

SoftMax

LSTM

Dropout

Figure 9 The deep neural network design in Seer con-sisting of a set of convolution layers followed by a set oflong-short term memory layers Each input and output neu-ron corresponds to a microservice ordered in topologicalorder from back-end microservices in the top to front-endmicroservices in the bottom

microservices causing them We compare against utilization-based approaches in Section 5 The number of input andoutput neurons is equal to the number of active microser-vices in the cluster with the input value corresponding toqueue depths and the output value to the probability fora given microservice to initiate a QoS violation Input neu-rons are ordered according to the topological applicationstructure with dependent microservices corresponding toconsecutive neurons to capture hotspot patterns in space In-put traces are annotated with the microservices that causedQoS violations in the past

The choice of DNN architecture is also instrumental to itsaccuracy There are three main DNN designs that are populartoday fully connected networks (FC) convolutional neuralnetworks (CNN) and recurrent neural networks (RNN) es-pecially their Long Short-Term Memory (LSTM) class ForSeer to be effective in improving performance predictabilityinference needs to occur with enough slack for the clustermanagerrsquos action to take effect Hence we focus on the morecomputationally-efficient CNN and LSTM networks CNNsare especially effective at reducing the dimensionality oflarge datasets and finding patterns in space eg in imagerecognition LSTMs on the other hand are particularly effec-tive at finding patterns in time eg predicting tomorrowrsquosweather based on todayrsquos measurements Signaling QoS vi-olations in a large cluster requires both spatial recognitionnamely identifying problematic clusters of microserviceswhose dependencies cause QoS violations and discardingnoisy but non-critical microservices and temporal recogni-tion namely using past QoS violations to anticipate futureones We compare three network designs a CNN a LSTMand a hybrid network that combines the two using the CNNfirst to reduce the dimensionality and filter out microservices

that do not affect end-to-end performance and then an LSTMwith a SoftMax final layer to infer the probability for eachmicroservice to initiate a QoS violation The architectureof the hybrid network is shown in Fig 9 Each network isconfigured using hyperparameter tuning to avoid overfittingand the final parameters are shown in Table 2We train each network on a weekrsquos worth of trace data

collected on a 20-server cluster running all end-to-end ser-vices (for methodology details see Sec 5) and test it on tracescollected on a different week after the servers had beenpatched and the OS had been upgraded

561 7879 9345QoS Violation Detection Accuracy ()

0

2

4

6

8

10

12

14

16

Infe

rence

Tim

e (

ms)

CNNLSTMSeer (CNN+LSTM)

Figure 10 Comparison ofDNN architectures

The quantitative com-parison of the threenetworks is shown inFig 10 The CNN is byfar the fastest but alsothe worst performingsince it is not designedto recognize patterns intime that lead to QoS vi-olations The LSTM onthe other hand is espe-cially effective at captur-ing load patterns overtime but is less effective at reducing the dimensionality ofthe original dataset which makes it prone to false positivesdue to microservices with many outstanding requests whichare off the critical path It also incurs higher overheads forinference than the CNN Finally Seer correctly anticipates9345 of violations outperforming both networks for asmall increase in inference time compared to LSTM Giventhat most resource partitioning decisions take effect after afew 100ms the inference time for Seer is within the windowof opportunity the cluster manager has to take action Moreimportantly it attributes the QoS violation to the correctmicroservice simplifying the cluster managerrsquos task QoSviolations missed by Seer included four random load spikesand a network switch failure which caused high packet dropsOut of the five end-to-end services the one most prone

to QoS violations initially was Social Network first becauseit has stricter QoS constraints than eg E-commerce andsecond due to a synchronous and cyclic communication be-tween three neighboring services that caused them to entera positive feedback loop until saturation We reimplementedthe communication protocol between them post-detectionOn the other hand the service for which QoS violations werehardest to detect wasMedia Service because of a faulty mem-ory bandwidth partitioning mechanism in one of our serverswhich resulted in widely inconsistent memory bandwidthallocations during movie streaming Since the QoS violationonly occurred when the specific streaming microservice wasscheduled on the faulty node it was hard for Seer to collectenough datapoints to signal the violation

Retraining Seer By default training happens once andcan be time consuming taking several hours up to a day forweek-long traces collected on our 20-server cluster (Sec 5includes a detailed sensitivity study for training time) How-ever one of the main advantages of microservices is that theysimplify frequent application updates with old microser-vices often swapped out and replaced by newer modules orlarge services progressively broken down to microservicesIf the application (or underlying hardware) change signifi-cantly Seerrsquos detection accuracy can be impacted To adjustto changes in the execution environment Seer retrains incre-mentally in the background using the transfer learning-basedapproach in [80] Weights from previous training rounds arestored in disk allowing the model to continue training fromwhere it last left off when new data arrive reducing thetraining time by 2-3 orders of magnitude Even though thisapproach allows Seer to handle application changes almost inreal-time it is not a long-term solution since newweights arestill polluted by the previous application architecture Whenthe application changes in a major way eg microserviceson the critical path change Seer also retrains from scratchin the background While the new network trains QoS vi-olation detection happens with the incrementally-trainedinterim model In Section 5 we evaluate Seerrsquos ability toadjust its estimations to application changes

44 Hardware MonitoringOnce a QoS violation is signaled and a culprit microserviceis pinpointed Seer uses low-level monitoring to identify thereason behind the QoS violation The exact process dependson whether Seer has access to performance countersPrivate cluster When Seer has access to hardware eventssuch as performance counters it uses them to determinethe utilization of different shared resources Note that eventhough utilization is a problematic metric for anticipatingQoS violations in a large-scale service once a culprit mi-croservice has been identified examining the utilization ofdifferent resources can provide useful hints to the clustermanager on suitable decisions to avoid degraded perfor-mance Seer specifically examines CPU memory capacityand bandwidth network bandwidth cache contention andstorage IO bandwidth when prioritizing a resource to adjustOnce the saturated resource is identified Seer notifies thecluster manager to take actionPublic clusterWhen Seer does not have access to perfor-mance counters it instead uses a set of 10 tunable contentiousmicrobenchmarks each of them targeting a different sharedresource [30] to determine resource saturation For exampleif Seer injects the memory bandwidth microbenchmark inthe system and tunes up its intensity without an impacton the co-scheduled microservicersquos performance memorybandwidth is most likely not the resource that needs to beadjusted Seer starts from microbenchmarks corresponding

to core resources and progressively moves to resources fur-ther away from the core until it sees a substantial change inperformance when running the microbenchmark Each mi-crobenchmark takes approximately 10ms to complete avoid-ing prolonged degraded performance

Upon identifying the problematic resource(s) Seer notifiesthe cluster manager which takes one of several resourceallocation actions resizing the Docker container or usingmechanisms like Intelrsquos Cache Allocation Technology (CAT)for last level cache (LLC) partitioning and the Linux trafficcontrolrsquos hierarchical token bucket (HTB) queueing disciplinein qdisc [17 62] for network bandwidth partitioning

45 System Insights from SeerUsing learning-based data-driven approaches in systemsis most useful when these techniques are used to gain in-sight into system problems instead of treating them as blackboxes Section 5 includes an analysis of the causes behindQoS violations signaled by Seer including application bugspoor resource provisioning decisions and hardware failuresFurthermore we have deployed Seer in a large installation ofthe Social Network service over the past few months and itsoutput has been instrumental not only in guaranteeing QoSbut in understanding sources of unpredictable performanceand improving the application design This has resulted bothin progressively fewer QoS violations over time and a betterunderstanding of the design challenges of microservices

46 ImplementationSeer is implemented in 12KLOC of CC++ and Python Itruns on Linux and OSX and supports applications in variouslanguages including all frameworks the end-to-end servicesare designed in Furthermore we provide automated patchesfor the instrumentation probes for many popular microser-vices including NGINX memcached MongoDB Xapian and allSockshop and Go-microservices applications to minimize thedevelopment effort from the userrsquos perspective

Seer is a centralized system we usemaster-slavemirroringto improve fault tolerance with two hot stand-by mastersthat can take over if the primary system fails Similarly thetrace database is also replicated in the backgroundSecurity concerns Trace data is stored and processed un-encrypted in Cassandra Previous work has shown that thesensitivity applications have to different resources can leakinformation about their nature and characteristics makingthem vulnerable to malicious security attacks [27 36 50 5284 88 89 94 99] Similar attacks are possible using the dataand techniques in Seer and are deferred to future work

5 Seer Analysis and Validation51 MethodologyServer clusters First we use a dedicated local cluster with20 2-socket 40-core servers with 128GB of RAM each Each

0 5 10 15 20 25 30 35 40

Training Time (hours)

40

50

60

70

80

90

100

Qo

S V

iola

tio

n D

ete

ctio

n A

ccu

racy (

)

100MB

1GB

10GB

20GB

50GB

100GB 200GB

500GB 1TB

10-2 10-1 100 101Tracing Interval (s)

50

55

60

65

70

75

80

85

90

95

Dete

ction A

ccura

cy (

)

Figure 11 Seerrsquos sensitivity to (a) the size of trainingdatasets and (b) the tracing interval

server is connected to a 40Gbps ToR switch over 10Gbe NICsSecond we deploy the Social Network service to GoogleCompute Engine (GCE) and Windows Azure clusters withhundreds of servers to study the scalability of SeerApplicationsWe use all five end-to-end services of Table 1Services for now are driven by open-loop workload gener-ators and the input load varies from constant to diurnalto load with spikes in user demand In Section 6 we studySeer in a real large-scale deployment of the Social Networkin that case the input load is driven by real user traffic

52 EvaluationSensitivity to training data Fig 11a shows the detectionaccuracy and training time for Seer as we increase the sizeof the training dataset The size of the dots is a functionof the dataset size Training data is collected from the 20-server cluster described above across different load levelsplacement strategies time intervals and request types Thesmallest training set size (100MB) is collected over ten min-utes of the cluster operating at high utilization while thelargest dataset (1TB) is collected over almost two months ofcontinuous deployment As datasets grow Seerrsquos accuracyincreases leveling off at 100-200GB Beyond that point accu-racy does not further increase while the time needed fortraining grows significantly Unless otherwise specified weuse the 100GB training datasetSensitivity to tracing frequencyBy default the distributedtracing system instantaneously collects the latency of everysingle user request Collecting queue depth statistics on theother hand is a per-microservice iterative process Fig 11bshows how Seerrsquos accuracy changes as we vary the frequencywith which we collect queue depth statistics Waiting for along time before sampling queues eg gt 1s can result in un-detected QoS violations before Seer gets a chance to processthe incoming traces In contrast sampling queues very fre-quently results in unnecessarily many inferences and runsthe risk of increasing the tracing overhead For the remain-der of this work we use 100ms as the interval for measuringqueue depths across microservicesFalse negatives amp false positives Fig 12a shows the per-centage of false negatives and false positives as we vary the

CPU

Cache

Memory

Network

Disk

Application

False Negatives False Positives0

10

20

30

40

50

Pe

rce

nta

ge

(

)

10ms50ms100ms

500ms1s2s

Utilizatio

n

App only

Net only

Seer

Ground0

20

40

60

80

100

o

f V

alid

Qo

S V

iola

tio

ns

Truth

Figure 12 (a) The false negatives and false positives in Seeras we vary the inference window (b) Breakdown to causesof QoS violations and comparison with utilization-baseddetection and systems with limited instrumentation

prediction window When Seer tries to anticipate QoS viola-tions that will occur in the next 10-100ms both false positivesand false negatives are low since Seer uses a very recentsnapshot of the cluster state to anticipate performance unpre-dictability If inference was instantaneous very short predic-tion windows would always be better However given thatinference takes several milliseconds and more importantlyapplying corrective actions to avoid QoS violations takes10-100s of milliseconds to take effect such short windowsdefy the point of proactive QoS violation detection At theother end predicting far into the future results in significantfalse negatives and especially false positives This is becausemany QoS violations are caused by very short bursty eventsthat do not have an impact on queue lengths until a fewmilliseconds before the violation occurs Therefore requiringSeer to predict one or more seconds into the future meansthat normal queue depths are annotated as QoS violationsresulting in many false positives Unless otherwise specifiedwe use a 100ms prediction windowComparison of debugging systems Fig 12b comparesSeer with a utilization-based performance debugging systemthat uses resource saturation as the trigger to signal a QoSviolation and two systems that only use a fraction of Seerrsquosinstrumentation App-only exclusively uses queues mea-sured via application instrumentation (not network queues)while Network-only uses queues in the NIC and ignoresapplication-level queueing We also show the ground truthfor the total number of upcoming QoS violations (96 over atwo-week period) and break it down by the reason that ledto unpredictable performanceA large fraction of QoS violations are due to application-

level inefficiencies including correctness bugs unnecessarysynchronization andor blocking behavior between microser-vices (including two cases of deadlocks) and misconfigurediptables rules which caused packet drops An equally largefraction of QoS violations are due to compute contentionfollowed by contention in the network cache and memory

Accuracy

QoS Violation

Impactful Service Update

Innocuous Service Update

Incremental Retraining

Signaled QoS Violation

0 2 4 6 8 10 12Time (hours)

75

80

85

90

95

De

tectio

n A

ccu

racy (

)

Social Network Movie Reviewing Ecommerce Go-microservices

0 2 4 6 8 10 12Time (hours)

60708090

100110120130140

Ta

il La

ten

cy n

orm

QoS

QoS

Figure 13 Seer retraining incrementally after each time theSocial Network service is updated

contention and finally disk Since the only persistent mi-croservices are the back-end databases it is reasonable thatdisk accounts for a small fraction of overall QoS violations

Seer accurately follows this breakdown for the most partonly missing a few QoS violations due to random load spikesincluding one caused by a switch failure The App-only sys-tem correctly identifies application-level sources of unpre-dictable performance but misses the majority of system-related issues especially in uncore resources On the otherhand Network-only correctly identifies the vast majorityof network-related issues as well as most of the core- anduncore-driven QoS violations but misses several application-level issues The difference between Network-only and Seeris small suggesting that one could omit the application-levelinstrumentation in favor of a simpler design While this sys-tem is still effective in capturing QoS violations it is lessuseful in providing feedback to application developers onhow to improve their design to avoid QoS violations in the fu-ture Finally the utilization-based system behaves the worstmissing most violations not caused by CPU saturationOut of the 89 QoS violations Seer detects it notifies the

cluster manager early enough to avoid 84 of them The QoSviolations that were not avoided correspond to application-level bugs which cannot be easily corrected online Sincethis is a private cluster Seer uses utilization metrics andperformance counters to identify problematic resourcesRetraining Fig 13 shows the detection accuracy for Seerand the tail latency for each end-to-end service over a periodof time during which Social Network is getting frequently andsubstantially updated This includes new microservices be-ing added to the service such as the ability to place an orderfrom an ad using the orders microservice of E-commerceor the back-end of the service changing from MongoDB to

Cassandra and the front-end switching from nginx to thenodejs front-end of E-commerce These are changes that fun-damentally affect the applicationrsquos behavior throughput la-tency and bottlenecks The other services remain unchangedduring this period (Banking was not active during this timeand is omitted from the graph) Blue dots denote correctly-signaled upcoming QoS violations and red times denote QoSviolations that were not detected by Seer All unidentifiedQoS violations coincide with the application being updatedShortly after the update Seer incrementally retrains in thebackground and starts recovering its accuracy until anothermajor update occurs Some of the updates have no impact oneither performance or Seerrsquos accuracy either because theyinvolve microservices off the critical path or because theyare insensitive to resource contention

The bottom figure shows that unidentified QoS violationsindeed result in performance degradation for Social Networkand in some cases for the other end-to-end services if theyare sharing physical resources with Social Network on anoversubscribed server Once retraining completes the per-formance of the service(s) recovers The longer Seer trainson an evolving application the more likely it is to correctlyanticipate its future QoS violations

6 Large-Scale Cloud Study61 Seer ScalabilityWe now deploy our Social Network service on a 100-serverdedicated cluster on Google Compute Engine (GCE) anduse it to service real user traffic The application has 582registered users with 165 daily active users and has beendeployed for a two-month period The cluster on averagehosts 386 single-concerned containers (one microservice percontainer) subject to some resource scaling actions by thecluster manager based on Seerrsquos feedback

Training Inference10-310-210-1100101102103104105

Tim

e (

s)

NativeTPUBrainwave

Figure 14 Seer training and infer-ence with hardware acceleration

Accuracy remainshigh for Seer con-sistent with thesmall-scale exper-iments Inferencetime however in-creases substantiallyfrom 114ms forthe 20-server clus-ter to 54ms for the100-server GCE set-ting Even thoughthis is still suffi-cient for many re-source allocation decisions as the application scales furtherSeerrsquos ability to anticipate a QoS violation within the clustermanagerrsquos window of opportunity diminishes

Over the past year multiple public cloud providers have ex-posed hardware acceleration offerings for DNN training and

nginx

text

imag

e

user

Tag

urlS

horte

n

vide

o

reco

mm

sear

chlogin

read

Post

com

pPos

t

writ

eGra

ph

mem

cach

ed

mon

godb

0

20

40

60

80

100

120

QoS

Vio

lations

Figure 15 QoS violations each microservice in Social Net-work is responsible for

inference either using a special-purpose design like the Ten-sor Processing Unit (TPU) from Google [55] or using recon-figurable FPGAs like Project Brainwave from Microsoft [25]We offload Seerrsquos DNN logic to both systems and quantifythe impact on training and inference time and detection ac-curacy 1 Fig 14 shows this comparison for a 200GB trainingdataset Both the TPU and Project Brainwave dramaticallyoutperform our local implementation by up to two orders ofmagnitude Between the two accelerators the TPU is moreeffective in training consistent with its design objective [55]while Project Brainwave achieves faster inference For theremainder of the paper we run Seer on TPUs and host theSocial Network service on GCE

62 Source of QoS ViolationsWe now examine which microservice is the most commonculprit for a QoS violation Fig 15 shows the number of QoSviolations caused by each service over the two-month periodThe most frequent culprits by far are the in-memory cachingtiers in memcached and Thrift services with high requestfanout such as composePost readPost and login memcachedis a justified source of QoS violations since it is on the criti-cal path for almost all query types and it is additionally verysensitive to resource contention in compute and to a lesserdegree cache and memory Microservices with high fanoutare also expected to initiate QoS violations as they have tosynchronize multiple inbound requests before proceeding Ifprocessing for any incoming requests is delayed end-to-endperformance is likely to suffer Among these QoS violationsmost of memcachedrsquos violations were caused by resource con-tention while violations in Thrift services were caused bylong synchronization times

63 Seerrsquos Long-Term Impact on Application DesignSeer has now been deployed in the Social Network clusterfor over two months and in this time it has detected 536upcoming QoS violations (906 accuracy) and avoided 495(84) of them Furthermore by detecting recurring patterns1Before running on TPUs we reimplemented our DNN in Tensorflow Wesimilarly adjust the DNN to the currently-supported designs in Brainwave

0 10 20 30 40 50 60Day

0

5

10

15

20

25

Qo

S v

iola

tio

ns

Figure 16 QoS violations each microservice in Social Net-work is responsible for

that lead to QoS violations Seer has helped the applicationdevelopers better understand bugs and design decisions thatlead to hotspots such as microservices with a lot of back-and-forth communication between them or microservicesforming cyclic dependencies or using blocking primitivesThis has led to a decreasing number of QoS violations overthe two month period (seen in Fig 16) as the applicationprogressively improves In days 22 and 23 there was a clusteroutage which is why the reported violations are zero Sys-tems like Seer can be used not only to improve performancepredictability in complex cloud systems but to help usersbetter understand the design challenges of microservices asmore services transition to this application model

7 ConclusionsCloud services increasingly move away from complex mono-lithic designs and adopt the model of specialized loosely-coupled microservices We presented Seer a data-drivencloud performance debugging system that leverages practi-cal learning techniques and the massive amount of tracingdata cloud systems collect to proactively detect and avoidQoS violations We have validated Seerrsquos accuracy in con-trolled environments and evaluated its scalability on large-scale clusters on public clouds We have also deployed thesystem in a cluster hosting a social network with hundredsof users In all scenarios Seer accurately detects upcomingQoS violations improving responsiveness and performancepredictability As more services transition to the microser-vices model systems like Seer provide practical solutionsthat can navigate the increasing complexity of the cloud

AcknowledgementsWe sincerely thank Christos Kozyrakis Balaji PrabhakarMendel Rosenblum Daniel Sanchez Ravi Soundararajanthe academic and industrial users of the benchmark suiteand the anonymous reviewers for their feedback on earlierversions of this manuscript This work was supported byNSF CAREER Award CCF-1846046 NSF grant CNS-1422088a Facebook Faculty Research Award a VMWare ResearchAward a John and Norma Balen Sesquicentennial FacultyFellowship and donations from GCE Azure and AWS

References[1] [n d] Apache Thrift httpsthriftapacheorg[2] [n d] Decomposing Twitter Adventures in Service-

Oriented Architecture httpswwwslidesharenetInfoQdecomposing-twitter-adventures-in-serviceoriented-architecture

[3] [n d] Golang Microservices Example httpsgithubcomharlowgo-micro-services

[4] [n d] Messaging that just works httpswwwrabbitmqcom[5] [n d] MongoDB httpswwwmongodbcom[6] [n d] NGINX httpsnginxorgen[7] [n d] SockShop A Microservices Demo Application httpswww

weaveworksblogsock-shop-microservices-demo-application[8] [n d] Zipkin httpzipkinio[9] 2016 The Evolution of Microservices httpswwwslidesharenet

adriancockcroftevolution-of-microservices-craft-conference[10] Martiacuten Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng

Chen Craig Citro Greg Corrado Andy Davis Jeffrey Dean MatthieuDevin Sanjay Ghemawat Ian Goodfellow Andrew Harp GeoffreyIrving Michael Isard Yangqing Jia Rafal Jozefowicz Lukasz KaiserManjunath Kudlur Josh Levenberg Dan Maneacute Rajat Monga SherryMoore Derek Murray Chris Olah Mike Schuster Jonathon ShlensBenoit Steiner Ilya Sutskever Kunal Talwar Paul Tucker VincentVanhoucke Vijay Vasudevan Fernanda Vieacutegas Oriol Vinyals PeteWarden Martin Wattenberg Martin Wicke Yuan Yu and XiaoqiangZheng [n d] TensorFlow Large-Scale Machine Learning on Hetero-geneous Distributed Systems In Proceedings of OSDI 2016

[11] Adrian Cockroft [n d] Microservices Workshop Why whatand how to get there httpwwwslidesharenetadriancockcroftmicroservices-workshop-craft-conference

[12] Hyunwook Baek Abhinav Srivastava and Jacobus Van der Merwe[n d] CloudSight A tenant-oriented transparency framework forcross-layer cloud troubleshooting In Proceedings of the 17th IEEEACMInternational Symposium on Cluster Cloud and Grid Computing 2017

[13] Luiz Barroso [n d] Warehouse-Scale Computing Entering theTeenage Decade ISCA Keynote SJ June 2011 ([n d])

[14] Luiz Barroso and Urs Hoelzle 2009 The Datacenter as a Computer AnIntroduction to the Design ofWarehouse-Scale Machines MC Publishers

[15] Robert Bell Yehuda Koren and Chris Volinsky 2007 The BellKor 2008Solution to the Netflix Prize Technical Report

[16] Leon Bottou [n d] Large-Scale Machine Learning with StochasticGradient Descent In Proceedings of the International Conference onComputational Statistics (COMPSTAT) Paris France 2010

[17] Martin A Brown [n d] Traffic Control HOWTO httplinux-ipnetarticlesTraffic-Control-HOWTO

[18] Cassandra [n d] Apache Cassandra httpcassandraapacheorg[19] Adrian M Caulfield Eric S Chung Andrew Putnam Hari Angepat

Jeremy Fowers Michael Haselman Stephen Heil Matt HumphreyPuneet Kaur Joo-Young Kim Daniel Lo Todd Massengill KalinOvtcharov Michael Papamichael Lisa Woods Sitaram Lanka DerekChiou and Doug Burger 2016 A Cloud-scale Acceleration Architec-ture In MICRO IEEE Press Piscataway NJ USA Article 7 13 pages

[20] Tianshi Chen Zidong Du Ninghui Sun Jia Wang Chengyong WuYunji Chen and Olivier Temam 2014 DianNao a small-footprint high-throughput accelerator for ubiquitous machine-learning In Proc ofthe 19th intl conf on Architectural Support for Programming Languagesand Operating Systems

[21] Yunji Chen Tianshi Chen Zhiwei Xu Ninghui Sun andOlivier Temam2016 DianNao Family Energy-efficient Hardware Accelerators forMachine Learning Commun ACM 59 11 (Oct 2016) 105ndash112 httpsdoiorg1011452996864

[22] Yunji Chen Tao Luo Shaoli Liu Shijin Zhang Liqiang He Jia WangLing Li Tianshi Chen Zhiwei Xu Ninghui Sun and Olivier Temam2014 DaDianNao A Machine-Learning Supercomputer In Proceedingsof the 47th Annual IEEEACM International Symposium on Microarchi-tecture (MICRO-47) IEEE Computer Society Washington DC USA

609ndash622[23] Ludmila Cherkasova Diwaker Gupta and Amin Vahdat 2007 Com-

parison of the Three CPU Schedulers in Xen SIGMETRICS PerformEval Rev 35 2 (Sept 2007) 42ndash51

[24] Michael Chow David Meisner Jason Flinn Daniel Peek and Thomas FWenisch 2014 The Mystery Machine End-to-end Performance Anal-ysis of Large-scale Internet Services In Proceedings of the 11th USENIXConference on Operating Systems Design and Implementation (OSDIrsquo14)USENIX Association Berkeley CA USA 217ndash231

[25] Eric S Chung Jeremy Fowers Kalin Ovtcharov Michael PapamichaelAdrian M Caulfield Todd Massengill Ming Liu Daniel Lo ShlomiAlkalay Michael Haselman Maleen Abeydeera Logan Adams HariAngepat Christian Boehn Derek Chiou Oren Firestein AlessandroForin Kang Su Gatlin Mahdi Ghandi Stephen Heil Kyle HolohanAhmad El Husseini Tamaacutes Juhaacutesz Kara Kagi Ratna Kovvuri SitaramLanka Friedel van Megen Dima Mukhortov Prerak Patel BrandonPerez Amanda Rapsang Steven K Reinhardt Bita Rouhani AdamSapek Raja Seera Sangeetha Shekar Balaji Sridharan Gabriel WeiszLisaWoods Phillip Yi Xiao Dan Zhang Ritchie Zhao and Doug Burger2018 Serving DNNs in Real Time at Datacenter Scale with ProjectBrainwave IEEE Micro 38 2 (2018) 8ndash20

[26] Guilherme Da Cunha Rodrigues Rodrigo N Calheiros Vini-cius Tavares Guimaraes Glederson Lessa dos Santos Maacutercio Barbosade Carvalho Lisandro Zambenedetti Granville Liane Margarida Rock-enbach Tarouco and Rajkumar Buyya 2016 Monitoring of CloudComputing Environments Concepts Solutions Trends and Future Di-rections In Proceedings of the 31st Annual ACM Symposium on AppliedComputing (SAC rsquo16) ACM New York NY USA 378ndash383

[27] Marwan Darwish Abdelkader Ouda and Luiz Fernando Capretz [nd] Cloud-based DDoS attacks and defenses In Proc of i-SocietyToronto ON 2013

[28] Jeffrey Dean and Luiz Andre Barroso [n d] The Tail at Scale InCACM Vol 56 No 2

[29] Christina Delimitrou Nick Bambos and Christos Kozyrakis [n d]QoS-Aware Admission Control in Heterogeneous Datacenters In Pro-ceedings of the International Conference of Autonomic Computing (ICAC)San Jose CA USA 2013

[30] Christina Delimitrou and Christos Kozyrakis [n d] iBench Quanti-fying Interference for Datacenter Workloads In Proceedings of the 2013IEEE International Symposium on Workload Characterization (IISWC)Portland OR September 2013

[31] Christina Delimitrou and Christos Kozyrakis [n d] Paragon QoS-Aware Scheduling for Heterogeneous Datacenters In Proceedings ofthe Eighteenth International Conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS) HoustonTX USA 2013

[32] Christina Delimitrou and Christos Kozyrakis [n d] QoS-Aware Sched-uling in Heterogeneous Datacenters with Paragon In ACM Transac-tions on Computer Systems (TOCS) Vol 31 Issue 4 December 2013

[33] Christina Delimitrou and Christos Kozyrakis [n d] Quality-of-Service-Aware Scheduling in Heterogeneous Datacenters withParagon In IEEE Micro Special Issue on Top Picks from the ComputerArchitecture Conferences MayJune 2014

[34] Christina Delimitrou and Christos Kozyrakis [n d] Quasar Resource-Efficient and QoS-Aware Cluster Management In Proc of ASPLOS SaltLake City 2014

[35] Christina Delimitrou and Christos Kozyrakis 2016 HCloud Resource-Efficient Provisioning in Shared Cloud Systems In Proceedings of theTwenty First International Conference on Architectural Support for Pro-gramming Languages and Operating Systems (ASPLOS)

[36] Christina Delimitrou and Christos Kozyrakis 2017 Bolt I KnowWhatYou Did Last Summer In The Cloud In Proc of the Twenty SecondInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS)

[37] Christina Delimitrou and Christos Kozyrakis 2018 Amdahlrsquos Law forTail Latency In Communications of the ACM (CACM)

[38] Christina Delimitrou Daniel Sanchez and Christos Kozyrakis 2015Tarcil Reconciling Scheduling Speed and Quality in Large Shared Clus-ters In Proceedings of the Sixth ACM Symposium on Cloud Computing(SOCC)

[39] Zidong Du Robert Fasthuber Tianshi Chen Paolo Ienne Ling Li TaoLuo Xiaobing Feng Yunji Chen and Olivier Temam 2015 ShiDi-anNao Shifting Vision Processing Closer to the Sensor In Proceed-ings of the 42Nd Annual International Symposium on Computer Ar-chitecture (ISCA rsquo15) ACM New York NY USA 92ndash104 httpsdoiorg10114527494692750389

[40] Daniel Firestone Andrew Putnam SambhramaMundkur Derek ChiouAlireza Dabagh Mike Andrewartha Hari Angepat Vivek BhanuAdrian Caulfield Eric Chung Harish Kumar Chandrappa SomeshChaturmohta Matt Humphrey Jack Lavier Norman Lam FengfenLiu Kalin Ovtcharov Jitu Padhye Gautham Popuri Shachar RaindelTejas Sapre Mark Shaw Gabriel Silva Madhan Sivakumar NisheethSrivastava Anshuman Verma Qasim Zuhair Deepak Bansal DougBurger Kushagra Vaid David A Maltz and Albert Greenberg 2018Azure Accelerated Networking SmartNICs in the Public Cloud In 15thUSENIX Symposium on Networked Systems Design and Implementation(NSDI 18) USENIX Association Renton WA 51ndash66

[41] Brad Fitzpatrick [n d] Distributed caching with memcached InLinux Journal Volume 2004 Issue 124 2004

[42] Rodrigo Fonseca George Porter Randy H Katz Scott Shenker andIon Stoica 2007 X-trace A Pervasive Network Tracing Framework InProceedings of the 4th USENIX Conference on Networked Systems Designamp Implementation (NSDIrsquo07) USENIX Association Berkeley CA USA20ndash20

[43] Yu Gan and Christina Delimitrou 2018 The Architectural Implicationsof Cloud Microservices In Computer Architecture Letters (CAL) vol17iss 2

[44] Yu Gan Meghna Pancholi Dailun Cheng Siyuan Hu Yuan He andChristina Delimitrou 2018 Seer Leveraging Big Data to Navigate theComplexity of Cloud Debugging In Proceedings of the Tenth USENIXWorkshop on Hot Topics in Cloud Computing (HotCloud)

[45] Yu Gan Yanqi Zhang Dailun Cheng Ankitha Shetty Priyal RathiNayan Katarki Ariana Bruno Justin Hu Brian Ritchken BrendonJackson Kelvin Hu Meghna Pancholi Yuan He Brett Clancy ChrisColen Fukang Wen Catherine Leung Siyuan Wang Leon ZaruvinskyMateo Espinosa Rick Lin Zhongling Liu Jake Padilla and ChristinaDelimitrou 2019 An Open-Source Benchmark Suite for Microser-vices and Their Hardware-Software Implications for Cloud and EdgeSystems In Proceedings of the Twenty Fourth International Conferenceon Architectural Support for Programming Languages and OperatingSystems (ASPLOS)

[46] Yilong Geng Shiyu Liu Zi Yin Ashish Naik Balaji Prabhakar MendelRosenblum and Amin Vahdat [n d] Exploiting a Natural NetworkEffect for Scalable Fine-grained Clock Synchronization In Proc ofNSDI 2018

[47] Daniel Gmach Jerry Rolia Ludmila Cherkasova and Alfons Kemper[n d] Workload Analysis and Demand Prediction of Enterprise DataCenter Applications In Proceedings of IISWC Boston MA 2007 10httpsdoiorg101109IISWC20074362193

[48] Zhenhuan Gong Xiaohui Gu and John Wilkes [n d] PRESS PRe-dictive Elastic ReSource Scaling for cloud systems In Proceedings ofCNSM Niagara Falls ON 2010

[49] Donald Gross John F Shortle James M Thompson and Carl M Harris[n d] Fundamentals of Queueing Theory InWiley Series in Probabilityand Statistics Book 627 2011

[50] Sanchika Gupta and PadamKumar [n d] VM Profile Based OptimizedNetwork Attack Pattern Detection Scheme for DDOS Attacks in CloudIn Proc of SSCC Mysore India 2013

[51] BenHindman Andy Konwinski Matei Zaharia Ali Ghodsi AnthonyDJoseph Randy Katz Scott Shenker and Ion Stoica [n d] Mesos APlatform for Fine-Grained Resource Sharing in the Data Center InProceedings of NSDI Boston MA 2011

[52] Jingwei Huang David M Nicol and Roy H Campbell [n d] Denial-of-Service Threat to HadoopYARN Clusters with Multi-Tenancy InProc of the IEEE International Congress on Big Data Washington DC2014

[53] Alexandru Iosup Nezih Yigitbasi and Dick Epema [n d] On thePerformance Variability of Production Cloud Services In Proceedingsof CCGRID Newport Beach CA 2011

[54] Hiranya Jayathilaka Chandra Krintz and Rich Wolski 2017 Perfor-mance Monitoring and Root Cause Analysis for Cloud-hosted WebApplications In Proceedings of WWW International World Wide WebConferences Steering Committee Republic and Canton of GenevaSwitzerland 469ndash478

[55] Norman P Jouppi Cliff Young Nishant Patil David Patterson Gau-rav Agrawal Raminder Bajwa Sarah Bates Suresh Bhatia Nan Bo-den Al Borchers Rick Boyle Pierre-luc Cantin Clifford Chao ChrisClark Jeremy Coriell Mike Daley Matt Dau Jeffrey Dean Ben GelbTara Vazir Ghaemmaghami Rajendra Gottipati William GullandRobert Hagmann C Richard Ho Doug Hogberg John Hu RobertHundt Dan Hurt Julian Ibarz Aaron Jaffey Alek Jaworski Alexan-der Kaplan Harshit Khaitan Daniel Killebrew Andy Koch NaveenKumar Steve Lacy James Laudon James Law Diemthu Le ChrisLeary Zhuyuan Liu Kyle Lucke Alan Lundin Gordon MacKean Adri-ana Maggiore Maire Mahony Kieran Miller Rahul Nagarajan RaviNarayanaswami Ray Ni Kathy Nix Thomas Norrie Mark Omer-nick Narayana Penukonda Andy Phelps Jonathan Ross Matt RossAmir Salek Emad Samadiani Chris Severn Gregory Sizikov MatthewSnelham Jed Souter Dan Steinberg Andy Swing Mercedes Tan Gre-gory Thorson Bo Tian Horia Toma Erick Tuttle Vijay VasudevanRichard Walter Walter Wang Eric Wilcox and Doe Hyun Yoon 2017In-Datacenter Performance Analysis of a Tensor Processing Unit InProceedings of the 44th Annual International Symposium on ComputerArchitecture (ISCA rsquo17) ACM New York NY USA 1ndash12

[56] Harshad Kasture and Daniel Sanchez 2014 Ubik Efficient CacheSharing with Strict QoS for Latency-Critical Workloads In Proceed-ings of the 19th international conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS-XIX)

[57] Harshad Kasture and Daniel Sanchez 2016 TailBench A BenchmarkSuite and Evaluation Methodology for Latency-Critical ApplicationsIn Proc of IISWC

[58] Yaakoub El Khamra Hyunjoo Kim Shantenu Jha andManish Parashar[n d] Exploring the performance fluctuations of hpc workloads onclouds In Proceedings of CloudCom Indianapolis IN 2010

[59] Krzysztof C Kiwiel [n d] Convergence and efficiency of subgradientmethods for quasiconvex minimization InMathematical Programming(Series A) (Berlin Heidelberg Springer) 90 (1) pp 1-25 2001

[60] Jacob Leverich and Christos Kozyrakis [n d] Reconciling High ServerUtilization and Sub-millisecond Quality-of-Service In Proc of EuroSys2014

[61] David Lo Liqun Cheng Rama Govindaraju Luiz Andreacute Barroso andChristos Kozyrakis [n d] Towards Energy Proportionality for Large-scale Latency-critical Workloads In Proceedings of the 41st AnnualInternational Symposium on Computer Architecuture (ISCA) Minneapo-lis MN 2014

[62] David Lo Liqun Cheng Rama Govindaraju Parthasarathy Ran-ganathan and Christos Kozyrakis [n d] Heracles Improving Re-source Efficiency at Scale In Proc of the 42Nd Annual InternationalSymposium on Computer Architecture (ISCA) Portland OR 2015

[63] Dave Mangot [n d] EC2 variability The numbers re-vealed httptechmangotcomrollerdaveentryec2_variability_the_numbers_revealed

[64] Jason Mars and Lingjia Tang [n d] Whare-map heterogeneity inhomogeneous warehouse-scale computers In Proceedings of ISCATel-Aviv Israel 2013

[65] Jason Mars Lingjia Tang and Robert Hundt 2011 Heterogeneity inldquoHomogeneousrdquo Warehouse-Scale Computers A Performance Oppor-tunity IEEE Comput Archit Lett 10 2 (July 2011) 4

[66] Jason Mars Lingjia Tang Robert Hundt Kevin Skadron and Mary LouSoffa [n d] Bubble-Up increasing utilization in modern warehousescale computers via sensible co-locations In Proceedings of MICROPorto Alegre Brazil 2011

[67] David Meisner Christopher M Sadler Luiz Andreacute Barroso Wolf-Dietrich Weber and Thomas F Wenisch 2011 Power managementof online data-intensive services In Proceedings of the 38th annualinternational symposium on Computer architecture 319ndash330

[68] Ripal Nathuji Canturk Isci and Eugene Gorbatov [n d] Exploitingplatform heterogeneity for power efficient data centers In Proceedingsof ICAC Jacksonville FL 2007

[69] Ripal Nathuji Aman Kansal and Alireza Ghaffarkhah [n d] Q-Clouds Managing Performance Interference Effects for QoS-AwareClouds In Proceedings of EuroSys ParisFrance 2010

[70] Hiep Nguyen Zhiming Shen Xiaohui Gu Sethuraman Subbiah andJohn Wilkes [n d] AGILE Elastic Distributed Resource Scaling forInfrastructure-as-a-Service In Proceedings of ICAC San Jose CA 2013

[71] Dejan Novakovic Nedeljko Vasic Stanko Novakovic Dejan Kosticand Ricardo Bianchini [n d] DeepDive Transparently Identifyingand Managing Performance Interference in Virtualized EnvironmentsIn Proceedings of ATC San Jose CA 2013

[72] Simon Ostermann Alexandru Iosup Nezih Yigitbasi Radu ProdanThomas Fahringer and Dick Epema [n d] A Performance Analysisof EC2 Cloud Computing Services for Scientific Computing In LectureNotes on Cloud Computing Volume 34 p115-131 2010

[73] Kay Ousterhout Patrick Wendell Matei Zaharia and Ion Stoica [nd] Sparrow Distributed Low Latency Scheduling In Proceedings ofSOSP Farminton PA 2013

[74] Xue Ouyang Peter Garraghan Renyu Yang Paul Townend and Jie Xu[n d] Reducing Late-Timing Failure at Scale Straggler Root-CauseAnalysis in Cloud Datacenters In Proceedings of 46th Annual IEEEIFIPInternational Conference on Dependable Systems and Networks 2016

[75] Andrew Putnam Adrian M Caulfield Eric S Chung Derek ChiouKypros Constantinides John Demme Hadi Esmaeilzadeh Jeremy Fow-ers Gopi Prashanth Gopal Jan Gray Michael Haselman Scott HauckStephen Heil Amir Hormati Joo-Young Kim Sitaram Lanka JamesLarus Eric Peterson Simon Pope Aaron Smith Jason Thong Phillip YiXiao and Doug Burger 2014 A Reconfigurable Fabric for Accelerat-ing Large-Scale Datacenter Services In Proc of the 41st Intl Symp onComputer Architecture

[76] Suhail Rehman and Majd Sakr [n d] Initial Findings for provisioningvariation in cloud computing In Proceedings of CloudCom IndianapolisIN 2010

[77] Charles Reiss Alexey Tumanov Gregory Ganger Randy Katz andMichael Kozych [n d] Heterogeneity and Dynamicity of Clouds atScale Google Trace Analysis In Proceedings of SOCC 2012

[78] Gang Ren Eric Tune Tipp Moseley Yixin Shi Silvius Rus andRobert Hundt 2010 Google-Wide Profiling A Continuous Pro-filing Infrastructure for Data Centers IEEE Micro (2010) 65ndash79httpwwwcomputerorgportalwebcsdldoi101109MM201068

[79] rightscale [n d] Rightscale httpsawsamazoncomsolution-providersisvrightscale

[80] S Sarwar A Ankit and K Roy [n d] Incremental Learning in DeepConvolutional Neural Networks Using Partial Network Sharing InarXiv preprint arXiv171202719

[81] Joumlrg Schad Jens Dittrich and Jorge-Arnulfo Quianeacute-Ruiz 2010 Run-time Measurements in the Cloud Observing Analyzing and ReducingVariance Proceedings VLDB Endow 3 1-2 (Sept 2010) 460ndash471

[82] Malte Schwarzkopf Andy Konwinski Michael Abd-El-Malek andJohn Wilkes [n d] Omega flexible scalable schedulers for largecompute clusters In Proceedings of EuroSys Prague 2013

[83] Zhiming Shen Sethuraman Subbiah Xiaohui Gu and John Wilkes [nd] CloudScale elastic resource scaling for multi-tenant cloud systemsIn Proceedings of SOCC Cascais Portugal 2011

[84] David Shue Michael J Freedman and Anees Shaikh [n d] Perfor-mance Isolation and Fairness for Multi-tenant Cloud Storage In Procof OSDI Hollywood CA 2012

[85] Benjamin H Sigelman Luiz Andreacute Barroso Mike Burrows PatStephenson Manoj Plakal Donald Beaver Saul Jaspan and Chan-dan Shanbhag 2010 Dapper a Large-Scale Distributed Systems TracingInfrastructure Technical Report Google Inc httpsresearchgooglecomarchivepapersdapper-2010-1pdf

[86] Lalith Suresh Peter Bodik Ishai Menache Marco Canini and FlorinCiucu [n d] Distributed Resource Management Across ProcessBoundaries In Proceedings of the ACM Symposium on Cloud Computing(SOCC) Santa Clara CA 2017

[87] torque [n d] Torque Resource Manager httpwwwadaptivecomputingcomproductsopen-sourcetorque

[88] Venkatanathan Varadarajan Thomas Ristenpart and Michael Swift[n d] Scheduler-based Defenses against Cross-VM Side-channels InProc of the 23rd Usenix Security Symposium San Diego CA 2014

[89] Venkatanathan Varadarajan Yinqian Zhang Thomas Ristenpart andMichael Swift [n d] A Placement Vulnerability Study inMulti-TenantPublic Clouds In Proc of the 24th USENIX Security Symposium (USENIXSecurity) Washington DC 2015

[90] Nedeljko Vasić Dejan Novaković Svetozar Miučin Dejan Kostić andRicardo Bianchini [n d] DejaVu accelerating resource allocationin virtualized environments In Proceedings of the Seventeenth Interna-tional Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS) London UK 2012

[91] Jianping Weng Jessie Hui Wang Jiahai Yang and Yang Yang [n d]Root cause analysis of anomalies of multitier services in public cloudsIn Proceedings of the IEEEACM 25th International Symposium on Qual-ity of Service (IWQoS) 2017

[92] Ian H Witten Eibe Frank and Geoffrey Holmes [n d] Data MiningPractical Machine Learning Tools and Techniques 3rd Edition

[93] Tianyin Xu Xinxin Jin Peng Huang Yuanyuan Zhou Shan Lu LongJin and Shankar Pasupathy 2016 Early Detection of ConfigurationErrors to Reduce Failure Damage In Proceedings of the 12th USENIXConference on Operating Systems Design and Implementation (OSDIrsquo16)USENIX Association Berkeley CA USA 619ndash634

[94] Yunjing Xu Michael Bailey Farnam Jahanian Kaustubh Joshi MattiHiltunen and Richard Schlichting [n d] An Exploration of L2 CacheCovert Channels in Virtualized Environments In Proc of the 3rd ACMWorkshop on Cloud Computing Security Workshop (CCSW) ChicagoIL 2011

[95] Hailong Yang Alex Breslow Jason Mars and Lingjia Tang [n d]Bubble-flux precise online QoS management for increased utilizationin warehouse scale computers In Proceedings of ISCA 2013

[96] Minlan Yu Albert Greenberg Dave Maltz Jennifer Rexford LihuaYuan Srikanth Kandula and Changhoon Kim [n d] Profiling Net-work Performance for Multi-tier Data Center Applications In Proceed-ings of NSDI Boston MA 2011 14

[97] Xiao Yu Pallavi Joshi Jianwu Xu Guoliang Jin Hui Zhang and GuofeiJiang 2016 CloudSeer Workflow Monitoring of Cloud Infrastructuresvia Interleaved Logs In Proceedings of APSLOS ACM New York NYUSA 489ndash502

[98] Yinqian Zhang and Michael K Reiter [n d] Duppel retrofittingcommodity operating systems to mitigate cache side channels in thecloud In Proc of CCS Berlin Germany 2013

[99] Mark Zhao and G Edward Suh [n d] FPGA-Based Remote PowerSide-Channel Attacks In Proceedings of the IEEE Symposium on Securityand Privacy May 2018

  • Abstract
  • 1 Introduction
  • 2 Related Work
  • 3 End-to-End Applications with Microservices
    • 31 Social Network
    • 32 Media Service
    • 33 E-Commerce Service
    • 34 Banking System
    • 35 Hotel Reservation Site
      • 4 Seer Design
        • 41 Overview
        • 42 Distributed Tracing
        • 43 Deep Learning in Performance Debugging
        • 44 Hardware Monitoring
        • 45 System Insights from Seer
        • 46 Implementation
          • 5 Seer Analysis and Validation
            • 51 Methodology
            • 52 Evaluation
              • 6 Large-Scale Cloud Study
                • 61 Seer Scalability
                • 62 Source of QoS Violations
                • 63 Seers Long-Term Impact on Application Design
                  • 7 Conclusions
                  • Acknowledgements
                  • References
Page 2: Seer: Leveraging Big Data to Navigate the Complexity of ... · data to extract the causal relationships between requests. Even though such tracing systems help cloud providers detect

Specifically even though the quality-of-service (QoS) re-quirements of the end-to-end application are similar for mi-croservices and monoliths the tail latency required for eachindividual microservice is much stricter than for traditionalcloud applications [34 43ndash45 61 62 64 67 70 77] This putsincreased pressure on delivering predictable performance asdependencies between microservices mean that a single mis-behaving microservice can cause cascading QoS violationsacross the system

Netflix Twitter

Amazon Social Network

Figure 1 Microservicesgraphs in three large cloudproviders [2 9 11] and ourSocial Network service

Fig 1 shows threeinstances of real large-scale production de-ployments of microser-vices [2 9 11] Theperimeter of the cir-cle (or sphere surface)shows the different mi-croservices and edgesshow dependencies be-tween them We alsoshow these dependen-cies for Social Networkone of the large-scaleservices used in theevaluation of this work(see Sec 3) Unfortu-nately the complexity of modern cloud services means thatmanually determining the impact of each pair-wide depen-dency on end-to-end QoS or relying on the user to providethis information is impractical

Apart from software heterogeneity datacenter hardware isalso becoming increasingly heterogeneous as special-purposearchitectures [20ndash22 39 55] and FPGAs are used to accel-erate critical operations [19 25 40 75] This adds to theexisting server heterogeneity in the cloud where servers areprogressively replaced and upgraded over the datacenterrsquosprovisioned lifetime [31 33 65 68 95] and further compli-cates the effort to guarantee predictable performanceThe need for performance predictability has prompted a

long line of work on performance tracing monitoring anddebugging systems [24 42 46 78 85 93 97] Systems likeDapper and GWP for example rely on distributed tracing(often at RPC level) and low-level hardware event monitoringrespectively to detect performance abnormalities while theMystery Machine [24] leverages the large amount of loggeddata to extract the causal relationships between requestsEven though such tracing systems help cloud providers

detect QoS violations and apply corrective actions to restoreperformance until those actions take effect performance suf-fers For monolithic services this primarily affects the serviceexperiencing the QoS violation itself and potentially servicesit is sharing physical resources with With microserviceshowever a posteriori QoS violation detection is more im-pactful as hotspots propagate and amplify across dependent

A Posteriori Seer

0 50 100 150 200 250 300Time (sec)

40

50

60

70

80

90

100

La

ten

cy

ile m

ee

tin

g Q

oS

Violationdetected

QoSrestored

0 50 100 150 200 250 300

Time (s)

Mic

rose

rvic

es I

nsta

nce

s

100

101

102

CP

U U

tiliz

atio

n (

)

Front-end

Back-end

0 50 100 150 200 250 300

Time (s)

Mic

rose

rvic

es I

nsta

nce

s

100

101

102

La

ten

cy in

cre

ase

(

)

Front-end

Back-end

Figure 2 The performance impact of a posteriori perfor-mance diagnostics for a monolith and for microservices

services forcing the system to operate in a degraded state forlonger until all oversubscribed tiers have been relieved andall accumulated queues have drained Fig 2a shows the im-pact of reacting to a QoS violation after it occurs for the SocialNetwork application with several hundred users running on20 two-socket high-end servers Even though the schedulerscales out all oversubscribed tiers once the violation occursit takes several seconds for the service to return to nominaloperation There are two reasons for this first by the timeone tier has been upsized its neighboring tiers have builtup request backlogs which cause them to saturate in turnSecond utilization is not always a good proxy for tail latencyandor QoS violations [14 28 61 62 73] Fig 2b shows theutilization of all microservices ordered from the back-endto the front-end over time and Fig 2c shows their corre-sponding 99th percentile latencies normalized to nominaloperation Although there are cases where high utilizationand high latency match the effect of hotspots propagatingthrough the service is much more pronounced when lookingat latencies with the back-end tiers progressively saturatingthe servicersquos logic and front-end microservices In contrastthere are highly-utilized microservices that do not experi-ence increases in their tail latency A commonway to addresssuch QoS violations is rate limiting [86] which constrainsthe incoming load until hotspots dissipate This restoresperformance but degrades the end userrsquos experience as afraction of input requests is dropped

We present Seer a proactive cloud performance debuggingsystem that leverages practical deep learning techniques todiagnose upcoming QoS violations in a scalable and onlinemanner First Seer is proactive to avoid the long recoveryperiods of a posteriori QoS violation detection Second ituses themassive amount of tracing data cloud systems collectover time to learn spatial and temporal patterns that leadto QoS violations early enough to avoid them altogetherSeer includes a lightweight distributed RPC-level tracingsystem based on Apache Thriftrsquos timing interface [1] tocollect end-to-end traces of request execution and track per-microservice outstanding requests Seer uses these tracesto train a deep neural network to recognize imminent QoSviolations and identify the microservice(s) that initiated theperformance degradation Once Seer identifies the culprit

of a QoS violation that will occur over the next few 100s ofmilliseconds it uses detailed per-node hardware monitoringto determine the reason behind the degraded performanceand provide the cluster scheduler with recommendations onactions required to avoid it

We evaluate Seer both in our local cluster of 20 two-socketservers and on large-scale clusters on Google Compute En-gine (GCE) with a set of end-to-end interactive applicationsbuilt with microservices including the Social Network aboveIn our local cluster Seer correctly identifies upcoming QoSviolations in 93 of cases and correctly pinpoints the mi-croservice initiating the violation 89 of the time To combatlong inference times as clusters scale we offload the DNNtraining and inference to Googlersquos Tensor Processing Units(TPUs) when running on GCE [55] We additionally experi-ment with using FPGAs in Seer via Project Brainwave [25]when running on Windows Azure and show that both typesof acceleration speed up Seer by 200-235x with the TPU help-ing the most during training and vice versa for inferenceAccuracy is consistent with the small cluster results

Finally we deploy Seer in a large-scale installation of theSocial Network service with several hundred users and showthat it not only correctly identifies 906 of upcoming QoSviolations and avoids 84 of them but that detecting pat-terns that create hotspots helps the applicationrsquos developersimprove the service design resulting in a decreasing numberof QoS violations over time As cloud application and hard-ware complexity continues to grow data-driven systems likeSeer can offer practical solutions for systems whose scalemake empirical approaches intractable

2 Related WorkPerformance unpredictability is a well-studied problem inpublic clouds that stems from platform heterogeneity re-source interference software bugs and load variation [2332 34 35 38 53 58 61ndash63 72 76 81 81] We now reviewrelated work on reducing performance unpredictability incloud systems including through scheduling and clustermanagement or through online tracing systemsCloud management The prevalence of cloud computinghas motivated several cluster management designs Systemslike Mesos [51] Torque [87] Tarcil [38] and Omega [82]all target the problem of resource allocation in large multi-tenant clusters Mesos is a two-level scheduler It has a centralcoordinator that makes resource offers to application frame-works and each framework has an individual scheduler thathandles its assigned resources Omega on the other handfollows a shared-state approach where multiple concurrentschedulers can view the whole cluster state with conflicts be-ing resolved through a transactional mechanism [82] Tarcilleverages information on the type of resources applicationsneed to employ a sampling-base distributed scheduler thatreturns high quality resources within a fewmilliseconds [38]

Dejavu identifies a few workload classes and reuses previ-ous allocations for each class to minimize reallocation over-heads [90] CloudScale [83] PRESS [48] AGILE [70] and thework by Gmach et al [47] predict future resource needs on-line often without a priori knowledge Finally auto-scalingsystems such as Rightscale [79] automatically scale thenumber of physical or virtual instances used by webservingworkloads to accommodate changes in user load

A second line of work tries to identify resources that willallow a new potentially-unknown application to meet itsperformance (throughput or tail latency) requirements [2931 32 34 66 68 95] Paragon uses classification to determinethe impact of platform heterogeneity and workload interfer-ence on an unknown incoming workload [30 31] It thenuses this information to achieve predictable performanceand high cluster utilization Paragon assumes that the clus-ter manager has full control over all resources which is oftennot the case in public clouds Quasar extends the use of datamining in cluster management by additionally determiningthe appropriate amount of resources for a new applicationNathuji et al developed a feedback-based scheme that tunesresource assignments to mitigate memory interference [69]Yang et al developed an online scheme that detects mem-ory pressure and finds colocations that avoid interference onlatency-sensitiveworkloads [95] Similarly DeepDive detectsand manages interference between co-scheduled workloadsin a VM environment [71]Finally CPI2 [98] throttles low-priority workloads that

introduce destructive interference to important latency-critical services using low-level metrics of performance col-lected through Google-Wide Profiling (GWP) In terms ofmanaging platform heterogeneity Nathuji et al [68] andMars et al [64] quantified its impact on conventional bench-marks and Google services and designed schemes to predictthe most appropriate servers for a workloadCloud tracing amp diagnostics There is extensive relatedwork on monitoring systems that has shown that execu-tion traces can help diagnose performance efficiency andeven security problems in large-scale systems [12 24 26 4254 77 85 93 97] For example X-Trace is a tracing frame-work that provides a comprehensive view of the behavior ofservices running on large-scale potentially shared clustersX-Trace supports several protocols and software systemsand has been deployed in several real-world scenarios in-cluding DNS resolution and a photo-hosting site [42] TheMystery Machine on the other hand leverages the massiveamount of monitoring data cloud systems collect to deter-mine the causal relationship between different requests [24]Cloudseer serves a similar purpose building an automatonfor the workflow of each task based on normal executionand then compares against this automaton at runtime todetermine if the workflow has diverged from its expectedbehavior [97] Finally there are several production systems

Service Communication Unique Per-language LoC breakdownProtocol Microservices (end-to-end service)

Social Network RPC 36 34 C 23 C++ 18 Java 7 node 6 Python 5 Scala 3 PHP 2 JS 2 GoMedia Service RPC 38 30 C 21 C++ 20 Java 10 PHP 8 Scala 5 node 3 Python 3 JS

E-commerce Site REST 41 21 Java 16 C++ 15 C 14 Go 10 JS 7 node 5 Scala 4 HTML 3 RubyBanking System RPC 28 29 C 25 Javascript 16 Java 16 nodejs 11 C++ 3 Python

Hotel Reservations [3] RPC 15 89 Go 7 HTML 4 PythonTable 1 Characteristics and code composition of each end-to-end microservices-based application

mongoDB

mongoDB

mongoDB

memcached

memcached

memcached

mongoDB

memcached

Social Network

Service

text

video

image

userTag

composePost

postsStorage

writeTimeline

writeGraph

readPost blockedUsers

readTimeline

login

userInfomongoDB

memcached

search

index0

index1

indexn

hellip

uniqueIDads

recommender

Client nginx

http

http

fastcgiphp-

fpm

Load

BalancerurlShorten

favorite

followUser

Figure 3 Dependency graph between the microservices of the end-to-end Social Network application

Go-microservices

(hotel reservations)

ads

userProfile

search

geo

rate

mongoDB

mongoDB

recommender

front-end

(nodejs)http

Client

httpLoad

Balancer

memcached

memcached

mongoDB

memcached

memcached

Figure 4 Architecture of the hotel reserva-tion site using Go-microservices [3]

including Dapper [85] GWP [78] and Zipkin [8] which pro-vide the tracing infrastructure for large-scale productionsservices at Google and Twitter respectively Dapper and Zip-kin trace distributed user requests at RPC granularity whileGWP focuses on low-level hardware monitoring diagnosticsRoot cause analysis of performance abnormalities in the

cloud has also gained increased attention over the past fewyears as the number of interactive latency-critical serviceshosted in cloud systems has increased Jayathilaka et al [54]for example developed Roots a system that automaticallyidentifies the root cause of performance anomalies in webapplications deployed in Platform-as-a-Service (PaaS) cloudsRoots tracks events within the PaaS cloud using a combina-tion of metadata injection and platform-level instrumenta-tion Weng et al [91] similarly explore the cloud providerrsquosability to diagnose the root cause of performance abnormali-ties inmulti-tier applications Finally Ouyang et al [74] focuson the root cause analysis of straggler tasks in distributedprogramming frameworks like MapReduce and Spark

Even though this work does not specifically target interac-tive latency-critical microservices or applications of similargranularity such examples provide promising evidence thatdata-driven performance diagnostics can improve a large-scale systemrsquos ability to identify performance anomalies andaddress them to meet its performance guarantees

3 End-to-End Applications withMicroservices

We motivate and evaluate Seer with a set of new end-to-endinteractive services built with microservices Even thoughthere are open-source microservices that can serve as compo-nents of a larger application such as nginx [6] memcached [41]MongoDB [5] Xapian [57] and RabbitMQ [4] there are currentlyno publicly-available end-to-end microservices applicationswith the exception of a few simple architectures like Go-microservices [3] and Sockshop [7] We design four end-to-end services implementing a Social Network a Media Servicean E-commerce Site and a Banking System Starting fromthe Go-microservices architecture [3] we also develop anend-to-end Hotel Reservation system Services are designedto be representative of frameworks used in production sys-tems modular and easily reconfigurable The end-to-endapplications and tracing infrastructure are described in moredetail and open-sourced in [45]Table 1 briefly shows the characteristics of each end-to-

end application including its communication protocol thenumber of unique microservices it includes and its break-down by programming language and framework Unlessotherwise noted all microservices are deployed in Dockercontainers Below we briefly describe the scope and func-tionality of each service

E-commerce

Service

front-end

(nodejs)http

memcached

orders

search

index0

index1

indexn

hellip

recommender

media

discounts

catalogue

wishlist

cart

accountInfo

mongoDB

mongoDB

mongoDB

shipping

mongoDB

queueMaster orderQueue

mongoDB

mongoDB

memcached

mongoDB

payment

authorization

transactionID

invoicing

login

adsClient

httpLoad

Balancer

socialNet

memcached

Figure 5 The architecture of the end-to-end E-commerceapplication implementing an online clothing store

31 Social NetworkScope The end-to-end service implements a broadcast-stylesocial network with uni-directional follow relationshipsFunctionality Fig 3 shows the architecture of the end-to-end service Users (client) send requests over http whichfirst reach a load balancer implemented with nginx whichselects a specific webserver is selected also in nginx Userscan create posts embedded with text media links and tags toother users which are then broadcasted to all their followersUsers can also read favorite and repost posts as well as replypublicly or send a direct message to another user The appli-cation also includes machine learning plugins such as adsand user recommender engines [15 16 59 92] a search ser-vice using Xapian [57] and microservices that allow users tofollow unfollow or block other accounts Inter-microservicemessages use Apache Thrift RPCs [1] The servicersquos backenduses memcached for caching and MongoDB for persistently stor-ing posts user profiles media and user recommendationsThis service is broadly deployed at Cornell and elsewhereand currently has several hundred users We use this in-stallation to test the effectiveness and scalability of Seer inSection 6

32 Media ServiceScope The application implements an end-to-end servicefor browsing movie information as well as reviewing ratingrenting and streaming movies [9 11]Functionality As with the social network a client requesthits the load balancer which distributes requests among mul-tiple nginx webservers The front-end is similar to SocialNetwork and users can search and browse information aboutmovies including the plot photos videos and review in-formation as well as insert a review for a specific movie bylogging in to their account Users can also select to rent amovie which involves a payment authentication module to

verify the user has enough funds and a video streaming mod-ule using nginx-hls a production nginx module for HTTPlive streaming Movie files are stored in NFS to avoid thelatency and complexity of accessing chunked records fromnon-relational databases while reviews are held in memcached

and MongoDB instances Movie information is maintained ina sharded and replicated MySQL DB We are similarly de-ploying Media Service as a hosting site for project demos atCornell which students can browse and review

33 E-Commerce ServiceScope The service implements an e-commerce site for cloth-ing The design draws inspiration and uses several compo-nents of the open-source Sockshop application [7]Functionality The application front-end here is a nodejs

service Clients can use the service to browse the inventoryusing catalogue a Go microservice that mines the back-end memcached and MongoDB instances holding informationabout products Users can also place orders (Go) by addingitems to their cart (Java) After they log in (Go) to their ac-count they can select shipping options (Java) process theirpayment (Go) and obtain an invoice (Java) for their orderOrders are serialized and committed using QueueMaster (Go)Finally the service includes a recommender engine (C++) andmicroservices for creating wishlists (Java)

34 Banking SystemScope The service implements a secure banking systemsupporting payments loans and credit card managementFunctionality Users interface with a nodejs front-endsimilar to E-commerce to login to their account search in-formation about the bank or contact a representative Oncelogged in a user can process a payment from their accountpay their credit card or request a new one request a loan andobtain information about wealth management options Mostmicroservices are written in Java and Javascript The back-end databases are memcached and MongoDB instances The ser-vice also has a relational database (BankInfoDB) that includesinformation about the bank its services and representatives

35 Hotel Reservation SiteScope The service is an online hotel reservation site forbrowsing information about hotels and making reservationsFunctionalityThe service is based on theGo-microservicesopen-source project [3] augmented with backend databasesand machine learning widgets for advertisement and hotelrecommendations A client request is first directed to oneof the front-end webservers in nodejs by a load balancerThe front-end then interfaces with the search engine whichallows users to explore hotel availability in a given regionand place a reservation The service back-end consists ofmemcached and MongoDB instances

Tracing

Coordinator

Client

Load Balancer

Microservice

Tracing module

Seer

TraceDB

(Cassandra)

Cluster ManagerAdjust

resource

allocation

Microservices dependency graph

Figure 6 Overview of Seerrsquos operation

4 Seer Design41 OverviewFig 6 shows the high-level architecture of the system Seer isan online performance debugging system for cloud systemshosting interactive latency-critical services Even though weare focusing our analysis on microservices where the impactof QoS violations is more severe Seer is also applicable togeneral cloud services and traditional multi-tier or Service-Oriented Architecture (SOA) workloads Seer uses two levelsof tracing shown in Fig 7

RPC-level tracing

Signal QoS

violation

Identify culprit

microservice

Node-level diagnostics

Identify culprit

resource

Notify cluster

manager

Perf cntrs uBench

Figure 7 The two levels oftracing used in Seer

First it uses a light-weight distributed RPC-level tracing system de-scribed in Sec 42 whichcollects end-to-end ex-ecution traces for eachuser request includingper-tier latency and out-standing requests asso-ciates RPCs belongingto the same end-to-endrequest and aggregates

them to a centralized Cassandra database (TraceDB) Fromthere traces are used to train Seer to recognize patterns inspace (between microservices) and time that lead to QoSviolations At runtime Seer consumes real-time streamingtraces to infer whether there is an imminent QoS violation

When a QoS violation is expected to occur and a culprit mi-croservice has been located Seer uses its lower tracing levelwhich consists of detailed per-node low-level hardware mon-itoring primitives such as performance counters to identifythe reason behind the QoS violation It also uses this informa-tion to provide the cluster manager with recommendationson how to avoid the performance degradation altogetherWhen Seer runs on a public cloud where performance coun-ters are disabled it uses a set of tunable microbenchmarksto determine the source of unpredictable performance (seeSec 44) Using two specialized tracing levels instead of col-lecting detailed low-level traces for all active microservicesensures that the distributed tracing is lightweight enough to

Microservice m+1

Microservice m-1

me

mca

che

dNIC

epoll

(libevent)

socket

read

mem$

proc

Packet

filtering

Packet

filtering

TCPIP

TX

TCPIP

RX

Tracing

Coordinator

WebUI

User

http

TraceDB

(Cassandra)

QueryEngine

mic

rose

rvic

es

latency

Gantt charts

NIC

Figure 8 Distributed tracing and instrumentation in Seer

track all active requests and services in the system and thatdetailed low-level hardware tracing is only used on-demandfor microservices likely to cause performance disruptions Inthe following sections we describe the design of the tracingsystem the learning techniques in Seer and its low-leveldiagnostic framework and the system insights we can drawfrom Seerrsquos decisions to improve cloud application design

42 Distributed TracingA major challenge with microservices is that one cannotsimply rely on the client to report performance as withtraditional client-server applications We have developeda distributed tracing system for Seer similar in design toDapper [85] and Zipkin [8] that records per-microservicelatencies using the Thrift timing interface as shown in Fig 8We additionally track the number of requests queued in eachmicroservice (outstanding requests) since queue lengths arehighly correlated with performance and QoS violations [4649 56 96] In all cases the overhead from tracing withoutrequest sampling is negligible less than 01 on end-to-endlatency and less than 015 on throughput (QPS) whichis tolerable for such systems [24 78 85] Traces from allmicroservices are aggregated in a centralized database [18]Instrumentation The tracing system requires two typesof application instrumentation First to distinguish betweenthe time spent processing network requests and the timethat goes towards application computation we instrumentthe application to report the time it sees a new request (post-RPC processing) We similarly instrument the transmit sideof the application Second systems have multiple sources ofqueueing in both hardware and software To obtain accuratemeasurements of queue lengths per microservice we need toaccount for these different queues Fig 8 shows an example of

Name LoC Layers Nonlinear Weights BatchFC Conv Vect Total Function Size

CNN 1456 8 8 ReLU 30K 4LSTM 944 12 6 18 sigmoidtanh 52K 32

Seer 2882 10 7 5 22 ReLU 80K 32sigmoidtanhTable 2 The different neural network configurations weexplored for Seer

Seerrsquos instrumentation for memcached Memcached includesfive main stages [60] TCPIP receive epolllibevent read-ing the request from the socket processing the request andresponding over TCPIP either with the ltkvgt pair for a reador with an ack for a write Each of these stages includes ahardware (NIC) or software (epollsocket readmemcachedproc) queue For theNIC queues Seer filters packets based onthe destination microservice but accounts for the aggregatequeue length if hardware queues are shared since that willimpact how fast a microservicersquos packets get processed Forthe software queues Seer inserts probes in the applicationto read the number of queued requests in each caseLimited instrumentation As seen above accounting forall sources of queueing in a complex system requires non-trivial instrumentation This can become cumbersome ifusers leverage third-party applications in their services or inthe case of public cloud providers which do not have accessto the source code of externally-submitted applications forinstrumentation In these cases Seer relies on the requestsqueued exclusively in the NIC to signal upcoming QoS viola-tions In Section 5 we compare the accuracy of the full versuslimited instrumentation and see that using network queuedepths alone is enough to signal a large fraction of QoS viola-tions although smaller than when the full instrumentation isavailable Exclusively polling NIC queues identifies hotspotscaused by routing incast failures and resource saturationbut misses QoS violations that are caused by performanceand efficiency bugs in the application implementation suchas blocking behavior between microservices Signaling suchbugs helps developers better understand the microservicesmodel and results in better application designInferring queue lengths Additionally there has been re-cent work on using deep learning to reverse engineer thenumber of queued requests in switches across a large net-work topology [46] when tracing information is incompleteSuch techniques are also applicable and beneficial for Seerwhen the default level of instrumentation is not available

43 Deep Learning in Performance DebuggingA popular way to model performance in cloud systems es-pecially when there are dependencies between tasks arequeueing networks [49] Although queueing networks are avaluable tool to model how bottlenecks propagate through

the system they require in-depth knowledge of applicationsemantics and structure and can become overly complexas applications and systems scale They additionally cannoteasily capture all sources of contention such as the OS andnetwork stack

Instead in Seer we take a data-driven application-agnosticapproach that assumes no information about the structureand bottlenecks of a service making it robust to unknownand changing applications and relying instead on practi-cal learning techniques to infer patterns that lead to QoSviolations This includes both spatial patterns such as de-pendencies between microservices and temporal patternssuch as input load and resource contention The key idea inSeer is that conditions that led to QoS violations in the pastcan be used to anticipate unpredictable performance in thenear future Seer uses execution traces annotated with QoSviolations and collected over time to train a deep neural net-work to signal upcoming QoS violations Below we describethe structure of the neural network why deep learning iswell-suited for this problem and how Seer adapts to changesin application structure onlineUsing deep learning Although deep learning is not theonly approach that can be used for proactive QoS violationdetection there are several reasonswhy it is preferable in thiscase First the problem Seer must solve is a pattern matchingproblem of recognizing conditions that result in QoS viola-tions where the patterns are not always known in advanceor easy to annotate This is a more complicated task than sim-ply signaling a microservice with many enqueued requestsfor which simpler classification regression or sequence la-beling techniques would suffice [15 16 92] Second the DNNin Seer assumes no a priori knowledge about dependenciesbetween individual microservices making it applicable tofrequently-updated services where describing changes iscumbersome or even difficult for the user to know Thirddeep learning has been shown to be especially effective inpattern recognition problems with massive datasets eg inimage or text recognition [10] Finally as we show in thevalidation section (Sec 5) deep learning allows Seer to rec-ognize QoS violations with high accuracy in practice andwithin the opportunity window the cluster manager has toapply corrective actionsConfiguring the DNN The input used in the network isessential for its accuracy We have experimented with re-source utilization latency and queue depths as input met-rics Consistent with prior work utilization is not a goodproxy for performance [31 35 56 61] Latency similarlyleads to many false positives or to incorrectly pinpointingcomputationally-intensive microservices as QoS violationculprits Again consistent with queueing theory [49] andprior work [34 37 38 46 56] per-microservice queue depthsaccurately capture performance bottlenecks and pinpoint the

hellip

CNN

LSTM

hellip

LSTM

LSTM

LSTM

LSTM

LSTM

hellip

LSTM

LSTM

LSTM

LSTM

LSTM

hellip

LSTM

LSTM

LSTM

LSTM

M

icro

serv

ice

s(i

n d

ep

en

de

nci

es

ord

er)

helliphellip hellip

hellip

M

icrose

rvice

s(P

rfo

r Qo

Sv

iola

tion

)

LSTM

hellip

SoftMax

LSTM

Dropout

Figure 9 The deep neural network design in Seer con-sisting of a set of convolution layers followed by a set oflong-short term memory layers Each input and output neu-ron corresponds to a microservice ordered in topologicalorder from back-end microservices in the top to front-endmicroservices in the bottom

microservices causing them We compare against utilization-based approaches in Section 5 The number of input andoutput neurons is equal to the number of active microser-vices in the cluster with the input value corresponding toqueue depths and the output value to the probability fora given microservice to initiate a QoS violation Input neu-rons are ordered according to the topological applicationstructure with dependent microservices corresponding toconsecutive neurons to capture hotspot patterns in space In-put traces are annotated with the microservices that causedQoS violations in the past

The choice of DNN architecture is also instrumental to itsaccuracy There are three main DNN designs that are populartoday fully connected networks (FC) convolutional neuralnetworks (CNN) and recurrent neural networks (RNN) es-pecially their Long Short-Term Memory (LSTM) class ForSeer to be effective in improving performance predictabilityinference needs to occur with enough slack for the clustermanagerrsquos action to take effect Hence we focus on the morecomputationally-efficient CNN and LSTM networks CNNsare especially effective at reducing the dimensionality oflarge datasets and finding patterns in space eg in imagerecognition LSTMs on the other hand are particularly effec-tive at finding patterns in time eg predicting tomorrowrsquosweather based on todayrsquos measurements Signaling QoS vi-olations in a large cluster requires both spatial recognitionnamely identifying problematic clusters of microserviceswhose dependencies cause QoS violations and discardingnoisy but non-critical microservices and temporal recogni-tion namely using past QoS violations to anticipate futureones We compare three network designs a CNN a LSTMand a hybrid network that combines the two using the CNNfirst to reduce the dimensionality and filter out microservices

that do not affect end-to-end performance and then an LSTMwith a SoftMax final layer to infer the probability for eachmicroservice to initiate a QoS violation The architectureof the hybrid network is shown in Fig 9 Each network isconfigured using hyperparameter tuning to avoid overfittingand the final parameters are shown in Table 2We train each network on a weekrsquos worth of trace data

collected on a 20-server cluster running all end-to-end ser-vices (for methodology details see Sec 5) and test it on tracescollected on a different week after the servers had beenpatched and the OS had been upgraded

561 7879 9345QoS Violation Detection Accuracy ()

0

2

4

6

8

10

12

14

16

Infe

rence

Tim

e (

ms)

CNNLSTMSeer (CNN+LSTM)

Figure 10 Comparison ofDNN architectures

The quantitative com-parison of the threenetworks is shown inFig 10 The CNN is byfar the fastest but alsothe worst performingsince it is not designedto recognize patterns intime that lead to QoS vi-olations The LSTM onthe other hand is espe-cially effective at captur-ing load patterns overtime but is less effective at reducing the dimensionality ofthe original dataset which makes it prone to false positivesdue to microservices with many outstanding requests whichare off the critical path It also incurs higher overheads forinference than the CNN Finally Seer correctly anticipates9345 of violations outperforming both networks for asmall increase in inference time compared to LSTM Giventhat most resource partitioning decisions take effect after afew 100ms the inference time for Seer is within the windowof opportunity the cluster manager has to take action Moreimportantly it attributes the QoS violation to the correctmicroservice simplifying the cluster managerrsquos task QoSviolations missed by Seer included four random load spikesand a network switch failure which caused high packet dropsOut of the five end-to-end services the one most prone

to QoS violations initially was Social Network first becauseit has stricter QoS constraints than eg E-commerce andsecond due to a synchronous and cyclic communication be-tween three neighboring services that caused them to entera positive feedback loop until saturation We reimplementedthe communication protocol between them post-detectionOn the other hand the service for which QoS violations werehardest to detect wasMedia Service because of a faulty mem-ory bandwidth partitioning mechanism in one of our serverswhich resulted in widely inconsistent memory bandwidthallocations during movie streaming Since the QoS violationonly occurred when the specific streaming microservice wasscheduled on the faulty node it was hard for Seer to collectenough datapoints to signal the violation

Retraining Seer By default training happens once andcan be time consuming taking several hours up to a day forweek-long traces collected on our 20-server cluster (Sec 5includes a detailed sensitivity study for training time) How-ever one of the main advantages of microservices is that theysimplify frequent application updates with old microser-vices often swapped out and replaced by newer modules orlarge services progressively broken down to microservicesIf the application (or underlying hardware) change signifi-cantly Seerrsquos detection accuracy can be impacted To adjustto changes in the execution environment Seer retrains incre-mentally in the background using the transfer learning-basedapproach in [80] Weights from previous training rounds arestored in disk allowing the model to continue training fromwhere it last left off when new data arrive reducing thetraining time by 2-3 orders of magnitude Even though thisapproach allows Seer to handle application changes almost inreal-time it is not a long-term solution since newweights arestill polluted by the previous application architecture Whenthe application changes in a major way eg microserviceson the critical path change Seer also retrains from scratchin the background While the new network trains QoS vi-olation detection happens with the incrementally-trainedinterim model In Section 5 we evaluate Seerrsquos ability toadjust its estimations to application changes

44 Hardware MonitoringOnce a QoS violation is signaled and a culprit microserviceis pinpointed Seer uses low-level monitoring to identify thereason behind the QoS violation The exact process dependson whether Seer has access to performance countersPrivate cluster When Seer has access to hardware eventssuch as performance counters it uses them to determinethe utilization of different shared resources Note that eventhough utilization is a problematic metric for anticipatingQoS violations in a large-scale service once a culprit mi-croservice has been identified examining the utilization ofdifferent resources can provide useful hints to the clustermanager on suitable decisions to avoid degraded perfor-mance Seer specifically examines CPU memory capacityand bandwidth network bandwidth cache contention andstorage IO bandwidth when prioritizing a resource to adjustOnce the saturated resource is identified Seer notifies thecluster manager to take actionPublic clusterWhen Seer does not have access to perfor-mance counters it instead uses a set of 10 tunable contentiousmicrobenchmarks each of them targeting a different sharedresource [30] to determine resource saturation For exampleif Seer injects the memory bandwidth microbenchmark inthe system and tunes up its intensity without an impacton the co-scheduled microservicersquos performance memorybandwidth is most likely not the resource that needs to beadjusted Seer starts from microbenchmarks corresponding

to core resources and progressively moves to resources fur-ther away from the core until it sees a substantial change inperformance when running the microbenchmark Each mi-crobenchmark takes approximately 10ms to complete avoid-ing prolonged degraded performance

Upon identifying the problematic resource(s) Seer notifiesthe cluster manager which takes one of several resourceallocation actions resizing the Docker container or usingmechanisms like Intelrsquos Cache Allocation Technology (CAT)for last level cache (LLC) partitioning and the Linux trafficcontrolrsquos hierarchical token bucket (HTB) queueing disciplinein qdisc [17 62] for network bandwidth partitioning

45 System Insights from SeerUsing learning-based data-driven approaches in systemsis most useful when these techniques are used to gain in-sight into system problems instead of treating them as blackboxes Section 5 includes an analysis of the causes behindQoS violations signaled by Seer including application bugspoor resource provisioning decisions and hardware failuresFurthermore we have deployed Seer in a large installation ofthe Social Network service over the past few months and itsoutput has been instrumental not only in guaranteeing QoSbut in understanding sources of unpredictable performanceand improving the application design This has resulted bothin progressively fewer QoS violations over time and a betterunderstanding of the design challenges of microservices

46 ImplementationSeer is implemented in 12KLOC of CC++ and Python Itruns on Linux and OSX and supports applications in variouslanguages including all frameworks the end-to-end servicesare designed in Furthermore we provide automated patchesfor the instrumentation probes for many popular microser-vices including NGINX memcached MongoDB Xapian and allSockshop and Go-microservices applications to minimize thedevelopment effort from the userrsquos perspective

Seer is a centralized system we usemaster-slavemirroringto improve fault tolerance with two hot stand-by mastersthat can take over if the primary system fails Similarly thetrace database is also replicated in the backgroundSecurity concerns Trace data is stored and processed un-encrypted in Cassandra Previous work has shown that thesensitivity applications have to different resources can leakinformation about their nature and characteristics makingthem vulnerable to malicious security attacks [27 36 50 5284 88 89 94 99] Similar attacks are possible using the dataand techniques in Seer and are deferred to future work

5 Seer Analysis and Validation51 MethodologyServer clusters First we use a dedicated local cluster with20 2-socket 40-core servers with 128GB of RAM each Each

0 5 10 15 20 25 30 35 40

Training Time (hours)

40

50

60

70

80

90

100

Qo

S V

iola

tio

n D

ete

ctio

n A

ccu

racy (

)

100MB

1GB

10GB

20GB

50GB

100GB 200GB

500GB 1TB

10-2 10-1 100 101Tracing Interval (s)

50

55

60

65

70

75

80

85

90

95

Dete

ction A

ccura

cy (

)

Figure 11 Seerrsquos sensitivity to (a) the size of trainingdatasets and (b) the tracing interval

server is connected to a 40Gbps ToR switch over 10Gbe NICsSecond we deploy the Social Network service to GoogleCompute Engine (GCE) and Windows Azure clusters withhundreds of servers to study the scalability of SeerApplicationsWe use all five end-to-end services of Table 1Services for now are driven by open-loop workload gener-ators and the input load varies from constant to diurnalto load with spikes in user demand In Section 6 we studySeer in a real large-scale deployment of the Social Networkin that case the input load is driven by real user traffic

52 EvaluationSensitivity to training data Fig 11a shows the detectionaccuracy and training time for Seer as we increase the sizeof the training dataset The size of the dots is a functionof the dataset size Training data is collected from the 20-server cluster described above across different load levelsplacement strategies time intervals and request types Thesmallest training set size (100MB) is collected over ten min-utes of the cluster operating at high utilization while thelargest dataset (1TB) is collected over almost two months ofcontinuous deployment As datasets grow Seerrsquos accuracyincreases leveling off at 100-200GB Beyond that point accu-racy does not further increase while the time needed fortraining grows significantly Unless otherwise specified weuse the 100GB training datasetSensitivity to tracing frequencyBy default the distributedtracing system instantaneously collects the latency of everysingle user request Collecting queue depth statistics on theother hand is a per-microservice iterative process Fig 11bshows how Seerrsquos accuracy changes as we vary the frequencywith which we collect queue depth statistics Waiting for along time before sampling queues eg gt 1s can result in un-detected QoS violations before Seer gets a chance to processthe incoming traces In contrast sampling queues very fre-quently results in unnecessarily many inferences and runsthe risk of increasing the tracing overhead For the remain-der of this work we use 100ms as the interval for measuringqueue depths across microservicesFalse negatives amp false positives Fig 12a shows the per-centage of false negatives and false positives as we vary the

CPU

Cache

Memory

Network

Disk

Application

False Negatives False Positives0

10

20

30

40

50

Pe

rce

nta

ge

(

)

10ms50ms100ms

500ms1s2s

Utilizatio

n

App only

Net only

Seer

Ground0

20

40

60

80

100

o

f V

alid

Qo

S V

iola

tio

ns

Truth

Figure 12 (a) The false negatives and false positives in Seeras we vary the inference window (b) Breakdown to causesof QoS violations and comparison with utilization-baseddetection and systems with limited instrumentation

prediction window When Seer tries to anticipate QoS viola-tions that will occur in the next 10-100ms both false positivesand false negatives are low since Seer uses a very recentsnapshot of the cluster state to anticipate performance unpre-dictability If inference was instantaneous very short predic-tion windows would always be better However given thatinference takes several milliseconds and more importantlyapplying corrective actions to avoid QoS violations takes10-100s of milliseconds to take effect such short windowsdefy the point of proactive QoS violation detection At theother end predicting far into the future results in significantfalse negatives and especially false positives This is becausemany QoS violations are caused by very short bursty eventsthat do not have an impact on queue lengths until a fewmilliseconds before the violation occurs Therefore requiringSeer to predict one or more seconds into the future meansthat normal queue depths are annotated as QoS violationsresulting in many false positives Unless otherwise specifiedwe use a 100ms prediction windowComparison of debugging systems Fig 12b comparesSeer with a utilization-based performance debugging systemthat uses resource saturation as the trigger to signal a QoSviolation and two systems that only use a fraction of Seerrsquosinstrumentation App-only exclusively uses queues mea-sured via application instrumentation (not network queues)while Network-only uses queues in the NIC and ignoresapplication-level queueing We also show the ground truthfor the total number of upcoming QoS violations (96 over atwo-week period) and break it down by the reason that ledto unpredictable performanceA large fraction of QoS violations are due to application-

level inefficiencies including correctness bugs unnecessarysynchronization andor blocking behavior between microser-vices (including two cases of deadlocks) and misconfigurediptables rules which caused packet drops An equally largefraction of QoS violations are due to compute contentionfollowed by contention in the network cache and memory

Accuracy

QoS Violation

Impactful Service Update

Innocuous Service Update

Incremental Retraining

Signaled QoS Violation

0 2 4 6 8 10 12Time (hours)

75

80

85

90

95

De

tectio

n A

ccu

racy (

)

Social Network Movie Reviewing Ecommerce Go-microservices

0 2 4 6 8 10 12Time (hours)

60708090

100110120130140

Ta

il La

ten

cy n

orm

QoS

QoS

Figure 13 Seer retraining incrementally after each time theSocial Network service is updated

contention and finally disk Since the only persistent mi-croservices are the back-end databases it is reasonable thatdisk accounts for a small fraction of overall QoS violations

Seer accurately follows this breakdown for the most partonly missing a few QoS violations due to random load spikesincluding one caused by a switch failure The App-only sys-tem correctly identifies application-level sources of unpre-dictable performance but misses the majority of system-related issues especially in uncore resources On the otherhand Network-only correctly identifies the vast majorityof network-related issues as well as most of the core- anduncore-driven QoS violations but misses several application-level issues The difference between Network-only and Seeris small suggesting that one could omit the application-levelinstrumentation in favor of a simpler design While this sys-tem is still effective in capturing QoS violations it is lessuseful in providing feedback to application developers onhow to improve their design to avoid QoS violations in the fu-ture Finally the utilization-based system behaves the worstmissing most violations not caused by CPU saturationOut of the 89 QoS violations Seer detects it notifies the

cluster manager early enough to avoid 84 of them The QoSviolations that were not avoided correspond to application-level bugs which cannot be easily corrected online Sincethis is a private cluster Seer uses utilization metrics andperformance counters to identify problematic resourcesRetraining Fig 13 shows the detection accuracy for Seerand the tail latency for each end-to-end service over a periodof time during which Social Network is getting frequently andsubstantially updated This includes new microservices be-ing added to the service such as the ability to place an orderfrom an ad using the orders microservice of E-commerceor the back-end of the service changing from MongoDB to

Cassandra and the front-end switching from nginx to thenodejs front-end of E-commerce These are changes that fun-damentally affect the applicationrsquos behavior throughput la-tency and bottlenecks The other services remain unchangedduring this period (Banking was not active during this timeand is omitted from the graph) Blue dots denote correctly-signaled upcoming QoS violations and red times denote QoSviolations that were not detected by Seer All unidentifiedQoS violations coincide with the application being updatedShortly after the update Seer incrementally retrains in thebackground and starts recovering its accuracy until anothermajor update occurs Some of the updates have no impact oneither performance or Seerrsquos accuracy either because theyinvolve microservices off the critical path or because theyare insensitive to resource contention

The bottom figure shows that unidentified QoS violationsindeed result in performance degradation for Social Networkand in some cases for the other end-to-end services if theyare sharing physical resources with Social Network on anoversubscribed server Once retraining completes the per-formance of the service(s) recovers The longer Seer trainson an evolving application the more likely it is to correctlyanticipate its future QoS violations

6 Large-Scale Cloud Study61 Seer ScalabilityWe now deploy our Social Network service on a 100-serverdedicated cluster on Google Compute Engine (GCE) anduse it to service real user traffic The application has 582registered users with 165 daily active users and has beendeployed for a two-month period The cluster on averagehosts 386 single-concerned containers (one microservice percontainer) subject to some resource scaling actions by thecluster manager based on Seerrsquos feedback

Training Inference10-310-210-1100101102103104105

Tim

e (

s)

NativeTPUBrainwave

Figure 14 Seer training and infer-ence with hardware acceleration

Accuracy remainshigh for Seer con-sistent with thesmall-scale exper-iments Inferencetime however in-creases substantiallyfrom 114ms forthe 20-server clus-ter to 54ms for the100-server GCE set-ting Even thoughthis is still suffi-cient for many re-source allocation decisions as the application scales furtherSeerrsquos ability to anticipate a QoS violation within the clustermanagerrsquos window of opportunity diminishes

Over the past year multiple public cloud providers have ex-posed hardware acceleration offerings for DNN training and

nginx

text

imag

e

user

Tag

urlS

horte

n

vide

o

reco

mm

sear

chlogin

read

Post

com

pPos

t

writ

eGra

ph

mem

cach

ed

mon

godb

0

20

40

60

80

100

120

QoS

Vio

lations

Figure 15 QoS violations each microservice in Social Net-work is responsible for

inference either using a special-purpose design like the Ten-sor Processing Unit (TPU) from Google [55] or using recon-figurable FPGAs like Project Brainwave from Microsoft [25]We offload Seerrsquos DNN logic to both systems and quantifythe impact on training and inference time and detection ac-curacy 1 Fig 14 shows this comparison for a 200GB trainingdataset Both the TPU and Project Brainwave dramaticallyoutperform our local implementation by up to two orders ofmagnitude Between the two accelerators the TPU is moreeffective in training consistent with its design objective [55]while Project Brainwave achieves faster inference For theremainder of the paper we run Seer on TPUs and host theSocial Network service on GCE

62 Source of QoS ViolationsWe now examine which microservice is the most commonculprit for a QoS violation Fig 15 shows the number of QoSviolations caused by each service over the two-month periodThe most frequent culprits by far are the in-memory cachingtiers in memcached and Thrift services with high requestfanout such as composePost readPost and login memcachedis a justified source of QoS violations since it is on the criti-cal path for almost all query types and it is additionally verysensitive to resource contention in compute and to a lesserdegree cache and memory Microservices with high fanoutare also expected to initiate QoS violations as they have tosynchronize multiple inbound requests before proceeding Ifprocessing for any incoming requests is delayed end-to-endperformance is likely to suffer Among these QoS violationsmost of memcachedrsquos violations were caused by resource con-tention while violations in Thrift services were caused bylong synchronization times

63 Seerrsquos Long-Term Impact on Application DesignSeer has now been deployed in the Social Network clusterfor over two months and in this time it has detected 536upcoming QoS violations (906 accuracy) and avoided 495(84) of them Furthermore by detecting recurring patterns1Before running on TPUs we reimplemented our DNN in Tensorflow Wesimilarly adjust the DNN to the currently-supported designs in Brainwave

0 10 20 30 40 50 60Day

0

5

10

15

20

25

Qo

S v

iola

tio

ns

Figure 16 QoS violations each microservice in Social Net-work is responsible for

that lead to QoS violations Seer has helped the applicationdevelopers better understand bugs and design decisions thatlead to hotspots such as microservices with a lot of back-and-forth communication between them or microservicesforming cyclic dependencies or using blocking primitivesThis has led to a decreasing number of QoS violations overthe two month period (seen in Fig 16) as the applicationprogressively improves In days 22 and 23 there was a clusteroutage which is why the reported violations are zero Sys-tems like Seer can be used not only to improve performancepredictability in complex cloud systems but to help usersbetter understand the design challenges of microservices asmore services transition to this application model

7 ConclusionsCloud services increasingly move away from complex mono-lithic designs and adopt the model of specialized loosely-coupled microservices We presented Seer a data-drivencloud performance debugging system that leverages practi-cal learning techniques and the massive amount of tracingdata cloud systems collect to proactively detect and avoidQoS violations We have validated Seerrsquos accuracy in con-trolled environments and evaluated its scalability on large-scale clusters on public clouds We have also deployed thesystem in a cluster hosting a social network with hundredsof users In all scenarios Seer accurately detects upcomingQoS violations improving responsiveness and performancepredictability As more services transition to the microser-vices model systems like Seer provide practical solutionsthat can navigate the increasing complexity of the cloud

AcknowledgementsWe sincerely thank Christos Kozyrakis Balaji PrabhakarMendel Rosenblum Daniel Sanchez Ravi Soundararajanthe academic and industrial users of the benchmark suiteand the anonymous reviewers for their feedback on earlierversions of this manuscript This work was supported byNSF CAREER Award CCF-1846046 NSF grant CNS-1422088a Facebook Faculty Research Award a VMWare ResearchAward a John and Norma Balen Sesquicentennial FacultyFellowship and donations from GCE Azure and AWS

References[1] [n d] Apache Thrift httpsthriftapacheorg[2] [n d] Decomposing Twitter Adventures in Service-

Oriented Architecture httpswwwslidesharenetInfoQdecomposing-twitter-adventures-in-serviceoriented-architecture

[3] [n d] Golang Microservices Example httpsgithubcomharlowgo-micro-services

[4] [n d] Messaging that just works httpswwwrabbitmqcom[5] [n d] MongoDB httpswwwmongodbcom[6] [n d] NGINX httpsnginxorgen[7] [n d] SockShop A Microservices Demo Application httpswww

weaveworksblogsock-shop-microservices-demo-application[8] [n d] Zipkin httpzipkinio[9] 2016 The Evolution of Microservices httpswwwslidesharenet

adriancockcroftevolution-of-microservices-craft-conference[10] Martiacuten Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng

Chen Craig Citro Greg Corrado Andy Davis Jeffrey Dean MatthieuDevin Sanjay Ghemawat Ian Goodfellow Andrew Harp GeoffreyIrving Michael Isard Yangqing Jia Rafal Jozefowicz Lukasz KaiserManjunath Kudlur Josh Levenberg Dan Maneacute Rajat Monga SherryMoore Derek Murray Chris Olah Mike Schuster Jonathon ShlensBenoit Steiner Ilya Sutskever Kunal Talwar Paul Tucker VincentVanhoucke Vijay Vasudevan Fernanda Vieacutegas Oriol Vinyals PeteWarden Martin Wattenberg Martin Wicke Yuan Yu and XiaoqiangZheng [n d] TensorFlow Large-Scale Machine Learning on Hetero-geneous Distributed Systems In Proceedings of OSDI 2016

[11] Adrian Cockroft [n d] Microservices Workshop Why whatand how to get there httpwwwslidesharenetadriancockcroftmicroservices-workshop-craft-conference

[12] Hyunwook Baek Abhinav Srivastava and Jacobus Van der Merwe[n d] CloudSight A tenant-oriented transparency framework forcross-layer cloud troubleshooting In Proceedings of the 17th IEEEACMInternational Symposium on Cluster Cloud and Grid Computing 2017

[13] Luiz Barroso [n d] Warehouse-Scale Computing Entering theTeenage Decade ISCA Keynote SJ June 2011 ([n d])

[14] Luiz Barroso and Urs Hoelzle 2009 The Datacenter as a Computer AnIntroduction to the Design ofWarehouse-Scale Machines MC Publishers

[15] Robert Bell Yehuda Koren and Chris Volinsky 2007 The BellKor 2008Solution to the Netflix Prize Technical Report

[16] Leon Bottou [n d] Large-Scale Machine Learning with StochasticGradient Descent In Proceedings of the International Conference onComputational Statistics (COMPSTAT) Paris France 2010

[17] Martin A Brown [n d] Traffic Control HOWTO httplinux-ipnetarticlesTraffic-Control-HOWTO

[18] Cassandra [n d] Apache Cassandra httpcassandraapacheorg[19] Adrian M Caulfield Eric S Chung Andrew Putnam Hari Angepat

Jeremy Fowers Michael Haselman Stephen Heil Matt HumphreyPuneet Kaur Joo-Young Kim Daniel Lo Todd Massengill KalinOvtcharov Michael Papamichael Lisa Woods Sitaram Lanka DerekChiou and Doug Burger 2016 A Cloud-scale Acceleration Architec-ture In MICRO IEEE Press Piscataway NJ USA Article 7 13 pages

[20] Tianshi Chen Zidong Du Ninghui Sun Jia Wang Chengyong WuYunji Chen and Olivier Temam 2014 DianNao a small-footprint high-throughput accelerator for ubiquitous machine-learning In Proc ofthe 19th intl conf on Architectural Support for Programming Languagesand Operating Systems

[21] Yunji Chen Tianshi Chen Zhiwei Xu Ninghui Sun andOlivier Temam2016 DianNao Family Energy-efficient Hardware Accelerators forMachine Learning Commun ACM 59 11 (Oct 2016) 105ndash112 httpsdoiorg1011452996864

[22] Yunji Chen Tao Luo Shaoli Liu Shijin Zhang Liqiang He Jia WangLing Li Tianshi Chen Zhiwei Xu Ninghui Sun and Olivier Temam2014 DaDianNao A Machine-Learning Supercomputer In Proceedingsof the 47th Annual IEEEACM International Symposium on Microarchi-tecture (MICRO-47) IEEE Computer Society Washington DC USA

609ndash622[23] Ludmila Cherkasova Diwaker Gupta and Amin Vahdat 2007 Com-

parison of the Three CPU Schedulers in Xen SIGMETRICS PerformEval Rev 35 2 (Sept 2007) 42ndash51

[24] Michael Chow David Meisner Jason Flinn Daniel Peek and Thomas FWenisch 2014 The Mystery Machine End-to-end Performance Anal-ysis of Large-scale Internet Services In Proceedings of the 11th USENIXConference on Operating Systems Design and Implementation (OSDIrsquo14)USENIX Association Berkeley CA USA 217ndash231

[25] Eric S Chung Jeremy Fowers Kalin Ovtcharov Michael PapamichaelAdrian M Caulfield Todd Massengill Ming Liu Daniel Lo ShlomiAlkalay Michael Haselman Maleen Abeydeera Logan Adams HariAngepat Christian Boehn Derek Chiou Oren Firestein AlessandroForin Kang Su Gatlin Mahdi Ghandi Stephen Heil Kyle HolohanAhmad El Husseini Tamaacutes Juhaacutesz Kara Kagi Ratna Kovvuri SitaramLanka Friedel van Megen Dima Mukhortov Prerak Patel BrandonPerez Amanda Rapsang Steven K Reinhardt Bita Rouhani AdamSapek Raja Seera Sangeetha Shekar Balaji Sridharan Gabriel WeiszLisaWoods Phillip Yi Xiao Dan Zhang Ritchie Zhao and Doug Burger2018 Serving DNNs in Real Time at Datacenter Scale with ProjectBrainwave IEEE Micro 38 2 (2018) 8ndash20

[26] Guilherme Da Cunha Rodrigues Rodrigo N Calheiros Vini-cius Tavares Guimaraes Glederson Lessa dos Santos Maacutercio Barbosade Carvalho Lisandro Zambenedetti Granville Liane Margarida Rock-enbach Tarouco and Rajkumar Buyya 2016 Monitoring of CloudComputing Environments Concepts Solutions Trends and Future Di-rections In Proceedings of the 31st Annual ACM Symposium on AppliedComputing (SAC rsquo16) ACM New York NY USA 378ndash383

[27] Marwan Darwish Abdelkader Ouda and Luiz Fernando Capretz [nd] Cloud-based DDoS attacks and defenses In Proc of i-SocietyToronto ON 2013

[28] Jeffrey Dean and Luiz Andre Barroso [n d] The Tail at Scale InCACM Vol 56 No 2

[29] Christina Delimitrou Nick Bambos and Christos Kozyrakis [n d]QoS-Aware Admission Control in Heterogeneous Datacenters In Pro-ceedings of the International Conference of Autonomic Computing (ICAC)San Jose CA USA 2013

[30] Christina Delimitrou and Christos Kozyrakis [n d] iBench Quanti-fying Interference for Datacenter Workloads In Proceedings of the 2013IEEE International Symposium on Workload Characterization (IISWC)Portland OR September 2013

[31] Christina Delimitrou and Christos Kozyrakis [n d] Paragon QoS-Aware Scheduling for Heterogeneous Datacenters In Proceedings ofthe Eighteenth International Conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS) HoustonTX USA 2013

[32] Christina Delimitrou and Christos Kozyrakis [n d] QoS-Aware Sched-uling in Heterogeneous Datacenters with Paragon In ACM Transac-tions on Computer Systems (TOCS) Vol 31 Issue 4 December 2013

[33] Christina Delimitrou and Christos Kozyrakis [n d] Quality-of-Service-Aware Scheduling in Heterogeneous Datacenters withParagon In IEEE Micro Special Issue on Top Picks from the ComputerArchitecture Conferences MayJune 2014

[34] Christina Delimitrou and Christos Kozyrakis [n d] Quasar Resource-Efficient and QoS-Aware Cluster Management In Proc of ASPLOS SaltLake City 2014

[35] Christina Delimitrou and Christos Kozyrakis 2016 HCloud Resource-Efficient Provisioning in Shared Cloud Systems In Proceedings of theTwenty First International Conference on Architectural Support for Pro-gramming Languages and Operating Systems (ASPLOS)

[36] Christina Delimitrou and Christos Kozyrakis 2017 Bolt I KnowWhatYou Did Last Summer In The Cloud In Proc of the Twenty SecondInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS)

[37] Christina Delimitrou and Christos Kozyrakis 2018 Amdahlrsquos Law forTail Latency In Communications of the ACM (CACM)

[38] Christina Delimitrou Daniel Sanchez and Christos Kozyrakis 2015Tarcil Reconciling Scheduling Speed and Quality in Large Shared Clus-ters In Proceedings of the Sixth ACM Symposium on Cloud Computing(SOCC)

[39] Zidong Du Robert Fasthuber Tianshi Chen Paolo Ienne Ling Li TaoLuo Xiaobing Feng Yunji Chen and Olivier Temam 2015 ShiDi-anNao Shifting Vision Processing Closer to the Sensor In Proceed-ings of the 42Nd Annual International Symposium on Computer Ar-chitecture (ISCA rsquo15) ACM New York NY USA 92ndash104 httpsdoiorg10114527494692750389

[40] Daniel Firestone Andrew Putnam SambhramaMundkur Derek ChiouAlireza Dabagh Mike Andrewartha Hari Angepat Vivek BhanuAdrian Caulfield Eric Chung Harish Kumar Chandrappa SomeshChaturmohta Matt Humphrey Jack Lavier Norman Lam FengfenLiu Kalin Ovtcharov Jitu Padhye Gautham Popuri Shachar RaindelTejas Sapre Mark Shaw Gabriel Silva Madhan Sivakumar NisheethSrivastava Anshuman Verma Qasim Zuhair Deepak Bansal DougBurger Kushagra Vaid David A Maltz and Albert Greenberg 2018Azure Accelerated Networking SmartNICs in the Public Cloud In 15thUSENIX Symposium on Networked Systems Design and Implementation(NSDI 18) USENIX Association Renton WA 51ndash66

[41] Brad Fitzpatrick [n d] Distributed caching with memcached InLinux Journal Volume 2004 Issue 124 2004

[42] Rodrigo Fonseca George Porter Randy H Katz Scott Shenker andIon Stoica 2007 X-trace A Pervasive Network Tracing Framework InProceedings of the 4th USENIX Conference on Networked Systems Designamp Implementation (NSDIrsquo07) USENIX Association Berkeley CA USA20ndash20

[43] Yu Gan and Christina Delimitrou 2018 The Architectural Implicationsof Cloud Microservices In Computer Architecture Letters (CAL) vol17iss 2

[44] Yu Gan Meghna Pancholi Dailun Cheng Siyuan Hu Yuan He andChristina Delimitrou 2018 Seer Leveraging Big Data to Navigate theComplexity of Cloud Debugging In Proceedings of the Tenth USENIXWorkshop on Hot Topics in Cloud Computing (HotCloud)

[45] Yu Gan Yanqi Zhang Dailun Cheng Ankitha Shetty Priyal RathiNayan Katarki Ariana Bruno Justin Hu Brian Ritchken BrendonJackson Kelvin Hu Meghna Pancholi Yuan He Brett Clancy ChrisColen Fukang Wen Catherine Leung Siyuan Wang Leon ZaruvinskyMateo Espinosa Rick Lin Zhongling Liu Jake Padilla and ChristinaDelimitrou 2019 An Open-Source Benchmark Suite for Microser-vices and Their Hardware-Software Implications for Cloud and EdgeSystems In Proceedings of the Twenty Fourth International Conferenceon Architectural Support for Programming Languages and OperatingSystems (ASPLOS)

[46] Yilong Geng Shiyu Liu Zi Yin Ashish Naik Balaji Prabhakar MendelRosenblum and Amin Vahdat [n d] Exploiting a Natural NetworkEffect for Scalable Fine-grained Clock Synchronization In Proc ofNSDI 2018

[47] Daniel Gmach Jerry Rolia Ludmila Cherkasova and Alfons Kemper[n d] Workload Analysis and Demand Prediction of Enterprise DataCenter Applications In Proceedings of IISWC Boston MA 2007 10httpsdoiorg101109IISWC20074362193

[48] Zhenhuan Gong Xiaohui Gu and John Wilkes [n d] PRESS PRe-dictive Elastic ReSource Scaling for cloud systems In Proceedings ofCNSM Niagara Falls ON 2010

[49] Donald Gross John F Shortle James M Thompson and Carl M Harris[n d] Fundamentals of Queueing Theory InWiley Series in Probabilityand Statistics Book 627 2011

[50] Sanchika Gupta and PadamKumar [n d] VM Profile Based OptimizedNetwork Attack Pattern Detection Scheme for DDOS Attacks in CloudIn Proc of SSCC Mysore India 2013

[51] BenHindman Andy Konwinski Matei Zaharia Ali Ghodsi AnthonyDJoseph Randy Katz Scott Shenker and Ion Stoica [n d] Mesos APlatform for Fine-Grained Resource Sharing in the Data Center InProceedings of NSDI Boston MA 2011

[52] Jingwei Huang David M Nicol and Roy H Campbell [n d] Denial-of-Service Threat to HadoopYARN Clusters with Multi-Tenancy InProc of the IEEE International Congress on Big Data Washington DC2014

[53] Alexandru Iosup Nezih Yigitbasi and Dick Epema [n d] On thePerformance Variability of Production Cloud Services In Proceedingsof CCGRID Newport Beach CA 2011

[54] Hiranya Jayathilaka Chandra Krintz and Rich Wolski 2017 Perfor-mance Monitoring and Root Cause Analysis for Cloud-hosted WebApplications In Proceedings of WWW International World Wide WebConferences Steering Committee Republic and Canton of GenevaSwitzerland 469ndash478

[55] Norman P Jouppi Cliff Young Nishant Patil David Patterson Gau-rav Agrawal Raminder Bajwa Sarah Bates Suresh Bhatia Nan Bo-den Al Borchers Rick Boyle Pierre-luc Cantin Clifford Chao ChrisClark Jeremy Coriell Mike Daley Matt Dau Jeffrey Dean Ben GelbTara Vazir Ghaemmaghami Rajendra Gottipati William GullandRobert Hagmann C Richard Ho Doug Hogberg John Hu RobertHundt Dan Hurt Julian Ibarz Aaron Jaffey Alek Jaworski Alexan-der Kaplan Harshit Khaitan Daniel Killebrew Andy Koch NaveenKumar Steve Lacy James Laudon James Law Diemthu Le ChrisLeary Zhuyuan Liu Kyle Lucke Alan Lundin Gordon MacKean Adri-ana Maggiore Maire Mahony Kieran Miller Rahul Nagarajan RaviNarayanaswami Ray Ni Kathy Nix Thomas Norrie Mark Omer-nick Narayana Penukonda Andy Phelps Jonathan Ross Matt RossAmir Salek Emad Samadiani Chris Severn Gregory Sizikov MatthewSnelham Jed Souter Dan Steinberg Andy Swing Mercedes Tan Gre-gory Thorson Bo Tian Horia Toma Erick Tuttle Vijay VasudevanRichard Walter Walter Wang Eric Wilcox and Doe Hyun Yoon 2017In-Datacenter Performance Analysis of a Tensor Processing Unit InProceedings of the 44th Annual International Symposium on ComputerArchitecture (ISCA rsquo17) ACM New York NY USA 1ndash12

[56] Harshad Kasture and Daniel Sanchez 2014 Ubik Efficient CacheSharing with Strict QoS for Latency-Critical Workloads In Proceed-ings of the 19th international conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS-XIX)

[57] Harshad Kasture and Daniel Sanchez 2016 TailBench A BenchmarkSuite and Evaluation Methodology for Latency-Critical ApplicationsIn Proc of IISWC

[58] Yaakoub El Khamra Hyunjoo Kim Shantenu Jha andManish Parashar[n d] Exploring the performance fluctuations of hpc workloads onclouds In Proceedings of CloudCom Indianapolis IN 2010

[59] Krzysztof C Kiwiel [n d] Convergence and efficiency of subgradientmethods for quasiconvex minimization InMathematical Programming(Series A) (Berlin Heidelberg Springer) 90 (1) pp 1-25 2001

[60] Jacob Leverich and Christos Kozyrakis [n d] Reconciling High ServerUtilization and Sub-millisecond Quality-of-Service In Proc of EuroSys2014

[61] David Lo Liqun Cheng Rama Govindaraju Luiz Andreacute Barroso andChristos Kozyrakis [n d] Towards Energy Proportionality for Large-scale Latency-critical Workloads In Proceedings of the 41st AnnualInternational Symposium on Computer Architecuture (ISCA) Minneapo-lis MN 2014

[62] David Lo Liqun Cheng Rama Govindaraju Parthasarathy Ran-ganathan and Christos Kozyrakis [n d] Heracles Improving Re-source Efficiency at Scale In Proc of the 42Nd Annual InternationalSymposium on Computer Architecture (ISCA) Portland OR 2015

[63] Dave Mangot [n d] EC2 variability The numbers re-vealed httptechmangotcomrollerdaveentryec2_variability_the_numbers_revealed

[64] Jason Mars and Lingjia Tang [n d] Whare-map heterogeneity inhomogeneous warehouse-scale computers In Proceedings of ISCATel-Aviv Israel 2013

[65] Jason Mars Lingjia Tang and Robert Hundt 2011 Heterogeneity inldquoHomogeneousrdquo Warehouse-Scale Computers A Performance Oppor-tunity IEEE Comput Archit Lett 10 2 (July 2011) 4

[66] Jason Mars Lingjia Tang Robert Hundt Kevin Skadron and Mary LouSoffa [n d] Bubble-Up increasing utilization in modern warehousescale computers via sensible co-locations In Proceedings of MICROPorto Alegre Brazil 2011

[67] David Meisner Christopher M Sadler Luiz Andreacute Barroso Wolf-Dietrich Weber and Thomas F Wenisch 2011 Power managementof online data-intensive services In Proceedings of the 38th annualinternational symposium on Computer architecture 319ndash330

[68] Ripal Nathuji Canturk Isci and Eugene Gorbatov [n d] Exploitingplatform heterogeneity for power efficient data centers In Proceedingsof ICAC Jacksonville FL 2007

[69] Ripal Nathuji Aman Kansal and Alireza Ghaffarkhah [n d] Q-Clouds Managing Performance Interference Effects for QoS-AwareClouds In Proceedings of EuroSys ParisFrance 2010

[70] Hiep Nguyen Zhiming Shen Xiaohui Gu Sethuraman Subbiah andJohn Wilkes [n d] AGILE Elastic Distributed Resource Scaling forInfrastructure-as-a-Service In Proceedings of ICAC San Jose CA 2013

[71] Dejan Novakovic Nedeljko Vasic Stanko Novakovic Dejan Kosticand Ricardo Bianchini [n d] DeepDive Transparently Identifyingand Managing Performance Interference in Virtualized EnvironmentsIn Proceedings of ATC San Jose CA 2013

[72] Simon Ostermann Alexandru Iosup Nezih Yigitbasi Radu ProdanThomas Fahringer and Dick Epema [n d] A Performance Analysisof EC2 Cloud Computing Services for Scientific Computing In LectureNotes on Cloud Computing Volume 34 p115-131 2010

[73] Kay Ousterhout Patrick Wendell Matei Zaharia and Ion Stoica [nd] Sparrow Distributed Low Latency Scheduling In Proceedings ofSOSP Farminton PA 2013

[74] Xue Ouyang Peter Garraghan Renyu Yang Paul Townend and Jie Xu[n d] Reducing Late-Timing Failure at Scale Straggler Root-CauseAnalysis in Cloud Datacenters In Proceedings of 46th Annual IEEEIFIPInternational Conference on Dependable Systems and Networks 2016

[75] Andrew Putnam Adrian M Caulfield Eric S Chung Derek ChiouKypros Constantinides John Demme Hadi Esmaeilzadeh Jeremy Fow-ers Gopi Prashanth Gopal Jan Gray Michael Haselman Scott HauckStephen Heil Amir Hormati Joo-Young Kim Sitaram Lanka JamesLarus Eric Peterson Simon Pope Aaron Smith Jason Thong Phillip YiXiao and Doug Burger 2014 A Reconfigurable Fabric for Accelerat-ing Large-Scale Datacenter Services In Proc of the 41st Intl Symp onComputer Architecture

[76] Suhail Rehman and Majd Sakr [n d] Initial Findings for provisioningvariation in cloud computing In Proceedings of CloudCom IndianapolisIN 2010

[77] Charles Reiss Alexey Tumanov Gregory Ganger Randy Katz andMichael Kozych [n d] Heterogeneity and Dynamicity of Clouds atScale Google Trace Analysis In Proceedings of SOCC 2012

[78] Gang Ren Eric Tune Tipp Moseley Yixin Shi Silvius Rus andRobert Hundt 2010 Google-Wide Profiling A Continuous Pro-filing Infrastructure for Data Centers IEEE Micro (2010) 65ndash79httpwwwcomputerorgportalwebcsdldoi101109MM201068

[79] rightscale [n d] Rightscale httpsawsamazoncomsolution-providersisvrightscale

[80] S Sarwar A Ankit and K Roy [n d] Incremental Learning in DeepConvolutional Neural Networks Using Partial Network Sharing InarXiv preprint arXiv171202719

[81] Joumlrg Schad Jens Dittrich and Jorge-Arnulfo Quianeacute-Ruiz 2010 Run-time Measurements in the Cloud Observing Analyzing and ReducingVariance Proceedings VLDB Endow 3 1-2 (Sept 2010) 460ndash471

[82] Malte Schwarzkopf Andy Konwinski Michael Abd-El-Malek andJohn Wilkes [n d] Omega flexible scalable schedulers for largecompute clusters In Proceedings of EuroSys Prague 2013

[83] Zhiming Shen Sethuraman Subbiah Xiaohui Gu and John Wilkes [nd] CloudScale elastic resource scaling for multi-tenant cloud systemsIn Proceedings of SOCC Cascais Portugal 2011

[84] David Shue Michael J Freedman and Anees Shaikh [n d] Perfor-mance Isolation and Fairness for Multi-tenant Cloud Storage In Procof OSDI Hollywood CA 2012

[85] Benjamin H Sigelman Luiz Andreacute Barroso Mike Burrows PatStephenson Manoj Plakal Donald Beaver Saul Jaspan and Chan-dan Shanbhag 2010 Dapper a Large-Scale Distributed Systems TracingInfrastructure Technical Report Google Inc httpsresearchgooglecomarchivepapersdapper-2010-1pdf

[86] Lalith Suresh Peter Bodik Ishai Menache Marco Canini and FlorinCiucu [n d] Distributed Resource Management Across ProcessBoundaries In Proceedings of the ACM Symposium on Cloud Computing(SOCC) Santa Clara CA 2017

[87] torque [n d] Torque Resource Manager httpwwwadaptivecomputingcomproductsopen-sourcetorque

[88] Venkatanathan Varadarajan Thomas Ristenpart and Michael Swift[n d] Scheduler-based Defenses against Cross-VM Side-channels InProc of the 23rd Usenix Security Symposium San Diego CA 2014

[89] Venkatanathan Varadarajan Yinqian Zhang Thomas Ristenpart andMichael Swift [n d] A Placement Vulnerability Study inMulti-TenantPublic Clouds In Proc of the 24th USENIX Security Symposium (USENIXSecurity) Washington DC 2015

[90] Nedeljko Vasić Dejan Novaković Svetozar Miučin Dejan Kostić andRicardo Bianchini [n d] DejaVu accelerating resource allocationin virtualized environments In Proceedings of the Seventeenth Interna-tional Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS) London UK 2012

[91] Jianping Weng Jessie Hui Wang Jiahai Yang and Yang Yang [n d]Root cause analysis of anomalies of multitier services in public cloudsIn Proceedings of the IEEEACM 25th International Symposium on Qual-ity of Service (IWQoS) 2017

[92] Ian H Witten Eibe Frank and Geoffrey Holmes [n d] Data MiningPractical Machine Learning Tools and Techniques 3rd Edition

[93] Tianyin Xu Xinxin Jin Peng Huang Yuanyuan Zhou Shan Lu LongJin and Shankar Pasupathy 2016 Early Detection of ConfigurationErrors to Reduce Failure Damage In Proceedings of the 12th USENIXConference on Operating Systems Design and Implementation (OSDIrsquo16)USENIX Association Berkeley CA USA 619ndash634

[94] Yunjing Xu Michael Bailey Farnam Jahanian Kaustubh Joshi MattiHiltunen and Richard Schlichting [n d] An Exploration of L2 CacheCovert Channels in Virtualized Environments In Proc of the 3rd ACMWorkshop on Cloud Computing Security Workshop (CCSW) ChicagoIL 2011

[95] Hailong Yang Alex Breslow Jason Mars and Lingjia Tang [n d]Bubble-flux precise online QoS management for increased utilizationin warehouse scale computers In Proceedings of ISCA 2013

[96] Minlan Yu Albert Greenberg Dave Maltz Jennifer Rexford LihuaYuan Srikanth Kandula and Changhoon Kim [n d] Profiling Net-work Performance for Multi-tier Data Center Applications In Proceed-ings of NSDI Boston MA 2011 14

[97] Xiao Yu Pallavi Joshi Jianwu Xu Guoliang Jin Hui Zhang and GuofeiJiang 2016 CloudSeer Workflow Monitoring of Cloud Infrastructuresvia Interleaved Logs In Proceedings of APSLOS ACM New York NYUSA 489ndash502

[98] Yinqian Zhang and Michael K Reiter [n d] Duppel retrofittingcommodity operating systems to mitigate cache side channels in thecloud In Proc of CCS Berlin Germany 2013

[99] Mark Zhao and G Edward Suh [n d] FPGA-Based Remote PowerSide-Channel Attacks In Proceedings of the IEEE Symposium on Securityand Privacy May 2018

  • Abstract
  • 1 Introduction
  • 2 Related Work
  • 3 End-to-End Applications with Microservices
    • 31 Social Network
    • 32 Media Service
    • 33 E-Commerce Service
    • 34 Banking System
    • 35 Hotel Reservation Site
      • 4 Seer Design
        • 41 Overview
        • 42 Distributed Tracing
        • 43 Deep Learning in Performance Debugging
        • 44 Hardware Monitoring
        • 45 System Insights from Seer
        • 46 Implementation
          • 5 Seer Analysis and Validation
            • 51 Methodology
            • 52 Evaluation
              • 6 Large-Scale Cloud Study
                • 61 Seer Scalability
                • 62 Source of QoS Violations
                • 63 Seers Long-Term Impact on Application Design
                  • 7 Conclusions
                  • Acknowledgements
                  • References
Page 3: Seer: Leveraging Big Data to Navigate the Complexity of ... · data to extract the causal relationships between requests. Even though such tracing systems help cloud providers detect

of a QoS violation that will occur over the next few 100s ofmilliseconds it uses detailed per-node hardware monitoringto determine the reason behind the degraded performanceand provide the cluster scheduler with recommendations onactions required to avoid it

We evaluate Seer both in our local cluster of 20 two-socketservers and on large-scale clusters on Google Compute En-gine (GCE) with a set of end-to-end interactive applicationsbuilt with microservices including the Social Network aboveIn our local cluster Seer correctly identifies upcoming QoSviolations in 93 of cases and correctly pinpoints the mi-croservice initiating the violation 89 of the time To combatlong inference times as clusters scale we offload the DNNtraining and inference to Googlersquos Tensor Processing Units(TPUs) when running on GCE [55] We additionally experi-ment with using FPGAs in Seer via Project Brainwave [25]when running on Windows Azure and show that both typesof acceleration speed up Seer by 200-235x with the TPU help-ing the most during training and vice versa for inferenceAccuracy is consistent with the small cluster results

Finally we deploy Seer in a large-scale installation of theSocial Network service with several hundred users and showthat it not only correctly identifies 906 of upcoming QoSviolations and avoids 84 of them but that detecting pat-terns that create hotspots helps the applicationrsquos developersimprove the service design resulting in a decreasing numberof QoS violations over time As cloud application and hard-ware complexity continues to grow data-driven systems likeSeer can offer practical solutions for systems whose scalemake empirical approaches intractable

2 Related WorkPerformance unpredictability is a well-studied problem inpublic clouds that stems from platform heterogeneity re-source interference software bugs and load variation [2332 34 35 38 53 58 61ndash63 72 76 81 81] We now reviewrelated work on reducing performance unpredictability incloud systems including through scheduling and clustermanagement or through online tracing systemsCloud management The prevalence of cloud computinghas motivated several cluster management designs Systemslike Mesos [51] Torque [87] Tarcil [38] and Omega [82]all target the problem of resource allocation in large multi-tenant clusters Mesos is a two-level scheduler It has a centralcoordinator that makes resource offers to application frame-works and each framework has an individual scheduler thathandles its assigned resources Omega on the other handfollows a shared-state approach where multiple concurrentschedulers can view the whole cluster state with conflicts be-ing resolved through a transactional mechanism [82] Tarcilleverages information on the type of resources applicationsneed to employ a sampling-base distributed scheduler thatreturns high quality resources within a fewmilliseconds [38]

Dejavu identifies a few workload classes and reuses previ-ous allocations for each class to minimize reallocation over-heads [90] CloudScale [83] PRESS [48] AGILE [70] and thework by Gmach et al [47] predict future resource needs on-line often without a priori knowledge Finally auto-scalingsystems such as Rightscale [79] automatically scale thenumber of physical or virtual instances used by webservingworkloads to accommodate changes in user load

A second line of work tries to identify resources that willallow a new potentially-unknown application to meet itsperformance (throughput or tail latency) requirements [2931 32 34 66 68 95] Paragon uses classification to determinethe impact of platform heterogeneity and workload interfer-ence on an unknown incoming workload [30 31] It thenuses this information to achieve predictable performanceand high cluster utilization Paragon assumes that the clus-ter manager has full control over all resources which is oftennot the case in public clouds Quasar extends the use of datamining in cluster management by additionally determiningthe appropriate amount of resources for a new applicationNathuji et al developed a feedback-based scheme that tunesresource assignments to mitigate memory interference [69]Yang et al developed an online scheme that detects mem-ory pressure and finds colocations that avoid interference onlatency-sensitiveworkloads [95] Similarly DeepDive detectsand manages interference between co-scheduled workloadsin a VM environment [71]Finally CPI2 [98] throttles low-priority workloads that

introduce destructive interference to important latency-critical services using low-level metrics of performance col-lected through Google-Wide Profiling (GWP) In terms ofmanaging platform heterogeneity Nathuji et al [68] andMars et al [64] quantified its impact on conventional bench-marks and Google services and designed schemes to predictthe most appropriate servers for a workloadCloud tracing amp diagnostics There is extensive relatedwork on monitoring systems that has shown that execu-tion traces can help diagnose performance efficiency andeven security problems in large-scale systems [12 24 26 4254 77 85 93 97] For example X-Trace is a tracing frame-work that provides a comprehensive view of the behavior ofservices running on large-scale potentially shared clustersX-Trace supports several protocols and software systemsand has been deployed in several real-world scenarios in-cluding DNS resolution and a photo-hosting site [42] TheMystery Machine on the other hand leverages the massiveamount of monitoring data cloud systems collect to deter-mine the causal relationship between different requests [24]Cloudseer serves a similar purpose building an automatonfor the workflow of each task based on normal executionand then compares against this automaton at runtime todetermine if the workflow has diverged from its expectedbehavior [97] Finally there are several production systems

Service Communication Unique Per-language LoC breakdownProtocol Microservices (end-to-end service)

Social Network RPC 36 34 C 23 C++ 18 Java 7 node 6 Python 5 Scala 3 PHP 2 JS 2 GoMedia Service RPC 38 30 C 21 C++ 20 Java 10 PHP 8 Scala 5 node 3 Python 3 JS

E-commerce Site REST 41 21 Java 16 C++ 15 C 14 Go 10 JS 7 node 5 Scala 4 HTML 3 RubyBanking System RPC 28 29 C 25 Javascript 16 Java 16 nodejs 11 C++ 3 Python

Hotel Reservations [3] RPC 15 89 Go 7 HTML 4 PythonTable 1 Characteristics and code composition of each end-to-end microservices-based application

mongoDB

mongoDB

mongoDB

memcached

memcached

memcached

mongoDB

memcached

Social Network

Service

text

video

image

userTag

composePost

postsStorage

writeTimeline

writeGraph

readPost blockedUsers

readTimeline

login

userInfomongoDB

memcached

search

index0

index1

indexn

hellip

uniqueIDads

recommender

Client nginx

http

http

fastcgiphp-

fpm

Load

BalancerurlShorten

favorite

followUser

Figure 3 Dependency graph between the microservices of the end-to-end Social Network application

Go-microservices

(hotel reservations)

ads

userProfile

search

geo

rate

mongoDB

mongoDB

recommender

front-end

(nodejs)http

Client

httpLoad

Balancer

memcached

memcached

mongoDB

memcached

memcached

Figure 4 Architecture of the hotel reserva-tion site using Go-microservices [3]

including Dapper [85] GWP [78] and Zipkin [8] which pro-vide the tracing infrastructure for large-scale productionsservices at Google and Twitter respectively Dapper and Zip-kin trace distributed user requests at RPC granularity whileGWP focuses on low-level hardware monitoring diagnosticsRoot cause analysis of performance abnormalities in the

cloud has also gained increased attention over the past fewyears as the number of interactive latency-critical serviceshosted in cloud systems has increased Jayathilaka et al [54]for example developed Roots a system that automaticallyidentifies the root cause of performance anomalies in webapplications deployed in Platform-as-a-Service (PaaS) cloudsRoots tracks events within the PaaS cloud using a combina-tion of metadata injection and platform-level instrumenta-tion Weng et al [91] similarly explore the cloud providerrsquosability to diagnose the root cause of performance abnormali-ties inmulti-tier applications Finally Ouyang et al [74] focuson the root cause analysis of straggler tasks in distributedprogramming frameworks like MapReduce and Spark

Even though this work does not specifically target interac-tive latency-critical microservices or applications of similargranularity such examples provide promising evidence thatdata-driven performance diagnostics can improve a large-scale systemrsquos ability to identify performance anomalies andaddress them to meet its performance guarantees

3 End-to-End Applications withMicroservices

We motivate and evaluate Seer with a set of new end-to-endinteractive services built with microservices Even thoughthere are open-source microservices that can serve as compo-nents of a larger application such as nginx [6] memcached [41]MongoDB [5] Xapian [57] and RabbitMQ [4] there are currentlyno publicly-available end-to-end microservices applicationswith the exception of a few simple architectures like Go-microservices [3] and Sockshop [7] We design four end-to-end services implementing a Social Network a Media Servicean E-commerce Site and a Banking System Starting fromthe Go-microservices architecture [3] we also develop anend-to-end Hotel Reservation system Services are designedto be representative of frameworks used in production sys-tems modular and easily reconfigurable The end-to-endapplications and tracing infrastructure are described in moredetail and open-sourced in [45]Table 1 briefly shows the characteristics of each end-to-

end application including its communication protocol thenumber of unique microservices it includes and its break-down by programming language and framework Unlessotherwise noted all microservices are deployed in Dockercontainers Below we briefly describe the scope and func-tionality of each service

E-commerce

Service

front-end

(nodejs)http

memcached

orders

search

index0

index1

indexn

hellip

recommender

media

discounts

catalogue

wishlist

cart

accountInfo

mongoDB

mongoDB

mongoDB

shipping

mongoDB

queueMaster orderQueue

mongoDB

mongoDB

memcached

mongoDB

payment

authorization

transactionID

invoicing

login

adsClient

httpLoad

Balancer

socialNet

memcached

Figure 5 The architecture of the end-to-end E-commerceapplication implementing an online clothing store

31 Social NetworkScope The end-to-end service implements a broadcast-stylesocial network with uni-directional follow relationshipsFunctionality Fig 3 shows the architecture of the end-to-end service Users (client) send requests over http whichfirst reach a load balancer implemented with nginx whichselects a specific webserver is selected also in nginx Userscan create posts embedded with text media links and tags toother users which are then broadcasted to all their followersUsers can also read favorite and repost posts as well as replypublicly or send a direct message to another user The appli-cation also includes machine learning plugins such as adsand user recommender engines [15 16 59 92] a search ser-vice using Xapian [57] and microservices that allow users tofollow unfollow or block other accounts Inter-microservicemessages use Apache Thrift RPCs [1] The servicersquos backenduses memcached for caching and MongoDB for persistently stor-ing posts user profiles media and user recommendationsThis service is broadly deployed at Cornell and elsewhereand currently has several hundred users We use this in-stallation to test the effectiveness and scalability of Seer inSection 6

32 Media ServiceScope The application implements an end-to-end servicefor browsing movie information as well as reviewing ratingrenting and streaming movies [9 11]Functionality As with the social network a client requesthits the load balancer which distributes requests among mul-tiple nginx webservers The front-end is similar to SocialNetwork and users can search and browse information aboutmovies including the plot photos videos and review in-formation as well as insert a review for a specific movie bylogging in to their account Users can also select to rent amovie which involves a payment authentication module to

verify the user has enough funds and a video streaming mod-ule using nginx-hls a production nginx module for HTTPlive streaming Movie files are stored in NFS to avoid thelatency and complexity of accessing chunked records fromnon-relational databases while reviews are held in memcached

and MongoDB instances Movie information is maintained ina sharded and replicated MySQL DB We are similarly de-ploying Media Service as a hosting site for project demos atCornell which students can browse and review

33 E-Commerce ServiceScope The service implements an e-commerce site for cloth-ing The design draws inspiration and uses several compo-nents of the open-source Sockshop application [7]Functionality The application front-end here is a nodejs

service Clients can use the service to browse the inventoryusing catalogue a Go microservice that mines the back-end memcached and MongoDB instances holding informationabout products Users can also place orders (Go) by addingitems to their cart (Java) After they log in (Go) to their ac-count they can select shipping options (Java) process theirpayment (Go) and obtain an invoice (Java) for their orderOrders are serialized and committed using QueueMaster (Go)Finally the service includes a recommender engine (C++) andmicroservices for creating wishlists (Java)

34 Banking SystemScope The service implements a secure banking systemsupporting payments loans and credit card managementFunctionality Users interface with a nodejs front-endsimilar to E-commerce to login to their account search in-formation about the bank or contact a representative Oncelogged in a user can process a payment from their accountpay their credit card or request a new one request a loan andobtain information about wealth management options Mostmicroservices are written in Java and Javascript The back-end databases are memcached and MongoDB instances The ser-vice also has a relational database (BankInfoDB) that includesinformation about the bank its services and representatives

35 Hotel Reservation SiteScope The service is an online hotel reservation site forbrowsing information about hotels and making reservationsFunctionalityThe service is based on theGo-microservicesopen-source project [3] augmented with backend databasesand machine learning widgets for advertisement and hotelrecommendations A client request is first directed to oneof the front-end webservers in nodejs by a load balancerThe front-end then interfaces with the search engine whichallows users to explore hotel availability in a given regionand place a reservation The service back-end consists ofmemcached and MongoDB instances

Tracing

Coordinator

Client

Load Balancer

Microservice

Tracing module

Seer

TraceDB

(Cassandra)

Cluster ManagerAdjust

resource

allocation

Microservices dependency graph

Figure 6 Overview of Seerrsquos operation

4 Seer Design41 OverviewFig 6 shows the high-level architecture of the system Seer isan online performance debugging system for cloud systemshosting interactive latency-critical services Even though weare focusing our analysis on microservices where the impactof QoS violations is more severe Seer is also applicable togeneral cloud services and traditional multi-tier or Service-Oriented Architecture (SOA) workloads Seer uses two levelsof tracing shown in Fig 7

RPC-level tracing

Signal QoS

violation

Identify culprit

microservice

Node-level diagnostics

Identify culprit

resource

Notify cluster

manager

Perf cntrs uBench

Figure 7 The two levels oftracing used in Seer

First it uses a light-weight distributed RPC-level tracing system de-scribed in Sec 42 whichcollects end-to-end ex-ecution traces for eachuser request includingper-tier latency and out-standing requests asso-ciates RPCs belongingto the same end-to-endrequest and aggregates

them to a centralized Cassandra database (TraceDB) Fromthere traces are used to train Seer to recognize patterns inspace (between microservices) and time that lead to QoSviolations At runtime Seer consumes real-time streamingtraces to infer whether there is an imminent QoS violation

When a QoS violation is expected to occur and a culprit mi-croservice has been located Seer uses its lower tracing levelwhich consists of detailed per-node low-level hardware mon-itoring primitives such as performance counters to identifythe reason behind the QoS violation It also uses this informa-tion to provide the cluster manager with recommendationson how to avoid the performance degradation altogetherWhen Seer runs on a public cloud where performance coun-ters are disabled it uses a set of tunable microbenchmarksto determine the source of unpredictable performance (seeSec 44) Using two specialized tracing levels instead of col-lecting detailed low-level traces for all active microservicesensures that the distributed tracing is lightweight enough to

Microservice m+1

Microservice m-1

me

mca

che

dNIC

epoll

(libevent)

socket

read

mem$

proc

Packet

filtering

Packet

filtering

TCPIP

TX

TCPIP

RX

Tracing

Coordinator

WebUI

User

http

TraceDB

(Cassandra)

QueryEngine

mic

rose

rvic

es

latency

Gantt charts

NIC

Figure 8 Distributed tracing and instrumentation in Seer

track all active requests and services in the system and thatdetailed low-level hardware tracing is only used on-demandfor microservices likely to cause performance disruptions Inthe following sections we describe the design of the tracingsystem the learning techniques in Seer and its low-leveldiagnostic framework and the system insights we can drawfrom Seerrsquos decisions to improve cloud application design

42 Distributed TracingA major challenge with microservices is that one cannotsimply rely on the client to report performance as withtraditional client-server applications We have developeda distributed tracing system for Seer similar in design toDapper [85] and Zipkin [8] that records per-microservicelatencies using the Thrift timing interface as shown in Fig 8We additionally track the number of requests queued in eachmicroservice (outstanding requests) since queue lengths arehighly correlated with performance and QoS violations [4649 56 96] In all cases the overhead from tracing withoutrequest sampling is negligible less than 01 on end-to-endlatency and less than 015 on throughput (QPS) whichis tolerable for such systems [24 78 85] Traces from allmicroservices are aggregated in a centralized database [18]Instrumentation The tracing system requires two typesof application instrumentation First to distinguish betweenthe time spent processing network requests and the timethat goes towards application computation we instrumentthe application to report the time it sees a new request (post-RPC processing) We similarly instrument the transmit sideof the application Second systems have multiple sources ofqueueing in both hardware and software To obtain accuratemeasurements of queue lengths per microservice we need toaccount for these different queues Fig 8 shows an example of

Name LoC Layers Nonlinear Weights BatchFC Conv Vect Total Function Size

CNN 1456 8 8 ReLU 30K 4LSTM 944 12 6 18 sigmoidtanh 52K 32

Seer 2882 10 7 5 22 ReLU 80K 32sigmoidtanhTable 2 The different neural network configurations weexplored for Seer

Seerrsquos instrumentation for memcached Memcached includesfive main stages [60] TCPIP receive epolllibevent read-ing the request from the socket processing the request andresponding over TCPIP either with the ltkvgt pair for a reador with an ack for a write Each of these stages includes ahardware (NIC) or software (epollsocket readmemcachedproc) queue For theNIC queues Seer filters packets based onthe destination microservice but accounts for the aggregatequeue length if hardware queues are shared since that willimpact how fast a microservicersquos packets get processed Forthe software queues Seer inserts probes in the applicationto read the number of queued requests in each caseLimited instrumentation As seen above accounting forall sources of queueing in a complex system requires non-trivial instrumentation This can become cumbersome ifusers leverage third-party applications in their services or inthe case of public cloud providers which do not have accessto the source code of externally-submitted applications forinstrumentation In these cases Seer relies on the requestsqueued exclusively in the NIC to signal upcoming QoS viola-tions In Section 5 we compare the accuracy of the full versuslimited instrumentation and see that using network queuedepths alone is enough to signal a large fraction of QoS viola-tions although smaller than when the full instrumentation isavailable Exclusively polling NIC queues identifies hotspotscaused by routing incast failures and resource saturationbut misses QoS violations that are caused by performanceand efficiency bugs in the application implementation suchas blocking behavior between microservices Signaling suchbugs helps developers better understand the microservicesmodel and results in better application designInferring queue lengths Additionally there has been re-cent work on using deep learning to reverse engineer thenumber of queued requests in switches across a large net-work topology [46] when tracing information is incompleteSuch techniques are also applicable and beneficial for Seerwhen the default level of instrumentation is not available

43 Deep Learning in Performance DebuggingA popular way to model performance in cloud systems es-pecially when there are dependencies between tasks arequeueing networks [49] Although queueing networks are avaluable tool to model how bottlenecks propagate through

the system they require in-depth knowledge of applicationsemantics and structure and can become overly complexas applications and systems scale They additionally cannoteasily capture all sources of contention such as the OS andnetwork stack

Instead in Seer we take a data-driven application-agnosticapproach that assumes no information about the structureand bottlenecks of a service making it robust to unknownand changing applications and relying instead on practi-cal learning techniques to infer patterns that lead to QoSviolations This includes both spatial patterns such as de-pendencies between microservices and temporal patternssuch as input load and resource contention The key idea inSeer is that conditions that led to QoS violations in the pastcan be used to anticipate unpredictable performance in thenear future Seer uses execution traces annotated with QoSviolations and collected over time to train a deep neural net-work to signal upcoming QoS violations Below we describethe structure of the neural network why deep learning iswell-suited for this problem and how Seer adapts to changesin application structure onlineUsing deep learning Although deep learning is not theonly approach that can be used for proactive QoS violationdetection there are several reasonswhy it is preferable in thiscase First the problem Seer must solve is a pattern matchingproblem of recognizing conditions that result in QoS viola-tions where the patterns are not always known in advanceor easy to annotate This is a more complicated task than sim-ply signaling a microservice with many enqueued requestsfor which simpler classification regression or sequence la-beling techniques would suffice [15 16 92] Second the DNNin Seer assumes no a priori knowledge about dependenciesbetween individual microservices making it applicable tofrequently-updated services where describing changes iscumbersome or even difficult for the user to know Thirddeep learning has been shown to be especially effective inpattern recognition problems with massive datasets eg inimage or text recognition [10] Finally as we show in thevalidation section (Sec 5) deep learning allows Seer to rec-ognize QoS violations with high accuracy in practice andwithin the opportunity window the cluster manager has toapply corrective actionsConfiguring the DNN The input used in the network isessential for its accuracy We have experimented with re-source utilization latency and queue depths as input met-rics Consistent with prior work utilization is not a goodproxy for performance [31 35 56 61] Latency similarlyleads to many false positives or to incorrectly pinpointingcomputationally-intensive microservices as QoS violationculprits Again consistent with queueing theory [49] andprior work [34 37 38 46 56] per-microservice queue depthsaccurately capture performance bottlenecks and pinpoint the

hellip

CNN

LSTM

hellip

LSTM

LSTM

LSTM

LSTM

LSTM

hellip

LSTM

LSTM

LSTM

LSTM

LSTM

hellip

LSTM

LSTM

LSTM

LSTM

M

icro

serv

ice

s(i

n d

ep

en

de

nci

es

ord

er)

helliphellip hellip

hellip

M

icrose

rvice

s(P

rfo

r Qo

Sv

iola

tion

)

LSTM

hellip

SoftMax

LSTM

Dropout

Figure 9 The deep neural network design in Seer con-sisting of a set of convolution layers followed by a set oflong-short term memory layers Each input and output neu-ron corresponds to a microservice ordered in topologicalorder from back-end microservices in the top to front-endmicroservices in the bottom

microservices causing them We compare against utilization-based approaches in Section 5 The number of input andoutput neurons is equal to the number of active microser-vices in the cluster with the input value corresponding toqueue depths and the output value to the probability fora given microservice to initiate a QoS violation Input neu-rons are ordered according to the topological applicationstructure with dependent microservices corresponding toconsecutive neurons to capture hotspot patterns in space In-put traces are annotated with the microservices that causedQoS violations in the past

The choice of DNN architecture is also instrumental to itsaccuracy There are three main DNN designs that are populartoday fully connected networks (FC) convolutional neuralnetworks (CNN) and recurrent neural networks (RNN) es-pecially their Long Short-Term Memory (LSTM) class ForSeer to be effective in improving performance predictabilityinference needs to occur with enough slack for the clustermanagerrsquos action to take effect Hence we focus on the morecomputationally-efficient CNN and LSTM networks CNNsare especially effective at reducing the dimensionality oflarge datasets and finding patterns in space eg in imagerecognition LSTMs on the other hand are particularly effec-tive at finding patterns in time eg predicting tomorrowrsquosweather based on todayrsquos measurements Signaling QoS vi-olations in a large cluster requires both spatial recognitionnamely identifying problematic clusters of microserviceswhose dependencies cause QoS violations and discardingnoisy but non-critical microservices and temporal recogni-tion namely using past QoS violations to anticipate futureones We compare three network designs a CNN a LSTMand a hybrid network that combines the two using the CNNfirst to reduce the dimensionality and filter out microservices

that do not affect end-to-end performance and then an LSTMwith a SoftMax final layer to infer the probability for eachmicroservice to initiate a QoS violation The architectureof the hybrid network is shown in Fig 9 Each network isconfigured using hyperparameter tuning to avoid overfittingand the final parameters are shown in Table 2We train each network on a weekrsquos worth of trace data

collected on a 20-server cluster running all end-to-end ser-vices (for methodology details see Sec 5) and test it on tracescollected on a different week after the servers had beenpatched and the OS had been upgraded

561 7879 9345QoS Violation Detection Accuracy ()

0

2

4

6

8

10

12

14

16

Infe

rence

Tim

e (

ms)

CNNLSTMSeer (CNN+LSTM)

Figure 10 Comparison ofDNN architectures

The quantitative com-parison of the threenetworks is shown inFig 10 The CNN is byfar the fastest but alsothe worst performingsince it is not designedto recognize patterns intime that lead to QoS vi-olations The LSTM onthe other hand is espe-cially effective at captur-ing load patterns overtime but is less effective at reducing the dimensionality ofthe original dataset which makes it prone to false positivesdue to microservices with many outstanding requests whichare off the critical path It also incurs higher overheads forinference than the CNN Finally Seer correctly anticipates9345 of violations outperforming both networks for asmall increase in inference time compared to LSTM Giventhat most resource partitioning decisions take effect after afew 100ms the inference time for Seer is within the windowof opportunity the cluster manager has to take action Moreimportantly it attributes the QoS violation to the correctmicroservice simplifying the cluster managerrsquos task QoSviolations missed by Seer included four random load spikesand a network switch failure which caused high packet dropsOut of the five end-to-end services the one most prone

to QoS violations initially was Social Network first becauseit has stricter QoS constraints than eg E-commerce andsecond due to a synchronous and cyclic communication be-tween three neighboring services that caused them to entera positive feedback loop until saturation We reimplementedthe communication protocol between them post-detectionOn the other hand the service for which QoS violations werehardest to detect wasMedia Service because of a faulty mem-ory bandwidth partitioning mechanism in one of our serverswhich resulted in widely inconsistent memory bandwidthallocations during movie streaming Since the QoS violationonly occurred when the specific streaming microservice wasscheduled on the faulty node it was hard for Seer to collectenough datapoints to signal the violation

Retraining Seer By default training happens once andcan be time consuming taking several hours up to a day forweek-long traces collected on our 20-server cluster (Sec 5includes a detailed sensitivity study for training time) How-ever one of the main advantages of microservices is that theysimplify frequent application updates with old microser-vices often swapped out and replaced by newer modules orlarge services progressively broken down to microservicesIf the application (or underlying hardware) change signifi-cantly Seerrsquos detection accuracy can be impacted To adjustto changes in the execution environment Seer retrains incre-mentally in the background using the transfer learning-basedapproach in [80] Weights from previous training rounds arestored in disk allowing the model to continue training fromwhere it last left off when new data arrive reducing thetraining time by 2-3 orders of magnitude Even though thisapproach allows Seer to handle application changes almost inreal-time it is not a long-term solution since newweights arestill polluted by the previous application architecture Whenthe application changes in a major way eg microserviceson the critical path change Seer also retrains from scratchin the background While the new network trains QoS vi-olation detection happens with the incrementally-trainedinterim model In Section 5 we evaluate Seerrsquos ability toadjust its estimations to application changes

44 Hardware MonitoringOnce a QoS violation is signaled and a culprit microserviceis pinpointed Seer uses low-level monitoring to identify thereason behind the QoS violation The exact process dependson whether Seer has access to performance countersPrivate cluster When Seer has access to hardware eventssuch as performance counters it uses them to determinethe utilization of different shared resources Note that eventhough utilization is a problematic metric for anticipatingQoS violations in a large-scale service once a culprit mi-croservice has been identified examining the utilization ofdifferent resources can provide useful hints to the clustermanager on suitable decisions to avoid degraded perfor-mance Seer specifically examines CPU memory capacityand bandwidth network bandwidth cache contention andstorage IO bandwidth when prioritizing a resource to adjustOnce the saturated resource is identified Seer notifies thecluster manager to take actionPublic clusterWhen Seer does not have access to perfor-mance counters it instead uses a set of 10 tunable contentiousmicrobenchmarks each of them targeting a different sharedresource [30] to determine resource saturation For exampleif Seer injects the memory bandwidth microbenchmark inthe system and tunes up its intensity without an impacton the co-scheduled microservicersquos performance memorybandwidth is most likely not the resource that needs to beadjusted Seer starts from microbenchmarks corresponding

to core resources and progressively moves to resources fur-ther away from the core until it sees a substantial change inperformance when running the microbenchmark Each mi-crobenchmark takes approximately 10ms to complete avoid-ing prolonged degraded performance

Upon identifying the problematic resource(s) Seer notifiesthe cluster manager which takes one of several resourceallocation actions resizing the Docker container or usingmechanisms like Intelrsquos Cache Allocation Technology (CAT)for last level cache (LLC) partitioning and the Linux trafficcontrolrsquos hierarchical token bucket (HTB) queueing disciplinein qdisc [17 62] for network bandwidth partitioning

45 System Insights from SeerUsing learning-based data-driven approaches in systemsis most useful when these techniques are used to gain in-sight into system problems instead of treating them as blackboxes Section 5 includes an analysis of the causes behindQoS violations signaled by Seer including application bugspoor resource provisioning decisions and hardware failuresFurthermore we have deployed Seer in a large installation ofthe Social Network service over the past few months and itsoutput has been instrumental not only in guaranteeing QoSbut in understanding sources of unpredictable performanceand improving the application design This has resulted bothin progressively fewer QoS violations over time and a betterunderstanding of the design challenges of microservices

46 ImplementationSeer is implemented in 12KLOC of CC++ and Python Itruns on Linux and OSX and supports applications in variouslanguages including all frameworks the end-to-end servicesare designed in Furthermore we provide automated patchesfor the instrumentation probes for many popular microser-vices including NGINX memcached MongoDB Xapian and allSockshop and Go-microservices applications to minimize thedevelopment effort from the userrsquos perspective

Seer is a centralized system we usemaster-slavemirroringto improve fault tolerance with two hot stand-by mastersthat can take over if the primary system fails Similarly thetrace database is also replicated in the backgroundSecurity concerns Trace data is stored and processed un-encrypted in Cassandra Previous work has shown that thesensitivity applications have to different resources can leakinformation about their nature and characteristics makingthem vulnerable to malicious security attacks [27 36 50 5284 88 89 94 99] Similar attacks are possible using the dataand techniques in Seer and are deferred to future work

5 Seer Analysis and Validation51 MethodologyServer clusters First we use a dedicated local cluster with20 2-socket 40-core servers with 128GB of RAM each Each

0 5 10 15 20 25 30 35 40

Training Time (hours)

40

50

60

70

80

90

100

Qo

S V

iola

tio

n D

ete

ctio

n A

ccu

racy (

)

100MB

1GB

10GB

20GB

50GB

100GB 200GB

500GB 1TB

10-2 10-1 100 101Tracing Interval (s)

50

55

60

65

70

75

80

85

90

95

Dete

ction A

ccura

cy (

)

Figure 11 Seerrsquos sensitivity to (a) the size of trainingdatasets and (b) the tracing interval

server is connected to a 40Gbps ToR switch over 10Gbe NICsSecond we deploy the Social Network service to GoogleCompute Engine (GCE) and Windows Azure clusters withhundreds of servers to study the scalability of SeerApplicationsWe use all five end-to-end services of Table 1Services for now are driven by open-loop workload gener-ators and the input load varies from constant to diurnalto load with spikes in user demand In Section 6 we studySeer in a real large-scale deployment of the Social Networkin that case the input load is driven by real user traffic

52 EvaluationSensitivity to training data Fig 11a shows the detectionaccuracy and training time for Seer as we increase the sizeof the training dataset The size of the dots is a functionof the dataset size Training data is collected from the 20-server cluster described above across different load levelsplacement strategies time intervals and request types Thesmallest training set size (100MB) is collected over ten min-utes of the cluster operating at high utilization while thelargest dataset (1TB) is collected over almost two months ofcontinuous deployment As datasets grow Seerrsquos accuracyincreases leveling off at 100-200GB Beyond that point accu-racy does not further increase while the time needed fortraining grows significantly Unless otherwise specified weuse the 100GB training datasetSensitivity to tracing frequencyBy default the distributedtracing system instantaneously collects the latency of everysingle user request Collecting queue depth statistics on theother hand is a per-microservice iterative process Fig 11bshows how Seerrsquos accuracy changes as we vary the frequencywith which we collect queue depth statistics Waiting for along time before sampling queues eg gt 1s can result in un-detected QoS violations before Seer gets a chance to processthe incoming traces In contrast sampling queues very fre-quently results in unnecessarily many inferences and runsthe risk of increasing the tracing overhead For the remain-der of this work we use 100ms as the interval for measuringqueue depths across microservicesFalse negatives amp false positives Fig 12a shows the per-centage of false negatives and false positives as we vary the

CPU

Cache

Memory

Network

Disk

Application

False Negatives False Positives0

10

20

30

40

50

Pe

rce

nta

ge

(

)

10ms50ms100ms

500ms1s2s

Utilizatio

n

App only

Net only

Seer

Ground0

20

40

60

80

100

o

f V

alid

Qo

S V

iola

tio

ns

Truth

Figure 12 (a) The false negatives and false positives in Seeras we vary the inference window (b) Breakdown to causesof QoS violations and comparison with utilization-baseddetection and systems with limited instrumentation

prediction window When Seer tries to anticipate QoS viola-tions that will occur in the next 10-100ms both false positivesand false negatives are low since Seer uses a very recentsnapshot of the cluster state to anticipate performance unpre-dictability If inference was instantaneous very short predic-tion windows would always be better However given thatinference takes several milliseconds and more importantlyapplying corrective actions to avoid QoS violations takes10-100s of milliseconds to take effect such short windowsdefy the point of proactive QoS violation detection At theother end predicting far into the future results in significantfalse negatives and especially false positives This is becausemany QoS violations are caused by very short bursty eventsthat do not have an impact on queue lengths until a fewmilliseconds before the violation occurs Therefore requiringSeer to predict one or more seconds into the future meansthat normal queue depths are annotated as QoS violationsresulting in many false positives Unless otherwise specifiedwe use a 100ms prediction windowComparison of debugging systems Fig 12b comparesSeer with a utilization-based performance debugging systemthat uses resource saturation as the trigger to signal a QoSviolation and two systems that only use a fraction of Seerrsquosinstrumentation App-only exclusively uses queues mea-sured via application instrumentation (not network queues)while Network-only uses queues in the NIC and ignoresapplication-level queueing We also show the ground truthfor the total number of upcoming QoS violations (96 over atwo-week period) and break it down by the reason that ledto unpredictable performanceA large fraction of QoS violations are due to application-

level inefficiencies including correctness bugs unnecessarysynchronization andor blocking behavior between microser-vices (including two cases of deadlocks) and misconfigurediptables rules which caused packet drops An equally largefraction of QoS violations are due to compute contentionfollowed by contention in the network cache and memory

Accuracy

QoS Violation

Impactful Service Update

Innocuous Service Update

Incremental Retraining

Signaled QoS Violation

0 2 4 6 8 10 12Time (hours)

75

80

85

90

95

De

tectio

n A

ccu

racy (

)

Social Network Movie Reviewing Ecommerce Go-microservices

0 2 4 6 8 10 12Time (hours)

60708090

100110120130140

Ta

il La

ten

cy n

orm

QoS

QoS

Figure 13 Seer retraining incrementally after each time theSocial Network service is updated

contention and finally disk Since the only persistent mi-croservices are the back-end databases it is reasonable thatdisk accounts for a small fraction of overall QoS violations

Seer accurately follows this breakdown for the most partonly missing a few QoS violations due to random load spikesincluding one caused by a switch failure The App-only sys-tem correctly identifies application-level sources of unpre-dictable performance but misses the majority of system-related issues especially in uncore resources On the otherhand Network-only correctly identifies the vast majorityof network-related issues as well as most of the core- anduncore-driven QoS violations but misses several application-level issues The difference between Network-only and Seeris small suggesting that one could omit the application-levelinstrumentation in favor of a simpler design While this sys-tem is still effective in capturing QoS violations it is lessuseful in providing feedback to application developers onhow to improve their design to avoid QoS violations in the fu-ture Finally the utilization-based system behaves the worstmissing most violations not caused by CPU saturationOut of the 89 QoS violations Seer detects it notifies the

cluster manager early enough to avoid 84 of them The QoSviolations that were not avoided correspond to application-level bugs which cannot be easily corrected online Sincethis is a private cluster Seer uses utilization metrics andperformance counters to identify problematic resourcesRetraining Fig 13 shows the detection accuracy for Seerand the tail latency for each end-to-end service over a periodof time during which Social Network is getting frequently andsubstantially updated This includes new microservices be-ing added to the service such as the ability to place an orderfrom an ad using the orders microservice of E-commerceor the back-end of the service changing from MongoDB to

Cassandra and the front-end switching from nginx to thenodejs front-end of E-commerce These are changes that fun-damentally affect the applicationrsquos behavior throughput la-tency and bottlenecks The other services remain unchangedduring this period (Banking was not active during this timeand is omitted from the graph) Blue dots denote correctly-signaled upcoming QoS violations and red times denote QoSviolations that were not detected by Seer All unidentifiedQoS violations coincide with the application being updatedShortly after the update Seer incrementally retrains in thebackground and starts recovering its accuracy until anothermajor update occurs Some of the updates have no impact oneither performance or Seerrsquos accuracy either because theyinvolve microservices off the critical path or because theyare insensitive to resource contention

The bottom figure shows that unidentified QoS violationsindeed result in performance degradation for Social Networkand in some cases for the other end-to-end services if theyare sharing physical resources with Social Network on anoversubscribed server Once retraining completes the per-formance of the service(s) recovers The longer Seer trainson an evolving application the more likely it is to correctlyanticipate its future QoS violations

6 Large-Scale Cloud Study61 Seer ScalabilityWe now deploy our Social Network service on a 100-serverdedicated cluster on Google Compute Engine (GCE) anduse it to service real user traffic The application has 582registered users with 165 daily active users and has beendeployed for a two-month period The cluster on averagehosts 386 single-concerned containers (one microservice percontainer) subject to some resource scaling actions by thecluster manager based on Seerrsquos feedback

Training Inference10-310-210-1100101102103104105

Tim

e (

s)

NativeTPUBrainwave

Figure 14 Seer training and infer-ence with hardware acceleration

Accuracy remainshigh for Seer con-sistent with thesmall-scale exper-iments Inferencetime however in-creases substantiallyfrom 114ms forthe 20-server clus-ter to 54ms for the100-server GCE set-ting Even thoughthis is still suffi-cient for many re-source allocation decisions as the application scales furtherSeerrsquos ability to anticipate a QoS violation within the clustermanagerrsquos window of opportunity diminishes

Over the past year multiple public cloud providers have ex-posed hardware acceleration offerings for DNN training and

nginx

text

imag

e

user

Tag

urlS

horte

n

vide

o

reco

mm

sear

chlogin

read

Post

com

pPos

t

writ

eGra

ph

mem

cach

ed

mon

godb

0

20

40

60

80

100

120

QoS

Vio

lations

Figure 15 QoS violations each microservice in Social Net-work is responsible for

inference either using a special-purpose design like the Ten-sor Processing Unit (TPU) from Google [55] or using recon-figurable FPGAs like Project Brainwave from Microsoft [25]We offload Seerrsquos DNN logic to both systems and quantifythe impact on training and inference time and detection ac-curacy 1 Fig 14 shows this comparison for a 200GB trainingdataset Both the TPU and Project Brainwave dramaticallyoutperform our local implementation by up to two orders ofmagnitude Between the two accelerators the TPU is moreeffective in training consistent with its design objective [55]while Project Brainwave achieves faster inference For theremainder of the paper we run Seer on TPUs and host theSocial Network service on GCE

62 Source of QoS ViolationsWe now examine which microservice is the most commonculprit for a QoS violation Fig 15 shows the number of QoSviolations caused by each service over the two-month periodThe most frequent culprits by far are the in-memory cachingtiers in memcached and Thrift services with high requestfanout such as composePost readPost and login memcachedis a justified source of QoS violations since it is on the criti-cal path for almost all query types and it is additionally verysensitive to resource contention in compute and to a lesserdegree cache and memory Microservices with high fanoutare also expected to initiate QoS violations as they have tosynchronize multiple inbound requests before proceeding Ifprocessing for any incoming requests is delayed end-to-endperformance is likely to suffer Among these QoS violationsmost of memcachedrsquos violations were caused by resource con-tention while violations in Thrift services were caused bylong synchronization times

63 Seerrsquos Long-Term Impact on Application DesignSeer has now been deployed in the Social Network clusterfor over two months and in this time it has detected 536upcoming QoS violations (906 accuracy) and avoided 495(84) of them Furthermore by detecting recurring patterns1Before running on TPUs we reimplemented our DNN in Tensorflow Wesimilarly adjust the DNN to the currently-supported designs in Brainwave

0 10 20 30 40 50 60Day

0

5

10

15

20

25

Qo

S v

iola

tio

ns

Figure 16 QoS violations each microservice in Social Net-work is responsible for

that lead to QoS violations Seer has helped the applicationdevelopers better understand bugs and design decisions thatlead to hotspots such as microservices with a lot of back-and-forth communication between them or microservicesforming cyclic dependencies or using blocking primitivesThis has led to a decreasing number of QoS violations overthe two month period (seen in Fig 16) as the applicationprogressively improves In days 22 and 23 there was a clusteroutage which is why the reported violations are zero Sys-tems like Seer can be used not only to improve performancepredictability in complex cloud systems but to help usersbetter understand the design challenges of microservices asmore services transition to this application model

7 ConclusionsCloud services increasingly move away from complex mono-lithic designs and adopt the model of specialized loosely-coupled microservices We presented Seer a data-drivencloud performance debugging system that leverages practi-cal learning techniques and the massive amount of tracingdata cloud systems collect to proactively detect and avoidQoS violations We have validated Seerrsquos accuracy in con-trolled environments and evaluated its scalability on large-scale clusters on public clouds We have also deployed thesystem in a cluster hosting a social network with hundredsof users In all scenarios Seer accurately detects upcomingQoS violations improving responsiveness and performancepredictability As more services transition to the microser-vices model systems like Seer provide practical solutionsthat can navigate the increasing complexity of the cloud

AcknowledgementsWe sincerely thank Christos Kozyrakis Balaji PrabhakarMendel Rosenblum Daniel Sanchez Ravi Soundararajanthe academic and industrial users of the benchmark suiteand the anonymous reviewers for their feedback on earlierversions of this manuscript This work was supported byNSF CAREER Award CCF-1846046 NSF grant CNS-1422088a Facebook Faculty Research Award a VMWare ResearchAward a John and Norma Balen Sesquicentennial FacultyFellowship and donations from GCE Azure and AWS

References[1] [n d] Apache Thrift httpsthriftapacheorg[2] [n d] Decomposing Twitter Adventures in Service-

Oriented Architecture httpswwwslidesharenetInfoQdecomposing-twitter-adventures-in-serviceoriented-architecture

[3] [n d] Golang Microservices Example httpsgithubcomharlowgo-micro-services

[4] [n d] Messaging that just works httpswwwrabbitmqcom[5] [n d] MongoDB httpswwwmongodbcom[6] [n d] NGINX httpsnginxorgen[7] [n d] SockShop A Microservices Demo Application httpswww

weaveworksblogsock-shop-microservices-demo-application[8] [n d] Zipkin httpzipkinio[9] 2016 The Evolution of Microservices httpswwwslidesharenet

adriancockcroftevolution-of-microservices-craft-conference[10] Martiacuten Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng

Chen Craig Citro Greg Corrado Andy Davis Jeffrey Dean MatthieuDevin Sanjay Ghemawat Ian Goodfellow Andrew Harp GeoffreyIrving Michael Isard Yangqing Jia Rafal Jozefowicz Lukasz KaiserManjunath Kudlur Josh Levenberg Dan Maneacute Rajat Monga SherryMoore Derek Murray Chris Olah Mike Schuster Jonathon ShlensBenoit Steiner Ilya Sutskever Kunal Talwar Paul Tucker VincentVanhoucke Vijay Vasudevan Fernanda Vieacutegas Oriol Vinyals PeteWarden Martin Wattenberg Martin Wicke Yuan Yu and XiaoqiangZheng [n d] TensorFlow Large-Scale Machine Learning on Hetero-geneous Distributed Systems In Proceedings of OSDI 2016

[11] Adrian Cockroft [n d] Microservices Workshop Why whatand how to get there httpwwwslidesharenetadriancockcroftmicroservices-workshop-craft-conference

[12] Hyunwook Baek Abhinav Srivastava and Jacobus Van der Merwe[n d] CloudSight A tenant-oriented transparency framework forcross-layer cloud troubleshooting In Proceedings of the 17th IEEEACMInternational Symposium on Cluster Cloud and Grid Computing 2017

[13] Luiz Barroso [n d] Warehouse-Scale Computing Entering theTeenage Decade ISCA Keynote SJ June 2011 ([n d])

[14] Luiz Barroso and Urs Hoelzle 2009 The Datacenter as a Computer AnIntroduction to the Design ofWarehouse-Scale Machines MC Publishers

[15] Robert Bell Yehuda Koren and Chris Volinsky 2007 The BellKor 2008Solution to the Netflix Prize Technical Report

[16] Leon Bottou [n d] Large-Scale Machine Learning with StochasticGradient Descent In Proceedings of the International Conference onComputational Statistics (COMPSTAT) Paris France 2010

[17] Martin A Brown [n d] Traffic Control HOWTO httplinux-ipnetarticlesTraffic-Control-HOWTO

[18] Cassandra [n d] Apache Cassandra httpcassandraapacheorg[19] Adrian M Caulfield Eric S Chung Andrew Putnam Hari Angepat

Jeremy Fowers Michael Haselman Stephen Heil Matt HumphreyPuneet Kaur Joo-Young Kim Daniel Lo Todd Massengill KalinOvtcharov Michael Papamichael Lisa Woods Sitaram Lanka DerekChiou and Doug Burger 2016 A Cloud-scale Acceleration Architec-ture In MICRO IEEE Press Piscataway NJ USA Article 7 13 pages

[20] Tianshi Chen Zidong Du Ninghui Sun Jia Wang Chengyong WuYunji Chen and Olivier Temam 2014 DianNao a small-footprint high-throughput accelerator for ubiquitous machine-learning In Proc ofthe 19th intl conf on Architectural Support for Programming Languagesand Operating Systems

[21] Yunji Chen Tianshi Chen Zhiwei Xu Ninghui Sun andOlivier Temam2016 DianNao Family Energy-efficient Hardware Accelerators forMachine Learning Commun ACM 59 11 (Oct 2016) 105ndash112 httpsdoiorg1011452996864

[22] Yunji Chen Tao Luo Shaoli Liu Shijin Zhang Liqiang He Jia WangLing Li Tianshi Chen Zhiwei Xu Ninghui Sun and Olivier Temam2014 DaDianNao A Machine-Learning Supercomputer In Proceedingsof the 47th Annual IEEEACM International Symposium on Microarchi-tecture (MICRO-47) IEEE Computer Society Washington DC USA

609ndash622[23] Ludmila Cherkasova Diwaker Gupta and Amin Vahdat 2007 Com-

parison of the Three CPU Schedulers in Xen SIGMETRICS PerformEval Rev 35 2 (Sept 2007) 42ndash51

[24] Michael Chow David Meisner Jason Flinn Daniel Peek and Thomas FWenisch 2014 The Mystery Machine End-to-end Performance Anal-ysis of Large-scale Internet Services In Proceedings of the 11th USENIXConference on Operating Systems Design and Implementation (OSDIrsquo14)USENIX Association Berkeley CA USA 217ndash231

[25] Eric S Chung Jeremy Fowers Kalin Ovtcharov Michael PapamichaelAdrian M Caulfield Todd Massengill Ming Liu Daniel Lo ShlomiAlkalay Michael Haselman Maleen Abeydeera Logan Adams HariAngepat Christian Boehn Derek Chiou Oren Firestein AlessandroForin Kang Su Gatlin Mahdi Ghandi Stephen Heil Kyle HolohanAhmad El Husseini Tamaacutes Juhaacutesz Kara Kagi Ratna Kovvuri SitaramLanka Friedel van Megen Dima Mukhortov Prerak Patel BrandonPerez Amanda Rapsang Steven K Reinhardt Bita Rouhani AdamSapek Raja Seera Sangeetha Shekar Balaji Sridharan Gabriel WeiszLisaWoods Phillip Yi Xiao Dan Zhang Ritchie Zhao and Doug Burger2018 Serving DNNs in Real Time at Datacenter Scale with ProjectBrainwave IEEE Micro 38 2 (2018) 8ndash20

[26] Guilherme Da Cunha Rodrigues Rodrigo N Calheiros Vini-cius Tavares Guimaraes Glederson Lessa dos Santos Maacutercio Barbosade Carvalho Lisandro Zambenedetti Granville Liane Margarida Rock-enbach Tarouco and Rajkumar Buyya 2016 Monitoring of CloudComputing Environments Concepts Solutions Trends and Future Di-rections In Proceedings of the 31st Annual ACM Symposium on AppliedComputing (SAC rsquo16) ACM New York NY USA 378ndash383

[27] Marwan Darwish Abdelkader Ouda and Luiz Fernando Capretz [nd] Cloud-based DDoS attacks and defenses In Proc of i-SocietyToronto ON 2013

[28] Jeffrey Dean and Luiz Andre Barroso [n d] The Tail at Scale InCACM Vol 56 No 2

[29] Christina Delimitrou Nick Bambos and Christos Kozyrakis [n d]QoS-Aware Admission Control in Heterogeneous Datacenters In Pro-ceedings of the International Conference of Autonomic Computing (ICAC)San Jose CA USA 2013

[30] Christina Delimitrou and Christos Kozyrakis [n d] iBench Quanti-fying Interference for Datacenter Workloads In Proceedings of the 2013IEEE International Symposium on Workload Characterization (IISWC)Portland OR September 2013

[31] Christina Delimitrou and Christos Kozyrakis [n d] Paragon QoS-Aware Scheduling for Heterogeneous Datacenters In Proceedings ofthe Eighteenth International Conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS) HoustonTX USA 2013

[32] Christina Delimitrou and Christos Kozyrakis [n d] QoS-Aware Sched-uling in Heterogeneous Datacenters with Paragon In ACM Transac-tions on Computer Systems (TOCS) Vol 31 Issue 4 December 2013

[33] Christina Delimitrou and Christos Kozyrakis [n d] Quality-of-Service-Aware Scheduling in Heterogeneous Datacenters withParagon In IEEE Micro Special Issue on Top Picks from the ComputerArchitecture Conferences MayJune 2014

[34] Christina Delimitrou and Christos Kozyrakis [n d] Quasar Resource-Efficient and QoS-Aware Cluster Management In Proc of ASPLOS SaltLake City 2014

[35] Christina Delimitrou and Christos Kozyrakis 2016 HCloud Resource-Efficient Provisioning in Shared Cloud Systems In Proceedings of theTwenty First International Conference on Architectural Support for Pro-gramming Languages and Operating Systems (ASPLOS)

[36] Christina Delimitrou and Christos Kozyrakis 2017 Bolt I KnowWhatYou Did Last Summer In The Cloud In Proc of the Twenty SecondInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS)

[37] Christina Delimitrou and Christos Kozyrakis 2018 Amdahlrsquos Law forTail Latency In Communications of the ACM (CACM)

[38] Christina Delimitrou Daniel Sanchez and Christos Kozyrakis 2015Tarcil Reconciling Scheduling Speed and Quality in Large Shared Clus-ters In Proceedings of the Sixth ACM Symposium on Cloud Computing(SOCC)

[39] Zidong Du Robert Fasthuber Tianshi Chen Paolo Ienne Ling Li TaoLuo Xiaobing Feng Yunji Chen and Olivier Temam 2015 ShiDi-anNao Shifting Vision Processing Closer to the Sensor In Proceed-ings of the 42Nd Annual International Symposium on Computer Ar-chitecture (ISCA rsquo15) ACM New York NY USA 92ndash104 httpsdoiorg10114527494692750389

[40] Daniel Firestone Andrew Putnam SambhramaMundkur Derek ChiouAlireza Dabagh Mike Andrewartha Hari Angepat Vivek BhanuAdrian Caulfield Eric Chung Harish Kumar Chandrappa SomeshChaturmohta Matt Humphrey Jack Lavier Norman Lam FengfenLiu Kalin Ovtcharov Jitu Padhye Gautham Popuri Shachar RaindelTejas Sapre Mark Shaw Gabriel Silva Madhan Sivakumar NisheethSrivastava Anshuman Verma Qasim Zuhair Deepak Bansal DougBurger Kushagra Vaid David A Maltz and Albert Greenberg 2018Azure Accelerated Networking SmartNICs in the Public Cloud In 15thUSENIX Symposium on Networked Systems Design and Implementation(NSDI 18) USENIX Association Renton WA 51ndash66

[41] Brad Fitzpatrick [n d] Distributed caching with memcached InLinux Journal Volume 2004 Issue 124 2004

[42] Rodrigo Fonseca George Porter Randy H Katz Scott Shenker andIon Stoica 2007 X-trace A Pervasive Network Tracing Framework InProceedings of the 4th USENIX Conference on Networked Systems Designamp Implementation (NSDIrsquo07) USENIX Association Berkeley CA USA20ndash20

[43] Yu Gan and Christina Delimitrou 2018 The Architectural Implicationsof Cloud Microservices In Computer Architecture Letters (CAL) vol17iss 2

[44] Yu Gan Meghna Pancholi Dailun Cheng Siyuan Hu Yuan He andChristina Delimitrou 2018 Seer Leveraging Big Data to Navigate theComplexity of Cloud Debugging In Proceedings of the Tenth USENIXWorkshop on Hot Topics in Cloud Computing (HotCloud)

[45] Yu Gan Yanqi Zhang Dailun Cheng Ankitha Shetty Priyal RathiNayan Katarki Ariana Bruno Justin Hu Brian Ritchken BrendonJackson Kelvin Hu Meghna Pancholi Yuan He Brett Clancy ChrisColen Fukang Wen Catherine Leung Siyuan Wang Leon ZaruvinskyMateo Espinosa Rick Lin Zhongling Liu Jake Padilla and ChristinaDelimitrou 2019 An Open-Source Benchmark Suite for Microser-vices and Their Hardware-Software Implications for Cloud and EdgeSystems In Proceedings of the Twenty Fourth International Conferenceon Architectural Support for Programming Languages and OperatingSystems (ASPLOS)

[46] Yilong Geng Shiyu Liu Zi Yin Ashish Naik Balaji Prabhakar MendelRosenblum and Amin Vahdat [n d] Exploiting a Natural NetworkEffect for Scalable Fine-grained Clock Synchronization In Proc ofNSDI 2018

[47] Daniel Gmach Jerry Rolia Ludmila Cherkasova and Alfons Kemper[n d] Workload Analysis and Demand Prediction of Enterprise DataCenter Applications In Proceedings of IISWC Boston MA 2007 10httpsdoiorg101109IISWC20074362193

[48] Zhenhuan Gong Xiaohui Gu and John Wilkes [n d] PRESS PRe-dictive Elastic ReSource Scaling for cloud systems In Proceedings ofCNSM Niagara Falls ON 2010

[49] Donald Gross John F Shortle James M Thompson and Carl M Harris[n d] Fundamentals of Queueing Theory InWiley Series in Probabilityand Statistics Book 627 2011

[50] Sanchika Gupta and PadamKumar [n d] VM Profile Based OptimizedNetwork Attack Pattern Detection Scheme for DDOS Attacks in CloudIn Proc of SSCC Mysore India 2013

[51] BenHindman Andy Konwinski Matei Zaharia Ali Ghodsi AnthonyDJoseph Randy Katz Scott Shenker and Ion Stoica [n d] Mesos APlatform for Fine-Grained Resource Sharing in the Data Center InProceedings of NSDI Boston MA 2011

[52] Jingwei Huang David M Nicol and Roy H Campbell [n d] Denial-of-Service Threat to HadoopYARN Clusters with Multi-Tenancy InProc of the IEEE International Congress on Big Data Washington DC2014

[53] Alexandru Iosup Nezih Yigitbasi and Dick Epema [n d] On thePerformance Variability of Production Cloud Services In Proceedingsof CCGRID Newport Beach CA 2011

[54] Hiranya Jayathilaka Chandra Krintz and Rich Wolski 2017 Perfor-mance Monitoring and Root Cause Analysis for Cloud-hosted WebApplications In Proceedings of WWW International World Wide WebConferences Steering Committee Republic and Canton of GenevaSwitzerland 469ndash478

[55] Norman P Jouppi Cliff Young Nishant Patil David Patterson Gau-rav Agrawal Raminder Bajwa Sarah Bates Suresh Bhatia Nan Bo-den Al Borchers Rick Boyle Pierre-luc Cantin Clifford Chao ChrisClark Jeremy Coriell Mike Daley Matt Dau Jeffrey Dean Ben GelbTara Vazir Ghaemmaghami Rajendra Gottipati William GullandRobert Hagmann C Richard Ho Doug Hogberg John Hu RobertHundt Dan Hurt Julian Ibarz Aaron Jaffey Alek Jaworski Alexan-der Kaplan Harshit Khaitan Daniel Killebrew Andy Koch NaveenKumar Steve Lacy James Laudon James Law Diemthu Le ChrisLeary Zhuyuan Liu Kyle Lucke Alan Lundin Gordon MacKean Adri-ana Maggiore Maire Mahony Kieran Miller Rahul Nagarajan RaviNarayanaswami Ray Ni Kathy Nix Thomas Norrie Mark Omer-nick Narayana Penukonda Andy Phelps Jonathan Ross Matt RossAmir Salek Emad Samadiani Chris Severn Gregory Sizikov MatthewSnelham Jed Souter Dan Steinberg Andy Swing Mercedes Tan Gre-gory Thorson Bo Tian Horia Toma Erick Tuttle Vijay VasudevanRichard Walter Walter Wang Eric Wilcox and Doe Hyun Yoon 2017In-Datacenter Performance Analysis of a Tensor Processing Unit InProceedings of the 44th Annual International Symposium on ComputerArchitecture (ISCA rsquo17) ACM New York NY USA 1ndash12

[56] Harshad Kasture and Daniel Sanchez 2014 Ubik Efficient CacheSharing with Strict QoS for Latency-Critical Workloads In Proceed-ings of the 19th international conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS-XIX)

[57] Harshad Kasture and Daniel Sanchez 2016 TailBench A BenchmarkSuite and Evaluation Methodology for Latency-Critical ApplicationsIn Proc of IISWC

[58] Yaakoub El Khamra Hyunjoo Kim Shantenu Jha andManish Parashar[n d] Exploring the performance fluctuations of hpc workloads onclouds In Proceedings of CloudCom Indianapolis IN 2010

[59] Krzysztof C Kiwiel [n d] Convergence and efficiency of subgradientmethods for quasiconvex minimization InMathematical Programming(Series A) (Berlin Heidelberg Springer) 90 (1) pp 1-25 2001

[60] Jacob Leverich and Christos Kozyrakis [n d] Reconciling High ServerUtilization and Sub-millisecond Quality-of-Service In Proc of EuroSys2014

[61] David Lo Liqun Cheng Rama Govindaraju Luiz Andreacute Barroso andChristos Kozyrakis [n d] Towards Energy Proportionality for Large-scale Latency-critical Workloads In Proceedings of the 41st AnnualInternational Symposium on Computer Architecuture (ISCA) Minneapo-lis MN 2014

[62] David Lo Liqun Cheng Rama Govindaraju Parthasarathy Ran-ganathan and Christos Kozyrakis [n d] Heracles Improving Re-source Efficiency at Scale In Proc of the 42Nd Annual InternationalSymposium on Computer Architecture (ISCA) Portland OR 2015

[63] Dave Mangot [n d] EC2 variability The numbers re-vealed httptechmangotcomrollerdaveentryec2_variability_the_numbers_revealed

[64] Jason Mars and Lingjia Tang [n d] Whare-map heterogeneity inhomogeneous warehouse-scale computers In Proceedings of ISCATel-Aviv Israel 2013

[65] Jason Mars Lingjia Tang and Robert Hundt 2011 Heterogeneity inldquoHomogeneousrdquo Warehouse-Scale Computers A Performance Oppor-tunity IEEE Comput Archit Lett 10 2 (July 2011) 4

[66] Jason Mars Lingjia Tang Robert Hundt Kevin Skadron and Mary LouSoffa [n d] Bubble-Up increasing utilization in modern warehousescale computers via sensible co-locations In Proceedings of MICROPorto Alegre Brazil 2011

[67] David Meisner Christopher M Sadler Luiz Andreacute Barroso Wolf-Dietrich Weber and Thomas F Wenisch 2011 Power managementof online data-intensive services In Proceedings of the 38th annualinternational symposium on Computer architecture 319ndash330

[68] Ripal Nathuji Canturk Isci and Eugene Gorbatov [n d] Exploitingplatform heterogeneity for power efficient data centers In Proceedingsof ICAC Jacksonville FL 2007

[69] Ripal Nathuji Aman Kansal and Alireza Ghaffarkhah [n d] Q-Clouds Managing Performance Interference Effects for QoS-AwareClouds In Proceedings of EuroSys ParisFrance 2010

[70] Hiep Nguyen Zhiming Shen Xiaohui Gu Sethuraman Subbiah andJohn Wilkes [n d] AGILE Elastic Distributed Resource Scaling forInfrastructure-as-a-Service In Proceedings of ICAC San Jose CA 2013

[71] Dejan Novakovic Nedeljko Vasic Stanko Novakovic Dejan Kosticand Ricardo Bianchini [n d] DeepDive Transparently Identifyingand Managing Performance Interference in Virtualized EnvironmentsIn Proceedings of ATC San Jose CA 2013

[72] Simon Ostermann Alexandru Iosup Nezih Yigitbasi Radu ProdanThomas Fahringer and Dick Epema [n d] A Performance Analysisof EC2 Cloud Computing Services for Scientific Computing In LectureNotes on Cloud Computing Volume 34 p115-131 2010

[73] Kay Ousterhout Patrick Wendell Matei Zaharia and Ion Stoica [nd] Sparrow Distributed Low Latency Scheduling In Proceedings ofSOSP Farminton PA 2013

[74] Xue Ouyang Peter Garraghan Renyu Yang Paul Townend and Jie Xu[n d] Reducing Late-Timing Failure at Scale Straggler Root-CauseAnalysis in Cloud Datacenters In Proceedings of 46th Annual IEEEIFIPInternational Conference on Dependable Systems and Networks 2016

[75] Andrew Putnam Adrian M Caulfield Eric S Chung Derek ChiouKypros Constantinides John Demme Hadi Esmaeilzadeh Jeremy Fow-ers Gopi Prashanth Gopal Jan Gray Michael Haselman Scott HauckStephen Heil Amir Hormati Joo-Young Kim Sitaram Lanka JamesLarus Eric Peterson Simon Pope Aaron Smith Jason Thong Phillip YiXiao and Doug Burger 2014 A Reconfigurable Fabric for Accelerat-ing Large-Scale Datacenter Services In Proc of the 41st Intl Symp onComputer Architecture

[76] Suhail Rehman and Majd Sakr [n d] Initial Findings for provisioningvariation in cloud computing In Proceedings of CloudCom IndianapolisIN 2010

[77] Charles Reiss Alexey Tumanov Gregory Ganger Randy Katz andMichael Kozych [n d] Heterogeneity and Dynamicity of Clouds atScale Google Trace Analysis In Proceedings of SOCC 2012

[78] Gang Ren Eric Tune Tipp Moseley Yixin Shi Silvius Rus andRobert Hundt 2010 Google-Wide Profiling A Continuous Pro-filing Infrastructure for Data Centers IEEE Micro (2010) 65ndash79httpwwwcomputerorgportalwebcsdldoi101109MM201068

[79] rightscale [n d] Rightscale httpsawsamazoncomsolution-providersisvrightscale

[80] S Sarwar A Ankit and K Roy [n d] Incremental Learning in DeepConvolutional Neural Networks Using Partial Network Sharing InarXiv preprint arXiv171202719

[81] Joumlrg Schad Jens Dittrich and Jorge-Arnulfo Quianeacute-Ruiz 2010 Run-time Measurements in the Cloud Observing Analyzing and ReducingVariance Proceedings VLDB Endow 3 1-2 (Sept 2010) 460ndash471

[82] Malte Schwarzkopf Andy Konwinski Michael Abd-El-Malek andJohn Wilkes [n d] Omega flexible scalable schedulers for largecompute clusters In Proceedings of EuroSys Prague 2013

[83] Zhiming Shen Sethuraman Subbiah Xiaohui Gu and John Wilkes [nd] CloudScale elastic resource scaling for multi-tenant cloud systemsIn Proceedings of SOCC Cascais Portugal 2011

[84] David Shue Michael J Freedman and Anees Shaikh [n d] Perfor-mance Isolation and Fairness for Multi-tenant Cloud Storage In Procof OSDI Hollywood CA 2012

[85] Benjamin H Sigelman Luiz Andreacute Barroso Mike Burrows PatStephenson Manoj Plakal Donald Beaver Saul Jaspan and Chan-dan Shanbhag 2010 Dapper a Large-Scale Distributed Systems TracingInfrastructure Technical Report Google Inc httpsresearchgooglecomarchivepapersdapper-2010-1pdf

[86] Lalith Suresh Peter Bodik Ishai Menache Marco Canini and FlorinCiucu [n d] Distributed Resource Management Across ProcessBoundaries In Proceedings of the ACM Symposium on Cloud Computing(SOCC) Santa Clara CA 2017

[87] torque [n d] Torque Resource Manager httpwwwadaptivecomputingcomproductsopen-sourcetorque

[88] Venkatanathan Varadarajan Thomas Ristenpart and Michael Swift[n d] Scheduler-based Defenses against Cross-VM Side-channels InProc of the 23rd Usenix Security Symposium San Diego CA 2014

[89] Venkatanathan Varadarajan Yinqian Zhang Thomas Ristenpart andMichael Swift [n d] A Placement Vulnerability Study inMulti-TenantPublic Clouds In Proc of the 24th USENIX Security Symposium (USENIXSecurity) Washington DC 2015

[90] Nedeljko Vasić Dejan Novaković Svetozar Miučin Dejan Kostić andRicardo Bianchini [n d] DejaVu accelerating resource allocationin virtualized environments In Proceedings of the Seventeenth Interna-tional Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS) London UK 2012

[91] Jianping Weng Jessie Hui Wang Jiahai Yang and Yang Yang [n d]Root cause analysis of anomalies of multitier services in public cloudsIn Proceedings of the IEEEACM 25th International Symposium on Qual-ity of Service (IWQoS) 2017

[92] Ian H Witten Eibe Frank and Geoffrey Holmes [n d] Data MiningPractical Machine Learning Tools and Techniques 3rd Edition

[93] Tianyin Xu Xinxin Jin Peng Huang Yuanyuan Zhou Shan Lu LongJin and Shankar Pasupathy 2016 Early Detection of ConfigurationErrors to Reduce Failure Damage In Proceedings of the 12th USENIXConference on Operating Systems Design and Implementation (OSDIrsquo16)USENIX Association Berkeley CA USA 619ndash634

[94] Yunjing Xu Michael Bailey Farnam Jahanian Kaustubh Joshi MattiHiltunen and Richard Schlichting [n d] An Exploration of L2 CacheCovert Channels in Virtualized Environments In Proc of the 3rd ACMWorkshop on Cloud Computing Security Workshop (CCSW) ChicagoIL 2011

[95] Hailong Yang Alex Breslow Jason Mars and Lingjia Tang [n d]Bubble-flux precise online QoS management for increased utilizationin warehouse scale computers In Proceedings of ISCA 2013

[96] Minlan Yu Albert Greenberg Dave Maltz Jennifer Rexford LihuaYuan Srikanth Kandula and Changhoon Kim [n d] Profiling Net-work Performance for Multi-tier Data Center Applications In Proceed-ings of NSDI Boston MA 2011 14

[97] Xiao Yu Pallavi Joshi Jianwu Xu Guoliang Jin Hui Zhang and GuofeiJiang 2016 CloudSeer Workflow Monitoring of Cloud Infrastructuresvia Interleaved Logs In Proceedings of APSLOS ACM New York NYUSA 489ndash502

[98] Yinqian Zhang and Michael K Reiter [n d] Duppel retrofittingcommodity operating systems to mitigate cache side channels in thecloud In Proc of CCS Berlin Germany 2013

[99] Mark Zhao and G Edward Suh [n d] FPGA-Based Remote PowerSide-Channel Attacks In Proceedings of the IEEE Symposium on Securityand Privacy May 2018

  • Abstract
  • 1 Introduction
  • 2 Related Work
  • 3 End-to-End Applications with Microservices
    • 31 Social Network
    • 32 Media Service
    • 33 E-Commerce Service
    • 34 Banking System
    • 35 Hotel Reservation Site
      • 4 Seer Design
        • 41 Overview
        • 42 Distributed Tracing
        • 43 Deep Learning in Performance Debugging
        • 44 Hardware Monitoring
        • 45 System Insights from Seer
        • 46 Implementation
          • 5 Seer Analysis and Validation
            • 51 Methodology
            • 52 Evaluation
              • 6 Large-Scale Cloud Study
                • 61 Seer Scalability
                • 62 Source of QoS Violations
                • 63 Seers Long-Term Impact on Application Design
                  • 7 Conclusions
                  • Acknowledgements
                  • References
Page 4: Seer: Leveraging Big Data to Navigate the Complexity of ... · data to extract the causal relationships between requests. Even though such tracing systems help cloud providers detect

Service Communication Unique Per-language LoC breakdownProtocol Microservices (end-to-end service)

Social Network RPC 36 34 C 23 C++ 18 Java 7 node 6 Python 5 Scala 3 PHP 2 JS 2 GoMedia Service RPC 38 30 C 21 C++ 20 Java 10 PHP 8 Scala 5 node 3 Python 3 JS

E-commerce Site REST 41 21 Java 16 C++ 15 C 14 Go 10 JS 7 node 5 Scala 4 HTML 3 RubyBanking System RPC 28 29 C 25 Javascript 16 Java 16 nodejs 11 C++ 3 Python

Hotel Reservations [3] RPC 15 89 Go 7 HTML 4 PythonTable 1 Characteristics and code composition of each end-to-end microservices-based application

mongoDB

mongoDB

mongoDB

memcached

memcached

memcached

mongoDB

memcached

Social Network

Service

text

video

image

userTag

composePost

postsStorage

writeTimeline

writeGraph

readPost blockedUsers

readTimeline

login

userInfomongoDB

memcached

search

index0

index1

indexn

hellip

uniqueIDads

recommender

Client nginx

http

http

fastcgiphp-

fpm

Load

BalancerurlShorten

favorite

followUser

Figure 3 Dependency graph between the microservices of the end-to-end Social Network application

Go-microservices

(hotel reservations)

ads

userProfile

search

geo

rate

mongoDB

mongoDB

recommender

front-end

(nodejs)http

Client

httpLoad

Balancer

memcached

memcached

mongoDB

memcached

memcached

Figure 4 Architecture of the hotel reserva-tion site using Go-microservices [3]

including Dapper [85] GWP [78] and Zipkin [8] which pro-vide the tracing infrastructure for large-scale productionsservices at Google and Twitter respectively Dapper and Zip-kin trace distributed user requests at RPC granularity whileGWP focuses on low-level hardware monitoring diagnosticsRoot cause analysis of performance abnormalities in the

cloud has also gained increased attention over the past fewyears as the number of interactive latency-critical serviceshosted in cloud systems has increased Jayathilaka et al [54]for example developed Roots a system that automaticallyidentifies the root cause of performance anomalies in webapplications deployed in Platform-as-a-Service (PaaS) cloudsRoots tracks events within the PaaS cloud using a combina-tion of metadata injection and platform-level instrumenta-tion Weng et al [91] similarly explore the cloud providerrsquosability to diagnose the root cause of performance abnormali-ties inmulti-tier applications Finally Ouyang et al [74] focuson the root cause analysis of straggler tasks in distributedprogramming frameworks like MapReduce and Spark

Even though this work does not specifically target interac-tive latency-critical microservices or applications of similargranularity such examples provide promising evidence thatdata-driven performance diagnostics can improve a large-scale systemrsquos ability to identify performance anomalies andaddress them to meet its performance guarantees

3 End-to-End Applications withMicroservices

We motivate and evaluate Seer with a set of new end-to-endinteractive services built with microservices Even thoughthere are open-source microservices that can serve as compo-nents of a larger application such as nginx [6] memcached [41]MongoDB [5] Xapian [57] and RabbitMQ [4] there are currentlyno publicly-available end-to-end microservices applicationswith the exception of a few simple architectures like Go-microservices [3] and Sockshop [7] We design four end-to-end services implementing a Social Network a Media Servicean E-commerce Site and a Banking System Starting fromthe Go-microservices architecture [3] we also develop anend-to-end Hotel Reservation system Services are designedto be representative of frameworks used in production sys-tems modular and easily reconfigurable The end-to-endapplications and tracing infrastructure are described in moredetail and open-sourced in [45]Table 1 briefly shows the characteristics of each end-to-

end application including its communication protocol thenumber of unique microservices it includes and its break-down by programming language and framework Unlessotherwise noted all microservices are deployed in Dockercontainers Below we briefly describe the scope and func-tionality of each service

E-commerce

Service

front-end

(nodejs)http

memcached

orders

search

index0

index1

indexn

hellip

recommender

media

discounts

catalogue

wishlist

cart

accountInfo

mongoDB

mongoDB

mongoDB

shipping

mongoDB

queueMaster orderQueue

mongoDB

mongoDB

memcached

mongoDB

payment

authorization

transactionID

invoicing

login

adsClient

httpLoad

Balancer

socialNet

memcached

Figure 5 The architecture of the end-to-end E-commerceapplication implementing an online clothing store

31 Social NetworkScope The end-to-end service implements a broadcast-stylesocial network with uni-directional follow relationshipsFunctionality Fig 3 shows the architecture of the end-to-end service Users (client) send requests over http whichfirst reach a load balancer implemented with nginx whichselects a specific webserver is selected also in nginx Userscan create posts embedded with text media links and tags toother users which are then broadcasted to all their followersUsers can also read favorite and repost posts as well as replypublicly or send a direct message to another user The appli-cation also includes machine learning plugins such as adsand user recommender engines [15 16 59 92] a search ser-vice using Xapian [57] and microservices that allow users tofollow unfollow or block other accounts Inter-microservicemessages use Apache Thrift RPCs [1] The servicersquos backenduses memcached for caching and MongoDB for persistently stor-ing posts user profiles media and user recommendationsThis service is broadly deployed at Cornell and elsewhereand currently has several hundred users We use this in-stallation to test the effectiveness and scalability of Seer inSection 6

32 Media ServiceScope The application implements an end-to-end servicefor browsing movie information as well as reviewing ratingrenting and streaming movies [9 11]Functionality As with the social network a client requesthits the load balancer which distributes requests among mul-tiple nginx webservers The front-end is similar to SocialNetwork and users can search and browse information aboutmovies including the plot photos videos and review in-formation as well as insert a review for a specific movie bylogging in to their account Users can also select to rent amovie which involves a payment authentication module to

verify the user has enough funds and a video streaming mod-ule using nginx-hls a production nginx module for HTTPlive streaming Movie files are stored in NFS to avoid thelatency and complexity of accessing chunked records fromnon-relational databases while reviews are held in memcached

and MongoDB instances Movie information is maintained ina sharded and replicated MySQL DB We are similarly de-ploying Media Service as a hosting site for project demos atCornell which students can browse and review

33 E-Commerce ServiceScope The service implements an e-commerce site for cloth-ing The design draws inspiration and uses several compo-nents of the open-source Sockshop application [7]Functionality The application front-end here is a nodejs

service Clients can use the service to browse the inventoryusing catalogue a Go microservice that mines the back-end memcached and MongoDB instances holding informationabout products Users can also place orders (Go) by addingitems to their cart (Java) After they log in (Go) to their ac-count they can select shipping options (Java) process theirpayment (Go) and obtain an invoice (Java) for their orderOrders are serialized and committed using QueueMaster (Go)Finally the service includes a recommender engine (C++) andmicroservices for creating wishlists (Java)

34 Banking SystemScope The service implements a secure banking systemsupporting payments loans and credit card managementFunctionality Users interface with a nodejs front-endsimilar to E-commerce to login to their account search in-formation about the bank or contact a representative Oncelogged in a user can process a payment from their accountpay their credit card or request a new one request a loan andobtain information about wealth management options Mostmicroservices are written in Java and Javascript The back-end databases are memcached and MongoDB instances The ser-vice also has a relational database (BankInfoDB) that includesinformation about the bank its services and representatives

35 Hotel Reservation SiteScope The service is an online hotel reservation site forbrowsing information about hotels and making reservationsFunctionalityThe service is based on theGo-microservicesopen-source project [3] augmented with backend databasesand machine learning widgets for advertisement and hotelrecommendations A client request is first directed to oneof the front-end webservers in nodejs by a load balancerThe front-end then interfaces with the search engine whichallows users to explore hotel availability in a given regionand place a reservation The service back-end consists ofmemcached and MongoDB instances

Tracing

Coordinator

Client

Load Balancer

Microservice

Tracing module

Seer

TraceDB

(Cassandra)

Cluster ManagerAdjust

resource

allocation

Microservices dependency graph

Figure 6 Overview of Seerrsquos operation

4 Seer Design41 OverviewFig 6 shows the high-level architecture of the system Seer isan online performance debugging system for cloud systemshosting interactive latency-critical services Even though weare focusing our analysis on microservices where the impactof QoS violations is more severe Seer is also applicable togeneral cloud services and traditional multi-tier or Service-Oriented Architecture (SOA) workloads Seer uses two levelsof tracing shown in Fig 7

RPC-level tracing

Signal QoS

violation

Identify culprit

microservice

Node-level diagnostics

Identify culprit

resource

Notify cluster

manager

Perf cntrs uBench

Figure 7 The two levels oftracing used in Seer

First it uses a light-weight distributed RPC-level tracing system de-scribed in Sec 42 whichcollects end-to-end ex-ecution traces for eachuser request includingper-tier latency and out-standing requests asso-ciates RPCs belongingto the same end-to-endrequest and aggregates

them to a centralized Cassandra database (TraceDB) Fromthere traces are used to train Seer to recognize patterns inspace (between microservices) and time that lead to QoSviolations At runtime Seer consumes real-time streamingtraces to infer whether there is an imminent QoS violation

When a QoS violation is expected to occur and a culprit mi-croservice has been located Seer uses its lower tracing levelwhich consists of detailed per-node low-level hardware mon-itoring primitives such as performance counters to identifythe reason behind the QoS violation It also uses this informa-tion to provide the cluster manager with recommendationson how to avoid the performance degradation altogetherWhen Seer runs on a public cloud where performance coun-ters are disabled it uses a set of tunable microbenchmarksto determine the source of unpredictable performance (seeSec 44) Using two specialized tracing levels instead of col-lecting detailed low-level traces for all active microservicesensures that the distributed tracing is lightweight enough to

Microservice m+1

Microservice m-1

me

mca

che

dNIC

epoll

(libevent)

socket

read

mem$

proc

Packet

filtering

Packet

filtering

TCPIP

TX

TCPIP

RX

Tracing

Coordinator

WebUI

User

http

TraceDB

(Cassandra)

QueryEngine

mic

rose

rvic

es

latency

Gantt charts

NIC

Figure 8 Distributed tracing and instrumentation in Seer

track all active requests and services in the system and thatdetailed low-level hardware tracing is only used on-demandfor microservices likely to cause performance disruptions Inthe following sections we describe the design of the tracingsystem the learning techniques in Seer and its low-leveldiagnostic framework and the system insights we can drawfrom Seerrsquos decisions to improve cloud application design

42 Distributed TracingA major challenge with microservices is that one cannotsimply rely on the client to report performance as withtraditional client-server applications We have developeda distributed tracing system for Seer similar in design toDapper [85] and Zipkin [8] that records per-microservicelatencies using the Thrift timing interface as shown in Fig 8We additionally track the number of requests queued in eachmicroservice (outstanding requests) since queue lengths arehighly correlated with performance and QoS violations [4649 56 96] In all cases the overhead from tracing withoutrequest sampling is negligible less than 01 on end-to-endlatency and less than 015 on throughput (QPS) whichis tolerable for such systems [24 78 85] Traces from allmicroservices are aggregated in a centralized database [18]Instrumentation The tracing system requires two typesof application instrumentation First to distinguish betweenthe time spent processing network requests and the timethat goes towards application computation we instrumentthe application to report the time it sees a new request (post-RPC processing) We similarly instrument the transmit sideof the application Second systems have multiple sources ofqueueing in both hardware and software To obtain accuratemeasurements of queue lengths per microservice we need toaccount for these different queues Fig 8 shows an example of

Name LoC Layers Nonlinear Weights BatchFC Conv Vect Total Function Size

CNN 1456 8 8 ReLU 30K 4LSTM 944 12 6 18 sigmoidtanh 52K 32

Seer 2882 10 7 5 22 ReLU 80K 32sigmoidtanhTable 2 The different neural network configurations weexplored for Seer

Seerrsquos instrumentation for memcached Memcached includesfive main stages [60] TCPIP receive epolllibevent read-ing the request from the socket processing the request andresponding over TCPIP either with the ltkvgt pair for a reador with an ack for a write Each of these stages includes ahardware (NIC) or software (epollsocket readmemcachedproc) queue For theNIC queues Seer filters packets based onthe destination microservice but accounts for the aggregatequeue length if hardware queues are shared since that willimpact how fast a microservicersquos packets get processed Forthe software queues Seer inserts probes in the applicationto read the number of queued requests in each caseLimited instrumentation As seen above accounting forall sources of queueing in a complex system requires non-trivial instrumentation This can become cumbersome ifusers leverage third-party applications in their services or inthe case of public cloud providers which do not have accessto the source code of externally-submitted applications forinstrumentation In these cases Seer relies on the requestsqueued exclusively in the NIC to signal upcoming QoS viola-tions In Section 5 we compare the accuracy of the full versuslimited instrumentation and see that using network queuedepths alone is enough to signal a large fraction of QoS viola-tions although smaller than when the full instrumentation isavailable Exclusively polling NIC queues identifies hotspotscaused by routing incast failures and resource saturationbut misses QoS violations that are caused by performanceand efficiency bugs in the application implementation suchas blocking behavior between microservices Signaling suchbugs helps developers better understand the microservicesmodel and results in better application designInferring queue lengths Additionally there has been re-cent work on using deep learning to reverse engineer thenumber of queued requests in switches across a large net-work topology [46] when tracing information is incompleteSuch techniques are also applicable and beneficial for Seerwhen the default level of instrumentation is not available

43 Deep Learning in Performance DebuggingA popular way to model performance in cloud systems es-pecially when there are dependencies between tasks arequeueing networks [49] Although queueing networks are avaluable tool to model how bottlenecks propagate through

the system they require in-depth knowledge of applicationsemantics and structure and can become overly complexas applications and systems scale They additionally cannoteasily capture all sources of contention such as the OS andnetwork stack

Instead in Seer we take a data-driven application-agnosticapproach that assumes no information about the structureand bottlenecks of a service making it robust to unknownand changing applications and relying instead on practi-cal learning techniques to infer patterns that lead to QoSviolations This includes both spatial patterns such as de-pendencies between microservices and temporal patternssuch as input load and resource contention The key idea inSeer is that conditions that led to QoS violations in the pastcan be used to anticipate unpredictable performance in thenear future Seer uses execution traces annotated with QoSviolations and collected over time to train a deep neural net-work to signal upcoming QoS violations Below we describethe structure of the neural network why deep learning iswell-suited for this problem and how Seer adapts to changesin application structure onlineUsing deep learning Although deep learning is not theonly approach that can be used for proactive QoS violationdetection there are several reasonswhy it is preferable in thiscase First the problem Seer must solve is a pattern matchingproblem of recognizing conditions that result in QoS viola-tions where the patterns are not always known in advanceor easy to annotate This is a more complicated task than sim-ply signaling a microservice with many enqueued requestsfor which simpler classification regression or sequence la-beling techniques would suffice [15 16 92] Second the DNNin Seer assumes no a priori knowledge about dependenciesbetween individual microservices making it applicable tofrequently-updated services where describing changes iscumbersome or even difficult for the user to know Thirddeep learning has been shown to be especially effective inpattern recognition problems with massive datasets eg inimage or text recognition [10] Finally as we show in thevalidation section (Sec 5) deep learning allows Seer to rec-ognize QoS violations with high accuracy in practice andwithin the opportunity window the cluster manager has toapply corrective actionsConfiguring the DNN The input used in the network isessential for its accuracy We have experimented with re-source utilization latency and queue depths as input met-rics Consistent with prior work utilization is not a goodproxy for performance [31 35 56 61] Latency similarlyleads to many false positives or to incorrectly pinpointingcomputationally-intensive microservices as QoS violationculprits Again consistent with queueing theory [49] andprior work [34 37 38 46 56] per-microservice queue depthsaccurately capture performance bottlenecks and pinpoint the

hellip

CNN

LSTM

hellip

LSTM

LSTM

LSTM

LSTM

LSTM

hellip

LSTM

LSTM

LSTM

LSTM

LSTM

hellip

LSTM

LSTM

LSTM

LSTM

M

icro

serv

ice

s(i

n d

ep

en

de

nci

es

ord

er)

helliphellip hellip

hellip

M

icrose

rvice

s(P

rfo

r Qo

Sv

iola

tion

)

LSTM

hellip

SoftMax

LSTM

Dropout

Figure 9 The deep neural network design in Seer con-sisting of a set of convolution layers followed by a set oflong-short term memory layers Each input and output neu-ron corresponds to a microservice ordered in topologicalorder from back-end microservices in the top to front-endmicroservices in the bottom

microservices causing them We compare against utilization-based approaches in Section 5 The number of input andoutput neurons is equal to the number of active microser-vices in the cluster with the input value corresponding toqueue depths and the output value to the probability fora given microservice to initiate a QoS violation Input neu-rons are ordered according to the topological applicationstructure with dependent microservices corresponding toconsecutive neurons to capture hotspot patterns in space In-put traces are annotated with the microservices that causedQoS violations in the past

The choice of DNN architecture is also instrumental to itsaccuracy There are three main DNN designs that are populartoday fully connected networks (FC) convolutional neuralnetworks (CNN) and recurrent neural networks (RNN) es-pecially their Long Short-Term Memory (LSTM) class ForSeer to be effective in improving performance predictabilityinference needs to occur with enough slack for the clustermanagerrsquos action to take effect Hence we focus on the morecomputationally-efficient CNN and LSTM networks CNNsare especially effective at reducing the dimensionality oflarge datasets and finding patterns in space eg in imagerecognition LSTMs on the other hand are particularly effec-tive at finding patterns in time eg predicting tomorrowrsquosweather based on todayrsquos measurements Signaling QoS vi-olations in a large cluster requires both spatial recognitionnamely identifying problematic clusters of microserviceswhose dependencies cause QoS violations and discardingnoisy but non-critical microservices and temporal recogni-tion namely using past QoS violations to anticipate futureones We compare three network designs a CNN a LSTMand a hybrid network that combines the two using the CNNfirst to reduce the dimensionality and filter out microservices

that do not affect end-to-end performance and then an LSTMwith a SoftMax final layer to infer the probability for eachmicroservice to initiate a QoS violation The architectureof the hybrid network is shown in Fig 9 Each network isconfigured using hyperparameter tuning to avoid overfittingand the final parameters are shown in Table 2We train each network on a weekrsquos worth of trace data

collected on a 20-server cluster running all end-to-end ser-vices (for methodology details see Sec 5) and test it on tracescollected on a different week after the servers had beenpatched and the OS had been upgraded

561 7879 9345QoS Violation Detection Accuracy ()

0

2

4

6

8

10

12

14

16

Infe

rence

Tim

e (

ms)

CNNLSTMSeer (CNN+LSTM)

Figure 10 Comparison ofDNN architectures

The quantitative com-parison of the threenetworks is shown inFig 10 The CNN is byfar the fastest but alsothe worst performingsince it is not designedto recognize patterns intime that lead to QoS vi-olations The LSTM onthe other hand is espe-cially effective at captur-ing load patterns overtime but is less effective at reducing the dimensionality ofthe original dataset which makes it prone to false positivesdue to microservices with many outstanding requests whichare off the critical path It also incurs higher overheads forinference than the CNN Finally Seer correctly anticipates9345 of violations outperforming both networks for asmall increase in inference time compared to LSTM Giventhat most resource partitioning decisions take effect after afew 100ms the inference time for Seer is within the windowof opportunity the cluster manager has to take action Moreimportantly it attributes the QoS violation to the correctmicroservice simplifying the cluster managerrsquos task QoSviolations missed by Seer included four random load spikesand a network switch failure which caused high packet dropsOut of the five end-to-end services the one most prone

to QoS violations initially was Social Network first becauseit has stricter QoS constraints than eg E-commerce andsecond due to a synchronous and cyclic communication be-tween three neighboring services that caused them to entera positive feedback loop until saturation We reimplementedthe communication protocol between them post-detectionOn the other hand the service for which QoS violations werehardest to detect wasMedia Service because of a faulty mem-ory bandwidth partitioning mechanism in one of our serverswhich resulted in widely inconsistent memory bandwidthallocations during movie streaming Since the QoS violationonly occurred when the specific streaming microservice wasscheduled on the faulty node it was hard for Seer to collectenough datapoints to signal the violation

Retraining Seer By default training happens once andcan be time consuming taking several hours up to a day forweek-long traces collected on our 20-server cluster (Sec 5includes a detailed sensitivity study for training time) How-ever one of the main advantages of microservices is that theysimplify frequent application updates with old microser-vices often swapped out and replaced by newer modules orlarge services progressively broken down to microservicesIf the application (or underlying hardware) change signifi-cantly Seerrsquos detection accuracy can be impacted To adjustto changes in the execution environment Seer retrains incre-mentally in the background using the transfer learning-basedapproach in [80] Weights from previous training rounds arestored in disk allowing the model to continue training fromwhere it last left off when new data arrive reducing thetraining time by 2-3 orders of magnitude Even though thisapproach allows Seer to handle application changes almost inreal-time it is not a long-term solution since newweights arestill polluted by the previous application architecture Whenthe application changes in a major way eg microserviceson the critical path change Seer also retrains from scratchin the background While the new network trains QoS vi-olation detection happens with the incrementally-trainedinterim model In Section 5 we evaluate Seerrsquos ability toadjust its estimations to application changes

44 Hardware MonitoringOnce a QoS violation is signaled and a culprit microserviceis pinpointed Seer uses low-level monitoring to identify thereason behind the QoS violation The exact process dependson whether Seer has access to performance countersPrivate cluster When Seer has access to hardware eventssuch as performance counters it uses them to determinethe utilization of different shared resources Note that eventhough utilization is a problematic metric for anticipatingQoS violations in a large-scale service once a culprit mi-croservice has been identified examining the utilization ofdifferent resources can provide useful hints to the clustermanager on suitable decisions to avoid degraded perfor-mance Seer specifically examines CPU memory capacityand bandwidth network bandwidth cache contention andstorage IO bandwidth when prioritizing a resource to adjustOnce the saturated resource is identified Seer notifies thecluster manager to take actionPublic clusterWhen Seer does not have access to perfor-mance counters it instead uses a set of 10 tunable contentiousmicrobenchmarks each of them targeting a different sharedresource [30] to determine resource saturation For exampleif Seer injects the memory bandwidth microbenchmark inthe system and tunes up its intensity without an impacton the co-scheduled microservicersquos performance memorybandwidth is most likely not the resource that needs to beadjusted Seer starts from microbenchmarks corresponding

to core resources and progressively moves to resources fur-ther away from the core until it sees a substantial change inperformance when running the microbenchmark Each mi-crobenchmark takes approximately 10ms to complete avoid-ing prolonged degraded performance

Upon identifying the problematic resource(s) Seer notifiesthe cluster manager which takes one of several resourceallocation actions resizing the Docker container or usingmechanisms like Intelrsquos Cache Allocation Technology (CAT)for last level cache (LLC) partitioning and the Linux trafficcontrolrsquos hierarchical token bucket (HTB) queueing disciplinein qdisc [17 62] for network bandwidth partitioning

45 System Insights from SeerUsing learning-based data-driven approaches in systemsis most useful when these techniques are used to gain in-sight into system problems instead of treating them as blackboxes Section 5 includes an analysis of the causes behindQoS violations signaled by Seer including application bugspoor resource provisioning decisions and hardware failuresFurthermore we have deployed Seer in a large installation ofthe Social Network service over the past few months and itsoutput has been instrumental not only in guaranteeing QoSbut in understanding sources of unpredictable performanceand improving the application design This has resulted bothin progressively fewer QoS violations over time and a betterunderstanding of the design challenges of microservices

46 ImplementationSeer is implemented in 12KLOC of CC++ and Python Itruns on Linux and OSX and supports applications in variouslanguages including all frameworks the end-to-end servicesare designed in Furthermore we provide automated patchesfor the instrumentation probes for many popular microser-vices including NGINX memcached MongoDB Xapian and allSockshop and Go-microservices applications to minimize thedevelopment effort from the userrsquos perspective

Seer is a centralized system we usemaster-slavemirroringto improve fault tolerance with two hot stand-by mastersthat can take over if the primary system fails Similarly thetrace database is also replicated in the backgroundSecurity concerns Trace data is stored and processed un-encrypted in Cassandra Previous work has shown that thesensitivity applications have to different resources can leakinformation about their nature and characteristics makingthem vulnerable to malicious security attacks [27 36 50 5284 88 89 94 99] Similar attacks are possible using the dataand techniques in Seer and are deferred to future work

5 Seer Analysis and Validation51 MethodologyServer clusters First we use a dedicated local cluster with20 2-socket 40-core servers with 128GB of RAM each Each

0 5 10 15 20 25 30 35 40

Training Time (hours)

40

50

60

70

80

90

100

Qo

S V

iola

tio

n D

ete

ctio

n A

ccu

racy (

)

100MB

1GB

10GB

20GB

50GB

100GB 200GB

500GB 1TB

10-2 10-1 100 101Tracing Interval (s)

50

55

60

65

70

75

80

85

90

95

Dete

ction A

ccura

cy (

)

Figure 11 Seerrsquos sensitivity to (a) the size of trainingdatasets and (b) the tracing interval

server is connected to a 40Gbps ToR switch over 10Gbe NICsSecond we deploy the Social Network service to GoogleCompute Engine (GCE) and Windows Azure clusters withhundreds of servers to study the scalability of SeerApplicationsWe use all five end-to-end services of Table 1Services for now are driven by open-loop workload gener-ators and the input load varies from constant to diurnalto load with spikes in user demand In Section 6 we studySeer in a real large-scale deployment of the Social Networkin that case the input load is driven by real user traffic

52 EvaluationSensitivity to training data Fig 11a shows the detectionaccuracy and training time for Seer as we increase the sizeof the training dataset The size of the dots is a functionof the dataset size Training data is collected from the 20-server cluster described above across different load levelsplacement strategies time intervals and request types Thesmallest training set size (100MB) is collected over ten min-utes of the cluster operating at high utilization while thelargest dataset (1TB) is collected over almost two months ofcontinuous deployment As datasets grow Seerrsquos accuracyincreases leveling off at 100-200GB Beyond that point accu-racy does not further increase while the time needed fortraining grows significantly Unless otherwise specified weuse the 100GB training datasetSensitivity to tracing frequencyBy default the distributedtracing system instantaneously collects the latency of everysingle user request Collecting queue depth statistics on theother hand is a per-microservice iterative process Fig 11bshows how Seerrsquos accuracy changes as we vary the frequencywith which we collect queue depth statistics Waiting for along time before sampling queues eg gt 1s can result in un-detected QoS violations before Seer gets a chance to processthe incoming traces In contrast sampling queues very fre-quently results in unnecessarily many inferences and runsthe risk of increasing the tracing overhead For the remain-der of this work we use 100ms as the interval for measuringqueue depths across microservicesFalse negatives amp false positives Fig 12a shows the per-centage of false negatives and false positives as we vary the

CPU

Cache

Memory

Network

Disk

Application

False Negatives False Positives0

10

20

30

40

50

Pe

rce

nta

ge

(

)

10ms50ms100ms

500ms1s2s

Utilizatio

n

App only

Net only

Seer

Ground0

20

40

60

80

100

o

f V

alid

Qo

S V

iola

tio

ns

Truth

Figure 12 (a) The false negatives and false positives in Seeras we vary the inference window (b) Breakdown to causesof QoS violations and comparison with utilization-baseddetection and systems with limited instrumentation

prediction window When Seer tries to anticipate QoS viola-tions that will occur in the next 10-100ms both false positivesand false negatives are low since Seer uses a very recentsnapshot of the cluster state to anticipate performance unpre-dictability If inference was instantaneous very short predic-tion windows would always be better However given thatinference takes several milliseconds and more importantlyapplying corrective actions to avoid QoS violations takes10-100s of milliseconds to take effect such short windowsdefy the point of proactive QoS violation detection At theother end predicting far into the future results in significantfalse negatives and especially false positives This is becausemany QoS violations are caused by very short bursty eventsthat do not have an impact on queue lengths until a fewmilliseconds before the violation occurs Therefore requiringSeer to predict one or more seconds into the future meansthat normal queue depths are annotated as QoS violationsresulting in many false positives Unless otherwise specifiedwe use a 100ms prediction windowComparison of debugging systems Fig 12b comparesSeer with a utilization-based performance debugging systemthat uses resource saturation as the trigger to signal a QoSviolation and two systems that only use a fraction of Seerrsquosinstrumentation App-only exclusively uses queues mea-sured via application instrumentation (not network queues)while Network-only uses queues in the NIC and ignoresapplication-level queueing We also show the ground truthfor the total number of upcoming QoS violations (96 over atwo-week period) and break it down by the reason that ledto unpredictable performanceA large fraction of QoS violations are due to application-

level inefficiencies including correctness bugs unnecessarysynchronization andor blocking behavior between microser-vices (including two cases of deadlocks) and misconfigurediptables rules which caused packet drops An equally largefraction of QoS violations are due to compute contentionfollowed by contention in the network cache and memory

Accuracy

QoS Violation

Impactful Service Update

Innocuous Service Update

Incremental Retraining

Signaled QoS Violation

0 2 4 6 8 10 12Time (hours)

75

80

85

90

95

De

tectio

n A

ccu

racy (

)

Social Network Movie Reviewing Ecommerce Go-microservices

0 2 4 6 8 10 12Time (hours)

60708090

100110120130140

Ta

il La

ten

cy n

orm

QoS

QoS

Figure 13 Seer retraining incrementally after each time theSocial Network service is updated

contention and finally disk Since the only persistent mi-croservices are the back-end databases it is reasonable thatdisk accounts for a small fraction of overall QoS violations

Seer accurately follows this breakdown for the most partonly missing a few QoS violations due to random load spikesincluding one caused by a switch failure The App-only sys-tem correctly identifies application-level sources of unpre-dictable performance but misses the majority of system-related issues especially in uncore resources On the otherhand Network-only correctly identifies the vast majorityof network-related issues as well as most of the core- anduncore-driven QoS violations but misses several application-level issues The difference between Network-only and Seeris small suggesting that one could omit the application-levelinstrumentation in favor of a simpler design While this sys-tem is still effective in capturing QoS violations it is lessuseful in providing feedback to application developers onhow to improve their design to avoid QoS violations in the fu-ture Finally the utilization-based system behaves the worstmissing most violations not caused by CPU saturationOut of the 89 QoS violations Seer detects it notifies the

cluster manager early enough to avoid 84 of them The QoSviolations that were not avoided correspond to application-level bugs which cannot be easily corrected online Sincethis is a private cluster Seer uses utilization metrics andperformance counters to identify problematic resourcesRetraining Fig 13 shows the detection accuracy for Seerand the tail latency for each end-to-end service over a periodof time during which Social Network is getting frequently andsubstantially updated This includes new microservices be-ing added to the service such as the ability to place an orderfrom an ad using the orders microservice of E-commerceor the back-end of the service changing from MongoDB to

Cassandra and the front-end switching from nginx to thenodejs front-end of E-commerce These are changes that fun-damentally affect the applicationrsquos behavior throughput la-tency and bottlenecks The other services remain unchangedduring this period (Banking was not active during this timeand is omitted from the graph) Blue dots denote correctly-signaled upcoming QoS violations and red times denote QoSviolations that were not detected by Seer All unidentifiedQoS violations coincide with the application being updatedShortly after the update Seer incrementally retrains in thebackground and starts recovering its accuracy until anothermajor update occurs Some of the updates have no impact oneither performance or Seerrsquos accuracy either because theyinvolve microservices off the critical path or because theyare insensitive to resource contention

The bottom figure shows that unidentified QoS violationsindeed result in performance degradation for Social Networkand in some cases for the other end-to-end services if theyare sharing physical resources with Social Network on anoversubscribed server Once retraining completes the per-formance of the service(s) recovers The longer Seer trainson an evolving application the more likely it is to correctlyanticipate its future QoS violations

6 Large-Scale Cloud Study61 Seer ScalabilityWe now deploy our Social Network service on a 100-serverdedicated cluster on Google Compute Engine (GCE) anduse it to service real user traffic The application has 582registered users with 165 daily active users and has beendeployed for a two-month period The cluster on averagehosts 386 single-concerned containers (one microservice percontainer) subject to some resource scaling actions by thecluster manager based on Seerrsquos feedback

Training Inference10-310-210-1100101102103104105

Tim

e (

s)

NativeTPUBrainwave

Figure 14 Seer training and infer-ence with hardware acceleration

Accuracy remainshigh for Seer con-sistent with thesmall-scale exper-iments Inferencetime however in-creases substantiallyfrom 114ms forthe 20-server clus-ter to 54ms for the100-server GCE set-ting Even thoughthis is still suffi-cient for many re-source allocation decisions as the application scales furtherSeerrsquos ability to anticipate a QoS violation within the clustermanagerrsquos window of opportunity diminishes

Over the past year multiple public cloud providers have ex-posed hardware acceleration offerings for DNN training and

nginx

text

imag

e

user

Tag

urlS

horte

n

vide

o

reco

mm

sear

chlogin

read

Post

com

pPos

t

writ

eGra

ph

mem

cach

ed

mon

godb

0

20

40

60

80

100

120

QoS

Vio

lations

Figure 15 QoS violations each microservice in Social Net-work is responsible for

inference either using a special-purpose design like the Ten-sor Processing Unit (TPU) from Google [55] or using recon-figurable FPGAs like Project Brainwave from Microsoft [25]We offload Seerrsquos DNN logic to both systems and quantifythe impact on training and inference time and detection ac-curacy 1 Fig 14 shows this comparison for a 200GB trainingdataset Both the TPU and Project Brainwave dramaticallyoutperform our local implementation by up to two orders ofmagnitude Between the two accelerators the TPU is moreeffective in training consistent with its design objective [55]while Project Brainwave achieves faster inference For theremainder of the paper we run Seer on TPUs and host theSocial Network service on GCE

62 Source of QoS ViolationsWe now examine which microservice is the most commonculprit for a QoS violation Fig 15 shows the number of QoSviolations caused by each service over the two-month periodThe most frequent culprits by far are the in-memory cachingtiers in memcached and Thrift services with high requestfanout such as composePost readPost and login memcachedis a justified source of QoS violations since it is on the criti-cal path for almost all query types and it is additionally verysensitive to resource contention in compute and to a lesserdegree cache and memory Microservices with high fanoutare also expected to initiate QoS violations as they have tosynchronize multiple inbound requests before proceeding Ifprocessing for any incoming requests is delayed end-to-endperformance is likely to suffer Among these QoS violationsmost of memcachedrsquos violations were caused by resource con-tention while violations in Thrift services were caused bylong synchronization times

63 Seerrsquos Long-Term Impact on Application DesignSeer has now been deployed in the Social Network clusterfor over two months and in this time it has detected 536upcoming QoS violations (906 accuracy) and avoided 495(84) of them Furthermore by detecting recurring patterns1Before running on TPUs we reimplemented our DNN in Tensorflow Wesimilarly adjust the DNN to the currently-supported designs in Brainwave

0 10 20 30 40 50 60Day

0

5

10

15

20

25

Qo

S v

iola

tio

ns

Figure 16 QoS violations each microservice in Social Net-work is responsible for

that lead to QoS violations Seer has helped the applicationdevelopers better understand bugs and design decisions thatlead to hotspots such as microservices with a lot of back-and-forth communication between them or microservicesforming cyclic dependencies or using blocking primitivesThis has led to a decreasing number of QoS violations overthe two month period (seen in Fig 16) as the applicationprogressively improves In days 22 and 23 there was a clusteroutage which is why the reported violations are zero Sys-tems like Seer can be used not only to improve performancepredictability in complex cloud systems but to help usersbetter understand the design challenges of microservices asmore services transition to this application model

7 ConclusionsCloud services increasingly move away from complex mono-lithic designs and adopt the model of specialized loosely-coupled microservices We presented Seer a data-drivencloud performance debugging system that leverages practi-cal learning techniques and the massive amount of tracingdata cloud systems collect to proactively detect and avoidQoS violations We have validated Seerrsquos accuracy in con-trolled environments and evaluated its scalability on large-scale clusters on public clouds We have also deployed thesystem in a cluster hosting a social network with hundredsof users In all scenarios Seer accurately detects upcomingQoS violations improving responsiveness and performancepredictability As more services transition to the microser-vices model systems like Seer provide practical solutionsthat can navigate the increasing complexity of the cloud

AcknowledgementsWe sincerely thank Christos Kozyrakis Balaji PrabhakarMendel Rosenblum Daniel Sanchez Ravi Soundararajanthe academic and industrial users of the benchmark suiteand the anonymous reviewers for their feedback on earlierversions of this manuscript This work was supported byNSF CAREER Award CCF-1846046 NSF grant CNS-1422088a Facebook Faculty Research Award a VMWare ResearchAward a John and Norma Balen Sesquicentennial FacultyFellowship and donations from GCE Azure and AWS

References[1] [n d] Apache Thrift httpsthriftapacheorg[2] [n d] Decomposing Twitter Adventures in Service-

Oriented Architecture httpswwwslidesharenetInfoQdecomposing-twitter-adventures-in-serviceoriented-architecture

[3] [n d] Golang Microservices Example httpsgithubcomharlowgo-micro-services

[4] [n d] Messaging that just works httpswwwrabbitmqcom[5] [n d] MongoDB httpswwwmongodbcom[6] [n d] NGINX httpsnginxorgen[7] [n d] SockShop A Microservices Demo Application httpswww

weaveworksblogsock-shop-microservices-demo-application[8] [n d] Zipkin httpzipkinio[9] 2016 The Evolution of Microservices httpswwwslidesharenet

adriancockcroftevolution-of-microservices-craft-conference[10] Martiacuten Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng

Chen Craig Citro Greg Corrado Andy Davis Jeffrey Dean MatthieuDevin Sanjay Ghemawat Ian Goodfellow Andrew Harp GeoffreyIrving Michael Isard Yangqing Jia Rafal Jozefowicz Lukasz KaiserManjunath Kudlur Josh Levenberg Dan Maneacute Rajat Monga SherryMoore Derek Murray Chris Olah Mike Schuster Jonathon ShlensBenoit Steiner Ilya Sutskever Kunal Talwar Paul Tucker VincentVanhoucke Vijay Vasudevan Fernanda Vieacutegas Oriol Vinyals PeteWarden Martin Wattenberg Martin Wicke Yuan Yu and XiaoqiangZheng [n d] TensorFlow Large-Scale Machine Learning on Hetero-geneous Distributed Systems In Proceedings of OSDI 2016

[11] Adrian Cockroft [n d] Microservices Workshop Why whatand how to get there httpwwwslidesharenetadriancockcroftmicroservices-workshop-craft-conference

[12] Hyunwook Baek Abhinav Srivastava and Jacobus Van der Merwe[n d] CloudSight A tenant-oriented transparency framework forcross-layer cloud troubleshooting In Proceedings of the 17th IEEEACMInternational Symposium on Cluster Cloud and Grid Computing 2017

[13] Luiz Barroso [n d] Warehouse-Scale Computing Entering theTeenage Decade ISCA Keynote SJ June 2011 ([n d])

[14] Luiz Barroso and Urs Hoelzle 2009 The Datacenter as a Computer AnIntroduction to the Design ofWarehouse-Scale Machines MC Publishers

[15] Robert Bell Yehuda Koren and Chris Volinsky 2007 The BellKor 2008Solution to the Netflix Prize Technical Report

[16] Leon Bottou [n d] Large-Scale Machine Learning with StochasticGradient Descent In Proceedings of the International Conference onComputational Statistics (COMPSTAT) Paris France 2010

[17] Martin A Brown [n d] Traffic Control HOWTO httplinux-ipnetarticlesTraffic-Control-HOWTO

[18] Cassandra [n d] Apache Cassandra httpcassandraapacheorg[19] Adrian M Caulfield Eric S Chung Andrew Putnam Hari Angepat

Jeremy Fowers Michael Haselman Stephen Heil Matt HumphreyPuneet Kaur Joo-Young Kim Daniel Lo Todd Massengill KalinOvtcharov Michael Papamichael Lisa Woods Sitaram Lanka DerekChiou and Doug Burger 2016 A Cloud-scale Acceleration Architec-ture In MICRO IEEE Press Piscataway NJ USA Article 7 13 pages

[20] Tianshi Chen Zidong Du Ninghui Sun Jia Wang Chengyong WuYunji Chen and Olivier Temam 2014 DianNao a small-footprint high-throughput accelerator for ubiquitous machine-learning In Proc ofthe 19th intl conf on Architectural Support for Programming Languagesand Operating Systems

[21] Yunji Chen Tianshi Chen Zhiwei Xu Ninghui Sun andOlivier Temam2016 DianNao Family Energy-efficient Hardware Accelerators forMachine Learning Commun ACM 59 11 (Oct 2016) 105ndash112 httpsdoiorg1011452996864

[22] Yunji Chen Tao Luo Shaoli Liu Shijin Zhang Liqiang He Jia WangLing Li Tianshi Chen Zhiwei Xu Ninghui Sun and Olivier Temam2014 DaDianNao A Machine-Learning Supercomputer In Proceedingsof the 47th Annual IEEEACM International Symposium on Microarchi-tecture (MICRO-47) IEEE Computer Society Washington DC USA

609ndash622[23] Ludmila Cherkasova Diwaker Gupta and Amin Vahdat 2007 Com-

parison of the Three CPU Schedulers in Xen SIGMETRICS PerformEval Rev 35 2 (Sept 2007) 42ndash51

[24] Michael Chow David Meisner Jason Flinn Daniel Peek and Thomas FWenisch 2014 The Mystery Machine End-to-end Performance Anal-ysis of Large-scale Internet Services In Proceedings of the 11th USENIXConference on Operating Systems Design and Implementation (OSDIrsquo14)USENIX Association Berkeley CA USA 217ndash231

[25] Eric S Chung Jeremy Fowers Kalin Ovtcharov Michael PapamichaelAdrian M Caulfield Todd Massengill Ming Liu Daniel Lo ShlomiAlkalay Michael Haselman Maleen Abeydeera Logan Adams HariAngepat Christian Boehn Derek Chiou Oren Firestein AlessandroForin Kang Su Gatlin Mahdi Ghandi Stephen Heil Kyle HolohanAhmad El Husseini Tamaacutes Juhaacutesz Kara Kagi Ratna Kovvuri SitaramLanka Friedel van Megen Dima Mukhortov Prerak Patel BrandonPerez Amanda Rapsang Steven K Reinhardt Bita Rouhani AdamSapek Raja Seera Sangeetha Shekar Balaji Sridharan Gabriel WeiszLisaWoods Phillip Yi Xiao Dan Zhang Ritchie Zhao and Doug Burger2018 Serving DNNs in Real Time at Datacenter Scale with ProjectBrainwave IEEE Micro 38 2 (2018) 8ndash20

[26] Guilherme Da Cunha Rodrigues Rodrigo N Calheiros Vini-cius Tavares Guimaraes Glederson Lessa dos Santos Maacutercio Barbosade Carvalho Lisandro Zambenedetti Granville Liane Margarida Rock-enbach Tarouco and Rajkumar Buyya 2016 Monitoring of CloudComputing Environments Concepts Solutions Trends and Future Di-rections In Proceedings of the 31st Annual ACM Symposium on AppliedComputing (SAC rsquo16) ACM New York NY USA 378ndash383

[27] Marwan Darwish Abdelkader Ouda and Luiz Fernando Capretz [nd] Cloud-based DDoS attacks and defenses In Proc of i-SocietyToronto ON 2013

[28] Jeffrey Dean and Luiz Andre Barroso [n d] The Tail at Scale InCACM Vol 56 No 2

[29] Christina Delimitrou Nick Bambos and Christos Kozyrakis [n d]QoS-Aware Admission Control in Heterogeneous Datacenters In Pro-ceedings of the International Conference of Autonomic Computing (ICAC)San Jose CA USA 2013

[30] Christina Delimitrou and Christos Kozyrakis [n d] iBench Quanti-fying Interference for Datacenter Workloads In Proceedings of the 2013IEEE International Symposium on Workload Characterization (IISWC)Portland OR September 2013

[31] Christina Delimitrou and Christos Kozyrakis [n d] Paragon QoS-Aware Scheduling for Heterogeneous Datacenters In Proceedings ofthe Eighteenth International Conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS) HoustonTX USA 2013

[32] Christina Delimitrou and Christos Kozyrakis [n d] QoS-Aware Sched-uling in Heterogeneous Datacenters with Paragon In ACM Transac-tions on Computer Systems (TOCS) Vol 31 Issue 4 December 2013

[33] Christina Delimitrou and Christos Kozyrakis [n d] Quality-of-Service-Aware Scheduling in Heterogeneous Datacenters withParagon In IEEE Micro Special Issue on Top Picks from the ComputerArchitecture Conferences MayJune 2014

[34] Christina Delimitrou and Christos Kozyrakis [n d] Quasar Resource-Efficient and QoS-Aware Cluster Management In Proc of ASPLOS SaltLake City 2014

[35] Christina Delimitrou and Christos Kozyrakis 2016 HCloud Resource-Efficient Provisioning in Shared Cloud Systems In Proceedings of theTwenty First International Conference on Architectural Support for Pro-gramming Languages and Operating Systems (ASPLOS)

[36] Christina Delimitrou and Christos Kozyrakis 2017 Bolt I KnowWhatYou Did Last Summer In The Cloud In Proc of the Twenty SecondInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS)

[37] Christina Delimitrou and Christos Kozyrakis 2018 Amdahlrsquos Law forTail Latency In Communications of the ACM (CACM)

[38] Christina Delimitrou Daniel Sanchez and Christos Kozyrakis 2015Tarcil Reconciling Scheduling Speed and Quality in Large Shared Clus-ters In Proceedings of the Sixth ACM Symposium on Cloud Computing(SOCC)

[39] Zidong Du Robert Fasthuber Tianshi Chen Paolo Ienne Ling Li TaoLuo Xiaobing Feng Yunji Chen and Olivier Temam 2015 ShiDi-anNao Shifting Vision Processing Closer to the Sensor In Proceed-ings of the 42Nd Annual International Symposium on Computer Ar-chitecture (ISCA rsquo15) ACM New York NY USA 92ndash104 httpsdoiorg10114527494692750389

[40] Daniel Firestone Andrew Putnam SambhramaMundkur Derek ChiouAlireza Dabagh Mike Andrewartha Hari Angepat Vivek BhanuAdrian Caulfield Eric Chung Harish Kumar Chandrappa SomeshChaturmohta Matt Humphrey Jack Lavier Norman Lam FengfenLiu Kalin Ovtcharov Jitu Padhye Gautham Popuri Shachar RaindelTejas Sapre Mark Shaw Gabriel Silva Madhan Sivakumar NisheethSrivastava Anshuman Verma Qasim Zuhair Deepak Bansal DougBurger Kushagra Vaid David A Maltz and Albert Greenberg 2018Azure Accelerated Networking SmartNICs in the Public Cloud In 15thUSENIX Symposium on Networked Systems Design and Implementation(NSDI 18) USENIX Association Renton WA 51ndash66

[41] Brad Fitzpatrick [n d] Distributed caching with memcached InLinux Journal Volume 2004 Issue 124 2004

[42] Rodrigo Fonseca George Porter Randy H Katz Scott Shenker andIon Stoica 2007 X-trace A Pervasive Network Tracing Framework InProceedings of the 4th USENIX Conference on Networked Systems Designamp Implementation (NSDIrsquo07) USENIX Association Berkeley CA USA20ndash20

[43] Yu Gan and Christina Delimitrou 2018 The Architectural Implicationsof Cloud Microservices In Computer Architecture Letters (CAL) vol17iss 2

[44] Yu Gan Meghna Pancholi Dailun Cheng Siyuan Hu Yuan He andChristina Delimitrou 2018 Seer Leveraging Big Data to Navigate theComplexity of Cloud Debugging In Proceedings of the Tenth USENIXWorkshop on Hot Topics in Cloud Computing (HotCloud)

[45] Yu Gan Yanqi Zhang Dailun Cheng Ankitha Shetty Priyal RathiNayan Katarki Ariana Bruno Justin Hu Brian Ritchken BrendonJackson Kelvin Hu Meghna Pancholi Yuan He Brett Clancy ChrisColen Fukang Wen Catherine Leung Siyuan Wang Leon ZaruvinskyMateo Espinosa Rick Lin Zhongling Liu Jake Padilla and ChristinaDelimitrou 2019 An Open-Source Benchmark Suite for Microser-vices and Their Hardware-Software Implications for Cloud and EdgeSystems In Proceedings of the Twenty Fourth International Conferenceon Architectural Support for Programming Languages and OperatingSystems (ASPLOS)

[46] Yilong Geng Shiyu Liu Zi Yin Ashish Naik Balaji Prabhakar MendelRosenblum and Amin Vahdat [n d] Exploiting a Natural NetworkEffect for Scalable Fine-grained Clock Synchronization In Proc ofNSDI 2018

[47] Daniel Gmach Jerry Rolia Ludmila Cherkasova and Alfons Kemper[n d] Workload Analysis and Demand Prediction of Enterprise DataCenter Applications In Proceedings of IISWC Boston MA 2007 10httpsdoiorg101109IISWC20074362193

[48] Zhenhuan Gong Xiaohui Gu and John Wilkes [n d] PRESS PRe-dictive Elastic ReSource Scaling for cloud systems In Proceedings ofCNSM Niagara Falls ON 2010

[49] Donald Gross John F Shortle James M Thompson and Carl M Harris[n d] Fundamentals of Queueing Theory InWiley Series in Probabilityand Statistics Book 627 2011

[50] Sanchika Gupta and PadamKumar [n d] VM Profile Based OptimizedNetwork Attack Pattern Detection Scheme for DDOS Attacks in CloudIn Proc of SSCC Mysore India 2013

[51] BenHindman Andy Konwinski Matei Zaharia Ali Ghodsi AnthonyDJoseph Randy Katz Scott Shenker and Ion Stoica [n d] Mesos APlatform for Fine-Grained Resource Sharing in the Data Center InProceedings of NSDI Boston MA 2011

[52] Jingwei Huang David M Nicol and Roy H Campbell [n d] Denial-of-Service Threat to HadoopYARN Clusters with Multi-Tenancy InProc of the IEEE International Congress on Big Data Washington DC2014

[53] Alexandru Iosup Nezih Yigitbasi and Dick Epema [n d] On thePerformance Variability of Production Cloud Services In Proceedingsof CCGRID Newport Beach CA 2011

[54] Hiranya Jayathilaka Chandra Krintz and Rich Wolski 2017 Perfor-mance Monitoring and Root Cause Analysis for Cloud-hosted WebApplications In Proceedings of WWW International World Wide WebConferences Steering Committee Republic and Canton of GenevaSwitzerland 469ndash478

[55] Norman P Jouppi Cliff Young Nishant Patil David Patterson Gau-rav Agrawal Raminder Bajwa Sarah Bates Suresh Bhatia Nan Bo-den Al Borchers Rick Boyle Pierre-luc Cantin Clifford Chao ChrisClark Jeremy Coriell Mike Daley Matt Dau Jeffrey Dean Ben GelbTara Vazir Ghaemmaghami Rajendra Gottipati William GullandRobert Hagmann C Richard Ho Doug Hogberg John Hu RobertHundt Dan Hurt Julian Ibarz Aaron Jaffey Alek Jaworski Alexan-der Kaplan Harshit Khaitan Daniel Killebrew Andy Koch NaveenKumar Steve Lacy James Laudon James Law Diemthu Le ChrisLeary Zhuyuan Liu Kyle Lucke Alan Lundin Gordon MacKean Adri-ana Maggiore Maire Mahony Kieran Miller Rahul Nagarajan RaviNarayanaswami Ray Ni Kathy Nix Thomas Norrie Mark Omer-nick Narayana Penukonda Andy Phelps Jonathan Ross Matt RossAmir Salek Emad Samadiani Chris Severn Gregory Sizikov MatthewSnelham Jed Souter Dan Steinberg Andy Swing Mercedes Tan Gre-gory Thorson Bo Tian Horia Toma Erick Tuttle Vijay VasudevanRichard Walter Walter Wang Eric Wilcox and Doe Hyun Yoon 2017In-Datacenter Performance Analysis of a Tensor Processing Unit InProceedings of the 44th Annual International Symposium on ComputerArchitecture (ISCA rsquo17) ACM New York NY USA 1ndash12

[56] Harshad Kasture and Daniel Sanchez 2014 Ubik Efficient CacheSharing with Strict QoS for Latency-Critical Workloads In Proceed-ings of the 19th international conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS-XIX)

[57] Harshad Kasture and Daniel Sanchez 2016 TailBench A BenchmarkSuite and Evaluation Methodology for Latency-Critical ApplicationsIn Proc of IISWC

[58] Yaakoub El Khamra Hyunjoo Kim Shantenu Jha andManish Parashar[n d] Exploring the performance fluctuations of hpc workloads onclouds In Proceedings of CloudCom Indianapolis IN 2010

[59] Krzysztof C Kiwiel [n d] Convergence and efficiency of subgradientmethods for quasiconvex minimization InMathematical Programming(Series A) (Berlin Heidelberg Springer) 90 (1) pp 1-25 2001

[60] Jacob Leverich and Christos Kozyrakis [n d] Reconciling High ServerUtilization and Sub-millisecond Quality-of-Service In Proc of EuroSys2014

[61] David Lo Liqun Cheng Rama Govindaraju Luiz Andreacute Barroso andChristos Kozyrakis [n d] Towards Energy Proportionality for Large-scale Latency-critical Workloads In Proceedings of the 41st AnnualInternational Symposium on Computer Architecuture (ISCA) Minneapo-lis MN 2014

[62] David Lo Liqun Cheng Rama Govindaraju Parthasarathy Ran-ganathan and Christos Kozyrakis [n d] Heracles Improving Re-source Efficiency at Scale In Proc of the 42Nd Annual InternationalSymposium on Computer Architecture (ISCA) Portland OR 2015

[63] Dave Mangot [n d] EC2 variability The numbers re-vealed httptechmangotcomrollerdaveentryec2_variability_the_numbers_revealed

[64] Jason Mars and Lingjia Tang [n d] Whare-map heterogeneity inhomogeneous warehouse-scale computers In Proceedings of ISCATel-Aviv Israel 2013

[65] Jason Mars Lingjia Tang and Robert Hundt 2011 Heterogeneity inldquoHomogeneousrdquo Warehouse-Scale Computers A Performance Oppor-tunity IEEE Comput Archit Lett 10 2 (July 2011) 4

[66] Jason Mars Lingjia Tang Robert Hundt Kevin Skadron and Mary LouSoffa [n d] Bubble-Up increasing utilization in modern warehousescale computers via sensible co-locations In Proceedings of MICROPorto Alegre Brazil 2011

[67] David Meisner Christopher M Sadler Luiz Andreacute Barroso Wolf-Dietrich Weber and Thomas F Wenisch 2011 Power managementof online data-intensive services In Proceedings of the 38th annualinternational symposium on Computer architecture 319ndash330

[68] Ripal Nathuji Canturk Isci and Eugene Gorbatov [n d] Exploitingplatform heterogeneity for power efficient data centers In Proceedingsof ICAC Jacksonville FL 2007

[69] Ripal Nathuji Aman Kansal and Alireza Ghaffarkhah [n d] Q-Clouds Managing Performance Interference Effects for QoS-AwareClouds In Proceedings of EuroSys ParisFrance 2010

[70] Hiep Nguyen Zhiming Shen Xiaohui Gu Sethuraman Subbiah andJohn Wilkes [n d] AGILE Elastic Distributed Resource Scaling forInfrastructure-as-a-Service In Proceedings of ICAC San Jose CA 2013

[71] Dejan Novakovic Nedeljko Vasic Stanko Novakovic Dejan Kosticand Ricardo Bianchini [n d] DeepDive Transparently Identifyingand Managing Performance Interference in Virtualized EnvironmentsIn Proceedings of ATC San Jose CA 2013

[72] Simon Ostermann Alexandru Iosup Nezih Yigitbasi Radu ProdanThomas Fahringer and Dick Epema [n d] A Performance Analysisof EC2 Cloud Computing Services for Scientific Computing In LectureNotes on Cloud Computing Volume 34 p115-131 2010

[73] Kay Ousterhout Patrick Wendell Matei Zaharia and Ion Stoica [nd] Sparrow Distributed Low Latency Scheduling In Proceedings ofSOSP Farminton PA 2013

[74] Xue Ouyang Peter Garraghan Renyu Yang Paul Townend and Jie Xu[n d] Reducing Late-Timing Failure at Scale Straggler Root-CauseAnalysis in Cloud Datacenters In Proceedings of 46th Annual IEEEIFIPInternational Conference on Dependable Systems and Networks 2016

[75] Andrew Putnam Adrian M Caulfield Eric S Chung Derek ChiouKypros Constantinides John Demme Hadi Esmaeilzadeh Jeremy Fow-ers Gopi Prashanth Gopal Jan Gray Michael Haselman Scott HauckStephen Heil Amir Hormati Joo-Young Kim Sitaram Lanka JamesLarus Eric Peterson Simon Pope Aaron Smith Jason Thong Phillip YiXiao and Doug Burger 2014 A Reconfigurable Fabric for Accelerat-ing Large-Scale Datacenter Services In Proc of the 41st Intl Symp onComputer Architecture

[76] Suhail Rehman and Majd Sakr [n d] Initial Findings for provisioningvariation in cloud computing In Proceedings of CloudCom IndianapolisIN 2010

[77] Charles Reiss Alexey Tumanov Gregory Ganger Randy Katz andMichael Kozych [n d] Heterogeneity and Dynamicity of Clouds atScale Google Trace Analysis In Proceedings of SOCC 2012

[78] Gang Ren Eric Tune Tipp Moseley Yixin Shi Silvius Rus andRobert Hundt 2010 Google-Wide Profiling A Continuous Pro-filing Infrastructure for Data Centers IEEE Micro (2010) 65ndash79httpwwwcomputerorgportalwebcsdldoi101109MM201068

[79] rightscale [n d] Rightscale httpsawsamazoncomsolution-providersisvrightscale

[80] S Sarwar A Ankit and K Roy [n d] Incremental Learning in DeepConvolutional Neural Networks Using Partial Network Sharing InarXiv preprint arXiv171202719

[81] Joumlrg Schad Jens Dittrich and Jorge-Arnulfo Quianeacute-Ruiz 2010 Run-time Measurements in the Cloud Observing Analyzing and ReducingVariance Proceedings VLDB Endow 3 1-2 (Sept 2010) 460ndash471

[82] Malte Schwarzkopf Andy Konwinski Michael Abd-El-Malek andJohn Wilkes [n d] Omega flexible scalable schedulers for largecompute clusters In Proceedings of EuroSys Prague 2013

[83] Zhiming Shen Sethuraman Subbiah Xiaohui Gu and John Wilkes [nd] CloudScale elastic resource scaling for multi-tenant cloud systemsIn Proceedings of SOCC Cascais Portugal 2011

[84] David Shue Michael J Freedman and Anees Shaikh [n d] Perfor-mance Isolation and Fairness for Multi-tenant Cloud Storage In Procof OSDI Hollywood CA 2012

[85] Benjamin H Sigelman Luiz Andreacute Barroso Mike Burrows PatStephenson Manoj Plakal Donald Beaver Saul Jaspan and Chan-dan Shanbhag 2010 Dapper a Large-Scale Distributed Systems TracingInfrastructure Technical Report Google Inc httpsresearchgooglecomarchivepapersdapper-2010-1pdf

[86] Lalith Suresh Peter Bodik Ishai Menache Marco Canini and FlorinCiucu [n d] Distributed Resource Management Across ProcessBoundaries In Proceedings of the ACM Symposium on Cloud Computing(SOCC) Santa Clara CA 2017

[87] torque [n d] Torque Resource Manager httpwwwadaptivecomputingcomproductsopen-sourcetorque

[88] Venkatanathan Varadarajan Thomas Ristenpart and Michael Swift[n d] Scheduler-based Defenses against Cross-VM Side-channels InProc of the 23rd Usenix Security Symposium San Diego CA 2014

[89] Venkatanathan Varadarajan Yinqian Zhang Thomas Ristenpart andMichael Swift [n d] A Placement Vulnerability Study inMulti-TenantPublic Clouds In Proc of the 24th USENIX Security Symposium (USENIXSecurity) Washington DC 2015

[90] Nedeljko Vasić Dejan Novaković Svetozar Miučin Dejan Kostić andRicardo Bianchini [n d] DejaVu accelerating resource allocationin virtualized environments In Proceedings of the Seventeenth Interna-tional Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS) London UK 2012

[91] Jianping Weng Jessie Hui Wang Jiahai Yang and Yang Yang [n d]Root cause analysis of anomalies of multitier services in public cloudsIn Proceedings of the IEEEACM 25th International Symposium on Qual-ity of Service (IWQoS) 2017

[92] Ian H Witten Eibe Frank and Geoffrey Holmes [n d] Data MiningPractical Machine Learning Tools and Techniques 3rd Edition

[93] Tianyin Xu Xinxin Jin Peng Huang Yuanyuan Zhou Shan Lu LongJin and Shankar Pasupathy 2016 Early Detection of ConfigurationErrors to Reduce Failure Damage In Proceedings of the 12th USENIXConference on Operating Systems Design and Implementation (OSDIrsquo16)USENIX Association Berkeley CA USA 619ndash634

[94] Yunjing Xu Michael Bailey Farnam Jahanian Kaustubh Joshi MattiHiltunen and Richard Schlichting [n d] An Exploration of L2 CacheCovert Channels in Virtualized Environments In Proc of the 3rd ACMWorkshop on Cloud Computing Security Workshop (CCSW) ChicagoIL 2011

[95] Hailong Yang Alex Breslow Jason Mars and Lingjia Tang [n d]Bubble-flux precise online QoS management for increased utilizationin warehouse scale computers In Proceedings of ISCA 2013

[96] Minlan Yu Albert Greenberg Dave Maltz Jennifer Rexford LihuaYuan Srikanth Kandula and Changhoon Kim [n d] Profiling Net-work Performance for Multi-tier Data Center Applications In Proceed-ings of NSDI Boston MA 2011 14

[97] Xiao Yu Pallavi Joshi Jianwu Xu Guoliang Jin Hui Zhang and GuofeiJiang 2016 CloudSeer Workflow Monitoring of Cloud Infrastructuresvia Interleaved Logs In Proceedings of APSLOS ACM New York NYUSA 489ndash502

[98] Yinqian Zhang and Michael K Reiter [n d] Duppel retrofittingcommodity operating systems to mitigate cache side channels in thecloud In Proc of CCS Berlin Germany 2013

[99] Mark Zhao and G Edward Suh [n d] FPGA-Based Remote PowerSide-Channel Attacks In Proceedings of the IEEE Symposium on Securityand Privacy May 2018

  • Abstract
  • 1 Introduction
  • 2 Related Work
  • 3 End-to-End Applications with Microservices
    • 31 Social Network
    • 32 Media Service
    • 33 E-Commerce Service
    • 34 Banking System
    • 35 Hotel Reservation Site
      • 4 Seer Design
        • 41 Overview
        • 42 Distributed Tracing
        • 43 Deep Learning in Performance Debugging
        • 44 Hardware Monitoring
        • 45 System Insights from Seer
        • 46 Implementation
          • 5 Seer Analysis and Validation
            • 51 Methodology
            • 52 Evaluation
              • 6 Large-Scale Cloud Study
                • 61 Seer Scalability
                • 62 Source of QoS Violations
                • 63 Seers Long-Term Impact on Application Design
                  • 7 Conclusions
                  • Acknowledgements
                  • References
Page 5: Seer: Leveraging Big Data to Navigate the Complexity of ... · data to extract the causal relationships between requests. Even though such tracing systems help cloud providers detect

E-commerce

Service

front-end

(nodejs)http

memcached

orders

search

index0

index1

indexn

hellip

recommender

media

discounts

catalogue

wishlist

cart

accountInfo

mongoDB

mongoDB

mongoDB

shipping

mongoDB

queueMaster orderQueue

mongoDB

mongoDB

memcached

mongoDB

payment

authorization

transactionID

invoicing

login

adsClient

httpLoad

Balancer

socialNet

memcached

Figure 5 The architecture of the end-to-end E-commerceapplication implementing an online clothing store

31 Social NetworkScope The end-to-end service implements a broadcast-stylesocial network with uni-directional follow relationshipsFunctionality Fig 3 shows the architecture of the end-to-end service Users (client) send requests over http whichfirst reach a load balancer implemented with nginx whichselects a specific webserver is selected also in nginx Userscan create posts embedded with text media links and tags toother users which are then broadcasted to all their followersUsers can also read favorite and repost posts as well as replypublicly or send a direct message to another user The appli-cation also includes machine learning plugins such as adsand user recommender engines [15 16 59 92] a search ser-vice using Xapian [57] and microservices that allow users tofollow unfollow or block other accounts Inter-microservicemessages use Apache Thrift RPCs [1] The servicersquos backenduses memcached for caching and MongoDB for persistently stor-ing posts user profiles media and user recommendationsThis service is broadly deployed at Cornell and elsewhereand currently has several hundred users We use this in-stallation to test the effectiveness and scalability of Seer inSection 6

32 Media ServiceScope The application implements an end-to-end servicefor browsing movie information as well as reviewing ratingrenting and streaming movies [9 11]Functionality As with the social network a client requesthits the load balancer which distributes requests among mul-tiple nginx webservers The front-end is similar to SocialNetwork and users can search and browse information aboutmovies including the plot photos videos and review in-formation as well as insert a review for a specific movie bylogging in to their account Users can also select to rent amovie which involves a payment authentication module to

verify the user has enough funds and a video streaming mod-ule using nginx-hls a production nginx module for HTTPlive streaming Movie files are stored in NFS to avoid thelatency and complexity of accessing chunked records fromnon-relational databases while reviews are held in memcached

and MongoDB instances Movie information is maintained ina sharded and replicated MySQL DB We are similarly de-ploying Media Service as a hosting site for project demos atCornell which students can browse and review

33 E-Commerce ServiceScope The service implements an e-commerce site for cloth-ing The design draws inspiration and uses several compo-nents of the open-source Sockshop application [7]Functionality The application front-end here is a nodejs

service Clients can use the service to browse the inventoryusing catalogue a Go microservice that mines the back-end memcached and MongoDB instances holding informationabout products Users can also place orders (Go) by addingitems to their cart (Java) After they log in (Go) to their ac-count they can select shipping options (Java) process theirpayment (Go) and obtain an invoice (Java) for their orderOrders are serialized and committed using QueueMaster (Go)Finally the service includes a recommender engine (C++) andmicroservices for creating wishlists (Java)

34 Banking SystemScope The service implements a secure banking systemsupporting payments loans and credit card managementFunctionality Users interface with a nodejs front-endsimilar to E-commerce to login to their account search in-formation about the bank or contact a representative Oncelogged in a user can process a payment from their accountpay their credit card or request a new one request a loan andobtain information about wealth management options Mostmicroservices are written in Java and Javascript The back-end databases are memcached and MongoDB instances The ser-vice also has a relational database (BankInfoDB) that includesinformation about the bank its services and representatives

35 Hotel Reservation SiteScope The service is an online hotel reservation site forbrowsing information about hotels and making reservationsFunctionalityThe service is based on theGo-microservicesopen-source project [3] augmented with backend databasesand machine learning widgets for advertisement and hotelrecommendations A client request is first directed to oneof the front-end webservers in nodejs by a load balancerThe front-end then interfaces with the search engine whichallows users to explore hotel availability in a given regionand place a reservation The service back-end consists ofmemcached and MongoDB instances

Tracing

Coordinator

Client

Load Balancer

Microservice

Tracing module

Seer

TraceDB

(Cassandra)

Cluster ManagerAdjust

resource

allocation

Microservices dependency graph

Figure 6 Overview of Seerrsquos operation

4 Seer Design41 OverviewFig 6 shows the high-level architecture of the system Seer isan online performance debugging system for cloud systemshosting interactive latency-critical services Even though weare focusing our analysis on microservices where the impactof QoS violations is more severe Seer is also applicable togeneral cloud services and traditional multi-tier or Service-Oriented Architecture (SOA) workloads Seer uses two levelsof tracing shown in Fig 7

RPC-level tracing

Signal QoS

violation

Identify culprit

microservice

Node-level diagnostics

Identify culprit

resource

Notify cluster

manager

Perf cntrs uBench

Figure 7 The two levels oftracing used in Seer

First it uses a light-weight distributed RPC-level tracing system de-scribed in Sec 42 whichcollects end-to-end ex-ecution traces for eachuser request includingper-tier latency and out-standing requests asso-ciates RPCs belongingto the same end-to-endrequest and aggregates

them to a centralized Cassandra database (TraceDB) Fromthere traces are used to train Seer to recognize patterns inspace (between microservices) and time that lead to QoSviolations At runtime Seer consumes real-time streamingtraces to infer whether there is an imminent QoS violation

When a QoS violation is expected to occur and a culprit mi-croservice has been located Seer uses its lower tracing levelwhich consists of detailed per-node low-level hardware mon-itoring primitives such as performance counters to identifythe reason behind the QoS violation It also uses this informa-tion to provide the cluster manager with recommendationson how to avoid the performance degradation altogetherWhen Seer runs on a public cloud where performance coun-ters are disabled it uses a set of tunable microbenchmarksto determine the source of unpredictable performance (seeSec 44) Using two specialized tracing levels instead of col-lecting detailed low-level traces for all active microservicesensures that the distributed tracing is lightweight enough to

Microservice m+1

Microservice m-1

me

mca

che

dNIC

epoll

(libevent)

socket

read

mem$

proc

Packet

filtering

Packet

filtering

TCPIP

TX

TCPIP

RX

Tracing

Coordinator

WebUI

User

http

TraceDB

(Cassandra)

QueryEngine

mic

rose

rvic

es

latency

Gantt charts

NIC

Figure 8 Distributed tracing and instrumentation in Seer

track all active requests and services in the system and thatdetailed low-level hardware tracing is only used on-demandfor microservices likely to cause performance disruptions Inthe following sections we describe the design of the tracingsystem the learning techniques in Seer and its low-leveldiagnostic framework and the system insights we can drawfrom Seerrsquos decisions to improve cloud application design

42 Distributed TracingA major challenge with microservices is that one cannotsimply rely on the client to report performance as withtraditional client-server applications We have developeda distributed tracing system for Seer similar in design toDapper [85] and Zipkin [8] that records per-microservicelatencies using the Thrift timing interface as shown in Fig 8We additionally track the number of requests queued in eachmicroservice (outstanding requests) since queue lengths arehighly correlated with performance and QoS violations [4649 56 96] In all cases the overhead from tracing withoutrequest sampling is negligible less than 01 on end-to-endlatency and less than 015 on throughput (QPS) whichis tolerable for such systems [24 78 85] Traces from allmicroservices are aggregated in a centralized database [18]Instrumentation The tracing system requires two typesof application instrumentation First to distinguish betweenthe time spent processing network requests and the timethat goes towards application computation we instrumentthe application to report the time it sees a new request (post-RPC processing) We similarly instrument the transmit sideof the application Second systems have multiple sources ofqueueing in both hardware and software To obtain accuratemeasurements of queue lengths per microservice we need toaccount for these different queues Fig 8 shows an example of

Name LoC Layers Nonlinear Weights BatchFC Conv Vect Total Function Size

CNN 1456 8 8 ReLU 30K 4LSTM 944 12 6 18 sigmoidtanh 52K 32

Seer 2882 10 7 5 22 ReLU 80K 32sigmoidtanhTable 2 The different neural network configurations weexplored for Seer

Seerrsquos instrumentation for memcached Memcached includesfive main stages [60] TCPIP receive epolllibevent read-ing the request from the socket processing the request andresponding over TCPIP either with the ltkvgt pair for a reador with an ack for a write Each of these stages includes ahardware (NIC) or software (epollsocket readmemcachedproc) queue For theNIC queues Seer filters packets based onthe destination microservice but accounts for the aggregatequeue length if hardware queues are shared since that willimpact how fast a microservicersquos packets get processed Forthe software queues Seer inserts probes in the applicationto read the number of queued requests in each caseLimited instrumentation As seen above accounting forall sources of queueing in a complex system requires non-trivial instrumentation This can become cumbersome ifusers leverage third-party applications in their services or inthe case of public cloud providers which do not have accessto the source code of externally-submitted applications forinstrumentation In these cases Seer relies on the requestsqueued exclusively in the NIC to signal upcoming QoS viola-tions In Section 5 we compare the accuracy of the full versuslimited instrumentation and see that using network queuedepths alone is enough to signal a large fraction of QoS viola-tions although smaller than when the full instrumentation isavailable Exclusively polling NIC queues identifies hotspotscaused by routing incast failures and resource saturationbut misses QoS violations that are caused by performanceand efficiency bugs in the application implementation suchas blocking behavior between microservices Signaling suchbugs helps developers better understand the microservicesmodel and results in better application designInferring queue lengths Additionally there has been re-cent work on using deep learning to reverse engineer thenumber of queued requests in switches across a large net-work topology [46] when tracing information is incompleteSuch techniques are also applicable and beneficial for Seerwhen the default level of instrumentation is not available

43 Deep Learning in Performance DebuggingA popular way to model performance in cloud systems es-pecially when there are dependencies between tasks arequeueing networks [49] Although queueing networks are avaluable tool to model how bottlenecks propagate through

the system they require in-depth knowledge of applicationsemantics and structure and can become overly complexas applications and systems scale They additionally cannoteasily capture all sources of contention such as the OS andnetwork stack

Instead in Seer we take a data-driven application-agnosticapproach that assumes no information about the structureand bottlenecks of a service making it robust to unknownand changing applications and relying instead on practi-cal learning techniques to infer patterns that lead to QoSviolations This includes both spatial patterns such as de-pendencies between microservices and temporal patternssuch as input load and resource contention The key idea inSeer is that conditions that led to QoS violations in the pastcan be used to anticipate unpredictable performance in thenear future Seer uses execution traces annotated with QoSviolations and collected over time to train a deep neural net-work to signal upcoming QoS violations Below we describethe structure of the neural network why deep learning iswell-suited for this problem and how Seer adapts to changesin application structure onlineUsing deep learning Although deep learning is not theonly approach that can be used for proactive QoS violationdetection there are several reasonswhy it is preferable in thiscase First the problem Seer must solve is a pattern matchingproblem of recognizing conditions that result in QoS viola-tions where the patterns are not always known in advanceor easy to annotate This is a more complicated task than sim-ply signaling a microservice with many enqueued requestsfor which simpler classification regression or sequence la-beling techniques would suffice [15 16 92] Second the DNNin Seer assumes no a priori knowledge about dependenciesbetween individual microservices making it applicable tofrequently-updated services where describing changes iscumbersome or even difficult for the user to know Thirddeep learning has been shown to be especially effective inpattern recognition problems with massive datasets eg inimage or text recognition [10] Finally as we show in thevalidation section (Sec 5) deep learning allows Seer to rec-ognize QoS violations with high accuracy in practice andwithin the opportunity window the cluster manager has toapply corrective actionsConfiguring the DNN The input used in the network isessential for its accuracy We have experimented with re-source utilization latency and queue depths as input met-rics Consistent with prior work utilization is not a goodproxy for performance [31 35 56 61] Latency similarlyleads to many false positives or to incorrectly pinpointingcomputationally-intensive microservices as QoS violationculprits Again consistent with queueing theory [49] andprior work [34 37 38 46 56] per-microservice queue depthsaccurately capture performance bottlenecks and pinpoint the

hellip

CNN

LSTM

hellip

LSTM

LSTM

LSTM

LSTM

LSTM

hellip

LSTM

LSTM

LSTM

LSTM

LSTM

hellip

LSTM

LSTM

LSTM

LSTM

M

icro

serv

ice

s(i

n d

ep

en

de

nci

es

ord

er)

helliphellip hellip

hellip

M

icrose

rvice

s(P

rfo

r Qo

Sv

iola

tion

)

LSTM

hellip

SoftMax

LSTM

Dropout

Figure 9 The deep neural network design in Seer con-sisting of a set of convolution layers followed by a set oflong-short term memory layers Each input and output neu-ron corresponds to a microservice ordered in topologicalorder from back-end microservices in the top to front-endmicroservices in the bottom

microservices causing them We compare against utilization-based approaches in Section 5 The number of input andoutput neurons is equal to the number of active microser-vices in the cluster with the input value corresponding toqueue depths and the output value to the probability fora given microservice to initiate a QoS violation Input neu-rons are ordered according to the topological applicationstructure with dependent microservices corresponding toconsecutive neurons to capture hotspot patterns in space In-put traces are annotated with the microservices that causedQoS violations in the past

The choice of DNN architecture is also instrumental to itsaccuracy There are three main DNN designs that are populartoday fully connected networks (FC) convolutional neuralnetworks (CNN) and recurrent neural networks (RNN) es-pecially their Long Short-Term Memory (LSTM) class ForSeer to be effective in improving performance predictabilityinference needs to occur with enough slack for the clustermanagerrsquos action to take effect Hence we focus on the morecomputationally-efficient CNN and LSTM networks CNNsare especially effective at reducing the dimensionality oflarge datasets and finding patterns in space eg in imagerecognition LSTMs on the other hand are particularly effec-tive at finding patterns in time eg predicting tomorrowrsquosweather based on todayrsquos measurements Signaling QoS vi-olations in a large cluster requires both spatial recognitionnamely identifying problematic clusters of microserviceswhose dependencies cause QoS violations and discardingnoisy but non-critical microservices and temporal recogni-tion namely using past QoS violations to anticipate futureones We compare three network designs a CNN a LSTMand a hybrid network that combines the two using the CNNfirst to reduce the dimensionality and filter out microservices

that do not affect end-to-end performance and then an LSTMwith a SoftMax final layer to infer the probability for eachmicroservice to initiate a QoS violation The architectureof the hybrid network is shown in Fig 9 Each network isconfigured using hyperparameter tuning to avoid overfittingand the final parameters are shown in Table 2We train each network on a weekrsquos worth of trace data

collected on a 20-server cluster running all end-to-end ser-vices (for methodology details see Sec 5) and test it on tracescollected on a different week after the servers had beenpatched and the OS had been upgraded

561 7879 9345QoS Violation Detection Accuracy ()

0

2

4

6

8

10

12

14

16

Infe

rence

Tim

e (

ms)

CNNLSTMSeer (CNN+LSTM)

Figure 10 Comparison ofDNN architectures

The quantitative com-parison of the threenetworks is shown inFig 10 The CNN is byfar the fastest but alsothe worst performingsince it is not designedto recognize patterns intime that lead to QoS vi-olations The LSTM onthe other hand is espe-cially effective at captur-ing load patterns overtime but is less effective at reducing the dimensionality ofthe original dataset which makes it prone to false positivesdue to microservices with many outstanding requests whichare off the critical path It also incurs higher overheads forinference than the CNN Finally Seer correctly anticipates9345 of violations outperforming both networks for asmall increase in inference time compared to LSTM Giventhat most resource partitioning decisions take effect after afew 100ms the inference time for Seer is within the windowof opportunity the cluster manager has to take action Moreimportantly it attributes the QoS violation to the correctmicroservice simplifying the cluster managerrsquos task QoSviolations missed by Seer included four random load spikesand a network switch failure which caused high packet dropsOut of the five end-to-end services the one most prone

to QoS violations initially was Social Network first becauseit has stricter QoS constraints than eg E-commerce andsecond due to a synchronous and cyclic communication be-tween three neighboring services that caused them to entera positive feedback loop until saturation We reimplementedthe communication protocol between them post-detectionOn the other hand the service for which QoS violations werehardest to detect wasMedia Service because of a faulty mem-ory bandwidth partitioning mechanism in one of our serverswhich resulted in widely inconsistent memory bandwidthallocations during movie streaming Since the QoS violationonly occurred when the specific streaming microservice wasscheduled on the faulty node it was hard for Seer to collectenough datapoints to signal the violation

Retraining Seer By default training happens once andcan be time consuming taking several hours up to a day forweek-long traces collected on our 20-server cluster (Sec 5includes a detailed sensitivity study for training time) How-ever one of the main advantages of microservices is that theysimplify frequent application updates with old microser-vices often swapped out and replaced by newer modules orlarge services progressively broken down to microservicesIf the application (or underlying hardware) change signifi-cantly Seerrsquos detection accuracy can be impacted To adjustto changes in the execution environment Seer retrains incre-mentally in the background using the transfer learning-basedapproach in [80] Weights from previous training rounds arestored in disk allowing the model to continue training fromwhere it last left off when new data arrive reducing thetraining time by 2-3 orders of magnitude Even though thisapproach allows Seer to handle application changes almost inreal-time it is not a long-term solution since newweights arestill polluted by the previous application architecture Whenthe application changes in a major way eg microserviceson the critical path change Seer also retrains from scratchin the background While the new network trains QoS vi-olation detection happens with the incrementally-trainedinterim model In Section 5 we evaluate Seerrsquos ability toadjust its estimations to application changes

44 Hardware MonitoringOnce a QoS violation is signaled and a culprit microserviceis pinpointed Seer uses low-level monitoring to identify thereason behind the QoS violation The exact process dependson whether Seer has access to performance countersPrivate cluster When Seer has access to hardware eventssuch as performance counters it uses them to determinethe utilization of different shared resources Note that eventhough utilization is a problematic metric for anticipatingQoS violations in a large-scale service once a culprit mi-croservice has been identified examining the utilization ofdifferent resources can provide useful hints to the clustermanager on suitable decisions to avoid degraded perfor-mance Seer specifically examines CPU memory capacityand bandwidth network bandwidth cache contention andstorage IO bandwidth when prioritizing a resource to adjustOnce the saturated resource is identified Seer notifies thecluster manager to take actionPublic clusterWhen Seer does not have access to perfor-mance counters it instead uses a set of 10 tunable contentiousmicrobenchmarks each of them targeting a different sharedresource [30] to determine resource saturation For exampleif Seer injects the memory bandwidth microbenchmark inthe system and tunes up its intensity without an impacton the co-scheduled microservicersquos performance memorybandwidth is most likely not the resource that needs to beadjusted Seer starts from microbenchmarks corresponding

to core resources and progressively moves to resources fur-ther away from the core until it sees a substantial change inperformance when running the microbenchmark Each mi-crobenchmark takes approximately 10ms to complete avoid-ing prolonged degraded performance

Upon identifying the problematic resource(s) Seer notifiesthe cluster manager which takes one of several resourceallocation actions resizing the Docker container or usingmechanisms like Intelrsquos Cache Allocation Technology (CAT)for last level cache (LLC) partitioning and the Linux trafficcontrolrsquos hierarchical token bucket (HTB) queueing disciplinein qdisc [17 62] for network bandwidth partitioning

45 System Insights from SeerUsing learning-based data-driven approaches in systemsis most useful when these techniques are used to gain in-sight into system problems instead of treating them as blackboxes Section 5 includes an analysis of the causes behindQoS violations signaled by Seer including application bugspoor resource provisioning decisions and hardware failuresFurthermore we have deployed Seer in a large installation ofthe Social Network service over the past few months and itsoutput has been instrumental not only in guaranteeing QoSbut in understanding sources of unpredictable performanceand improving the application design This has resulted bothin progressively fewer QoS violations over time and a betterunderstanding of the design challenges of microservices

46 ImplementationSeer is implemented in 12KLOC of CC++ and Python Itruns on Linux and OSX and supports applications in variouslanguages including all frameworks the end-to-end servicesare designed in Furthermore we provide automated patchesfor the instrumentation probes for many popular microser-vices including NGINX memcached MongoDB Xapian and allSockshop and Go-microservices applications to minimize thedevelopment effort from the userrsquos perspective

Seer is a centralized system we usemaster-slavemirroringto improve fault tolerance with two hot stand-by mastersthat can take over if the primary system fails Similarly thetrace database is also replicated in the backgroundSecurity concerns Trace data is stored and processed un-encrypted in Cassandra Previous work has shown that thesensitivity applications have to different resources can leakinformation about their nature and characteristics makingthem vulnerable to malicious security attacks [27 36 50 5284 88 89 94 99] Similar attacks are possible using the dataand techniques in Seer and are deferred to future work

5 Seer Analysis and Validation51 MethodologyServer clusters First we use a dedicated local cluster with20 2-socket 40-core servers with 128GB of RAM each Each

0 5 10 15 20 25 30 35 40

Training Time (hours)

40

50

60

70

80

90

100

Qo

S V

iola

tio

n D

ete

ctio

n A

ccu

racy (

)

100MB

1GB

10GB

20GB

50GB

100GB 200GB

500GB 1TB

10-2 10-1 100 101Tracing Interval (s)

50

55

60

65

70

75

80

85

90

95

Dete

ction A

ccura

cy (

)

Figure 11 Seerrsquos sensitivity to (a) the size of trainingdatasets and (b) the tracing interval

server is connected to a 40Gbps ToR switch over 10Gbe NICsSecond we deploy the Social Network service to GoogleCompute Engine (GCE) and Windows Azure clusters withhundreds of servers to study the scalability of SeerApplicationsWe use all five end-to-end services of Table 1Services for now are driven by open-loop workload gener-ators and the input load varies from constant to diurnalto load with spikes in user demand In Section 6 we studySeer in a real large-scale deployment of the Social Networkin that case the input load is driven by real user traffic

52 EvaluationSensitivity to training data Fig 11a shows the detectionaccuracy and training time for Seer as we increase the sizeof the training dataset The size of the dots is a functionof the dataset size Training data is collected from the 20-server cluster described above across different load levelsplacement strategies time intervals and request types Thesmallest training set size (100MB) is collected over ten min-utes of the cluster operating at high utilization while thelargest dataset (1TB) is collected over almost two months ofcontinuous deployment As datasets grow Seerrsquos accuracyincreases leveling off at 100-200GB Beyond that point accu-racy does not further increase while the time needed fortraining grows significantly Unless otherwise specified weuse the 100GB training datasetSensitivity to tracing frequencyBy default the distributedtracing system instantaneously collects the latency of everysingle user request Collecting queue depth statistics on theother hand is a per-microservice iterative process Fig 11bshows how Seerrsquos accuracy changes as we vary the frequencywith which we collect queue depth statistics Waiting for along time before sampling queues eg gt 1s can result in un-detected QoS violations before Seer gets a chance to processthe incoming traces In contrast sampling queues very fre-quently results in unnecessarily many inferences and runsthe risk of increasing the tracing overhead For the remain-der of this work we use 100ms as the interval for measuringqueue depths across microservicesFalse negatives amp false positives Fig 12a shows the per-centage of false negatives and false positives as we vary the

CPU

Cache

Memory

Network

Disk

Application

False Negatives False Positives0

10

20

30

40

50

Pe

rce

nta

ge

(

)

10ms50ms100ms

500ms1s2s

Utilizatio

n

App only

Net only

Seer

Ground0

20

40

60

80

100

o

f V

alid

Qo

S V

iola

tio

ns

Truth

Figure 12 (a) The false negatives and false positives in Seeras we vary the inference window (b) Breakdown to causesof QoS violations and comparison with utilization-baseddetection and systems with limited instrumentation

prediction window When Seer tries to anticipate QoS viola-tions that will occur in the next 10-100ms both false positivesand false negatives are low since Seer uses a very recentsnapshot of the cluster state to anticipate performance unpre-dictability If inference was instantaneous very short predic-tion windows would always be better However given thatinference takes several milliseconds and more importantlyapplying corrective actions to avoid QoS violations takes10-100s of milliseconds to take effect such short windowsdefy the point of proactive QoS violation detection At theother end predicting far into the future results in significantfalse negatives and especially false positives This is becausemany QoS violations are caused by very short bursty eventsthat do not have an impact on queue lengths until a fewmilliseconds before the violation occurs Therefore requiringSeer to predict one or more seconds into the future meansthat normal queue depths are annotated as QoS violationsresulting in many false positives Unless otherwise specifiedwe use a 100ms prediction windowComparison of debugging systems Fig 12b comparesSeer with a utilization-based performance debugging systemthat uses resource saturation as the trigger to signal a QoSviolation and two systems that only use a fraction of Seerrsquosinstrumentation App-only exclusively uses queues mea-sured via application instrumentation (not network queues)while Network-only uses queues in the NIC and ignoresapplication-level queueing We also show the ground truthfor the total number of upcoming QoS violations (96 over atwo-week period) and break it down by the reason that ledto unpredictable performanceA large fraction of QoS violations are due to application-

level inefficiencies including correctness bugs unnecessarysynchronization andor blocking behavior between microser-vices (including two cases of deadlocks) and misconfigurediptables rules which caused packet drops An equally largefraction of QoS violations are due to compute contentionfollowed by contention in the network cache and memory

Accuracy

QoS Violation

Impactful Service Update

Innocuous Service Update

Incremental Retraining

Signaled QoS Violation

0 2 4 6 8 10 12Time (hours)

75

80

85

90

95

De

tectio

n A

ccu

racy (

)

Social Network Movie Reviewing Ecommerce Go-microservices

0 2 4 6 8 10 12Time (hours)

60708090

100110120130140

Ta

il La

ten

cy n

orm

QoS

QoS

Figure 13 Seer retraining incrementally after each time theSocial Network service is updated

contention and finally disk Since the only persistent mi-croservices are the back-end databases it is reasonable thatdisk accounts for a small fraction of overall QoS violations

Seer accurately follows this breakdown for the most partonly missing a few QoS violations due to random load spikesincluding one caused by a switch failure The App-only sys-tem correctly identifies application-level sources of unpre-dictable performance but misses the majority of system-related issues especially in uncore resources On the otherhand Network-only correctly identifies the vast majorityof network-related issues as well as most of the core- anduncore-driven QoS violations but misses several application-level issues The difference between Network-only and Seeris small suggesting that one could omit the application-levelinstrumentation in favor of a simpler design While this sys-tem is still effective in capturing QoS violations it is lessuseful in providing feedback to application developers onhow to improve their design to avoid QoS violations in the fu-ture Finally the utilization-based system behaves the worstmissing most violations not caused by CPU saturationOut of the 89 QoS violations Seer detects it notifies the

cluster manager early enough to avoid 84 of them The QoSviolations that were not avoided correspond to application-level bugs which cannot be easily corrected online Sincethis is a private cluster Seer uses utilization metrics andperformance counters to identify problematic resourcesRetraining Fig 13 shows the detection accuracy for Seerand the tail latency for each end-to-end service over a periodof time during which Social Network is getting frequently andsubstantially updated This includes new microservices be-ing added to the service such as the ability to place an orderfrom an ad using the orders microservice of E-commerceor the back-end of the service changing from MongoDB to

Cassandra and the front-end switching from nginx to thenodejs front-end of E-commerce These are changes that fun-damentally affect the applicationrsquos behavior throughput la-tency and bottlenecks The other services remain unchangedduring this period (Banking was not active during this timeand is omitted from the graph) Blue dots denote correctly-signaled upcoming QoS violations and red times denote QoSviolations that were not detected by Seer All unidentifiedQoS violations coincide with the application being updatedShortly after the update Seer incrementally retrains in thebackground and starts recovering its accuracy until anothermajor update occurs Some of the updates have no impact oneither performance or Seerrsquos accuracy either because theyinvolve microservices off the critical path or because theyare insensitive to resource contention

The bottom figure shows that unidentified QoS violationsindeed result in performance degradation for Social Networkand in some cases for the other end-to-end services if theyare sharing physical resources with Social Network on anoversubscribed server Once retraining completes the per-formance of the service(s) recovers The longer Seer trainson an evolving application the more likely it is to correctlyanticipate its future QoS violations

6 Large-Scale Cloud Study61 Seer ScalabilityWe now deploy our Social Network service on a 100-serverdedicated cluster on Google Compute Engine (GCE) anduse it to service real user traffic The application has 582registered users with 165 daily active users and has beendeployed for a two-month period The cluster on averagehosts 386 single-concerned containers (one microservice percontainer) subject to some resource scaling actions by thecluster manager based on Seerrsquos feedback

Training Inference10-310-210-1100101102103104105

Tim

e (

s)

NativeTPUBrainwave

Figure 14 Seer training and infer-ence with hardware acceleration

Accuracy remainshigh for Seer con-sistent with thesmall-scale exper-iments Inferencetime however in-creases substantiallyfrom 114ms forthe 20-server clus-ter to 54ms for the100-server GCE set-ting Even thoughthis is still suffi-cient for many re-source allocation decisions as the application scales furtherSeerrsquos ability to anticipate a QoS violation within the clustermanagerrsquos window of opportunity diminishes

Over the past year multiple public cloud providers have ex-posed hardware acceleration offerings for DNN training and

nginx

text

imag

e

user

Tag

urlS

horte

n

vide

o

reco

mm

sear

chlogin

read

Post

com

pPos

t

writ

eGra

ph

mem

cach

ed

mon

godb

0

20

40

60

80

100

120

QoS

Vio

lations

Figure 15 QoS violations each microservice in Social Net-work is responsible for

inference either using a special-purpose design like the Ten-sor Processing Unit (TPU) from Google [55] or using recon-figurable FPGAs like Project Brainwave from Microsoft [25]We offload Seerrsquos DNN logic to both systems and quantifythe impact on training and inference time and detection ac-curacy 1 Fig 14 shows this comparison for a 200GB trainingdataset Both the TPU and Project Brainwave dramaticallyoutperform our local implementation by up to two orders ofmagnitude Between the two accelerators the TPU is moreeffective in training consistent with its design objective [55]while Project Brainwave achieves faster inference For theremainder of the paper we run Seer on TPUs and host theSocial Network service on GCE

62 Source of QoS ViolationsWe now examine which microservice is the most commonculprit for a QoS violation Fig 15 shows the number of QoSviolations caused by each service over the two-month periodThe most frequent culprits by far are the in-memory cachingtiers in memcached and Thrift services with high requestfanout such as composePost readPost and login memcachedis a justified source of QoS violations since it is on the criti-cal path for almost all query types and it is additionally verysensitive to resource contention in compute and to a lesserdegree cache and memory Microservices with high fanoutare also expected to initiate QoS violations as they have tosynchronize multiple inbound requests before proceeding Ifprocessing for any incoming requests is delayed end-to-endperformance is likely to suffer Among these QoS violationsmost of memcachedrsquos violations were caused by resource con-tention while violations in Thrift services were caused bylong synchronization times

63 Seerrsquos Long-Term Impact on Application DesignSeer has now been deployed in the Social Network clusterfor over two months and in this time it has detected 536upcoming QoS violations (906 accuracy) and avoided 495(84) of them Furthermore by detecting recurring patterns1Before running on TPUs we reimplemented our DNN in Tensorflow Wesimilarly adjust the DNN to the currently-supported designs in Brainwave

0 10 20 30 40 50 60Day

0

5

10

15

20

25

Qo

S v

iola

tio

ns

Figure 16 QoS violations each microservice in Social Net-work is responsible for

that lead to QoS violations Seer has helped the applicationdevelopers better understand bugs and design decisions thatlead to hotspots such as microservices with a lot of back-and-forth communication between them or microservicesforming cyclic dependencies or using blocking primitivesThis has led to a decreasing number of QoS violations overthe two month period (seen in Fig 16) as the applicationprogressively improves In days 22 and 23 there was a clusteroutage which is why the reported violations are zero Sys-tems like Seer can be used not only to improve performancepredictability in complex cloud systems but to help usersbetter understand the design challenges of microservices asmore services transition to this application model

7 ConclusionsCloud services increasingly move away from complex mono-lithic designs and adopt the model of specialized loosely-coupled microservices We presented Seer a data-drivencloud performance debugging system that leverages practi-cal learning techniques and the massive amount of tracingdata cloud systems collect to proactively detect and avoidQoS violations We have validated Seerrsquos accuracy in con-trolled environments and evaluated its scalability on large-scale clusters on public clouds We have also deployed thesystem in a cluster hosting a social network with hundredsof users In all scenarios Seer accurately detects upcomingQoS violations improving responsiveness and performancepredictability As more services transition to the microser-vices model systems like Seer provide practical solutionsthat can navigate the increasing complexity of the cloud

AcknowledgementsWe sincerely thank Christos Kozyrakis Balaji PrabhakarMendel Rosenblum Daniel Sanchez Ravi Soundararajanthe academic and industrial users of the benchmark suiteand the anonymous reviewers for their feedback on earlierversions of this manuscript This work was supported byNSF CAREER Award CCF-1846046 NSF grant CNS-1422088a Facebook Faculty Research Award a VMWare ResearchAward a John and Norma Balen Sesquicentennial FacultyFellowship and donations from GCE Azure and AWS

References[1] [n d] Apache Thrift httpsthriftapacheorg[2] [n d] Decomposing Twitter Adventures in Service-

Oriented Architecture httpswwwslidesharenetInfoQdecomposing-twitter-adventures-in-serviceoriented-architecture

[3] [n d] Golang Microservices Example httpsgithubcomharlowgo-micro-services

[4] [n d] Messaging that just works httpswwwrabbitmqcom[5] [n d] MongoDB httpswwwmongodbcom[6] [n d] NGINX httpsnginxorgen[7] [n d] SockShop A Microservices Demo Application httpswww

weaveworksblogsock-shop-microservices-demo-application[8] [n d] Zipkin httpzipkinio[9] 2016 The Evolution of Microservices httpswwwslidesharenet

adriancockcroftevolution-of-microservices-craft-conference[10] Martiacuten Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng

Chen Craig Citro Greg Corrado Andy Davis Jeffrey Dean MatthieuDevin Sanjay Ghemawat Ian Goodfellow Andrew Harp GeoffreyIrving Michael Isard Yangqing Jia Rafal Jozefowicz Lukasz KaiserManjunath Kudlur Josh Levenberg Dan Maneacute Rajat Monga SherryMoore Derek Murray Chris Olah Mike Schuster Jonathon ShlensBenoit Steiner Ilya Sutskever Kunal Talwar Paul Tucker VincentVanhoucke Vijay Vasudevan Fernanda Vieacutegas Oriol Vinyals PeteWarden Martin Wattenberg Martin Wicke Yuan Yu and XiaoqiangZheng [n d] TensorFlow Large-Scale Machine Learning on Hetero-geneous Distributed Systems In Proceedings of OSDI 2016

[11] Adrian Cockroft [n d] Microservices Workshop Why whatand how to get there httpwwwslidesharenetadriancockcroftmicroservices-workshop-craft-conference

[12] Hyunwook Baek Abhinav Srivastava and Jacobus Van der Merwe[n d] CloudSight A tenant-oriented transparency framework forcross-layer cloud troubleshooting In Proceedings of the 17th IEEEACMInternational Symposium on Cluster Cloud and Grid Computing 2017

[13] Luiz Barroso [n d] Warehouse-Scale Computing Entering theTeenage Decade ISCA Keynote SJ June 2011 ([n d])

[14] Luiz Barroso and Urs Hoelzle 2009 The Datacenter as a Computer AnIntroduction to the Design ofWarehouse-Scale Machines MC Publishers

[15] Robert Bell Yehuda Koren and Chris Volinsky 2007 The BellKor 2008Solution to the Netflix Prize Technical Report

[16] Leon Bottou [n d] Large-Scale Machine Learning with StochasticGradient Descent In Proceedings of the International Conference onComputational Statistics (COMPSTAT) Paris France 2010

[17] Martin A Brown [n d] Traffic Control HOWTO httplinux-ipnetarticlesTraffic-Control-HOWTO

[18] Cassandra [n d] Apache Cassandra httpcassandraapacheorg[19] Adrian M Caulfield Eric S Chung Andrew Putnam Hari Angepat

Jeremy Fowers Michael Haselman Stephen Heil Matt HumphreyPuneet Kaur Joo-Young Kim Daniel Lo Todd Massengill KalinOvtcharov Michael Papamichael Lisa Woods Sitaram Lanka DerekChiou and Doug Burger 2016 A Cloud-scale Acceleration Architec-ture In MICRO IEEE Press Piscataway NJ USA Article 7 13 pages

[20] Tianshi Chen Zidong Du Ninghui Sun Jia Wang Chengyong WuYunji Chen and Olivier Temam 2014 DianNao a small-footprint high-throughput accelerator for ubiquitous machine-learning In Proc ofthe 19th intl conf on Architectural Support for Programming Languagesand Operating Systems

[21] Yunji Chen Tianshi Chen Zhiwei Xu Ninghui Sun andOlivier Temam2016 DianNao Family Energy-efficient Hardware Accelerators forMachine Learning Commun ACM 59 11 (Oct 2016) 105ndash112 httpsdoiorg1011452996864

[22] Yunji Chen Tao Luo Shaoli Liu Shijin Zhang Liqiang He Jia WangLing Li Tianshi Chen Zhiwei Xu Ninghui Sun and Olivier Temam2014 DaDianNao A Machine-Learning Supercomputer In Proceedingsof the 47th Annual IEEEACM International Symposium on Microarchi-tecture (MICRO-47) IEEE Computer Society Washington DC USA

609ndash622[23] Ludmila Cherkasova Diwaker Gupta and Amin Vahdat 2007 Com-

parison of the Three CPU Schedulers in Xen SIGMETRICS PerformEval Rev 35 2 (Sept 2007) 42ndash51

[24] Michael Chow David Meisner Jason Flinn Daniel Peek and Thomas FWenisch 2014 The Mystery Machine End-to-end Performance Anal-ysis of Large-scale Internet Services In Proceedings of the 11th USENIXConference on Operating Systems Design and Implementation (OSDIrsquo14)USENIX Association Berkeley CA USA 217ndash231

[25] Eric S Chung Jeremy Fowers Kalin Ovtcharov Michael PapamichaelAdrian M Caulfield Todd Massengill Ming Liu Daniel Lo ShlomiAlkalay Michael Haselman Maleen Abeydeera Logan Adams HariAngepat Christian Boehn Derek Chiou Oren Firestein AlessandroForin Kang Su Gatlin Mahdi Ghandi Stephen Heil Kyle HolohanAhmad El Husseini Tamaacutes Juhaacutesz Kara Kagi Ratna Kovvuri SitaramLanka Friedel van Megen Dima Mukhortov Prerak Patel BrandonPerez Amanda Rapsang Steven K Reinhardt Bita Rouhani AdamSapek Raja Seera Sangeetha Shekar Balaji Sridharan Gabriel WeiszLisaWoods Phillip Yi Xiao Dan Zhang Ritchie Zhao and Doug Burger2018 Serving DNNs in Real Time at Datacenter Scale with ProjectBrainwave IEEE Micro 38 2 (2018) 8ndash20

[26] Guilherme Da Cunha Rodrigues Rodrigo N Calheiros Vini-cius Tavares Guimaraes Glederson Lessa dos Santos Maacutercio Barbosade Carvalho Lisandro Zambenedetti Granville Liane Margarida Rock-enbach Tarouco and Rajkumar Buyya 2016 Monitoring of CloudComputing Environments Concepts Solutions Trends and Future Di-rections In Proceedings of the 31st Annual ACM Symposium on AppliedComputing (SAC rsquo16) ACM New York NY USA 378ndash383

[27] Marwan Darwish Abdelkader Ouda and Luiz Fernando Capretz [nd] Cloud-based DDoS attacks and defenses In Proc of i-SocietyToronto ON 2013

[28] Jeffrey Dean and Luiz Andre Barroso [n d] The Tail at Scale InCACM Vol 56 No 2

[29] Christina Delimitrou Nick Bambos and Christos Kozyrakis [n d]QoS-Aware Admission Control in Heterogeneous Datacenters In Pro-ceedings of the International Conference of Autonomic Computing (ICAC)San Jose CA USA 2013

[30] Christina Delimitrou and Christos Kozyrakis [n d] iBench Quanti-fying Interference for Datacenter Workloads In Proceedings of the 2013IEEE International Symposium on Workload Characterization (IISWC)Portland OR September 2013

[31] Christina Delimitrou and Christos Kozyrakis [n d] Paragon QoS-Aware Scheduling for Heterogeneous Datacenters In Proceedings ofthe Eighteenth International Conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS) HoustonTX USA 2013

[32] Christina Delimitrou and Christos Kozyrakis [n d] QoS-Aware Sched-uling in Heterogeneous Datacenters with Paragon In ACM Transac-tions on Computer Systems (TOCS) Vol 31 Issue 4 December 2013

[33] Christina Delimitrou and Christos Kozyrakis [n d] Quality-of-Service-Aware Scheduling in Heterogeneous Datacenters withParagon In IEEE Micro Special Issue on Top Picks from the ComputerArchitecture Conferences MayJune 2014

[34] Christina Delimitrou and Christos Kozyrakis [n d] Quasar Resource-Efficient and QoS-Aware Cluster Management In Proc of ASPLOS SaltLake City 2014

[35] Christina Delimitrou and Christos Kozyrakis 2016 HCloud Resource-Efficient Provisioning in Shared Cloud Systems In Proceedings of theTwenty First International Conference on Architectural Support for Pro-gramming Languages and Operating Systems (ASPLOS)

[36] Christina Delimitrou and Christos Kozyrakis 2017 Bolt I KnowWhatYou Did Last Summer In The Cloud In Proc of the Twenty SecondInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS)

[37] Christina Delimitrou and Christos Kozyrakis 2018 Amdahlrsquos Law forTail Latency In Communications of the ACM (CACM)

[38] Christina Delimitrou Daniel Sanchez and Christos Kozyrakis 2015Tarcil Reconciling Scheduling Speed and Quality in Large Shared Clus-ters In Proceedings of the Sixth ACM Symposium on Cloud Computing(SOCC)

[39] Zidong Du Robert Fasthuber Tianshi Chen Paolo Ienne Ling Li TaoLuo Xiaobing Feng Yunji Chen and Olivier Temam 2015 ShiDi-anNao Shifting Vision Processing Closer to the Sensor In Proceed-ings of the 42Nd Annual International Symposium on Computer Ar-chitecture (ISCA rsquo15) ACM New York NY USA 92ndash104 httpsdoiorg10114527494692750389

[40] Daniel Firestone Andrew Putnam SambhramaMundkur Derek ChiouAlireza Dabagh Mike Andrewartha Hari Angepat Vivek BhanuAdrian Caulfield Eric Chung Harish Kumar Chandrappa SomeshChaturmohta Matt Humphrey Jack Lavier Norman Lam FengfenLiu Kalin Ovtcharov Jitu Padhye Gautham Popuri Shachar RaindelTejas Sapre Mark Shaw Gabriel Silva Madhan Sivakumar NisheethSrivastava Anshuman Verma Qasim Zuhair Deepak Bansal DougBurger Kushagra Vaid David A Maltz and Albert Greenberg 2018Azure Accelerated Networking SmartNICs in the Public Cloud In 15thUSENIX Symposium on Networked Systems Design and Implementation(NSDI 18) USENIX Association Renton WA 51ndash66

[41] Brad Fitzpatrick [n d] Distributed caching with memcached InLinux Journal Volume 2004 Issue 124 2004

[42] Rodrigo Fonseca George Porter Randy H Katz Scott Shenker andIon Stoica 2007 X-trace A Pervasive Network Tracing Framework InProceedings of the 4th USENIX Conference on Networked Systems Designamp Implementation (NSDIrsquo07) USENIX Association Berkeley CA USA20ndash20

[43] Yu Gan and Christina Delimitrou 2018 The Architectural Implicationsof Cloud Microservices In Computer Architecture Letters (CAL) vol17iss 2

[44] Yu Gan Meghna Pancholi Dailun Cheng Siyuan Hu Yuan He andChristina Delimitrou 2018 Seer Leveraging Big Data to Navigate theComplexity of Cloud Debugging In Proceedings of the Tenth USENIXWorkshop on Hot Topics in Cloud Computing (HotCloud)

[45] Yu Gan Yanqi Zhang Dailun Cheng Ankitha Shetty Priyal RathiNayan Katarki Ariana Bruno Justin Hu Brian Ritchken BrendonJackson Kelvin Hu Meghna Pancholi Yuan He Brett Clancy ChrisColen Fukang Wen Catherine Leung Siyuan Wang Leon ZaruvinskyMateo Espinosa Rick Lin Zhongling Liu Jake Padilla and ChristinaDelimitrou 2019 An Open-Source Benchmark Suite for Microser-vices and Their Hardware-Software Implications for Cloud and EdgeSystems In Proceedings of the Twenty Fourth International Conferenceon Architectural Support for Programming Languages and OperatingSystems (ASPLOS)

[46] Yilong Geng Shiyu Liu Zi Yin Ashish Naik Balaji Prabhakar MendelRosenblum and Amin Vahdat [n d] Exploiting a Natural NetworkEffect for Scalable Fine-grained Clock Synchronization In Proc ofNSDI 2018

[47] Daniel Gmach Jerry Rolia Ludmila Cherkasova and Alfons Kemper[n d] Workload Analysis and Demand Prediction of Enterprise DataCenter Applications In Proceedings of IISWC Boston MA 2007 10httpsdoiorg101109IISWC20074362193

[48] Zhenhuan Gong Xiaohui Gu and John Wilkes [n d] PRESS PRe-dictive Elastic ReSource Scaling for cloud systems In Proceedings ofCNSM Niagara Falls ON 2010

[49] Donald Gross John F Shortle James M Thompson and Carl M Harris[n d] Fundamentals of Queueing Theory InWiley Series in Probabilityand Statistics Book 627 2011

[50] Sanchika Gupta and PadamKumar [n d] VM Profile Based OptimizedNetwork Attack Pattern Detection Scheme for DDOS Attacks in CloudIn Proc of SSCC Mysore India 2013

[51] BenHindman Andy Konwinski Matei Zaharia Ali Ghodsi AnthonyDJoseph Randy Katz Scott Shenker and Ion Stoica [n d] Mesos APlatform for Fine-Grained Resource Sharing in the Data Center InProceedings of NSDI Boston MA 2011

[52] Jingwei Huang David M Nicol and Roy H Campbell [n d] Denial-of-Service Threat to HadoopYARN Clusters with Multi-Tenancy InProc of the IEEE International Congress on Big Data Washington DC2014

[53] Alexandru Iosup Nezih Yigitbasi and Dick Epema [n d] On thePerformance Variability of Production Cloud Services In Proceedingsof CCGRID Newport Beach CA 2011

[54] Hiranya Jayathilaka Chandra Krintz and Rich Wolski 2017 Perfor-mance Monitoring and Root Cause Analysis for Cloud-hosted WebApplications In Proceedings of WWW International World Wide WebConferences Steering Committee Republic and Canton of GenevaSwitzerland 469ndash478

[55] Norman P Jouppi Cliff Young Nishant Patil David Patterson Gau-rav Agrawal Raminder Bajwa Sarah Bates Suresh Bhatia Nan Bo-den Al Borchers Rick Boyle Pierre-luc Cantin Clifford Chao ChrisClark Jeremy Coriell Mike Daley Matt Dau Jeffrey Dean Ben GelbTara Vazir Ghaemmaghami Rajendra Gottipati William GullandRobert Hagmann C Richard Ho Doug Hogberg John Hu RobertHundt Dan Hurt Julian Ibarz Aaron Jaffey Alek Jaworski Alexan-der Kaplan Harshit Khaitan Daniel Killebrew Andy Koch NaveenKumar Steve Lacy James Laudon James Law Diemthu Le ChrisLeary Zhuyuan Liu Kyle Lucke Alan Lundin Gordon MacKean Adri-ana Maggiore Maire Mahony Kieran Miller Rahul Nagarajan RaviNarayanaswami Ray Ni Kathy Nix Thomas Norrie Mark Omer-nick Narayana Penukonda Andy Phelps Jonathan Ross Matt RossAmir Salek Emad Samadiani Chris Severn Gregory Sizikov MatthewSnelham Jed Souter Dan Steinberg Andy Swing Mercedes Tan Gre-gory Thorson Bo Tian Horia Toma Erick Tuttle Vijay VasudevanRichard Walter Walter Wang Eric Wilcox and Doe Hyun Yoon 2017In-Datacenter Performance Analysis of a Tensor Processing Unit InProceedings of the 44th Annual International Symposium on ComputerArchitecture (ISCA rsquo17) ACM New York NY USA 1ndash12

[56] Harshad Kasture and Daniel Sanchez 2014 Ubik Efficient CacheSharing with Strict QoS for Latency-Critical Workloads In Proceed-ings of the 19th international conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS-XIX)

[57] Harshad Kasture and Daniel Sanchez 2016 TailBench A BenchmarkSuite and Evaluation Methodology for Latency-Critical ApplicationsIn Proc of IISWC

[58] Yaakoub El Khamra Hyunjoo Kim Shantenu Jha andManish Parashar[n d] Exploring the performance fluctuations of hpc workloads onclouds In Proceedings of CloudCom Indianapolis IN 2010

[59] Krzysztof C Kiwiel [n d] Convergence and efficiency of subgradientmethods for quasiconvex minimization InMathematical Programming(Series A) (Berlin Heidelberg Springer) 90 (1) pp 1-25 2001

[60] Jacob Leverich and Christos Kozyrakis [n d] Reconciling High ServerUtilization and Sub-millisecond Quality-of-Service In Proc of EuroSys2014

[61] David Lo Liqun Cheng Rama Govindaraju Luiz Andreacute Barroso andChristos Kozyrakis [n d] Towards Energy Proportionality for Large-scale Latency-critical Workloads In Proceedings of the 41st AnnualInternational Symposium on Computer Architecuture (ISCA) Minneapo-lis MN 2014

[62] David Lo Liqun Cheng Rama Govindaraju Parthasarathy Ran-ganathan and Christos Kozyrakis [n d] Heracles Improving Re-source Efficiency at Scale In Proc of the 42Nd Annual InternationalSymposium on Computer Architecture (ISCA) Portland OR 2015

[63] Dave Mangot [n d] EC2 variability The numbers re-vealed httptechmangotcomrollerdaveentryec2_variability_the_numbers_revealed

[64] Jason Mars and Lingjia Tang [n d] Whare-map heterogeneity inhomogeneous warehouse-scale computers In Proceedings of ISCATel-Aviv Israel 2013

[65] Jason Mars Lingjia Tang and Robert Hundt 2011 Heterogeneity inldquoHomogeneousrdquo Warehouse-Scale Computers A Performance Oppor-tunity IEEE Comput Archit Lett 10 2 (July 2011) 4

[66] Jason Mars Lingjia Tang Robert Hundt Kevin Skadron and Mary LouSoffa [n d] Bubble-Up increasing utilization in modern warehousescale computers via sensible co-locations In Proceedings of MICROPorto Alegre Brazil 2011

[67] David Meisner Christopher M Sadler Luiz Andreacute Barroso Wolf-Dietrich Weber and Thomas F Wenisch 2011 Power managementof online data-intensive services In Proceedings of the 38th annualinternational symposium on Computer architecture 319ndash330

[68] Ripal Nathuji Canturk Isci and Eugene Gorbatov [n d] Exploitingplatform heterogeneity for power efficient data centers In Proceedingsof ICAC Jacksonville FL 2007

[69] Ripal Nathuji Aman Kansal and Alireza Ghaffarkhah [n d] Q-Clouds Managing Performance Interference Effects for QoS-AwareClouds In Proceedings of EuroSys ParisFrance 2010

[70] Hiep Nguyen Zhiming Shen Xiaohui Gu Sethuraman Subbiah andJohn Wilkes [n d] AGILE Elastic Distributed Resource Scaling forInfrastructure-as-a-Service In Proceedings of ICAC San Jose CA 2013

[71] Dejan Novakovic Nedeljko Vasic Stanko Novakovic Dejan Kosticand Ricardo Bianchini [n d] DeepDive Transparently Identifyingand Managing Performance Interference in Virtualized EnvironmentsIn Proceedings of ATC San Jose CA 2013

[72] Simon Ostermann Alexandru Iosup Nezih Yigitbasi Radu ProdanThomas Fahringer and Dick Epema [n d] A Performance Analysisof EC2 Cloud Computing Services for Scientific Computing In LectureNotes on Cloud Computing Volume 34 p115-131 2010

[73] Kay Ousterhout Patrick Wendell Matei Zaharia and Ion Stoica [nd] Sparrow Distributed Low Latency Scheduling In Proceedings ofSOSP Farminton PA 2013

[74] Xue Ouyang Peter Garraghan Renyu Yang Paul Townend and Jie Xu[n d] Reducing Late-Timing Failure at Scale Straggler Root-CauseAnalysis in Cloud Datacenters In Proceedings of 46th Annual IEEEIFIPInternational Conference on Dependable Systems and Networks 2016

[75] Andrew Putnam Adrian M Caulfield Eric S Chung Derek ChiouKypros Constantinides John Demme Hadi Esmaeilzadeh Jeremy Fow-ers Gopi Prashanth Gopal Jan Gray Michael Haselman Scott HauckStephen Heil Amir Hormati Joo-Young Kim Sitaram Lanka JamesLarus Eric Peterson Simon Pope Aaron Smith Jason Thong Phillip YiXiao and Doug Burger 2014 A Reconfigurable Fabric for Accelerat-ing Large-Scale Datacenter Services In Proc of the 41st Intl Symp onComputer Architecture

[76] Suhail Rehman and Majd Sakr [n d] Initial Findings for provisioningvariation in cloud computing In Proceedings of CloudCom IndianapolisIN 2010

[77] Charles Reiss Alexey Tumanov Gregory Ganger Randy Katz andMichael Kozych [n d] Heterogeneity and Dynamicity of Clouds atScale Google Trace Analysis In Proceedings of SOCC 2012

[78] Gang Ren Eric Tune Tipp Moseley Yixin Shi Silvius Rus andRobert Hundt 2010 Google-Wide Profiling A Continuous Pro-filing Infrastructure for Data Centers IEEE Micro (2010) 65ndash79httpwwwcomputerorgportalwebcsdldoi101109MM201068

[79] rightscale [n d] Rightscale httpsawsamazoncomsolution-providersisvrightscale

[80] S Sarwar A Ankit and K Roy [n d] Incremental Learning in DeepConvolutional Neural Networks Using Partial Network Sharing InarXiv preprint arXiv171202719

[81] Joumlrg Schad Jens Dittrich and Jorge-Arnulfo Quianeacute-Ruiz 2010 Run-time Measurements in the Cloud Observing Analyzing and ReducingVariance Proceedings VLDB Endow 3 1-2 (Sept 2010) 460ndash471

[82] Malte Schwarzkopf Andy Konwinski Michael Abd-El-Malek andJohn Wilkes [n d] Omega flexible scalable schedulers for largecompute clusters In Proceedings of EuroSys Prague 2013

[83] Zhiming Shen Sethuraman Subbiah Xiaohui Gu and John Wilkes [nd] CloudScale elastic resource scaling for multi-tenant cloud systemsIn Proceedings of SOCC Cascais Portugal 2011

[84] David Shue Michael J Freedman and Anees Shaikh [n d] Perfor-mance Isolation and Fairness for Multi-tenant Cloud Storage In Procof OSDI Hollywood CA 2012

[85] Benjamin H Sigelman Luiz Andreacute Barroso Mike Burrows PatStephenson Manoj Plakal Donald Beaver Saul Jaspan and Chan-dan Shanbhag 2010 Dapper a Large-Scale Distributed Systems TracingInfrastructure Technical Report Google Inc httpsresearchgooglecomarchivepapersdapper-2010-1pdf

[86] Lalith Suresh Peter Bodik Ishai Menache Marco Canini and FlorinCiucu [n d] Distributed Resource Management Across ProcessBoundaries In Proceedings of the ACM Symposium on Cloud Computing(SOCC) Santa Clara CA 2017

[87] torque [n d] Torque Resource Manager httpwwwadaptivecomputingcomproductsopen-sourcetorque

[88] Venkatanathan Varadarajan Thomas Ristenpart and Michael Swift[n d] Scheduler-based Defenses against Cross-VM Side-channels InProc of the 23rd Usenix Security Symposium San Diego CA 2014

[89] Venkatanathan Varadarajan Yinqian Zhang Thomas Ristenpart andMichael Swift [n d] A Placement Vulnerability Study inMulti-TenantPublic Clouds In Proc of the 24th USENIX Security Symposium (USENIXSecurity) Washington DC 2015

[90] Nedeljko Vasić Dejan Novaković Svetozar Miučin Dejan Kostić andRicardo Bianchini [n d] DejaVu accelerating resource allocationin virtualized environments In Proceedings of the Seventeenth Interna-tional Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS) London UK 2012

[91] Jianping Weng Jessie Hui Wang Jiahai Yang and Yang Yang [n d]Root cause analysis of anomalies of multitier services in public cloudsIn Proceedings of the IEEEACM 25th International Symposium on Qual-ity of Service (IWQoS) 2017

[92] Ian H Witten Eibe Frank and Geoffrey Holmes [n d] Data MiningPractical Machine Learning Tools and Techniques 3rd Edition

[93] Tianyin Xu Xinxin Jin Peng Huang Yuanyuan Zhou Shan Lu LongJin and Shankar Pasupathy 2016 Early Detection of ConfigurationErrors to Reduce Failure Damage In Proceedings of the 12th USENIXConference on Operating Systems Design and Implementation (OSDIrsquo16)USENIX Association Berkeley CA USA 619ndash634

[94] Yunjing Xu Michael Bailey Farnam Jahanian Kaustubh Joshi MattiHiltunen and Richard Schlichting [n d] An Exploration of L2 CacheCovert Channels in Virtualized Environments In Proc of the 3rd ACMWorkshop on Cloud Computing Security Workshop (CCSW) ChicagoIL 2011

[95] Hailong Yang Alex Breslow Jason Mars and Lingjia Tang [n d]Bubble-flux precise online QoS management for increased utilizationin warehouse scale computers In Proceedings of ISCA 2013

[96] Minlan Yu Albert Greenberg Dave Maltz Jennifer Rexford LihuaYuan Srikanth Kandula and Changhoon Kim [n d] Profiling Net-work Performance for Multi-tier Data Center Applications In Proceed-ings of NSDI Boston MA 2011 14

[97] Xiao Yu Pallavi Joshi Jianwu Xu Guoliang Jin Hui Zhang and GuofeiJiang 2016 CloudSeer Workflow Monitoring of Cloud Infrastructuresvia Interleaved Logs In Proceedings of APSLOS ACM New York NYUSA 489ndash502

[98] Yinqian Zhang and Michael K Reiter [n d] Duppel retrofittingcommodity operating systems to mitigate cache side channels in thecloud In Proc of CCS Berlin Germany 2013

[99] Mark Zhao and G Edward Suh [n d] FPGA-Based Remote PowerSide-Channel Attacks In Proceedings of the IEEE Symposium on Securityand Privacy May 2018

  • Abstract
  • 1 Introduction
  • 2 Related Work
  • 3 End-to-End Applications with Microservices
    • 31 Social Network
    • 32 Media Service
    • 33 E-Commerce Service
    • 34 Banking System
    • 35 Hotel Reservation Site
      • 4 Seer Design
        • 41 Overview
        • 42 Distributed Tracing
        • 43 Deep Learning in Performance Debugging
        • 44 Hardware Monitoring
        • 45 System Insights from Seer
        • 46 Implementation
          • 5 Seer Analysis and Validation
            • 51 Methodology
            • 52 Evaluation
              • 6 Large-Scale Cloud Study
                • 61 Seer Scalability
                • 62 Source of QoS Violations
                • 63 Seers Long-Term Impact on Application Design
                  • 7 Conclusions
                  • Acknowledgements
                  • References
Page 6: Seer: Leveraging Big Data to Navigate the Complexity of ... · data to extract the causal relationships between requests. Even though such tracing systems help cloud providers detect

Tracing

Coordinator

Client

Load Balancer

Microservice

Tracing module

Seer

TraceDB

(Cassandra)

Cluster ManagerAdjust

resource

allocation

Microservices dependency graph

Figure 6 Overview of Seerrsquos operation

4 Seer Design41 OverviewFig 6 shows the high-level architecture of the system Seer isan online performance debugging system for cloud systemshosting interactive latency-critical services Even though weare focusing our analysis on microservices where the impactof QoS violations is more severe Seer is also applicable togeneral cloud services and traditional multi-tier or Service-Oriented Architecture (SOA) workloads Seer uses two levelsof tracing shown in Fig 7

RPC-level tracing

Signal QoS

violation

Identify culprit

microservice

Node-level diagnostics

Identify culprit

resource

Notify cluster

manager

Perf cntrs uBench

Figure 7 The two levels oftracing used in Seer

First it uses a light-weight distributed RPC-level tracing system de-scribed in Sec 42 whichcollects end-to-end ex-ecution traces for eachuser request includingper-tier latency and out-standing requests asso-ciates RPCs belongingto the same end-to-endrequest and aggregates

them to a centralized Cassandra database (TraceDB) Fromthere traces are used to train Seer to recognize patterns inspace (between microservices) and time that lead to QoSviolations At runtime Seer consumes real-time streamingtraces to infer whether there is an imminent QoS violation

When a QoS violation is expected to occur and a culprit mi-croservice has been located Seer uses its lower tracing levelwhich consists of detailed per-node low-level hardware mon-itoring primitives such as performance counters to identifythe reason behind the QoS violation It also uses this informa-tion to provide the cluster manager with recommendationson how to avoid the performance degradation altogetherWhen Seer runs on a public cloud where performance coun-ters are disabled it uses a set of tunable microbenchmarksto determine the source of unpredictable performance (seeSec 44) Using two specialized tracing levels instead of col-lecting detailed low-level traces for all active microservicesensures that the distributed tracing is lightweight enough to

Microservice m+1

Microservice m-1

me

mca

che

dNIC

epoll

(libevent)

socket

read

mem$

proc

Packet

filtering

Packet

filtering

TCPIP

TX

TCPIP

RX

Tracing

Coordinator

WebUI

User

http

TraceDB

(Cassandra)

QueryEngine

mic

rose

rvic

es

latency

Gantt charts

NIC

Figure 8 Distributed tracing and instrumentation in Seer

track all active requests and services in the system and thatdetailed low-level hardware tracing is only used on-demandfor microservices likely to cause performance disruptions Inthe following sections we describe the design of the tracingsystem the learning techniques in Seer and its low-leveldiagnostic framework and the system insights we can drawfrom Seerrsquos decisions to improve cloud application design

42 Distributed TracingA major challenge with microservices is that one cannotsimply rely on the client to report performance as withtraditional client-server applications We have developeda distributed tracing system for Seer similar in design toDapper [85] and Zipkin [8] that records per-microservicelatencies using the Thrift timing interface as shown in Fig 8We additionally track the number of requests queued in eachmicroservice (outstanding requests) since queue lengths arehighly correlated with performance and QoS violations [4649 56 96] In all cases the overhead from tracing withoutrequest sampling is negligible less than 01 on end-to-endlatency and less than 015 on throughput (QPS) whichis tolerable for such systems [24 78 85] Traces from allmicroservices are aggregated in a centralized database [18]Instrumentation The tracing system requires two typesof application instrumentation First to distinguish betweenthe time spent processing network requests and the timethat goes towards application computation we instrumentthe application to report the time it sees a new request (post-RPC processing) We similarly instrument the transmit sideof the application Second systems have multiple sources ofqueueing in both hardware and software To obtain accuratemeasurements of queue lengths per microservice we need toaccount for these different queues Fig 8 shows an example of

Name LoC Layers Nonlinear Weights BatchFC Conv Vect Total Function Size

CNN 1456 8 8 ReLU 30K 4LSTM 944 12 6 18 sigmoidtanh 52K 32

Seer 2882 10 7 5 22 ReLU 80K 32sigmoidtanhTable 2 The different neural network configurations weexplored for Seer

Seerrsquos instrumentation for memcached Memcached includesfive main stages [60] TCPIP receive epolllibevent read-ing the request from the socket processing the request andresponding over TCPIP either with the ltkvgt pair for a reador with an ack for a write Each of these stages includes ahardware (NIC) or software (epollsocket readmemcachedproc) queue For theNIC queues Seer filters packets based onthe destination microservice but accounts for the aggregatequeue length if hardware queues are shared since that willimpact how fast a microservicersquos packets get processed Forthe software queues Seer inserts probes in the applicationto read the number of queued requests in each caseLimited instrumentation As seen above accounting forall sources of queueing in a complex system requires non-trivial instrumentation This can become cumbersome ifusers leverage third-party applications in their services or inthe case of public cloud providers which do not have accessto the source code of externally-submitted applications forinstrumentation In these cases Seer relies on the requestsqueued exclusively in the NIC to signal upcoming QoS viola-tions In Section 5 we compare the accuracy of the full versuslimited instrumentation and see that using network queuedepths alone is enough to signal a large fraction of QoS viola-tions although smaller than when the full instrumentation isavailable Exclusively polling NIC queues identifies hotspotscaused by routing incast failures and resource saturationbut misses QoS violations that are caused by performanceand efficiency bugs in the application implementation suchas blocking behavior between microservices Signaling suchbugs helps developers better understand the microservicesmodel and results in better application designInferring queue lengths Additionally there has been re-cent work on using deep learning to reverse engineer thenumber of queued requests in switches across a large net-work topology [46] when tracing information is incompleteSuch techniques are also applicable and beneficial for Seerwhen the default level of instrumentation is not available

43 Deep Learning in Performance DebuggingA popular way to model performance in cloud systems es-pecially when there are dependencies between tasks arequeueing networks [49] Although queueing networks are avaluable tool to model how bottlenecks propagate through

the system they require in-depth knowledge of applicationsemantics and structure and can become overly complexas applications and systems scale They additionally cannoteasily capture all sources of contention such as the OS andnetwork stack

Instead in Seer we take a data-driven application-agnosticapproach that assumes no information about the structureand bottlenecks of a service making it robust to unknownand changing applications and relying instead on practi-cal learning techniques to infer patterns that lead to QoSviolations This includes both spatial patterns such as de-pendencies between microservices and temporal patternssuch as input load and resource contention The key idea inSeer is that conditions that led to QoS violations in the pastcan be used to anticipate unpredictable performance in thenear future Seer uses execution traces annotated with QoSviolations and collected over time to train a deep neural net-work to signal upcoming QoS violations Below we describethe structure of the neural network why deep learning iswell-suited for this problem and how Seer adapts to changesin application structure onlineUsing deep learning Although deep learning is not theonly approach that can be used for proactive QoS violationdetection there are several reasonswhy it is preferable in thiscase First the problem Seer must solve is a pattern matchingproblem of recognizing conditions that result in QoS viola-tions where the patterns are not always known in advanceor easy to annotate This is a more complicated task than sim-ply signaling a microservice with many enqueued requestsfor which simpler classification regression or sequence la-beling techniques would suffice [15 16 92] Second the DNNin Seer assumes no a priori knowledge about dependenciesbetween individual microservices making it applicable tofrequently-updated services where describing changes iscumbersome or even difficult for the user to know Thirddeep learning has been shown to be especially effective inpattern recognition problems with massive datasets eg inimage or text recognition [10] Finally as we show in thevalidation section (Sec 5) deep learning allows Seer to rec-ognize QoS violations with high accuracy in practice andwithin the opportunity window the cluster manager has toapply corrective actionsConfiguring the DNN The input used in the network isessential for its accuracy We have experimented with re-source utilization latency and queue depths as input met-rics Consistent with prior work utilization is not a goodproxy for performance [31 35 56 61] Latency similarlyleads to many false positives or to incorrectly pinpointingcomputationally-intensive microservices as QoS violationculprits Again consistent with queueing theory [49] andprior work [34 37 38 46 56] per-microservice queue depthsaccurately capture performance bottlenecks and pinpoint the

hellip

CNN

LSTM

hellip

LSTM

LSTM

LSTM

LSTM

LSTM

hellip

LSTM

LSTM

LSTM

LSTM

LSTM

hellip

LSTM

LSTM

LSTM

LSTM

M

icro

serv

ice

s(i

n d

ep

en

de

nci

es

ord

er)

helliphellip hellip

hellip

M

icrose

rvice

s(P

rfo

r Qo

Sv

iola

tion

)

LSTM

hellip

SoftMax

LSTM

Dropout

Figure 9 The deep neural network design in Seer con-sisting of a set of convolution layers followed by a set oflong-short term memory layers Each input and output neu-ron corresponds to a microservice ordered in topologicalorder from back-end microservices in the top to front-endmicroservices in the bottom

microservices causing them We compare against utilization-based approaches in Section 5 The number of input andoutput neurons is equal to the number of active microser-vices in the cluster with the input value corresponding toqueue depths and the output value to the probability fora given microservice to initiate a QoS violation Input neu-rons are ordered according to the topological applicationstructure with dependent microservices corresponding toconsecutive neurons to capture hotspot patterns in space In-put traces are annotated with the microservices that causedQoS violations in the past

The choice of DNN architecture is also instrumental to itsaccuracy There are three main DNN designs that are populartoday fully connected networks (FC) convolutional neuralnetworks (CNN) and recurrent neural networks (RNN) es-pecially their Long Short-Term Memory (LSTM) class ForSeer to be effective in improving performance predictabilityinference needs to occur with enough slack for the clustermanagerrsquos action to take effect Hence we focus on the morecomputationally-efficient CNN and LSTM networks CNNsare especially effective at reducing the dimensionality oflarge datasets and finding patterns in space eg in imagerecognition LSTMs on the other hand are particularly effec-tive at finding patterns in time eg predicting tomorrowrsquosweather based on todayrsquos measurements Signaling QoS vi-olations in a large cluster requires both spatial recognitionnamely identifying problematic clusters of microserviceswhose dependencies cause QoS violations and discardingnoisy but non-critical microservices and temporal recogni-tion namely using past QoS violations to anticipate futureones We compare three network designs a CNN a LSTMand a hybrid network that combines the two using the CNNfirst to reduce the dimensionality and filter out microservices

that do not affect end-to-end performance and then an LSTMwith a SoftMax final layer to infer the probability for eachmicroservice to initiate a QoS violation The architectureof the hybrid network is shown in Fig 9 Each network isconfigured using hyperparameter tuning to avoid overfittingand the final parameters are shown in Table 2We train each network on a weekrsquos worth of trace data

collected on a 20-server cluster running all end-to-end ser-vices (for methodology details see Sec 5) and test it on tracescollected on a different week after the servers had beenpatched and the OS had been upgraded

561 7879 9345QoS Violation Detection Accuracy ()

0

2

4

6

8

10

12

14

16

Infe

rence

Tim

e (

ms)

CNNLSTMSeer (CNN+LSTM)

Figure 10 Comparison ofDNN architectures

The quantitative com-parison of the threenetworks is shown inFig 10 The CNN is byfar the fastest but alsothe worst performingsince it is not designedto recognize patterns intime that lead to QoS vi-olations The LSTM onthe other hand is espe-cially effective at captur-ing load patterns overtime but is less effective at reducing the dimensionality ofthe original dataset which makes it prone to false positivesdue to microservices with many outstanding requests whichare off the critical path It also incurs higher overheads forinference than the CNN Finally Seer correctly anticipates9345 of violations outperforming both networks for asmall increase in inference time compared to LSTM Giventhat most resource partitioning decisions take effect after afew 100ms the inference time for Seer is within the windowof opportunity the cluster manager has to take action Moreimportantly it attributes the QoS violation to the correctmicroservice simplifying the cluster managerrsquos task QoSviolations missed by Seer included four random load spikesand a network switch failure which caused high packet dropsOut of the five end-to-end services the one most prone

to QoS violations initially was Social Network first becauseit has stricter QoS constraints than eg E-commerce andsecond due to a synchronous and cyclic communication be-tween three neighboring services that caused them to entera positive feedback loop until saturation We reimplementedthe communication protocol between them post-detectionOn the other hand the service for which QoS violations werehardest to detect wasMedia Service because of a faulty mem-ory bandwidth partitioning mechanism in one of our serverswhich resulted in widely inconsistent memory bandwidthallocations during movie streaming Since the QoS violationonly occurred when the specific streaming microservice wasscheduled on the faulty node it was hard for Seer to collectenough datapoints to signal the violation

Retraining Seer By default training happens once andcan be time consuming taking several hours up to a day forweek-long traces collected on our 20-server cluster (Sec 5includes a detailed sensitivity study for training time) How-ever one of the main advantages of microservices is that theysimplify frequent application updates with old microser-vices often swapped out and replaced by newer modules orlarge services progressively broken down to microservicesIf the application (or underlying hardware) change signifi-cantly Seerrsquos detection accuracy can be impacted To adjustto changes in the execution environment Seer retrains incre-mentally in the background using the transfer learning-basedapproach in [80] Weights from previous training rounds arestored in disk allowing the model to continue training fromwhere it last left off when new data arrive reducing thetraining time by 2-3 orders of magnitude Even though thisapproach allows Seer to handle application changes almost inreal-time it is not a long-term solution since newweights arestill polluted by the previous application architecture Whenthe application changes in a major way eg microserviceson the critical path change Seer also retrains from scratchin the background While the new network trains QoS vi-olation detection happens with the incrementally-trainedinterim model In Section 5 we evaluate Seerrsquos ability toadjust its estimations to application changes

44 Hardware MonitoringOnce a QoS violation is signaled and a culprit microserviceis pinpointed Seer uses low-level monitoring to identify thereason behind the QoS violation The exact process dependson whether Seer has access to performance countersPrivate cluster When Seer has access to hardware eventssuch as performance counters it uses them to determinethe utilization of different shared resources Note that eventhough utilization is a problematic metric for anticipatingQoS violations in a large-scale service once a culprit mi-croservice has been identified examining the utilization ofdifferent resources can provide useful hints to the clustermanager on suitable decisions to avoid degraded perfor-mance Seer specifically examines CPU memory capacityand bandwidth network bandwidth cache contention andstorage IO bandwidth when prioritizing a resource to adjustOnce the saturated resource is identified Seer notifies thecluster manager to take actionPublic clusterWhen Seer does not have access to perfor-mance counters it instead uses a set of 10 tunable contentiousmicrobenchmarks each of them targeting a different sharedresource [30] to determine resource saturation For exampleif Seer injects the memory bandwidth microbenchmark inthe system and tunes up its intensity without an impacton the co-scheduled microservicersquos performance memorybandwidth is most likely not the resource that needs to beadjusted Seer starts from microbenchmarks corresponding

to core resources and progressively moves to resources fur-ther away from the core until it sees a substantial change inperformance when running the microbenchmark Each mi-crobenchmark takes approximately 10ms to complete avoid-ing prolonged degraded performance

Upon identifying the problematic resource(s) Seer notifiesthe cluster manager which takes one of several resourceallocation actions resizing the Docker container or usingmechanisms like Intelrsquos Cache Allocation Technology (CAT)for last level cache (LLC) partitioning and the Linux trafficcontrolrsquos hierarchical token bucket (HTB) queueing disciplinein qdisc [17 62] for network bandwidth partitioning

45 System Insights from SeerUsing learning-based data-driven approaches in systemsis most useful when these techniques are used to gain in-sight into system problems instead of treating them as blackboxes Section 5 includes an analysis of the causes behindQoS violations signaled by Seer including application bugspoor resource provisioning decisions and hardware failuresFurthermore we have deployed Seer in a large installation ofthe Social Network service over the past few months and itsoutput has been instrumental not only in guaranteeing QoSbut in understanding sources of unpredictable performanceand improving the application design This has resulted bothin progressively fewer QoS violations over time and a betterunderstanding of the design challenges of microservices

46 ImplementationSeer is implemented in 12KLOC of CC++ and Python Itruns on Linux and OSX and supports applications in variouslanguages including all frameworks the end-to-end servicesare designed in Furthermore we provide automated patchesfor the instrumentation probes for many popular microser-vices including NGINX memcached MongoDB Xapian and allSockshop and Go-microservices applications to minimize thedevelopment effort from the userrsquos perspective

Seer is a centralized system we usemaster-slavemirroringto improve fault tolerance with two hot stand-by mastersthat can take over if the primary system fails Similarly thetrace database is also replicated in the backgroundSecurity concerns Trace data is stored and processed un-encrypted in Cassandra Previous work has shown that thesensitivity applications have to different resources can leakinformation about their nature and characteristics makingthem vulnerable to malicious security attacks [27 36 50 5284 88 89 94 99] Similar attacks are possible using the dataand techniques in Seer and are deferred to future work

5 Seer Analysis and Validation51 MethodologyServer clusters First we use a dedicated local cluster with20 2-socket 40-core servers with 128GB of RAM each Each

0 5 10 15 20 25 30 35 40

Training Time (hours)

40

50

60

70

80

90

100

Qo

S V

iola

tio

n D

ete

ctio

n A

ccu

racy (

)

100MB

1GB

10GB

20GB

50GB

100GB 200GB

500GB 1TB

10-2 10-1 100 101Tracing Interval (s)

50

55

60

65

70

75

80

85

90

95

Dete

ction A

ccura

cy (

)

Figure 11 Seerrsquos sensitivity to (a) the size of trainingdatasets and (b) the tracing interval

server is connected to a 40Gbps ToR switch over 10Gbe NICsSecond we deploy the Social Network service to GoogleCompute Engine (GCE) and Windows Azure clusters withhundreds of servers to study the scalability of SeerApplicationsWe use all five end-to-end services of Table 1Services for now are driven by open-loop workload gener-ators and the input load varies from constant to diurnalto load with spikes in user demand In Section 6 we studySeer in a real large-scale deployment of the Social Networkin that case the input load is driven by real user traffic

52 EvaluationSensitivity to training data Fig 11a shows the detectionaccuracy and training time for Seer as we increase the sizeof the training dataset The size of the dots is a functionof the dataset size Training data is collected from the 20-server cluster described above across different load levelsplacement strategies time intervals and request types Thesmallest training set size (100MB) is collected over ten min-utes of the cluster operating at high utilization while thelargest dataset (1TB) is collected over almost two months ofcontinuous deployment As datasets grow Seerrsquos accuracyincreases leveling off at 100-200GB Beyond that point accu-racy does not further increase while the time needed fortraining grows significantly Unless otherwise specified weuse the 100GB training datasetSensitivity to tracing frequencyBy default the distributedtracing system instantaneously collects the latency of everysingle user request Collecting queue depth statistics on theother hand is a per-microservice iterative process Fig 11bshows how Seerrsquos accuracy changes as we vary the frequencywith which we collect queue depth statistics Waiting for along time before sampling queues eg gt 1s can result in un-detected QoS violations before Seer gets a chance to processthe incoming traces In contrast sampling queues very fre-quently results in unnecessarily many inferences and runsthe risk of increasing the tracing overhead For the remain-der of this work we use 100ms as the interval for measuringqueue depths across microservicesFalse negatives amp false positives Fig 12a shows the per-centage of false negatives and false positives as we vary the

CPU

Cache

Memory

Network

Disk

Application

False Negatives False Positives0

10

20

30

40

50

Pe

rce

nta

ge

(

)

10ms50ms100ms

500ms1s2s

Utilizatio

n

App only

Net only

Seer

Ground0

20

40

60

80

100

o

f V

alid

Qo

S V

iola

tio

ns

Truth

Figure 12 (a) The false negatives and false positives in Seeras we vary the inference window (b) Breakdown to causesof QoS violations and comparison with utilization-baseddetection and systems with limited instrumentation

prediction window When Seer tries to anticipate QoS viola-tions that will occur in the next 10-100ms both false positivesand false negatives are low since Seer uses a very recentsnapshot of the cluster state to anticipate performance unpre-dictability If inference was instantaneous very short predic-tion windows would always be better However given thatinference takes several milliseconds and more importantlyapplying corrective actions to avoid QoS violations takes10-100s of milliseconds to take effect such short windowsdefy the point of proactive QoS violation detection At theother end predicting far into the future results in significantfalse negatives and especially false positives This is becausemany QoS violations are caused by very short bursty eventsthat do not have an impact on queue lengths until a fewmilliseconds before the violation occurs Therefore requiringSeer to predict one or more seconds into the future meansthat normal queue depths are annotated as QoS violationsresulting in many false positives Unless otherwise specifiedwe use a 100ms prediction windowComparison of debugging systems Fig 12b comparesSeer with a utilization-based performance debugging systemthat uses resource saturation as the trigger to signal a QoSviolation and two systems that only use a fraction of Seerrsquosinstrumentation App-only exclusively uses queues mea-sured via application instrumentation (not network queues)while Network-only uses queues in the NIC and ignoresapplication-level queueing We also show the ground truthfor the total number of upcoming QoS violations (96 over atwo-week period) and break it down by the reason that ledto unpredictable performanceA large fraction of QoS violations are due to application-

level inefficiencies including correctness bugs unnecessarysynchronization andor blocking behavior between microser-vices (including two cases of deadlocks) and misconfigurediptables rules which caused packet drops An equally largefraction of QoS violations are due to compute contentionfollowed by contention in the network cache and memory

Accuracy

QoS Violation

Impactful Service Update

Innocuous Service Update

Incremental Retraining

Signaled QoS Violation

0 2 4 6 8 10 12Time (hours)

75

80

85

90

95

De

tectio

n A

ccu

racy (

)

Social Network Movie Reviewing Ecommerce Go-microservices

0 2 4 6 8 10 12Time (hours)

60708090

100110120130140

Ta

il La

ten

cy n

orm

QoS

QoS

Figure 13 Seer retraining incrementally after each time theSocial Network service is updated

contention and finally disk Since the only persistent mi-croservices are the back-end databases it is reasonable thatdisk accounts for a small fraction of overall QoS violations

Seer accurately follows this breakdown for the most partonly missing a few QoS violations due to random load spikesincluding one caused by a switch failure The App-only sys-tem correctly identifies application-level sources of unpre-dictable performance but misses the majority of system-related issues especially in uncore resources On the otherhand Network-only correctly identifies the vast majorityof network-related issues as well as most of the core- anduncore-driven QoS violations but misses several application-level issues The difference between Network-only and Seeris small suggesting that one could omit the application-levelinstrumentation in favor of a simpler design While this sys-tem is still effective in capturing QoS violations it is lessuseful in providing feedback to application developers onhow to improve their design to avoid QoS violations in the fu-ture Finally the utilization-based system behaves the worstmissing most violations not caused by CPU saturationOut of the 89 QoS violations Seer detects it notifies the

cluster manager early enough to avoid 84 of them The QoSviolations that were not avoided correspond to application-level bugs which cannot be easily corrected online Sincethis is a private cluster Seer uses utilization metrics andperformance counters to identify problematic resourcesRetraining Fig 13 shows the detection accuracy for Seerand the tail latency for each end-to-end service over a periodof time during which Social Network is getting frequently andsubstantially updated This includes new microservices be-ing added to the service such as the ability to place an orderfrom an ad using the orders microservice of E-commerceor the back-end of the service changing from MongoDB to

Cassandra and the front-end switching from nginx to thenodejs front-end of E-commerce These are changes that fun-damentally affect the applicationrsquos behavior throughput la-tency and bottlenecks The other services remain unchangedduring this period (Banking was not active during this timeand is omitted from the graph) Blue dots denote correctly-signaled upcoming QoS violations and red times denote QoSviolations that were not detected by Seer All unidentifiedQoS violations coincide with the application being updatedShortly after the update Seer incrementally retrains in thebackground and starts recovering its accuracy until anothermajor update occurs Some of the updates have no impact oneither performance or Seerrsquos accuracy either because theyinvolve microservices off the critical path or because theyare insensitive to resource contention

The bottom figure shows that unidentified QoS violationsindeed result in performance degradation for Social Networkand in some cases for the other end-to-end services if theyare sharing physical resources with Social Network on anoversubscribed server Once retraining completes the per-formance of the service(s) recovers The longer Seer trainson an evolving application the more likely it is to correctlyanticipate its future QoS violations

6 Large-Scale Cloud Study61 Seer ScalabilityWe now deploy our Social Network service on a 100-serverdedicated cluster on Google Compute Engine (GCE) anduse it to service real user traffic The application has 582registered users with 165 daily active users and has beendeployed for a two-month period The cluster on averagehosts 386 single-concerned containers (one microservice percontainer) subject to some resource scaling actions by thecluster manager based on Seerrsquos feedback

Training Inference10-310-210-1100101102103104105

Tim

e (

s)

NativeTPUBrainwave

Figure 14 Seer training and infer-ence with hardware acceleration

Accuracy remainshigh for Seer con-sistent with thesmall-scale exper-iments Inferencetime however in-creases substantiallyfrom 114ms forthe 20-server clus-ter to 54ms for the100-server GCE set-ting Even thoughthis is still suffi-cient for many re-source allocation decisions as the application scales furtherSeerrsquos ability to anticipate a QoS violation within the clustermanagerrsquos window of opportunity diminishes

Over the past year multiple public cloud providers have ex-posed hardware acceleration offerings for DNN training and

nginx

text

imag

e

user

Tag

urlS

horte

n

vide

o

reco

mm

sear

chlogin

read

Post

com

pPos

t

writ

eGra

ph

mem

cach

ed

mon

godb

0

20

40

60

80

100

120

QoS

Vio

lations

Figure 15 QoS violations each microservice in Social Net-work is responsible for

inference either using a special-purpose design like the Ten-sor Processing Unit (TPU) from Google [55] or using recon-figurable FPGAs like Project Brainwave from Microsoft [25]We offload Seerrsquos DNN logic to both systems and quantifythe impact on training and inference time and detection ac-curacy 1 Fig 14 shows this comparison for a 200GB trainingdataset Both the TPU and Project Brainwave dramaticallyoutperform our local implementation by up to two orders ofmagnitude Between the two accelerators the TPU is moreeffective in training consistent with its design objective [55]while Project Brainwave achieves faster inference For theremainder of the paper we run Seer on TPUs and host theSocial Network service on GCE

62 Source of QoS ViolationsWe now examine which microservice is the most commonculprit for a QoS violation Fig 15 shows the number of QoSviolations caused by each service over the two-month periodThe most frequent culprits by far are the in-memory cachingtiers in memcached and Thrift services with high requestfanout such as composePost readPost and login memcachedis a justified source of QoS violations since it is on the criti-cal path for almost all query types and it is additionally verysensitive to resource contention in compute and to a lesserdegree cache and memory Microservices with high fanoutare also expected to initiate QoS violations as they have tosynchronize multiple inbound requests before proceeding Ifprocessing for any incoming requests is delayed end-to-endperformance is likely to suffer Among these QoS violationsmost of memcachedrsquos violations were caused by resource con-tention while violations in Thrift services were caused bylong synchronization times

63 Seerrsquos Long-Term Impact on Application DesignSeer has now been deployed in the Social Network clusterfor over two months and in this time it has detected 536upcoming QoS violations (906 accuracy) and avoided 495(84) of them Furthermore by detecting recurring patterns1Before running on TPUs we reimplemented our DNN in Tensorflow Wesimilarly adjust the DNN to the currently-supported designs in Brainwave

0 10 20 30 40 50 60Day

0

5

10

15

20

25

Qo

S v

iola

tio

ns

Figure 16 QoS violations each microservice in Social Net-work is responsible for

that lead to QoS violations Seer has helped the applicationdevelopers better understand bugs and design decisions thatlead to hotspots such as microservices with a lot of back-and-forth communication between them or microservicesforming cyclic dependencies or using blocking primitivesThis has led to a decreasing number of QoS violations overthe two month period (seen in Fig 16) as the applicationprogressively improves In days 22 and 23 there was a clusteroutage which is why the reported violations are zero Sys-tems like Seer can be used not only to improve performancepredictability in complex cloud systems but to help usersbetter understand the design challenges of microservices asmore services transition to this application model

7 ConclusionsCloud services increasingly move away from complex mono-lithic designs and adopt the model of specialized loosely-coupled microservices We presented Seer a data-drivencloud performance debugging system that leverages practi-cal learning techniques and the massive amount of tracingdata cloud systems collect to proactively detect and avoidQoS violations We have validated Seerrsquos accuracy in con-trolled environments and evaluated its scalability on large-scale clusters on public clouds We have also deployed thesystem in a cluster hosting a social network with hundredsof users In all scenarios Seer accurately detects upcomingQoS violations improving responsiveness and performancepredictability As more services transition to the microser-vices model systems like Seer provide practical solutionsthat can navigate the increasing complexity of the cloud

AcknowledgementsWe sincerely thank Christos Kozyrakis Balaji PrabhakarMendel Rosenblum Daniel Sanchez Ravi Soundararajanthe academic and industrial users of the benchmark suiteand the anonymous reviewers for their feedback on earlierversions of this manuscript This work was supported byNSF CAREER Award CCF-1846046 NSF grant CNS-1422088a Facebook Faculty Research Award a VMWare ResearchAward a John and Norma Balen Sesquicentennial FacultyFellowship and donations from GCE Azure and AWS

References[1] [n d] Apache Thrift httpsthriftapacheorg[2] [n d] Decomposing Twitter Adventures in Service-

Oriented Architecture httpswwwslidesharenetInfoQdecomposing-twitter-adventures-in-serviceoriented-architecture

[3] [n d] Golang Microservices Example httpsgithubcomharlowgo-micro-services

[4] [n d] Messaging that just works httpswwwrabbitmqcom[5] [n d] MongoDB httpswwwmongodbcom[6] [n d] NGINX httpsnginxorgen[7] [n d] SockShop A Microservices Demo Application httpswww

weaveworksblogsock-shop-microservices-demo-application[8] [n d] Zipkin httpzipkinio[9] 2016 The Evolution of Microservices httpswwwslidesharenet

adriancockcroftevolution-of-microservices-craft-conference[10] Martiacuten Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng

Chen Craig Citro Greg Corrado Andy Davis Jeffrey Dean MatthieuDevin Sanjay Ghemawat Ian Goodfellow Andrew Harp GeoffreyIrving Michael Isard Yangqing Jia Rafal Jozefowicz Lukasz KaiserManjunath Kudlur Josh Levenberg Dan Maneacute Rajat Monga SherryMoore Derek Murray Chris Olah Mike Schuster Jonathon ShlensBenoit Steiner Ilya Sutskever Kunal Talwar Paul Tucker VincentVanhoucke Vijay Vasudevan Fernanda Vieacutegas Oriol Vinyals PeteWarden Martin Wattenberg Martin Wicke Yuan Yu and XiaoqiangZheng [n d] TensorFlow Large-Scale Machine Learning on Hetero-geneous Distributed Systems In Proceedings of OSDI 2016

[11] Adrian Cockroft [n d] Microservices Workshop Why whatand how to get there httpwwwslidesharenetadriancockcroftmicroservices-workshop-craft-conference

[12] Hyunwook Baek Abhinav Srivastava and Jacobus Van der Merwe[n d] CloudSight A tenant-oriented transparency framework forcross-layer cloud troubleshooting In Proceedings of the 17th IEEEACMInternational Symposium on Cluster Cloud and Grid Computing 2017

[13] Luiz Barroso [n d] Warehouse-Scale Computing Entering theTeenage Decade ISCA Keynote SJ June 2011 ([n d])

[14] Luiz Barroso and Urs Hoelzle 2009 The Datacenter as a Computer AnIntroduction to the Design ofWarehouse-Scale Machines MC Publishers

[15] Robert Bell Yehuda Koren and Chris Volinsky 2007 The BellKor 2008Solution to the Netflix Prize Technical Report

[16] Leon Bottou [n d] Large-Scale Machine Learning with StochasticGradient Descent In Proceedings of the International Conference onComputational Statistics (COMPSTAT) Paris France 2010

[17] Martin A Brown [n d] Traffic Control HOWTO httplinux-ipnetarticlesTraffic-Control-HOWTO

[18] Cassandra [n d] Apache Cassandra httpcassandraapacheorg[19] Adrian M Caulfield Eric S Chung Andrew Putnam Hari Angepat

Jeremy Fowers Michael Haselman Stephen Heil Matt HumphreyPuneet Kaur Joo-Young Kim Daniel Lo Todd Massengill KalinOvtcharov Michael Papamichael Lisa Woods Sitaram Lanka DerekChiou and Doug Burger 2016 A Cloud-scale Acceleration Architec-ture In MICRO IEEE Press Piscataway NJ USA Article 7 13 pages

[20] Tianshi Chen Zidong Du Ninghui Sun Jia Wang Chengyong WuYunji Chen and Olivier Temam 2014 DianNao a small-footprint high-throughput accelerator for ubiquitous machine-learning In Proc ofthe 19th intl conf on Architectural Support for Programming Languagesand Operating Systems

[21] Yunji Chen Tianshi Chen Zhiwei Xu Ninghui Sun andOlivier Temam2016 DianNao Family Energy-efficient Hardware Accelerators forMachine Learning Commun ACM 59 11 (Oct 2016) 105ndash112 httpsdoiorg1011452996864

[22] Yunji Chen Tao Luo Shaoli Liu Shijin Zhang Liqiang He Jia WangLing Li Tianshi Chen Zhiwei Xu Ninghui Sun and Olivier Temam2014 DaDianNao A Machine-Learning Supercomputer In Proceedingsof the 47th Annual IEEEACM International Symposium on Microarchi-tecture (MICRO-47) IEEE Computer Society Washington DC USA

609ndash622[23] Ludmila Cherkasova Diwaker Gupta and Amin Vahdat 2007 Com-

parison of the Three CPU Schedulers in Xen SIGMETRICS PerformEval Rev 35 2 (Sept 2007) 42ndash51

[24] Michael Chow David Meisner Jason Flinn Daniel Peek and Thomas FWenisch 2014 The Mystery Machine End-to-end Performance Anal-ysis of Large-scale Internet Services In Proceedings of the 11th USENIXConference on Operating Systems Design and Implementation (OSDIrsquo14)USENIX Association Berkeley CA USA 217ndash231

[25] Eric S Chung Jeremy Fowers Kalin Ovtcharov Michael PapamichaelAdrian M Caulfield Todd Massengill Ming Liu Daniel Lo ShlomiAlkalay Michael Haselman Maleen Abeydeera Logan Adams HariAngepat Christian Boehn Derek Chiou Oren Firestein AlessandroForin Kang Su Gatlin Mahdi Ghandi Stephen Heil Kyle HolohanAhmad El Husseini Tamaacutes Juhaacutesz Kara Kagi Ratna Kovvuri SitaramLanka Friedel van Megen Dima Mukhortov Prerak Patel BrandonPerez Amanda Rapsang Steven K Reinhardt Bita Rouhani AdamSapek Raja Seera Sangeetha Shekar Balaji Sridharan Gabriel WeiszLisaWoods Phillip Yi Xiao Dan Zhang Ritchie Zhao and Doug Burger2018 Serving DNNs in Real Time at Datacenter Scale with ProjectBrainwave IEEE Micro 38 2 (2018) 8ndash20

[26] Guilherme Da Cunha Rodrigues Rodrigo N Calheiros Vini-cius Tavares Guimaraes Glederson Lessa dos Santos Maacutercio Barbosade Carvalho Lisandro Zambenedetti Granville Liane Margarida Rock-enbach Tarouco and Rajkumar Buyya 2016 Monitoring of CloudComputing Environments Concepts Solutions Trends and Future Di-rections In Proceedings of the 31st Annual ACM Symposium on AppliedComputing (SAC rsquo16) ACM New York NY USA 378ndash383

[27] Marwan Darwish Abdelkader Ouda and Luiz Fernando Capretz [nd] Cloud-based DDoS attacks and defenses In Proc of i-SocietyToronto ON 2013

[28] Jeffrey Dean and Luiz Andre Barroso [n d] The Tail at Scale InCACM Vol 56 No 2

[29] Christina Delimitrou Nick Bambos and Christos Kozyrakis [n d]QoS-Aware Admission Control in Heterogeneous Datacenters In Pro-ceedings of the International Conference of Autonomic Computing (ICAC)San Jose CA USA 2013

[30] Christina Delimitrou and Christos Kozyrakis [n d] iBench Quanti-fying Interference for Datacenter Workloads In Proceedings of the 2013IEEE International Symposium on Workload Characterization (IISWC)Portland OR September 2013

[31] Christina Delimitrou and Christos Kozyrakis [n d] Paragon QoS-Aware Scheduling for Heterogeneous Datacenters In Proceedings ofthe Eighteenth International Conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS) HoustonTX USA 2013

[32] Christina Delimitrou and Christos Kozyrakis [n d] QoS-Aware Sched-uling in Heterogeneous Datacenters with Paragon In ACM Transac-tions on Computer Systems (TOCS) Vol 31 Issue 4 December 2013

[33] Christina Delimitrou and Christos Kozyrakis [n d] Quality-of-Service-Aware Scheduling in Heterogeneous Datacenters withParagon In IEEE Micro Special Issue on Top Picks from the ComputerArchitecture Conferences MayJune 2014

[34] Christina Delimitrou and Christos Kozyrakis [n d] Quasar Resource-Efficient and QoS-Aware Cluster Management In Proc of ASPLOS SaltLake City 2014

[35] Christina Delimitrou and Christos Kozyrakis 2016 HCloud Resource-Efficient Provisioning in Shared Cloud Systems In Proceedings of theTwenty First International Conference on Architectural Support for Pro-gramming Languages and Operating Systems (ASPLOS)

[36] Christina Delimitrou and Christos Kozyrakis 2017 Bolt I KnowWhatYou Did Last Summer In The Cloud In Proc of the Twenty SecondInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS)

[37] Christina Delimitrou and Christos Kozyrakis 2018 Amdahlrsquos Law forTail Latency In Communications of the ACM (CACM)

[38] Christina Delimitrou Daniel Sanchez and Christos Kozyrakis 2015Tarcil Reconciling Scheduling Speed and Quality in Large Shared Clus-ters In Proceedings of the Sixth ACM Symposium on Cloud Computing(SOCC)

[39] Zidong Du Robert Fasthuber Tianshi Chen Paolo Ienne Ling Li TaoLuo Xiaobing Feng Yunji Chen and Olivier Temam 2015 ShiDi-anNao Shifting Vision Processing Closer to the Sensor In Proceed-ings of the 42Nd Annual International Symposium on Computer Ar-chitecture (ISCA rsquo15) ACM New York NY USA 92ndash104 httpsdoiorg10114527494692750389

[40] Daniel Firestone Andrew Putnam SambhramaMundkur Derek ChiouAlireza Dabagh Mike Andrewartha Hari Angepat Vivek BhanuAdrian Caulfield Eric Chung Harish Kumar Chandrappa SomeshChaturmohta Matt Humphrey Jack Lavier Norman Lam FengfenLiu Kalin Ovtcharov Jitu Padhye Gautham Popuri Shachar RaindelTejas Sapre Mark Shaw Gabriel Silva Madhan Sivakumar NisheethSrivastava Anshuman Verma Qasim Zuhair Deepak Bansal DougBurger Kushagra Vaid David A Maltz and Albert Greenberg 2018Azure Accelerated Networking SmartNICs in the Public Cloud In 15thUSENIX Symposium on Networked Systems Design and Implementation(NSDI 18) USENIX Association Renton WA 51ndash66

[41] Brad Fitzpatrick [n d] Distributed caching with memcached InLinux Journal Volume 2004 Issue 124 2004

[42] Rodrigo Fonseca George Porter Randy H Katz Scott Shenker andIon Stoica 2007 X-trace A Pervasive Network Tracing Framework InProceedings of the 4th USENIX Conference on Networked Systems Designamp Implementation (NSDIrsquo07) USENIX Association Berkeley CA USA20ndash20

[43] Yu Gan and Christina Delimitrou 2018 The Architectural Implicationsof Cloud Microservices In Computer Architecture Letters (CAL) vol17iss 2

[44] Yu Gan Meghna Pancholi Dailun Cheng Siyuan Hu Yuan He andChristina Delimitrou 2018 Seer Leveraging Big Data to Navigate theComplexity of Cloud Debugging In Proceedings of the Tenth USENIXWorkshop on Hot Topics in Cloud Computing (HotCloud)

[45] Yu Gan Yanqi Zhang Dailun Cheng Ankitha Shetty Priyal RathiNayan Katarki Ariana Bruno Justin Hu Brian Ritchken BrendonJackson Kelvin Hu Meghna Pancholi Yuan He Brett Clancy ChrisColen Fukang Wen Catherine Leung Siyuan Wang Leon ZaruvinskyMateo Espinosa Rick Lin Zhongling Liu Jake Padilla and ChristinaDelimitrou 2019 An Open-Source Benchmark Suite for Microser-vices and Their Hardware-Software Implications for Cloud and EdgeSystems In Proceedings of the Twenty Fourth International Conferenceon Architectural Support for Programming Languages and OperatingSystems (ASPLOS)

[46] Yilong Geng Shiyu Liu Zi Yin Ashish Naik Balaji Prabhakar MendelRosenblum and Amin Vahdat [n d] Exploiting a Natural NetworkEffect for Scalable Fine-grained Clock Synchronization In Proc ofNSDI 2018

[47] Daniel Gmach Jerry Rolia Ludmila Cherkasova and Alfons Kemper[n d] Workload Analysis and Demand Prediction of Enterprise DataCenter Applications In Proceedings of IISWC Boston MA 2007 10httpsdoiorg101109IISWC20074362193

[48] Zhenhuan Gong Xiaohui Gu and John Wilkes [n d] PRESS PRe-dictive Elastic ReSource Scaling for cloud systems In Proceedings ofCNSM Niagara Falls ON 2010

[49] Donald Gross John F Shortle James M Thompson and Carl M Harris[n d] Fundamentals of Queueing Theory InWiley Series in Probabilityand Statistics Book 627 2011

[50] Sanchika Gupta and PadamKumar [n d] VM Profile Based OptimizedNetwork Attack Pattern Detection Scheme for DDOS Attacks in CloudIn Proc of SSCC Mysore India 2013

[51] BenHindman Andy Konwinski Matei Zaharia Ali Ghodsi AnthonyDJoseph Randy Katz Scott Shenker and Ion Stoica [n d] Mesos APlatform for Fine-Grained Resource Sharing in the Data Center InProceedings of NSDI Boston MA 2011

[52] Jingwei Huang David M Nicol and Roy H Campbell [n d] Denial-of-Service Threat to HadoopYARN Clusters with Multi-Tenancy InProc of the IEEE International Congress on Big Data Washington DC2014

[53] Alexandru Iosup Nezih Yigitbasi and Dick Epema [n d] On thePerformance Variability of Production Cloud Services In Proceedingsof CCGRID Newport Beach CA 2011

[54] Hiranya Jayathilaka Chandra Krintz and Rich Wolski 2017 Perfor-mance Monitoring and Root Cause Analysis for Cloud-hosted WebApplications In Proceedings of WWW International World Wide WebConferences Steering Committee Republic and Canton of GenevaSwitzerland 469ndash478

[55] Norman P Jouppi Cliff Young Nishant Patil David Patterson Gau-rav Agrawal Raminder Bajwa Sarah Bates Suresh Bhatia Nan Bo-den Al Borchers Rick Boyle Pierre-luc Cantin Clifford Chao ChrisClark Jeremy Coriell Mike Daley Matt Dau Jeffrey Dean Ben GelbTara Vazir Ghaemmaghami Rajendra Gottipati William GullandRobert Hagmann C Richard Ho Doug Hogberg John Hu RobertHundt Dan Hurt Julian Ibarz Aaron Jaffey Alek Jaworski Alexan-der Kaplan Harshit Khaitan Daniel Killebrew Andy Koch NaveenKumar Steve Lacy James Laudon James Law Diemthu Le ChrisLeary Zhuyuan Liu Kyle Lucke Alan Lundin Gordon MacKean Adri-ana Maggiore Maire Mahony Kieran Miller Rahul Nagarajan RaviNarayanaswami Ray Ni Kathy Nix Thomas Norrie Mark Omer-nick Narayana Penukonda Andy Phelps Jonathan Ross Matt RossAmir Salek Emad Samadiani Chris Severn Gregory Sizikov MatthewSnelham Jed Souter Dan Steinberg Andy Swing Mercedes Tan Gre-gory Thorson Bo Tian Horia Toma Erick Tuttle Vijay VasudevanRichard Walter Walter Wang Eric Wilcox and Doe Hyun Yoon 2017In-Datacenter Performance Analysis of a Tensor Processing Unit InProceedings of the 44th Annual International Symposium on ComputerArchitecture (ISCA rsquo17) ACM New York NY USA 1ndash12

[56] Harshad Kasture and Daniel Sanchez 2014 Ubik Efficient CacheSharing with Strict QoS for Latency-Critical Workloads In Proceed-ings of the 19th international conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS-XIX)

[57] Harshad Kasture and Daniel Sanchez 2016 TailBench A BenchmarkSuite and Evaluation Methodology for Latency-Critical ApplicationsIn Proc of IISWC

[58] Yaakoub El Khamra Hyunjoo Kim Shantenu Jha andManish Parashar[n d] Exploring the performance fluctuations of hpc workloads onclouds In Proceedings of CloudCom Indianapolis IN 2010

[59] Krzysztof C Kiwiel [n d] Convergence and efficiency of subgradientmethods for quasiconvex minimization InMathematical Programming(Series A) (Berlin Heidelberg Springer) 90 (1) pp 1-25 2001

[60] Jacob Leverich and Christos Kozyrakis [n d] Reconciling High ServerUtilization and Sub-millisecond Quality-of-Service In Proc of EuroSys2014

[61] David Lo Liqun Cheng Rama Govindaraju Luiz Andreacute Barroso andChristos Kozyrakis [n d] Towards Energy Proportionality for Large-scale Latency-critical Workloads In Proceedings of the 41st AnnualInternational Symposium on Computer Architecuture (ISCA) Minneapo-lis MN 2014

[62] David Lo Liqun Cheng Rama Govindaraju Parthasarathy Ran-ganathan and Christos Kozyrakis [n d] Heracles Improving Re-source Efficiency at Scale In Proc of the 42Nd Annual InternationalSymposium on Computer Architecture (ISCA) Portland OR 2015

[63] Dave Mangot [n d] EC2 variability The numbers re-vealed httptechmangotcomrollerdaveentryec2_variability_the_numbers_revealed

[64] Jason Mars and Lingjia Tang [n d] Whare-map heterogeneity inhomogeneous warehouse-scale computers In Proceedings of ISCATel-Aviv Israel 2013

[65] Jason Mars Lingjia Tang and Robert Hundt 2011 Heterogeneity inldquoHomogeneousrdquo Warehouse-Scale Computers A Performance Oppor-tunity IEEE Comput Archit Lett 10 2 (July 2011) 4

[66] Jason Mars Lingjia Tang Robert Hundt Kevin Skadron and Mary LouSoffa [n d] Bubble-Up increasing utilization in modern warehousescale computers via sensible co-locations In Proceedings of MICROPorto Alegre Brazil 2011

[67] David Meisner Christopher M Sadler Luiz Andreacute Barroso Wolf-Dietrich Weber and Thomas F Wenisch 2011 Power managementof online data-intensive services In Proceedings of the 38th annualinternational symposium on Computer architecture 319ndash330

[68] Ripal Nathuji Canturk Isci and Eugene Gorbatov [n d] Exploitingplatform heterogeneity for power efficient data centers In Proceedingsof ICAC Jacksonville FL 2007

[69] Ripal Nathuji Aman Kansal and Alireza Ghaffarkhah [n d] Q-Clouds Managing Performance Interference Effects for QoS-AwareClouds In Proceedings of EuroSys ParisFrance 2010

[70] Hiep Nguyen Zhiming Shen Xiaohui Gu Sethuraman Subbiah andJohn Wilkes [n d] AGILE Elastic Distributed Resource Scaling forInfrastructure-as-a-Service In Proceedings of ICAC San Jose CA 2013

[71] Dejan Novakovic Nedeljko Vasic Stanko Novakovic Dejan Kosticand Ricardo Bianchini [n d] DeepDive Transparently Identifyingand Managing Performance Interference in Virtualized EnvironmentsIn Proceedings of ATC San Jose CA 2013

[72] Simon Ostermann Alexandru Iosup Nezih Yigitbasi Radu ProdanThomas Fahringer and Dick Epema [n d] A Performance Analysisof EC2 Cloud Computing Services for Scientific Computing In LectureNotes on Cloud Computing Volume 34 p115-131 2010

[73] Kay Ousterhout Patrick Wendell Matei Zaharia and Ion Stoica [nd] Sparrow Distributed Low Latency Scheduling In Proceedings ofSOSP Farminton PA 2013

[74] Xue Ouyang Peter Garraghan Renyu Yang Paul Townend and Jie Xu[n d] Reducing Late-Timing Failure at Scale Straggler Root-CauseAnalysis in Cloud Datacenters In Proceedings of 46th Annual IEEEIFIPInternational Conference on Dependable Systems and Networks 2016

[75] Andrew Putnam Adrian M Caulfield Eric S Chung Derek ChiouKypros Constantinides John Demme Hadi Esmaeilzadeh Jeremy Fow-ers Gopi Prashanth Gopal Jan Gray Michael Haselman Scott HauckStephen Heil Amir Hormati Joo-Young Kim Sitaram Lanka JamesLarus Eric Peterson Simon Pope Aaron Smith Jason Thong Phillip YiXiao and Doug Burger 2014 A Reconfigurable Fabric for Accelerat-ing Large-Scale Datacenter Services In Proc of the 41st Intl Symp onComputer Architecture

[76] Suhail Rehman and Majd Sakr [n d] Initial Findings for provisioningvariation in cloud computing In Proceedings of CloudCom IndianapolisIN 2010

[77] Charles Reiss Alexey Tumanov Gregory Ganger Randy Katz andMichael Kozych [n d] Heterogeneity and Dynamicity of Clouds atScale Google Trace Analysis In Proceedings of SOCC 2012

[78] Gang Ren Eric Tune Tipp Moseley Yixin Shi Silvius Rus andRobert Hundt 2010 Google-Wide Profiling A Continuous Pro-filing Infrastructure for Data Centers IEEE Micro (2010) 65ndash79httpwwwcomputerorgportalwebcsdldoi101109MM201068

[79] rightscale [n d] Rightscale httpsawsamazoncomsolution-providersisvrightscale

[80] S Sarwar A Ankit and K Roy [n d] Incremental Learning in DeepConvolutional Neural Networks Using Partial Network Sharing InarXiv preprint arXiv171202719

[81] Joumlrg Schad Jens Dittrich and Jorge-Arnulfo Quianeacute-Ruiz 2010 Run-time Measurements in the Cloud Observing Analyzing and ReducingVariance Proceedings VLDB Endow 3 1-2 (Sept 2010) 460ndash471

[82] Malte Schwarzkopf Andy Konwinski Michael Abd-El-Malek andJohn Wilkes [n d] Omega flexible scalable schedulers for largecompute clusters In Proceedings of EuroSys Prague 2013

[83] Zhiming Shen Sethuraman Subbiah Xiaohui Gu and John Wilkes [nd] CloudScale elastic resource scaling for multi-tenant cloud systemsIn Proceedings of SOCC Cascais Portugal 2011

[84] David Shue Michael J Freedman and Anees Shaikh [n d] Perfor-mance Isolation and Fairness for Multi-tenant Cloud Storage In Procof OSDI Hollywood CA 2012

[85] Benjamin H Sigelman Luiz Andreacute Barroso Mike Burrows PatStephenson Manoj Plakal Donald Beaver Saul Jaspan and Chan-dan Shanbhag 2010 Dapper a Large-Scale Distributed Systems TracingInfrastructure Technical Report Google Inc httpsresearchgooglecomarchivepapersdapper-2010-1pdf

[86] Lalith Suresh Peter Bodik Ishai Menache Marco Canini and FlorinCiucu [n d] Distributed Resource Management Across ProcessBoundaries In Proceedings of the ACM Symposium on Cloud Computing(SOCC) Santa Clara CA 2017

[87] torque [n d] Torque Resource Manager httpwwwadaptivecomputingcomproductsopen-sourcetorque

[88] Venkatanathan Varadarajan Thomas Ristenpart and Michael Swift[n d] Scheduler-based Defenses against Cross-VM Side-channels InProc of the 23rd Usenix Security Symposium San Diego CA 2014

[89] Venkatanathan Varadarajan Yinqian Zhang Thomas Ristenpart andMichael Swift [n d] A Placement Vulnerability Study inMulti-TenantPublic Clouds In Proc of the 24th USENIX Security Symposium (USENIXSecurity) Washington DC 2015

[90] Nedeljko Vasić Dejan Novaković Svetozar Miučin Dejan Kostić andRicardo Bianchini [n d] DejaVu accelerating resource allocationin virtualized environments In Proceedings of the Seventeenth Interna-tional Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS) London UK 2012

[91] Jianping Weng Jessie Hui Wang Jiahai Yang and Yang Yang [n d]Root cause analysis of anomalies of multitier services in public cloudsIn Proceedings of the IEEEACM 25th International Symposium on Qual-ity of Service (IWQoS) 2017

[92] Ian H Witten Eibe Frank and Geoffrey Holmes [n d] Data MiningPractical Machine Learning Tools and Techniques 3rd Edition

[93] Tianyin Xu Xinxin Jin Peng Huang Yuanyuan Zhou Shan Lu LongJin and Shankar Pasupathy 2016 Early Detection of ConfigurationErrors to Reduce Failure Damage In Proceedings of the 12th USENIXConference on Operating Systems Design and Implementation (OSDIrsquo16)USENIX Association Berkeley CA USA 619ndash634

[94] Yunjing Xu Michael Bailey Farnam Jahanian Kaustubh Joshi MattiHiltunen and Richard Schlichting [n d] An Exploration of L2 CacheCovert Channels in Virtualized Environments In Proc of the 3rd ACMWorkshop on Cloud Computing Security Workshop (CCSW) ChicagoIL 2011

[95] Hailong Yang Alex Breslow Jason Mars and Lingjia Tang [n d]Bubble-flux precise online QoS management for increased utilizationin warehouse scale computers In Proceedings of ISCA 2013

[96] Minlan Yu Albert Greenberg Dave Maltz Jennifer Rexford LihuaYuan Srikanth Kandula and Changhoon Kim [n d] Profiling Net-work Performance for Multi-tier Data Center Applications In Proceed-ings of NSDI Boston MA 2011 14

[97] Xiao Yu Pallavi Joshi Jianwu Xu Guoliang Jin Hui Zhang and GuofeiJiang 2016 CloudSeer Workflow Monitoring of Cloud Infrastructuresvia Interleaved Logs In Proceedings of APSLOS ACM New York NYUSA 489ndash502

[98] Yinqian Zhang and Michael K Reiter [n d] Duppel retrofittingcommodity operating systems to mitigate cache side channels in thecloud In Proc of CCS Berlin Germany 2013

[99] Mark Zhao and G Edward Suh [n d] FPGA-Based Remote PowerSide-Channel Attacks In Proceedings of the IEEE Symposium on Securityand Privacy May 2018

  • Abstract
  • 1 Introduction
  • 2 Related Work
  • 3 End-to-End Applications with Microservices
    • 31 Social Network
    • 32 Media Service
    • 33 E-Commerce Service
    • 34 Banking System
    • 35 Hotel Reservation Site
      • 4 Seer Design
        • 41 Overview
        • 42 Distributed Tracing
        • 43 Deep Learning in Performance Debugging
        • 44 Hardware Monitoring
        • 45 System Insights from Seer
        • 46 Implementation
          • 5 Seer Analysis and Validation
            • 51 Methodology
            • 52 Evaluation
              • 6 Large-Scale Cloud Study
                • 61 Seer Scalability
                • 62 Source of QoS Violations
                • 63 Seers Long-Term Impact on Application Design
                  • 7 Conclusions
                  • Acknowledgements
                  • References
Page 7: Seer: Leveraging Big Data to Navigate the Complexity of ... · data to extract the causal relationships between requests. Even though such tracing systems help cloud providers detect

Name LoC Layers Nonlinear Weights BatchFC Conv Vect Total Function Size

CNN 1456 8 8 ReLU 30K 4LSTM 944 12 6 18 sigmoidtanh 52K 32

Seer 2882 10 7 5 22 ReLU 80K 32sigmoidtanhTable 2 The different neural network configurations weexplored for Seer

Seerrsquos instrumentation for memcached Memcached includesfive main stages [60] TCPIP receive epolllibevent read-ing the request from the socket processing the request andresponding over TCPIP either with the ltkvgt pair for a reador with an ack for a write Each of these stages includes ahardware (NIC) or software (epollsocket readmemcachedproc) queue For theNIC queues Seer filters packets based onthe destination microservice but accounts for the aggregatequeue length if hardware queues are shared since that willimpact how fast a microservicersquos packets get processed Forthe software queues Seer inserts probes in the applicationto read the number of queued requests in each caseLimited instrumentation As seen above accounting forall sources of queueing in a complex system requires non-trivial instrumentation This can become cumbersome ifusers leverage third-party applications in their services or inthe case of public cloud providers which do not have accessto the source code of externally-submitted applications forinstrumentation In these cases Seer relies on the requestsqueued exclusively in the NIC to signal upcoming QoS viola-tions In Section 5 we compare the accuracy of the full versuslimited instrumentation and see that using network queuedepths alone is enough to signal a large fraction of QoS viola-tions although smaller than when the full instrumentation isavailable Exclusively polling NIC queues identifies hotspotscaused by routing incast failures and resource saturationbut misses QoS violations that are caused by performanceand efficiency bugs in the application implementation suchas blocking behavior between microservices Signaling suchbugs helps developers better understand the microservicesmodel and results in better application designInferring queue lengths Additionally there has been re-cent work on using deep learning to reverse engineer thenumber of queued requests in switches across a large net-work topology [46] when tracing information is incompleteSuch techniques are also applicable and beneficial for Seerwhen the default level of instrumentation is not available

43 Deep Learning in Performance DebuggingA popular way to model performance in cloud systems es-pecially when there are dependencies between tasks arequeueing networks [49] Although queueing networks are avaluable tool to model how bottlenecks propagate through

the system they require in-depth knowledge of applicationsemantics and structure and can become overly complexas applications and systems scale They additionally cannoteasily capture all sources of contention such as the OS andnetwork stack

Instead in Seer we take a data-driven application-agnosticapproach that assumes no information about the structureand bottlenecks of a service making it robust to unknownand changing applications and relying instead on practi-cal learning techniques to infer patterns that lead to QoSviolations This includes both spatial patterns such as de-pendencies between microservices and temporal patternssuch as input load and resource contention The key idea inSeer is that conditions that led to QoS violations in the pastcan be used to anticipate unpredictable performance in thenear future Seer uses execution traces annotated with QoSviolations and collected over time to train a deep neural net-work to signal upcoming QoS violations Below we describethe structure of the neural network why deep learning iswell-suited for this problem and how Seer adapts to changesin application structure onlineUsing deep learning Although deep learning is not theonly approach that can be used for proactive QoS violationdetection there are several reasonswhy it is preferable in thiscase First the problem Seer must solve is a pattern matchingproblem of recognizing conditions that result in QoS viola-tions where the patterns are not always known in advanceor easy to annotate This is a more complicated task than sim-ply signaling a microservice with many enqueued requestsfor which simpler classification regression or sequence la-beling techniques would suffice [15 16 92] Second the DNNin Seer assumes no a priori knowledge about dependenciesbetween individual microservices making it applicable tofrequently-updated services where describing changes iscumbersome or even difficult for the user to know Thirddeep learning has been shown to be especially effective inpattern recognition problems with massive datasets eg inimage or text recognition [10] Finally as we show in thevalidation section (Sec 5) deep learning allows Seer to rec-ognize QoS violations with high accuracy in practice andwithin the opportunity window the cluster manager has toapply corrective actionsConfiguring the DNN The input used in the network isessential for its accuracy We have experimented with re-source utilization latency and queue depths as input met-rics Consistent with prior work utilization is not a goodproxy for performance [31 35 56 61] Latency similarlyleads to many false positives or to incorrectly pinpointingcomputationally-intensive microservices as QoS violationculprits Again consistent with queueing theory [49] andprior work [34 37 38 46 56] per-microservice queue depthsaccurately capture performance bottlenecks and pinpoint the

hellip

CNN

LSTM

hellip

LSTM

LSTM

LSTM

LSTM

LSTM

hellip

LSTM

LSTM

LSTM

LSTM

LSTM

hellip

LSTM

LSTM

LSTM

LSTM

M

icro

serv

ice

s(i

n d

ep

en

de

nci

es

ord

er)

helliphellip hellip

hellip

M

icrose

rvice

s(P

rfo

r Qo

Sv

iola

tion

)

LSTM

hellip

SoftMax

LSTM

Dropout

Figure 9 The deep neural network design in Seer con-sisting of a set of convolution layers followed by a set oflong-short term memory layers Each input and output neu-ron corresponds to a microservice ordered in topologicalorder from back-end microservices in the top to front-endmicroservices in the bottom

microservices causing them We compare against utilization-based approaches in Section 5 The number of input andoutput neurons is equal to the number of active microser-vices in the cluster with the input value corresponding toqueue depths and the output value to the probability fora given microservice to initiate a QoS violation Input neu-rons are ordered according to the topological applicationstructure with dependent microservices corresponding toconsecutive neurons to capture hotspot patterns in space In-put traces are annotated with the microservices that causedQoS violations in the past

The choice of DNN architecture is also instrumental to itsaccuracy There are three main DNN designs that are populartoday fully connected networks (FC) convolutional neuralnetworks (CNN) and recurrent neural networks (RNN) es-pecially their Long Short-Term Memory (LSTM) class ForSeer to be effective in improving performance predictabilityinference needs to occur with enough slack for the clustermanagerrsquos action to take effect Hence we focus on the morecomputationally-efficient CNN and LSTM networks CNNsare especially effective at reducing the dimensionality oflarge datasets and finding patterns in space eg in imagerecognition LSTMs on the other hand are particularly effec-tive at finding patterns in time eg predicting tomorrowrsquosweather based on todayrsquos measurements Signaling QoS vi-olations in a large cluster requires both spatial recognitionnamely identifying problematic clusters of microserviceswhose dependencies cause QoS violations and discardingnoisy but non-critical microservices and temporal recogni-tion namely using past QoS violations to anticipate futureones We compare three network designs a CNN a LSTMand a hybrid network that combines the two using the CNNfirst to reduce the dimensionality and filter out microservices

that do not affect end-to-end performance and then an LSTMwith a SoftMax final layer to infer the probability for eachmicroservice to initiate a QoS violation The architectureof the hybrid network is shown in Fig 9 Each network isconfigured using hyperparameter tuning to avoid overfittingand the final parameters are shown in Table 2We train each network on a weekrsquos worth of trace data

collected on a 20-server cluster running all end-to-end ser-vices (for methodology details see Sec 5) and test it on tracescollected on a different week after the servers had beenpatched and the OS had been upgraded

561 7879 9345QoS Violation Detection Accuracy ()

0

2

4

6

8

10

12

14

16

Infe

rence

Tim

e (

ms)

CNNLSTMSeer (CNN+LSTM)

Figure 10 Comparison ofDNN architectures

The quantitative com-parison of the threenetworks is shown inFig 10 The CNN is byfar the fastest but alsothe worst performingsince it is not designedto recognize patterns intime that lead to QoS vi-olations The LSTM onthe other hand is espe-cially effective at captur-ing load patterns overtime but is less effective at reducing the dimensionality ofthe original dataset which makes it prone to false positivesdue to microservices with many outstanding requests whichare off the critical path It also incurs higher overheads forinference than the CNN Finally Seer correctly anticipates9345 of violations outperforming both networks for asmall increase in inference time compared to LSTM Giventhat most resource partitioning decisions take effect after afew 100ms the inference time for Seer is within the windowof opportunity the cluster manager has to take action Moreimportantly it attributes the QoS violation to the correctmicroservice simplifying the cluster managerrsquos task QoSviolations missed by Seer included four random load spikesand a network switch failure which caused high packet dropsOut of the five end-to-end services the one most prone

to QoS violations initially was Social Network first becauseit has stricter QoS constraints than eg E-commerce andsecond due to a synchronous and cyclic communication be-tween three neighboring services that caused them to entera positive feedback loop until saturation We reimplementedthe communication protocol between them post-detectionOn the other hand the service for which QoS violations werehardest to detect wasMedia Service because of a faulty mem-ory bandwidth partitioning mechanism in one of our serverswhich resulted in widely inconsistent memory bandwidthallocations during movie streaming Since the QoS violationonly occurred when the specific streaming microservice wasscheduled on the faulty node it was hard for Seer to collectenough datapoints to signal the violation

Retraining Seer By default training happens once andcan be time consuming taking several hours up to a day forweek-long traces collected on our 20-server cluster (Sec 5includes a detailed sensitivity study for training time) How-ever one of the main advantages of microservices is that theysimplify frequent application updates with old microser-vices often swapped out and replaced by newer modules orlarge services progressively broken down to microservicesIf the application (or underlying hardware) change signifi-cantly Seerrsquos detection accuracy can be impacted To adjustto changes in the execution environment Seer retrains incre-mentally in the background using the transfer learning-basedapproach in [80] Weights from previous training rounds arestored in disk allowing the model to continue training fromwhere it last left off when new data arrive reducing thetraining time by 2-3 orders of magnitude Even though thisapproach allows Seer to handle application changes almost inreal-time it is not a long-term solution since newweights arestill polluted by the previous application architecture Whenthe application changes in a major way eg microserviceson the critical path change Seer also retrains from scratchin the background While the new network trains QoS vi-olation detection happens with the incrementally-trainedinterim model In Section 5 we evaluate Seerrsquos ability toadjust its estimations to application changes

44 Hardware MonitoringOnce a QoS violation is signaled and a culprit microserviceis pinpointed Seer uses low-level monitoring to identify thereason behind the QoS violation The exact process dependson whether Seer has access to performance countersPrivate cluster When Seer has access to hardware eventssuch as performance counters it uses them to determinethe utilization of different shared resources Note that eventhough utilization is a problematic metric for anticipatingQoS violations in a large-scale service once a culprit mi-croservice has been identified examining the utilization ofdifferent resources can provide useful hints to the clustermanager on suitable decisions to avoid degraded perfor-mance Seer specifically examines CPU memory capacityand bandwidth network bandwidth cache contention andstorage IO bandwidth when prioritizing a resource to adjustOnce the saturated resource is identified Seer notifies thecluster manager to take actionPublic clusterWhen Seer does not have access to perfor-mance counters it instead uses a set of 10 tunable contentiousmicrobenchmarks each of them targeting a different sharedresource [30] to determine resource saturation For exampleif Seer injects the memory bandwidth microbenchmark inthe system and tunes up its intensity without an impacton the co-scheduled microservicersquos performance memorybandwidth is most likely not the resource that needs to beadjusted Seer starts from microbenchmarks corresponding

to core resources and progressively moves to resources fur-ther away from the core until it sees a substantial change inperformance when running the microbenchmark Each mi-crobenchmark takes approximately 10ms to complete avoid-ing prolonged degraded performance

Upon identifying the problematic resource(s) Seer notifiesthe cluster manager which takes one of several resourceallocation actions resizing the Docker container or usingmechanisms like Intelrsquos Cache Allocation Technology (CAT)for last level cache (LLC) partitioning and the Linux trafficcontrolrsquos hierarchical token bucket (HTB) queueing disciplinein qdisc [17 62] for network bandwidth partitioning

45 System Insights from SeerUsing learning-based data-driven approaches in systemsis most useful when these techniques are used to gain in-sight into system problems instead of treating them as blackboxes Section 5 includes an analysis of the causes behindQoS violations signaled by Seer including application bugspoor resource provisioning decisions and hardware failuresFurthermore we have deployed Seer in a large installation ofthe Social Network service over the past few months and itsoutput has been instrumental not only in guaranteeing QoSbut in understanding sources of unpredictable performanceand improving the application design This has resulted bothin progressively fewer QoS violations over time and a betterunderstanding of the design challenges of microservices

46 ImplementationSeer is implemented in 12KLOC of CC++ and Python Itruns on Linux and OSX and supports applications in variouslanguages including all frameworks the end-to-end servicesare designed in Furthermore we provide automated patchesfor the instrumentation probes for many popular microser-vices including NGINX memcached MongoDB Xapian and allSockshop and Go-microservices applications to minimize thedevelopment effort from the userrsquos perspective

Seer is a centralized system we usemaster-slavemirroringto improve fault tolerance with two hot stand-by mastersthat can take over if the primary system fails Similarly thetrace database is also replicated in the backgroundSecurity concerns Trace data is stored and processed un-encrypted in Cassandra Previous work has shown that thesensitivity applications have to different resources can leakinformation about their nature and characteristics makingthem vulnerable to malicious security attacks [27 36 50 5284 88 89 94 99] Similar attacks are possible using the dataand techniques in Seer and are deferred to future work

5 Seer Analysis and Validation51 MethodologyServer clusters First we use a dedicated local cluster with20 2-socket 40-core servers with 128GB of RAM each Each

0 5 10 15 20 25 30 35 40

Training Time (hours)

40

50

60

70

80

90

100

Qo

S V

iola

tio

n D

ete

ctio

n A

ccu

racy (

)

100MB

1GB

10GB

20GB

50GB

100GB 200GB

500GB 1TB

10-2 10-1 100 101Tracing Interval (s)

50

55

60

65

70

75

80

85

90

95

Dete

ction A

ccura

cy (

)

Figure 11 Seerrsquos sensitivity to (a) the size of trainingdatasets and (b) the tracing interval

server is connected to a 40Gbps ToR switch over 10Gbe NICsSecond we deploy the Social Network service to GoogleCompute Engine (GCE) and Windows Azure clusters withhundreds of servers to study the scalability of SeerApplicationsWe use all five end-to-end services of Table 1Services for now are driven by open-loop workload gener-ators and the input load varies from constant to diurnalto load with spikes in user demand In Section 6 we studySeer in a real large-scale deployment of the Social Networkin that case the input load is driven by real user traffic

52 EvaluationSensitivity to training data Fig 11a shows the detectionaccuracy and training time for Seer as we increase the sizeof the training dataset The size of the dots is a functionof the dataset size Training data is collected from the 20-server cluster described above across different load levelsplacement strategies time intervals and request types Thesmallest training set size (100MB) is collected over ten min-utes of the cluster operating at high utilization while thelargest dataset (1TB) is collected over almost two months ofcontinuous deployment As datasets grow Seerrsquos accuracyincreases leveling off at 100-200GB Beyond that point accu-racy does not further increase while the time needed fortraining grows significantly Unless otherwise specified weuse the 100GB training datasetSensitivity to tracing frequencyBy default the distributedtracing system instantaneously collects the latency of everysingle user request Collecting queue depth statistics on theother hand is a per-microservice iterative process Fig 11bshows how Seerrsquos accuracy changes as we vary the frequencywith which we collect queue depth statistics Waiting for along time before sampling queues eg gt 1s can result in un-detected QoS violations before Seer gets a chance to processthe incoming traces In contrast sampling queues very fre-quently results in unnecessarily many inferences and runsthe risk of increasing the tracing overhead For the remain-der of this work we use 100ms as the interval for measuringqueue depths across microservicesFalse negatives amp false positives Fig 12a shows the per-centage of false negatives and false positives as we vary the

CPU

Cache

Memory

Network

Disk

Application

False Negatives False Positives0

10

20

30

40

50

Pe

rce

nta

ge

(

)

10ms50ms100ms

500ms1s2s

Utilizatio

n

App only

Net only

Seer

Ground0

20

40

60

80

100

o

f V

alid

Qo

S V

iola

tio

ns

Truth

Figure 12 (a) The false negatives and false positives in Seeras we vary the inference window (b) Breakdown to causesof QoS violations and comparison with utilization-baseddetection and systems with limited instrumentation

prediction window When Seer tries to anticipate QoS viola-tions that will occur in the next 10-100ms both false positivesand false negatives are low since Seer uses a very recentsnapshot of the cluster state to anticipate performance unpre-dictability If inference was instantaneous very short predic-tion windows would always be better However given thatinference takes several milliseconds and more importantlyapplying corrective actions to avoid QoS violations takes10-100s of milliseconds to take effect such short windowsdefy the point of proactive QoS violation detection At theother end predicting far into the future results in significantfalse negatives and especially false positives This is becausemany QoS violations are caused by very short bursty eventsthat do not have an impact on queue lengths until a fewmilliseconds before the violation occurs Therefore requiringSeer to predict one or more seconds into the future meansthat normal queue depths are annotated as QoS violationsresulting in many false positives Unless otherwise specifiedwe use a 100ms prediction windowComparison of debugging systems Fig 12b comparesSeer with a utilization-based performance debugging systemthat uses resource saturation as the trigger to signal a QoSviolation and two systems that only use a fraction of Seerrsquosinstrumentation App-only exclusively uses queues mea-sured via application instrumentation (not network queues)while Network-only uses queues in the NIC and ignoresapplication-level queueing We also show the ground truthfor the total number of upcoming QoS violations (96 over atwo-week period) and break it down by the reason that ledto unpredictable performanceA large fraction of QoS violations are due to application-

level inefficiencies including correctness bugs unnecessarysynchronization andor blocking behavior between microser-vices (including two cases of deadlocks) and misconfigurediptables rules which caused packet drops An equally largefraction of QoS violations are due to compute contentionfollowed by contention in the network cache and memory

Accuracy

QoS Violation

Impactful Service Update

Innocuous Service Update

Incremental Retraining

Signaled QoS Violation

0 2 4 6 8 10 12Time (hours)

75

80

85

90

95

De

tectio

n A

ccu

racy (

)

Social Network Movie Reviewing Ecommerce Go-microservices

0 2 4 6 8 10 12Time (hours)

60708090

100110120130140

Ta

il La

ten

cy n

orm

QoS

QoS

Figure 13 Seer retraining incrementally after each time theSocial Network service is updated

contention and finally disk Since the only persistent mi-croservices are the back-end databases it is reasonable thatdisk accounts for a small fraction of overall QoS violations

Seer accurately follows this breakdown for the most partonly missing a few QoS violations due to random load spikesincluding one caused by a switch failure The App-only sys-tem correctly identifies application-level sources of unpre-dictable performance but misses the majority of system-related issues especially in uncore resources On the otherhand Network-only correctly identifies the vast majorityof network-related issues as well as most of the core- anduncore-driven QoS violations but misses several application-level issues The difference between Network-only and Seeris small suggesting that one could omit the application-levelinstrumentation in favor of a simpler design While this sys-tem is still effective in capturing QoS violations it is lessuseful in providing feedback to application developers onhow to improve their design to avoid QoS violations in the fu-ture Finally the utilization-based system behaves the worstmissing most violations not caused by CPU saturationOut of the 89 QoS violations Seer detects it notifies the

cluster manager early enough to avoid 84 of them The QoSviolations that were not avoided correspond to application-level bugs which cannot be easily corrected online Sincethis is a private cluster Seer uses utilization metrics andperformance counters to identify problematic resourcesRetraining Fig 13 shows the detection accuracy for Seerand the tail latency for each end-to-end service over a periodof time during which Social Network is getting frequently andsubstantially updated This includes new microservices be-ing added to the service such as the ability to place an orderfrom an ad using the orders microservice of E-commerceor the back-end of the service changing from MongoDB to

Cassandra and the front-end switching from nginx to thenodejs front-end of E-commerce These are changes that fun-damentally affect the applicationrsquos behavior throughput la-tency and bottlenecks The other services remain unchangedduring this period (Banking was not active during this timeand is omitted from the graph) Blue dots denote correctly-signaled upcoming QoS violations and red times denote QoSviolations that were not detected by Seer All unidentifiedQoS violations coincide with the application being updatedShortly after the update Seer incrementally retrains in thebackground and starts recovering its accuracy until anothermajor update occurs Some of the updates have no impact oneither performance or Seerrsquos accuracy either because theyinvolve microservices off the critical path or because theyare insensitive to resource contention

The bottom figure shows that unidentified QoS violationsindeed result in performance degradation for Social Networkand in some cases for the other end-to-end services if theyare sharing physical resources with Social Network on anoversubscribed server Once retraining completes the per-formance of the service(s) recovers The longer Seer trainson an evolving application the more likely it is to correctlyanticipate its future QoS violations

6 Large-Scale Cloud Study61 Seer ScalabilityWe now deploy our Social Network service on a 100-serverdedicated cluster on Google Compute Engine (GCE) anduse it to service real user traffic The application has 582registered users with 165 daily active users and has beendeployed for a two-month period The cluster on averagehosts 386 single-concerned containers (one microservice percontainer) subject to some resource scaling actions by thecluster manager based on Seerrsquos feedback

Training Inference10-310-210-1100101102103104105

Tim

e (

s)

NativeTPUBrainwave

Figure 14 Seer training and infer-ence with hardware acceleration

Accuracy remainshigh for Seer con-sistent with thesmall-scale exper-iments Inferencetime however in-creases substantiallyfrom 114ms forthe 20-server clus-ter to 54ms for the100-server GCE set-ting Even thoughthis is still suffi-cient for many re-source allocation decisions as the application scales furtherSeerrsquos ability to anticipate a QoS violation within the clustermanagerrsquos window of opportunity diminishes

Over the past year multiple public cloud providers have ex-posed hardware acceleration offerings for DNN training and

nginx

text

imag

e

user

Tag

urlS

horte

n

vide

o

reco

mm

sear

chlogin

read

Post

com

pPos

t

writ

eGra

ph

mem

cach

ed

mon

godb

0

20

40

60

80

100

120

QoS

Vio

lations

Figure 15 QoS violations each microservice in Social Net-work is responsible for

inference either using a special-purpose design like the Ten-sor Processing Unit (TPU) from Google [55] or using recon-figurable FPGAs like Project Brainwave from Microsoft [25]We offload Seerrsquos DNN logic to both systems and quantifythe impact on training and inference time and detection ac-curacy 1 Fig 14 shows this comparison for a 200GB trainingdataset Both the TPU and Project Brainwave dramaticallyoutperform our local implementation by up to two orders ofmagnitude Between the two accelerators the TPU is moreeffective in training consistent with its design objective [55]while Project Brainwave achieves faster inference For theremainder of the paper we run Seer on TPUs and host theSocial Network service on GCE

62 Source of QoS ViolationsWe now examine which microservice is the most commonculprit for a QoS violation Fig 15 shows the number of QoSviolations caused by each service over the two-month periodThe most frequent culprits by far are the in-memory cachingtiers in memcached and Thrift services with high requestfanout such as composePost readPost and login memcachedis a justified source of QoS violations since it is on the criti-cal path for almost all query types and it is additionally verysensitive to resource contention in compute and to a lesserdegree cache and memory Microservices with high fanoutare also expected to initiate QoS violations as they have tosynchronize multiple inbound requests before proceeding Ifprocessing for any incoming requests is delayed end-to-endperformance is likely to suffer Among these QoS violationsmost of memcachedrsquos violations were caused by resource con-tention while violations in Thrift services were caused bylong synchronization times

63 Seerrsquos Long-Term Impact on Application DesignSeer has now been deployed in the Social Network clusterfor over two months and in this time it has detected 536upcoming QoS violations (906 accuracy) and avoided 495(84) of them Furthermore by detecting recurring patterns1Before running on TPUs we reimplemented our DNN in Tensorflow Wesimilarly adjust the DNN to the currently-supported designs in Brainwave

0 10 20 30 40 50 60Day

0

5

10

15

20

25

Qo

S v

iola

tio

ns

Figure 16 QoS violations each microservice in Social Net-work is responsible for

that lead to QoS violations Seer has helped the applicationdevelopers better understand bugs and design decisions thatlead to hotspots such as microservices with a lot of back-and-forth communication between them or microservicesforming cyclic dependencies or using blocking primitivesThis has led to a decreasing number of QoS violations overthe two month period (seen in Fig 16) as the applicationprogressively improves In days 22 and 23 there was a clusteroutage which is why the reported violations are zero Sys-tems like Seer can be used not only to improve performancepredictability in complex cloud systems but to help usersbetter understand the design challenges of microservices asmore services transition to this application model

7 ConclusionsCloud services increasingly move away from complex mono-lithic designs and adopt the model of specialized loosely-coupled microservices We presented Seer a data-drivencloud performance debugging system that leverages practi-cal learning techniques and the massive amount of tracingdata cloud systems collect to proactively detect and avoidQoS violations We have validated Seerrsquos accuracy in con-trolled environments and evaluated its scalability on large-scale clusters on public clouds We have also deployed thesystem in a cluster hosting a social network with hundredsof users In all scenarios Seer accurately detects upcomingQoS violations improving responsiveness and performancepredictability As more services transition to the microser-vices model systems like Seer provide practical solutionsthat can navigate the increasing complexity of the cloud

AcknowledgementsWe sincerely thank Christos Kozyrakis Balaji PrabhakarMendel Rosenblum Daniel Sanchez Ravi Soundararajanthe academic and industrial users of the benchmark suiteand the anonymous reviewers for their feedback on earlierversions of this manuscript This work was supported byNSF CAREER Award CCF-1846046 NSF grant CNS-1422088a Facebook Faculty Research Award a VMWare ResearchAward a John and Norma Balen Sesquicentennial FacultyFellowship and donations from GCE Azure and AWS

References[1] [n d] Apache Thrift httpsthriftapacheorg[2] [n d] Decomposing Twitter Adventures in Service-

Oriented Architecture httpswwwslidesharenetInfoQdecomposing-twitter-adventures-in-serviceoriented-architecture

[3] [n d] Golang Microservices Example httpsgithubcomharlowgo-micro-services

[4] [n d] Messaging that just works httpswwwrabbitmqcom[5] [n d] MongoDB httpswwwmongodbcom[6] [n d] NGINX httpsnginxorgen[7] [n d] SockShop A Microservices Demo Application httpswww

weaveworksblogsock-shop-microservices-demo-application[8] [n d] Zipkin httpzipkinio[9] 2016 The Evolution of Microservices httpswwwslidesharenet

adriancockcroftevolution-of-microservices-craft-conference[10] Martiacuten Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng

Chen Craig Citro Greg Corrado Andy Davis Jeffrey Dean MatthieuDevin Sanjay Ghemawat Ian Goodfellow Andrew Harp GeoffreyIrving Michael Isard Yangqing Jia Rafal Jozefowicz Lukasz KaiserManjunath Kudlur Josh Levenberg Dan Maneacute Rajat Monga SherryMoore Derek Murray Chris Olah Mike Schuster Jonathon ShlensBenoit Steiner Ilya Sutskever Kunal Talwar Paul Tucker VincentVanhoucke Vijay Vasudevan Fernanda Vieacutegas Oriol Vinyals PeteWarden Martin Wattenberg Martin Wicke Yuan Yu and XiaoqiangZheng [n d] TensorFlow Large-Scale Machine Learning on Hetero-geneous Distributed Systems In Proceedings of OSDI 2016

[11] Adrian Cockroft [n d] Microservices Workshop Why whatand how to get there httpwwwslidesharenetadriancockcroftmicroservices-workshop-craft-conference

[12] Hyunwook Baek Abhinav Srivastava and Jacobus Van der Merwe[n d] CloudSight A tenant-oriented transparency framework forcross-layer cloud troubleshooting In Proceedings of the 17th IEEEACMInternational Symposium on Cluster Cloud and Grid Computing 2017

[13] Luiz Barroso [n d] Warehouse-Scale Computing Entering theTeenage Decade ISCA Keynote SJ June 2011 ([n d])

[14] Luiz Barroso and Urs Hoelzle 2009 The Datacenter as a Computer AnIntroduction to the Design ofWarehouse-Scale Machines MC Publishers

[15] Robert Bell Yehuda Koren and Chris Volinsky 2007 The BellKor 2008Solution to the Netflix Prize Technical Report

[16] Leon Bottou [n d] Large-Scale Machine Learning with StochasticGradient Descent In Proceedings of the International Conference onComputational Statistics (COMPSTAT) Paris France 2010

[17] Martin A Brown [n d] Traffic Control HOWTO httplinux-ipnetarticlesTraffic-Control-HOWTO

[18] Cassandra [n d] Apache Cassandra httpcassandraapacheorg[19] Adrian M Caulfield Eric S Chung Andrew Putnam Hari Angepat

Jeremy Fowers Michael Haselman Stephen Heil Matt HumphreyPuneet Kaur Joo-Young Kim Daniel Lo Todd Massengill KalinOvtcharov Michael Papamichael Lisa Woods Sitaram Lanka DerekChiou and Doug Burger 2016 A Cloud-scale Acceleration Architec-ture In MICRO IEEE Press Piscataway NJ USA Article 7 13 pages

[20] Tianshi Chen Zidong Du Ninghui Sun Jia Wang Chengyong WuYunji Chen and Olivier Temam 2014 DianNao a small-footprint high-throughput accelerator for ubiquitous machine-learning In Proc ofthe 19th intl conf on Architectural Support for Programming Languagesand Operating Systems

[21] Yunji Chen Tianshi Chen Zhiwei Xu Ninghui Sun andOlivier Temam2016 DianNao Family Energy-efficient Hardware Accelerators forMachine Learning Commun ACM 59 11 (Oct 2016) 105ndash112 httpsdoiorg1011452996864

[22] Yunji Chen Tao Luo Shaoli Liu Shijin Zhang Liqiang He Jia WangLing Li Tianshi Chen Zhiwei Xu Ninghui Sun and Olivier Temam2014 DaDianNao A Machine-Learning Supercomputer In Proceedingsof the 47th Annual IEEEACM International Symposium on Microarchi-tecture (MICRO-47) IEEE Computer Society Washington DC USA

609ndash622[23] Ludmila Cherkasova Diwaker Gupta and Amin Vahdat 2007 Com-

parison of the Three CPU Schedulers in Xen SIGMETRICS PerformEval Rev 35 2 (Sept 2007) 42ndash51

[24] Michael Chow David Meisner Jason Flinn Daniel Peek and Thomas FWenisch 2014 The Mystery Machine End-to-end Performance Anal-ysis of Large-scale Internet Services In Proceedings of the 11th USENIXConference on Operating Systems Design and Implementation (OSDIrsquo14)USENIX Association Berkeley CA USA 217ndash231

[25] Eric S Chung Jeremy Fowers Kalin Ovtcharov Michael PapamichaelAdrian M Caulfield Todd Massengill Ming Liu Daniel Lo ShlomiAlkalay Michael Haselman Maleen Abeydeera Logan Adams HariAngepat Christian Boehn Derek Chiou Oren Firestein AlessandroForin Kang Su Gatlin Mahdi Ghandi Stephen Heil Kyle HolohanAhmad El Husseini Tamaacutes Juhaacutesz Kara Kagi Ratna Kovvuri SitaramLanka Friedel van Megen Dima Mukhortov Prerak Patel BrandonPerez Amanda Rapsang Steven K Reinhardt Bita Rouhani AdamSapek Raja Seera Sangeetha Shekar Balaji Sridharan Gabriel WeiszLisaWoods Phillip Yi Xiao Dan Zhang Ritchie Zhao and Doug Burger2018 Serving DNNs in Real Time at Datacenter Scale with ProjectBrainwave IEEE Micro 38 2 (2018) 8ndash20

[26] Guilherme Da Cunha Rodrigues Rodrigo N Calheiros Vini-cius Tavares Guimaraes Glederson Lessa dos Santos Maacutercio Barbosade Carvalho Lisandro Zambenedetti Granville Liane Margarida Rock-enbach Tarouco and Rajkumar Buyya 2016 Monitoring of CloudComputing Environments Concepts Solutions Trends and Future Di-rections In Proceedings of the 31st Annual ACM Symposium on AppliedComputing (SAC rsquo16) ACM New York NY USA 378ndash383

[27] Marwan Darwish Abdelkader Ouda and Luiz Fernando Capretz [nd] Cloud-based DDoS attacks and defenses In Proc of i-SocietyToronto ON 2013

[28] Jeffrey Dean and Luiz Andre Barroso [n d] The Tail at Scale InCACM Vol 56 No 2

[29] Christina Delimitrou Nick Bambos and Christos Kozyrakis [n d]QoS-Aware Admission Control in Heterogeneous Datacenters In Pro-ceedings of the International Conference of Autonomic Computing (ICAC)San Jose CA USA 2013

[30] Christina Delimitrou and Christos Kozyrakis [n d] iBench Quanti-fying Interference for Datacenter Workloads In Proceedings of the 2013IEEE International Symposium on Workload Characterization (IISWC)Portland OR September 2013

[31] Christina Delimitrou and Christos Kozyrakis [n d] Paragon QoS-Aware Scheduling for Heterogeneous Datacenters In Proceedings ofthe Eighteenth International Conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS) HoustonTX USA 2013

[32] Christina Delimitrou and Christos Kozyrakis [n d] QoS-Aware Sched-uling in Heterogeneous Datacenters with Paragon In ACM Transac-tions on Computer Systems (TOCS) Vol 31 Issue 4 December 2013

[33] Christina Delimitrou and Christos Kozyrakis [n d] Quality-of-Service-Aware Scheduling in Heterogeneous Datacenters withParagon In IEEE Micro Special Issue on Top Picks from the ComputerArchitecture Conferences MayJune 2014

[34] Christina Delimitrou and Christos Kozyrakis [n d] Quasar Resource-Efficient and QoS-Aware Cluster Management In Proc of ASPLOS SaltLake City 2014

[35] Christina Delimitrou and Christos Kozyrakis 2016 HCloud Resource-Efficient Provisioning in Shared Cloud Systems In Proceedings of theTwenty First International Conference on Architectural Support for Pro-gramming Languages and Operating Systems (ASPLOS)

[36] Christina Delimitrou and Christos Kozyrakis 2017 Bolt I KnowWhatYou Did Last Summer In The Cloud In Proc of the Twenty SecondInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS)

[37] Christina Delimitrou and Christos Kozyrakis 2018 Amdahlrsquos Law forTail Latency In Communications of the ACM (CACM)

[38] Christina Delimitrou Daniel Sanchez and Christos Kozyrakis 2015Tarcil Reconciling Scheduling Speed and Quality in Large Shared Clus-ters In Proceedings of the Sixth ACM Symposium on Cloud Computing(SOCC)

[39] Zidong Du Robert Fasthuber Tianshi Chen Paolo Ienne Ling Li TaoLuo Xiaobing Feng Yunji Chen and Olivier Temam 2015 ShiDi-anNao Shifting Vision Processing Closer to the Sensor In Proceed-ings of the 42Nd Annual International Symposium on Computer Ar-chitecture (ISCA rsquo15) ACM New York NY USA 92ndash104 httpsdoiorg10114527494692750389

[40] Daniel Firestone Andrew Putnam SambhramaMundkur Derek ChiouAlireza Dabagh Mike Andrewartha Hari Angepat Vivek BhanuAdrian Caulfield Eric Chung Harish Kumar Chandrappa SomeshChaturmohta Matt Humphrey Jack Lavier Norman Lam FengfenLiu Kalin Ovtcharov Jitu Padhye Gautham Popuri Shachar RaindelTejas Sapre Mark Shaw Gabriel Silva Madhan Sivakumar NisheethSrivastava Anshuman Verma Qasim Zuhair Deepak Bansal DougBurger Kushagra Vaid David A Maltz and Albert Greenberg 2018Azure Accelerated Networking SmartNICs in the Public Cloud In 15thUSENIX Symposium on Networked Systems Design and Implementation(NSDI 18) USENIX Association Renton WA 51ndash66

[41] Brad Fitzpatrick [n d] Distributed caching with memcached InLinux Journal Volume 2004 Issue 124 2004

[42] Rodrigo Fonseca George Porter Randy H Katz Scott Shenker andIon Stoica 2007 X-trace A Pervasive Network Tracing Framework InProceedings of the 4th USENIX Conference on Networked Systems Designamp Implementation (NSDIrsquo07) USENIX Association Berkeley CA USA20ndash20

[43] Yu Gan and Christina Delimitrou 2018 The Architectural Implicationsof Cloud Microservices In Computer Architecture Letters (CAL) vol17iss 2

[44] Yu Gan Meghna Pancholi Dailun Cheng Siyuan Hu Yuan He andChristina Delimitrou 2018 Seer Leveraging Big Data to Navigate theComplexity of Cloud Debugging In Proceedings of the Tenth USENIXWorkshop on Hot Topics in Cloud Computing (HotCloud)

[45] Yu Gan Yanqi Zhang Dailun Cheng Ankitha Shetty Priyal RathiNayan Katarki Ariana Bruno Justin Hu Brian Ritchken BrendonJackson Kelvin Hu Meghna Pancholi Yuan He Brett Clancy ChrisColen Fukang Wen Catherine Leung Siyuan Wang Leon ZaruvinskyMateo Espinosa Rick Lin Zhongling Liu Jake Padilla and ChristinaDelimitrou 2019 An Open-Source Benchmark Suite for Microser-vices and Their Hardware-Software Implications for Cloud and EdgeSystems In Proceedings of the Twenty Fourth International Conferenceon Architectural Support for Programming Languages and OperatingSystems (ASPLOS)

[46] Yilong Geng Shiyu Liu Zi Yin Ashish Naik Balaji Prabhakar MendelRosenblum and Amin Vahdat [n d] Exploiting a Natural NetworkEffect for Scalable Fine-grained Clock Synchronization In Proc ofNSDI 2018

[47] Daniel Gmach Jerry Rolia Ludmila Cherkasova and Alfons Kemper[n d] Workload Analysis and Demand Prediction of Enterprise DataCenter Applications In Proceedings of IISWC Boston MA 2007 10httpsdoiorg101109IISWC20074362193

[48] Zhenhuan Gong Xiaohui Gu and John Wilkes [n d] PRESS PRe-dictive Elastic ReSource Scaling for cloud systems In Proceedings ofCNSM Niagara Falls ON 2010

[49] Donald Gross John F Shortle James M Thompson and Carl M Harris[n d] Fundamentals of Queueing Theory InWiley Series in Probabilityand Statistics Book 627 2011

[50] Sanchika Gupta and PadamKumar [n d] VM Profile Based OptimizedNetwork Attack Pattern Detection Scheme for DDOS Attacks in CloudIn Proc of SSCC Mysore India 2013

[51] BenHindman Andy Konwinski Matei Zaharia Ali Ghodsi AnthonyDJoseph Randy Katz Scott Shenker and Ion Stoica [n d] Mesos APlatform for Fine-Grained Resource Sharing in the Data Center InProceedings of NSDI Boston MA 2011

[52] Jingwei Huang David M Nicol and Roy H Campbell [n d] Denial-of-Service Threat to HadoopYARN Clusters with Multi-Tenancy InProc of the IEEE International Congress on Big Data Washington DC2014

[53] Alexandru Iosup Nezih Yigitbasi and Dick Epema [n d] On thePerformance Variability of Production Cloud Services In Proceedingsof CCGRID Newport Beach CA 2011

[54] Hiranya Jayathilaka Chandra Krintz and Rich Wolski 2017 Perfor-mance Monitoring and Root Cause Analysis for Cloud-hosted WebApplications In Proceedings of WWW International World Wide WebConferences Steering Committee Republic and Canton of GenevaSwitzerland 469ndash478

[55] Norman P Jouppi Cliff Young Nishant Patil David Patterson Gau-rav Agrawal Raminder Bajwa Sarah Bates Suresh Bhatia Nan Bo-den Al Borchers Rick Boyle Pierre-luc Cantin Clifford Chao ChrisClark Jeremy Coriell Mike Daley Matt Dau Jeffrey Dean Ben GelbTara Vazir Ghaemmaghami Rajendra Gottipati William GullandRobert Hagmann C Richard Ho Doug Hogberg John Hu RobertHundt Dan Hurt Julian Ibarz Aaron Jaffey Alek Jaworski Alexan-der Kaplan Harshit Khaitan Daniel Killebrew Andy Koch NaveenKumar Steve Lacy James Laudon James Law Diemthu Le ChrisLeary Zhuyuan Liu Kyle Lucke Alan Lundin Gordon MacKean Adri-ana Maggiore Maire Mahony Kieran Miller Rahul Nagarajan RaviNarayanaswami Ray Ni Kathy Nix Thomas Norrie Mark Omer-nick Narayana Penukonda Andy Phelps Jonathan Ross Matt RossAmir Salek Emad Samadiani Chris Severn Gregory Sizikov MatthewSnelham Jed Souter Dan Steinberg Andy Swing Mercedes Tan Gre-gory Thorson Bo Tian Horia Toma Erick Tuttle Vijay VasudevanRichard Walter Walter Wang Eric Wilcox and Doe Hyun Yoon 2017In-Datacenter Performance Analysis of a Tensor Processing Unit InProceedings of the 44th Annual International Symposium on ComputerArchitecture (ISCA rsquo17) ACM New York NY USA 1ndash12

[56] Harshad Kasture and Daniel Sanchez 2014 Ubik Efficient CacheSharing with Strict QoS for Latency-Critical Workloads In Proceed-ings of the 19th international conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS-XIX)

[57] Harshad Kasture and Daniel Sanchez 2016 TailBench A BenchmarkSuite and Evaluation Methodology for Latency-Critical ApplicationsIn Proc of IISWC

[58] Yaakoub El Khamra Hyunjoo Kim Shantenu Jha andManish Parashar[n d] Exploring the performance fluctuations of hpc workloads onclouds In Proceedings of CloudCom Indianapolis IN 2010

[59] Krzysztof C Kiwiel [n d] Convergence and efficiency of subgradientmethods for quasiconvex minimization InMathematical Programming(Series A) (Berlin Heidelberg Springer) 90 (1) pp 1-25 2001

[60] Jacob Leverich and Christos Kozyrakis [n d] Reconciling High ServerUtilization and Sub-millisecond Quality-of-Service In Proc of EuroSys2014

[61] David Lo Liqun Cheng Rama Govindaraju Luiz Andreacute Barroso andChristos Kozyrakis [n d] Towards Energy Proportionality for Large-scale Latency-critical Workloads In Proceedings of the 41st AnnualInternational Symposium on Computer Architecuture (ISCA) Minneapo-lis MN 2014

[62] David Lo Liqun Cheng Rama Govindaraju Parthasarathy Ran-ganathan and Christos Kozyrakis [n d] Heracles Improving Re-source Efficiency at Scale In Proc of the 42Nd Annual InternationalSymposium on Computer Architecture (ISCA) Portland OR 2015

[63] Dave Mangot [n d] EC2 variability The numbers re-vealed httptechmangotcomrollerdaveentryec2_variability_the_numbers_revealed

[64] Jason Mars and Lingjia Tang [n d] Whare-map heterogeneity inhomogeneous warehouse-scale computers In Proceedings of ISCATel-Aviv Israel 2013

[65] Jason Mars Lingjia Tang and Robert Hundt 2011 Heterogeneity inldquoHomogeneousrdquo Warehouse-Scale Computers A Performance Oppor-tunity IEEE Comput Archit Lett 10 2 (July 2011) 4

[66] Jason Mars Lingjia Tang Robert Hundt Kevin Skadron and Mary LouSoffa [n d] Bubble-Up increasing utilization in modern warehousescale computers via sensible co-locations In Proceedings of MICROPorto Alegre Brazil 2011

[67] David Meisner Christopher M Sadler Luiz Andreacute Barroso Wolf-Dietrich Weber and Thomas F Wenisch 2011 Power managementof online data-intensive services In Proceedings of the 38th annualinternational symposium on Computer architecture 319ndash330

[68] Ripal Nathuji Canturk Isci and Eugene Gorbatov [n d] Exploitingplatform heterogeneity for power efficient data centers In Proceedingsof ICAC Jacksonville FL 2007

[69] Ripal Nathuji Aman Kansal and Alireza Ghaffarkhah [n d] Q-Clouds Managing Performance Interference Effects for QoS-AwareClouds In Proceedings of EuroSys ParisFrance 2010

[70] Hiep Nguyen Zhiming Shen Xiaohui Gu Sethuraman Subbiah andJohn Wilkes [n d] AGILE Elastic Distributed Resource Scaling forInfrastructure-as-a-Service In Proceedings of ICAC San Jose CA 2013

[71] Dejan Novakovic Nedeljko Vasic Stanko Novakovic Dejan Kosticand Ricardo Bianchini [n d] DeepDive Transparently Identifyingand Managing Performance Interference in Virtualized EnvironmentsIn Proceedings of ATC San Jose CA 2013

[72] Simon Ostermann Alexandru Iosup Nezih Yigitbasi Radu ProdanThomas Fahringer and Dick Epema [n d] A Performance Analysisof EC2 Cloud Computing Services for Scientific Computing In LectureNotes on Cloud Computing Volume 34 p115-131 2010

[73] Kay Ousterhout Patrick Wendell Matei Zaharia and Ion Stoica [nd] Sparrow Distributed Low Latency Scheduling In Proceedings ofSOSP Farminton PA 2013

[74] Xue Ouyang Peter Garraghan Renyu Yang Paul Townend and Jie Xu[n d] Reducing Late-Timing Failure at Scale Straggler Root-CauseAnalysis in Cloud Datacenters In Proceedings of 46th Annual IEEEIFIPInternational Conference on Dependable Systems and Networks 2016

[75] Andrew Putnam Adrian M Caulfield Eric S Chung Derek ChiouKypros Constantinides John Demme Hadi Esmaeilzadeh Jeremy Fow-ers Gopi Prashanth Gopal Jan Gray Michael Haselman Scott HauckStephen Heil Amir Hormati Joo-Young Kim Sitaram Lanka JamesLarus Eric Peterson Simon Pope Aaron Smith Jason Thong Phillip YiXiao and Doug Burger 2014 A Reconfigurable Fabric for Accelerat-ing Large-Scale Datacenter Services In Proc of the 41st Intl Symp onComputer Architecture

[76] Suhail Rehman and Majd Sakr [n d] Initial Findings for provisioningvariation in cloud computing In Proceedings of CloudCom IndianapolisIN 2010

[77] Charles Reiss Alexey Tumanov Gregory Ganger Randy Katz andMichael Kozych [n d] Heterogeneity and Dynamicity of Clouds atScale Google Trace Analysis In Proceedings of SOCC 2012

[78] Gang Ren Eric Tune Tipp Moseley Yixin Shi Silvius Rus andRobert Hundt 2010 Google-Wide Profiling A Continuous Pro-filing Infrastructure for Data Centers IEEE Micro (2010) 65ndash79httpwwwcomputerorgportalwebcsdldoi101109MM201068

[79] rightscale [n d] Rightscale httpsawsamazoncomsolution-providersisvrightscale

[80] S Sarwar A Ankit and K Roy [n d] Incremental Learning in DeepConvolutional Neural Networks Using Partial Network Sharing InarXiv preprint arXiv171202719

[81] Joumlrg Schad Jens Dittrich and Jorge-Arnulfo Quianeacute-Ruiz 2010 Run-time Measurements in the Cloud Observing Analyzing and ReducingVariance Proceedings VLDB Endow 3 1-2 (Sept 2010) 460ndash471

[82] Malte Schwarzkopf Andy Konwinski Michael Abd-El-Malek andJohn Wilkes [n d] Omega flexible scalable schedulers for largecompute clusters In Proceedings of EuroSys Prague 2013

[83] Zhiming Shen Sethuraman Subbiah Xiaohui Gu and John Wilkes [nd] CloudScale elastic resource scaling for multi-tenant cloud systemsIn Proceedings of SOCC Cascais Portugal 2011

[84] David Shue Michael J Freedman and Anees Shaikh [n d] Perfor-mance Isolation and Fairness for Multi-tenant Cloud Storage In Procof OSDI Hollywood CA 2012

[85] Benjamin H Sigelman Luiz Andreacute Barroso Mike Burrows PatStephenson Manoj Plakal Donald Beaver Saul Jaspan and Chan-dan Shanbhag 2010 Dapper a Large-Scale Distributed Systems TracingInfrastructure Technical Report Google Inc httpsresearchgooglecomarchivepapersdapper-2010-1pdf

[86] Lalith Suresh Peter Bodik Ishai Menache Marco Canini and FlorinCiucu [n d] Distributed Resource Management Across ProcessBoundaries In Proceedings of the ACM Symposium on Cloud Computing(SOCC) Santa Clara CA 2017

[87] torque [n d] Torque Resource Manager httpwwwadaptivecomputingcomproductsopen-sourcetorque

[88] Venkatanathan Varadarajan Thomas Ristenpart and Michael Swift[n d] Scheduler-based Defenses against Cross-VM Side-channels InProc of the 23rd Usenix Security Symposium San Diego CA 2014

[89] Venkatanathan Varadarajan Yinqian Zhang Thomas Ristenpart andMichael Swift [n d] A Placement Vulnerability Study inMulti-TenantPublic Clouds In Proc of the 24th USENIX Security Symposium (USENIXSecurity) Washington DC 2015

[90] Nedeljko Vasić Dejan Novaković Svetozar Miučin Dejan Kostić andRicardo Bianchini [n d] DejaVu accelerating resource allocationin virtualized environments In Proceedings of the Seventeenth Interna-tional Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS) London UK 2012

[91] Jianping Weng Jessie Hui Wang Jiahai Yang and Yang Yang [n d]Root cause analysis of anomalies of multitier services in public cloudsIn Proceedings of the IEEEACM 25th International Symposium on Qual-ity of Service (IWQoS) 2017

[92] Ian H Witten Eibe Frank and Geoffrey Holmes [n d] Data MiningPractical Machine Learning Tools and Techniques 3rd Edition

[93] Tianyin Xu Xinxin Jin Peng Huang Yuanyuan Zhou Shan Lu LongJin and Shankar Pasupathy 2016 Early Detection of ConfigurationErrors to Reduce Failure Damage In Proceedings of the 12th USENIXConference on Operating Systems Design and Implementation (OSDIrsquo16)USENIX Association Berkeley CA USA 619ndash634

[94] Yunjing Xu Michael Bailey Farnam Jahanian Kaustubh Joshi MattiHiltunen and Richard Schlichting [n d] An Exploration of L2 CacheCovert Channels in Virtualized Environments In Proc of the 3rd ACMWorkshop on Cloud Computing Security Workshop (CCSW) ChicagoIL 2011

[95] Hailong Yang Alex Breslow Jason Mars and Lingjia Tang [n d]Bubble-flux precise online QoS management for increased utilizationin warehouse scale computers In Proceedings of ISCA 2013

[96] Minlan Yu Albert Greenberg Dave Maltz Jennifer Rexford LihuaYuan Srikanth Kandula and Changhoon Kim [n d] Profiling Net-work Performance for Multi-tier Data Center Applications In Proceed-ings of NSDI Boston MA 2011 14

[97] Xiao Yu Pallavi Joshi Jianwu Xu Guoliang Jin Hui Zhang and GuofeiJiang 2016 CloudSeer Workflow Monitoring of Cloud Infrastructuresvia Interleaved Logs In Proceedings of APSLOS ACM New York NYUSA 489ndash502

[98] Yinqian Zhang and Michael K Reiter [n d] Duppel retrofittingcommodity operating systems to mitigate cache side channels in thecloud In Proc of CCS Berlin Germany 2013

[99] Mark Zhao and G Edward Suh [n d] FPGA-Based Remote PowerSide-Channel Attacks In Proceedings of the IEEE Symposium on Securityand Privacy May 2018

  • Abstract
  • 1 Introduction
  • 2 Related Work
  • 3 End-to-End Applications with Microservices
    • 31 Social Network
    • 32 Media Service
    • 33 E-Commerce Service
    • 34 Banking System
    • 35 Hotel Reservation Site
      • 4 Seer Design
        • 41 Overview
        • 42 Distributed Tracing
        • 43 Deep Learning in Performance Debugging
        • 44 Hardware Monitoring
        • 45 System Insights from Seer
        • 46 Implementation
          • 5 Seer Analysis and Validation
            • 51 Methodology
            • 52 Evaluation
              • 6 Large-Scale Cloud Study
                • 61 Seer Scalability
                • 62 Source of QoS Violations
                • 63 Seers Long-Term Impact on Application Design
                  • 7 Conclusions
                  • Acknowledgements
                  • References
Page 8: Seer: Leveraging Big Data to Navigate the Complexity of ... · data to extract the causal relationships between requests. Even though such tracing systems help cloud providers detect

hellip

CNN

LSTM

hellip

LSTM

LSTM

LSTM

LSTM

LSTM

hellip

LSTM

LSTM

LSTM

LSTM

LSTM

hellip

LSTM

LSTM

LSTM

LSTM

M

icro

serv

ice

s(i

n d

ep

en

de

nci

es

ord

er)

helliphellip hellip

hellip

M

icrose

rvice

s(P

rfo

r Qo

Sv

iola

tion

)

LSTM

hellip

SoftMax

LSTM

Dropout

Figure 9 The deep neural network design in Seer con-sisting of a set of convolution layers followed by a set oflong-short term memory layers Each input and output neu-ron corresponds to a microservice ordered in topologicalorder from back-end microservices in the top to front-endmicroservices in the bottom

microservices causing them We compare against utilization-based approaches in Section 5 The number of input andoutput neurons is equal to the number of active microser-vices in the cluster with the input value corresponding toqueue depths and the output value to the probability fora given microservice to initiate a QoS violation Input neu-rons are ordered according to the topological applicationstructure with dependent microservices corresponding toconsecutive neurons to capture hotspot patterns in space In-put traces are annotated with the microservices that causedQoS violations in the past

The choice of DNN architecture is also instrumental to itsaccuracy There are three main DNN designs that are populartoday fully connected networks (FC) convolutional neuralnetworks (CNN) and recurrent neural networks (RNN) es-pecially their Long Short-Term Memory (LSTM) class ForSeer to be effective in improving performance predictabilityinference needs to occur with enough slack for the clustermanagerrsquos action to take effect Hence we focus on the morecomputationally-efficient CNN and LSTM networks CNNsare especially effective at reducing the dimensionality oflarge datasets and finding patterns in space eg in imagerecognition LSTMs on the other hand are particularly effec-tive at finding patterns in time eg predicting tomorrowrsquosweather based on todayrsquos measurements Signaling QoS vi-olations in a large cluster requires both spatial recognitionnamely identifying problematic clusters of microserviceswhose dependencies cause QoS violations and discardingnoisy but non-critical microservices and temporal recogni-tion namely using past QoS violations to anticipate futureones We compare three network designs a CNN a LSTMand a hybrid network that combines the two using the CNNfirst to reduce the dimensionality and filter out microservices

that do not affect end-to-end performance and then an LSTMwith a SoftMax final layer to infer the probability for eachmicroservice to initiate a QoS violation The architectureof the hybrid network is shown in Fig 9 Each network isconfigured using hyperparameter tuning to avoid overfittingand the final parameters are shown in Table 2We train each network on a weekrsquos worth of trace data

collected on a 20-server cluster running all end-to-end ser-vices (for methodology details see Sec 5) and test it on tracescollected on a different week after the servers had beenpatched and the OS had been upgraded

561 7879 9345QoS Violation Detection Accuracy ()

0

2

4

6

8

10

12

14

16

Infe

rence

Tim

e (

ms)

CNNLSTMSeer (CNN+LSTM)

Figure 10 Comparison ofDNN architectures

The quantitative com-parison of the threenetworks is shown inFig 10 The CNN is byfar the fastest but alsothe worst performingsince it is not designedto recognize patterns intime that lead to QoS vi-olations The LSTM onthe other hand is espe-cially effective at captur-ing load patterns overtime but is less effective at reducing the dimensionality ofthe original dataset which makes it prone to false positivesdue to microservices with many outstanding requests whichare off the critical path It also incurs higher overheads forinference than the CNN Finally Seer correctly anticipates9345 of violations outperforming both networks for asmall increase in inference time compared to LSTM Giventhat most resource partitioning decisions take effect after afew 100ms the inference time for Seer is within the windowof opportunity the cluster manager has to take action Moreimportantly it attributes the QoS violation to the correctmicroservice simplifying the cluster managerrsquos task QoSviolations missed by Seer included four random load spikesand a network switch failure which caused high packet dropsOut of the five end-to-end services the one most prone

to QoS violations initially was Social Network first becauseit has stricter QoS constraints than eg E-commerce andsecond due to a synchronous and cyclic communication be-tween three neighboring services that caused them to entera positive feedback loop until saturation We reimplementedthe communication protocol between them post-detectionOn the other hand the service for which QoS violations werehardest to detect wasMedia Service because of a faulty mem-ory bandwidth partitioning mechanism in one of our serverswhich resulted in widely inconsistent memory bandwidthallocations during movie streaming Since the QoS violationonly occurred when the specific streaming microservice wasscheduled on the faulty node it was hard for Seer to collectenough datapoints to signal the violation

Retraining Seer By default training happens once andcan be time consuming taking several hours up to a day forweek-long traces collected on our 20-server cluster (Sec 5includes a detailed sensitivity study for training time) How-ever one of the main advantages of microservices is that theysimplify frequent application updates with old microser-vices often swapped out and replaced by newer modules orlarge services progressively broken down to microservicesIf the application (or underlying hardware) change signifi-cantly Seerrsquos detection accuracy can be impacted To adjustto changes in the execution environment Seer retrains incre-mentally in the background using the transfer learning-basedapproach in [80] Weights from previous training rounds arestored in disk allowing the model to continue training fromwhere it last left off when new data arrive reducing thetraining time by 2-3 orders of magnitude Even though thisapproach allows Seer to handle application changes almost inreal-time it is not a long-term solution since newweights arestill polluted by the previous application architecture Whenthe application changes in a major way eg microserviceson the critical path change Seer also retrains from scratchin the background While the new network trains QoS vi-olation detection happens with the incrementally-trainedinterim model In Section 5 we evaluate Seerrsquos ability toadjust its estimations to application changes

44 Hardware MonitoringOnce a QoS violation is signaled and a culprit microserviceis pinpointed Seer uses low-level monitoring to identify thereason behind the QoS violation The exact process dependson whether Seer has access to performance countersPrivate cluster When Seer has access to hardware eventssuch as performance counters it uses them to determinethe utilization of different shared resources Note that eventhough utilization is a problematic metric for anticipatingQoS violations in a large-scale service once a culprit mi-croservice has been identified examining the utilization ofdifferent resources can provide useful hints to the clustermanager on suitable decisions to avoid degraded perfor-mance Seer specifically examines CPU memory capacityand bandwidth network bandwidth cache contention andstorage IO bandwidth when prioritizing a resource to adjustOnce the saturated resource is identified Seer notifies thecluster manager to take actionPublic clusterWhen Seer does not have access to perfor-mance counters it instead uses a set of 10 tunable contentiousmicrobenchmarks each of them targeting a different sharedresource [30] to determine resource saturation For exampleif Seer injects the memory bandwidth microbenchmark inthe system and tunes up its intensity without an impacton the co-scheduled microservicersquos performance memorybandwidth is most likely not the resource that needs to beadjusted Seer starts from microbenchmarks corresponding

to core resources and progressively moves to resources fur-ther away from the core until it sees a substantial change inperformance when running the microbenchmark Each mi-crobenchmark takes approximately 10ms to complete avoid-ing prolonged degraded performance

Upon identifying the problematic resource(s) Seer notifiesthe cluster manager which takes one of several resourceallocation actions resizing the Docker container or usingmechanisms like Intelrsquos Cache Allocation Technology (CAT)for last level cache (LLC) partitioning and the Linux trafficcontrolrsquos hierarchical token bucket (HTB) queueing disciplinein qdisc [17 62] for network bandwidth partitioning

45 System Insights from SeerUsing learning-based data-driven approaches in systemsis most useful when these techniques are used to gain in-sight into system problems instead of treating them as blackboxes Section 5 includes an analysis of the causes behindQoS violations signaled by Seer including application bugspoor resource provisioning decisions and hardware failuresFurthermore we have deployed Seer in a large installation ofthe Social Network service over the past few months and itsoutput has been instrumental not only in guaranteeing QoSbut in understanding sources of unpredictable performanceand improving the application design This has resulted bothin progressively fewer QoS violations over time and a betterunderstanding of the design challenges of microservices

46 ImplementationSeer is implemented in 12KLOC of CC++ and Python Itruns on Linux and OSX and supports applications in variouslanguages including all frameworks the end-to-end servicesare designed in Furthermore we provide automated patchesfor the instrumentation probes for many popular microser-vices including NGINX memcached MongoDB Xapian and allSockshop and Go-microservices applications to minimize thedevelopment effort from the userrsquos perspective

Seer is a centralized system we usemaster-slavemirroringto improve fault tolerance with two hot stand-by mastersthat can take over if the primary system fails Similarly thetrace database is also replicated in the backgroundSecurity concerns Trace data is stored and processed un-encrypted in Cassandra Previous work has shown that thesensitivity applications have to different resources can leakinformation about their nature and characteristics makingthem vulnerable to malicious security attacks [27 36 50 5284 88 89 94 99] Similar attacks are possible using the dataand techniques in Seer and are deferred to future work

5 Seer Analysis and Validation51 MethodologyServer clusters First we use a dedicated local cluster with20 2-socket 40-core servers with 128GB of RAM each Each

0 5 10 15 20 25 30 35 40

Training Time (hours)

40

50

60

70

80

90

100

Qo

S V

iola

tio

n D

ete

ctio

n A

ccu

racy (

)

100MB

1GB

10GB

20GB

50GB

100GB 200GB

500GB 1TB

10-2 10-1 100 101Tracing Interval (s)

50

55

60

65

70

75

80

85

90

95

Dete

ction A

ccura

cy (

)

Figure 11 Seerrsquos sensitivity to (a) the size of trainingdatasets and (b) the tracing interval

server is connected to a 40Gbps ToR switch over 10Gbe NICsSecond we deploy the Social Network service to GoogleCompute Engine (GCE) and Windows Azure clusters withhundreds of servers to study the scalability of SeerApplicationsWe use all five end-to-end services of Table 1Services for now are driven by open-loop workload gener-ators and the input load varies from constant to diurnalto load with spikes in user demand In Section 6 we studySeer in a real large-scale deployment of the Social Networkin that case the input load is driven by real user traffic

52 EvaluationSensitivity to training data Fig 11a shows the detectionaccuracy and training time for Seer as we increase the sizeof the training dataset The size of the dots is a functionof the dataset size Training data is collected from the 20-server cluster described above across different load levelsplacement strategies time intervals and request types Thesmallest training set size (100MB) is collected over ten min-utes of the cluster operating at high utilization while thelargest dataset (1TB) is collected over almost two months ofcontinuous deployment As datasets grow Seerrsquos accuracyincreases leveling off at 100-200GB Beyond that point accu-racy does not further increase while the time needed fortraining grows significantly Unless otherwise specified weuse the 100GB training datasetSensitivity to tracing frequencyBy default the distributedtracing system instantaneously collects the latency of everysingle user request Collecting queue depth statistics on theother hand is a per-microservice iterative process Fig 11bshows how Seerrsquos accuracy changes as we vary the frequencywith which we collect queue depth statistics Waiting for along time before sampling queues eg gt 1s can result in un-detected QoS violations before Seer gets a chance to processthe incoming traces In contrast sampling queues very fre-quently results in unnecessarily many inferences and runsthe risk of increasing the tracing overhead For the remain-der of this work we use 100ms as the interval for measuringqueue depths across microservicesFalse negatives amp false positives Fig 12a shows the per-centage of false negatives and false positives as we vary the

CPU

Cache

Memory

Network

Disk

Application

False Negatives False Positives0

10

20

30

40

50

Pe

rce

nta

ge

(

)

10ms50ms100ms

500ms1s2s

Utilizatio

n

App only

Net only

Seer

Ground0

20

40

60

80

100

o

f V

alid

Qo

S V

iola

tio

ns

Truth

Figure 12 (a) The false negatives and false positives in Seeras we vary the inference window (b) Breakdown to causesof QoS violations and comparison with utilization-baseddetection and systems with limited instrumentation

prediction window When Seer tries to anticipate QoS viola-tions that will occur in the next 10-100ms both false positivesand false negatives are low since Seer uses a very recentsnapshot of the cluster state to anticipate performance unpre-dictability If inference was instantaneous very short predic-tion windows would always be better However given thatinference takes several milliseconds and more importantlyapplying corrective actions to avoid QoS violations takes10-100s of milliseconds to take effect such short windowsdefy the point of proactive QoS violation detection At theother end predicting far into the future results in significantfalse negatives and especially false positives This is becausemany QoS violations are caused by very short bursty eventsthat do not have an impact on queue lengths until a fewmilliseconds before the violation occurs Therefore requiringSeer to predict one or more seconds into the future meansthat normal queue depths are annotated as QoS violationsresulting in many false positives Unless otherwise specifiedwe use a 100ms prediction windowComparison of debugging systems Fig 12b comparesSeer with a utilization-based performance debugging systemthat uses resource saturation as the trigger to signal a QoSviolation and two systems that only use a fraction of Seerrsquosinstrumentation App-only exclusively uses queues mea-sured via application instrumentation (not network queues)while Network-only uses queues in the NIC and ignoresapplication-level queueing We also show the ground truthfor the total number of upcoming QoS violations (96 over atwo-week period) and break it down by the reason that ledto unpredictable performanceA large fraction of QoS violations are due to application-

level inefficiencies including correctness bugs unnecessarysynchronization andor blocking behavior between microser-vices (including two cases of deadlocks) and misconfigurediptables rules which caused packet drops An equally largefraction of QoS violations are due to compute contentionfollowed by contention in the network cache and memory

Accuracy

QoS Violation

Impactful Service Update

Innocuous Service Update

Incremental Retraining

Signaled QoS Violation

0 2 4 6 8 10 12Time (hours)

75

80

85

90

95

De

tectio

n A

ccu

racy (

)

Social Network Movie Reviewing Ecommerce Go-microservices

0 2 4 6 8 10 12Time (hours)

60708090

100110120130140

Ta

il La

ten

cy n

orm

QoS

QoS

Figure 13 Seer retraining incrementally after each time theSocial Network service is updated

contention and finally disk Since the only persistent mi-croservices are the back-end databases it is reasonable thatdisk accounts for a small fraction of overall QoS violations

Seer accurately follows this breakdown for the most partonly missing a few QoS violations due to random load spikesincluding one caused by a switch failure The App-only sys-tem correctly identifies application-level sources of unpre-dictable performance but misses the majority of system-related issues especially in uncore resources On the otherhand Network-only correctly identifies the vast majorityof network-related issues as well as most of the core- anduncore-driven QoS violations but misses several application-level issues The difference between Network-only and Seeris small suggesting that one could omit the application-levelinstrumentation in favor of a simpler design While this sys-tem is still effective in capturing QoS violations it is lessuseful in providing feedback to application developers onhow to improve their design to avoid QoS violations in the fu-ture Finally the utilization-based system behaves the worstmissing most violations not caused by CPU saturationOut of the 89 QoS violations Seer detects it notifies the

cluster manager early enough to avoid 84 of them The QoSviolations that were not avoided correspond to application-level bugs which cannot be easily corrected online Sincethis is a private cluster Seer uses utilization metrics andperformance counters to identify problematic resourcesRetraining Fig 13 shows the detection accuracy for Seerand the tail latency for each end-to-end service over a periodof time during which Social Network is getting frequently andsubstantially updated This includes new microservices be-ing added to the service such as the ability to place an orderfrom an ad using the orders microservice of E-commerceor the back-end of the service changing from MongoDB to

Cassandra and the front-end switching from nginx to thenodejs front-end of E-commerce These are changes that fun-damentally affect the applicationrsquos behavior throughput la-tency and bottlenecks The other services remain unchangedduring this period (Banking was not active during this timeand is omitted from the graph) Blue dots denote correctly-signaled upcoming QoS violations and red times denote QoSviolations that were not detected by Seer All unidentifiedQoS violations coincide with the application being updatedShortly after the update Seer incrementally retrains in thebackground and starts recovering its accuracy until anothermajor update occurs Some of the updates have no impact oneither performance or Seerrsquos accuracy either because theyinvolve microservices off the critical path or because theyare insensitive to resource contention

The bottom figure shows that unidentified QoS violationsindeed result in performance degradation for Social Networkand in some cases for the other end-to-end services if theyare sharing physical resources with Social Network on anoversubscribed server Once retraining completes the per-formance of the service(s) recovers The longer Seer trainson an evolving application the more likely it is to correctlyanticipate its future QoS violations

6 Large-Scale Cloud Study61 Seer ScalabilityWe now deploy our Social Network service on a 100-serverdedicated cluster on Google Compute Engine (GCE) anduse it to service real user traffic The application has 582registered users with 165 daily active users and has beendeployed for a two-month period The cluster on averagehosts 386 single-concerned containers (one microservice percontainer) subject to some resource scaling actions by thecluster manager based on Seerrsquos feedback

Training Inference10-310-210-1100101102103104105

Tim

e (

s)

NativeTPUBrainwave

Figure 14 Seer training and infer-ence with hardware acceleration

Accuracy remainshigh for Seer con-sistent with thesmall-scale exper-iments Inferencetime however in-creases substantiallyfrom 114ms forthe 20-server clus-ter to 54ms for the100-server GCE set-ting Even thoughthis is still suffi-cient for many re-source allocation decisions as the application scales furtherSeerrsquos ability to anticipate a QoS violation within the clustermanagerrsquos window of opportunity diminishes

Over the past year multiple public cloud providers have ex-posed hardware acceleration offerings for DNN training and

nginx

text

imag

e

user

Tag

urlS

horte

n

vide

o

reco

mm

sear

chlogin

read

Post

com

pPos

t

writ

eGra

ph

mem

cach

ed

mon

godb

0

20

40

60

80

100

120

QoS

Vio

lations

Figure 15 QoS violations each microservice in Social Net-work is responsible for

inference either using a special-purpose design like the Ten-sor Processing Unit (TPU) from Google [55] or using recon-figurable FPGAs like Project Brainwave from Microsoft [25]We offload Seerrsquos DNN logic to both systems and quantifythe impact on training and inference time and detection ac-curacy 1 Fig 14 shows this comparison for a 200GB trainingdataset Both the TPU and Project Brainwave dramaticallyoutperform our local implementation by up to two orders ofmagnitude Between the two accelerators the TPU is moreeffective in training consistent with its design objective [55]while Project Brainwave achieves faster inference For theremainder of the paper we run Seer on TPUs and host theSocial Network service on GCE

62 Source of QoS ViolationsWe now examine which microservice is the most commonculprit for a QoS violation Fig 15 shows the number of QoSviolations caused by each service over the two-month periodThe most frequent culprits by far are the in-memory cachingtiers in memcached and Thrift services with high requestfanout such as composePost readPost and login memcachedis a justified source of QoS violations since it is on the criti-cal path for almost all query types and it is additionally verysensitive to resource contention in compute and to a lesserdegree cache and memory Microservices with high fanoutare also expected to initiate QoS violations as they have tosynchronize multiple inbound requests before proceeding Ifprocessing for any incoming requests is delayed end-to-endperformance is likely to suffer Among these QoS violationsmost of memcachedrsquos violations were caused by resource con-tention while violations in Thrift services were caused bylong synchronization times

63 Seerrsquos Long-Term Impact on Application DesignSeer has now been deployed in the Social Network clusterfor over two months and in this time it has detected 536upcoming QoS violations (906 accuracy) and avoided 495(84) of them Furthermore by detecting recurring patterns1Before running on TPUs we reimplemented our DNN in Tensorflow Wesimilarly adjust the DNN to the currently-supported designs in Brainwave

0 10 20 30 40 50 60Day

0

5

10

15

20

25

Qo

S v

iola

tio

ns

Figure 16 QoS violations each microservice in Social Net-work is responsible for

that lead to QoS violations Seer has helped the applicationdevelopers better understand bugs and design decisions thatlead to hotspots such as microservices with a lot of back-and-forth communication between them or microservicesforming cyclic dependencies or using blocking primitivesThis has led to a decreasing number of QoS violations overthe two month period (seen in Fig 16) as the applicationprogressively improves In days 22 and 23 there was a clusteroutage which is why the reported violations are zero Sys-tems like Seer can be used not only to improve performancepredictability in complex cloud systems but to help usersbetter understand the design challenges of microservices asmore services transition to this application model

7 ConclusionsCloud services increasingly move away from complex mono-lithic designs and adopt the model of specialized loosely-coupled microservices We presented Seer a data-drivencloud performance debugging system that leverages practi-cal learning techniques and the massive amount of tracingdata cloud systems collect to proactively detect and avoidQoS violations We have validated Seerrsquos accuracy in con-trolled environments and evaluated its scalability on large-scale clusters on public clouds We have also deployed thesystem in a cluster hosting a social network with hundredsof users In all scenarios Seer accurately detects upcomingQoS violations improving responsiveness and performancepredictability As more services transition to the microser-vices model systems like Seer provide practical solutionsthat can navigate the increasing complexity of the cloud

AcknowledgementsWe sincerely thank Christos Kozyrakis Balaji PrabhakarMendel Rosenblum Daniel Sanchez Ravi Soundararajanthe academic and industrial users of the benchmark suiteand the anonymous reviewers for their feedback on earlierversions of this manuscript This work was supported byNSF CAREER Award CCF-1846046 NSF grant CNS-1422088a Facebook Faculty Research Award a VMWare ResearchAward a John and Norma Balen Sesquicentennial FacultyFellowship and donations from GCE Azure and AWS

References[1] [n d] Apache Thrift httpsthriftapacheorg[2] [n d] Decomposing Twitter Adventures in Service-

Oriented Architecture httpswwwslidesharenetInfoQdecomposing-twitter-adventures-in-serviceoriented-architecture

[3] [n d] Golang Microservices Example httpsgithubcomharlowgo-micro-services

[4] [n d] Messaging that just works httpswwwrabbitmqcom[5] [n d] MongoDB httpswwwmongodbcom[6] [n d] NGINX httpsnginxorgen[7] [n d] SockShop A Microservices Demo Application httpswww

weaveworksblogsock-shop-microservices-demo-application[8] [n d] Zipkin httpzipkinio[9] 2016 The Evolution of Microservices httpswwwslidesharenet

adriancockcroftevolution-of-microservices-craft-conference[10] Martiacuten Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng

Chen Craig Citro Greg Corrado Andy Davis Jeffrey Dean MatthieuDevin Sanjay Ghemawat Ian Goodfellow Andrew Harp GeoffreyIrving Michael Isard Yangqing Jia Rafal Jozefowicz Lukasz KaiserManjunath Kudlur Josh Levenberg Dan Maneacute Rajat Monga SherryMoore Derek Murray Chris Olah Mike Schuster Jonathon ShlensBenoit Steiner Ilya Sutskever Kunal Talwar Paul Tucker VincentVanhoucke Vijay Vasudevan Fernanda Vieacutegas Oriol Vinyals PeteWarden Martin Wattenberg Martin Wicke Yuan Yu and XiaoqiangZheng [n d] TensorFlow Large-Scale Machine Learning on Hetero-geneous Distributed Systems In Proceedings of OSDI 2016

[11] Adrian Cockroft [n d] Microservices Workshop Why whatand how to get there httpwwwslidesharenetadriancockcroftmicroservices-workshop-craft-conference

[12] Hyunwook Baek Abhinav Srivastava and Jacobus Van der Merwe[n d] CloudSight A tenant-oriented transparency framework forcross-layer cloud troubleshooting In Proceedings of the 17th IEEEACMInternational Symposium on Cluster Cloud and Grid Computing 2017

[13] Luiz Barroso [n d] Warehouse-Scale Computing Entering theTeenage Decade ISCA Keynote SJ June 2011 ([n d])

[14] Luiz Barroso and Urs Hoelzle 2009 The Datacenter as a Computer AnIntroduction to the Design ofWarehouse-Scale Machines MC Publishers

[15] Robert Bell Yehuda Koren and Chris Volinsky 2007 The BellKor 2008Solution to the Netflix Prize Technical Report

[16] Leon Bottou [n d] Large-Scale Machine Learning with StochasticGradient Descent In Proceedings of the International Conference onComputational Statistics (COMPSTAT) Paris France 2010

[17] Martin A Brown [n d] Traffic Control HOWTO httplinux-ipnetarticlesTraffic-Control-HOWTO

[18] Cassandra [n d] Apache Cassandra httpcassandraapacheorg[19] Adrian M Caulfield Eric S Chung Andrew Putnam Hari Angepat

Jeremy Fowers Michael Haselman Stephen Heil Matt HumphreyPuneet Kaur Joo-Young Kim Daniel Lo Todd Massengill KalinOvtcharov Michael Papamichael Lisa Woods Sitaram Lanka DerekChiou and Doug Burger 2016 A Cloud-scale Acceleration Architec-ture In MICRO IEEE Press Piscataway NJ USA Article 7 13 pages

[20] Tianshi Chen Zidong Du Ninghui Sun Jia Wang Chengyong WuYunji Chen and Olivier Temam 2014 DianNao a small-footprint high-throughput accelerator for ubiquitous machine-learning In Proc ofthe 19th intl conf on Architectural Support for Programming Languagesand Operating Systems

[21] Yunji Chen Tianshi Chen Zhiwei Xu Ninghui Sun andOlivier Temam2016 DianNao Family Energy-efficient Hardware Accelerators forMachine Learning Commun ACM 59 11 (Oct 2016) 105ndash112 httpsdoiorg1011452996864

[22] Yunji Chen Tao Luo Shaoli Liu Shijin Zhang Liqiang He Jia WangLing Li Tianshi Chen Zhiwei Xu Ninghui Sun and Olivier Temam2014 DaDianNao A Machine-Learning Supercomputer In Proceedingsof the 47th Annual IEEEACM International Symposium on Microarchi-tecture (MICRO-47) IEEE Computer Society Washington DC USA

609ndash622[23] Ludmila Cherkasova Diwaker Gupta and Amin Vahdat 2007 Com-

parison of the Three CPU Schedulers in Xen SIGMETRICS PerformEval Rev 35 2 (Sept 2007) 42ndash51

[24] Michael Chow David Meisner Jason Flinn Daniel Peek and Thomas FWenisch 2014 The Mystery Machine End-to-end Performance Anal-ysis of Large-scale Internet Services In Proceedings of the 11th USENIXConference on Operating Systems Design and Implementation (OSDIrsquo14)USENIX Association Berkeley CA USA 217ndash231

[25] Eric S Chung Jeremy Fowers Kalin Ovtcharov Michael PapamichaelAdrian M Caulfield Todd Massengill Ming Liu Daniel Lo ShlomiAlkalay Michael Haselman Maleen Abeydeera Logan Adams HariAngepat Christian Boehn Derek Chiou Oren Firestein AlessandroForin Kang Su Gatlin Mahdi Ghandi Stephen Heil Kyle HolohanAhmad El Husseini Tamaacutes Juhaacutesz Kara Kagi Ratna Kovvuri SitaramLanka Friedel van Megen Dima Mukhortov Prerak Patel BrandonPerez Amanda Rapsang Steven K Reinhardt Bita Rouhani AdamSapek Raja Seera Sangeetha Shekar Balaji Sridharan Gabriel WeiszLisaWoods Phillip Yi Xiao Dan Zhang Ritchie Zhao and Doug Burger2018 Serving DNNs in Real Time at Datacenter Scale with ProjectBrainwave IEEE Micro 38 2 (2018) 8ndash20

[26] Guilherme Da Cunha Rodrigues Rodrigo N Calheiros Vini-cius Tavares Guimaraes Glederson Lessa dos Santos Maacutercio Barbosade Carvalho Lisandro Zambenedetti Granville Liane Margarida Rock-enbach Tarouco and Rajkumar Buyya 2016 Monitoring of CloudComputing Environments Concepts Solutions Trends and Future Di-rections In Proceedings of the 31st Annual ACM Symposium on AppliedComputing (SAC rsquo16) ACM New York NY USA 378ndash383

[27] Marwan Darwish Abdelkader Ouda and Luiz Fernando Capretz [nd] Cloud-based DDoS attacks and defenses In Proc of i-SocietyToronto ON 2013

[28] Jeffrey Dean and Luiz Andre Barroso [n d] The Tail at Scale InCACM Vol 56 No 2

[29] Christina Delimitrou Nick Bambos and Christos Kozyrakis [n d]QoS-Aware Admission Control in Heterogeneous Datacenters In Pro-ceedings of the International Conference of Autonomic Computing (ICAC)San Jose CA USA 2013

[30] Christina Delimitrou and Christos Kozyrakis [n d] iBench Quanti-fying Interference for Datacenter Workloads In Proceedings of the 2013IEEE International Symposium on Workload Characterization (IISWC)Portland OR September 2013

[31] Christina Delimitrou and Christos Kozyrakis [n d] Paragon QoS-Aware Scheduling for Heterogeneous Datacenters In Proceedings ofthe Eighteenth International Conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS) HoustonTX USA 2013

[32] Christina Delimitrou and Christos Kozyrakis [n d] QoS-Aware Sched-uling in Heterogeneous Datacenters with Paragon In ACM Transac-tions on Computer Systems (TOCS) Vol 31 Issue 4 December 2013

[33] Christina Delimitrou and Christos Kozyrakis [n d] Quality-of-Service-Aware Scheduling in Heterogeneous Datacenters withParagon In IEEE Micro Special Issue on Top Picks from the ComputerArchitecture Conferences MayJune 2014

[34] Christina Delimitrou and Christos Kozyrakis [n d] Quasar Resource-Efficient and QoS-Aware Cluster Management In Proc of ASPLOS SaltLake City 2014

[35] Christina Delimitrou and Christos Kozyrakis 2016 HCloud Resource-Efficient Provisioning in Shared Cloud Systems In Proceedings of theTwenty First International Conference on Architectural Support for Pro-gramming Languages and Operating Systems (ASPLOS)

[36] Christina Delimitrou and Christos Kozyrakis 2017 Bolt I KnowWhatYou Did Last Summer In The Cloud In Proc of the Twenty SecondInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS)

[37] Christina Delimitrou and Christos Kozyrakis 2018 Amdahlrsquos Law forTail Latency In Communications of the ACM (CACM)

[38] Christina Delimitrou Daniel Sanchez and Christos Kozyrakis 2015Tarcil Reconciling Scheduling Speed and Quality in Large Shared Clus-ters In Proceedings of the Sixth ACM Symposium on Cloud Computing(SOCC)

[39] Zidong Du Robert Fasthuber Tianshi Chen Paolo Ienne Ling Li TaoLuo Xiaobing Feng Yunji Chen and Olivier Temam 2015 ShiDi-anNao Shifting Vision Processing Closer to the Sensor In Proceed-ings of the 42Nd Annual International Symposium on Computer Ar-chitecture (ISCA rsquo15) ACM New York NY USA 92ndash104 httpsdoiorg10114527494692750389

[40] Daniel Firestone Andrew Putnam SambhramaMundkur Derek ChiouAlireza Dabagh Mike Andrewartha Hari Angepat Vivek BhanuAdrian Caulfield Eric Chung Harish Kumar Chandrappa SomeshChaturmohta Matt Humphrey Jack Lavier Norman Lam FengfenLiu Kalin Ovtcharov Jitu Padhye Gautham Popuri Shachar RaindelTejas Sapre Mark Shaw Gabriel Silva Madhan Sivakumar NisheethSrivastava Anshuman Verma Qasim Zuhair Deepak Bansal DougBurger Kushagra Vaid David A Maltz and Albert Greenberg 2018Azure Accelerated Networking SmartNICs in the Public Cloud In 15thUSENIX Symposium on Networked Systems Design and Implementation(NSDI 18) USENIX Association Renton WA 51ndash66

[41] Brad Fitzpatrick [n d] Distributed caching with memcached InLinux Journal Volume 2004 Issue 124 2004

[42] Rodrigo Fonseca George Porter Randy H Katz Scott Shenker andIon Stoica 2007 X-trace A Pervasive Network Tracing Framework InProceedings of the 4th USENIX Conference on Networked Systems Designamp Implementation (NSDIrsquo07) USENIX Association Berkeley CA USA20ndash20

[43] Yu Gan and Christina Delimitrou 2018 The Architectural Implicationsof Cloud Microservices In Computer Architecture Letters (CAL) vol17iss 2

[44] Yu Gan Meghna Pancholi Dailun Cheng Siyuan Hu Yuan He andChristina Delimitrou 2018 Seer Leveraging Big Data to Navigate theComplexity of Cloud Debugging In Proceedings of the Tenth USENIXWorkshop on Hot Topics in Cloud Computing (HotCloud)

[45] Yu Gan Yanqi Zhang Dailun Cheng Ankitha Shetty Priyal RathiNayan Katarki Ariana Bruno Justin Hu Brian Ritchken BrendonJackson Kelvin Hu Meghna Pancholi Yuan He Brett Clancy ChrisColen Fukang Wen Catherine Leung Siyuan Wang Leon ZaruvinskyMateo Espinosa Rick Lin Zhongling Liu Jake Padilla and ChristinaDelimitrou 2019 An Open-Source Benchmark Suite for Microser-vices and Their Hardware-Software Implications for Cloud and EdgeSystems In Proceedings of the Twenty Fourth International Conferenceon Architectural Support for Programming Languages and OperatingSystems (ASPLOS)

[46] Yilong Geng Shiyu Liu Zi Yin Ashish Naik Balaji Prabhakar MendelRosenblum and Amin Vahdat [n d] Exploiting a Natural NetworkEffect for Scalable Fine-grained Clock Synchronization In Proc ofNSDI 2018

[47] Daniel Gmach Jerry Rolia Ludmila Cherkasova and Alfons Kemper[n d] Workload Analysis and Demand Prediction of Enterprise DataCenter Applications In Proceedings of IISWC Boston MA 2007 10httpsdoiorg101109IISWC20074362193

[48] Zhenhuan Gong Xiaohui Gu and John Wilkes [n d] PRESS PRe-dictive Elastic ReSource Scaling for cloud systems In Proceedings ofCNSM Niagara Falls ON 2010

[49] Donald Gross John F Shortle James M Thompson and Carl M Harris[n d] Fundamentals of Queueing Theory InWiley Series in Probabilityand Statistics Book 627 2011

[50] Sanchika Gupta and PadamKumar [n d] VM Profile Based OptimizedNetwork Attack Pattern Detection Scheme for DDOS Attacks in CloudIn Proc of SSCC Mysore India 2013

[51] BenHindman Andy Konwinski Matei Zaharia Ali Ghodsi AnthonyDJoseph Randy Katz Scott Shenker and Ion Stoica [n d] Mesos APlatform for Fine-Grained Resource Sharing in the Data Center InProceedings of NSDI Boston MA 2011

[52] Jingwei Huang David M Nicol and Roy H Campbell [n d] Denial-of-Service Threat to HadoopYARN Clusters with Multi-Tenancy InProc of the IEEE International Congress on Big Data Washington DC2014

[53] Alexandru Iosup Nezih Yigitbasi and Dick Epema [n d] On thePerformance Variability of Production Cloud Services In Proceedingsof CCGRID Newport Beach CA 2011

[54] Hiranya Jayathilaka Chandra Krintz and Rich Wolski 2017 Perfor-mance Monitoring and Root Cause Analysis for Cloud-hosted WebApplications In Proceedings of WWW International World Wide WebConferences Steering Committee Republic and Canton of GenevaSwitzerland 469ndash478

[55] Norman P Jouppi Cliff Young Nishant Patil David Patterson Gau-rav Agrawal Raminder Bajwa Sarah Bates Suresh Bhatia Nan Bo-den Al Borchers Rick Boyle Pierre-luc Cantin Clifford Chao ChrisClark Jeremy Coriell Mike Daley Matt Dau Jeffrey Dean Ben GelbTara Vazir Ghaemmaghami Rajendra Gottipati William GullandRobert Hagmann C Richard Ho Doug Hogberg John Hu RobertHundt Dan Hurt Julian Ibarz Aaron Jaffey Alek Jaworski Alexan-der Kaplan Harshit Khaitan Daniel Killebrew Andy Koch NaveenKumar Steve Lacy James Laudon James Law Diemthu Le ChrisLeary Zhuyuan Liu Kyle Lucke Alan Lundin Gordon MacKean Adri-ana Maggiore Maire Mahony Kieran Miller Rahul Nagarajan RaviNarayanaswami Ray Ni Kathy Nix Thomas Norrie Mark Omer-nick Narayana Penukonda Andy Phelps Jonathan Ross Matt RossAmir Salek Emad Samadiani Chris Severn Gregory Sizikov MatthewSnelham Jed Souter Dan Steinberg Andy Swing Mercedes Tan Gre-gory Thorson Bo Tian Horia Toma Erick Tuttle Vijay VasudevanRichard Walter Walter Wang Eric Wilcox and Doe Hyun Yoon 2017In-Datacenter Performance Analysis of a Tensor Processing Unit InProceedings of the 44th Annual International Symposium on ComputerArchitecture (ISCA rsquo17) ACM New York NY USA 1ndash12

[56] Harshad Kasture and Daniel Sanchez 2014 Ubik Efficient CacheSharing with Strict QoS for Latency-Critical Workloads In Proceed-ings of the 19th international conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS-XIX)

[57] Harshad Kasture and Daniel Sanchez 2016 TailBench A BenchmarkSuite and Evaluation Methodology for Latency-Critical ApplicationsIn Proc of IISWC

[58] Yaakoub El Khamra Hyunjoo Kim Shantenu Jha andManish Parashar[n d] Exploring the performance fluctuations of hpc workloads onclouds In Proceedings of CloudCom Indianapolis IN 2010

[59] Krzysztof C Kiwiel [n d] Convergence and efficiency of subgradientmethods for quasiconvex minimization InMathematical Programming(Series A) (Berlin Heidelberg Springer) 90 (1) pp 1-25 2001

[60] Jacob Leverich and Christos Kozyrakis [n d] Reconciling High ServerUtilization and Sub-millisecond Quality-of-Service In Proc of EuroSys2014

[61] David Lo Liqun Cheng Rama Govindaraju Luiz Andreacute Barroso andChristos Kozyrakis [n d] Towards Energy Proportionality for Large-scale Latency-critical Workloads In Proceedings of the 41st AnnualInternational Symposium on Computer Architecuture (ISCA) Minneapo-lis MN 2014

[62] David Lo Liqun Cheng Rama Govindaraju Parthasarathy Ran-ganathan and Christos Kozyrakis [n d] Heracles Improving Re-source Efficiency at Scale In Proc of the 42Nd Annual InternationalSymposium on Computer Architecture (ISCA) Portland OR 2015

[63] Dave Mangot [n d] EC2 variability The numbers re-vealed httptechmangotcomrollerdaveentryec2_variability_the_numbers_revealed

[64] Jason Mars and Lingjia Tang [n d] Whare-map heterogeneity inhomogeneous warehouse-scale computers In Proceedings of ISCATel-Aviv Israel 2013

[65] Jason Mars Lingjia Tang and Robert Hundt 2011 Heterogeneity inldquoHomogeneousrdquo Warehouse-Scale Computers A Performance Oppor-tunity IEEE Comput Archit Lett 10 2 (July 2011) 4

[66] Jason Mars Lingjia Tang Robert Hundt Kevin Skadron and Mary LouSoffa [n d] Bubble-Up increasing utilization in modern warehousescale computers via sensible co-locations In Proceedings of MICROPorto Alegre Brazil 2011

[67] David Meisner Christopher M Sadler Luiz Andreacute Barroso Wolf-Dietrich Weber and Thomas F Wenisch 2011 Power managementof online data-intensive services In Proceedings of the 38th annualinternational symposium on Computer architecture 319ndash330

[68] Ripal Nathuji Canturk Isci and Eugene Gorbatov [n d] Exploitingplatform heterogeneity for power efficient data centers In Proceedingsof ICAC Jacksonville FL 2007

[69] Ripal Nathuji Aman Kansal and Alireza Ghaffarkhah [n d] Q-Clouds Managing Performance Interference Effects for QoS-AwareClouds In Proceedings of EuroSys ParisFrance 2010

[70] Hiep Nguyen Zhiming Shen Xiaohui Gu Sethuraman Subbiah andJohn Wilkes [n d] AGILE Elastic Distributed Resource Scaling forInfrastructure-as-a-Service In Proceedings of ICAC San Jose CA 2013

[71] Dejan Novakovic Nedeljko Vasic Stanko Novakovic Dejan Kosticand Ricardo Bianchini [n d] DeepDive Transparently Identifyingand Managing Performance Interference in Virtualized EnvironmentsIn Proceedings of ATC San Jose CA 2013

[72] Simon Ostermann Alexandru Iosup Nezih Yigitbasi Radu ProdanThomas Fahringer and Dick Epema [n d] A Performance Analysisof EC2 Cloud Computing Services for Scientific Computing In LectureNotes on Cloud Computing Volume 34 p115-131 2010

[73] Kay Ousterhout Patrick Wendell Matei Zaharia and Ion Stoica [nd] Sparrow Distributed Low Latency Scheduling In Proceedings ofSOSP Farminton PA 2013

[74] Xue Ouyang Peter Garraghan Renyu Yang Paul Townend and Jie Xu[n d] Reducing Late-Timing Failure at Scale Straggler Root-CauseAnalysis in Cloud Datacenters In Proceedings of 46th Annual IEEEIFIPInternational Conference on Dependable Systems and Networks 2016

[75] Andrew Putnam Adrian M Caulfield Eric S Chung Derek ChiouKypros Constantinides John Demme Hadi Esmaeilzadeh Jeremy Fow-ers Gopi Prashanth Gopal Jan Gray Michael Haselman Scott HauckStephen Heil Amir Hormati Joo-Young Kim Sitaram Lanka JamesLarus Eric Peterson Simon Pope Aaron Smith Jason Thong Phillip YiXiao and Doug Burger 2014 A Reconfigurable Fabric for Accelerat-ing Large-Scale Datacenter Services In Proc of the 41st Intl Symp onComputer Architecture

[76] Suhail Rehman and Majd Sakr [n d] Initial Findings for provisioningvariation in cloud computing In Proceedings of CloudCom IndianapolisIN 2010

[77] Charles Reiss Alexey Tumanov Gregory Ganger Randy Katz andMichael Kozych [n d] Heterogeneity and Dynamicity of Clouds atScale Google Trace Analysis In Proceedings of SOCC 2012

[78] Gang Ren Eric Tune Tipp Moseley Yixin Shi Silvius Rus andRobert Hundt 2010 Google-Wide Profiling A Continuous Pro-filing Infrastructure for Data Centers IEEE Micro (2010) 65ndash79httpwwwcomputerorgportalwebcsdldoi101109MM201068

[79] rightscale [n d] Rightscale httpsawsamazoncomsolution-providersisvrightscale

[80] S Sarwar A Ankit and K Roy [n d] Incremental Learning in DeepConvolutional Neural Networks Using Partial Network Sharing InarXiv preprint arXiv171202719

[81] Joumlrg Schad Jens Dittrich and Jorge-Arnulfo Quianeacute-Ruiz 2010 Run-time Measurements in the Cloud Observing Analyzing and ReducingVariance Proceedings VLDB Endow 3 1-2 (Sept 2010) 460ndash471

[82] Malte Schwarzkopf Andy Konwinski Michael Abd-El-Malek andJohn Wilkes [n d] Omega flexible scalable schedulers for largecompute clusters In Proceedings of EuroSys Prague 2013

[83] Zhiming Shen Sethuraman Subbiah Xiaohui Gu and John Wilkes [nd] CloudScale elastic resource scaling for multi-tenant cloud systemsIn Proceedings of SOCC Cascais Portugal 2011

[84] David Shue Michael J Freedman and Anees Shaikh [n d] Perfor-mance Isolation and Fairness for Multi-tenant Cloud Storage In Procof OSDI Hollywood CA 2012

[85] Benjamin H Sigelman Luiz Andreacute Barroso Mike Burrows PatStephenson Manoj Plakal Donald Beaver Saul Jaspan and Chan-dan Shanbhag 2010 Dapper a Large-Scale Distributed Systems TracingInfrastructure Technical Report Google Inc httpsresearchgooglecomarchivepapersdapper-2010-1pdf

[86] Lalith Suresh Peter Bodik Ishai Menache Marco Canini and FlorinCiucu [n d] Distributed Resource Management Across ProcessBoundaries In Proceedings of the ACM Symposium on Cloud Computing(SOCC) Santa Clara CA 2017

[87] torque [n d] Torque Resource Manager httpwwwadaptivecomputingcomproductsopen-sourcetorque

[88] Venkatanathan Varadarajan Thomas Ristenpart and Michael Swift[n d] Scheduler-based Defenses against Cross-VM Side-channels InProc of the 23rd Usenix Security Symposium San Diego CA 2014

[89] Venkatanathan Varadarajan Yinqian Zhang Thomas Ristenpart andMichael Swift [n d] A Placement Vulnerability Study inMulti-TenantPublic Clouds In Proc of the 24th USENIX Security Symposium (USENIXSecurity) Washington DC 2015

[90] Nedeljko Vasić Dejan Novaković Svetozar Miučin Dejan Kostić andRicardo Bianchini [n d] DejaVu accelerating resource allocationin virtualized environments In Proceedings of the Seventeenth Interna-tional Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS) London UK 2012

[91] Jianping Weng Jessie Hui Wang Jiahai Yang and Yang Yang [n d]Root cause analysis of anomalies of multitier services in public cloudsIn Proceedings of the IEEEACM 25th International Symposium on Qual-ity of Service (IWQoS) 2017

[92] Ian H Witten Eibe Frank and Geoffrey Holmes [n d] Data MiningPractical Machine Learning Tools and Techniques 3rd Edition

[93] Tianyin Xu Xinxin Jin Peng Huang Yuanyuan Zhou Shan Lu LongJin and Shankar Pasupathy 2016 Early Detection of ConfigurationErrors to Reduce Failure Damage In Proceedings of the 12th USENIXConference on Operating Systems Design and Implementation (OSDIrsquo16)USENIX Association Berkeley CA USA 619ndash634

[94] Yunjing Xu Michael Bailey Farnam Jahanian Kaustubh Joshi MattiHiltunen and Richard Schlichting [n d] An Exploration of L2 CacheCovert Channels in Virtualized Environments In Proc of the 3rd ACMWorkshop on Cloud Computing Security Workshop (CCSW) ChicagoIL 2011

[95] Hailong Yang Alex Breslow Jason Mars and Lingjia Tang [n d]Bubble-flux precise online QoS management for increased utilizationin warehouse scale computers In Proceedings of ISCA 2013

[96] Minlan Yu Albert Greenberg Dave Maltz Jennifer Rexford LihuaYuan Srikanth Kandula and Changhoon Kim [n d] Profiling Net-work Performance for Multi-tier Data Center Applications In Proceed-ings of NSDI Boston MA 2011 14

[97] Xiao Yu Pallavi Joshi Jianwu Xu Guoliang Jin Hui Zhang and GuofeiJiang 2016 CloudSeer Workflow Monitoring of Cloud Infrastructuresvia Interleaved Logs In Proceedings of APSLOS ACM New York NYUSA 489ndash502

[98] Yinqian Zhang and Michael K Reiter [n d] Duppel retrofittingcommodity operating systems to mitigate cache side channels in thecloud In Proc of CCS Berlin Germany 2013

[99] Mark Zhao and G Edward Suh [n d] FPGA-Based Remote PowerSide-Channel Attacks In Proceedings of the IEEE Symposium on Securityand Privacy May 2018

  • Abstract
  • 1 Introduction
  • 2 Related Work
  • 3 End-to-End Applications with Microservices
    • 31 Social Network
    • 32 Media Service
    • 33 E-Commerce Service
    • 34 Banking System
    • 35 Hotel Reservation Site
      • 4 Seer Design
        • 41 Overview
        • 42 Distributed Tracing
        • 43 Deep Learning in Performance Debugging
        • 44 Hardware Monitoring
        • 45 System Insights from Seer
        • 46 Implementation
          • 5 Seer Analysis and Validation
            • 51 Methodology
            • 52 Evaluation
              • 6 Large-Scale Cloud Study
                • 61 Seer Scalability
                • 62 Source of QoS Violations
                • 63 Seers Long-Term Impact on Application Design
                  • 7 Conclusions
                  • Acknowledgements
                  • References
Page 9: Seer: Leveraging Big Data to Navigate the Complexity of ... · data to extract the causal relationships between requests. Even though such tracing systems help cloud providers detect

Retraining Seer By default training happens once andcan be time consuming taking several hours up to a day forweek-long traces collected on our 20-server cluster (Sec 5includes a detailed sensitivity study for training time) How-ever one of the main advantages of microservices is that theysimplify frequent application updates with old microser-vices often swapped out and replaced by newer modules orlarge services progressively broken down to microservicesIf the application (or underlying hardware) change signifi-cantly Seerrsquos detection accuracy can be impacted To adjustto changes in the execution environment Seer retrains incre-mentally in the background using the transfer learning-basedapproach in [80] Weights from previous training rounds arestored in disk allowing the model to continue training fromwhere it last left off when new data arrive reducing thetraining time by 2-3 orders of magnitude Even though thisapproach allows Seer to handle application changes almost inreal-time it is not a long-term solution since newweights arestill polluted by the previous application architecture Whenthe application changes in a major way eg microserviceson the critical path change Seer also retrains from scratchin the background While the new network trains QoS vi-olation detection happens with the incrementally-trainedinterim model In Section 5 we evaluate Seerrsquos ability toadjust its estimations to application changes

44 Hardware MonitoringOnce a QoS violation is signaled and a culprit microserviceis pinpointed Seer uses low-level monitoring to identify thereason behind the QoS violation The exact process dependson whether Seer has access to performance countersPrivate cluster When Seer has access to hardware eventssuch as performance counters it uses them to determinethe utilization of different shared resources Note that eventhough utilization is a problematic metric for anticipatingQoS violations in a large-scale service once a culprit mi-croservice has been identified examining the utilization ofdifferent resources can provide useful hints to the clustermanager on suitable decisions to avoid degraded perfor-mance Seer specifically examines CPU memory capacityand bandwidth network bandwidth cache contention andstorage IO bandwidth when prioritizing a resource to adjustOnce the saturated resource is identified Seer notifies thecluster manager to take actionPublic clusterWhen Seer does not have access to perfor-mance counters it instead uses a set of 10 tunable contentiousmicrobenchmarks each of them targeting a different sharedresource [30] to determine resource saturation For exampleif Seer injects the memory bandwidth microbenchmark inthe system and tunes up its intensity without an impacton the co-scheduled microservicersquos performance memorybandwidth is most likely not the resource that needs to beadjusted Seer starts from microbenchmarks corresponding

to core resources and progressively moves to resources fur-ther away from the core until it sees a substantial change inperformance when running the microbenchmark Each mi-crobenchmark takes approximately 10ms to complete avoid-ing prolonged degraded performance

Upon identifying the problematic resource(s) Seer notifiesthe cluster manager which takes one of several resourceallocation actions resizing the Docker container or usingmechanisms like Intelrsquos Cache Allocation Technology (CAT)for last level cache (LLC) partitioning and the Linux trafficcontrolrsquos hierarchical token bucket (HTB) queueing disciplinein qdisc [17 62] for network bandwidth partitioning

45 System Insights from SeerUsing learning-based data-driven approaches in systemsis most useful when these techniques are used to gain in-sight into system problems instead of treating them as blackboxes Section 5 includes an analysis of the causes behindQoS violations signaled by Seer including application bugspoor resource provisioning decisions and hardware failuresFurthermore we have deployed Seer in a large installation ofthe Social Network service over the past few months and itsoutput has been instrumental not only in guaranteeing QoSbut in understanding sources of unpredictable performanceand improving the application design This has resulted bothin progressively fewer QoS violations over time and a betterunderstanding of the design challenges of microservices

46 ImplementationSeer is implemented in 12KLOC of CC++ and Python Itruns on Linux and OSX and supports applications in variouslanguages including all frameworks the end-to-end servicesare designed in Furthermore we provide automated patchesfor the instrumentation probes for many popular microser-vices including NGINX memcached MongoDB Xapian and allSockshop and Go-microservices applications to minimize thedevelopment effort from the userrsquos perspective

Seer is a centralized system we usemaster-slavemirroringto improve fault tolerance with two hot stand-by mastersthat can take over if the primary system fails Similarly thetrace database is also replicated in the backgroundSecurity concerns Trace data is stored and processed un-encrypted in Cassandra Previous work has shown that thesensitivity applications have to different resources can leakinformation about their nature and characteristics makingthem vulnerable to malicious security attacks [27 36 50 5284 88 89 94 99] Similar attacks are possible using the dataand techniques in Seer and are deferred to future work

5 Seer Analysis and Validation51 MethodologyServer clusters First we use a dedicated local cluster with20 2-socket 40-core servers with 128GB of RAM each Each

0 5 10 15 20 25 30 35 40

Training Time (hours)

40

50

60

70

80

90

100

Qo

S V

iola

tio

n D

ete

ctio

n A

ccu

racy (

)

100MB

1GB

10GB

20GB

50GB

100GB 200GB

500GB 1TB

10-2 10-1 100 101Tracing Interval (s)

50

55

60

65

70

75

80

85

90

95

Dete

ction A

ccura

cy (

)

Figure 11 Seerrsquos sensitivity to (a) the size of trainingdatasets and (b) the tracing interval

server is connected to a 40Gbps ToR switch over 10Gbe NICsSecond we deploy the Social Network service to GoogleCompute Engine (GCE) and Windows Azure clusters withhundreds of servers to study the scalability of SeerApplicationsWe use all five end-to-end services of Table 1Services for now are driven by open-loop workload gener-ators and the input load varies from constant to diurnalto load with spikes in user demand In Section 6 we studySeer in a real large-scale deployment of the Social Networkin that case the input load is driven by real user traffic

52 EvaluationSensitivity to training data Fig 11a shows the detectionaccuracy and training time for Seer as we increase the sizeof the training dataset The size of the dots is a functionof the dataset size Training data is collected from the 20-server cluster described above across different load levelsplacement strategies time intervals and request types Thesmallest training set size (100MB) is collected over ten min-utes of the cluster operating at high utilization while thelargest dataset (1TB) is collected over almost two months ofcontinuous deployment As datasets grow Seerrsquos accuracyincreases leveling off at 100-200GB Beyond that point accu-racy does not further increase while the time needed fortraining grows significantly Unless otherwise specified weuse the 100GB training datasetSensitivity to tracing frequencyBy default the distributedtracing system instantaneously collects the latency of everysingle user request Collecting queue depth statistics on theother hand is a per-microservice iterative process Fig 11bshows how Seerrsquos accuracy changes as we vary the frequencywith which we collect queue depth statistics Waiting for along time before sampling queues eg gt 1s can result in un-detected QoS violations before Seer gets a chance to processthe incoming traces In contrast sampling queues very fre-quently results in unnecessarily many inferences and runsthe risk of increasing the tracing overhead For the remain-der of this work we use 100ms as the interval for measuringqueue depths across microservicesFalse negatives amp false positives Fig 12a shows the per-centage of false negatives and false positives as we vary the

CPU

Cache

Memory

Network

Disk

Application

False Negatives False Positives0

10

20

30

40

50

Pe

rce

nta

ge

(

)

10ms50ms100ms

500ms1s2s

Utilizatio

n

App only

Net only

Seer

Ground0

20

40

60

80

100

o

f V

alid

Qo

S V

iola

tio

ns

Truth

Figure 12 (a) The false negatives and false positives in Seeras we vary the inference window (b) Breakdown to causesof QoS violations and comparison with utilization-baseddetection and systems with limited instrumentation

prediction window When Seer tries to anticipate QoS viola-tions that will occur in the next 10-100ms both false positivesand false negatives are low since Seer uses a very recentsnapshot of the cluster state to anticipate performance unpre-dictability If inference was instantaneous very short predic-tion windows would always be better However given thatinference takes several milliseconds and more importantlyapplying corrective actions to avoid QoS violations takes10-100s of milliseconds to take effect such short windowsdefy the point of proactive QoS violation detection At theother end predicting far into the future results in significantfalse negatives and especially false positives This is becausemany QoS violations are caused by very short bursty eventsthat do not have an impact on queue lengths until a fewmilliseconds before the violation occurs Therefore requiringSeer to predict one or more seconds into the future meansthat normal queue depths are annotated as QoS violationsresulting in many false positives Unless otherwise specifiedwe use a 100ms prediction windowComparison of debugging systems Fig 12b comparesSeer with a utilization-based performance debugging systemthat uses resource saturation as the trigger to signal a QoSviolation and two systems that only use a fraction of Seerrsquosinstrumentation App-only exclusively uses queues mea-sured via application instrumentation (not network queues)while Network-only uses queues in the NIC and ignoresapplication-level queueing We also show the ground truthfor the total number of upcoming QoS violations (96 over atwo-week period) and break it down by the reason that ledto unpredictable performanceA large fraction of QoS violations are due to application-

level inefficiencies including correctness bugs unnecessarysynchronization andor blocking behavior between microser-vices (including two cases of deadlocks) and misconfigurediptables rules which caused packet drops An equally largefraction of QoS violations are due to compute contentionfollowed by contention in the network cache and memory

Accuracy

QoS Violation

Impactful Service Update

Innocuous Service Update

Incremental Retraining

Signaled QoS Violation

0 2 4 6 8 10 12Time (hours)

75

80

85

90

95

De

tectio

n A

ccu

racy (

)

Social Network Movie Reviewing Ecommerce Go-microservices

0 2 4 6 8 10 12Time (hours)

60708090

100110120130140

Ta

il La

ten

cy n

orm

QoS

QoS

Figure 13 Seer retraining incrementally after each time theSocial Network service is updated

contention and finally disk Since the only persistent mi-croservices are the back-end databases it is reasonable thatdisk accounts for a small fraction of overall QoS violations

Seer accurately follows this breakdown for the most partonly missing a few QoS violations due to random load spikesincluding one caused by a switch failure The App-only sys-tem correctly identifies application-level sources of unpre-dictable performance but misses the majority of system-related issues especially in uncore resources On the otherhand Network-only correctly identifies the vast majorityof network-related issues as well as most of the core- anduncore-driven QoS violations but misses several application-level issues The difference between Network-only and Seeris small suggesting that one could omit the application-levelinstrumentation in favor of a simpler design While this sys-tem is still effective in capturing QoS violations it is lessuseful in providing feedback to application developers onhow to improve their design to avoid QoS violations in the fu-ture Finally the utilization-based system behaves the worstmissing most violations not caused by CPU saturationOut of the 89 QoS violations Seer detects it notifies the

cluster manager early enough to avoid 84 of them The QoSviolations that were not avoided correspond to application-level bugs which cannot be easily corrected online Sincethis is a private cluster Seer uses utilization metrics andperformance counters to identify problematic resourcesRetraining Fig 13 shows the detection accuracy for Seerand the tail latency for each end-to-end service over a periodof time during which Social Network is getting frequently andsubstantially updated This includes new microservices be-ing added to the service such as the ability to place an orderfrom an ad using the orders microservice of E-commerceor the back-end of the service changing from MongoDB to

Cassandra and the front-end switching from nginx to thenodejs front-end of E-commerce These are changes that fun-damentally affect the applicationrsquos behavior throughput la-tency and bottlenecks The other services remain unchangedduring this period (Banking was not active during this timeand is omitted from the graph) Blue dots denote correctly-signaled upcoming QoS violations and red times denote QoSviolations that were not detected by Seer All unidentifiedQoS violations coincide with the application being updatedShortly after the update Seer incrementally retrains in thebackground and starts recovering its accuracy until anothermajor update occurs Some of the updates have no impact oneither performance or Seerrsquos accuracy either because theyinvolve microservices off the critical path or because theyare insensitive to resource contention

The bottom figure shows that unidentified QoS violationsindeed result in performance degradation for Social Networkand in some cases for the other end-to-end services if theyare sharing physical resources with Social Network on anoversubscribed server Once retraining completes the per-formance of the service(s) recovers The longer Seer trainson an evolving application the more likely it is to correctlyanticipate its future QoS violations

6 Large-Scale Cloud Study61 Seer ScalabilityWe now deploy our Social Network service on a 100-serverdedicated cluster on Google Compute Engine (GCE) anduse it to service real user traffic The application has 582registered users with 165 daily active users and has beendeployed for a two-month period The cluster on averagehosts 386 single-concerned containers (one microservice percontainer) subject to some resource scaling actions by thecluster manager based on Seerrsquos feedback

Training Inference10-310-210-1100101102103104105

Tim

e (

s)

NativeTPUBrainwave

Figure 14 Seer training and infer-ence with hardware acceleration

Accuracy remainshigh for Seer con-sistent with thesmall-scale exper-iments Inferencetime however in-creases substantiallyfrom 114ms forthe 20-server clus-ter to 54ms for the100-server GCE set-ting Even thoughthis is still suffi-cient for many re-source allocation decisions as the application scales furtherSeerrsquos ability to anticipate a QoS violation within the clustermanagerrsquos window of opportunity diminishes

Over the past year multiple public cloud providers have ex-posed hardware acceleration offerings for DNN training and

nginx

text

imag

e

user

Tag

urlS

horte

n

vide

o

reco

mm

sear

chlogin

read

Post

com

pPos

t

writ

eGra

ph

mem

cach

ed

mon

godb

0

20

40

60

80

100

120

QoS

Vio

lations

Figure 15 QoS violations each microservice in Social Net-work is responsible for

inference either using a special-purpose design like the Ten-sor Processing Unit (TPU) from Google [55] or using recon-figurable FPGAs like Project Brainwave from Microsoft [25]We offload Seerrsquos DNN logic to both systems and quantifythe impact on training and inference time and detection ac-curacy 1 Fig 14 shows this comparison for a 200GB trainingdataset Both the TPU and Project Brainwave dramaticallyoutperform our local implementation by up to two orders ofmagnitude Between the two accelerators the TPU is moreeffective in training consistent with its design objective [55]while Project Brainwave achieves faster inference For theremainder of the paper we run Seer on TPUs and host theSocial Network service on GCE

62 Source of QoS ViolationsWe now examine which microservice is the most commonculprit for a QoS violation Fig 15 shows the number of QoSviolations caused by each service over the two-month periodThe most frequent culprits by far are the in-memory cachingtiers in memcached and Thrift services with high requestfanout such as composePost readPost and login memcachedis a justified source of QoS violations since it is on the criti-cal path for almost all query types and it is additionally verysensitive to resource contention in compute and to a lesserdegree cache and memory Microservices with high fanoutare also expected to initiate QoS violations as they have tosynchronize multiple inbound requests before proceeding Ifprocessing for any incoming requests is delayed end-to-endperformance is likely to suffer Among these QoS violationsmost of memcachedrsquos violations were caused by resource con-tention while violations in Thrift services were caused bylong synchronization times

63 Seerrsquos Long-Term Impact on Application DesignSeer has now been deployed in the Social Network clusterfor over two months and in this time it has detected 536upcoming QoS violations (906 accuracy) and avoided 495(84) of them Furthermore by detecting recurring patterns1Before running on TPUs we reimplemented our DNN in Tensorflow Wesimilarly adjust the DNN to the currently-supported designs in Brainwave

0 10 20 30 40 50 60Day

0

5

10

15

20

25

Qo

S v

iola

tio

ns

Figure 16 QoS violations each microservice in Social Net-work is responsible for

that lead to QoS violations Seer has helped the applicationdevelopers better understand bugs and design decisions thatlead to hotspots such as microservices with a lot of back-and-forth communication between them or microservicesforming cyclic dependencies or using blocking primitivesThis has led to a decreasing number of QoS violations overthe two month period (seen in Fig 16) as the applicationprogressively improves In days 22 and 23 there was a clusteroutage which is why the reported violations are zero Sys-tems like Seer can be used not only to improve performancepredictability in complex cloud systems but to help usersbetter understand the design challenges of microservices asmore services transition to this application model

7 ConclusionsCloud services increasingly move away from complex mono-lithic designs and adopt the model of specialized loosely-coupled microservices We presented Seer a data-drivencloud performance debugging system that leverages practi-cal learning techniques and the massive amount of tracingdata cloud systems collect to proactively detect and avoidQoS violations We have validated Seerrsquos accuracy in con-trolled environments and evaluated its scalability on large-scale clusters on public clouds We have also deployed thesystem in a cluster hosting a social network with hundredsof users In all scenarios Seer accurately detects upcomingQoS violations improving responsiveness and performancepredictability As more services transition to the microser-vices model systems like Seer provide practical solutionsthat can navigate the increasing complexity of the cloud

AcknowledgementsWe sincerely thank Christos Kozyrakis Balaji PrabhakarMendel Rosenblum Daniel Sanchez Ravi Soundararajanthe academic and industrial users of the benchmark suiteand the anonymous reviewers for their feedback on earlierversions of this manuscript This work was supported byNSF CAREER Award CCF-1846046 NSF grant CNS-1422088a Facebook Faculty Research Award a VMWare ResearchAward a John and Norma Balen Sesquicentennial FacultyFellowship and donations from GCE Azure and AWS

References[1] [n d] Apache Thrift httpsthriftapacheorg[2] [n d] Decomposing Twitter Adventures in Service-

Oriented Architecture httpswwwslidesharenetInfoQdecomposing-twitter-adventures-in-serviceoriented-architecture

[3] [n d] Golang Microservices Example httpsgithubcomharlowgo-micro-services

[4] [n d] Messaging that just works httpswwwrabbitmqcom[5] [n d] MongoDB httpswwwmongodbcom[6] [n d] NGINX httpsnginxorgen[7] [n d] SockShop A Microservices Demo Application httpswww

weaveworksblogsock-shop-microservices-demo-application[8] [n d] Zipkin httpzipkinio[9] 2016 The Evolution of Microservices httpswwwslidesharenet

adriancockcroftevolution-of-microservices-craft-conference[10] Martiacuten Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng

Chen Craig Citro Greg Corrado Andy Davis Jeffrey Dean MatthieuDevin Sanjay Ghemawat Ian Goodfellow Andrew Harp GeoffreyIrving Michael Isard Yangqing Jia Rafal Jozefowicz Lukasz KaiserManjunath Kudlur Josh Levenberg Dan Maneacute Rajat Monga SherryMoore Derek Murray Chris Olah Mike Schuster Jonathon ShlensBenoit Steiner Ilya Sutskever Kunal Talwar Paul Tucker VincentVanhoucke Vijay Vasudevan Fernanda Vieacutegas Oriol Vinyals PeteWarden Martin Wattenberg Martin Wicke Yuan Yu and XiaoqiangZheng [n d] TensorFlow Large-Scale Machine Learning on Hetero-geneous Distributed Systems In Proceedings of OSDI 2016

[11] Adrian Cockroft [n d] Microservices Workshop Why whatand how to get there httpwwwslidesharenetadriancockcroftmicroservices-workshop-craft-conference

[12] Hyunwook Baek Abhinav Srivastava and Jacobus Van der Merwe[n d] CloudSight A tenant-oriented transparency framework forcross-layer cloud troubleshooting In Proceedings of the 17th IEEEACMInternational Symposium on Cluster Cloud and Grid Computing 2017

[13] Luiz Barroso [n d] Warehouse-Scale Computing Entering theTeenage Decade ISCA Keynote SJ June 2011 ([n d])

[14] Luiz Barroso and Urs Hoelzle 2009 The Datacenter as a Computer AnIntroduction to the Design ofWarehouse-Scale Machines MC Publishers

[15] Robert Bell Yehuda Koren and Chris Volinsky 2007 The BellKor 2008Solution to the Netflix Prize Technical Report

[16] Leon Bottou [n d] Large-Scale Machine Learning with StochasticGradient Descent In Proceedings of the International Conference onComputational Statistics (COMPSTAT) Paris France 2010

[17] Martin A Brown [n d] Traffic Control HOWTO httplinux-ipnetarticlesTraffic-Control-HOWTO

[18] Cassandra [n d] Apache Cassandra httpcassandraapacheorg[19] Adrian M Caulfield Eric S Chung Andrew Putnam Hari Angepat

Jeremy Fowers Michael Haselman Stephen Heil Matt HumphreyPuneet Kaur Joo-Young Kim Daniel Lo Todd Massengill KalinOvtcharov Michael Papamichael Lisa Woods Sitaram Lanka DerekChiou and Doug Burger 2016 A Cloud-scale Acceleration Architec-ture In MICRO IEEE Press Piscataway NJ USA Article 7 13 pages

[20] Tianshi Chen Zidong Du Ninghui Sun Jia Wang Chengyong WuYunji Chen and Olivier Temam 2014 DianNao a small-footprint high-throughput accelerator for ubiquitous machine-learning In Proc ofthe 19th intl conf on Architectural Support for Programming Languagesand Operating Systems

[21] Yunji Chen Tianshi Chen Zhiwei Xu Ninghui Sun andOlivier Temam2016 DianNao Family Energy-efficient Hardware Accelerators forMachine Learning Commun ACM 59 11 (Oct 2016) 105ndash112 httpsdoiorg1011452996864

[22] Yunji Chen Tao Luo Shaoli Liu Shijin Zhang Liqiang He Jia WangLing Li Tianshi Chen Zhiwei Xu Ninghui Sun and Olivier Temam2014 DaDianNao A Machine-Learning Supercomputer In Proceedingsof the 47th Annual IEEEACM International Symposium on Microarchi-tecture (MICRO-47) IEEE Computer Society Washington DC USA

609ndash622[23] Ludmila Cherkasova Diwaker Gupta and Amin Vahdat 2007 Com-

parison of the Three CPU Schedulers in Xen SIGMETRICS PerformEval Rev 35 2 (Sept 2007) 42ndash51

[24] Michael Chow David Meisner Jason Flinn Daniel Peek and Thomas FWenisch 2014 The Mystery Machine End-to-end Performance Anal-ysis of Large-scale Internet Services In Proceedings of the 11th USENIXConference on Operating Systems Design and Implementation (OSDIrsquo14)USENIX Association Berkeley CA USA 217ndash231

[25] Eric S Chung Jeremy Fowers Kalin Ovtcharov Michael PapamichaelAdrian M Caulfield Todd Massengill Ming Liu Daniel Lo ShlomiAlkalay Michael Haselman Maleen Abeydeera Logan Adams HariAngepat Christian Boehn Derek Chiou Oren Firestein AlessandroForin Kang Su Gatlin Mahdi Ghandi Stephen Heil Kyle HolohanAhmad El Husseini Tamaacutes Juhaacutesz Kara Kagi Ratna Kovvuri SitaramLanka Friedel van Megen Dima Mukhortov Prerak Patel BrandonPerez Amanda Rapsang Steven K Reinhardt Bita Rouhani AdamSapek Raja Seera Sangeetha Shekar Balaji Sridharan Gabriel WeiszLisaWoods Phillip Yi Xiao Dan Zhang Ritchie Zhao and Doug Burger2018 Serving DNNs in Real Time at Datacenter Scale with ProjectBrainwave IEEE Micro 38 2 (2018) 8ndash20

[26] Guilherme Da Cunha Rodrigues Rodrigo N Calheiros Vini-cius Tavares Guimaraes Glederson Lessa dos Santos Maacutercio Barbosade Carvalho Lisandro Zambenedetti Granville Liane Margarida Rock-enbach Tarouco and Rajkumar Buyya 2016 Monitoring of CloudComputing Environments Concepts Solutions Trends and Future Di-rections In Proceedings of the 31st Annual ACM Symposium on AppliedComputing (SAC rsquo16) ACM New York NY USA 378ndash383

[27] Marwan Darwish Abdelkader Ouda and Luiz Fernando Capretz [nd] Cloud-based DDoS attacks and defenses In Proc of i-SocietyToronto ON 2013

[28] Jeffrey Dean and Luiz Andre Barroso [n d] The Tail at Scale InCACM Vol 56 No 2

[29] Christina Delimitrou Nick Bambos and Christos Kozyrakis [n d]QoS-Aware Admission Control in Heterogeneous Datacenters In Pro-ceedings of the International Conference of Autonomic Computing (ICAC)San Jose CA USA 2013

[30] Christina Delimitrou and Christos Kozyrakis [n d] iBench Quanti-fying Interference for Datacenter Workloads In Proceedings of the 2013IEEE International Symposium on Workload Characterization (IISWC)Portland OR September 2013

[31] Christina Delimitrou and Christos Kozyrakis [n d] Paragon QoS-Aware Scheduling for Heterogeneous Datacenters In Proceedings ofthe Eighteenth International Conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS) HoustonTX USA 2013

[32] Christina Delimitrou and Christos Kozyrakis [n d] QoS-Aware Sched-uling in Heterogeneous Datacenters with Paragon In ACM Transac-tions on Computer Systems (TOCS) Vol 31 Issue 4 December 2013

[33] Christina Delimitrou and Christos Kozyrakis [n d] Quality-of-Service-Aware Scheduling in Heterogeneous Datacenters withParagon In IEEE Micro Special Issue on Top Picks from the ComputerArchitecture Conferences MayJune 2014

[34] Christina Delimitrou and Christos Kozyrakis [n d] Quasar Resource-Efficient and QoS-Aware Cluster Management In Proc of ASPLOS SaltLake City 2014

[35] Christina Delimitrou and Christos Kozyrakis 2016 HCloud Resource-Efficient Provisioning in Shared Cloud Systems In Proceedings of theTwenty First International Conference on Architectural Support for Pro-gramming Languages and Operating Systems (ASPLOS)

[36] Christina Delimitrou and Christos Kozyrakis 2017 Bolt I KnowWhatYou Did Last Summer In The Cloud In Proc of the Twenty SecondInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS)

[37] Christina Delimitrou and Christos Kozyrakis 2018 Amdahlrsquos Law forTail Latency In Communications of the ACM (CACM)

[38] Christina Delimitrou Daniel Sanchez and Christos Kozyrakis 2015Tarcil Reconciling Scheduling Speed and Quality in Large Shared Clus-ters In Proceedings of the Sixth ACM Symposium on Cloud Computing(SOCC)

[39] Zidong Du Robert Fasthuber Tianshi Chen Paolo Ienne Ling Li TaoLuo Xiaobing Feng Yunji Chen and Olivier Temam 2015 ShiDi-anNao Shifting Vision Processing Closer to the Sensor In Proceed-ings of the 42Nd Annual International Symposium on Computer Ar-chitecture (ISCA rsquo15) ACM New York NY USA 92ndash104 httpsdoiorg10114527494692750389

[40] Daniel Firestone Andrew Putnam SambhramaMundkur Derek ChiouAlireza Dabagh Mike Andrewartha Hari Angepat Vivek BhanuAdrian Caulfield Eric Chung Harish Kumar Chandrappa SomeshChaturmohta Matt Humphrey Jack Lavier Norman Lam FengfenLiu Kalin Ovtcharov Jitu Padhye Gautham Popuri Shachar RaindelTejas Sapre Mark Shaw Gabriel Silva Madhan Sivakumar NisheethSrivastava Anshuman Verma Qasim Zuhair Deepak Bansal DougBurger Kushagra Vaid David A Maltz and Albert Greenberg 2018Azure Accelerated Networking SmartNICs in the Public Cloud In 15thUSENIX Symposium on Networked Systems Design and Implementation(NSDI 18) USENIX Association Renton WA 51ndash66

[41] Brad Fitzpatrick [n d] Distributed caching with memcached InLinux Journal Volume 2004 Issue 124 2004

[42] Rodrigo Fonseca George Porter Randy H Katz Scott Shenker andIon Stoica 2007 X-trace A Pervasive Network Tracing Framework InProceedings of the 4th USENIX Conference on Networked Systems Designamp Implementation (NSDIrsquo07) USENIX Association Berkeley CA USA20ndash20

[43] Yu Gan and Christina Delimitrou 2018 The Architectural Implicationsof Cloud Microservices In Computer Architecture Letters (CAL) vol17iss 2

[44] Yu Gan Meghna Pancholi Dailun Cheng Siyuan Hu Yuan He andChristina Delimitrou 2018 Seer Leveraging Big Data to Navigate theComplexity of Cloud Debugging In Proceedings of the Tenth USENIXWorkshop on Hot Topics in Cloud Computing (HotCloud)

[45] Yu Gan Yanqi Zhang Dailun Cheng Ankitha Shetty Priyal RathiNayan Katarki Ariana Bruno Justin Hu Brian Ritchken BrendonJackson Kelvin Hu Meghna Pancholi Yuan He Brett Clancy ChrisColen Fukang Wen Catherine Leung Siyuan Wang Leon ZaruvinskyMateo Espinosa Rick Lin Zhongling Liu Jake Padilla and ChristinaDelimitrou 2019 An Open-Source Benchmark Suite for Microser-vices and Their Hardware-Software Implications for Cloud and EdgeSystems In Proceedings of the Twenty Fourth International Conferenceon Architectural Support for Programming Languages and OperatingSystems (ASPLOS)

[46] Yilong Geng Shiyu Liu Zi Yin Ashish Naik Balaji Prabhakar MendelRosenblum and Amin Vahdat [n d] Exploiting a Natural NetworkEffect for Scalable Fine-grained Clock Synchronization In Proc ofNSDI 2018

[47] Daniel Gmach Jerry Rolia Ludmila Cherkasova and Alfons Kemper[n d] Workload Analysis and Demand Prediction of Enterprise DataCenter Applications In Proceedings of IISWC Boston MA 2007 10httpsdoiorg101109IISWC20074362193

[48] Zhenhuan Gong Xiaohui Gu and John Wilkes [n d] PRESS PRe-dictive Elastic ReSource Scaling for cloud systems In Proceedings ofCNSM Niagara Falls ON 2010

[49] Donald Gross John F Shortle James M Thompson and Carl M Harris[n d] Fundamentals of Queueing Theory InWiley Series in Probabilityand Statistics Book 627 2011

[50] Sanchika Gupta and PadamKumar [n d] VM Profile Based OptimizedNetwork Attack Pattern Detection Scheme for DDOS Attacks in CloudIn Proc of SSCC Mysore India 2013

[51] BenHindman Andy Konwinski Matei Zaharia Ali Ghodsi AnthonyDJoseph Randy Katz Scott Shenker and Ion Stoica [n d] Mesos APlatform for Fine-Grained Resource Sharing in the Data Center InProceedings of NSDI Boston MA 2011

[52] Jingwei Huang David M Nicol and Roy H Campbell [n d] Denial-of-Service Threat to HadoopYARN Clusters with Multi-Tenancy InProc of the IEEE International Congress on Big Data Washington DC2014

[53] Alexandru Iosup Nezih Yigitbasi and Dick Epema [n d] On thePerformance Variability of Production Cloud Services In Proceedingsof CCGRID Newport Beach CA 2011

[54] Hiranya Jayathilaka Chandra Krintz and Rich Wolski 2017 Perfor-mance Monitoring and Root Cause Analysis for Cloud-hosted WebApplications In Proceedings of WWW International World Wide WebConferences Steering Committee Republic and Canton of GenevaSwitzerland 469ndash478

[55] Norman P Jouppi Cliff Young Nishant Patil David Patterson Gau-rav Agrawal Raminder Bajwa Sarah Bates Suresh Bhatia Nan Bo-den Al Borchers Rick Boyle Pierre-luc Cantin Clifford Chao ChrisClark Jeremy Coriell Mike Daley Matt Dau Jeffrey Dean Ben GelbTara Vazir Ghaemmaghami Rajendra Gottipati William GullandRobert Hagmann C Richard Ho Doug Hogberg John Hu RobertHundt Dan Hurt Julian Ibarz Aaron Jaffey Alek Jaworski Alexan-der Kaplan Harshit Khaitan Daniel Killebrew Andy Koch NaveenKumar Steve Lacy James Laudon James Law Diemthu Le ChrisLeary Zhuyuan Liu Kyle Lucke Alan Lundin Gordon MacKean Adri-ana Maggiore Maire Mahony Kieran Miller Rahul Nagarajan RaviNarayanaswami Ray Ni Kathy Nix Thomas Norrie Mark Omer-nick Narayana Penukonda Andy Phelps Jonathan Ross Matt RossAmir Salek Emad Samadiani Chris Severn Gregory Sizikov MatthewSnelham Jed Souter Dan Steinberg Andy Swing Mercedes Tan Gre-gory Thorson Bo Tian Horia Toma Erick Tuttle Vijay VasudevanRichard Walter Walter Wang Eric Wilcox and Doe Hyun Yoon 2017In-Datacenter Performance Analysis of a Tensor Processing Unit InProceedings of the 44th Annual International Symposium on ComputerArchitecture (ISCA rsquo17) ACM New York NY USA 1ndash12

[56] Harshad Kasture and Daniel Sanchez 2014 Ubik Efficient CacheSharing with Strict QoS for Latency-Critical Workloads In Proceed-ings of the 19th international conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS-XIX)

[57] Harshad Kasture and Daniel Sanchez 2016 TailBench A BenchmarkSuite and Evaluation Methodology for Latency-Critical ApplicationsIn Proc of IISWC

[58] Yaakoub El Khamra Hyunjoo Kim Shantenu Jha andManish Parashar[n d] Exploring the performance fluctuations of hpc workloads onclouds In Proceedings of CloudCom Indianapolis IN 2010

[59] Krzysztof C Kiwiel [n d] Convergence and efficiency of subgradientmethods for quasiconvex minimization InMathematical Programming(Series A) (Berlin Heidelberg Springer) 90 (1) pp 1-25 2001

[60] Jacob Leverich and Christos Kozyrakis [n d] Reconciling High ServerUtilization and Sub-millisecond Quality-of-Service In Proc of EuroSys2014

[61] David Lo Liqun Cheng Rama Govindaraju Luiz Andreacute Barroso andChristos Kozyrakis [n d] Towards Energy Proportionality for Large-scale Latency-critical Workloads In Proceedings of the 41st AnnualInternational Symposium on Computer Architecuture (ISCA) Minneapo-lis MN 2014

[62] David Lo Liqun Cheng Rama Govindaraju Parthasarathy Ran-ganathan and Christos Kozyrakis [n d] Heracles Improving Re-source Efficiency at Scale In Proc of the 42Nd Annual InternationalSymposium on Computer Architecture (ISCA) Portland OR 2015

[63] Dave Mangot [n d] EC2 variability The numbers re-vealed httptechmangotcomrollerdaveentryec2_variability_the_numbers_revealed

[64] Jason Mars and Lingjia Tang [n d] Whare-map heterogeneity inhomogeneous warehouse-scale computers In Proceedings of ISCATel-Aviv Israel 2013

[65] Jason Mars Lingjia Tang and Robert Hundt 2011 Heterogeneity inldquoHomogeneousrdquo Warehouse-Scale Computers A Performance Oppor-tunity IEEE Comput Archit Lett 10 2 (July 2011) 4

[66] Jason Mars Lingjia Tang Robert Hundt Kevin Skadron and Mary LouSoffa [n d] Bubble-Up increasing utilization in modern warehousescale computers via sensible co-locations In Proceedings of MICROPorto Alegre Brazil 2011

[67] David Meisner Christopher M Sadler Luiz Andreacute Barroso Wolf-Dietrich Weber and Thomas F Wenisch 2011 Power managementof online data-intensive services In Proceedings of the 38th annualinternational symposium on Computer architecture 319ndash330

[68] Ripal Nathuji Canturk Isci and Eugene Gorbatov [n d] Exploitingplatform heterogeneity for power efficient data centers In Proceedingsof ICAC Jacksonville FL 2007

[69] Ripal Nathuji Aman Kansal and Alireza Ghaffarkhah [n d] Q-Clouds Managing Performance Interference Effects for QoS-AwareClouds In Proceedings of EuroSys ParisFrance 2010

[70] Hiep Nguyen Zhiming Shen Xiaohui Gu Sethuraman Subbiah andJohn Wilkes [n d] AGILE Elastic Distributed Resource Scaling forInfrastructure-as-a-Service In Proceedings of ICAC San Jose CA 2013

[71] Dejan Novakovic Nedeljko Vasic Stanko Novakovic Dejan Kosticand Ricardo Bianchini [n d] DeepDive Transparently Identifyingand Managing Performance Interference in Virtualized EnvironmentsIn Proceedings of ATC San Jose CA 2013

[72] Simon Ostermann Alexandru Iosup Nezih Yigitbasi Radu ProdanThomas Fahringer and Dick Epema [n d] A Performance Analysisof EC2 Cloud Computing Services for Scientific Computing In LectureNotes on Cloud Computing Volume 34 p115-131 2010

[73] Kay Ousterhout Patrick Wendell Matei Zaharia and Ion Stoica [nd] Sparrow Distributed Low Latency Scheduling In Proceedings ofSOSP Farminton PA 2013

[74] Xue Ouyang Peter Garraghan Renyu Yang Paul Townend and Jie Xu[n d] Reducing Late-Timing Failure at Scale Straggler Root-CauseAnalysis in Cloud Datacenters In Proceedings of 46th Annual IEEEIFIPInternational Conference on Dependable Systems and Networks 2016

[75] Andrew Putnam Adrian M Caulfield Eric S Chung Derek ChiouKypros Constantinides John Demme Hadi Esmaeilzadeh Jeremy Fow-ers Gopi Prashanth Gopal Jan Gray Michael Haselman Scott HauckStephen Heil Amir Hormati Joo-Young Kim Sitaram Lanka JamesLarus Eric Peterson Simon Pope Aaron Smith Jason Thong Phillip YiXiao and Doug Burger 2014 A Reconfigurable Fabric for Accelerat-ing Large-Scale Datacenter Services In Proc of the 41st Intl Symp onComputer Architecture

[76] Suhail Rehman and Majd Sakr [n d] Initial Findings for provisioningvariation in cloud computing In Proceedings of CloudCom IndianapolisIN 2010

[77] Charles Reiss Alexey Tumanov Gregory Ganger Randy Katz andMichael Kozych [n d] Heterogeneity and Dynamicity of Clouds atScale Google Trace Analysis In Proceedings of SOCC 2012

[78] Gang Ren Eric Tune Tipp Moseley Yixin Shi Silvius Rus andRobert Hundt 2010 Google-Wide Profiling A Continuous Pro-filing Infrastructure for Data Centers IEEE Micro (2010) 65ndash79httpwwwcomputerorgportalwebcsdldoi101109MM201068

[79] rightscale [n d] Rightscale httpsawsamazoncomsolution-providersisvrightscale

[80] S Sarwar A Ankit and K Roy [n d] Incremental Learning in DeepConvolutional Neural Networks Using Partial Network Sharing InarXiv preprint arXiv171202719

[81] Joumlrg Schad Jens Dittrich and Jorge-Arnulfo Quianeacute-Ruiz 2010 Run-time Measurements in the Cloud Observing Analyzing and ReducingVariance Proceedings VLDB Endow 3 1-2 (Sept 2010) 460ndash471

[82] Malte Schwarzkopf Andy Konwinski Michael Abd-El-Malek andJohn Wilkes [n d] Omega flexible scalable schedulers for largecompute clusters In Proceedings of EuroSys Prague 2013

[83] Zhiming Shen Sethuraman Subbiah Xiaohui Gu and John Wilkes [nd] CloudScale elastic resource scaling for multi-tenant cloud systemsIn Proceedings of SOCC Cascais Portugal 2011

[84] David Shue Michael J Freedman and Anees Shaikh [n d] Perfor-mance Isolation and Fairness for Multi-tenant Cloud Storage In Procof OSDI Hollywood CA 2012

[85] Benjamin H Sigelman Luiz Andreacute Barroso Mike Burrows PatStephenson Manoj Plakal Donald Beaver Saul Jaspan and Chan-dan Shanbhag 2010 Dapper a Large-Scale Distributed Systems TracingInfrastructure Technical Report Google Inc httpsresearchgooglecomarchivepapersdapper-2010-1pdf

[86] Lalith Suresh Peter Bodik Ishai Menache Marco Canini and FlorinCiucu [n d] Distributed Resource Management Across ProcessBoundaries In Proceedings of the ACM Symposium on Cloud Computing(SOCC) Santa Clara CA 2017

[87] torque [n d] Torque Resource Manager httpwwwadaptivecomputingcomproductsopen-sourcetorque

[88] Venkatanathan Varadarajan Thomas Ristenpart and Michael Swift[n d] Scheduler-based Defenses against Cross-VM Side-channels InProc of the 23rd Usenix Security Symposium San Diego CA 2014

[89] Venkatanathan Varadarajan Yinqian Zhang Thomas Ristenpart andMichael Swift [n d] A Placement Vulnerability Study inMulti-TenantPublic Clouds In Proc of the 24th USENIX Security Symposium (USENIXSecurity) Washington DC 2015

[90] Nedeljko Vasić Dejan Novaković Svetozar Miučin Dejan Kostić andRicardo Bianchini [n d] DejaVu accelerating resource allocationin virtualized environments In Proceedings of the Seventeenth Interna-tional Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS) London UK 2012

[91] Jianping Weng Jessie Hui Wang Jiahai Yang and Yang Yang [n d]Root cause analysis of anomalies of multitier services in public cloudsIn Proceedings of the IEEEACM 25th International Symposium on Qual-ity of Service (IWQoS) 2017

[92] Ian H Witten Eibe Frank and Geoffrey Holmes [n d] Data MiningPractical Machine Learning Tools and Techniques 3rd Edition

[93] Tianyin Xu Xinxin Jin Peng Huang Yuanyuan Zhou Shan Lu LongJin and Shankar Pasupathy 2016 Early Detection of ConfigurationErrors to Reduce Failure Damage In Proceedings of the 12th USENIXConference on Operating Systems Design and Implementation (OSDIrsquo16)USENIX Association Berkeley CA USA 619ndash634

[94] Yunjing Xu Michael Bailey Farnam Jahanian Kaustubh Joshi MattiHiltunen and Richard Schlichting [n d] An Exploration of L2 CacheCovert Channels in Virtualized Environments In Proc of the 3rd ACMWorkshop on Cloud Computing Security Workshop (CCSW) ChicagoIL 2011

[95] Hailong Yang Alex Breslow Jason Mars and Lingjia Tang [n d]Bubble-flux precise online QoS management for increased utilizationin warehouse scale computers In Proceedings of ISCA 2013

[96] Minlan Yu Albert Greenberg Dave Maltz Jennifer Rexford LihuaYuan Srikanth Kandula and Changhoon Kim [n d] Profiling Net-work Performance for Multi-tier Data Center Applications In Proceed-ings of NSDI Boston MA 2011 14

[97] Xiao Yu Pallavi Joshi Jianwu Xu Guoliang Jin Hui Zhang and GuofeiJiang 2016 CloudSeer Workflow Monitoring of Cloud Infrastructuresvia Interleaved Logs In Proceedings of APSLOS ACM New York NYUSA 489ndash502

[98] Yinqian Zhang and Michael K Reiter [n d] Duppel retrofittingcommodity operating systems to mitigate cache side channels in thecloud In Proc of CCS Berlin Germany 2013

[99] Mark Zhao and G Edward Suh [n d] FPGA-Based Remote PowerSide-Channel Attacks In Proceedings of the IEEE Symposium on Securityand Privacy May 2018

  • Abstract
  • 1 Introduction
  • 2 Related Work
  • 3 End-to-End Applications with Microservices
    • 31 Social Network
    • 32 Media Service
    • 33 E-Commerce Service
    • 34 Banking System
    • 35 Hotel Reservation Site
      • 4 Seer Design
        • 41 Overview
        • 42 Distributed Tracing
        • 43 Deep Learning in Performance Debugging
        • 44 Hardware Monitoring
        • 45 System Insights from Seer
        • 46 Implementation
          • 5 Seer Analysis and Validation
            • 51 Methodology
            • 52 Evaluation
              • 6 Large-Scale Cloud Study
                • 61 Seer Scalability
                • 62 Source of QoS Violations
                • 63 Seers Long-Term Impact on Application Design
                  • 7 Conclusions
                  • Acknowledgements
                  • References
Page 10: Seer: Leveraging Big Data to Navigate the Complexity of ... · data to extract the causal relationships between requests. Even though such tracing systems help cloud providers detect

0 5 10 15 20 25 30 35 40

Training Time (hours)

40

50

60

70

80

90

100

Qo

S V

iola

tio

n D

ete

ctio

n A

ccu

racy (

)

100MB

1GB

10GB

20GB

50GB

100GB 200GB

500GB 1TB

10-2 10-1 100 101Tracing Interval (s)

50

55

60

65

70

75

80

85

90

95

Dete

ction A

ccura

cy (

)

Figure 11 Seerrsquos sensitivity to (a) the size of trainingdatasets and (b) the tracing interval

server is connected to a 40Gbps ToR switch over 10Gbe NICsSecond we deploy the Social Network service to GoogleCompute Engine (GCE) and Windows Azure clusters withhundreds of servers to study the scalability of SeerApplicationsWe use all five end-to-end services of Table 1Services for now are driven by open-loop workload gener-ators and the input load varies from constant to diurnalto load with spikes in user demand In Section 6 we studySeer in a real large-scale deployment of the Social Networkin that case the input load is driven by real user traffic

52 EvaluationSensitivity to training data Fig 11a shows the detectionaccuracy and training time for Seer as we increase the sizeof the training dataset The size of the dots is a functionof the dataset size Training data is collected from the 20-server cluster described above across different load levelsplacement strategies time intervals and request types Thesmallest training set size (100MB) is collected over ten min-utes of the cluster operating at high utilization while thelargest dataset (1TB) is collected over almost two months ofcontinuous deployment As datasets grow Seerrsquos accuracyincreases leveling off at 100-200GB Beyond that point accu-racy does not further increase while the time needed fortraining grows significantly Unless otherwise specified weuse the 100GB training datasetSensitivity to tracing frequencyBy default the distributedtracing system instantaneously collects the latency of everysingle user request Collecting queue depth statistics on theother hand is a per-microservice iterative process Fig 11bshows how Seerrsquos accuracy changes as we vary the frequencywith which we collect queue depth statistics Waiting for along time before sampling queues eg gt 1s can result in un-detected QoS violations before Seer gets a chance to processthe incoming traces In contrast sampling queues very fre-quently results in unnecessarily many inferences and runsthe risk of increasing the tracing overhead For the remain-der of this work we use 100ms as the interval for measuringqueue depths across microservicesFalse negatives amp false positives Fig 12a shows the per-centage of false negatives and false positives as we vary the

CPU

Cache

Memory

Network

Disk

Application

False Negatives False Positives0

10

20

30

40

50

Pe

rce

nta

ge

(

)

10ms50ms100ms

500ms1s2s

Utilizatio

n

App only

Net only

Seer

Ground0

20

40

60

80

100

o

f V

alid

Qo

S V

iola

tio

ns

Truth

Figure 12 (a) The false negatives and false positives in Seeras we vary the inference window (b) Breakdown to causesof QoS violations and comparison with utilization-baseddetection and systems with limited instrumentation

prediction window When Seer tries to anticipate QoS viola-tions that will occur in the next 10-100ms both false positivesand false negatives are low since Seer uses a very recentsnapshot of the cluster state to anticipate performance unpre-dictability If inference was instantaneous very short predic-tion windows would always be better However given thatinference takes several milliseconds and more importantlyapplying corrective actions to avoid QoS violations takes10-100s of milliseconds to take effect such short windowsdefy the point of proactive QoS violation detection At theother end predicting far into the future results in significantfalse negatives and especially false positives This is becausemany QoS violations are caused by very short bursty eventsthat do not have an impact on queue lengths until a fewmilliseconds before the violation occurs Therefore requiringSeer to predict one or more seconds into the future meansthat normal queue depths are annotated as QoS violationsresulting in many false positives Unless otherwise specifiedwe use a 100ms prediction windowComparison of debugging systems Fig 12b comparesSeer with a utilization-based performance debugging systemthat uses resource saturation as the trigger to signal a QoSviolation and two systems that only use a fraction of Seerrsquosinstrumentation App-only exclusively uses queues mea-sured via application instrumentation (not network queues)while Network-only uses queues in the NIC and ignoresapplication-level queueing We also show the ground truthfor the total number of upcoming QoS violations (96 over atwo-week period) and break it down by the reason that ledto unpredictable performanceA large fraction of QoS violations are due to application-

level inefficiencies including correctness bugs unnecessarysynchronization andor blocking behavior between microser-vices (including two cases of deadlocks) and misconfigurediptables rules which caused packet drops An equally largefraction of QoS violations are due to compute contentionfollowed by contention in the network cache and memory

Accuracy

QoS Violation

Impactful Service Update

Innocuous Service Update

Incremental Retraining

Signaled QoS Violation

0 2 4 6 8 10 12Time (hours)

75

80

85

90

95

De

tectio

n A

ccu

racy (

)

Social Network Movie Reviewing Ecommerce Go-microservices

0 2 4 6 8 10 12Time (hours)

60708090

100110120130140

Ta

il La

ten

cy n

orm

QoS

QoS

Figure 13 Seer retraining incrementally after each time theSocial Network service is updated

contention and finally disk Since the only persistent mi-croservices are the back-end databases it is reasonable thatdisk accounts for a small fraction of overall QoS violations

Seer accurately follows this breakdown for the most partonly missing a few QoS violations due to random load spikesincluding one caused by a switch failure The App-only sys-tem correctly identifies application-level sources of unpre-dictable performance but misses the majority of system-related issues especially in uncore resources On the otherhand Network-only correctly identifies the vast majorityof network-related issues as well as most of the core- anduncore-driven QoS violations but misses several application-level issues The difference between Network-only and Seeris small suggesting that one could omit the application-levelinstrumentation in favor of a simpler design While this sys-tem is still effective in capturing QoS violations it is lessuseful in providing feedback to application developers onhow to improve their design to avoid QoS violations in the fu-ture Finally the utilization-based system behaves the worstmissing most violations not caused by CPU saturationOut of the 89 QoS violations Seer detects it notifies the

cluster manager early enough to avoid 84 of them The QoSviolations that were not avoided correspond to application-level bugs which cannot be easily corrected online Sincethis is a private cluster Seer uses utilization metrics andperformance counters to identify problematic resourcesRetraining Fig 13 shows the detection accuracy for Seerand the tail latency for each end-to-end service over a periodof time during which Social Network is getting frequently andsubstantially updated This includes new microservices be-ing added to the service such as the ability to place an orderfrom an ad using the orders microservice of E-commerceor the back-end of the service changing from MongoDB to

Cassandra and the front-end switching from nginx to thenodejs front-end of E-commerce These are changes that fun-damentally affect the applicationrsquos behavior throughput la-tency and bottlenecks The other services remain unchangedduring this period (Banking was not active during this timeand is omitted from the graph) Blue dots denote correctly-signaled upcoming QoS violations and red times denote QoSviolations that were not detected by Seer All unidentifiedQoS violations coincide with the application being updatedShortly after the update Seer incrementally retrains in thebackground and starts recovering its accuracy until anothermajor update occurs Some of the updates have no impact oneither performance or Seerrsquos accuracy either because theyinvolve microservices off the critical path or because theyare insensitive to resource contention

The bottom figure shows that unidentified QoS violationsindeed result in performance degradation for Social Networkand in some cases for the other end-to-end services if theyare sharing physical resources with Social Network on anoversubscribed server Once retraining completes the per-formance of the service(s) recovers The longer Seer trainson an evolving application the more likely it is to correctlyanticipate its future QoS violations

6 Large-Scale Cloud Study61 Seer ScalabilityWe now deploy our Social Network service on a 100-serverdedicated cluster on Google Compute Engine (GCE) anduse it to service real user traffic The application has 582registered users with 165 daily active users and has beendeployed for a two-month period The cluster on averagehosts 386 single-concerned containers (one microservice percontainer) subject to some resource scaling actions by thecluster manager based on Seerrsquos feedback

Training Inference10-310-210-1100101102103104105

Tim

e (

s)

NativeTPUBrainwave

Figure 14 Seer training and infer-ence with hardware acceleration

Accuracy remainshigh for Seer con-sistent with thesmall-scale exper-iments Inferencetime however in-creases substantiallyfrom 114ms forthe 20-server clus-ter to 54ms for the100-server GCE set-ting Even thoughthis is still suffi-cient for many re-source allocation decisions as the application scales furtherSeerrsquos ability to anticipate a QoS violation within the clustermanagerrsquos window of opportunity diminishes

Over the past year multiple public cloud providers have ex-posed hardware acceleration offerings for DNN training and

nginx

text

imag

e

user

Tag

urlS

horte

n

vide

o

reco

mm

sear

chlogin

read

Post

com

pPos

t

writ

eGra

ph

mem

cach

ed

mon

godb

0

20

40

60

80

100

120

QoS

Vio

lations

Figure 15 QoS violations each microservice in Social Net-work is responsible for

inference either using a special-purpose design like the Ten-sor Processing Unit (TPU) from Google [55] or using recon-figurable FPGAs like Project Brainwave from Microsoft [25]We offload Seerrsquos DNN logic to both systems and quantifythe impact on training and inference time and detection ac-curacy 1 Fig 14 shows this comparison for a 200GB trainingdataset Both the TPU and Project Brainwave dramaticallyoutperform our local implementation by up to two orders ofmagnitude Between the two accelerators the TPU is moreeffective in training consistent with its design objective [55]while Project Brainwave achieves faster inference For theremainder of the paper we run Seer on TPUs and host theSocial Network service on GCE

62 Source of QoS ViolationsWe now examine which microservice is the most commonculprit for a QoS violation Fig 15 shows the number of QoSviolations caused by each service over the two-month periodThe most frequent culprits by far are the in-memory cachingtiers in memcached and Thrift services with high requestfanout such as composePost readPost and login memcachedis a justified source of QoS violations since it is on the criti-cal path for almost all query types and it is additionally verysensitive to resource contention in compute and to a lesserdegree cache and memory Microservices with high fanoutare also expected to initiate QoS violations as they have tosynchronize multiple inbound requests before proceeding Ifprocessing for any incoming requests is delayed end-to-endperformance is likely to suffer Among these QoS violationsmost of memcachedrsquos violations were caused by resource con-tention while violations in Thrift services were caused bylong synchronization times

63 Seerrsquos Long-Term Impact on Application DesignSeer has now been deployed in the Social Network clusterfor over two months and in this time it has detected 536upcoming QoS violations (906 accuracy) and avoided 495(84) of them Furthermore by detecting recurring patterns1Before running on TPUs we reimplemented our DNN in Tensorflow Wesimilarly adjust the DNN to the currently-supported designs in Brainwave

0 10 20 30 40 50 60Day

0

5

10

15

20

25

Qo

S v

iola

tio

ns

Figure 16 QoS violations each microservice in Social Net-work is responsible for

that lead to QoS violations Seer has helped the applicationdevelopers better understand bugs and design decisions thatlead to hotspots such as microservices with a lot of back-and-forth communication between them or microservicesforming cyclic dependencies or using blocking primitivesThis has led to a decreasing number of QoS violations overthe two month period (seen in Fig 16) as the applicationprogressively improves In days 22 and 23 there was a clusteroutage which is why the reported violations are zero Sys-tems like Seer can be used not only to improve performancepredictability in complex cloud systems but to help usersbetter understand the design challenges of microservices asmore services transition to this application model

7 ConclusionsCloud services increasingly move away from complex mono-lithic designs and adopt the model of specialized loosely-coupled microservices We presented Seer a data-drivencloud performance debugging system that leverages practi-cal learning techniques and the massive amount of tracingdata cloud systems collect to proactively detect and avoidQoS violations We have validated Seerrsquos accuracy in con-trolled environments and evaluated its scalability on large-scale clusters on public clouds We have also deployed thesystem in a cluster hosting a social network with hundredsof users In all scenarios Seer accurately detects upcomingQoS violations improving responsiveness and performancepredictability As more services transition to the microser-vices model systems like Seer provide practical solutionsthat can navigate the increasing complexity of the cloud

AcknowledgementsWe sincerely thank Christos Kozyrakis Balaji PrabhakarMendel Rosenblum Daniel Sanchez Ravi Soundararajanthe academic and industrial users of the benchmark suiteand the anonymous reviewers for their feedback on earlierversions of this manuscript This work was supported byNSF CAREER Award CCF-1846046 NSF grant CNS-1422088a Facebook Faculty Research Award a VMWare ResearchAward a John and Norma Balen Sesquicentennial FacultyFellowship and donations from GCE Azure and AWS

References[1] [n d] Apache Thrift httpsthriftapacheorg[2] [n d] Decomposing Twitter Adventures in Service-

Oriented Architecture httpswwwslidesharenetInfoQdecomposing-twitter-adventures-in-serviceoriented-architecture

[3] [n d] Golang Microservices Example httpsgithubcomharlowgo-micro-services

[4] [n d] Messaging that just works httpswwwrabbitmqcom[5] [n d] MongoDB httpswwwmongodbcom[6] [n d] NGINX httpsnginxorgen[7] [n d] SockShop A Microservices Demo Application httpswww

weaveworksblogsock-shop-microservices-demo-application[8] [n d] Zipkin httpzipkinio[9] 2016 The Evolution of Microservices httpswwwslidesharenet

adriancockcroftevolution-of-microservices-craft-conference[10] Martiacuten Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng

Chen Craig Citro Greg Corrado Andy Davis Jeffrey Dean MatthieuDevin Sanjay Ghemawat Ian Goodfellow Andrew Harp GeoffreyIrving Michael Isard Yangqing Jia Rafal Jozefowicz Lukasz KaiserManjunath Kudlur Josh Levenberg Dan Maneacute Rajat Monga SherryMoore Derek Murray Chris Olah Mike Schuster Jonathon ShlensBenoit Steiner Ilya Sutskever Kunal Talwar Paul Tucker VincentVanhoucke Vijay Vasudevan Fernanda Vieacutegas Oriol Vinyals PeteWarden Martin Wattenberg Martin Wicke Yuan Yu and XiaoqiangZheng [n d] TensorFlow Large-Scale Machine Learning on Hetero-geneous Distributed Systems In Proceedings of OSDI 2016

[11] Adrian Cockroft [n d] Microservices Workshop Why whatand how to get there httpwwwslidesharenetadriancockcroftmicroservices-workshop-craft-conference

[12] Hyunwook Baek Abhinav Srivastava and Jacobus Van der Merwe[n d] CloudSight A tenant-oriented transparency framework forcross-layer cloud troubleshooting In Proceedings of the 17th IEEEACMInternational Symposium on Cluster Cloud and Grid Computing 2017

[13] Luiz Barroso [n d] Warehouse-Scale Computing Entering theTeenage Decade ISCA Keynote SJ June 2011 ([n d])

[14] Luiz Barroso and Urs Hoelzle 2009 The Datacenter as a Computer AnIntroduction to the Design ofWarehouse-Scale Machines MC Publishers

[15] Robert Bell Yehuda Koren and Chris Volinsky 2007 The BellKor 2008Solution to the Netflix Prize Technical Report

[16] Leon Bottou [n d] Large-Scale Machine Learning with StochasticGradient Descent In Proceedings of the International Conference onComputational Statistics (COMPSTAT) Paris France 2010

[17] Martin A Brown [n d] Traffic Control HOWTO httplinux-ipnetarticlesTraffic-Control-HOWTO

[18] Cassandra [n d] Apache Cassandra httpcassandraapacheorg[19] Adrian M Caulfield Eric S Chung Andrew Putnam Hari Angepat

Jeremy Fowers Michael Haselman Stephen Heil Matt HumphreyPuneet Kaur Joo-Young Kim Daniel Lo Todd Massengill KalinOvtcharov Michael Papamichael Lisa Woods Sitaram Lanka DerekChiou and Doug Burger 2016 A Cloud-scale Acceleration Architec-ture In MICRO IEEE Press Piscataway NJ USA Article 7 13 pages

[20] Tianshi Chen Zidong Du Ninghui Sun Jia Wang Chengyong WuYunji Chen and Olivier Temam 2014 DianNao a small-footprint high-throughput accelerator for ubiquitous machine-learning In Proc ofthe 19th intl conf on Architectural Support for Programming Languagesand Operating Systems

[21] Yunji Chen Tianshi Chen Zhiwei Xu Ninghui Sun andOlivier Temam2016 DianNao Family Energy-efficient Hardware Accelerators forMachine Learning Commun ACM 59 11 (Oct 2016) 105ndash112 httpsdoiorg1011452996864

[22] Yunji Chen Tao Luo Shaoli Liu Shijin Zhang Liqiang He Jia WangLing Li Tianshi Chen Zhiwei Xu Ninghui Sun and Olivier Temam2014 DaDianNao A Machine-Learning Supercomputer In Proceedingsof the 47th Annual IEEEACM International Symposium on Microarchi-tecture (MICRO-47) IEEE Computer Society Washington DC USA

609ndash622[23] Ludmila Cherkasova Diwaker Gupta and Amin Vahdat 2007 Com-

parison of the Three CPU Schedulers in Xen SIGMETRICS PerformEval Rev 35 2 (Sept 2007) 42ndash51

[24] Michael Chow David Meisner Jason Flinn Daniel Peek and Thomas FWenisch 2014 The Mystery Machine End-to-end Performance Anal-ysis of Large-scale Internet Services In Proceedings of the 11th USENIXConference on Operating Systems Design and Implementation (OSDIrsquo14)USENIX Association Berkeley CA USA 217ndash231

[25] Eric S Chung Jeremy Fowers Kalin Ovtcharov Michael PapamichaelAdrian M Caulfield Todd Massengill Ming Liu Daniel Lo ShlomiAlkalay Michael Haselman Maleen Abeydeera Logan Adams HariAngepat Christian Boehn Derek Chiou Oren Firestein AlessandroForin Kang Su Gatlin Mahdi Ghandi Stephen Heil Kyle HolohanAhmad El Husseini Tamaacutes Juhaacutesz Kara Kagi Ratna Kovvuri SitaramLanka Friedel van Megen Dima Mukhortov Prerak Patel BrandonPerez Amanda Rapsang Steven K Reinhardt Bita Rouhani AdamSapek Raja Seera Sangeetha Shekar Balaji Sridharan Gabriel WeiszLisaWoods Phillip Yi Xiao Dan Zhang Ritchie Zhao and Doug Burger2018 Serving DNNs in Real Time at Datacenter Scale with ProjectBrainwave IEEE Micro 38 2 (2018) 8ndash20

[26] Guilherme Da Cunha Rodrigues Rodrigo N Calheiros Vini-cius Tavares Guimaraes Glederson Lessa dos Santos Maacutercio Barbosade Carvalho Lisandro Zambenedetti Granville Liane Margarida Rock-enbach Tarouco and Rajkumar Buyya 2016 Monitoring of CloudComputing Environments Concepts Solutions Trends and Future Di-rections In Proceedings of the 31st Annual ACM Symposium on AppliedComputing (SAC rsquo16) ACM New York NY USA 378ndash383

[27] Marwan Darwish Abdelkader Ouda and Luiz Fernando Capretz [nd] Cloud-based DDoS attacks and defenses In Proc of i-SocietyToronto ON 2013

[28] Jeffrey Dean and Luiz Andre Barroso [n d] The Tail at Scale InCACM Vol 56 No 2

[29] Christina Delimitrou Nick Bambos and Christos Kozyrakis [n d]QoS-Aware Admission Control in Heterogeneous Datacenters In Pro-ceedings of the International Conference of Autonomic Computing (ICAC)San Jose CA USA 2013

[30] Christina Delimitrou and Christos Kozyrakis [n d] iBench Quanti-fying Interference for Datacenter Workloads In Proceedings of the 2013IEEE International Symposium on Workload Characterization (IISWC)Portland OR September 2013

[31] Christina Delimitrou and Christos Kozyrakis [n d] Paragon QoS-Aware Scheduling for Heterogeneous Datacenters In Proceedings ofthe Eighteenth International Conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS) HoustonTX USA 2013

[32] Christina Delimitrou and Christos Kozyrakis [n d] QoS-Aware Sched-uling in Heterogeneous Datacenters with Paragon In ACM Transac-tions on Computer Systems (TOCS) Vol 31 Issue 4 December 2013

[33] Christina Delimitrou and Christos Kozyrakis [n d] Quality-of-Service-Aware Scheduling in Heterogeneous Datacenters withParagon In IEEE Micro Special Issue on Top Picks from the ComputerArchitecture Conferences MayJune 2014

[34] Christina Delimitrou and Christos Kozyrakis [n d] Quasar Resource-Efficient and QoS-Aware Cluster Management In Proc of ASPLOS SaltLake City 2014

[35] Christina Delimitrou and Christos Kozyrakis 2016 HCloud Resource-Efficient Provisioning in Shared Cloud Systems In Proceedings of theTwenty First International Conference on Architectural Support for Pro-gramming Languages and Operating Systems (ASPLOS)

[36] Christina Delimitrou and Christos Kozyrakis 2017 Bolt I KnowWhatYou Did Last Summer In The Cloud In Proc of the Twenty SecondInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS)

[37] Christina Delimitrou and Christos Kozyrakis 2018 Amdahlrsquos Law forTail Latency In Communications of the ACM (CACM)

[38] Christina Delimitrou Daniel Sanchez and Christos Kozyrakis 2015Tarcil Reconciling Scheduling Speed and Quality in Large Shared Clus-ters In Proceedings of the Sixth ACM Symposium on Cloud Computing(SOCC)

[39] Zidong Du Robert Fasthuber Tianshi Chen Paolo Ienne Ling Li TaoLuo Xiaobing Feng Yunji Chen and Olivier Temam 2015 ShiDi-anNao Shifting Vision Processing Closer to the Sensor In Proceed-ings of the 42Nd Annual International Symposium on Computer Ar-chitecture (ISCA rsquo15) ACM New York NY USA 92ndash104 httpsdoiorg10114527494692750389

[40] Daniel Firestone Andrew Putnam SambhramaMundkur Derek ChiouAlireza Dabagh Mike Andrewartha Hari Angepat Vivek BhanuAdrian Caulfield Eric Chung Harish Kumar Chandrappa SomeshChaturmohta Matt Humphrey Jack Lavier Norman Lam FengfenLiu Kalin Ovtcharov Jitu Padhye Gautham Popuri Shachar RaindelTejas Sapre Mark Shaw Gabriel Silva Madhan Sivakumar NisheethSrivastava Anshuman Verma Qasim Zuhair Deepak Bansal DougBurger Kushagra Vaid David A Maltz and Albert Greenberg 2018Azure Accelerated Networking SmartNICs in the Public Cloud In 15thUSENIX Symposium on Networked Systems Design and Implementation(NSDI 18) USENIX Association Renton WA 51ndash66

[41] Brad Fitzpatrick [n d] Distributed caching with memcached InLinux Journal Volume 2004 Issue 124 2004

[42] Rodrigo Fonseca George Porter Randy H Katz Scott Shenker andIon Stoica 2007 X-trace A Pervasive Network Tracing Framework InProceedings of the 4th USENIX Conference on Networked Systems Designamp Implementation (NSDIrsquo07) USENIX Association Berkeley CA USA20ndash20

[43] Yu Gan and Christina Delimitrou 2018 The Architectural Implicationsof Cloud Microservices In Computer Architecture Letters (CAL) vol17iss 2

[44] Yu Gan Meghna Pancholi Dailun Cheng Siyuan Hu Yuan He andChristina Delimitrou 2018 Seer Leveraging Big Data to Navigate theComplexity of Cloud Debugging In Proceedings of the Tenth USENIXWorkshop on Hot Topics in Cloud Computing (HotCloud)

[45] Yu Gan Yanqi Zhang Dailun Cheng Ankitha Shetty Priyal RathiNayan Katarki Ariana Bruno Justin Hu Brian Ritchken BrendonJackson Kelvin Hu Meghna Pancholi Yuan He Brett Clancy ChrisColen Fukang Wen Catherine Leung Siyuan Wang Leon ZaruvinskyMateo Espinosa Rick Lin Zhongling Liu Jake Padilla and ChristinaDelimitrou 2019 An Open-Source Benchmark Suite for Microser-vices and Their Hardware-Software Implications for Cloud and EdgeSystems In Proceedings of the Twenty Fourth International Conferenceon Architectural Support for Programming Languages and OperatingSystems (ASPLOS)

[46] Yilong Geng Shiyu Liu Zi Yin Ashish Naik Balaji Prabhakar MendelRosenblum and Amin Vahdat [n d] Exploiting a Natural NetworkEffect for Scalable Fine-grained Clock Synchronization In Proc ofNSDI 2018

[47] Daniel Gmach Jerry Rolia Ludmila Cherkasova and Alfons Kemper[n d] Workload Analysis and Demand Prediction of Enterprise DataCenter Applications In Proceedings of IISWC Boston MA 2007 10httpsdoiorg101109IISWC20074362193

[48] Zhenhuan Gong Xiaohui Gu and John Wilkes [n d] PRESS PRe-dictive Elastic ReSource Scaling for cloud systems In Proceedings ofCNSM Niagara Falls ON 2010

[49] Donald Gross John F Shortle James M Thompson and Carl M Harris[n d] Fundamentals of Queueing Theory InWiley Series in Probabilityand Statistics Book 627 2011

[50] Sanchika Gupta and PadamKumar [n d] VM Profile Based OptimizedNetwork Attack Pattern Detection Scheme for DDOS Attacks in CloudIn Proc of SSCC Mysore India 2013

[51] BenHindman Andy Konwinski Matei Zaharia Ali Ghodsi AnthonyDJoseph Randy Katz Scott Shenker and Ion Stoica [n d] Mesos APlatform for Fine-Grained Resource Sharing in the Data Center InProceedings of NSDI Boston MA 2011

[52] Jingwei Huang David M Nicol and Roy H Campbell [n d] Denial-of-Service Threat to HadoopYARN Clusters with Multi-Tenancy InProc of the IEEE International Congress on Big Data Washington DC2014

[53] Alexandru Iosup Nezih Yigitbasi and Dick Epema [n d] On thePerformance Variability of Production Cloud Services In Proceedingsof CCGRID Newport Beach CA 2011

[54] Hiranya Jayathilaka Chandra Krintz and Rich Wolski 2017 Perfor-mance Monitoring and Root Cause Analysis for Cloud-hosted WebApplications In Proceedings of WWW International World Wide WebConferences Steering Committee Republic and Canton of GenevaSwitzerland 469ndash478

[55] Norman P Jouppi Cliff Young Nishant Patil David Patterson Gau-rav Agrawal Raminder Bajwa Sarah Bates Suresh Bhatia Nan Bo-den Al Borchers Rick Boyle Pierre-luc Cantin Clifford Chao ChrisClark Jeremy Coriell Mike Daley Matt Dau Jeffrey Dean Ben GelbTara Vazir Ghaemmaghami Rajendra Gottipati William GullandRobert Hagmann C Richard Ho Doug Hogberg John Hu RobertHundt Dan Hurt Julian Ibarz Aaron Jaffey Alek Jaworski Alexan-der Kaplan Harshit Khaitan Daniel Killebrew Andy Koch NaveenKumar Steve Lacy James Laudon James Law Diemthu Le ChrisLeary Zhuyuan Liu Kyle Lucke Alan Lundin Gordon MacKean Adri-ana Maggiore Maire Mahony Kieran Miller Rahul Nagarajan RaviNarayanaswami Ray Ni Kathy Nix Thomas Norrie Mark Omer-nick Narayana Penukonda Andy Phelps Jonathan Ross Matt RossAmir Salek Emad Samadiani Chris Severn Gregory Sizikov MatthewSnelham Jed Souter Dan Steinberg Andy Swing Mercedes Tan Gre-gory Thorson Bo Tian Horia Toma Erick Tuttle Vijay VasudevanRichard Walter Walter Wang Eric Wilcox and Doe Hyun Yoon 2017In-Datacenter Performance Analysis of a Tensor Processing Unit InProceedings of the 44th Annual International Symposium on ComputerArchitecture (ISCA rsquo17) ACM New York NY USA 1ndash12

[56] Harshad Kasture and Daniel Sanchez 2014 Ubik Efficient CacheSharing with Strict QoS for Latency-Critical Workloads In Proceed-ings of the 19th international conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS-XIX)

[57] Harshad Kasture and Daniel Sanchez 2016 TailBench A BenchmarkSuite and Evaluation Methodology for Latency-Critical ApplicationsIn Proc of IISWC

[58] Yaakoub El Khamra Hyunjoo Kim Shantenu Jha andManish Parashar[n d] Exploring the performance fluctuations of hpc workloads onclouds In Proceedings of CloudCom Indianapolis IN 2010

[59] Krzysztof C Kiwiel [n d] Convergence and efficiency of subgradientmethods for quasiconvex minimization InMathematical Programming(Series A) (Berlin Heidelberg Springer) 90 (1) pp 1-25 2001

[60] Jacob Leverich and Christos Kozyrakis [n d] Reconciling High ServerUtilization and Sub-millisecond Quality-of-Service In Proc of EuroSys2014

[61] David Lo Liqun Cheng Rama Govindaraju Luiz Andreacute Barroso andChristos Kozyrakis [n d] Towards Energy Proportionality for Large-scale Latency-critical Workloads In Proceedings of the 41st AnnualInternational Symposium on Computer Architecuture (ISCA) Minneapo-lis MN 2014

[62] David Lo Liqun Cheng Rama Govindaraju Parthasarathy Ran-ganathan and Christos Kozyrakis [n d] Heracles Improving Re-source Efficiency at Scale In Proc of the 42Nd Annual InternationalSymposium on Computer Architecture (ISCA) Portland OR 2015

[63] Dave Mangot [n d] EC2 variability The numbers re-vealed httptechmangotcomrollerdaveentryec2_variability_the_numbers_revealed

[64] Jason Mars and Lingjia Tang [n d] Whare-map heterogeneity inhomogeneous warehouse-scale computers In Proceedings of ISCATel-Aviv Israel 2013

[65] Jason Mars Lingjia Tang and Robert Hundt 2011 Heterogeneity inldquoHomogeneousrdquo Warehouse-Scale Computers A Performance Oppor-tunity IEEE Comput Archit Lett 10 2 (July 2011) 4

[66] Jason Mars Lingjia Tang Robert Hundt Kevin Skadron and Mary LouSoffa [n d] Bubble-Up increasing utilization in modern warehousescale computers via sensible co-locations In Proceedings of MICROPorto Alegre Brazil 2011

[67] David Meisner Christopher M Sadler Luiz Andreacute Barroso Wolf-Dietrich Weber and Thomas F Wenisch 2011 Power managementof online data-intensive services In Proceedings of the 38th annualinternational symposium on Computer architecture 319ndash330

[68] Ripal Nathuji Canturk Isci and Eugene Gorbatov [n d] Exploitingplatform heterogeneity for power efficient data centers In Proceedingsof ICAC Jacksonville FL 2007

[69] Ripal Nathuji Aman Kansal and Alireza Ghaffarkhah [n d] Q-Clouds Managing Performance Interference Effects for QoS-AwareClouds In Proceedings of EuroSys ParisFrance 2010

[70] Hiep Nguyen Zhiming Shen Xiaohui Gu Sethuraman Subbiah andJohn Wilkes [n d] AGILE Elastic Distributed Resource Scaling forInfrastructure-as-a-Service In Proceedings of ICAC San Jose CA 2013

[71] Dejan Novakovic Nedeljko Vasic Stanko Novakovic Dejan Kosticand Ricardo Bianchini [n d] DeepDive Transparently Identifyingand Managing Performance Interference in Virtualized EnvironmentsIn Proceedings of ATC San Jose CA 2013

[72] Simon Ostermann Alexandru Iosup Nezih Yigitbasi Radu ProdanThomas Fahringer and Dick Epema [n d] A Performance Analysisof EC2 Cloud Computing Services for Scientific Computing In LectureNotes on Cloud Computing Volume 34 p115-131 2010

[73] Kay Ousterhout Patrick Wendell Matei Zaharia and Ion Stoica [nd] Sparrow Distributed Low Latency Scheduling In Proceedings ofSOSP Farminton PA 2013

[74] Xue Ouyang Peter Garraghan Renyu Yang Paul Townend and Jie Xu[n d] Reducing Late-Timing Failure at Scale Straggler Root-CauseAnalysis in Cloud Datacenters In Proceedings of 46th Annual IEEEIFIPInternational Conference on Dependable Systems and Networks 2016

[75] Andrew Putnam Adrian M Caulfield Eric S Chung Derek ChiouKypros Constantinides John Demme Hadi Esmaeilzadeh Jeremy Fow-ers Gopi Prashanth Gopal Jan Gray Michael Haselman Scott HauckStephen Heil Amir Hormati Joo-Young Kim Sitaram Lanka JamesLarus Eric Peterson Simon Pope Aaron Smith Jason Thong Phillip YiXiao and Doug Burger 2014 A Reconfigurable Fabric for Accelerat-ing Large-Scale Datacenter Services In Proc of the 41st Intl Symp onComputer Architecture

[76] Suhail Rehman and Majd Sakr [n d] Initial Findings for provisioningvariation in cloud computing In Proceedings of CloudCom IndianapolisIN 2010

[77] Charles Reiss Alexey Tumanov Gregory Ganger Randy Katz andMichael Kozych [n d] Heterogeneity and Dynamicity of Clouds atScale Google Trace Analysis In Proceedings of SOCC 2012

[78] Gang Ren Eric Tune Tipp Moseley Yixin Shi Silvius Rus andRobert Hundt 2010 Google-Wide Profiling A Continuous Pro-filing Infrastructure for Data Centers IEEE Micro (2010) 65ndash79httpwwwcomputerorgportalwebcsdldoi101109MM201068

[79] rightscale [n d] Rightscale httpsawsamazoncomsolution-providersisvrightscale

[80] S Sarwar A Ankit and K Roy [n d] Incremental Learning in DeepConvolutional Neural Networks Using Partial Network Sharing InarXiv preprint arXiv171202719

[81] Joumlrg Schad Jens Dittrich and Jorge-Arnulfo Quianeacute-Ruiz 2010 Run-time Measurements in the Cloud Observing Analyzing and ReducingVariance Proceedings VLDB Endow 3 1-2 (Sept 2010) 460ndash471

[82] Malte Schwarzkopf Andy Konwinski Michael Abd-El-Malek andJohn Wilkes [n d] Omega flexible scalable schedulers for largecompute clusters In Proceedings of EuroSys Prague 2013

[83] Zhiming Shen Sethuraman Subbiah Xiaohui Gu and John Wilkes [nd] CloudScale elastic resource scaling for multi-tenant cloud systemsIn Proceedings of SOCC Cascais Portugal 2011

[84] David Shue Michael J Freedman and Anees Shaikh [n d] Perfor-mance Isolation and Fairness for Multi-tenant Cloud Storage In Procof OSDI Hollywood CA 2012

[85] Benjamin H Sigelman Luiz Andreacute Barroso Mike Burrows PatStephenson Manoj Plakal Donald Beaver Saul Jaspan and Chan-dan Shanbhag 2010 Dapper a Large-Scale Distributed Systems TracingInfrastructure Technical Report Google Inc httpsresearchgooglecomarchivepapersdapper-2010-1pdf

[86] Lalith Suresh Peter Bodik Ishai Menache Marco Canini and FlorinCiucu [n d] Distributed Resource Management Across ProcessBoundaries In Proceedings of the ACM Symposium on Cloud Computing(SOCC) Santa Clara CA 2017

[87] torque [n d] Torque Resource Manager httpwwwadaptivecomputingcomproductsopen-sourcetorque

[88] Venkatanathan Varadarajan Thomas Ristenpart and Michael Swift[n d] Scheduler-based Defenses against Cross-VM Side-channels InProc of the 23rd Usenix Security Symposium San Diego CA 2014

[89] Venkatanathan Varadarajan Yinqian Zhang Thomas Ristenpart andMichael Swift [n d] A Placement Vulnerability Study inMulti-TenantPublic Clouds In Proc of the 24th USENIX Security Symposium (USENIXSecurity) Washington DC 2015

[90] Nedeljko Vasić Dejan Novaković Svetozar Miučin Dejan Kostić andRicardo Bianchini [n d] DejaVu accelerating resource allocationin virtualized environments In Proceedings of the Seventeenth Interna-tional Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS) London UK 2012

[91] Jianping Weng Jessie Hui Wang Jiahai Yang and Yang Yang [n d]Root cause analysis of anomalies of multitier services in public cloudsIn Proceedings of the IEEEACM 25th International Symposium on Qual-ity of Service (IWQoS) 2017

[92] Ian H Witten Eibe Frank and Geoffrey Holmes [n d] Data MiningPractical Machine Learning Tools and Techniques 3rd Edition

[93] Tianyin Xu Xinxin Jin Peng Huang Yuanyuan Zhou Shan Lu LongJin and Shankar Pasupathy 2016 Early Detection of ConfigurationErrors to Reduce Failure Damage In Proceedings of the 12th USENIXConference on Operating Systems Design and Implementation (OSDIrsquo16)USENIX Association Berkeley CA USA 619ndash634

[94] Yunjing Xu Michael Bailey Farnam Jahanian Kaustubh Joshi MattiHiltunen and Richard Schlichting [n d] An Exploration of L2 CacheCovert Channels in Virtualized Environments In Proc of the 3rd ACMWorkshop on Cloud Computing Security Workshop (CCSW) ChicagoIL 2011

[95] Hailong Yang Alex Breslow Jason Mars and Lingjia Tang [n d]Bubble-flux precise online QoS management for increased utilizationin warehouse scale computers In Proceedings of ISCA 2013

[96] Minlan Yu Albert Greenberg Dave Maltz Jennifer Rexford LihuaYuan Srikanth Kandula and Changhoon Kim [n d] Profiling Net-work Performance for Multi-tier Data Center Applications In Proceed-ings of NSDI Boston MA 2011 14

[97] Xiao Yu Pallavi Joshi Jianwu Xu Guoliang Jin Hui Zhang and GuofeiJiang 2016 CloudSeer Workflow Monitoring of Cloud Infrastructuresvia Interleaved Logs In Proceedings of APSLOS ACM New York NYUSA 489ndash502

[98] Yinqian Zhang and Michael K Reiter [n d] Duppel retrofittingcommodity operating systems to mitigate cache side channels in thecloud In Proc of CCS Berlin Germany 2013

[99] Mark Zhao and G Edward Suh [n d] FPGA-Based Remote PowerSide-Channel Attacks In Proceedings of the IEEE Symposium on Securityand Privacy May 2018

  • Abstract
  • 1 Introduction
  • 2 Related Work
  • 3 End-to-End Applications with Microservices
    • 31 Social Network
    • 32 Media Service
    • 33 E-Commerce Service
    • 34 Banking System
    • 35 Hotel Reservation Site
      • 4 Seer Design
        • 41 Overview
        • 42 Distributed Tracing
        • 43 Deep Learning in Performance Debugging
        • 44 Hardware Monitoring
        • 45 System Insights from Seer
        • 46 Implementation
          • 5 Seer Analysis and Validation
            • 51 Methodology
            • 52 Evaluation
              • 6 Large-Scale Cloud Study
                • 61 Seer Scalability
                • 62 Source of QoS Violations
                • 63 Seers Long-Term Impact on Application Design
                  • 7 Conclusions
                  • Acknowledgements
                  • References
Page 11: Seer: Leveraging Big Data to Navigate the Complexity of ... · data to extract the causal relationships between requests. Even though such tracing systems help cloud providers detect

Accuracy

QoS Violation

Impactful Service Update

Innocuous Service Update

Incremental Retraining

Signaled QoS Violation

0 2 4 6 8 10 12Time (hours)

75

80

85

90

95

De

tectio

n A

ccu

racy (

)

Social Network Movie Reviewing Ecommerce Go-microservices

0 2 4 6 8 10 12Time (hours)

60708090

100110120130140

Ta

il La

ten

cy n

orm

QoS

QoS

Figure 13 Seer retraining incrementally after each time theSocial Network service is updated

contention and finally disk Since the only persistent mi-croservices are the back-end databases it is reasonable thatdisk accounts for a small fraction of overall QoS violations

Seer accurately follows this breakdown for the most partonly missing a few QoS violations due to random load spikesincluding one caused by a switch failure The App-only sys-tem correctly identifies application-level sources of unpre-dictable performance but misses the majority of system-related issues especially in uncore resources On the otherhand Network-only correctly identifies the vast majorityof network-related issues as well as most of the core- anduncore-driven QoS violations but misses several application-level issues The difference between Network-only and Seeris small suggesting that one could omit the application-levelinstrumentation in favor of a simpler design While this sys-tem is still effective in capturing QoS violations it is lessuseful in providing feedback to application developers onhow to improve their design to avoid QoS violations in the fu-ture Finally the utilization-based system behaves the worstmissing most violations not caused by CPU saturationOut of the 89 QoS violations Seer detects it notifies the

cluster manager early enough to avoid 84 of them The QoSviolations that were not avoided correspond to application-level bugs which cannot be easily corrected online Sincethis is a private cluster Seer uses utilization metrics andperformance counters to identify problematic resourcesRetraining Fig 13 shows the detection accuracy for Seerand the tail latency for each end-to-end service over a periodof time during which Social Network is getting frequently andsubstantially updated This includes new microservices be-ing added to the service such as the ability to place an orderfrom an ad using the orders microservice of E-commerceor the back-end of the service changing from MongoDB to

Cassandra and the front-end switching from nginx to thenodejs front-end of E-commerce These are changes that fun-damentally affect the applicationrsquos behavior throughput la-tency and bottlenecks The other services remain unchangedduring this period (Banking was not active during this timeand is omitted from the graph) Blue dots denote correctly-signaled upcoming QoS violations and red times denote QoSviolations that were not detected by Seer All unidentifiedQoS violations coincide with the application being updatedShortly after the update Seer incrementally retrains in thebackground and starts recovering its accuracy until anothermajor update occurs Some of the updates have no impact oneither performance or Seerrsquos accuracy either because theyinvolve microservices off the critical path or because theyare insensitive to resource contention

The bottom figure shows that unidentified QoS violationsindeed result in performance degradation for Social Networkand in some cases for the other end-to-end services if theyare sharing physical resources with Social Network on anoversubscribed server Once retraining completes the per-formance of the service(s) recovers The longer Seer trainson an evolving application the more likely it is to correctlyanticipate its future QoS violations

6 Large-Scale Cloud Study61 Seer ScalabilityWe now deploy our Social Network service on a 100-serverdedicated cluster on Google Compute Engine (GCE) anduse it to service real user traffic The application has 582registered users with 165 daily active users and has beendeployed for a two-month period The cluster on averagehosts 386 single-concerned containers (one microservice percontainer) subject to some resource scaling actions by thecluster manager based on Seerrsquos feedback

Training Inference10-310-210-1100101102103104105

Tim

e (

s)

NativeTPUBrainwave

Figure 14 Seer training and infer-ence with hardware acceleration

Accuracy remainshigh for Seer con-sistent with thesmall-scale exper-iments Inferencetime however in-creases substantiallyfrom 114ms forthe 20-server clus-ter to 54ms for the100-server GCE set-ting Even thoughthis is still suffi-cient for many re-source allocation decisions as the application scales furtherSeerrsquos ability to anticipate a QoS violation within the clustermanagerrsquos window of opportunity diminishes

Over the past year multiple public cloud providers have ex-posed hardware acceleration offerings for DNN training and

nginx

text

imag

e

user

Tag

urlS

horte

n

vide

o

reco

mm

sear

chlogin

read

Post

com

pPos

t

writ

eGra

ph

mem

cach

ed

mon

godb

0

20

40

60

80

100

120

QoS

Vio

lations

Figure 15 QoS violations each microservice in Social Net-work is responsible for

inference either using a special-purpose design like the Ten-sor Processing Unit (TPU) from Google [55] or using recon-figurable FPGAs like Project Brainwave from Microsoft [25]We offload Seerrsquos DNN logic to both systems and quantifythe impact on training and inference time and detection ac-curacy 1 Fig 14 shows this comparison for a 200GB trainingdataset Both the TPU and Project Brainwave dramaticallyoutperform our local implementation by up to two orders ofmagnitude Between the two accelerators the TPU is moreeffective in training consistent with its design objective [55]while Project Brainwave achieves faster inference For theremainder of the paper we run Seer on TPUs and host theSocial Network service on GCE

62 Source of QoS ViolationsWe now examine which microservice is the most commonculprit for a QoS violation Fig 15 shows the number of QoSviolations caused by each service over the two-month periodThe most frequent culprits by far are the in-memory cachingtiers in memcached and Thrift services with high requestfanout such as composePost readPost and login memcachedis a justified source of QoS violations since it is on the criti-cal path for almost all query types and it is additionally verysensitive to resource contention in compute and to a lesserdegree cache and memory Microservices with high fanoutare also expected to initiate QoS violations as they have tosynchronize multiple inbound requests before proceeding Ifprocessing for any incoming requests is delayed end-to-endperformance is likely to suffer Among these QoS violationsmost of memcachedrsquos violations were caused by resource con-tention while violations in Thrift services were caused bylong synchronization times

63 Seerrsquos Long-Term Impact on Application DesignSeer has now been deployed in the Social Network clusterfor over two months and in this time it has detected 536upcoming QoS violations (906 accuracy) and avoided 495(84) of them Furthermore by detecting recurring patterns1Before running on TPUs we reimplemented our DNN in Tensorflow Wesimilarly adjust the DNN to the currently-supported designs in Brainwave

0 10 20 30 40 50 60Day

0

5

10

15

20

25

Qo

S v

iola

tio

ns

Figure 16 QoS violations each microservice in Social Net-work is responsible for

that lead to QoS violations Seer has helped the applicationdevelopers better understand bugs and design decisions thatlead to hotspots such as microservices with a lot of back-and-forth communication between them or microservicesforming cyclic dependencies or using blocking primitivesThis has led to a decreasing number of QoS violations overthe two month period (seen in Fig 16) as the applicationprogressively improves In days 22 and 23 there was a clusteroutage which is why the reported violations are zero Sys-tems like Seer can be used not only to improve performancepredictability in complex cloud systems but to help usersbetter understand the design challenges of microservices asmore services transition to this application model

7 ConclusionsCloud services increasingly move away from complex mono-lithic designs and adopt the model of specialized loosely-coupled microservices We presented Seer a data-drivencloud performance debugging system that leverages practi-cal learning techniques and the massive amount of tracingdata cloud systems collect to proactively detect and avoidQoS violations We have validated Seerrsquos accuracy in con-trolled environments and evaluated its scalability on large-scale clusters on public clouds We have also deployed thesystem in a cluster hosting a social network with hundredsof users In all scenarios Seer accurately detects upcomingQoS violations improving responsiveness and performancepredictability As more services transition to the microser-vices model systems like Seer provide practical solutionsthat can navigate the increasing complexity of the cloud

AcknowledgementsWe sincerely thank Christos Kozyrakis Balaji PrabhakarMendel Rosenblum Daniel Sanchez Ravi Soundararajanthe academic and industrial users of the benchmark suiteand the anonymous reviewers for their feedback on earlierversions of this manuscript This work was supported byNSF CAREER Award CCF-1846046 NSF grant CNS-1422088a Facebook Faculty Research Award a VMWare ResearchAward a John and Norma Balen Sesquicentennial FacultyFellowship and donations from GCE Azure and AWS

References[1] [n d] Apache Thrift httpsthriftapacheorg[2] [n d] Decomposing Twitter Adventures in Service-

Oriented Architecture httpswwwslidesharenetInfoQdecomposing-twitter-adventures-in-serviceoriented-architecture

[3] [n d] Golang Microservices Example httpsgithubcomharlowgo-micro-services

[4] [n d] Messaging that just works httpswwwrabbitmqcom[5] [n d] MongoDB httpswwwmongodbcom[6] [n d] NGINX httpsnginxorgen[7] [n d] SockShop A Microservices Demo Application httpswww

weaveworksblogsock-shop-microservices-demo-application[8] [n d] Zipkin httpzipkinio[9] 2016 The Evolution of Microservices httpswwwslidesharenet

adriancockcroftevolution-of-microservices-craft-conference[10] Martiacuten Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng

Chen Craig Citro Greg Corrado Andy Davis Jeffrey Dean MatthieuDevin Sanjay Ghemawat Ian Goodfellow Andrew Harp GeoffreyIrving Michael Isard Yangqing Jia Rafal Jozefowicz Lukasz KaiserManjunath Kudlur Josh Levenberg Dan Maneacute Rajat Monga SherryMoore Derek Murray Chris Olah Mike Schuster Jonathon ShlensBenoit Steiner Ilya Sutskever Kunal Talwar Paul Tucker VincentVanhoucke Vijay Vasudevan Fernanda Vieacutegas Oriol Vinyals PeteWarden Martin Wattenberg Martin Wicke Yuan Yu and XiaoqiangZheng [n d] TensorFlow Large-Scale Machine Learning on Hetero-geneous Distributed Systems In Proceedings of OSDI 2016

[11] Adrian Cockroft [n d] Microservices Workshop Why whatand how to get there httpwwwslidesharenetadriancockcroftmicroservices-workshop-craft-conference

[12] Hyunwook Baek Abhinav Srivastava and Jacobus Van der Merwe[n d] CloudSight A tenant-oriented transparency framework forcross-layer cloud troubleshooting In Proceedings of the 17th IEEEACMInternational Symposium on Cluster Cloud and Grid Computing 2017

[13] Luiz Barroso [n d] Warehouse-Scale Computing Entering theTeenage Decade ISCA Keynote SJ June 2011 ([n d])

[14] Luiz Barroso and Urs Hoelzle 2009 The Datacenter as a Computer AnIntroduction to the Design ofWarehouse-Scale Machines MC Publishers

[15] Robert Bell Yehuda Koren and Chris Volinsky 2007 The BellKor 2008Solution to the Netflix Prize Technical Report

[16] Leon Bottou [n d] Large-Scale Machine Learning with StochasticGradient Descent In Proceedings of the International Conference onComputational Statistics (COMPSTAT) Paris France 2010

[17] Martin A Brown [n d] Traffic Control HOWTO httplinux-ipnetarticlesTraffic-Control-HOWTO

[18] Cassandra [n d] Apache Cassandra httpcassandraapacheorg[19] Adrian M Caulfield Eric S Chung Andrew Putnam Hari Angepat

Jeremy Fowers Michael Haselman Stephen Heil Matt HumphreyPuneet Kaur Joo-Young Kim Daniel Lo Todd Massengill KalinOvtcharov Michael Papamichael Lisa Woods Sitaram Lanka DerekChiou and Doug Burger 2016 A Cloud-scale Acceleration Architec-ture In MICRO IEEE Press Piscataway NJ USA Article 7 13 pages

[20] Tianshi Chen Zidong Du Ninghui Sun Jia Wang Chengyong WuYunji Chen and Olivier Temam 2014 DianNao a small-footprint high-throughput accelerator for ubiquitous machine-learning In Proc ofthe 19th intl conf on Architectural Support for Programming Languagesand Operating Systems

[21] Yunji Chen Tianshi Chen Zhiwei Xu Ninghui Sun andOlivier Temam2016 DianNao Family Energy-efficient Hardware Accelerators forMachine Learning Commun ACM 59 11 (Oct 2016) 105ndash112 httpsdoiorg1011452996864

[22] Yunji Chen Tao Luo Shaoli Liu Shijin Zhang Liqiang He Jia WangLing Li Tianshi Chen Zhiwei Xu Ninghui Sun and Olivier Temam2014 DaDianNao A Machine-Learning Supercomputer In Proceedingsof the 47th Annual IEEEACM International Symposium on Microarchi-tecture (MICRO-47) IEEE Computer Society Washington DC USA

609ndash622[23] Ludmila Cherkasova Diwaker Gupta and Amin Vahdat 2007 Com-

parison of the Three CPU Schedulers in Xen SIGMETRICS PerformEval Rev 35 2 (Sept 2007) 42ndash51

[24] Michael Chow David Meisner Jason Flinn Daniel Peek and Thomas FWenisch 2014 The Mystery Machine End-to-end Performance Anal-ysis of Large-scale Internet Services In Proceedings of the 11th USENIXConference on Operating Systems Design and Implementation (OSDIrsquo14)USENIX Association Berkeley CA USA 217ndash231

[25] Eric S Chung Jeremy Fowers Kalin Ovtcharov Michael PapamichaelAdrian M Caulfield Todd Massengill Ming Liu Daniel Lo ShlomiAlkalay Michael Haselman Maleen Abeydeera Logan Adams HariAngepat Christian Boehn Derek Chiou Oren Firestein AlessandroForin Kang Su Gatlin Mahdi Ghandi Stephen Heil Kyle HolohanAhmad El Husseini Tamaacutes Juhaacutesz Kara Kagi Ratna Kovvuri SitaramLanka Friedel van Megen Dima Mukhortov Prerak Patel BrandonPerez Amanda Rapsang Steven K Reinhardt Bita Rouhani AdamSapek Raja Seera Sangeetha Shekar Balaji Sridharan Gabriel WeiszLisaWoods Phillip Yi Xiao Dan Zhang Ritchie Zhao and Doug Burger2018 Serving DNNs in Real Time at Datacenter Scale with ProjectBrainwave IEEE Micro 38 2 (2018) 8ndash20

[26] Guilherme Da Cunha Rodrigues Rodrigo N Calheiros Vini-cius Tavares Guimaraes Glederson Lessa dos Santos Maacutercio Barbosade Carvalho Lisandro Zambenedetti Granville Liane Margarida Rock-enbach Tarouco and Rajkumar Buyya 2016 Monitoring of CloudComputing Environments Concepts Solutions Trends and Future Di-rections In Proceedings of the 31st Annual ACM Symposium on AppliedComputing (SAC rsquo16) ACM New York NY USA 378ndash383

[27] Marwan Darwish Abdelkader Ouda and Luiz Fernando Capretz [nd] Cloud-based DDoS attacks and defenses In Proc of i-SocietyToronto ON 2013

[28] Jeffrey Dean and Luiz Andre Barroso [n d] The Tail at Scale InCACM Vol 56 No 2

[29] Christina Delimitrou Nick Bambos and Christos Kozyrakis [n d]QoS-Aware Admission Control in Heterogeneous Datacenters In Pro-ceedings of the International Conference of Autonomic Computing (ICAC)San Jose CA USA 2013

[30] Christina Delimitrou and Christos Kozyrakis [n d] iBench Quanti-fying Interference for Datacenter Workloads In Proceedings of the 2013IEEE International Symposium on Workload Characterization (IISWC)Portland OR September 2013

[31] Christina Delimitrou and Christos Kozyrakis [n d] Paragon QoS-Aware Scheduling for Heterogeneous Datacenters In Proceedings ofthe Eighteenth International Conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS) HoustonTX USA 2013

[32] Christina Delimitrou and Christos Kozyrakis [n d] QoS-Aware Sched-uling in Heterogeneous Datacenters with Paragon In ACM Transac-tions on Computer Systems (TOCS) Vol 31 Issue 4 December 2013

[33] Christina Delimitrou and Christos Kozyrakis [n d] Quality-of-Service-Aware Scheduling in Heterogeneous Datacenters withParagon In IEEE Micro Special Issue on Top Picks from the ComputerArchitecture Conferences MayJune 2014

[34] Christina Delimitrou and Christos Kozyrakis [n d] Quasar Resource-Efficient and QoS-Aware Cluster Management In Proc of ASPLOS SaltLake City 2014

[35] Christina Delimitrou and Christos Kozyrakis 2016 HCloud Resource-Efficient Provisioning in Shared Cloud Systems In Proceedings of theTwenty First International Conference on Architectural Support for Pro-gramming Languages and Operating Systems (ASPLOS)

[36] Christina Delimitrou and Christos Kozyrakis 2017 Bolt I KnowWhatYou Did Last Summer In The Cloud In Proc of the Twenty SecondInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS)

[37] Christina Delimitrou and Christos Kozyrakis 2018 Amdahlrsquos Law forTail Latency In Communications of the ACM (CACM)

[38] Christina Delimitrou Daniel Sanchez and Christos Kozyrakis 2015Tarcil Reconciling Scheduling Speed and Quality in Large Shared Clus-ters In Proceedings of the Sixth ACM Symposium on Cloud Computing(SOCC)

[39] Zidong Du Robert Fasthuber Tianshi Chen Paolo Ienne Ling Li TaoLuo Xiaobing Feng Yunji Chen and Olivier Temam 2015 ShiDi-anNao Shifting Vision Processing Closer to the Sensor In Proceed-ings of the 42Nd Annual International Symposium on Computer Ar-chitecture (ISCA rsquo15) ACM New York NY USA 92ndash104 httpsdoiorg10114527494692750389

[40] Daniel Firestone Andrew Putnam SambhramaMundkur Derek ChiouAlireza Dabagh Mike Andrewartha Hari Angepat Vivek BhanuAdrian Caulfield Eric Chung Harish Kumar Chandrappa SomeshChaturmohta Matt Humphrey Jack Lavier Norman Lam FengfenLiu Kalin Ovtcharov Jitu Padhye Gautham Popuri Shachar RaindelTejas Sapre Mark Shaw Gabriel Silva Madhan Sivakumar NisheethSrivastava Anshuman Verma Qasim Zuhair Deepak Bansal DougBurger Kushagra Vaid David A Maltz and Albert Greenberg 2018Azure Accelerated Networking SmartNICs in the Public Cloud In 15thUSENIX Symposium on Networked Systems Design and Implementation(NSDI 18) USENIX Association Renton WA 51ndash66

[41] Brad Fitzpatrick [n d] Distributed caching with memcached InLinux Journal Volume 2004 Issue 124 2004

[42] Rodrigo Fonseca George Porter Randy H Katz Scott Shenker andIon Stoica 2007 X-trace A Pervasive Network Tracing Framework InProceedings of the 4th USENIX Conference on Networked Systems Designamp Implementation (NSDIrsquo07) USENIX Association Berkeley CA USA20ndash20

[43] Yu Gan and Christina Delimitrou 2018 The Architectural Implicationsof Cloud Microservices In Computer Architecture Letters (CAL) vol17iss 2

[44] Yu Gan Meghna Pancholi Dailun Cheng Siyuan Hu Yuan He andChristina Delimitrou 2018 Seer Leveraging Big Data to Navigate theComplexity of Cloud Debugging In Proceedings of the Tenth USENIXWorkshop on Hot Topics in Cloud Computing (HotCloud)

[45] Yu Gan Yanqi Zhang Dailun Cheng Ankitha Shetty Priyal RathiNayan Katarki Ariana Bruno Justin Hu Brian Ritchken BrendonJackson Kelvin Hu Meghna Pancholi Yuan He Brett Clancy ChrisColen Fukang Wen Catherine Leung Siyuan Wang Leon ZaruvinskyMateo Espinosa Rick Lin Zhongling Liu Jake Padilla and ChristinaDelimitrou 2019 An Open-Source Benchmark Suite for Microser-vices and Their Hardware-Software Implications for Cloud and EdgeSystems In Proceedings of the Twenty Fourth International Conferenceon Architectural Support for Programming Languages and OperatingSystems (ASPLOS)

[46] Yilong Geng Shiyu Liu Zi Yin Ashish Naik Balaji Prabhakar MendelRosenblum and Amin Vahdat [n d] Exploiting a Natural NetworkEffect for Scalable Fine-grained Clock Synchronization In Proc ofNSDI 2018

[47] Daniel Gmach Jerry Rolia Ludmila Cherkasova and Alfons Kemper[n d] Workload Analysis and Demand Prediction of Enterprise DataCenter Applications In Proceedings of IISWC Boston MA 2007 10httpsdoiorg101109IISWC20074362193

[48] Zhenhuan Gong Xiaohui Gu and John Wilkes [n d] PRESS PRe-dictive Elastic ReSource Scaling for cloud systems In Proceedings ofCNSM Niagara Falls ON 2010

[49] Donald Gross John F Shortle James M Thompson and Carl M Harris[n d] Fundamentals of Queueing Theory InWiley Series in Probabilityand Statistics Book 627 2011

[50] Sanchika Gupta and PadamKumar [n d] VM Profile Based OptimizedNetwork Attack Pattern Detection Scheme for DDOS Attacks in CloudIn Proc of SSCC Mysore India 2013

[51] BenHindman Andy Konwinski Matei Zaharia Ali Ghodsi AnthonyDJoseph Randy Katz Scott Shenker and Ion Stoica [n d] Mesos APlatform for Fine-Grained Resource Sharing in the Data Center InProceedings of NSDI Boston MA 2011

[52] Jingwei Huang David M Nicol and Roy H Campbell [n d] Denial-of-Service Threat to HadoopYARN Clusters with Multi-Tenancy InProc of the IEEE International Congress on Big Data Washington DC2014

[53] Alexandru Iosup Nezih Yigitbasi and Dick Epema [n d] On thePerformance Variability of Production Cloud Services In Proceedingsof CCGRID Newport Beach CA 2011

[54] Hiranya Jayathilaka Chandra Krintz and Rich Wolski 2017 Perfor-mance Monitoring and Root Cause Analysis for Cloud-hosted WebApplications In Proceedings of WWW International World Wide WebConferences Steering Committee Republic and Canton of GenevaSwitzerland 469ndash478

[55] Norman P Jouppi Cliff Young Nishant Patil David Patterson Gau-rav Agrawal Raminder Bajwa Sarah Bates Suresh Bhatia Nan Bo-den Al Borchers Rick Boyle Pierre-luc Cantin Clifford Chao ChrisClark Jeremy Coriell Mike Daley Matt Dau Jeffrey Dean Ben GelbTara Vazir Ghaemmaghami Rajendra Gottipati William GullandRobert Hagmann C Richard Ho Doug Hogberg John Hu RobertHundt Dan Hurt Julian Ibarz Aaron Jaffey Alek Jaworski Alexan-der Kaplan Harshit Khaitan Daniel Killebrew Andy Koch NaveenKumar Steve Lacy James Laudon James Law Diemthu Le ChrisLeary Zhuyuan Liu Kyle Lucke Alan Lundin Gordon MacKean Adri-ana Maggiore Maire Mahony Kieran Miller Rahul Nagarajan RaviNarayanaswami Ray Ni Kathy Nix Thomas Norrie Mark Omer-nick Narayana Penukonda Andy Phelps Jonathan Ross Matt RossAmir Salek Emad Samadiani Chris Severn Gregory Sizikov MatthewSnelham Jed Souter Dan Steinberg Andy Swing Mercedes Tan Gre-gory Thorson Bo Tian Horia Toma Erick Tuttle Vijay VasudevanRichard Walter Walter Wang Eric Wilcox and Doe Hyun Yoon 2017In-Datacenter Performance Analysis of a Tensor Processing Unit InProceedings of the 44th Annual International Symposium on ComputerArchitecture (ISCA rsquo17) ACM New York NY USA 1ndash12

[56] Harshad Kasture and Daniel Sanchez 2014 Ubik Efficient CacheSharing with Strict QoS for Latency-Critical Workloads In Proceed-ings of the 19th international conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS-XIX)

[57] Harshad Kasture and Daniel Sanchez 2016 TailBench A BenchmarkSuite and Evaluation Methodology for Latency-Critical ApplicationsIn Proc of IISWC

[58] Yaakoub El Khamra Hyunjoo Kim Shantenu Jha andManish Parashar[n d] Exploring the performance fluctuations of hpc workloads onclouds In Proceedings of CloudCom Indianapolis IN 2010

[59] Krzysztof C Kiwiel [n d] Convergence and efficiency of subgradientmethods for quasiconvex minimization InMathematical Programming(Series A) (Berlin Heidelberg Springer) 90 (1) pp 1-25 2001

[60] Jacob Leverich and Christos Kozyrakis [n d] Reconciling High ServerUtilization and Sub-millisecond Quality-of-Service In Proc of EuroSys2014

[61] David Lo Liqun Cheng Rama Govindaraju Luiz Andreacute Barroso andChristos Kozyrakis [n d] Towards Energy Proportionality for Large-scale Latency-critical Workloads In Proceedings of the 41st AnnualInternational Symposium on Computer Architecuture (ISCA) Minneapo-lis MN 2014

[62] David Lo Liqun Cheng Rama Govindaraju Parthasarathy Ran-ganathan and Christos Kozyrakis [n d] Heracles Improving Re-source Efficiency at Scale In Proc of the 42Nd Annual InternationalSymposium on Computer Architecture (ISCA) Portland OR 2015

[63] Dave Mangot [n d] EC2 variability The numbers re-vealed httptechmangotcomrollerdaveentryec2_variability_the_numbers_revealed

[64] Jason Mars and Lingjia Tang [n d] Whare-map heterogeneity inhomogeneous warehouse-scale computers In Proceedings of ISCATel-Aviv Israel 2013

[65] Jason Mars Lingjia Tang and Robert Hundt 2011 Heterogeneity inldquoHomogeneousrdquo Warehouse-Scale Computers A Performance Oppor-tunity IEEE Comput Archit Lett 10 2 (July 2011) 4

[66] Jason Mars Lingjia Tang Robert Hundt Kevin Skadron and Mary LouSoffa [n d] Bubble-Up increasing utilization in modern warehousescale computers via sensible co-locations In Proceedings of MICROPorto Alegre Brazil 2011

[67] David Meisner Christopher M Sadler Luiz Andreacute Barroso Wolf-Dietrich Weber and Thomas F Wenisch 2011 Power managementof online data-intensive services In Proceedings of the 38th annualinternational symposium on Computer architecture 319ndash330

[68] Ripal Nathuji Canturk Isci and Eugene Gorbatov [n d] Exploitingplatform heterogeneity for power efficient data centers In Proceedingsof ICAC Jacksonville FL 2007

[69] Ripal Nathuji Aman Kansal and Alireza Ghaffarkhah [n d] Q-Clouds Managing Performance Interference Effects for QoS-AwareClouds In Proceedings of EuroSys ParisFrance 2010

[70] Hiep Nguyen Zhiming Shen Xiaohui Gu Sethuraman Subbiah andJohn Wilkes [n d] AGILE Elastic Distributed Resource Scaling forInfrastructure-as-a-Service In Proceedings of ICAC San Jose CA 2013

[71] Dejan Novakovic Nedeljko Vasic Stanko Novakovic Dejan Kosticand Ricardo Bianchini [n d] DeepDive Transparently Identifyingand Managing Performance Interference in Virtualized EnvironmentsIn Proceedings of ATC San Jose CA 2013

[72] Simon Ostermann Alexandru Iosup Nezih Yigitbasi Radu ProdanThomas Fahringer and Dick Epema [n d] A Performance Analysisof EC2 Cloud Computing Services for Scientific Computing In LectureNotes on Cloud Computing Volume 34 p115-131 2010

[73] Kay Ousterhout Patrick Wendell Matei Zaharia and Ion Stoica [nd] Sparrow Distributed Low Latency Scheduling In Proceedings ofSOSP Farminton PA 2013

[74] Xue Ouyang Peter Garraghan Renyu Yang Paul Townend and Jie Xu[n d] Reducing Late-Timing Failure at Scale Straggler Root-CauseAnalysis in Cloud Datacenters In Proceedings of 46th Annual IEEEIFIPInternational Conference on Dependable Systems and Networks 2016

[75] Andrew Putnam Adrian M Caulfield Eric S Chung Derek ChiouKypros Constantinides John Demme Hadi Esmaeilzadeh Jeremy Fow-ers Gopi Prashanth Gopal Jan Gray Michael Haselman Scott HauckStephen Heil Amir Hormati Joo-Young Kim Sitaram Lanka JamesLarus Eric Peterson Simon Pope Aaron Smith Jason Thong Phillip YiXiao and Doug Burger 2014 A Reconfigurable Fabric for Accelerat-ing Large-Scale Datacenter Services In Proc of the 41st Intl Symp onComputer Architecture

[76] Suhail Rehman and Majd Sakr [n d] Initial Findings for provisioningvariation in cloud computing In Proceedings of CloudCom IndianapolisIN 2010

[77] Charles Reiss Alexey Tumanov Gregory Ganger Randy Katz andMichael Kozych [n d] Heterogeneity and Dynamicity of Clouds atScale Google Trace Analysis In Proceedings of SOCC 2012

[78] Gang Ren Eric Tune Tipp Moseley Yixin Shi Silvius Rus andRobert Hundt 2010 Google-Wide Profiling A Continuous Pro-filing Infrastructure for Data Centers IEEE Micro (2010) 65ndash79httpwwwcomputerorgportalwebcsdldoi101109MM201068

[79] rightscale [n d] Rightscale httpsawsamazoncomsolution-providersisvrightscale

[80] S Sarwar A Ankit and K Roy [n d] Incremental Learning in DeepConvolutional Neural Networks Using Partial Network Sharing InarXiv preprint arXiv171202719

[81] Joumlrg Schad Jens Dittrich and Jorge-Arnulfo Quianeacute-Ruiz 2010 Run-time Measurements in the Cloud Observing Analyzing and ReducingVariance Proceedings VLDB Endow 3 1-2 (Sept 2010) 460ndash471

[82] Malte Schwarzkopf Andy Konwinski Michael Abd-El-Malek andJohn Wilkes [n d] Omega flexible scalable schedulers for largecompute clusters In Proceedings of EuroSys Prague 2013

[83] Zhiming Shen Sethuraman Subbiah Xiaohui Gu and John Wilkes [nd] CloudScale elastic resource scaling for multi-tenant cloud systemsIn Proceedings of SOCC Cascais Portugal 2011

[84] David Shue Michael J Freedman and Anees Shaikh [n d] Perfor-mance Isolation and Fairness for Multi-tenant Cloud Storage In Procof OSDI Hollywood CA 2012

[85] Benjamin H Sigelman Luiz Andreacute Barroso Mike Burrows PatStephenson Manoj Plakal Donald Beaver Saul Jaspan and Chan-dan Shanbhag 2010 Dapper a Large-Scale Distributed Systems TracingInfrastructure Technical Report Google Inc httpsresearchgooglecomarchivepapersdapper-2010-1pdf

[86] Lalith Suresh Peter Bodik Ishai Menache Marco Canini and FlorinCiucu [n d] Distributed Resource Management Across ProcessBoundaries In Proceedings of the ACM Symposium on Cloud Computing(SOCC) Santa Clara CA 2017

[87] torque [n d] Torque Resource Manager httpwwwadaptivecomputingcomproductsopen-sourcetorque

[88] Venkatanathan Varadarajan Thomas Ristenpart and Michael Swift[n d] Scheduler-based Defenses against Cross-VM Side-channels InProc of the 23rd Usenix Security Symposium San Diego CA 2014

[89] Venkatanathan Varadarajan Yinqian Zhang Thomas Ristenpart andMichael Swift [n d] A Placement Vulnerability Study inMulti-TenantPublic Clouds In Proc of the 24th USENIX Security Symposium (USENIXSecurity) Washington DC 2015

[90] Nedeljko Vasić Dejan Novaković Svetozar Miučin Dejan Kostić andRicardo Bianchini [n d] DejaVu accelerating resource allocationin virtualized environments In Proceedings of the Seventeenth Interna-tional Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS) London UK 2012

[91] Jianping Weng Jessie Hui Wang Jiahai Yang and Yang Yang [n d]Root cause analysis of anomalies of multitier services in public cloudsIn Proceedings of the IEEEACM 25th International Symposium on Qual-ity of Service (IWQoS) 2017

[92] Ian H Witten Eibe Frank and Geoffrey Holmes [n d] Data MiningPractical Machine Learning Tools and Techniques 3rd Edition

[93] Tianyin Xu Xinxin Jin Peng Huang Yuanyuan Zhou Shan Lu LongJin and Shankar Pasupathy 2016 Early Detection of ConfigurationErrors to Reduce Failure Damage In Proceedings of the 12th USENIXConference on Operating Systems Design and Implementation (OSDIrsquo16)USENIX Association Berkeley CA USA 619ndash634

[94] Yunjing Xu Michael Bailey Farnam Jahanian Kaustubh Joshi MattiHiltunen and Richard Schlichting [n d] An Exploration of L2 CacheCovert Channels in Virtualized Environments In Proc of the 3rd ACMWorkshop on Cloud Computing Security Workshop (CCSW) ChicagoIL 2011

[95] Hailong Yang Alex Breslow Jason Mars and Lingjia Tang [n d]Bubble-flux precise online QoS management for increased utilizationin warehouse scale computers In Proceedings of ISCA 2013

[96] Minlan Yu Albert Greenberg Dave Maltz Jennifer Rexford LihuaYuan Srikanth Kandula and Changhoon Kim [n d] Profiling Net-work Performance for Multi-tier Data Center Applications In Proceed-ings of NSDI Boston MA 2011 14

[97] Xiao Yu Pallavi Joshi Jianwu Xu Guoliang Jin Hui Zhang and GuofeiJiang 2016 CloudSeer Workflow Monitoring of Cloud Infrastructuresvia Interleaved Logs In Proceedings of APSLOS ACM New York NYUSA 489ndash502

[98] Yinqian Zhang and Michael K Reiter [n d] Duppel retrofittingcommodity operating systems to mitigate cache side channels in thecloud In Proc of CCS Berlin Germany 2013

[99] Mark Zhao and G Edward Suh [n d] FPGA-Based Remote PowerSide-Channel Attacks In Proceedings of the IEEE Symposium on Securityand Privacy May 2018

  • Abstract
  • 1 Introduction
  • 2 Related Work
  • 3 End-to-End Applications with Microservices
    • 31 Social Network
    • 32 Media Service
    • 33 E-Commerce Service
    • 34 Banking System
    • 35 Hotel Reservation Site
      • 4 Seer Design
        • 41 Overview
        • 42 Distributed Tracing
        • 43 Deep Learning in Performance Debugging
        • 44 Hardware Monitoring
        • 45 System Insights from Seer
        • 46 Implementation
          • 5 Seer Analysis and Validation
            • 51 Methodology
            • 52 Evaluation
              • 6 Large-Scale Cloud Study
                • 61 Seer Scalability
                • 62 Source of QoS Violations
                • 63 Seers Long-Term Impact on Application Design
                  • 7 Conclusions
                  • Acknowledgements
                  • References
Page 12: Seer: Leveraging Big Data to Navigate the Complexity of ... · data to extract the causal relationships between requests. Even though such tracing systems help cloud providers detect

nginx

text

imag

e

user

Tag

urlS

horte

n

vide

o

reco

mm

sear

chlogin

read

Post

com

pPos

t

writ

eGra

ph

mem

cach

ed

mon

godb

0

20

40

60

80

100

120

QoS

Vio

lations

Figure 15 QoS violations each microservice in Social Net-work is responsible for

inference either using a special-purpose design like the Ten-sor Processing Unit (TPU) from Google [55] or using recon-figurable FPGAs like Project Brainwave from Microsoft [25]We offload Seerrsquos DNN logic to both systems and quantifythe impact on training and inference time and detection ac-curacy 1 Fig 14 shows this comparison for a 200GB trainingdataset Both the TPU and Project Brainwave dramaticallyoutperform our local implementation by up to two orders ofmagnitude Between the two accelerators the TPU is moreeffective in training consistent with its design objective [55]while Project Brainwave achieves faster inference For theremainder of the paper we run Seer on TPUs and host theSocial Network service on GCE

62 Source of QoS ViolationsWe now examine which microservice is the most commonculprit for a QoS violation Fig 15 shows the number of QoSviolations caused by each service over the two-month periodThe most frequent culprits by far are the in-memory cachingtiers in memcached and Thrift services with high requestfanout such as composePost readPost and login memcachedis a justified source of QoS violations since it is on the criti-cal path for almost all query types and it is additionally verysensitive to resource contention in compute and to a lesserdegree cache and memory Microservices with high fanoutare also expected to initiate QoS violations as they have tosynchronize multiple inbound requests before proceeding Ifprocessing for any incoming requests is delayed end-to-endperformance is likely to suffer Among these QoS violationsmost of memcachedrsquos violations were caused by resource con-tention while violations in Thrift services were caused bylong synchronization times

63 Seerrsquos Long-Term Impact on Application DesignSeer has now been deployed in the Social Network clusterfor over two months and in this time it has detected 536upcoming QoS violations (906 accuracy) and avoided 495(84) of them Furthermore by detecting recurring patterns1Before running on TPUs we reimplemented our DNN in Tensorflow Wesimilarly adjust the DNN to the currently-supported designs in Brainwave

0 10 20 30 40 50 60Day

0

5

10

15

20

25

Qo

S v

iola

tio

ns

Figure 16 QoS violations each microservice in Social Net-work is responsible for

that lead to QoS violations Seer has helped the applicationdevelopers better understand bugs and design decisions thatlead to hotspots such as microservices with a lot of back-and-forth communication between them or microservicesforming cyclic dependencies or using blocking primitivesThis has led to a decreasing number of QoS violations overthe two month period (seen in Fig 16) as the applicationprogressively improves In days 22 and 23 there was a clusteroutage which is why the reported violations are zero Sys-tems like Seer can be used not only to improve performancepredictability in complex cloud systems but to help usersbetter understand the design challenges of microservices asmore services transition to this application model

7 ConclusionsCloud services increasingly move away from complex mono-lithic designs and adopt the model of specialized loosely-coupled microservices We presented Seer a data-drivencloud performance debugging system that leverages practi-cal learning techniques and the massive amount of tracingdata cloud systems collect to proactively detect and avoidQoS violations We have validated Seerrsquos accuracy in con-trolled environments and evaluated its scalability on large-scale clusters on public clouds We have also deployed thesystem in a cluster hosting a social network with hundredsof users In all scenarios Seer accurately detects upcomingQoS violations improving responsiveness and performancepredictability As more services transition to the microser-vices model systems like Seer provide practical solutionsthat can navigate the increasing complexity of the cloud

AcknowledgementsWe sincerely thank Christos Kozyrakis Balaji PrabhakarMendel Rosenblum Daniel Sanchez Ravi Soundararajanthe academic and industrial users of the benchmark suiteand the anonymous reviewers for their feedback on earlierversions of this manuscript This work was supported byNSF CAREER Award CCF-1846046 NSF grant CNS-1422088a Facebook Faculty Research Award a VMWare ResearchAward a John and Norma Balen Sesquicentennial FacultyFellowship and donations from GCE Azure and AWS

References[1] [n d] Apache Thrift httpsthriftapacheorg[2] [n d] Decomposing Twitter Adventures in Service-

Oriented Architecture httpswwwslidesharenetInfoQdecomposing-twitter-adventures-in-serviceoriented-architecture

[3] [n d] Golang Microservices Example httpsgithubcomharlowgo-micro-services

[4] [n d] Messaging that just works httpswwwrabbitmqcom[5] [n d] MongoDB httpswwwmongodbcom[6] [n d] NGINX httpsnginxorgen[7] [n d] SockShop A Microservices Demo Application httpswww

weaveworksblogsock-shop-microservices-demo-application[8] [n d] Zipkin httpzipkinio[9] 2016 The Evolution of Microservices httpswwwslidesharenet

adriancockcroftevolution-of-microservices-craft-conference[10] Martiacuten Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng

Chen Craig Citro Greg Corrado Andy Davis Jeffrey Dean MatthieuDevin Sanjay Ghemawat Ian Goodfellow Andrew Harp GeoffreyIrving Michael Isard Yangqing Jia Rafal Jozefowicz Lukasz KaiserManjunath Kudlur Josh Levenberg Dan Maneacute Rajat Monga SherryMoore Derek Murray Chris Olah Mike Schuster Jonathon ShlensBenoit Steiner Ilya Sutskever Kunal Talwar Paul Tucker VincentVanhoucke Vijay Vasudevan Fernanda Vieacutegas Oriol Vinyals PeteWarden Martin Wattenberg Martin Wicke Yuan Yu and XiaoqiangZheng [n d] TensorFlow Large-Scale Machine Learning on Hetero-geneous Distributed Systems In Proceedings of OSDI 2016

[11] Adrian Cockroft [n d] Microservices Workshop Why whatand how to get there httpwwwslidesharenetadriancockcroftmicroservices-workshop-craft-conference

[12] Hyunwook Baek Abhinav Srivastava and Jacobus Van der Merwe[n d] CloudSight A tenant-oriented transparency framework forcross-layer cloud troubleshooting In Proceedings of the 17th IEEEACMInternational Symposium on Cluster Cloud and Grid Computing 2017

[13] Luiz Barroso [n d] Warehouse-Scale Computing Entering theTeenage Decade ISCA Keynote SJ June 2011 ([n d])

[14] Luiz Barroso and Urs Hoelzle 2009 The Datacenter as a Computer AnIntroduction to the Design ofWarehouse-Scale Machines MC Publishers

[15] Robert Bell Yehuda Koren and Chris Volinsky 2007 The BellKor 2008Solution to the Netflix Prize Technical Report

[16] Leon Bottou [n d] Large-Scale Machine Learning with StochasticGradient Descent In Proceedings of the International Conference onComputational Statistics (COMPSTAT) Paris France 2010

[17] Martin A Brown [n d] Traffic Control HOWTO httplinux-ipnetarticlesTraffic-Control-HOWTO

[18] Cassandra [n d] Apache Cassandra httpcassandraapacheorg[19] Adrian M Caulfield Eric S Chung Andrew Putnam Hari Angepat

Jeremy Fowers Michael Haselman Stephen Heil Matt HumphreyPuneet Kaur Joo-Young Kim Daniel Lo Todd Massengill KalinOvtcharov Michael Papamichael Lisa Woods Sitaram Lanka DerekChiou and Doug Burger 2016 A Cloud-scale Acceleration Architec-ture In MICRO IEEE Press Piscataway NJ USA Article 7 13 pages

[20] Tianshi Chen Zidong Du Ninghui Sun Jia Wang Chengyong WuYunji Chen and Olivier Temam 2014 DianNao a small-footprint high-throughput accelerator for ubiquitous machine-learning In Proc ofthe 19th intl conf on Architectural Support for Programming Languagesand Operating Systems

[21] Yunji Chen Tianshi Chen Zhiwei Xu Ninghui Sun andOlivier Temam2016 DianNao Family Energy-efficient Hardware Accelerators forMachine Learning Commun ACM 59 11 (Oct 2016) 105ndash112 httpsdoiorg1011452996864

[22] Yunji Chen Tao Luo Shaoli Liu Shijin Zhang Liqiang He Jia WangLing Li Tianshi Chen Zhiwei Xu Ninghui Sun and Olivier Temam2014 DaDianNao A Machine-Learning Supercomputer In Proceedingsof the 47th Annual IEEEACM International Symposium on Microarchi-tecture (MICRO-47) IEEE Computer Society Washington DC USA

609ndash622[23] Ludmila Cherkasova Diwaker Gupta and Amin Vahdat 2007 Com-

parison of the Three CPU Schedulers in Xen SIGMETRICS PerformEval Rev 35 2 (Sept 2007) 42ndash51

[24] Michael Chow David Meisner Jason Flinn Daniel Peek and Thomas FWenisch 2014 The Mystery Machine End-to-end Performance Anal-ysis of Large-scale Internet Services In Proceedings of the 11th USENIXConference on Operating Systems Design and Implementation (OSDIrsquo14)USENIX Association Berkeley CA USA 217ndash231

[25] Eric S Chung Jeremy Fowers Kalin Ovtcharov Michael PapamichaelAdrian M Caulfield Todd Massengill Ming Liu Daniel Lo ShlomiAlkalay Michael Haselman Maleen Abeydeera Logan Adams HariAngepat Christian Boehn Derek Chiou Oren Firestein AlessandroForin Kang Su Gatlin Mahdi Ghandi Stephen Heil Kyle HolohanAhmad El Husseini Tamaacutes Juhaacutesz Kara Kagi Ratna Kovvuri SitaramLanka Friedel van Megen Dima Mukhortov Prerak Patel BrandonPerez Amanda Rapsang Steven K Reinhardt Bita Rouhani AdamSapek Raja Seera Sangeetha Shekar Balaji Sridharan Gabriel WeiszLisaWoods Phillip Yi Xiao Dan Zhang Ritchie Zhao and Doug Burger2018 Serving DNNs in Real Time at Datacenter Scale with ProjectBrainwave IEEE Micro 38 2 (2018) 8ndash20

[26] Guilherme Da Cunha Rodrigues Rodrigo N Calheiros Vini-cius Tavares Guimaraes Glederson Lessa dos Santos Maacutercio Barbosade Carvalho Lisandro Zambenedetti Granville Liane Margarida Rock-enbach Tarouco and Rajkumar Buyya 2016 Monitoring of CloudComputing Environments Concepts Solutions Trends and Future Di-rections In Proceedings of the 31st Annual ACM Symposium on AppliedComputing (SAC rsquo16) ACM New York NY USA 378ndash383

[27] Marwan Darwish Abdelkader Ouda and Luiz Fernando Capretz [nd] Cloud-based DDoS attacks and defenses In Proc of i-SocietyToronto ON 2013

[28] Jeffrey Dean and Luiz Andre Barroso [n d] The Tail at Scale InCACM Vol 56 No 2

[29] Christina Delimitrou Nick Bambos and Christos Kozyrakis [n d]QoS-Aware Admission Control in Heterogeneous Datacenters In Pro-ceedings of the International Conference of Autonomic Computing (ICAC)San Jose CA USA 2013

[30] Christina Delimitrou and Christos Kozyrakis [n d] iBench Quanti-fying Interference for Datacenter Workloads In Proceedings of the 2013IEEE International Symposium on Workload Characterization (IISWC)Portland OR September 2013

[31] Christina Delimitrou and Christos Kozyrakis [n d] Paragon QoS-Aware Scheduling for Heterogeneous Datacenters In Proceedings ofthe Eighteenth International Conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS) HoustonTX USA 2013

[32] Christina Delimitrou and Christos Kozyrakis [n d] QoS-Aware Sched-uling in Heterogeneous Datacenters with Paragon In ACM Transac-tions on Computer Systems (TOCS) Vol 31 Issue 4 December 2013

[33] Christina Delimitrou and Christos Kozyrakis [n d] Quality-of-Service-Aware Scheduling in Heterogeneous Datacenters withParagon In IEEE Micro Special Issue on Top Picks from the ComputerArchitecture Conferences MayJune 2014

[34] Christina Delimitrou and Christos Kozyrakis [n d] Quasar Resource-Efficient and QoS-Aware Cluster Management In Proc of ASPLOS SaltLake City 2014

[35] Christina Delimitrou and Christos Kozyrakis 2016 HCloud Resource-Efficient Provisioning in Shared Cloud Systems In Proceedings of theTwenty First International Conference on Architectural Support for Pro-gramming Languages and Operating Systems (ASPLOS)

[36] Christina Delimitrou and Christos Kozyrakis 2017 Bolt I KnowWhatYou Did Last Summer In The Cloud In Proc of the Twenty SecondInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS)

[37] Christina Delimitrou and Christos Kozyrakis 2018 Amdahlrsquos Law forTail Latency In Communications of the ACM (CACM)

[38] Christina Delimitrou Daniel Sanchez and Christos Kozyrakis 2015Tarcil Reconciling Scheduling Speed and Quality in Large Shared Clus-ters In Proceedings of the Sixth ACM Symposium on Cloud Computing(SOCC)

[39] Zidong Du Robert Fasthuber Tianshi Chen Paolo Ienne Ling Li TaoLuo Xiaobing Feng Yunji Chen and Olivier Temam 2015 ShiDi-anNao Shifting Vision Processing Closer to the Sensor In Proceed-ings of the 42Nd Annual International Symposium on Computer Ar-chitecture (ISCA rsquo15) ACM New York NY USA 92ndash104 httpsdoiorg10114527494692750389

[40] Daniel Firestone Andrew Putnam SambhramaMundkur Derek ChiouAlireza Dabagh Mike Andrewartha Hari Angepat Vivek BhanuAdrian Caulfield Eric Chung Harish Kumar Chandrappa SomeshChaturmohta Matt Humphrey Jack Lavier Norman Lam FengfenLiu Kalin Ovtcharov Jitu Padhye Gautham Popuri Shachar RaindelTejas Sapre Mark Shaw Gabriel Silva Madhan Sivakumar NisheethSrivastava Anshuman Verma Qasim Zuhair Deepak Bansal DougBurger Kushagra Vaid David A Maltz and Albert Greenberg 2018Azure Accelerated Networking SmartNICs in the Public Cloud In 15thUSENIX Symposium on Networked Systems Design and Implementation(NSDI 18) USENIX Association Renton WA 51ndash66

[41] Brad Fitzpatrick [n d] Distributed caching with memcached InLinux Journal Volume 2004 Issue 124 2004

[42] Rodrigo Fonseca George Porter Randy H Katz Scott Shenker andIon Stoica 2007 X-trace A Pervasive Network Tracing Framework InProceedings of the 4th USENIX Conference on Networked Systems Designamp Implementation (NSDIrsquo07) USENIX Association Berkeley CA USA20ndash20

[43] Yu Gan and Christina Delimitrou 2018 The Architectural Implicationsof Cloud Microservices In Computer Architecture Letters (CAL) vol17iss 2

[44] Yu Gan Meghna Pancholi Dailun Cheng Siyuan Hu Yuan He andChristina Delimitrou 2018 Seer Leveraging Big Data to Navigate theComplexity of Cloud Debugging In Proceedings of the Tenth USENIXWorkshop on Hot Topics in Cloud Computing (HotCloud)

[45] Yu Gan Yanqi Zhang Dailun Cheng Ankitha Shetty Priyal RathiNayan Katarki Ariana Bruno Justin Hu Brian Ritchken BrendonJackson Kelvin Hu Meghna Pancholi Yuan He Brett Clancy ChrisColen Fukang Wen Catherine Leung Siyuan Wang Leon ZaruvinskyMateo Espinosa Rick Lin Zhongling Liu Jake Padilla and ChristinaDelimitrou 2019 An Open-Source Benchmark Suite for Microser-vices and Their Hardware-Software Implications for Cloud and EdgeSystems In Proceedings of the Twenty Fourth International Conferenceon Architectural Support for Programming Languages and OperatingSystems (ASPLOS)

[46] Yilong Geng Shiyu Liu Zi Yin Ashish Naik Balaji Prabhakar MendelRosenblum and Amin Vahdat [n d] Exploiting a Natural NetworkEffect for Scalable Fine-grained Clock Synchronization In Proc ofNSDI 2018

[47] Daniel Gmach Jerry Rolia Ludmila Cherkasova and Alfons Kemper[n d] Workload Analysis and Demand Prediction of Enterprise DataCenter Applications In Proceedings of IISWC Boston MA 2007 10httpsdoiorg101109IISWC20074362193

[48] Zhenhuan Gong Xiaohui Gu and John Wilkes [n d] PRESS PRe-dictive Elastic ReSource Scaling for cloud systems In Proceedings ofCNSM Niagara Falls ON 2010

[49] Donald Gross John F Shortle James M Thompson and Carl M Harris[n d] Fundamentals of Queueing Theory InWiley Series in Probabilityand Statistics Book 627 2011

[50] Sanchika Gupta and PadamKumar [n d] VM Profile Based OptimizedNetwork Attack Pattern Detection Scheme for DDOS Attacks in CloudIn Proc of SSCC Mysore India 2013

[51] BenHindman Andy Konwinski Matei Zaharia Ali Ghodsi AnthonyDJoseph Randy Katz Scott Shenker and Ion Stoica [n d] Mesos APlatform for Fine-Grained Resource Sharing in the Data Center InProceedings of NSDI Boston MA 2011

[52] Jingwei Huang David M Nicol and Roy H Campbell [n d] Denial-of-Service Threat to HadoopYARN Clusters with Multi-Tenancy InProc of the IEEE International Congress on Big Data Washington DC2014

[53] Alexandru Iosup Nezih Yigitbasi and Dick Epema [n d] On thePerformance Variability of Production Cloud Services In Proceedingsof CCGRID Newport Beach CA 2011

[54] Hiranya Jayathilaka Chandra Krintz and Rich Wolski 2017 Perfor-mance Monitoring and Root Cause Analysis for Cloud-hosted WebApplications In Proceedings of WWW International World Wide WebConferences Steering Committee Republic and Canton of GenevaSwitzerland 469ndash478

[55] Norman P Jouppi Cliff Young Nishant Patil David Patterson Gau-rav Agrawal Raminder Bajwa Sarah Bates Suresh Bhatia Nan Bo-den Al Borchers Rick Boyle Pierre-luc Cantin Clifford Chao ChrisClark Jeremy Coriell Mike Daley Matt Dau Jeffrey Dean Ben GelbTara Vazir Ghaemmaghami Rajendra Gottipati William GullandRobert Hagmann C Richard Ho Doug Hogberg John Hu RobertHundt Dan Hurt Julian Ibarz Aaron Jaffey Alek Jaworski Alexan-der Kaplan Harshit Khaitan Daniel Killebrew Andy Koch NaveenKumar Steve Lacy James Laudon James Law Diemthu Le ChrisLeary Zhuyuan Liu Kyle Lucke Alan Lundin Gordon MacKean Adri-ana Maggiore Maire Mahony Kieran Miller Rahul Nagarajan RaviNarayanaswami Ray Ni Kathy Nix Thomas Norrie Mark Omer-nick Narayana Penukonda Andy Phelps Jonathan Ross Matt RossAmir Salek Emad Samadiani Chris Severn Gregory Sizikov MatthewSnelham Jed Souter Dan Steinberg Andy Swing Mercedes Tan Gre-gory Thorson Bo Tian Horia Toma Erick Tuttle Vijay VasudevanRichard Walter Walter Wang Eric Wilcox and Doe Hyun Yoon 2017In-Datacenter Performance Analysis of a Tensor Processing Unit InProceedings of the 44th Annual International Symposium on ComputerArchitecture (ISCA rsquo17) ACM New York NY USA 1ndash12

[56] Harshad Kasture and Daniel Sanchez 2014 Ubik Efficient CacheSharing with Strict QoS for Latency-Critical Workloads In Proceed-ings of the 19th international conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS-XIX)

[57] Harshad Kasture and Daniel Sanchez 2016 TailBench A BenchmarkSuite and Evaluation Methodology for Latency-Critical ApplicationsIn Proc of IISWC

[58] Yaakoub El Khamra Hyunjoo Kim Shantenu Jha andManish Parashar[n d] Exploring the performance fluctuations of hpc workloads onclouds In Proceedings of CloudCom Indianapolis IN 2010

[59] Krzysztof C Kiwiel [n d] Convergence and efficiency of subgradientmethods for quasiconvex minimization InMathematical Programming(Series A) (Berlin Heidelberg Springer) 90 (1) pp 1-25 2001

[60] Jacob Leverich and Christos Kozyrakis [n d] Reconciling High ServerUtilization and Sub-millisecond Quality-of-Service In Proc of EuroSys2014

[61] David Lo Liqun Cheng Rama Govindaraju Luiz Andreacute Barroso andChristos Kozyrakis [n d] Towards Energy Proportionality for Large-scale Latency-critical Workloads In Proceedings of the 41st AnnualInternational Symposium on Computer Architecuture (ISCA) Minneapo-lis MN 2014

[62] David Lo Liqun Cheng Rama Govindaraju Parthasarathy Ran-ganathan and Christos Kozyrakis [n d] Heracles Improving Re-source Efficiency at Scale In Proc of the 42Nd Annual InternationalSymposium on Computer Architecture (ISCA) Portland OR 2015

[63] Dave Mangot [n d] EC2 variability The numbers re-vealed httptechmangotcomrollerdaveentryec2_variability_the_numbers_revealed

[64] Jason Mars and Lingjia Tang [n d] Whare-map heterogeneity inhomogeneous warehouse-scale computers In Proceedings of ISCATel-Aviv Israel 2013

[65] Jason Mars Lingjia Tang and Robert Hundt 2011 Heterogeneity inldquoHomogeneousrdquo Warehouse-Scale Computers A Performance Oppor-tunity IEEE Comput Archit Lett 10 2 (July 2011) 4

[66] Jason Mars Lingjia Tang Robert Hundt Kevin Skadron and Mary LouSoffa [n d] Bubble-Up increasing utilization in modern warehousescale computers via sensible co-locations In Proceedings of MICROPorto Alegre Brazil 2011

[67] David Meisner Christopher M Sadler Luiz Andreacute Barroso Wolf-Dietrich Weber and Thomas F Wenisch 2011 Power managementof online data-intensive services In Proceedings of the 38th annualinternational symposium on Computer architecture 319ndash330

[68] Ripal Nathuji Canturk Isci and Eugene Gorbatov [n d] Exploitingplatform heterogeneity for power efficient data centers In Proceedingsof ICAC Jacksonville FL 2007

[69] Ripal Nathuji Aman Kansal and Alireza Ghaffarkhah [n d] Q-Clouds Managing Performance Interference Effects for QoS-AwareClouds In Proceedings of EuroSys ParisFrance 2010

[70] Hiep Nguyen Zhiming Shen Xiaohui Gu Sethuraman Subbiah andJohn Wilkes [n d] AGILE Elastic Distributed Resource Scaling forInfrastructure-as-a-Service In Proceedings of ICAC San Jose CA 2013

[71] Dejan Novakovic Nedeljko Vasic Stanko Novakovic Dejan Kosticand Ricardo Bianchini [n d] DeepDive Transparently Identifyingand Managing Performance Interference in Virtualized EnvironmentsIn Proceedings of ATC San Jose CA 2013

[72] Simon Ostermann Alexandru Iosup Nezih Yigitbasi Radu ProdanThomas Fahringer and Dick Epema [n d] A Performance Analysisof EC2 Cloud Computing Services for Scientific Computing In LectureNotes on Cloud Computing Volume 34 p115-131 2010

[73] Kay Ousterhout Patrick Wendell Matei Zaharia and Ion Stoica [nd] Sparrow Distributed Low Latency Scheduling In Proceedings ofSOSP Farminton PA 2013

[74] Xue Ouyang Peter Garraghan Renyu Yang Paul Townend and Jie Xu[n d] Reducing Late-Timing Failure at Scale Straggler Root-CauseAnalysis in Cloud Datacenters In Proceedings of 46th Annual IEEEIFIPInternational Conference on Dependable Systems and Networks 2016

[75] Andrew Putnam Adrian M Caulfield Eric S Chung Derek ChiouKypros Constantinides John Demme Hadi Esmaeilzadeh Jeremy Fow-ers Gopi Prashanth Gopal Jan Gray Michael Haselman Scott HauckStephen Heil Amir Hormati Joo-Young Kim Sitaram Lanka JamesLarus Eric Peterson Simon Pope Aaron Smith Jason Thong Phillip YiXiao and Doug Burger 2014 A Reconfigurable Fabric for Accelerat-ing Large-Scale Datacenter Services In Proc of the 41st Intl Symp onComputer Architecture

[76] Suhail Rehman and Majd Sakr [n d] Initial Findings for provisioningvariation in cloud computing In Proceedings of CloudCom IndianapolisIN 2010

[77] Charles Reiss Alexey Tumanov Gregory Ganger Randy Katz andMichael Kozych [n d] Heterogeneity and Dynamicity of Clouds atScale Google Trace Analysis In Proceedings of SOCC 2012

[78] Gang Ren Eric Tune Tipp Moseley Yixin Shi Silvius Rus andRobert Hundt 2010 Google-Wide Profiling A Continuous Pro-filing Infrastructure for Data Centers IEEE Micro (2010) 65ndash79httpwwwcomputerorgportalwebcsdldoi101109MM201068

[79] rightscale [n d] Rightscale httpsawsamazoncomsolution-providersisvrightscale

[80] S Sarwar A Ankit and K Roy [n d] Incremental Learning in DeepConvolutional Neural Networks Using Partial Network Sharing InarXiv preprint arXiv171202719

[81] Joumlrg Schad Jens Dittrich and Jorge-Arnulfo Quianeacute-Ruiz 2010 Run-time Measurements in the Cloud Observing Analyzing and ReducingVariance Proceedings VLDB Endow 3 1-2 (Sept 2010) 460ndash471

[82] Malte Schwarzkopf Andy Konwinski Michael Abd-El-Malek andJohn Wilkes [n d] Omega flexible scalable schedulers for largecompute clusters In Proceedings of EuroSys Prague 2013

[83] Zhiming Shen Sethuraman Subbiah Xiaohui Gu and John Wilkes [nd] CloudScale elastic resource scaling for multi-tenant cloud systemsIn Proceedings of SOCC Cascais Portugal 2011

[84] David Shue Michael J Freedman and Anees Shaikh [n d] Perfor-mance Isolation and Fairness for Multi-tenant Cloud Storage In Procof OSDI Hollywood CA 2012

[85] Benjamin H Sigelman Luiz Andreacute Barroso Mike Burrows PatStephenson Manoj Plakal Donald Beaver Saul Jaspan and Chan-dan Shanbhag 2010 Dapper a Large-Scale Distributed Systems TracingInfrastructure Technical Report Google Inc httpsresearchgooglecomarchivepapersdapper-2010-1pdf

[86] Lalith Suresh Peter Bodik Ishai Menache Marco Canini and FlorinCiucu [n d] Distributed Resource Management Across ProcessBoundaries In Proceedings of the ACM Symposium on Cloud Computing(SOCC) Santa Clara CA 2017

[87] torque [n d] Torque Resource Manager httpwwwadaptivecomputingcomproductsopen-sourcetorque

[88] Venkatanathan Varadarajan Thomas Ristenpart and Michael Swift[n d] Scheduler-based Defenses against Cross-VM Side-channels InProc of the 23rd Usenix Security Symposium San Diego CA 2014

[89] Venkatanathan Varadarajan Yinqian Zhang Thomas Ristenpart andMichael Swift [n d] A Placement Vulnerability Study inMulti-TenantPublic Clouds In Proc of the 24th USENIX Security Symposium (USENIXSecurity) Washington DC 2015

[90] Nedeljko Vasić Dejan Novaković Svetozar Miučin Dejan Kostić andRicardo Bianchini [n d] DejaVu accelerating resource allocationin virtualized environments In Proceedings of the Seventeenth Interna-tional Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS) London UK 2012

[91] Jianping Weng Jessie Hui Wang Jiahai Yang and Yang Yang [n d]Root cause analysis of anomalies of multitier services in public cloudsIn Proceedings of the IEEEACM 25th International Symposium on Qual-ity of Service (IWQoS) 2017

[92] Ian H Witten Eibe Frank and Geoffrey Holmes [n d] Data MiningPractical Machine Learning Tools and Techniques 3rd Edition

[93] Tianyin Xu Xinxin Jin Peng Huang Yuanyuan Zhou Shan Lu LongJin and Shankar Pasupathy 2016 Early Detection of ConfigurationErrors to Reduce Failure Damage In Proceedings of the 12th USENIXConference on Operating Systems Design and Implementation (OSDIrsquo16)USENIX Association Berkeley CA USA 619ndash634

[94] Yunjing Xu Michael Bailey Farnam Jahanian Kaustubh Joshi MattiHiltunen and Richard Schlichting [n d] An Exploration of L2 CacheCovert Channels in Virtualized Environments In Proc of the 3rd ACMWorkshop on Cloud Computing Security Workshop (CCSW) ChicagoIL 2011

[95] Hailong Yang Alex Breslow Jason Mars and Lingjia Tang [n d]Bubble-flux precise online QoS management for increased utilizationin warehouse scale computers In Proceedings of ISCA 2013

[96] Minlan Yu Albert Greenberg Dave Maltz Jennifer Rexford LihuaYuan Srikanth Kandula and Changhoon Kim [n d] Profiling Net-work Performance for Multi-tier Data Center Applications In Proceed-ings of NSDI Boston MA 2011 14

[97] Xiao Yu Pallavi Joshi Jianwu Xu Guoliang Jin Hui Zhang and GuofeiJiang 2016 CloudSeer Workflow Monitoring of Cloud Infrastructuresvia Interleaved Logs In Proceedings of APSLOS ACM New York NYUSA 489ndash502

[98] Yinqian Zhang and Michael K Reiter [n d] Duppel retrofittingcommodity operating systems to mitigate cache side channels in thecloud In Proc of CCS Berlin Germany 2013

[99] Mark Zhao and G Edward Suh [n d] FPGA-Based Remote PowerSide-Channel Attacks In Proceedings of the IEEE Symposium on Securityand Privacy May 2018

  • Abstract
  • 1 Introduction
  • 2 Related Work
  • 3 End-to-End Applications with Microservices
    • 31 Social Network
    • 32 Media Service
    • 33 E-Commerce Service
    • 34 Banking System
    • 35 Hotel Reservation Site
      • 4 Seer Design
        • 41 Overview
        • 42 Distributed Tracing
        • 43 Deep Learning in Performance Debugging
        • 44 Hardware Monitoring
        • 45 System Insights from Seer
        • 46 Implementation
          • 5 Seer Analysis and Validation
            • 51 Methodology
            • 52 Evaluation
              • 6 Large-Scale Cloud Study
                • 61 Seer Scalability
                • 62 Source of QoS Violations
                • 63 Seers Long-Term Impact on Application Design
                  • 7 Conclusions
                  • Acknowledgements
                  • References
Page 13: Seer: Leveraging Big Data to Navigate the Complexity of ... · data to extract the causal relationships between requests. Even though such tracing systems help cloud providers detect

References[1] [n d] Apache Thrift httpsthriftapacheorg[2] [n d] Decomposing Twitter Adventures in Service-

Oriented Architecture httpswwwslidesharenetInfoQdecomposing-twitter-adventures-in-serviceoriented-architecture

[3] [n d] Golang Microservices Example httpsgithubcomharlowgo-micro-services

[4] [n d] Messaging that just works httpswwwrabbitmqcom[5] [n d] MongoDB httpswwwmongodbcom[6] [n d] NGINX httpsnginxorgen[7] [n d] SockShop A Microservices Demo Application httpswww

weaveworksblogsock-shop-microservices-demo-application[8] [n d] Zipkin httpzipkinio[9] 2016 The Evolution of Microservices httpswwwslidesharenet

adriancockcroftevolution-of-microservices-craft-conference[10] Martiacuten Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng

Chen Craig Citro Greg Corrado Andy Davis Jeffrey Dean MatthieuDevin Sanjay Ghemawat Ian Goodfellow Andrew Harp GeoffreyIrving Michael Isard Yangqing Jia Rafal Jozefowicz Lukasz KaiserManjunath Kudlur Josh Levenberg Dan Maneacute Rajat Monga SherryMoore Derek Murray Chris Olah Mike Schuster Jonathon ShlensBenoit Steiner Ilya Sutskever Kunal Talwar Paul Tucker VincentVanhoucke Vijay Vasudevan Fernanda Vieacutegas Oriol Vinyals PeteWarden Martin Wattenberg Martin Wicke Yuan Yu and XiaoqiangZheng [n d] TensorFlow Large-Scale Machine Learning on Hetero-geneous Distributed Systems In Proceedings of OSDI 2016

[11] Adrian Cockroft [n d] Microservices Workshop Why whatand how to get there httpwwwslidesharenetadriancockcroftmicroservices-workshop-craft-conference

[12] Hyunwook Baek Abhinav Srivastava and Jacobus Van der Merwe[n d] CloudSight A tenant-oriented transparency framework forcross-layer cloud troubleshooting In Proceedings of the 17th IEEEACMInternational Symposium on Cluster Cloud and Grid Computing 2017

[13] Luiz Barroso [n d] Warehouse-Scale Computing Entering theTeenage Decade ISCA Keynote SJ June 2011 ([n d])

[14] Luiz Barroso and Urs Hoelzle 2009 The Datacenter as a Computer AnIntroduction to the Design ofWarehouse-Scale Machines MC Publishers

[15] Robert Bell Yehuda Koren and Chris Volinsky 2007 The BellKor 2008Solution to the Netflix Prize Technical Report

[16] Leon Bottou [n d] Large-Scale Machine Learning with StochasticGradient Descent In Proceedings of the International Conference onComputational Statistics (COMPSTAT) Paris France 2010

[17] Martin A Brown [n d] Traffic Control HOWTO httplinux-ipnetarticlesTraffic-Control-HOWTO

[18] Cassandra [n d] Apache Cassandra httpcassandraapacheorg[19] Adrian M Caulfield Eric S Chung Andrew Putnam Hari Angepat

Jeremy Fowers Michael Haselman Stephen Heil Matt HumphreyPuneet Kaur Joo-Young Kim Daniel Lo Todd Massengill KalinOvtcharov Michael Papamichael Lisa Woods Sitaram Lanka DerekChiou and Doug Burger 2016 A Cloud-scale Acceleration Architec-ture In MICRO IEEE Press Piscataway NJ USA Article 7 13 pages

[20] Tianshi Chen Zidong Du Ninghui Sun Jia Wang Chengyong WuYunji Chen and Olivier Temam 2014 DianNao a small-footprint high-throughput accelerator for ubiquitous machine-learning In Proc ofthe 19th intl conf on Architectural Support for Programming Languagesand Operating Systems

[21] Yunji Chen Tianshi Chen Zhiwei Xu Ninghui Sun andOlivier Temam2016 DianNao Family Energy-efficient Hardware Accelerators forMachine Learning Commun ACM 59 11 (Oct 2016) 105ndash112 httpsdoiorg1011452996864

[22] Yunji Chen Tao Luo Shaoli Liu Shijin Zhang Liqiang He Jia WangLing Li Tianshi Chen Zhiwei Xu Ninghui Sun and Olivier Temam2014 DaDianNao A Machine-Learning Supercomputer In Proceedingsof the 47th Annual IEEEACM International Symposium on Microarchi-tecture (MICRO-47) IEEE Computer Society Washington DC USA

609ndash622[23] Ludmila Cherkasova Diwaker Gupta and Amin Vahdat 2007 Com-

parison of the Three CPU Schedulers in Xen SIGMETRICS PerformEval Rev 35 2 (Sept 2007) 42ndash51

[24] Michael Chow David Meisner Jason Flinn Daniel Peek and Thomas FWenisch 2014 The Mystery Machine End-to-end Performance Anal-ysis of Large-scale Internet Services In Proceedings of the 11th USENIXConference on Operating Systems Design and Implementation (OSDIrsquo14)USENIX Association Berkeley CA USA 217ndash231

[25] Eric S Chung Jeremy Fowers Kalin Ovtcharov Michael PapamichaelAdrian M Caulfield Todd Massengill Ming Liu Daniel Lo ShlomiAlkalay Michael Haselman Maleen Abeydeera Logan Adams HariAngepat Christian Boehn Derek Chiou Oren Firestein AlessandroForin Kang Su Gatlin Mahdi Ghandi Stephen Heil Kyle HolohanAhmad El Husseini Tamaacutes Juhaacutesz Kara Kagi Ratna Kovvuri SitaramLanka Friedel van Megen Dima Mukhortov Prerak Patel BrandonPerez Amanda Rapsang Steven K Reinhardt Bita Rouhani AdamSapek Raja Seera Sangeetha Shekar Balaji Sridharan Gabriel WeiszLisaWoods Phillip Yi Xiao Dan Zhang Ritchie Zhao and Doug Burger2018 Serving DNNs in Real Time at Datacenter Scale with ProjectBrainwave IEEE Micro 38 2 (2018) 8ndash20

[26] Guilherme Da Cunha Rodrigues Rodrigo N Calheiros Vini-cius Tavares Guimaraes Glederson Lessa dos Santos Maacutercio Barbosade Carvalho Lisandro Zambenedetti Granville Liane Margarida Rock-enbach Tarouco and Rajkumar Buyya 2016 Monitoring of CloudComputing Environments Concepts Solutions Trends and Future Di-rections In Proceedings of the 31st Annual ACM Symposium on AppliedComputing (SAC rsquo16) ACM New York NY USA 378ndash383

[27] Marwan Darwish Abdelkader Ouda and Luiz Fernando Capretz [nd] Cloud-based DDoS attacks and defenses In Proc of i-SocietyToronto ON 2013

[28] Jeffrey Dean and Luiz Andre Barroso [n d] The Tail at Scale InCACM Vol 56 No 2

[29] Christina Delimitrou Nick Bambos and Christos Kozyrakis [n d]QoS-Aware Admission Control in Heterogeneous Datacenters In Pro-ceedings of the International Conference of Autonomic Computing (ICAC)San Jose CA USA 2013

[30] Christina Delimitrou and Christos Kozyrakis [n d] iBench Quanti-fying Interference for Datacenter Workloads In Proceedings of the 2013IEEE International Symposium on Workload Characterization (IISWC)Portland OR September 2013

[31] Christina Delimitrou and Christos Kozyrakis [n d] Paragon QoS-Aware Scheduling for Heterogeneous Datacenters In Proceedings ofthe Eighteenth International Conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS) HoustonTX USA 2013

[32] Christina Delimitrou and Christos Kozyrakis [n d] QoS-Aware Sched-uling in Heterogeneous Datacenters with Paragon In ACM Transac-tions on Computer Systems (TOCS) Vol 31 Issue 4 December 2013

[33] Christina Delimitrou and Christos Kozyrakis [n d] Quality-of-Service-Aware Scheduling in Heterogeneous Datacenters withParagon In IEEE Micro Special Issue on Top Picks from the ComputerArchitecture Conferences MayJune 2014

[34] Christina Delimitrou and Christos Kozyrakis [n d] Quasar Resource-Efficient and QoS-Aware Cluster Management In Proc of ASPLOS SaltLake City 2014

[35] Christina Delimitrou and Christos Kozyrakis 2016 HCloud Resource-Efficient Provisioning in Shared Cloud Systems In Proceedings of theTwenty First International Conference on Architectural Support for Pro-gramming Languages and Operating Systems (ASPLOS)

[36] Christina Delimitrou and Christos Kozyrakis 2017 Bolt I KnowWhatYou Did Last Summer In The Cloud In Proc of the Twenty SecondInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS)

[37] Christina Delimitrou and Christos Kozyrakis 2018 Amdahlrsquos Law forTail Latency In Communications of the ACM (CACM)

[38] Christina Delimitrou Daniel Sanchez and Christos Kozyrakis 2015Tarcil Reconciling Scheduling Speed and Quality in Large Shared Clus-ters In Proceedings of the Sixth ACM Symposium on Cloud Computing(SOCC)

[39] Zidong Du Robert Fasthuber Tianshi Chen Paolo Ienne Ling Li TaoLuo Xiaobing Feng Yunji Chen and Olivier Temam 2015 ShiDi-anNao Shifting Vision Processing Closer to the Sensor In Proceed-ings of the 42Nd Annual International Symposium on Computer Ar-chitecture (ISCA rsquo15) ACM New York NY USA 92ndash104 httpsdoiorg10114527494692750389

[40] Daniel Firestone Andrew Putnam SambhramaMundkur Derek ChiouAlireza Dabagh Mike Andrewartha Hari Angepat Vivek BhanuAdrian Caulfield Eric Chung Harish Kumar Chandrappa SomeshChaturmohta Matt Humphrey Jack Lavier Norman Lam FengfenLiu Kalin Ovtcharov Jitu Padhye Gautham Popuri Shachar RaindelTejas Sapre Mark Shaw Gabriel Silva Madhan Sivakumar NisheethSrivastava Anshuman Verma Qasim Zuhair Deepak Bansal DougBurger Kushagra Vaid David A Maltz and Albert Greenberg 2018Azure Accelerated Networking SmartNICs in the Public Cloud In 15thUSENIX Symposium on Networked Systems Design and Implementation(NSDI 18) USENIX Association Renton WA 51ndash66

[41] Brad Fitzpatrick [n d] Distributed caching with memcached InLinux Journal Volume 2004 Issue 124 2004

[42] Rodrigo Fonseca George Porter Randy H Katz Scott Shenker andIon Stoica 2007 X-trace A Pervasive Network Tracing Framework InProceedings of the 4th USENIX Conference on Networked Systems Designamp Implementation (NSDIrsquo07) USENIX Association Berkeley CA USA20ndash20

[43] Yu Gan and Christina Delimitrou 2018 The Architectural Implicationsof Cloud Microservices In Computer Architecture Letters (CAL) vol17iss 2

[44] Yu Gan Meghna Pancholi Dailun Cheng Siyuan Hu Yuan He andChristina Delimitrou 2018 Seer Leveraging Big Data to Navigate theComplexity of Cloud Debugging In Proceedings of the Tenth USENIXWorkshop on Hot Topics in Cloud Computing (HotCloud)

[45] Yu Gan Yanqi Zhang Dailun Cheng Ankitha Shetty Priyal RathiNayan Katarki Ariana Bruno Justin Hu Brian Ritchken BrendonJackson Kelvin Hu Meghna Pancholi Yuan He Brett Clancy ChrisColen Fukang Wen Catherine Leung Siyuan Wang Leon ZaruvinskyMateo Espinosa Rick Lin Zhongling Liu Jake Padilla and ChristinaDelimitrou 2019 An Open-Source Benchmark Suite for Microser-vices and Their Hardware-Software Implications for Cloud and EdgeSystems In Proceedings of the Twenty Fourth International Conferenceon Architectural Support for Programming Languages and OperatingSystems (ASPLOS)

[46] Yilong Geng Shiyu Liu Zi Yin Ashish Naik Balaji Prabhakar MendelRosenblum and Amin Vahdat [n d] Exploiting a Natural NetworkEffect for Scalable Fine-grained Clock Synchronization In Proc ofNSDI 2018

[47] Daniel Gmach Jerry Rolia Ludmila Cherkasova and Alfons Kemper[n d] Workload Analysis and Demand Prediction of Enterprise DataCenter Applications In Proceedings of IISWC Boston MA 2007 10httpsdoiorg101109IISWC20074362193

[48] Zhenhuan Gong Xiaohui Gu and John Wilkes [n d] PRESS PRe-dictive Elastic ReSource Scaling for cloud systems In Proceedings ofCNSM Niagara Falls ON 2010

[49] Donald Gross John F Shortle James M Thompson and Carl M Harris[n d] Fundamentals of Queueing Theory InWiley Series in Probabilityand Statistics Book 627 2011

[50] Sanchika Gupta and PadamKumar [n d] VM Profile Based OptimizedNetwork Attack Pattern Detection Scheme for DDOS Attacks in CloudIn Proc of SSCC Mysore India 2013

[51] BenHindman Andy Konwinski Matei Zaharia Ali Ghodsi AnthonyDJoseph Randy Katz Scott Shenker and Ion Stoica [n d] Mesos APlatform for Fine-Grained Resource Sharing in the Data Center InProceedings of NSDI Boston MA 2011

[52] Jingwei Huang David M Nicol and Roy H Campbell [n d] Denial-of-Service Threat to HadoopYARN Clusters with Multi-Tenancy InProc of the IEEE International Congress on Big Data Washington DC2014

[53] Alexandru Iosup Nezih Yigitbasi and Dick Epema [n d] On thePerformance Variability of Production Cloud Services In Proceedingsof CCGRID Newport Beach CA 2011

[54] Hiranya Jayathilaka Chandra Krintz and Rich Wolski 2017 Perfor-mance Monitoring and Root Cause Analysis for Cloud-hosted WebApplications In Proceedings of WWW International World Wide WebConferences Steering Committee Republic and Canton of GenevaSwitzerland 469ndash478

[55] Norman P Jouppi Cliff Young Nishant Patil David Patterson Gau-rav Agrawal Raminder Bajwa Sarah Bates Suresh Bhatia Nan Bo-den Al Borchers Rick Boyle Pierre-luc Cantin Clifford Chao ChrisClark Jeremy Coriell Mike Daley Matt Dau Jeffrey Dean Ben GelbTara Vazir Ghaemmaghami Rajendra Gottipati William GullandRobert Hagmann C Richard Ho Doug Hogberg John Hu RobertHundt Dan Hurt Julian Ibarz Aaron Jaffey Alek Jaworski Alexan-der Kaplan Harshit Khaitan Daniel Killebrew Andy Koch NaveenKumar Steve Lacy James Laudon James Law Diemthu Le ChrisLeary Zhuyuan Liu Kyle Lucke Alan Lundin Gordon MacKean Adri-ana Maggiore Maire Mahony Kieran Miller Rahul Nagarajan RaviNarayanaswami Ray Ni Kathy Nix Thomas Norrie Mark Omer-nick Narayana Penukonda Andy Phelps Jonathan Ross Matt RossAmir Salek Emad Samadiani Chris Severn Gregory Sizikov MatthewSnelham Jed Souter Dan Steinberg Andy Swing Mercedes Tan Gre-gory Thorson Bo Tian Horia Toma Erick Tuttle Vijay VasudevanRichard Walter Walter Wang Eric Wilcox and Doe Hyun Yoon 2017In-Datacenter Performance Analysis of a Tensor Processing Unit InProceedings of the 44th Annual International Symposium on ComputerArchitecture (ISCA rsquo17) ACM New York NY USA 1ndash12

[56] Harshad Kasture and Daniel Sanchez 2014 Ubik Efficient CacheSharing with Strict QoS for Latency-Critical Workloads In Proceed-ings of the 19th international conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS-XIX)

[57] Harshad Kasture and Daniel Sanchez 2016 TailBench A BenchmarkSuite and Evaluation Methodology for Latency-Critical ApplicationsIn Proc of IISWC

[58] Yaakoub El Khamra Hyunjoo Kim Shantenu Jha andManish Parashar[n d] Exploring the performance fluctuations of hpc workloads onclouds In Proceedings of CloudCom Indianapolis IN 2010

[59] Krzysztof C Kiwiel [n d] Convergence and efficiency of subgradientmethods for quasiconvex minimization InMathematical Programming(Series A) (Berlin Heidelberg Springer) 90 (1) pp 1-25 2001

[60] Jacob Leverich and Christos Kozyrakis [n d] Reconciling High ServerUtilization and Sub-millisecond Quality-of-Service In Proc of EuroSys2014

[61] David Lo Liqun Cheng Rama Govindaraju Luiz Andreacute Barroso andChristos Kozyrakis [n d] Towards Energy Proportionality for Large-scale Latency-critical Workloads In Proceedings of the 41st AnnualInternational Symposium on Computer Architecuture (ISCA) Minneapo-lis MN 2014

[62] David Lo Liqun Cheng Rama Govindaraju Parthasarathy Ran-ganathan and Christos Kozyrakis [n d] Heracles Improving Re-source Efficiency at Scale In Proc of the 42Nd Annual InternationalSymposium on Computer Architecture (ISCA) Portland OR 2015

[63] Dave Mangot [n d] EC2 variability The numbers re-vealed httptechmangotcomrollerdaveentryec2_variability_the_numbers_revealed

[64] Jason Mars and Lingjia Tang [n d] Whare-map heterogeneity inhomogeneous warehouse-scale computers In Proceedings of ISCATel-Aviv Israel 2013

[65] Jason Mars Lingjia Tang and Robert Hundt 2011 Heterogeneity inldquoHomogeneousrdquo Warehouse-Scale Computers A Performance Oppor-tunity IEEE Comput Archit Lett 10 2 (July 2011) 4

[66] Jason Mars Lingjia Tang Robert Hundt Kevin Skadron and Mary LouSoffa [n d] Bubble-Up increasing utilization in modern warehousescale computers via sensible co-locations In Proceedings of MICROPorto Alegre Brazil 2011

[67] David Meisner Christopher M Sadler Luiz Andreacute Barroso Wolf-Dietrich Weber and Thomas F Wenisch 2011 Power managementof online data-intensive services In Proceedings of the 38th annualinternational symposium on Computer architecture 319ndash330

[68] Ripal Nathuji Canturk Isci and Eugene Gorbatov [n d] Exploitingplatform heterogeneity for power efficient data centers In Proceedingsof ICAC Jacksonville FL 2007

[69] Ripal Nathuji Aman Kansal and Alireza Ghaffarkhah [n d] Q-Clouds Managing Performance Interference Effects for QoS-AwareClouds In Proceedings of EuroSys ParisFrance 2010

[70] Hiep Nguyen Zhiming Shen Xiaohui Gu Sethuraman Subbiah andJohn Wilkes [n d] AGILE Elastic Distributed Resource Scaling forInfrastructure-as-a-Service In Proceedings of ICAC San Jose CA 2013

[71] Dejan Novakovic Nedeljko Vasic Stanko Novakovic Dejan Kosticand Ricardo Bianchini [n d] DeepDive Transparently Identifyingand Managing Performance Interference in Virtualized EnvironmentsIn Proceedings of ATC San Jose CA 2013

[72] Simon Ostermann Alexandru Iosup Nezih Yigitbasi Radu ProdanThomas Fahringer and Dick Epema [n d] A Performance Analysisof EC2 Cloud Computing Services for Scientific Computing In LectureNotes on Cloud Computing Volume 34 p115-131 2010

[73] Kay Ousterhout Patrick Wendell Matei Zaharia and Ion Stoica [nd] Sparrow Distributed Low Latency Scheduling In Proceedings ofSOSP Farminton PA 2013

[74] Xue Ouyang Peter Garraghan Renyu Yang Paul Townend and Jie Xu[n d] Reducing Late-Timing Failure at Scale Straggler Root-CauseAnalysis in Cloud Datacenters In Proceedings of 46th Annual IEEEIFIPInternational Conference on Dependable Systems and Networks 2016

[75] Andrew Putnam Adrian M Caulfield Eric S Chung Derek ChiouKypros Constantinides John Demme Hadi Esmaeilzadeh Jeremy Fow-ers Gopi Prashanth Gopal Jan Gray Michael Haselman Scott HauckStephen Heil Amir Hormati Joo-Young Kim Sitaram Lanka JamesLarus Eric Peterson Simon Pope Aaron Smith Jason Thong Phillip YiXiao and Doug Burger 2014 A Reconfigurable Fabric for Accelerat-ing Large-Scale Datacenter Services In Proc of the 41st Intl Symp onComputer Architecture

[76] Suhail Rehman and Majd Sakr [n d] Initial Findings for provisioningvariation in cloud computing In Proceedings of CloudCom IndianapolisIN 2010

[77] Charles Reiss Alexey Tumanov Gregory Ganger Randy Katz andMichael Kozych [n d] Heterogeneity and Dynamicity of Clouds atScale Google Trace Analysis In Proceedings of SOCC 2012

[78] Gang Ren Eric Tune Tipp Moseley Yixin Shi Silvius Rus andRobert Hundt 2010 Google-Wide Profiling A Continuous Pro-filing Infrastructure for Data Centers IEEE Micro (2010) 65ndash79httpwwwcomputerorgportalwebcsdldoi101109MM201068

[79] rightscale [n d] Rightscale httpsawsamazoncomsolution-providersisvrightscale

[80] S Sarwar A Ankit and K Roy [n d] Incremental Learning in DeepConvolutional Neural Networks Using Partial Network Sharing InarXiv preprint arXiv171202719

[81] Joumlrg Schad Jens Dittrich and Jorge-Arnulfo Quianeacute-Ruiz 2010 Run-time Measurements in the Cloud Observing Analyzing and ReducingVariance Proceedings VLDB Endow 3 1-2 (Sept 2010) 460ndash471

[82] Malte Schwarzkopf Andy Konwinski Michael Abd-El-Malek andJohn Wilkes [n d] Omega flexible scalable schedulers for largecompute clusters In Proceedings of EuroSys Prague 2013

[83] Zhiming Shen Sethuraman Subbiah Xiaohui Gu and John Wilkes [nd] CloudScale elastic resource scaling for multi-tenant cloud systemsIn Proceedings of SOCC Cascais Portugal 2011

[84] David Shue Michael J Freedman and Anees Shaikh [n d] Perfor-mance Isolation and Fairness for Multi-tenant Cloud Storage In Procof OSDI Hollywood CA 2012

[85] Benjamin H Sigelman Luiz Andreacute Barroso Mike Burrows PatStephenson Manoj Plakal Donald Beaver Saul Jaspan and Chan-dan Shanbhag 2010 Dapper a Large-Scale Distributed Systems TracingInfrastructure Technical Report Google Inc httpsresearchgooglecomarchivepapersdapper-2010-1pdf

[86] Lalith Suresh Peter Bodik Ishai Menache Marco Canini and FlorinCiucu [n d] Distributed Resource Management Across ProcessBoundaries In Proceedings of the ACM Symposium on Cloud Computing(SOCC) Santa Clara CA 2017

[87] torque [n d] Torque Resource Manager httpwwwadaptivecomputingcomproductsopen-sourcetorque

[88] Venkatanathan Varadarajan Thomas Ristenpart and Michael Swift[n d] Scheduler-based Defenses against Cross-VM Side-channels InProc of the 23rd Usenix Security Symposium San Diego CA 2014

[89] Venkatanathan Varadarajan Yinqian Zhang Thomas Ristenpart andMichael Swift [n d] A Placement Vulnerability Study inMulti-TenantPublic Clouds In Proc of the 24th USENIX Security Symposium (USENIXSecurity) Washington DC 2015

[90] Nedeljko Vasić Dejan Novaković Svetozar Miučin Dejan Kostić andRicardo Bianchini [n d] DejaVu accelerating resource allocationin virtualized environments In Proceedings of the Seventeenth Interna-tional Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS) London UK 2012

[91] Jianping Weng Jessie Hui Wang Jiahai Yang and Yang Yang [n d]Root cause analysis of anomalies of multitier services in public cloudsIn Proceedings of the IEEEACM 25th International Symposium on Qual-ity of Service (IWQoS) 2017

[92] Ian H Witten Eibe Frank and Geoffrey Holmes [n d] Data MiningPractical Machine Learning Tools and Techniques 3rd Edition

[93] Tianyin Xu Xinxin Jin Peng Huang Yuanyuan Zhou Shan Lu LongJin and Shankar Pasupathy 2016 Early Detection of ConfigurationErrors to Reduce Failure Damage In Proceedings of the 12th USENIXConference on Operating Systems Design and Implementation (OSDIrsquo16)USENIX Association Berkeley CA USA 619ndash634

[94] Yunjing Xu Michael Bailey Farnam Jahanian Kaustubh Joshi MattiHiltunen and Richard Schlichting [n d] An Exploration of L2 CacheCovert Channels in Virtualized Environments In Proc of the 3rd ACMWorkshop on Cloud Computing Security Workshop (CCSW) ChicagoIL 2011

[95] Hailong Yang Alex Breslow Jason Mars and Lingjia Tang [n d]Bubble-flux precise online QoS management for increased utilizationin warehouse scale computers In Proceedings of ISCA 2013

[96] Minlan Yu Albert Greenberg Dave Maltz Jennifer Rexford LihuaYuan Srikanth Kandula and Changhoon Kim [n d] Profiling Net-work Performance for Multi-tier Data Center Applications In Proceed-ings of NSDI Boston MA 2011 14

[97] Xiao Yu Pallavi Joshi Jianwu Xu Guoliang Jin Hui Zhang and GuofeiJiang 2016 CloudSeer Workflow Monitoring of Cloud Infrastructuresvia Interleaved Logs In Proceedings of APSLOS ACM New York NYUSA 489ndash502

[98] Yinqian Zhang and Michael K Reiter [n d] Duppel retrofittingcommodity operating systems to mitigate cache side channels in thecloud In Proc of CCS Berlin Germany 2013

[99] Mark Zhao and G Edward Suh [n d] FPGA-Based Remote PowerSide-Channel Attacks In Proceedings of the IEEE Symposium on Securityand Privacy May 2018

  • Abstract
  • 1 Introduction
  • 2 Related Work
  • 3 End-to-End Applications with Microservices
    • 31 Social Network
    • 32 Media Service
    • 33 E-Commerce Service
    • 34 Banking System
    • 35 Hotel Reservation Site
      • 4 Seer Design
        • 41 Overview
        • 42 Distributed Tracing
        • 43 Deep Learning in Performance Debugging
        • 44 Hardware Monitoring
        • 45 System Insights from Seer
        • 46 Implementation
          • 5 Seer Analysis and Validation
            • 51 Methodology
            • 52 Evaluation
              • 6 Large-Scale Cloud Study
                • 61 Seer Scalability
                • 62 Source of QoS Violations
                • 63 Seers Long-Term Impact on Application Design
                  • 7 Conclusions
                  • Acknowledgements
                  • References
Page 14: Seer: Leveraging Big Data to Navigate the Complexity of ... · data to extract the causal relationships between requests. Even though such tracing systems help cloud providers detect

[37] Christina Delimitrou and Christos Kozyrakis 2018 Amdahlrsquos Law forTail Latency In Communications of the ACM (CACM)

[38] Christina Delimitrou Daniel Sanchez and Christos Kozyrakis 2015Tarcil Reconciling Scheduling Speed and Quality in Large Shared Clus-ters In Proceedings of the Sixth ACM Symposium on Cloud Computing(SOCC)

[39] Zidong Du Robert Fasthuber Tianshi Chen Paolo Ienne Ling Li TaoLuo Xiaobing Feng Yunji Chen and Olivier Temam 2015 ShiDi-anNao Shifting Vision Processing Closer to the Sensor In Proceed-ings of the 42Nd Annual International Symposium on Computer Ar-chitecture (ISCA rsquo15) ACM New York NY USA 92ndash104 httpsdoiorg10114527494692750389

[40] Daniel Firestone Andrew Putnam SambhramaMundkur Derek ChiouAlireza Dabagh Mike Andrewartha Hari Angepat Vivek BhanuAdrian Caulfield Eric Chung Harish Kumar Chandrappa SomeshChaturmohta Matt Humphrey Jack Lavier Norman Lam FengfenLiu Kalin Ovtcharov Jitu Padhye Gautham Popuri Shachar RaindelTejas Sapre Mark Shaw Gabriel Silva Madhan Sivakumar NisheethSrivastava Anshuman Verma Qasim Zuhair Deepak Bansal DougBurger Kushagra Vaid David A Maltz and Albert Greenberg 2018Azure Accelerated Networking SmartNICs in the Public Cloud In 15thUSENIX Symposium on Networked Systems Design and Implementation(NSDI 18) USENIX Association Renton WA 51ndash66

[41] Brad Fitzpatrick [n d] Distributed caching with memcached InLinux Journal Volume 2004 Issue 124 2004

[42] Rodrigo Fonseca George Porter Randy H Katz Scott Shenker andIon Stoica 2007 X-trace A Pervasive Network Tracing Framework InProceedings of the 4th USENIX Conference on Networked Systems Designamp Implementation (NSDIrsquo07) USENIX Association Berkeley CA USA20ndash20

[43] Yu Gan and Christina Delimitrou 2018 The Architectural Implicationsof Cloud Microservices In Computer Architecture Letters (CAL) vol17iss 2

[44] Yu Gan Meghna Pancholi Dailun Cheng Siyuan Hu Yuan He andChristina Delimitrou 2018 Seer Leveraging Big Data to Navigate theComplexity of Cloud Debugging In Proceedings of the Tenth USENIXWorkshop on Hot Topics in Cloud Computing (HotCloud)

[45] Yu Gan Yanqi Zhang Dailun Cheng Ankitha Shetty Priyal RathiNayan Katarki Ariana Bruno Justin Hu Brian Ritchken BrendonJackson Kelvin Hu Meghna Pancholi Yuan He Brett Clancy ChrisColen Fukang Wen Catherine Leung Siyuan Wang Leon ZaruvinskyMateo Espinosa Rick Lin Zhongling Liu Jake Padilla and ChristinaDelimitrou 2019 An Open-Source Benchmark Suite for Microser-vices and Their Hardware-Software Implications for Cloud and EdgeSystems In Proceedings of the Twenty Fourth International Conferenceon Architectural Support for Programming Languages and OperatingSystems (ASPLOS)

[46] Yilong Geng Shiyu Liu Zi Yin Ashish Naik Balaji Prabhakar MendelRosenblum and Amin Vahdat [n d] Exploiting a Natural NetworkEffect for Scalable Fine-grained Clock Synchronization In Proc ofNSDI 2018

[47] Daniel Gmach Jerry Rolia Ludmila Cherkasova and Alfons Kemper[n d] Workload Analysis and Demand Prediction of Enterprise DataCenter Applications In Proceedings of IISWC Boston MA 2007 10httpsdoiorg101109IISWC20074362193

[48] Zhenhuan Gong Xiaohui Gu and John Wilkes [n d] PRESS PRe-dictive Elastic ReSource Scaling for cloud systems In Proceedings ofCNSM Niagara Falls ON 2010

[49] Donald Gross John F Shortle James M Thompson and Carl M Harris[n d] Fundamentals of Queueing Theory InWiley Series in Probabilityand Statistics Book 627 2011

[50] Sanchika Gupta and PadamKumar [n d] VM Profile Based OptimizedNetwork Attack Pattern Detection Scheme for DDOS Attacks in CloudIn Proc of SSCC Mysore India 2013

[51] BenHindman Andy Konwinski Matei Zaharia Ali Ghodsi AnthonyDJoseph Randy Katz Scott Shenker and Ion Stoica [n d] Mesos APlatform for Fine-Grained Resource Sharing in the Data Center InProceedings of NSDI Boston MA 2011

[52] Jingwei Huang David M Nicol and Roy H Campbell [n d] Denial-of-Service Threat to HadoopYARN Clusters with Multi-Tenancy InProc of the IEEE International Congress on Big Data Washington DC2014

[53] Alexandru Iosup Nezih Yigitbasi and Dick Epema [n d] On thePerformance Variability of Production Cloud Services In Proceedingsof CCGRID Newport Beach CA 2011

[54] Hiranya Jayathilaka Chandra Krintz and Rich Wolski 2017 Perfor-mance Monitoring and Root Cause Analysis for Cloud-hosted WebApplications In Proceedings of WWW International World Wide WebConferences Steering Committee Republic and Canton of GenevaSwitzerland 469ndash478

[55] Norman P Jouppi Cliff Young Nishant Patil David Patterson Gau-rav Agrawal Raminder Bajwa Sarah Bates Suresh Bhatia Nan Bo-den Al Borchers Rick Boyle Pierre-luc Cantin Clifford Chao ChrisClark Jeremy Coriell Mike Daley Matt Dau Jeffrey Dean Ben GelbTara Vazir Ghaemmaghami Rajendra Gottipati William GullandRobert Hagmann C Richard Ho Doug Hogberg John Hu RobertHundt Dan Hurt Julian Ibarz Aaron Jaffey Alek Jaworski Alexan-der Kaplan Harshit Khaitan Daniel Killebrew Andy Koch NaveenKumar Steve Lacy James Laudon James Law Diemthu Le ChrisLeary Zhuyuan Liu Kyle Lucke Alan Lundin Gordon MacKean Adri-ana Maggiore Maire Mahony Kieran Miller Rahul Nagarajan RaviNarayanaswami Ray Ni Kathy Nix Thomas Norrie Mark Omer-nick Narayana Penukonda Andy Phelps Jonathan Ross Matt RossAmir Salek Emad Samadiani Chris Severn Gregory Sizikov MatthewSnelham Jed Souter Dan Steinberg Andy Swing Mercedes Tan Gre-gory Thorson Bo Tian Horia Toma Erick Tuttle Vijay VasudevanRichard Walter Walter Wang Eric Wilcox and Doe Hyun Yoon 2017In-Datacenter Performance Analysis of a Tensor Processing Unit InProceedings of the 44th Annual International Symposium on ComputerArchitecture (ISCA rsquo17) ACM New York NY USA 1ndash12

[56] Harshad Kasture and Daniel Sanchez 2014 Ubik Efficient CacheSharing with Strict QoS for Latency-Critical Workloads In Proceed-ings of the 19th international conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS-XIX)

[57] Harshad Kasture and Daniel Sanchez 2016 TailBench A BenchmarkSuite and Evaluation Methodology for Latency-Critical ApplicationsIn Proc of IISWC

[58] Yaakoub El Khamra Hyunjoo Kim Shantenu Jha andManish Parashar[n d] Exploring the performance fluctuations of hpc workloads onclouds In Proceedings of CloudCom Indianapolis IN 2010

[59] Krzysztof C Kiwiel [n d] Convergence and efficiency of subgradientmethods for quasiconvex minimization InMathematical Programming(Series A) (Berlin Heidelberg Springer) 90 (1) pp 1-25 2001

[60] Jacob Leverich and Christos Kozyrakis [n d] Reconciling High ServerUtilization and Sub-millisecond Quality-of-Service In Proc of EuroSys2014

[61] David Lo Liqun Cheng Rama Govindaraju Luiz Andreacute Barroso andChristos Kozyrakis [n d] Towards Energy Proportionality for Large-scale Latency-critical Workloads In Proceedings of the 41st AnnualInternational Symposium on Computer Architecuture (ISCA) Minneapo-lis MN 2014

[62] David Lo Liqun Cheng Rama Govindaraju Parthasarathy Ran-ganathan and Christos Kozyrakis [n d] Heracles Improving Re-source Efficiency at Scale In Proc of the 42Nd Annual InternationalSymposium on Computer Architecture (ISCA) Portland OR 2015

[63] Dave Mangot [n d] EC2 variability The numbers re-vealed httptechmangotcomrollerdaveentryec2_variability_the_numbers_revealed

[64] Jason Mars and Lingjia Tang [n d] Whare-map heterogeneity inhomogeneous warehouse-scale computers In Proceedings of ISCATel-Aviv Israel 2013

[65] Jason Mars Lingjia Tang and Robert Hundt 2011 Heterogeneity inldquoHomogeneousrdquo Warehouse-Scale Computers A Performance Oppor-tunity IEEE Comput Archit Lett 10 2 (July 2011) 4

[66] Jason Mars Lingjia Tang Robert Hundt Kevin Skadron and Mary LouSoffa [n d] Bubble-Up increasing utilization in modern warehousescale computers via sensible co-locations In Proceedings of MICROPorto Alegre Brazil 2011

[67] David Meisner Christopher M Sadler Luiz Andreacute Barroso Wolf-Dietrich Weber and Thomas F Wenisch 2011 Power managementof online data-intensive services In Proceedings of the 38th annualinternational symposium on Computer architecture 319ndash330

[68] Ripal Nathuji Canturk Isci and Eugene Gorbatov [n d] Exploitingplatform heterogeneity for power efficient data centers In Proceedingsof ICAC Jacksonville FL 2007

[69] Ripal Nathuji Aman Kansal and Alireza Ghaffarkhah [n d] Q-Clouds Managing Performance Interference Effects for QoS-AwareClouds In Proceedings of EuroSys ParisFrance 2010

[70] Hiep Nguyen Zhiming Shen Xiaohui Gu Sethuraman Subbiah andJohn Wilkes [n d] AGILE Elastic Distributed Resource Scaling forInfrastructure-as-a-Service In Proceedings of ICAC San Jose CA 2013

[71] Dejan Novakovic Nedeljko Vasic Stanko Novakovic Dejan Kosticand Ricardo Bianchini [n d] DeepDive Transparently Identifyingand Managing Performance Interference in Virtualized EnvironmentsIn Proceedings of ATC San Jose CA 2013

[72] Simon Ostermann Alexandru Iosup Nezih Yigitbasi Radu ProdanThomas Fahringer and Dick Epema [n d] A Performance Analysisof EC2 Cloud Computing Services for Scientific Computing In LectureNotes on Cloud Computing Volume 34 p115-131 2010

[73] Kay Ousterhout Patrick Wendell Matei Zaharia and Ion Stoica [nd] Sparrow Distributed Low Latency Scheduling In Proceedings ofSOSP Farminton PA 2013

[74] Xue Ouyang Peter Garraghan Renyu Yang Paul Townend and Jie Xu[n d] Reducing Late-Timing Failure at Scale Straggler Root-CauseAnalysis in Cloud Datacenters In Proceedings of 46th Annual IEEEIFIPInternational Conference on Dependable Systems and Networks 2016

[75] Andrew Putnam Adrian M Caulfield Eric S Chung Derek ChiouKypros Constantinides John Demme Hadi Esmaeilzadeh Jeremy Fow-ers Gopi Prashanth Gopal Jan Gray Michael Haselman Scott HauckStephen Heil Amir Hormati Joo-Young Kim Sitaram Lanka JamesLarus Eric Peterson Simon Pope Aaron Smith Jason Thong Phillip YiXiao and Doug Burger 2014 A Reconfigurable Fabric for Accelerat-ing Large-Scale Datacenter Services In Proc of the 41st Intl Symp onComputer Architecture

[76] Suhail Rehman and Majd Sakr [n d] Initial Findings for provisioningvariation in cloud computing In Proceedings of CloudCom IndianapolisIN 2010

[77] Charles Reiss Alexey Tumanov Gregory Ganger Randy Katz andMichael Kozych [n d] Heterogeneity and Dynamicity of Clouds atScale Google Trace Analysis In Proceedings of SOCC 2012

[78] Gang Ren Eric Tune Tipp Moseley Yixin Shi Silvius Rus andRobert Hundt 2010 Google-Wide Profiling A Continuous Pro-filing Infrastructure for Data Centers IEEE Micro (2010) 65ndash79httpwwwcomputerorgportalwebcsdldoi101109MM201068

[79] rightscale [n d] Rightscale httpsawsamazoncomsolution-providersisvrightscale

[80] S Sarwar A Ankit and K Roy [n d] Incremental Learning in DeepConvolutional Neural Networks Using Partial Network Sharing InarXiv preprint arXiv171202719

[81] Joumlrg Schad Jens Dittrich and Jorge-Arnulfo Quianeacute-Ruiz 2010 Run-time Measurements in the Cloud Observing Analyzing and ReducingVariance Proceedings VLDB Endow 3 1-2 (Sept 2010) 460ndash471

[82] Malte Schwarzkopf Andy Konwinski Michael Abd-El-Malek andJohn Wilkes [n d] Omega flexible scalable schedulers for largecompute clusters In Proceedings of EuroSys Prague 2013

[83] Zhiming Shen Sethuraman Subbiah Xiaohui Gu and John Wilkes [nd] CloudScale elastic resource scaling for multi-tenant cloud systemsIn Proceedings of SOCC Cascais Portugal 2011

[84] David Shue Michael J Freedman and Anees Shaikh [n d] Perfor-mance Isolation and Fairness for Multi-tenant Cloud Storage In Procof OSDI Hollywood CA 2012

[85] Benjamin H Sigelman Luiz Andreacute Barroso Mike Burrows PatStephenson Manoj Plakal Donald Beaver Saul Jaspan and Chan-dan Shanbhag 2010 Dapper a Large-Scale Distributed Systems TracingInfrastructure Technical Report Google Inc httpsresearchgooglecomarchivepapersdapper-2010-1pdf

[86] Lalith Suresh Peter Bodik Ishai Menache Marco Canini and FlorinCiucu [n d] Distributed Resource Management Across ProcessBoundaries In Proceedings of the ACM Symposium on Cloud Computing(SOCC) Santa Clara CA 2017

[87] torque [n d] Torque Resource Manager httpwwwadaptivecomputingcomproductsopen-sourcetorque

[88] Venkatanathan Varadarajan Thomas Ristenpart and Michael Swift[n d] Scheduler-based Defenses against Cross-VM Side-channels InProc of the 23rd Usenix Security Symposium San Diego CA 2014

[89] Venkatanathan Varadarajan Yinqian Zhang Thomas Ristenpart andMichael Swift [n d] A Placement Vulnerability Study inMulti-TenantPublic Clouds In Proc of the 24th USENIX Security Symposium (USENIXSecurity) Washington DC 2015

[90] Nedeljko Vasić Dejan Novaković Svetozar Miučin Dejan Kostić andRicardo Bianchini [n d] DejaVu accelerating resource allocationin virtualized environments In Proceedings of the Seventeenth Interna-tional Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS) London UK 2012

[91] Jianping Weng Jessie Hui Wang Jiahai Yang and Yang Yang [n d]Root cause analysis of anomalies of multitier services in public cloudsIn Proceedings of the IEEEACM 25th International Symposium on Qual-ity of Service (IWQoS) 2017

[92] Ian H Witten Eibe Frank and Geoffrey Holmes [n d] Data MiningPractical Machine Learning Tools and Techniques 3rd Edition

[93] Tianyin Xu Xinxin Jin Peng Huang Yuanyuan Zhou Shan Lu LongJin and Shankar Pasupathy 2016 Early Detection of ConfigurationErrors to Reduce Failure Damage In Proceedings of the 12th USENIXConference on Operating Systems Design and Implementation (OSDIrsquo16)USENIX Association Berkeley CA USA 619ndash634

[94] Yunjing Xu Michael Bailey Farnam Jahanian Kaustubh Joshi MattiHiltunen and Richard Schlichting [n d] An Exploration of L2 CacheCovert Channels in Virtualized Environments In Proc of the 3rd ACMWorkshop on Cloud Computing Security Workshop (CCSW) ChicagoIL 2011

[95] Hailong Yang Alex Breslow Jason Mars and Lingjia Tang [n d]Bubble-flux precise online QoS management for increased utilizationin warehouse scale computers In Proceedings of ISCA 2013

[96] Minlan Yu Albert Greenberg Dave Maltz Jennifer Rexford LihuaYuan Srikanth Kandula and Changhoon Kim [n d] Profiling Net-work Performance for Multi-tier Data Center Applications In Proceed-ings of NSDI Boston MA 2011 14

[97] Xiao Yu Pallavi Joshi Jianwu Xu Guoliang Jin Hui Zhang and GuofeiJiang 2016 CloudSeer Workflow Monitoring of Cloud Infrastructuresvia Interleaved Logs In Proceedings of APSLOS ACM New York NYUSA 489ndash502

[98] Yinqian Zhang and Michael K Reiter [n d] Duppel retrofittingcommodity operating systems to mitigate cache side channels in thecloud In Proc of CCS Berlin Germany 2013

[99] Mark Zhao and G Edward Suh [n d] FPGA-Based Remote PowerSide-Channel Attacks In Proceedings of the IEEE Symposium on Securityand Privacy May 2018

  • Abstract
  • 1 Introduction
  • 2 Related Work
  • 3 End-to-End Applications with Microservices
    • 31 Social Network
    • 32 Media Service
    • 33 E-Commerce Service
    • 34 Banking System
    • 35 Hotel Reservation Site
      • 4 Seer Design
        • 41 Overview
        • 42 Distributed Tracing
        • 43 Deep Learning in Performance Debugging
        • 44 Hardware Monitoring
        • 45 System Insights from Seer
        • 46 Implementation
          • 5 Seer Analysis and Validation
            • 51 Methodology
            • 52 Evaluation
              • 6 Large-Scale Cloud Study
                • 61 Seer Scalability
                • 62 Source of QoS Violations
                • 63 Seers Long-Term Impact on Application Design
                  • 7 Conclusions
                  • Acknowledgements
                  • References
Page 15: Seer: Leveraging Big Data to Navigate the Complexity of ... · data to extract the causal relationships between requests. Even though such tracing systems help cloud providers detect

[64] Jason Mars and Lingjia Tang [n d] Whare-map heterogeneity inhomogeneous warehouse-scale computers In Proceedings of ISCATel-Aviv Israel 2013

[65] Jason Mars Lingjia Tang and Robert Hundt 2011 Heterogeneity inldquoHomogeneousrdquo Warehouse-Scale Computers A Performance Oppor-tunity IEEE Comput Archit Lett 10 2 (July 2011) 4

[66] Jason Mars Lingjia Tang Robert Hundt Kevin Skadron and Mary LouSoffa [n d] Bubble-Up increasing utilization in modern warehousescale computers via sensible co-locations In Proceedings of MICROPorto Alegre Brazil 2011

[67] David Meisner Christopher M Sadler Luiz Andreacute Barroso Wolf-Dietrich Weber and Thomas F Wenisch 2011 Power managementof online data-intensive services In Proceedings of the 38th annualinternational symposium on Computer architecture 319ndash330

[68] Ripal Nathuji Canturk Isci and Eugene Gorbatov [n d] Exploitingplatform heterogeneity for power efficient data centers In Proceedingsof ICAC Jacksonville FL 2007

[69] Ripal Nathuji Aman Kansal and Alireza Ghaffarkhah [n d] Q-Clouds Managing Performance Interference Effects for QoS-AwareClouds In Proceedings of EuroSys ParisFrance 2010

[70] Hiep Nguyen Zhiming Shen Xiaohui Gu Sethuraman Subbiah andJohn Wilkes [n d] AGILE Elastic Distributed Resource Scaling forInfrastructure-as-a-Service In Proceedings of ICAC San Jose CA 2013

[71] Dejan Novakovic Nedeljko Vasic Stanko Novakovic Dejan Kosticand Ricardo Bianchini [n d] DeepDive Transparently Identifyingand Managing Performance Interference in Virtualized EnvironmentsIn Proceedings of ATC San Jose CA 2013

[72] Simon Ostermann Alexandru Iosup Nezih Yigitbasi Radu ProdanThomas Fahringer and Dick Epema [n d] A Performance Analysisof EC2 Cloud Computing Services for Scientific Computing In LectureNotes on Cloud Computing Volume 34 p115-131 2010

[73] Kay Ousterhout Patrick Wendell Matei Zaharia and Ion Stoica [nd] Sparrow Distributed Low Latency Scheduling In Proceedings ofSOSP Farminton PA 2013

[74] Xue Ouyang Peter Garraghan Renyu Yang Paul Townend and Jie Xu[n d] Reducing Late-Timing Failure at Scale Straggler Root-CauseAnalysis in Cloud Datacenters In Proceedings of 46th Annual IEEEIFIPInternational Conference on Dependable Systems and Networks 2016

[75] Andrew Putnam Adrian M Caulfield Eric S Chung Derek ChiouKypros Constantinides John Demme Hadi Esmaeilzadeh Jeremy Fow-ers Gopi Prashanth Gopal Jan Gray Michael Haselman Scott HauckStephen Heil Amir Hormati Joo-Young Kim Sitaram Lanka JamesLarus Eric Peterson Simon Pope Aaron Smith Jason Thong Phillip YiXiao and Doug Burger 2014 A Reconfigurable Fabric for Accelerat-ing Large-Scale Datacenter Services In Proc of the 41st Intl Symp onComputer Architecture

[76] Suhail Rehman and Majd Sakr [n d] Initial Findings for provisioningvariation in cloud computing In Proceedings of CloudCom IndianapolisIN 2010

[77] Charles Reiss Alexey Tumanov Gregory Ganger Randy Katz andMichael Kozych [n d] Heterogeneity and Dynamicity of Clouds atScale Google Trace Analysis In Proceedings of SOCC 2012

[78] Gang Ren Eric Tune Tipp Moseley Yixin Shi Silvius Rus andRobert Hundt 2010 Google-Wide Profiling A Continuous Pro-filing Infrastructure for Data Centers IEEE Micro (2010) 65ndash79httpwwwcomputerorgportalwebcsdldoi101109MM201068

[79] rightscale [n d] Rightscale httpsawsamazoncomsolution-providersisvrightscale

[80] S Sarwar A Ankit and K Roy [n d] Incremental Learning in DeepConvolutional Neural Networks Using Partial Network Sharing InarXiv preprint arXiv171202719

[81] Joumlrg Schad Jens Dittrich and Jorge-Arnulfo Quianeacute-Ruiz 2010 Run-time Measurements in the Cloud Observing Analyzing and ReducingVariance Proceedings VLDB Endow 3 1-2 (Sept 2010) 460ndash471

[82] Malte Schwarzkopf Andy Konwinski Michael Abd-El-Malek andJohn Wilkes [n d] Omega flexible scalable schedulers for largecompute clusters In Proceedings of EuroSys Prague 2013

[83] Zhiming Shen Sethuraman Subbiah Xiaohui Gu and John Wilkes [nd] CloudScale elastic resource scaling for multi-tenant cloud systemsIn Proceedings of SOCC Cascais Portugal 2011

[84] David Shue Michael J Freedman and Anees Shaikh [n d] Perfor-mance Isolation and Fairness for Multi-tenant Cloud Storage In Procof OSDI Hollywood CA 2012

[85] Benjamin H Sigelman Luiz Andreacute Barroso Mike Burrows PatStephenson Manoj Plakal Donald Beaver Saul Jaspan and Chan-dan Shanbhag 2010 Dapper a Large-Scale Distributed Systems TracingInfrastructure Technical Report Google Inc httpsresearchgooglecomarchivepapersdapper-2010-1pdf

[86] Lalith Suresh Peter Bodik Ishai Menache Marco Canini and FlorinCiucu [n d] Distributed Resource Management Across ProcessBoundaries In Proceedings of the ACM Symposium on Cloud Computing(SOCC) Santa Clara CA 2017

[87] torque [n d] Torque Resource Manager httpwwwadaptivecomputingcomproductsopen-sourcetorque

[88] Venkatanathan Varadarajan Thomas Ristenpart and Michael Swift[n d] Scheduler-based Defenses against Cross-VM Side-channels InProc of the 23rd Usenix Security Symposium San Diego CA 2014

[89] Venkatanathan Varadarajan Yinqian Zhang Thomas Ristenpart andMichael Swift [n d] A Placement Vulnerability Study inMulti-TenantPublic Clouds In Proc of the 24th USENIX Security Symposium (USENIXSecurity) Washington DC 2015

[90] Nedeljko Vasić Dejan Novaković Svetozar Miučin Dejan Kostić andRicardo Bianchini [n d] DejaVu accelerating resource allocationin virtualized environments In Proceedings of the Seventeenth Interna-tional Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS) London UK 2012

[91] Jianping Weng Jessie Hui Wang Jiahai Yang and Yang Yang [n d]Root cause analysis of anomalies of multitier services in public cloudsIn Proceedings of the IEEEACM 25th International Symposium on Qual-ity of Service (IWQoS) 2017

[92] Ian H Witten Eibe Frank and Geoffrey Holmes [n d] Data MiningPractical Machine Learning Tools and Techniques 3rd Edition

[93] Tianyin Xu Xinxin Jin Peng Huang Yuanyuan Zhou Shan Lu LongJin and Shankar Pasupathy 2016 Early Detection of ConfigurationErrors to Reduce Failure Damage In Proceedings of the 12th USENIXConference on Operating Systems Design and Implementation (OSDIrsquo16)USENIX Association Berkeley CA USA 619ndash634

[94] Yunjing Xu Michael Bailey Farnam Jahanian Kaustubh Joshi MattiHiltunen and Richard Schlichting [n d] An Exploration of L2 CacheCovert Channels in Virtualized Environments In Proc of the 3rd ACMWorkshop on Cloud Computing Security Workshop (CCSW) ChicagoIL 2011

[95] Hailong Yang Alex Breslow Jason Mars and Lingjia Tang [n d]Bubble-flux precise online QoS management for increased utilizationin warehouse scale computers In Proceedings of ISCA 2013

[96] Minlan Yu Albert Greenberg Dave Maltz Jennifer Rexford LihuaYuan Srikanth Kandula and Changhoon Kim [n d] Profiling Net-work Performance for Multi-tier Data Center Applications In Proceed-ings of NSDI Boston MA 2011 14

[97] Xiao Yu Pallavi Joshi Jianwu Xu Guoliang Jin Hui Zhang and GuofeiJiang 2016 CloudSeer Workflow Monitoring of Cloud Infrastructuresvia Interleaved Logs In Proceedings of APSLOS ACM New York NYUSA 489ndash502

[98] Yinqian Zhang and Michael K Reiter [n d] Duppel retrofittingcommodity operating systems to mitigate cache side channels in thecloud In Proc of CCS Berlin Germany 2013

[99] Mark Zhao and G Edward Suh [n d] FPGA-Based Remote PowerSide-Channel Attacks In Proceedings of the IEEE Symposium on Securityand Privacy May 2018

  • Abstract
  • 1 Introduction
  • 2 Related Work
  • 3 End-to-End Applications with Microservices
    • 31 Social Network
    • 32 Media Service
    • 33 E-Commerce Service
    • 34 Banking System
    • 35 Hotel Reservation Site
      • 4 Seer Design
        • 41 Overview
        • 42 Distributed Tracing
        • 43 Deep Learning in Performance Debugging
        • 44 Hardware Monitoring
        • 45 System Insights from Seer
        • 46 Implementation
          • 5 Seer Analysis and Validation
            • 51 Methodology
            • 52 Evaluation
              • 6 Large-Scale Cloud Study
                • 61 Seer Scalability
                • 62 Source of QoS Violations
                • 63 Seers Long-Term Impact on Application Design
                  • 7 Conclusions
                  • Acknowledgements
                  • References