five approaches to - epsagon · although faas is only a subset of the serverless ecosystem, the two...

1www.epsagon.com info@epsagon.com

Five Approaches to Serverless Application Observability

Table of Contents

Introduction ................................................................................................................. 3

About Serverless Applications ............................................................................. 4

Simple Serverless App Example ................................................................. 4

Key Characteristics of Serverless Apps .................................................... 5

More on Serverless Apps ............................................................................. 5

About Observability .................................................................................................. 6

Serverless vs. Traditional Observability ............................................................. 7

Five Approaches to Serverless Observability .................................................. 8

1. Cloud Provider Monitoring Tools ............................................................ 8

2. Log Streaming to an External Service ................................................... 8

3. Function-Level Monitoring ....................................................................... 9

4. Self-Built Observability Solutions ............................................................ 11

5. Automated Distributed Tracing Solutions .......................................... 11

How Epsagon Can Help ......................................................................................... 13

Introduction

The serverless software development approach went mainstream when

AWS introduced Lambda – Function-as-a-Service (FaaS) in November 2014.

Although FaaS is only a subset of the serverless ecosystem, the two terms

are often used interchangeably.

Serverless/FaaS is yet another quantum leap in the evolution of computing

abstraction that started in the 1960s when programming languages buffered

developers from machine language. More recently, we have seen the

emergence of virtual machines, which abstract and standardize infrastructure

deployments on commodity servers, as well as containers that abstract apps

from both the operating system and infrastructure layers.

The benefits and challenges of serverlessEvent-driven, pay-per-execution serverless applications do, of course, use

both physical and virtual servers. However, developers and the operations

team do not interact with the infrastructure. They rely solely on managed

compute services to execute the code and scale as necessary, leveraging

APIs to consume third-party services. Because pricing is based on the

number of executions versus pre-purchased compute capacity, serverless

applications have lower runtime costs. Serverless also promotes faster

time-to-market and enhances process agility. It achieves this by allowing

developers to focus on the application rather than the infrastructure and by

eliminating the need to set up separate dev/test, staging, and production

environments.

Still, with all the benefits that are driving the growth of serverless computing, it

is not without its challenges. Serverless applications are complex, distributed

architectures of functions and API calls, encapsulated within ephemeral

containers. Without careful development and extensive testing, performance

issues can degrade the user experience and rack up costs. And yet, end-

to-end realtime observability is key to monitoring and troubleshooting

application performance, and it is not easy to achieve this in a serverless

environment.

This white paper explores why it is difficult to achieve observability in

serverless systems and how next-generation platforms offer the required

paradigm shift that closes the serverless-observability gap.

About Serverless Applications

Today the uptake of serverless computing is comparable to what we were

seeing for containers back in 2016. According to a recent Cloud Foundry

survey, 46% of the respondents stated that their companies are both using

and evaluating serverless, with another 35% solely in the evaluation stage.

F5’s 2019 State of Application Services report found that 29% of cloud

architects, 24% of SRE/DevOps, and 24% of executives view serverless as

a strategic trend. Gartner also puts serverless computing at the top of their

list of ten trends that will impact infrastructure and operations in 2019. And

market analysts are predicting dramatic growth rates for the global FaaS

market, which is expected to reach a value of ~$12 billion by 2022, at a ~34%

compound annual growth rate as of 2017.

Simple Serverless Application ExampleThe following graphic is a high-level description of a typical serverless web

application on AWS:

1 The static website UI is hosted in and served through an Amazon S3 bucket.

2 The app’s login and data access services are built as Lambda functions, with the developer specifying the amount of memory required and the maximum execution time. When invoked by calls from the client app, Lambda runs the functions in stateless compute containers that are ephemeral, sometimes lasting for only one invocation.

3 The functions read from and write to a fully managed NoSQL DynamoDB database (not essential to a serverless app and used here only as an example). They also handle the responses back to the client app.

4 Other external services used by the app are Cognito for authenticating users and STS for generating temporary AWS credentials for users to invoke Lambda.

Key Characteristics of Serverless ApplicationsServerless apps are usually highly distributed and make very heavy use

of third-party APIs. Although not illustrated in the simple example shown

above, serverless systems will often use an API Gateway to map the Lambda

functions to the API endpoints via RESTful HTTP requests.

Pay-Per-Use

Note that in our example there is no programmed access to a server. One of the

key principles of serverless apps is that, upon invocation, the provider of the

managed function service takes care of all infrastructure requirements. This

includes launching and loading the container to automatically provisioning

the compute/storage resources and services required by the executed code.

The FaaS vendor also handles concurrency and ensures virtually unlimited

scalability, with the customer paying only for the time that the code is actually

running. The cost for this differs depending on the amount of memory

provisioned but, by way of example, on AWS a Lambda function with 512MB

of memory costs $0.000000834 per every 100 ms.

Limited Resources

Another key characteristic of serverless apps is the ephemeral lifespan of

the function container (typically lasting for only one invocation) as well as

strict timeout limitations for the functions themselves. In AWS, for example,

the maximum execution time for a Lambda function is fifteen minutes, and

the Amazon API Gateway has a 29-second integration timeout. In order to

avoid timed-out function calls or too-frequent container cold starts, both

of which can degrade app performance, it is important to choose function

timeout values carefully. It is also important to monitor and make every effort

to reduce the time that a function waits for a response to an API call.

Last but not least, the more memory allocated to a function in serverless, the

more CPU resources are available to execute its code. This in turn shortens

its runtime and reduces the frequency of timeouts. However, as noted above,

the amount of memory provisioned will affect the run cost.

More on Serverless ApplicationsIn short, serverless developers need to thoroughly understand and carefully

balance timeout, API, and memory allocation considerations in order to

achieve optimal but cost-effective system performance. The following articles

provide more detailed information on these technical challenges as well as

guidelines on how to overcome them:

• Best Practices for AWS Lambda Timeouts

• The Importance and Impact of APIs in Serverless

• Finding Serverless’ Hidden Costs

About Observability

Today’s application architectures—whether serverless or not—are highly

distributed and highly complex. Throughout the application lifecycle, from

development and testing to staging and production, it is critical that all

aspects of the system’s current status be observable. In Cindy Sridharan’s

book “Distributed Systems Observability,” she notes that observability is

an inherent property of a system that has been “designed, built, tested,

deployed, operated, monitored, maintained, and evolved.” This is with the

understanding that, among other things, “no complex system is ever fully

healthy” and “distributed systems are pathologically unpredictable.”

An observable system, therefore, is one that makes the internal state of all of

its components externally observable, usually through instrumentation code.

Observability signals include logs, metrics, traces, exception trackers, and

detailed profiles to name but a few. Although observability is often considered

just a more sophisticated form of monitoring, monitoring is really only the

starting point of observability. And effective observability must provide all of

the following capabilities:

MONITORING SYSTEM PERFORMANCE

Verifying at all points in time that the system is working

as planned, customers are getting the right service at the

right SLA, errors are being captured and handled, logic and

business flows are correct, and so on.

TROUBLESHOOTING FAILURES

In the event of a detected anomaly, it is important for both

developer productivity and, in the case of production

systems, customer satisfaction that the cause can be easily

investigated. Ideally, the root cause analysis of an observable

system will be highly automated so that the problem can be

remediated as quickly as possible.

VISUALIZATION

With so many interdependent moving parts, it is difficult to

get a meaningful understanding of a distributed system’s

current state without visualization. A graphic representation

of the system elements, their interdependencies, and their

historical and current metrics is important for many reasons.

These allow for proper troubleshooting and more effective

day-to-day management of the system. They will also raise

the confidence of development and operations teams that

they have a good grasp of system performance.

Serverless vs. Traditional ObservabilityYan Cui, AWS Serverless Hero and frequent guest blogger for Epsagon, has

written the definitive article about the new challenges that serverless poses

for current observability practices. With AWS as his frame of reference, the

key points that he raises are the following:

• Because serverless separates the code from the infrastructure running

it, there is no place to install agents to automatically collect and transmit

system data to an observability system. Often, the only way to achieve

data collection automation is through manual instrumentation, which is

both tedious and, like everything manual, error-prone.

• Because everything has to be executed within an invoked function, you

can no longer perform your own background processing. You have to rely

on whatever the platform gives you in terms of logs and tracing data.

• Serverless frees you from the need to manage concurrency. But the flip

side of this is that it will be harder to batch observability data, and there

will be a much higher volume of traffic to the observability system. At scale,

this can have both performance and cost implications.

• Defining bigger batch sizes will not solve the data volume problem, as a lot

of valuable data may be lost due to the short lifespan of Lambda functions

and subsequent frequency of garbage collection.

• Functions are often chained together through asynchronous invocations.

Unfortunately, tracing these invocations through so many different event

sources is difficult.

The bottom line is that diligently instrumenting your serverless functions may

improve visibility into your system’s health. But the sheer volume of data that

then gets sent to the observability system may actually obfuscate debugging

as well as impact client-side latency—which may, in turn, have an impact on

business outcomes.

In addition, traditional observability practices and tools are not equipped

to identify and handle the new technical challenges posed by serverless

architectures, such as timeouts, out-of-memory errors, slow API responses,

and too-frequent container cold starts.

Five Approaches to Serverless ObservabilityThis section describes five approaches that are often used separately or in

combination to address the challenges involved in gaining greater serverless

observability.

Cloud Provider Monitoring ToolsDefault cloud vendor consoles, such as AWS CloudWatch, are monitoring

and management services that aggregate metrics and logs across

distributed stacks and services. However, in more complex distributed

systems with multiple functions, queues, triggers, etc., it is difficult to

understand connections across the different log entries, which are refreshed

asynchronously.

Log Streaming to an External ServiceLog aggregation platforms such as Splunk or Loggly provide single-pane

views of log metrics across distributed systems. Some of these platforms

analyze the data to create baseline performance profiles so that they can

then detect and alert to suspected anomalies.

Source: AWS: Getting Started with CloudWatch

Source: Kibana queries and filters

However, log aggregation platforms still require the development team to

generate the logs. In addition, they do not make it easier to overcome the

often asynchronous nature of serverless systems. Even in a log aggregation

dashboard, it can still be difficult to quickly understand the relationships

between events and triggers, making troubleshooting cumbersome. Last but

not least, these third-party log aggregation platforms can be expensive.

Function-Level MonitoringMore advanced observability layers, such as AWS X-Ray, automatically

integrate and instrument Lambda functions, providing an end-to-end function-

level view of requests as they traverse all components of the application.

AWS X-Ray example for monitoring Lambda showing a synchronous request with one downstream call to Amazon S3. Source: AWS Lambda Developer Guide

AWS X-Ray does make it easier to identify slow spots or slow AWS APIs and

even analyze root causes. But currently you cannot connect asynchronous

events with AWS X-Ray, so it is hard to trace a chain of events such as a

message that is published into SNS which, in turn, triggers another Lambda.

Also, the developer has to insert traces manually, which is not optimal when

monitoring a dynamic environment.

Distributed Tracing

The key drawback, however, of observability layers like AWS X-Ray is that

they essentially only measure the metrics of functions. And they only do so

on an individual basis. They will let you see, for example, that Function X

failed five times in the last hour, that the average duration of Function Y is 10

seconds, or that Function Z has an unusually high number of cold starts that

are affecting its latency. This information can be helpful to detect relatively

simple and straightforward issues related to the system’s health. But it cannot

identify application-level issues, such as a user trying to purchase an item

online and abandoning the cart because it's taking too long.

Thus, in a distributed serverless system with many moving parts, function-

level monitoring can not provide insight into business flows. If the individual

function metrics look good, you can be lulled into thinking that the application

is performing as planned. To achieve application-level observability, you need

distributed tracing. As shown below, distributed tracing tracks what happens

to a request across all the involved components when the user interacts with

the application. This is visualized in logical chunks, or spans.

Self-Built Observability SolutionsIt might be tempting to make use of existing open source libraries and build

a customized tracing solution for your serverless system. Turning to Yan Cui

once again for inspiration, he has written a detailed blog on why and how

developers can introduce correlation IDs into their serverless functions to

debug transactions that involve multiple functions and event source types.

However, a self-built distributed tracing system is not just about correlating

events throughout a call chain. To be an enterprise-grade serverless

observability system, it also has to meet numerous operational and functional

requirements, such as security, being able to work across multiple external

APIs, and having a back end to analyze events, issue alerts, and present

insights visually.

In short, unless building a distributed tracing solution is a core business

activity of the organization, implementing it in-house is most probably not a

good use of resources.

Automated Distributed Tracing SolutionsGiven the complexities of building and maintaining your own distributed tracing

solution, another approach is to adopt a third-party platform developed by a

vendor whose core competency and business is distributed tracing. There

are a number of distributed tracing platforms out there. So when choosing

the right solution for your organization, it’s important to assess the following:

• How quickly will that solution help you troubleshoot issues in your particular

system? Will it identify issues automatically and alert you to them, or will

you have to search for events inside the platform?

• Will you have to manually instrument your code, or will the

solution automatically discover your system components and their

interdependencies?

• Does it provide insight beyond your code into the whole system—APIs,

managed services, orchestration services such as AWS Step Functions,

and so on?

• Does it have a comprehensive single-pane console, or will you have

to aggregate information from other sources such as cloud provider

dashboards?

How Epsagon Can Help

Epsagon has developed a next-generation serverless monitoring and

troubleshooting platform based on distributed tracing technology. With no

code changes whatsoever and in less than five minutes, Epsagon can fully

and automatically discover all of your serverless system’s components and

the relationships among them. After onboarding Epsagon, you are no longer

monitoring discrete functions. Rather, you are gaining actionable insight into

logic and business flows.

With Epsagon’s powerful visualization features, troubleshooting and root

cause analyses are fast and efficient. Epsagon also applies advanced artificial

intelligence methods to predict issues before they happen, allowing you to

fix problems before they impact system performance and user experience.

Epsagon’s serverless observability platform also provides valuable insights

into ongoing system performance. Thus, for example, you can discover

system behavior that may be unnecessarily increasing your system’s

runtime costs.

See Epsagon at work by signing up for a free trial, or contact us to discuss

how we can help you with your serverless observability challenges today.

The ABCs of Serverless Observability

Embrace the serverless revolution to accelerate your development

cycles, enhance process agility, and optimize runtime costs.

Yes, observability is a challenge in highly distributed serverless apps,

with their complex event-driven function call chains and limited

visibility into underlying infrastructures.

But today’s highly automated serverless observability platforms use

distributed tracing technology, advanced data analytics, and intuitive

visualization to monitor system health end to end and quickly alert

to anomalies.

five approaches to - epsagon · although faas is only a subset of the serverless ecosystem, the two...

Documents

serverless / faas / lambda and how it relates to...

serverless is about what's next - goto conference · go,...

Ür den flexiblen produktionsverbund · challenges of a...

serverless presentation from devoxx 2017 casablanca (aws...

microservices and faas for offensive security con 25/def con...

the serverless intro - hackthezone · the serverless intro...

function as a service - cern openlab · openstack cern...

u -faas fuwi'jmitcm-

samenin de cloud...docker cloud, amazon ecs function as a...

serverless data analytics in the ibm...

when should you use a serverless approach? · microservices...

computing (faas) to k8s cluster using terraform -...

will serverless computing revolutionize nfv? · 2020. 5....

devopssec and container/ faas (function as a...

bringing serverless to containers with google cloud run ·...

cloud security - cornell...

faas shell - linux foundation events...apps azure functions...

serverless operations · 2020. 8. 31. · production-ready...

cloudstate - towards stateful serverless - swv1 · cloud...

deliverable d2.1 initial requirements and baselines · 1.3....