if you’re trying to achieve observability starting with€¦ · if you’re trying to achieve...

If you’re trying to achieve observability starting with unstructured logs, you’ve got some work ahead of you. Logs are no longer human scale. What worked when hosts typically numbered in less than the dozens, and when servers were treated more like pets than cattle no longer allows for effective management of services in today’s production environment.

When your only view of the system is through the instrumentation of your code, how it outputs what it tells you is as critical as what it outputs. Unfortunately, many of our logs look something like the following, with all of the vivid details flattened into static text.

6:01:00 accepted connection on port 80 from 10.0.0.3:63349 6:01:03 basic authentication accepted for user foo 6:01:15 processing request for /super/slow/server 6:01:18 request succeeded, sent response code 200 6:01:19 closed connection to 10.0.0.3:63349

That’s easy to get going with in development, but generates huge volumes of noisy data that are slow and clunky to use for troubleshooting in production. Sure, sometimes you can do post-hoc processing with things like regular expressions - but if you have to use regular expressions to understand what is happening in production, you are wasting your analytical skills on what amounts to a data formatting problem instead of actually analyzing the information the system is sending.

Your systems need to tell you enough to answer on-the-fly questions involving math and percentiles and breakdowns and group by’s. Making your logs easy to query, filter, and visualize is the first step along the path to observability.

1

Choose your path—where are you headed next?

First, understand whether what you are trying to do is improve your logging incrementally, or achieve observability more rapidly than that—are you a lone hero in the matrix or are you responding to a directive from management? Do you have support for a major project, or is this a skunkworks affair?

If you’re responding to a directive from management and/or have the resources and support to undertake a larger-scale endeavor, consider taking the larger step directly to beginning to instrument your code with events and traces—the intermediate logistical step of structuring your app’s log output is not required—but a reframing of the goals and benefits of instrumentation is. Strive for maximum impact from the outset—start the process of thinking about your business in terms of events, and leave ‘log lines’ behind altogether.

2

Struggling to make incremental changes? Start small

If you’re getting started and trying to see the forest for the trees—just a single day’s worth of effort on this will improve the lives of whoever has to work with the data. Even if you’re the only one participating in your project you can begin to do analysis and start to categorize the kinds of things that are happening in your environment. So if you need to start small, start small. Pulling a few things out into structured data and using a good structured logging library will start you along a path that is ultimately very fruitful. So before you do anything else, decide what logging library you’re going to use and start using it. For instance, logrus is a popular library for Golang that we’ll use for examples here.

All the structured logging libraries also support unstructured logging, so adding your logging library doesn’t have to change what your colleagues see or remove any of the unstructured instrumentation—you can make these changes in parallel and you can choose to leave in the old instrumentation if there is a lot of pushback. Often it’s as simple as changing something like this:

log.Print("Endpoint " + r.URL.Path + " accessed")

To something like this:

logrus.Info("Endpoint " + r.URL.Path + " accessed")

This will set you up nicely to start sprinkling some structure on top as a next step.

There are a lot of articles about how to pick a logging library out there, and most of them say the same things: pick one that has a robust user community, doesn’t have a major impact on your runtime or code hygiene, and allows you to capture, augment, and output the data flexibly. Anything in wide use for your programming language of choice at this point will likely meet your needs.

3

Don’t be overwhelmed

Now is the time to decide what is the highest priority to improve—what will you get the most benefit out of soonest? You could just do a simple frequency analysis of a given log- what happens the most in this service? Another approach is to find out where else in your organization this log’s output is used. You could ask what kind of errors your support engineers see the most and use that kind of information to prioritize your instrumentation work. As a last resort if you don’t have a clear priority, just start by wrapping the existing instrumentation’s strings and see where that takes you.

Don’t sink weeks into changing all of your logging at once. Once you can show that structured logs for even a subset of your code base are helping you better understand what’s happening in production, you will have the data you need to push for a greater investment in standardization...and observability.

Start Adding Structure

Once the initial hurdle of getting your shiny new logging framework everywhere in your codebase is jumped, the real fun begins. It’s time to add some structure. You don’t have to do the whole codebase at once - just add things as you go, starting with the highest value items first. Once you get the ball rolling, your team might even get into the fun too—everyone likes seeing how they can get more detail about the things they actually care about in their code.

For instance, coming back to our example above—maybe you have some existing lines like this:

logrus.Infof("accepted connection on port %d from %s", port, authority)

logrus.Infof("basic authentication accepted for user %s", user) logrus.Infof("processing request for %s", r.URL.Path) logrus.Infof("request succeeded, sent response code %d", status) logrus.Infof("closed connection to %s", authority)

4

Those can be translated into new structured lines that will be dramatically easier to search and use to solve problems. In logrus, there are WithField and WithFields methods that can be used to add structure to any log line. Let’s try applying those to our statements.

logrus.WithFields(logrus.Fields{

"port": port,

"authority": authority, }).Info("accepted connection") logrus.WithField("user", user).Info("basic authentication accepted") logrus.WithField("path", r.URL.Path).Info("processing request") logrus.WithField("status", status).Info("sent response code") logrus.WithField("authority", authority).Info("closed connection")

logrus will now generate a much cleaner output that looks like this:

INFO[0000] accepted connection INFO[0000] basic authentication accepted INFO[0000] processing request INFO[0000] sent response code INFO[0000] closed connection

authority=10.0.0.3:63349 port=80 user=foo

path=/super/slow/server

status=200 authority=10.0.0.3:63349

Resist bikeshedding on standards

Establish standards for field names, content types, etc. but don't get too caught up in this—maybe even wait until after you've converted a couple of apps, to prevent derailment by bikeshedding.

It’s totally reasonable for a mature project or organization to maintain its own module of logging conveniences. You can have some startup configuration that outputs nice colors and pretty formatting when you’re developing locally, but just emits JSON when your service runs in production. You can have helpers for adding domain-specific context to all your requests (for example: customer name, ID, and pricing plan). You can be creative.

5

Do something useful with the output

Once you’ve started to generate structured logs, don’t just sit on them—send them to a tool that offers an easier way to query, visualize, and plan your next steps. In particular, use something that lets you visualize your data, like Honeycomb. Keep moving toward the level of observability your organization needs to support the service you deliver. Show your colleagues and your management what you’re learning to further your case for investing in observability.

Using Honeycomb BubbleUp to quickly analyze and outliers in a heatmap

You have structured logs and want to take the next step

Switching to an event-based approach is more of a cultural change than logistically different from going from unstructured logs to structured ones, which is why we recommend skipping the intervening step if you have the resources to do so.

6

An event represents a unit of work in your systems

When we talk about events, we mean the following definition:

An event contains information about what it took for a service to

perform a unit of work.

Treating an event as a unit of work lets you adjust what it means depending on the goals of the observer. Sometimes a unit of work is downloading a single file, parsing it, and extracting specific pieces of information, but sometimes it’s getting an answer out of dozens of files. Sometimes a unit of work is accepting an HTTP request and doing everything necessary to hand back a response, but sometimes one HTTP request can generate many events.

Ideally, an event contains everything about what it took to perform that unit of work. It should record the input necessary to perform the work, attributes computed or resolved or discovered along the way, the conditions of the service as it was performing the work, and some details on the result of that work.

With wide, context-rich events, when you notice customer experience degrading, you then have a wide range of dimensions to start slicing by to get to the solution. Is the service failing requests on all endpoints or just a handful? Are all regions affected? Does it happen for everybody or just Android users? And once you zoom in, you have individual events that stand alone.

Seeing the events inside your logs

Many logs are only portions of events, regardless of whether those logs are structured. It’s not uncommon to see 5-30 log lines or entries which when taken together represent what could be considered one unit of work. For example, logs representing a single HTTP transaction often go something like our working example:

7

INFO[0000] accepted connection INFO[0000] basic authentication accepted INFO[0000] processing request INFO[0000] sent response code INFO[0000] closed connection

authority=10.0.0.3:63349 port=80 user=foo

path=/super/slow/server

status=200 authority=10.0.0.3:63349

The information contained in that collection of log messages is all related to a single event: the handling of that one connection. But rather than being helpfully grouped into one event, it’s spread out into many log messages. Sometimes there’s a request ID that lets you put them back together, but typically not.

The above log lines could be rolled up into one event that could be represented as something like this:

time="2019-08-22T11:56:11-07:00" level=info msg="Served HTTP request" authority="10.0.0.3:63349" duration_ms=123 path="/super/slow/server" port=80 service_name=slowsvc status=200 trace.trace_id=eafdf3123 user=foo

This can, of course, be formatted in a variety of ways, including as JSON using the JSONFormatter provided by the package.

logrus.SetFormatter(&logrus.JSONFormatter{})

Leading to an output like this, which is generally more readable for machines:

{

"authority":"10.0.0.3:63349", "duration_ms":123, "level":"info", "msg":"Served HTTP request", "path":"/super/slow/server", "port":80, "service_name":"slowsvc", "status":"200", "time":"2019-08-22T11:57:03-07:00",

8

"trace.trace_id":"eafdf3123", "user":"foo" }

Here are some concrete recommendations for creating events from your infra logs and with your own instrumentation:

● Generate unique request IDs at the edge of your infrastructure, and propagate them through the entire request lifecycle, including to your databases (in the comments field).

● Generate one event per service/hop/query/etc. For example, a single API request should generate a log line or event at the edge (ELB/ALB), the load balancer (Nginx), the API service, each microservice it gets passed off to, and for each query it generates on each storage layer. There are other sources of information and events that may be relevant when debugging (for example, your database likely generates a bunch of events that say how long the queue length is and reporting internal statistics, you may have a bunch of system stats information) but one event per hop is the current easiest and best practice).

● Wrap any call out to any other service/data store as a timing event. Finding where the system has become slow can involve either distributed tracing or comparing the view from multiple directions. For example, a DB may report that a query took 100ms, but the service may argue that it actually took 10 seconds. They can both be right--if the database doesn’t start counting time until it begins executing the query, and it has a large queue.

● Collect lots of context. Each event should be as wide as possible, with as many high-cardinality dimensions as possible, because this gives you as many ways to identify or drill down and group the events with other similar events as possible.

● Add redundant information when there’s an enforced unique identifier and a separate field that is easier for the people reading the

9

graph to understand. For example, perhaps in a given service, a Team ID is globally unique, and every Team has a name.

● Add two fields for errors – the error category and the returned error itself, especially when getting back an error from a dependency. For example, the category might include what you’re trying to do in your code (error reading file) and the second what you get back from the dependency (permission denied).

● Opt for wider events (more fields) when you can. It’s easier to add in more context now than it is to discover missing context later.

● Don’t be afraid to add fields that only exist in certain contexts. For example, add user information if there is an authenticated user, don’t if there isn’t.

● Be thoughtful about field names. Common field name prefixes help when skimming the field list if they’re alphabetized.

● Add units to field names, not values (such as parsing_duration_µs or file_size_gb).

What about the logs from my infra?

When you can’t instrument the code yourself (for example, the cloud services you use), look for the instrumentation they do provide, and find ways of extracting it to get what you need. For example, for MySQL events, stream events off the wire, heavily sampled, AND tail the slow query log, AND run MySQL command line commands to get InnoDB stats and queue length.

Can I automate any of this?

You can use tools like cribl.io to structure and clean up your existing infra and app logs, and/or augment them with more context, effectively turning them into events. Tools like this also support scrubbing and routing data so it can be sent to different locations based on what’s needed. For example, you might need to keep log data around for a long period of time for compliance purposes, but don't want to send all that data to your observability vendor.

10

Thinking with traces... for the future

With the advent of distributed tracing, another thing you could do is start thinking about your log lines as potential traces instead--If you collect data in arbitrarily wide structured events, you can infer metrics from those, and if you automatically append some simple span identifiers, you can use those same events for tracing views as well. For instance, using Honeycomb Beelines one can generate detailed traces in a similar fashion to the usage of structured logs above and this is the next logical step along an observability journey. For instance, your code would look something like this and would capture timing information about what each step was doing automatically:

span, ctx := beeline.StartSpan(req.Context(), "do_work") doSomeWork(ctx)

span.Send()

span, ctx = beeline.StartSpan(ctx, "call_db") callDB(ctx)

span.Send()

You don't have to do distributed tracing across the network to start thinking in the tracing mindset. Even just one service can emit multiple spans that “chunk out” and visualize each step of what your app is doing. Therefore, you don't have to emit a log of hitting an edge case any more - emit a span instead. Even if your trace is only one span long, this switch in mindset prepares you and your code for an easier path to additional instrumentation.

What makes a system observable? When do I stop?

A system is “observable” to the extent that you can explain what is happening on the inside just from observing it on the outside. This includes not having to add new instrumentation to get your new question answered. Adding new instrumentation isn't wrong (quite the opposite! Instrument as much as you have to to get the level of observability you need to support your service), but your system isn't appropriately observable until you don't have to add more to ask anything you need to know about it.

11

Before engaging in any major instrumentation project, determine what “appropriately observable” means for your business. What level of observability do you need to deliver the quality of service you must deliver? What do your users care about? What will they notice? If your system is instrumented well enough that you can ask most or all questions required to support it at the level of quality you want to deliver to your users, it is appropriately observable.

Observability isn't about mindlessly laboring over instrumenting every CPU instruction, every mouse move the user makes, or how long it takes to add two numbers together, it is about understanding what level of service you need to deliver, and instrumenting so you can meet those requirements.

Want to know more about observability? Download the first guide in our acclaimed series: “Achieving Observability”

12

About Honeycomb

Honeycomb provides next-gen APM for modern dev teams to better understand and debug production systems. With Honeycomb teams achieve system observability and find unknown problems in a fraction of the time it takes other approaches and tools. More time is spent innovating and life on-call doesn't suck. Developers love it, operators rely on it and the business can’t live without it.

Follow Honeycomb on Twitter LinkedIn

Visit us at Honeycomb.io

if you’re trying to achieve observability starting with€¦ · if you’re trying to achieve...

Documents