Observability with the Elastic Stack

sorangutan

In my role as a Product Lead for Observability at Elastic, I get a few different reactions when I use the term 'observability'. The most common reaction by far today still is: "What is 'observability'?" But I also increasingly hear things like: "We just kicked-off an 'observability initiative', but we're still figuring out exactly how to go about it." And finally, some organizations we have been fortunate to work with already consider 'observability' an integral part of how they design and build products and services.

Given that the term is still gaining traction, I thought it would be useful to demystify how we at Elastic view 'observability', what we learned from our thought-leading customers, and how we think about it from the product perspective as we evolve our stack for operational use cases.

What is 'Observability'?

We certainly did not invent the term 'observability'. We started hearing about it from users, primarily those within the Site Reliability Engineering (SRE) community. Several sources trace back beginnings of this term to SRE organizations from Silicon Valley giants like Twitter. And even though the seminal Google SRE Book does not mention the term, it lays out many of the principles associated with 'observability' today.

'Observability' is not something that a vendor delivers in a box -- it is an attribute of a system you build, much like usability, high availability, and stability. The goal of designing and building an 'observable' system is to make sure that when it is run in production, operators responsible for it can detect undesirable behaviors (e.g. service downtime, errors, slow responses) and have actionable information to pin down root cause in an effective manner (e.g. detailed event logs, granular resource usage information, and application traces). Common challenges preventing organizations from achieving this seemingly obvious goals include not collecting enough information, collecting too much information, but not making it actionable, and fragmenting access to this information.

The first aspect — detection of undesirable behaviors — usually starts with setting of Service Level Indicators (SLIs) and Objectives (SLOs). These are internal measures of success by which production systems are judged in observability-minded organizations. If there is a contractual obligation to fulfill these objectives, an SLI/SLO may also translate to a Service Level Agreements (SLAs). The most common example of an SLI is system uptime, for which you may set an SLO of 99.9999%. System uptime is also the most common SLA exposed to external customers. However, your SLI/SLOs internally may be a lot more granular, and monitoring and alerting on these most important factors of production system behavior is the basis of any observability initiative. This aspect of observability is also known by the term "monitoring".

The second aspect — providing operators with granular information to debug production issues quickly and efficiently — is an area where we see a lot of movement and innovation. There is quite a bit of talk about the "three pillars of observability" — metrics, logs, and application traces. There is also recognition that simply collecting all this granular data using a patchwork of tools is not necessarily actionable and often not cost effective.

https://www.elastic.co/blog/observability-with-the-elastic-stack