Monitoring Application Performance in Kubernetes: How New Relic Sets You Up for Success

sorangutan

This post was originally published on November 27, 2017. It was updated with information on the New Relic cluster explorer and using the New Relic platform to monitor Kubernetes environments.

Over the last few years, Kubernetes has emerged as the de facto standard for orchestrating and managing application containers. Virtually all the significant players in the cloud-native application space—many of them fierce competitors—have thrown their support behind Kubernetes as an industry standard. Measured on developer support, Kubernetes’ journey over the last three years—from first public appearance to serving as the main topic at an event attracting more than 7,000 attendees (KubeCon + CloudNativeCon)—has been simply spectacular.

Here at New Relic, the Kubernetes story boils down to a couple of very strategic points. First, we believe that for all of its success to date, Kubernetes is actually—and amazingly—still in its early journey towards achieving its full potential. Tomorrow’s Kubernetes environments will operate at a scale that will dwarf what we see today. As Kubernetes environments continue to scale, they will also continue to get more complex—and to present bigger challenges for efforts to monitor their performance and health.

Second, many of our customers tell us they see tremendous value in leveraging the New Relic platform to transition to and run their application workloads in Kubernetes. So we have been working hard to expand and upgrade the New Relic platform with a new generation of monitoring tools designed for Kubernetes environments. These capabilities will help our customers to keep moving quickly and to act with confidence when they adopt Kubernetes to manage and orchestrate their containerized workloads.

New Relic and Kubernetes: A powerful pairing

Most organizations that deploy Kubernetes environments today share the same basic goal: To maximize the benefits of running containerized applications, at virtually any scale, through the use of automation, workload orchestration, and other enabling technologies. It’s difficult to overstate the importance of orchestration in general, and Kubernetes in particular, in achieving this goal. Without these capabilities, the sheer complexity of managing and optimizing containerized workloads, microservices, and related innovations—especially when deployed at scale—might easily outweigh their benefits.

While Kubernetes excels at abstracting away much of the toil associated with managing containerized workloads, it also introduces some complexities of its own. The Kubernetes environment includes an additional layer that comes into play between the application and the underlying infrastructure. The challenge of monitoring and maintaining the performance and health of these Kubernetes environments, or of troubleshooting issues when they occur, can be daunting—especially as organizations deploy these environments at massive scale.

This is where the New Relic platform enters the picture as a powerful and uniquely capable complement for today’s Kubernetes environments. New Relic is all in on Kubernetes, which is why we have been steadily expanding and upgrading our ability to instrument, monitor, and support full observability across Kubernetes environments. This includes giving you the ability to move quickly and easily from high-level, holistic views of the performance and health of Kubernetes clusters, down to the node, container, and application-level visibility required to identify and troubleshoot performance issues. It also includes the ability to instrument and monitor Kubernetes infrastructure alongside applications, which critical for understanding dependencies and isolating potential performance issues.

Using New Relic to achieve end-to-end Kubernetes monitoring

With that in mind, we’d like to introduce you to some of the most powerful capabilities currently being introduced in the New Relic platform for monitoring Kubernetes applications and resolving performance issues in these environments. In particular, we’ll demonstrate how to use the Kubernetes Downward API to surface Kubernetes-specific metadata; and then how to use the New Relic APM agent API to tie this metadata into APM in order to explore performance issues and troubleshoot application transaction errors.

The importance of an end-to-end view

By design, applications typically aren’t aware of whether they’re running in a container or on an orchestration platform. A Java web application, for example, doesn’t have any special code paths that get executed only when running inside a Docker container on a Kubernetes cluster. This is a key benefit of containers (and container orchestration engines): your application and its business logic are decoupled from the specific details of its runtime environment. If you ever have to shift the underlying infrastructure to a new container runtime or Linux version, you won’t have to completely rewrite the application code.

Similarly, New Relic APM’s language agent—which instruments the application code to track rich events, metrics, and traces—doesn’t care where the application is running. It could be running in an ancient Linux server in a forgotten rack or on the latest 72-CPU Amazon EC2 instance. When monitoring applications managed by an orchestration layer, however, being able to relate an application error trace, for instance, to the container, pod, or host it’s running in can be very useful for debugging or troubleshooting.

All of these are must-have capabilities for a Kubernetes application monitoring solution because they help our customers answer an all-too-common question: Does a performance problem really reside in the code, or is it actually tied to the underlying infrastructure?

Application instrumentation with the Kubernetes Downward API

Traditionally, for application transaction traces or distributed traces you collect using New Relic APM, our agent can tell you exactly which server the code was running on. In many container environments, though, this gets more complex: the worker nodes (hosts) where the application runs (in the containers/pods) are often ephemeral—they come and go. It’s fairly common to configure policies for applications running in Kubernetes to automatically scale their host count in response to traffic fluctuations, so you may investigate a slow transaction trace, but the container or host where that application was running no longer exists. Knowing the containers or hosts where the application is currently running is not necessarily an indication of where it was running 5, 15, or 30 minutes ago—when the issue occurred.

Fortunately, the Kubernetes Downward API allows containers to consume information about the cluster and pod they’re running in using environment variables instead of external API calls. We can use a small sample app to demonstrate how this works in a Node.js application. (Feel free to fork this repo for your own use.)

To integrate with the Downward API, we edited the environment section of the application manifest; and we defined environment variables that would give a containerized application access to the Kubernetes node name, host IP, pod name, pod namespace, pod tier, and pod service account in which it’s running.

A few examples of these variables include:

- name: K8S_NODE_NAME
  valueFrom:
    fieldRef:
      fieldPath: spec.nodeName
- name: K8S_HOST_IP
  valueFrom:
    fieldRef:
      fieldPath: status.hostIP
- name: K8S_POD_NAME
  valueFrom:
    fieldRef:
      fieldPath: metadata.name

From there, we use the New Relic Node.js Agent API to expose those environment variables as custom parameters for all application transaction traces in an express.js application:

// Middleware for adding custom attributes
// These map to environment variables exposed in the pod spec
var CUSTOM_PARAMETERS = {
    'K8S_NODE_NAME': process.env.K8S_NODE_NAME,
    'K8S_HOST_IP': process.env.K8S_HOST_IP,
    'K8S_POD_NAME': process.env.K8S_POD_NAME,
    'K8S_POD_NAMESPACE': process.env.K8S_POD_NAMESPACE,
    'K8S_POD_IP': process.env.K8S_POD_IP,
    'K8S_POD_SERVICE_ACCOUNT': process.env.K8S_POD_SERVICE_ACCOUNT,
    'K8S_POD_TIER': process.env.K8S_POD_TIER
};

app.use(function(req, res, next) {
  newrelic.addCustomParameters(CUSTOM_PARAMETERS);
  next();
});

Exploring application performance in Kubernetes with New Relic

To get the instrumentation in place, we deployed the updated application code to the Kubernetes cluster, and then the custom parameters from Kubernetes began to appear in the New Relic UI.

In the following screenshot, we see some error details; the transaction attributes shows us, among other details, the Kubernetes hostname and IP address where the error occurred:

We also see the same metadata show up under Distributed Trace attributes.

Next, we used New Relic Insights to look at the same instrumentation to see the performance of application transactions based on pod names. To do this in Insights, we simply wrote the following custom New Relic Query Language (NRQL) query:

SELECT percentile(duration, 95) from Transaction where appName='newrelic-k8s-node-redis' and name='WebTransaction/Expressjs/GET//'FACET K8S_POD_NAME TIMESERIES auto

And here’s the result:

The Downward API exposes a good deal of useful metadata for monitoring solutions. You can use it to gain useful information about performance outliers and track down individual errors. In aggregate, APM error profiles also surfaces any correlations between errors in transactions that have been instrumented with data you’ve gathered with the Downward API.

For instance, in this example, APM error profiles automatically notices that nearly 57% of errors come from the same pods and pod IP addresses:

APM error profiles automatically incorporates the custom parameters and uses different statistical measures to determine if an unusual number of errors is coming from a certain pod, IP, or host within the container cluster. From there, you can zero in on infrastructure or cluster-specific root causes of the errors (or maybe you’ll just discover some bad code).

APM and the Kubernetes cluster explorer

The discussion above shows how to leverage Kubernetes metadata within APM to help troubleshoot issues. Conversely, if you are starting your investigation into an issue from the infrastructure side of the house, you can also visualize application performance metrics within the newly announced New Relic Kubernetes cluster explorer. If you identify issues within a node or a pod but don’t see any issues with the infrastructure itself, you can quickly check application metrics (throughput, response time, error rate and violations) for the application instance running on that specific node/pod.

The Kubernetes cluster explorer lets teams drill down to get visibility into applications and infrastructure metrics side-by-side in a rich, curated UI that simplifies complex environments.

Effectively monitoring applications running in Kubernetes requires not just code-level visibility into the applications but also the ability to correlate applications with the containers, pods, or hosts in which they’re running. By identifying pod-specific performance issues for an application’s workload, you can troubleshoot performance issues faster and have confidence that your application is always available, running fast, and doing what it is supposed to do.

Using the Kubernetes cluster explorer, you can get an understanding of how your application is performing. Additionally, clicking on the application (in this example ‘Order-Packaging’) sends you to the APM overview page where you can troubleshoot and dissect that specific application.

Your customers may not care about Kubernetes … but you must

Your customers likely don’t care if you’re using traditional virtual machines; a bleeding-edge, multi-cloud federated Kubernetes cluster; or a home-grown artisanal orchestration layer. They just want the applications and services you provide to be available, reliable, and fast.

With the rise of Kubernetes, however, along with new technical concepts like orchestration layers, teams need new contexts for understanding and exploring performance in their applications and code. We believe that to efficiently build and run modern software, you need visibility into the application layer—especially when those applications are running inside Kubernetes clusters. We look forward to continuing our work with customers and the broader Kubernetes community, and to creating and providing the solutions they need to succeed.

Learn more about monitoring Kubernetes and the new cluster explorer by visiting New Relic at KubeCon + CloudNativeCon North America 2018, in Seattle, on December 11-13, at booth #P20.