Five Steps for an Incident Response Workflow, Featuring the Jira Software and New Relic Integration

sorangutan

There are many ways to set up a successful incident response system, but the end result should always be the same: You need to discover incidents—and ideally fix them—before they affect your customers. Successful incident response—also referred to as incident orchestration—requires the alignment of teams, tools, and processes to prepare for and react to incidents and outages in your software.

You can’t afford to wait until an incident occurs to figure out a plan. You need to act quickly and efficiently to:

Maximize efficiency in communication and effort
Minimize the overall impact to your business

Modern software teams can do a lot more than just write a plan for handling incidents; they can connect real-time performance data and alerts to their incident response systems and automate much of the process using tools like the Jira Software and New Relic integration. Jira Software, developed by Atlassian, is a wildly popular tool used by Agile teams to plan, track and release software. The seamless inclusion of New Relic data in Jira, however, lets teams easily file Jira tickets about incidents and include all the context stakeholders need to resolve them.

Let’s highlight five steps for building an incident response workflow using the Jira Software and New Relic integration.

(Note: This is an abridged version of “Incident orchestration: Align teams, tools, processes,” part of our Guide to Measuring DevOps Success.)

1. Assign owners for team dashboards

Team dashboards in New Relic Insights provide, in a single view, the performance status of major components in your applications. But more importantly, they also allow you to visualize the service level indicators (SLIs) and other key performance indicators (KPIs) for your applications.

Each team dashboard should have an owner who assumes responsibility for the health of the applications and features that the dashboard monitors. And there should be no ambiguity about who is responsible for attending to and resolving an alert condition.

How you set up such policies will vary depending on the size, structure, and culture of your organization. For example, some teams may prefer to assign dashboards and alerts based on de facto features or application ownership. Other teams may prefer to adopt an on-call rotation. In on-call rotations, designated team members handle all first-line incident responses, and they resolve or delegate responsibilities based on predetermined incident thresholds.

2. Determine incident thresholds for alert conditions

The term “alerting” often carries negative connotations; for too many developers, alerting correlates too closely with errors, mistakes, and ongoing issues. However, developers who are proactive about alerting know they don’t have to stare at their dashboards all day, because effective alerts will tell them when they need to check in. For instance, a certain alert condition may be dismissible during low-traffic periods but require active remediation during peak hours.

For each of your applications, set a proactive alerting policy:

Identify the thresholds for what is officially considered an incident.
As you create alert policies with New Relic Alerts, make sure each set of threshold criteria is context-dependent.
Document incident evaluation and known remediation procedures in runbooks.
Include links to your runbooks when you define conditions and thresholds for your alert policies.

3. Ensure alerts have auditable notification channels

A key part of your incident response process is communication, which should take place in easily accessible and highly visible channels. A group chat room dedicated to incident communication allows all stakeholders to participate or observe and provides a chronology of notifications, key decisions, and actions for postmortem analysis.

New Relic Alerts integrates with several communication platforms, including Slack and Opsgenie. In fact, at New Relic we use a designated Slack channel for incidents; the channel is automated by a bot, called Nrrdbot (a modified clone of GitHub’s Hubot), that guides our incident responders through the communication process during an incident.

4. Create faster triage and resolution times with New Relic and Jira Software

When you discover problems in your application, using New Relic, that you need your developers to fix, you can push the details directly into your ticketing system in your Jira account. We built our integration to work with any combination of required custom fields.

The Jira Software and New Relic integration improves your incident response feedback cycle and helps seamlessly transition from error trace to hot fix. Specifically, you can correlate incidents detected by New Relic with Jira tickets directly from the New Relic UI, and easily file and assign issues to responsible teams for tracking or remediation.

File a Jira ticket directly from New Relic with the Jira Software and New Relic integration.

From within the corresponding Jira ticket, you can track the current status and severity of an incident from creation to resolution.

When an incident occurs in your system, there’s a broad set of stakeholders—from first responders to devs and ops folks leading the remediation effort to support engineers relaying information to your customers—who need to understand the issue and its status. You need a single source of truth from which you can convey the details. New Relic brings the issue to your attention, and Jira makes sure everyone can track your remediation efforts in real time.

5. Hold a retrospective and document your findings

After you’ve resolved an incident, key stakeholders and participants should hold a retrospective (also known as a “post-mortem”) to capture accurate and thorough documentation about the incident. In the documentation, include:

A root-cause analysis
A chronology and summary of remediation steps and their result, whether they were successful or not (including links to relevant Jira tickets)
A measure of the impact to the business in terms of user experience and financial losses, if possible
Recommendations for system or feature improvements to prevent a recurrence
Recommendations for process and communication improvements

Store postmortem reports in a highly visible, searchable repository, like an internal wiki. Culturally, it is essential that in this process, you focus on constructive learning and improvement rather than punishment or blame.

Get started by visiting the Atlassian Marketplace to install and set-up the Jira Software and New Relic integration, and join the the discussion at the New Relic Online Technical Community. For additional information, check out the documentation page.