Best Practices for Reducing Down-Time on the Way to Production

Serhat Can
4 min readMay 7, 2018

Incident management is often associated with production incidents that require immediate action. But, not all incidents are alike. Incidents can occur when integrating your code into master, releasing your software, changing a configuration item, or receiving a malicious email.

At OpsGenie, an on-call and incident management solution, we aim to reduce your alert fatigue and help you manage all kinds of incidents in the right way. Here are some best practices that will help your teams develop, release, and run software in confidence that we have learned along the way.

Monitor: Stay vigilant on the way to production and alert teams of issues

In DevOps, it is important to keep feedback loops small and active so that teams can release and fix problems faster. Continuous integration and delivery tools can alert teams of the issues that may break the release pipeline and block the way to production. It is dangerous to not be able to ship code because a bug may arise or a revert might be necessary. That is why it becomes critical to notify the on-call responder and keep everyone involved to maintain a healthy release pipeline.

Orchestrate: Gather and automate your alerts from different tools

A number of tools used by organizations to ease their DevOps process increase daily. Depending on the level of abstraction, monitoring needs may require network, application performance, SLA, API monitoring, and more.

Development teams that embrace agile practices leverage tools like Jira or Zendesk for ticketing, and Slack or MS Teams for messaging. Considering how integrated those tools can be and the number of them that keep relevant information that concerns other stakeholders, it is hard but very critical to keep everything organized. OpsGenie updates other tools in your stack by opening tickets or closing alerts automatically. Tight integrations with Chat tools like Slack allow teams to take full advantage of ChatOps by bringing day-to-day operations into shared chat channels.

Prioritize: Don’t page for low priority incidents

Not all incidents are the same. Some might have catastrophic effects on multiple services, while others require only a check. While collecting alerts from testing, and monitoring tools like AlertSite, teams can add tags or assign different priorities to alerts. Those priorities can then be used to route your alerts to different escalations.

For example, a low priority incident can use an escalation with one developer in it, and even suppress notification for an hour, while a high priority incident can notify the on-call immediately and if on-call does not acknowledge the alert, then the whole team might be paged. This escalation policy can go all the way up to CTO depending on the priority.

Cluster: Group related alerts

Incidents rarely create just 1–2 alerts. Think about a case where a core service in your microservices break and multiple dependent services are affected, or when the build is broken and dependent teams can’t integrate their code into production. In real life, incidents are complex.

In complex incidents, the on-call responder might receive hundreds of alerts from various levels of testing and monitoring solutions. It is important to have a clear view of related alerts and separate unrelated ones. The biggest problem is to be so drowned by alerts that you cannot find important pieces of information that will help your team resolve the issue. Or worse, to miss other key alerts in the mess.

This is why OpsGenie introduced its new incident response orchestration solution. Users can group alerts automatically or manually and notify subscribers or predefined stakeholders about the issue. The value is in reducing the alert fatigue with smart grouping capabilities and managing the dependency between effected parties.

Communicate: Keep stakeholders in the loop

As we talked about earlier, complex incidents involve multiple teams. Those teams may have a part in the incident or just use their services and might need to take preventative actions to protect themselves. It is important to keep them up to date.

However, this is easier said than done. Service dependencies and responsible teams should be defined and alerts should be routed to those teams or individual stakeholders. Status pages and health dashboards help teams to subscribe to updates and gather quick feedback on the status of the issue without bothering the teams that are already busy solving the problem.

How do AlertSite and OpsGenie help together?

AlertSite users can integrate Smartbear’s with OpsGenie and use OpsGenie’s powerful incident response orchestration and alerting capabilities to improve uptime for REST and SOAP APIs.

Learn more at OpsGenie’s AlertSite integration page and signup for a free trial to see how it can help you keep your service up and running!

By Serhat Can, Technical Evangelist at OpsGenie

Originally published at blog.smartbear.com on May 7, 2018.

--

--