Incident Management

Incident management is the process of responding to an unplanned event or service interruption to restore the service to its operational state. According to ITIL (IT Infrastructure library), “the incident management process ensures that normal service operation is restored as quickly as possible and the business impact is minimized.”

Incidents are events of any kind that disrupt or reduce the quality of service (or threaten to do so). A business application going down is an incident. A crawling-but-not-yet-dead web server can be an incident, too. It’s running slowly and interfering with productivity.

Problem management vs. incident management

ITIL defines a problem as a cause, or potential cause, of one or more incidents. The behaviours behind effective incident management and effective problem management are often similar and overlapping, but there are still key differences. For example, rolling back a recently deploy may get the service operating again and end the incident, but the underlying problem remains.

That said, we believe that problem management and incident management practices are becoming increasingly intertwined. During the times between incidents, IT teams can focus their efforts on problem investigations that lead to improvements and better service quality. This is how problem management becomes the most valuable to the organization.

The Problem/Incident management process

When problem management is a heavy, siloed, and separate process, companies can end up creating a dumping ground of problems. This backlog is where problem issues go to die in some teams. It’s best to get problems in front of the teams that can handle and do valuable investigations.

Problem detection - Proactively find problems so they can be fixed or identify workarounds before future incidents happen.
Categorization and prioritization - Track and assess known problems to keep teams organized and working on the most relevant and high-value problems.
Investigation and diagnosis - Identify the underlying contributing causes of the problem and the best course of action for remediation.
Create a known error record - In ITIL, a known error is “a problem that has a documented root cause and a workaround.” Recording this information leads to less downtime if the problem triggers an incident. This is typically stored in a document called a known error database.
Create a workaround, if necessary - A workaround is a temporary solution for reducing the impact of problems and keeping them from becoming incidents. These aren’t ideal, but they can limit business impact and avoid a customer-facing incident if the problem can’t be easily identified and eliminated.
Resolve and close the problem - A closed problem is one that has been eliminated and can no longer cause another incident.

Major Incident Management

A major incident is typically defined as an incident that has a significant impact on the business, such as a major service outage or data breach. Major incidents can also be defined based on the severity of the incident and the amount of resources that are required to resolve it.

In addition, major incidents are most often noticed by customers and stakeholders, so managing them well not only ensures that fewer incidents will occur in the future, it also increases your site’s reliability. This is important for keeping your users happy and ensuring that your staff is not overloaded with the constant maintenance of your existing infrastructure.

When major incidents occur, time is always of the essence. Incidents are labelled “major” when they are customer-facing or have a large impact on your product or infrastructure.