Problem Management
Problem management is the process of identifying and managing the causes of incidents on an IT service. It is a core component of ITSM frameworks.
Problem management isn’t just about finding and fixing incidents but identifying and understanding the underlying causes of an incident as well as identifying the best method to eliminate that root cause. Moreover, pinpointing the cause has no value to an organization if it’s a cut-off process completed by a siloed team, so problem management should be constant and widely practiced across multiple teams, including IT, security, and software developers. An incident may be over once the service is up and running again, but until the underlying causes and contributing factors are addressed, the problem remains.
Problem management vs. incident management
ITIL defines a problem as a cause, or potential cause, of one or more incidents. The behaviours behind effective incident management and effective problem management are often similar and overlapping, but there are still key differences. For example, rolling back a recently deploy may get the service operating again and end the incident, but the underlying problem remains.
That said, we believe that problem management and incident management practices are becoming increasingly intertwined. During the times between incidents, IT teams can focus their efforts on problem investigations that lead to improvements and better service quality. This is how problem management becomes the most valuable to the organization.
The Problem/Incident management process
When problem management is a heavy, siloed, and separate process, companies can end up creating a dumping ground of problems. This backlog is where problem issues go to die in some teams. It’s best to get problems in front of the teams that can handle and do valuable investigations.
-
Problem detection - Proactively find problems so they can be fixed or identify workarounds before future incidents happen.
-
Categorization and prioritization - Track and assess known problems to keep teams organized and working on the most relevant and high-value problems.
-
Investigation and diagnosis - Identify the underlying contributing causes of the problem and the best course of action for remediation.
-
Create a known error record - In ITIL, a known error is “a problem that has a documented root cause and a workaround.” Recording this information leads to less downtime if the problem triggers an incident. This is typically stored in a document called a known error database.
-
Create a workaround, if necessary - A workaround is a temporary solution for reducing the impact of problems and keeping them from becoming incidents. These aren’t ideal, but they can limit business impact and avoid a customer-facing incident if the problem can’t be easily identified and eliminated.
-
Resolve and close the problem - A closed problem is one that has been eliminated and can no longer cause another incident.
The Problem/Incident management process
When problem management is a heavy, siloed, and separate process, companies can end up creating a dumping ground of problems. This backlog is where problem issues go to die in some teams. It’s best to get problems in front of the teams that can handle and do valuable investigations.
-
Problem detection - Proactively find problems so they can be fixed or identify workarounds before future incidents happen.
-
Categorization and prioritization - Track and assess known problems to keep teams organized and working on the most relevant and high-value problems.
-
Investigation and diagnosis - Identify the underlying contributing causes of the problem and the best course of action for remediation.
-
Create a known error record - In ITIL, a known error is “a problem that has a documented root cause and a workaround.” Recording this information leads to less downtime if the problem triggers an incident. This is typically stored in a document called a known error database.
-
Create a workaround, if necessary - A workaround is a temporary solution for reducing the impact of problems and keeping them from becoming incidents. These aren’t ideal, but they can limit business impact and avoid a customer-facing incident if the problem can’t be easily identified and eliminated.
-
Resolve and close the problem - A closed problem is one that has been eliminated and can no longer cause another incident.