Root cause CIs for business services

For alerts associated with discovered business services and manual services, Service Analytics can apply root cause analysis (RCA). RCA identifies the CI that is the underlying root cause in a business service.

If Domain Support - Domain Extensions Installer is activated, then RCA is domain aware. Alerts are analyzed within the context of the domain that business services or manual services belong to. RCA for a business service or manual service runs on the MID Server that is in the same domain as the business service or manual service. If there is no MID Server for a specific domain, then the MID Server from the global domain is used.

Root cause analysis

The ongoing operations of an application can generate many events and alerts which can become overwhelming when problems arise. If the system is experiencing a problem, the manual process of assessing the impact of the alerts and identifying the underlying cause might require extensive resources and be lengthy.

To identify underlying problems, RCA algorithms prioritize alerts, group them in the context of impacted services, and identify root cause CIs. The root cause CI is a CI in a business service from which the root alert for an incident originated and which subsequently triggered additional alerts.

RCA has these components:
RCA Learner
An offline job that runs once a day to process past alerts. It collects information about frequent alert patterns within a service context, and stores this information in the alert knowledge base. Based on past alerts and on the impact model, the RCA Learner creates a probability model that can be used to answer cause and effect queries.
Real-Time Query
A scheduled job that runs every minute to group alerts and to update root cause CIs. It queries past lists of root cause recommendations to get the probability score for real-time alerts associated with other open alerts within the service.

For discovered business services and for manual services, the Learner collects and analyzes data from past alerts. For new alerts, the Learner applies existing knowledge from similar past alerts, and continues to capture and analyze data from new alerts. As more alerts are encountered and resolved, the alert knowledge base grows and the precision of diagnosing the root cause CI improves.

Business services are discovered by Service Mapping and are represented internally in the system by a service model. The service model of the business service is used for identifying CIs related to the root cause CI.

When the root cause CI in a service is known, operators can create a single incident ticket and engage only the needed IT operator to expedite remediation. The IT operator can direct troubleshooting efforts to remove the root cause problem, and stop the recurrence of undesirable events.

RCA-related properties

By default, RCA is not applied for business services and for manual services. You can enable RCA and modify other RCA-related behavior by changing the settings of the sa_analytics.aggregation.include_service and sa_analytics.rca_enabled properties.

RCA configurations

RCA uses an RCA configuration that filters and scopes the alerts to be analyzed. The base system includes pre-defined RCA configurations which might not be optimal in every environment. See RCA configurations for more information.

Confidence score

To help you decide how to invest troubleshooting efforts, RCA algorithms calculate a confidence score for the identified root cause CI. The confidence score is based on the Learner data and expresses the confidence in the identification. For example, a confidence score of 75% means that there is a certainty of 75% in the identification of the root cause CI. If more than one cause is possible, you can investigate the most likely root cause before investigating less likely root causes.

By default, RCA groups with any confidence scores are listed. To limit what groups are listed, change the sa_analytics.rca.query_probabality_threshold property to a percentage that the RCA group confidence score must meet to be listed. If a root cause CI has a confidence score that is lower than the specified percentage, Service Analytics does not treat that CI as the root cause.

View root cause CIs

You have these options for viewing root cause CIs in the Event Management dashboard.

UI access Description
In a business service map Displays root cause CIs highlighted in a business service map, and the relationships between the root cause CI, alerts, and related CIs in business services.
In impacted services Displays all automated alert groups that are associated with business services. You can drill down to view details about the alerts in the group, and the root cause CI if it exists. Double-click a group, and then click the Impacted Services tab to display the services and root cause CIs if applicable.