The Reactive SLA Problem

For most MSPs running ServiceNow, the SLA dashboard tells them about breaches after they have already happened. The notification fires when has_breached flips to true. The credit gets calculated. The customer success manager writes the apology. The post-mortem identifies a routing rule that has been broken for three weeks. The breach was preventable, but only with hours of warning that nobody was watching for.

SLA Guardian is one of the six scheduled MSP Agents inside SnowCoder. It is built specifically to flip that pattern. It runs on a schedule against every connected instance, scans task_sla, models the trajectory of every active SLA against its business schedule, and surfaces the tickets that will breach if nothing changes. This article unpacks the patterns it uses.

Pattern 1: Burn Rate on task_sla

The simplest pre-breach signal lives on task_sla itself. The percentage field tracks how much of the SLA window has elapsed against the schedule. SLA Guardian inverts the dashboard logic and asks the opposite question: not which SLAs have breached, but which are burning faster than the work to close them is progressing.

A simple version of the query that drives the agent looks like this. The agent runs a richer version that joins to the parent task and the assignment group, but the core pattern is the same.

var sla = new GlideRecord('task_sla');
sla.addQuery('active', true);
sla.addQuery('stage', 'in_progress');
sla.addQuery('has_breached', false);
sla.addQuery('percentage', '>=', 70);
sla.addQuery('business_percentage', '>=', 70);
sla.orderByDesc('percentage');
sla.query();

while (sla.next()) {
    var task = sla.task.getRefRecord();
    gs.info(
        task.getValue('number') + ' | ' +
        sla.getDisplayValue('sla') + ' | ' +
        sla.getValue('percentage') + '% elapsed | ' +
        'state=' + task.getValue('state')
    );
}

The 70 percent threshold is configurable per contract tier. The agent treats Platinum customers more aggressively than Bronze ones, with earlier thresholds and tighter escalation windows.

Pattern 2: Business Hours Math, Not Wall-Clock Math

The mistake most home-rolled SLA monitors make is using wall-clock time. A ticket that opens at 4pm on Friday with a 24-hour resolution SLA does not breach at 4pm Saturday. It breaches based on the attached schedule, which probably excludes the weekend. Reading the wrong percentage field is one of the most common reasons internal dashboards lie to MSP service managers.

SLA Guardian uses the business_percentage field, which accounts for the schedule, and cross-checks against schedule_id to make sure the SLA definition is attached to the right business hours record. If a ticket falls under a global support contract but is attached to a regional schedule, the agent flags the misconfiguration as part of its weekly summary.

That cross-check is also how SLA Guardian catches a class of silent defect: SLAs defined against the wrong schedule that look fine on the dashboard but consistently breach when the timer is recalculated end-of-month.

Pattern 3: Assignment Group Heat Maps

One pre-breach ticket is a problem for a service manager. Twenty pre-breach tickets all pointing at the same assignment group is a problem for the head of operations. SLA Guardian rolls every flagged ticket up to its assignment group and produces a daily heat map. The agent flags groups where the count of at-risk tickets is more than two standard deviations above their thirty-day average.

The heat map is the early warning system for capacity problems that nobody on the floor has yet raised. The agent does not just say group X is in trouble. It can also join to the customer's roster, on-call, or HR absence sources when those tables are available, so the duty manager can see if the group is short-staffed.

var counts = {};
var sla = new GlideRecord('task_sla');
sla.addQuery('active', true);
sla.addQuery('has_breached', false);
sla.addQuery('percentage', '>=', 70);
sla.query();

while (sla.next()) {
    var task = sla.task.getRefRecord();
    if (!task.isValidRecord()) continue;

    var groupId = task.getValue('assignment_group');
    if (!groupId) continue;

    counts[groupId] = (counts[groupId] || 0) + 1;
}

for (var groupId in counts) {
    var group = new GlideRecord('sys_user_group');
    if (!group.get(groupId)) continue;

    gs.info(
        group.getDisplayValue() + ' has ' +
        counts[groupId] +
        ' at-risk SLAs'
    );
}

Pattern 4: Intervention Routing

Detection without action is just a nicer dashboard. SLA Guardian is built to route interventions. For each flagged ticket, the agent generates a recommended action and the role that should own it. The actions are deliberately small and reversible.

Reassign: The current assignee is on leave, has not logged in for 48 hours, or is overloaded. Suggest the next eligible group member.
Re-priority: The ticket priority was set lower than its impact and urgency combination warrants. Suggest the corrected value.
Escalate: The ticket needs management visibility. Suggest the assignment group manager and the customer success owner.
Vendor chase: The ticket is in "Awaiting vendor" state. Suggest a vendor-side follow-up with the last update timestamp.
Customer follow-up: The ticket is in "Awaiting caller" with no contact in three business days. Suggest a structured re-engagement.

Each recommendation is logged. If a service manager accepts a recommendation, the agent records it as a closed-loop intervention. If the manager overrides it, that override becomes a training signal for the next run.

Pattern 5: The Weekly Contract Health View

The same data that drives daily pre-breach detection also rolls up into a weekly contract health view. The agent produces a per-tenant summary that shows breach count, near-miss count, average remaining time on at-risk tickets, and the trend versus the previous four weeks. This is the artifact MSP commercial teams use in their monthly service review meetings.

The view is what makes SLA Guardian an MSP tool rather than a back-office SRE tool. The patterns that prevent the breach are the same patterns that prove value to the customer at renewal.

Why SLA Guardian Lives Inside SnowCoder

SLA Guardian is part of the same Enterprise-tier MSP Agents suite as the Patch Manager, Instance Audit Agent, and Upgrade Readiness Agent. That matters because the patterns above frequently spill across agent boundaries. A surge of at-risk SLAs after a patch landed last weekend is exactly the kind of correlation a single-purpose tool will miss. SnowCoder's scheduled agents share the same instance map and the same knowledge base, so the SLA Guardian recommendation can include a one-line note like "tickets are concentrated in groups affected by the security update applied on Sunday."

That is the difference between a guardian that watches one table and a guardian that watches a fleet.

SLA Guardian Patterns: Preventing ServiceNow Breaches Before They Happen