OutageKit

Dual-Approver Flow

Requires two distinct approvers for mass updates and ETR changes, presenting side-by-side diffs, audience counts, and ETA deltas before confirmation. Prevents fat‑finger blasts, enforces shared accountability, and makes high‑stakes sends safer without slowing teams down.

Requirements

Two-Approver Gate for High-Risk Actions

"As an operations manager, I want a mandatory second approver for mass updates and ETR changes so that we reduce erroneous blasts and enforce shared accountability."

Description

Implements a mandatory two-approver checkpoint for high-risk actions, specifically mass outbound updates and ETR/ETA changes on incidents. The system creates an approval artifact with a cryptographic payload fingerprint capturing message content, audience filters, channels, delivery options, and time estimates. The first approver submits the action into a pending state; a second distinct user must approve before execution. Approvals are enforced consistently across web console, mobile web, and API, preventing circumvention. Rejections cancel the request with a reason, and any payload change invalidates prior approvals and restarts the flow. The feature surfaces real-time status, notifies the second approver via SMS/email/console, and blocks send until quorum is met, ensuring safety without adding unnecessary delay.

Acceptance Criteria

Submit Mass Update Requires Second Approver Before Send

Given a user initiates a mass outbound update or an incident ETR/ETA change, When they submit the action, Then the system creates an approval artifact with a cryptographic payload fingerprint capturing message content, audience filters, channels, delivery options, and time estimates, And the request enters Pending Approval state and is not executed. Given the request is Pending Approval, When the original submitter attempts to execute or schedule delivery, Then the system blocks the send and displays “Awaiting second approver”. Given the request is Pending Approval, When a second distinct user (different user ID from the submitter) opens the approval screen, Then they are shown a review view including preview content and a side-by-side comparison against the last published state with audience counts and ETA/ETR deltas. Given the request is Pending Approval, When the same user who submitted attempts to approve, Then the system prevents approval and shows “Second approver must be distinct”. Given the request is Pending Approval, When a second distinct user approves, Then the system immediately executes the action (or schedules it as configured), records the approval, and transitions the request to Executed state.

Cryptographic Payload Fingerprint Creation and Validation

Given a high-risk action is submitted, Then the system generates a deterministic cryptographic fingerprint from a canonicalized payload that includes message content, audience filters, channels, delivery options, and time estimates, And displays the fingerprint to both approvers. Given an approved request is about to execute, When the system recomputes the fingerprint, Then it must exactly match the stored fingerprint; otherwise, the send is blocked, the request returns to Pending Approval with zero approvals, and both parties are notified of the mismatch. Given a request is submitted or executed, Then the fingerprint and a full payload snapshot are stored immutably for audit and are retrievable via API and console.

Payload Change Invalidates Prior Approvals

Given a request is Pending Approval or Approved for execution, When any field within message content, audience filters, channels, delivery options, or time estimates is modified by any user or integration, Then all prior approvals are invalidated, the approval count resets to zero, the fingerprint is regenerated, and the status returns to Pending Approval. Given approvals were invalidated by a payload change, Then any prior approval buttons become disabled in all clients, and all approvers receive a notification that re-approval is required. Given approvals were invalidated, When an execution is attempted via API or UI, Then the system rejects the attempt with an explicit “Approval invalidated due to payload change” error.

Rejection Cancels Request With Reason

Given a request is Pending Approval, When a second approver selects Reject and enters a non-empty reason, Then the system sets the request to Rejected state, prevents execution, and records the reason on the approval artifact. Given a request is Rejected, Then the original submitter and watchers are notified via console, email, and SMS (if configured) with the rejection reason, and the request cannot be re-opened; a new submission is required for any subsequent attempt. Given a request is Rejected, When any client (web, mobile web, API) attempts to approve or execute it, Then the system denies the action with a “Request cancelled” error and logs the attempt for audit.

Enforcement Across Web, Mobile Web, and API

Given a high-risk action is initiated from the web console, mobile web, or API, When the user or integration attempts to execute without two distinct approvals, Then the system blocks the action consistently across all surfaces; the UI disables Send and the API responds with HTTP 409 and error code APPROVAL_REQUIRED. Given an integration attempts to bypass the gate via undocumented parameters, headers, or elevated roles, Then the system still enforces the two-approver requirement and returns APPROVAL_REQUIRED. Given a non-high-risk action is executed, Then the two-approver gate is not invoked, and the action proceeds normally, demonstrating scoped enforcement.

Real-Time Status Surfacing and Notifications

Given a high-risk action is submitted, Then the request status updates to Pending Approval and is visible to the submitter and approvers in the console and via API within 5 seconds. Given a high-risk action is submitted, Then the designated second approver receives notification via console, email, and/or SMS per their preferences within 60 seconds, containing a deep link to the approval screen. Given the second approver approves or rejects, Then the request status transitions (Approved→Executed or Rejected→Cancelled) are reflected in the console and API within 5 seconds, and the submitter receives confirmation notifications. Given the request is awaiting approval, Then the send button remains disabled and any scheduled time is held until quorum is met or the request is rejected/cancelled.

Immutable Audit Trail and Non-Repudiation

Given any high-risk action progresses through submit, approve, reject, modify, or execute events, Then the system records an immutable audit trail including timestamps, user IDs, auth context (SSO provider), client surface (web/mobile/API), IPs, fingerprint, prior/updated statuses, and any rejection reason. Given an auditor queries the system via console or API, Then they can retrieve the complete approval artifact and event history for a request and export it (CSV/JSON) without the ability to alter records. Given an attempt is made to modify or delete audit records, Then the system prevents the change, logs the attempt, and surfaces an administrative alert.

Side-by-Side Payload Diff Review

"As an approver, I want a clear side-by-side diff of the update so that I can verify exactly what will change before I approve."

Description

Provides a clear, side-by-side visual diff of the proposed change versus the current state, covering message text, templates after variable resolution, IVR voice transcript, language variants, throttling and suppression rules, and delivery options. Additions and removals are highlighted with color-coded markers and inline ETA/ETR before-and-after timestamps with localized time zones. For structured data (JSON payloads for API-driven sends), collapsible field-level diffs are shown. The diff view loads under two seconds for typical payload sizes, supports keyboard navigation, and is accessible (screen-reader friendly with ARIA annotations). This view is presented to both approvers and is snapshotted into the approval artifact to ensure the reviewed content matches what is ultimately sent.

Acceptance Criteria

p95 Diff Load Performance Under 2 Seconds

Given a typical payload (<=200 KB aggregated JSON, <=6 language variants, <=5,000 characters per message, <=200 structured fields) When an approver opens the Side-by-Side Diff view Then the first meaningful paint occurs ≤1,200 ms (p50) and ≤2,000 ms (p95); And the view is interactive ≤1,800 ms (p50) and ≤2,000 ms (p95) And expand/collapse and next/previous change actions respond ≤100 ms median on baseline hardware (4-core CPU, 8 GB RAM) and network (≤100 ms RTT)

Comprehensive Visual Diff Across All Payload Elements

Given proposed changes to any of: message text, post-resolution templates, IVR transcript (including SSML), language variants, throttling, suppression rules, or delivery options When the diff renders Then additions are marked with green "+", removals with red "-", and modifications aligned side-by-side with a visible legend And change detection achieves F1-score ≥0.99 against the regression corpus And color-blind safe secondary indicators (icons/patterns) are present And unchanged sections are deemphasized but not hidden unless explicitly collapsed

Collapsible Field-Level JSON Diff for API Payloads

Given an API-driven payload with nested JSON up to depth 8 and up to 1,000 fields When viewing the diff Then each object/array field shows added/removed/modified badges and can be individually expanded/collapsed And "Expand all changes" and "Collapse all" controls are available And expanding a collapsed node completes ≤150 ms median And a "Hide unchanged" filter is available And breadcrumbs display the full JSON path of the focused field

Localized ETA/ETR Before-and-After with Deltas

Given any change that impacts ETA/ETR values When the diff renders Then before-and-after timestamps appear inline adjacent to the affected content And times are formatted per the approver’s profile locale and timezone, including TZ abbreviation and UTC offset And hovering reveals the incident timezone And the delta is displayed as +/−HH:MM (e.g., +00:30) And automated i18n tests validate formatting across ≥10 locales with 0 critical issues

Full Keyboard Navigation and Shortcuts

Given the approver is using keyboard only When navigating the diff Then Tab/Shift+Tab follow reading order without trapping focus And Up/Down arrows move between lines/fields; Enter toggles expand/collapse And "n"/"p" jump to next/previous change; "?" opens shortcut help And visible focus meets WCAG 2.4.7 Focus Visible And all interactions are possible without a pointing device

Screen-Reader Accessible Diff with ARIA and WCAG

Given NVDA (Windows), JAWS (Windows), and VoiceOver (macOS) are used When reading the diff Then change markers announce role and state (e.g., "added text", "removed") And the diff uses appropriate ARIA roles (grid/treegrid) with labels and descriptions And expand/collapse states are announced And color is not the sole indicator; contrast ratios are ≥4.5:1 And axe-core and WAVE scans report 0 serious/critical violations

Immutable Approval Snapshot and Drift Prevention

Given the diff is displayed for approval When Approver A submits approval Then a content-addressed snapshot of the exact diff state and underlying payload (including content hash, approver ID, timestamp, locale/timezone) is stored And when Approver B opens the approval, the system verifies the pending payload hash equals the snapshot And if any mismatch is detected, approvals are blocked and a refresh is required And the artifact is immutable and available via audit UI/API for ≥365 days And upon broadcast the sent payload hash equals the approved snapshot hash

Audience Size and Segment Preview

"As an approver, I want to see audience size and segment breakdown so that I can confirm the scope and avoid over- or under-notifying customers."

Description

Calculates and displays audience impact prior to approval, including total targeted recipients and breakdown by channel (SMS, email, voice), segment, and geography. Counts are de-duplicated across channels and reflect live suppression lists (opt-outs, bounces), quiet hours, and throttling policies. The preview includes estimated delivery windows and concurrency limits, flags unusually large sends relative to historical baselines, and links to a sampled list (privacy-safe) for spot checks. Audience metrics are recomputed on any change and are snapshotted with the approval to provide evidence of intended scope.

Acceptance Criteria

Real-time Audience Breakdown Preview

Given a mass update or ETR change draft with selected segments, channels (SMS, email, voice), and geographies And tenant configurations for suppression lists, quiet hours, and throttling exist When an approver opens the Audience Preview pane Then the UI displays total targeted unique recipients and counts by channel, segment, and geography And the total unique recipient count equals the union of all reachable recipients across targeted channels after normalization and suppression And each breakdown count represents unique recipients within that slice And the preview renders within 2 seconds for audiences up to 500,000 unique recipients and within 6 seconds for up to 2,000,000 unique recipients at the 95th percentile

Cross-Channel De-duplication and Suppression Integrity

Given recipients may appear on multiple channels and normalization rules are configured (E.164 for phone, case-insensitive for email) And live suppression lists include opt-outs and bounces per channel When the audience preview computes counts Then per-channel counts exclude recipients suppressed on that channel And the total unique recipient count equals |SMS_reachable ∪ email_reachable ∪ voice_reachable| And recipients suppressed on all targeted channels are excluded from the total unique recipients And updates to suppression (e.g., a new opt-out) are reflected in counts within 10 seconds

Quiet Hours, Throttling, and Delivery Window Estimation

Given tenant-configured quiet hours by timezone and channel throttling/concurrency limits And the targeted audience spans multiple geographies/timezones When the preview displays delivery estimates Then per-channel estimated delivery windows exclude local quiet hours for each recipient group And windows reflect configured throttling and concurrency limits for each channel And per-channel concurrency limits are displayed alongside the windows And changes to quiet hours or throttling settings trigger a recompute and UI update within 5 seconds

Unusual Audience Size Alert Against Baseline

Given a baseline defined as the median total unique recipients of the last 30 approved sends of the same notification type and region (or global 30-day median if fewer than 10) When the current total unique recipients exceeds 150% of the baseline or is greater than 3 standard deviations above the 90-day mean Then the system displays an Unusually Large Audience alert with baseline, current total, and percent difference And both approvers must acknowledge the alert before approval actions become enabled

Privacy-Safe Sampled List for Spot Checks

Given the previewed audience exceeds 100 recipients When an approver selects View Sample Then a deterministic random sample of up to 200 recipient records is displayed And PII is masked (phone shows last 2 digits only; email shows first letter and masked domain; address limited to city and ZIP prefix) And export or download actions are disabled for the sample And the sample includes channel reachability and suppression reason indicators And the sample remains stable for the same filters/session for at least 1 hour or until filters change

Auto-Recompute on Any Targeting or Schedule Change with Deltas

Given a draft is open with the Audience Preview visible When a user changes any targeting filter, channel selection, geography, message schedule, or ETR Then audience metrics recompute automatically without page reload And a Last computed timestamp updates And recompute completes within 2 seconds for changes affecting less than 10% of the audience and within 6 seconds for changes up to 2,000,000 recipients at the 95th percentile And the UI displays side-by-side deltas (absolute and percentage) versus the immediately prior preview for total, channel, segment, and geography counts

Immutable Snapshot at Dual Approval

Given both approvers are reviewing the same draft in the Dual-Approver flow When the second approver confirms approval Then the system persists an immutable snapshot containing: total unique recipients; per-channel, per-segment, and per-geography counts; suppression breakdowns by reason; quiet-hours exclusion counts; throttling settings; estimated delivery windows; concurrency limits; unusual-size flag status; sample checksum; filter definition hash; last computed timestamp; and approver IDs/timestamps And the snapshot is linked in the audit log, is read-only, and can be retrieved via UI and API within 2 seconds at the 95th percentile And if metrics change between first and second approval, the system requires a refresh and re-acknowledgment so both approvers confirm identical metrics at approval time

Approver Role and Separation of Duties Enforcement

"As a compliance lead, I want enforced separation of duties for approvals so that our controls meet internal policy and regulatory expectations."

Description

Validates that two distinct, authorized users approve each high-risk action, enforcing separation of duties. The system blocks self-approval, prevents the same identity via multiple sessions, and supports tenant-level policy controls (e.g., require approvers from different roles or teams, require the creator to be different from both approvers, enforce MFA at approval time). Integrates with SSO/SCIM for role synchronization and device trust checks. Violations are surfaced with actionable errors, and policy configuration is auditable and versioned per tenant.

Acceptance Criteria

Two Approvers and Self-Approval Block

- Given a high-risk action is submitted by user U1, When approvals are collected, Then the system requires approvals from two users U2 and U3 where U2 != U1 and U3 != U1 and U2 != U3 before execution. - Given user U1 attempts to approve their own submitted action, When U1 clicks Approve, Then the system blocks the approval and displays error code POL-SELF-001 with message "Creator cannot approve this action."

Same Identity via Multiple Sessions Prevention

- Given an approval attempt comes from user account A with IdP subject S or SCIM externalId E, When a second approval is received from user account B with the same S or E, Then the system rejects the second approval as duplicate identity with error code POL-DUP-001 and records both attempts in the audit log. - Given the same user account attempts to approve twice from different sessions or devices, When the system detects session linkage (same accountId), Then it rejects the duplicate with error code POL-DUP-002.

Cross-Role/Team Separation Policy Enforcement

- Given tenant policy "approvers must have different roles" is enabled, When two approvals are submitted, Then role(approverA) != role(approverB) or the approval is blocked with error POL-ROLE-001 naming the conflicting roles. - Given tenant policy "approvers must be from different teams" is enabled, When two approvals are submitted, Then teamId(approverA) != teamId(approverB) or the approval is blocked with error POL-TEAM-001. - Given tenant policy "creator must be different from both approvers" is enabled, When approvals are submitted, Then creatorId != approverAId and creatorId != approverBId or the approval is blocked with error POL-SEP-001.

MFA Step-Up Required at Approval Time

- Given a user attempts to approve a high-risk action, When the user has not completed MFA within the last 5 minutes, Then a step-up MFA challenge is required and the approval is blocked until successfully completed. - Given the user fails or cancels the MFA challenge, When the retry limit of 3 is exceeded, Then the approval is denied with error POL-MFA-003 and the event is logged with MFA factor and reason. - Given MFA succeeds, When all other policy conditions are met, Then the approval is accepted.

Device Trust Enforcement for Approvers

- Given tenant policy "require device trust" is enabled, When an approver attempts to approve, Then the device must present a valid, non-expired trust attestation issued within the last 24 hours; otherwise the approval is blocked with error POL-DEV-001 and a remediation link is shown. - Given a device trust attestation is revoked mid-session, When the approver clicks Approve, Then the system revalidates in real time and blocks with error POL-DEV-002.

SSO/SCIM Role Synchronization and Authorization Freshness

- Given a role change is made in the IdP/SCIM source removing a user's Approver role, When SCIM updates are received, Then the user must lose approval ability within 5 minutes; attempts after propagation are denied with error AUTH-ROLE-403. - Given a new user is added to an Approver role via SCIM, When the user signs in via SSO, Then the role is recognized at approval time without requiring manual admin action. - Given the SCIM service is unavailable, When approval is attempted, Then the system uses the last known role state stamped with retrieval time and displays a banner if the snapshot is older than 15 minutes.

Actionable Errors, Policy Versioning, and Auditability

- Given any policy violation occurs, When blocking the approval, Then the UI displays an error code, short message, violated rule name, current policy version, and a "Learn more" link; the API returns 4xx with a machine-readable reason. - Given a tenant admin updates approval policies, When the change is saved, Then a new policy version is created with version number, author, timestamp, and diff; prior versions remain retrievable per tenant. - Given a high-risk action is executed after dual approval, When writing audit logs, Then the system records actionId, approverIds, creatorId, timestamps, policy version used, MFA factors, device trust status, and identity claims; logs are immutable and exportable.

Approval Escalation and Timeout Workflow

"As a dispatcher, I want pending approvals to escalate and expire predictably so that urgent communications are not blocked indefinitely."

Description

Introduces time-bound approval windows with automatic reminders and escalation. If a second approver does not act within a configurable timeout, the system escalates via SMS/email to on-call approvers and optionally reassigns the approval request. Approvers can provide reasons on reject, and requesters can cancel or amend (which resets approvals). All notifications include deep links to the diff and audience preview. Expired approvals are safely closed, and UI clearly communicates remaining time and escalation path to avoid stalled high-priority communications.

Acceptance Criteria

Second Approver Timeout Triggers Escalation and Expiry

Given a pending second approval with a configured timeout When the timeout elapses without action by the second approver Then the system immediately sends escalation notifications via SMS and email to the on-call approver(s) and records the event in the audit log Given an escalation was sent and a secondary window elapses with no decision When the secondary window expires Then the approval request status is set to Expired, all approval actions are disabled, and the requester is notified via SMS/email Given an approval request is Expired When any approver follows a prior action link Then the system blocks the action, displays "Request expired" with timestamp, and logs the attempt

Pre-Timeout Reminder Notifications to Pending Approver

Given a pending second approval with a timeout and scheduled reminder thresholds When a reminder threshold is reached before timeout Then the pending approver receives a reminder via SMS/email including request title, remaining time, and a deep link, and the reminder is logged once per threshold Given multiple reminder thresholds are configured When reminders are sent Then duplicate reminders are not sent for the same threshold and the timeout is not reset

Optional Reassignment to On-Call Approver on Timeout

Given auto-reassign on timeout is enabled in escalation settings When the initial timeout elapses without second approval Then the approval request is reassigned to the current on-call approver, the original pending approver loses action permissions, and both parties are notified Given auto-reassign on timeout is disabled When the initial timeout elapses Then the request remains assigned to the original approver while escalation notifications are still sent to on-call approver(s) Given a reassigned approval When the on-call approver approves or rejects Then the decision is recorded with identity and timestamp and satisfies the dual-approver requirement

Reject Requires Reason and Communicates Outcome

Given a second approver opens a pending approval When they select Reject Then the system requires a non-empty reason and prevents submission until provided Given a rejection reason is submitted When the system processes the rejection Then the requester and first approver receive SMS/email including the reason and a deep link, the approval closes as Rejected, and the event is logged

Requester Cancels or Amends Resets Approval Flow

Given a pending dual-approval request When the requester cancels the request Then the status changes to Cancelled, all approval actions are disabled, and all pending approvers are notified Given a pending dual-approval request When the requester amends the update or ETR Then the approval version increments, prior approvals are invalidated, the approver count resets, fresh diffs/audience/ETA deltas are generated, and new notifications are sent Given an amended request with a newer version When an approver opens a link from an older version Then they are redirected to the latest version with an indication the prior version is obsolete

All Notifications Contain Deep Links to Diff and Audience Preview

Given the system sends a reminder, escalation, approval, rejection, cancellation, or expiry notification When the recipient opens the embedded deep link Then they land on an authenticated page showing side-by-side diffs, audience count, and ETA deltas with an audience preview Given a recipient is not authenticated When they open a deep link Then they are prompted to authenticate and then redirected to the intended diff and audience preview Given a deep link references an expired or cancelled request When it is opened Then the page shows the closed status, prevents actions, and still displays the diff and audience preview for auditability

UI Displays Remaining Time and Escalation Path Clearly

Given a user views a pending approval in the console When the page loads Then the UI shows a live countdown of remaining time, current approver, next escalation target(s), and scheduled escalation time Given the approval is reassigned or escalated When the change occurs Then the countdown and escalation path indicators update within 2 seconds Given an approval is Expired, Cancelled, or Rejected When the UI renders the request Then the status is prominently displayed, action buttons are disabled, and the escalation path is greyed with a tooltip explaining the outcome and timestamp

Tamper-Proof Approval Audit Trail

"As a security auditor, I want an immutable record of approval decisions and exact content sent so that we can prove due diligence and reconstruct events when needed."

Description

Captures an immutable, append-only record of each high-risk action, including proposer identity, timestamps, payload fingerprint, full diff snapshot, audience metrics, ETA/ETR deltas, approver identities, decisions, reasons, and notification events. Entries are chain-hashed for tamper evidence, time-synced, and exportable to SIEM via webhook or scheduled export. The audit UI supports filtering by incident, approver, and date, and redacts sensitive PII while retaining evidentiary value. Retention policies are configurable per tenant to align with compliance requirements.

Acceptance Criteria

Append-Only Entry Creation for Approved High-Risk Actions

Given a mass update or ETR change is approved by two distinct approvers in the Dual-Approver Flow When the action is executed Then exactly one audit entry is appended to the tenant's audit log and the audit log total count increases by 1 And the entry contains: proposer_id, proposer_name, action_type, incident_ids, created_at_utc, chain_index, payload_fingerprint_sha256, full_diff_snapshot (pre, post), audience_counts {sms,email,ivr,total}, eta_delta_minutes, etr_delta_minutes, approver1_id, approver2_id, approver1_decision, approver2_decision, approver1_reason, approver2_reason, notification_event_ids And attempts to update or delete any existing audit entry via API or DB layer are rejected and result in no persisted change

Chain-Hash Integrity and Tamper Detection

Given an audit chain with at least two prior entries exists When a new audit entry is appended and the system recomputes hash_i = SHA-256(previous_hash_i || canonical_json(entry_i)) across the chain Then the recomputed head_hash equals the stored head_hash And each entry stores previous_hash and its own hash value And if any byte of any entry is altered after write, verification fails, a TamperDetected security event is logged with entry_id, and the UI/export refuses to load the tampered segment

Time Synchronization and Timestamp Policy

Given NTP time sources are configured and healthy When an audit entry is created Then created_at_utc is recorded in RFC3339/ISO8601 UTC with millisecond precision And the recorded time differs from a trusted reference by ≤ 200 ms And chain_index is strictly monotonically increasing per tenant And event timestamps within the entry preserve causal order: proposal_at < approvals_at < executed_at ≤ notifications_sent_at

SIEM Export via Webhook and Scheduled Delivery

Given a tenant has configured a SIEM webhook with a signing secret and a daily export at 02:00 UTC When a new audit entry is appended Then a POST is delivered to the webhook within 60 seconds including headers X-OK-Signature (HMAC-SHA256 of body), X-OK-Event-Id, and X-OK-Timestamp And 5xx or timeout responses trigger retries with exponential backoff for up to 24 hours with idempotency keyed by X-OK-Event-Id When the daily export window closes at 02:00 UTC Then a complete, deduplicated NDJSON batch for the previous UTC day is delivered to the configured destination (e.g., S3/SFTP/HTTPS) with metadata record_count, window_start, window_end, head_hash And delivery success or failure is logged per tenant

Audit UI Filtering by Incident, Approver, and Date

Given at least 10,000 audit entries exist for a tenant When a user applies filters for incident_id, approver_id(s), and a date range in the Audit UI Then only matching entries are returned and displayed And the first page loads within 2 seconds at p95 with correct total count and pagination And the applied filters persist in the URL and are restored on refresh and when the URL is shared

PII Redaction with Evidentiary Value

Given audit entries may include PII such as phone numbers and email addresses When entries are displayed in the UI or exported to SIEM Then PII fields are redacted by default (e.g., phone: +1-***-***-1234; email: f***@d***.com) and a stable pii_hash is included for correlation And redaction is applied consistently across full_diff_snapshot, audience metrics/details, and notification metadata And users without the View Sensitive Data permission never see unredacted PII in UI or exports

Per-Tenant Retention Policy Enforcement

Given a tenant retention policy of 18 months is configured When the scheduled retention job runs Then all entries older than 18 months are purged and a retention_purge audit entry is appended capturing purge_window, purged_count, previous_head_hash, new_genesis_hash, and chain_segment_id And chain integrity verification passes for the remaining entries after purge And no entries newer than the retention threshold are deleted And changes to the retention policy are audited and take effect on the next scheduled run

Change Invalidation and Concurrency Controls

"As an approver, I want approvals to reset if the content or audience changes so that what I approve is exactly what gets sent."

Description

Ensures that any modification to content, audience filters, channels, or ETR/ETA after the first approval automatically invalidates prior approvals and requires re-approval. Implements optimistic locking and versioning of the approval artifact to prevent race conditions from concurrent editors. The UI surfaces live change banners, disables send on stale versions, and provides a one-click refresh to review the new diff. API endpoints reject outdated approval tokens, guaranteeing that the executed send matches the content and scope both approvers reviewed.

Acceptance Criteria

Invalidate Prior Approvals on Any Post-Approval Edit

Given a mass update has at least one approval recorded for version Vn When any of content body, audience filters, delivery channels, or ETR/ETA is modified Then the system creates version Vn+1, clears all prior approvals (approvalCount = 0), revokes all approval tokens tied to Vn, displays a live “Changes detected” banner to any approver viewing Vn within 2 seconds, disables Send for Vn, and requires two new distinct approvals on Vn+1 before Send is enabled

Optimistic Locking Blocks Concurrent Overwrites

Given two editors have version Vn of a mass update open concurrently When Editor A saves changes producing version Vn+1 and Editor B attempts to save based on Vn Then Editor B’s save is rejected with a concurrency error indicating the current version (Vn+1), no changes from Editor B are persisted, no approvals are added or retained, and the UI prompts a one-click refresh to load Vn+1 and review the diff

API Rejects Stale Approval Tokens

Given a send request includes approvalToken Tn bound to version Vn When the current version at send time is not Vn (e.g., Vn+1 exists) Then the API responds with HTTP 409 and errorCode = APPROVAL_TOKEN_STALE including latestVersion, no notifications are sent, the stale token is invalidated, and the request performs no partial side effects

UI Disables Send on Stale Version and Prompts Refresh

Given an approver is viewing version Vn while the server has version Vn+1 When the approver attempts to finalize or send Then the Send control is disabled with an inline stale-version message; and when the approver clicks “Refresh & Review,” the UI loads Vn+1 within 3 seconds, presents a side-by-side diff including audience count delta and ETA/ETR delta, and keeps Send disabled until two distinct re-approvals are captured on Vn+1

Executed Send Equals Approved Version

Given version Vn has two approvals and an approvalDigest computed from content, audience filter, channels, and ETR/ETA at approval time When Send is executed Then the system computes the payload digest at execution and blocks the send if it does not equal approvalDigest; if equal, the send proceeds, and the audit log records version, digest, approver IDs, and timestamp

Re-Approval Requires Two Distinct Approvers After Invalidation

Given first approval exists on version Vn and a subsequent edit creates version Vn+1 When re-approval is requested for Vn+1 Then approvals must be provided by two distinct user identities; prior approvals from Vn do not carry over; the same user cannot approve twice; and Send remains disabled until both approvals are present on Vn+1

Scoped Roles Matrix

Granular permissions define who can initiate and who can approve by channel (SMS, email, IVR), geography, incident severity, and content type (ETR vs advisory). Keeps changes within a safe blast radius, mirrors real org responsibilities, and blocks unauthorized or overbroad updates.

Requirements

Granular Roles & Scopes Engine

"As an operations manager, I want permissions tied to my channels, region, severity, and content type so that I can act quickly within my remit without risking overbroad updates."

Description

Implements a least-privilege permission model that binds actions (initiate, approve, edit ETR, publish, cancel) to fine-grained scopes across channel (SMS, email, IVR), geography (service territories, polygon geofences), incident severity (minor/major/critical), and content type (ETR vs advisory). Supports composing scopes with AND logic, explicit deny overriding allow, role inheritance, and reusable policy templates that mirror real operational responsibilities. Enforces all checks server-side with a consistent policy evaluation service used by UI and API, returning deterministic allow/deny decisions with rationale. Targets p95 policy evaluation under 50 ms with cached policy artifacts and safe-deny fallbacks if a decision cannot be made. Integrates with OutageKit’s incident and notification services so only permitted users can stage or broadcast updates within their assigned blast radius. Expected outcome: unauthorized or overbroad updates are blocked while legitimate, scoped actions proceed without friction.

Acceptance Criteria

Scoped Publish Within Territory and Channel

Given user U has publish permission scoped to channels {SMS,email}, geography=G1 polygon, severities {minor,major}, and content {advisory} When U submits a publish request via SMS for an advisory targeting recipients whose service locations are entirely within G1 Then the policy service returns allow with rationale that lists matched policy IDs and scopes, and the notification service broadcasts only to recipients within G1 Given the same U attempts to publish via IVR, or for severity=critical, or for content=ETR When the request is evaluated Then the policy service returns deny with rationale citing scope mismatch and no stage/broadcast records are created Given U targets recipients that include any outside G1 When the request is evaluated Then the policy service returns deny with rationale "target outside authorized geography" and no partial send occurs

Explicit Deny Overrides Allow

Given user V has an allow policy to publish SMS in G2 for all severities and an explicit deny policy for severity=critical When V submits a critical SMS publish in G2 Then decision=deny and rationale includes "explicit deny override" with the deny policy ID Given V submits a major SMS publish in G2 When evaluated Then decision=allow with rationale showing the allow policy and no denies matched Given multiple policies grant and deny the same action across inheritance levels When evaluated Then deny takes precedence deterministically with precedence rule "explicit_deny_overrides_allow"

Role Inheritance and Least Privilege

Given role R1 grants actions {initiate, edit_ETR} scoped to geography=G3 and severities {minor,major}, and role R2 inherits R1 and adds approve for channel=SMS only When a user is assigned R2 Then they can initiate and edit ETR within G3 for {minor,major} and can approve only for SMS, not email or IVR Given initiate is removed from R1 When re-evaluated Then users with R2 immediately lose initiate rights without manual updates Given a user with no assigned roles When they attempt any action Then decision=deny with rationale "no matching allow"

AND Logic Scope Composition

Given a policy P grants approve where channel=SMS AND geography=G4 AND severity=major AND content=ETR When a request matches any three but not all four dimensions Then decision=deny with rationale indicating the missing dimension(s) Given a request matches all four dimensions When evaluated Then decision=allow and rationale lists P as the matched policy Given multiple values inside a single dimension in P (e.g., severity in {minor,major}) When evaluated Then matching still requires channel AND geography AND content to also match; no cross-dimension OR broadening occurs

Deterministic Decisions and Rationale Across UI and API

Given identical inputs (subject identity, action, resource, attributes) When evaluated 100 times in succession Then decisions are identical and rationales contain the same ordered policy ID list and precedence notes Given a blocked action initiated via UI and the same via API When evaluated server-side Then both receive a deny; UI surfaces disabled controls but enforcement occurs server-side; API receives HTTP 403 with machine-readable code and correlation ID; rationale is returned in both contexts Given an allowed action initiated via UI and via API When evaluated server-side Then both receive allow with the same decision ID and rationale, and no client-side override is required

Performance, Caching, and Safe-Deny

Given the policy evaluation service under nominal load When measuring end-to-end decision latency over a statistically significant sample Then p95 latency <= 50 ms at the service boundary Given the policy service experiences a timeout, cache corruption, or dependency failure When a decision cannot be made within the configured timeout Then the service returns deny with reason "SAFE_DENY" within a bounded time and no side effects occur Given warmed caches of policy artifacts When evaluating decisions after a policy change has been published and cache invalidation triggered Then evaluations reflect the latest policies and no stale-policy acceptance occurs

Reusable Policy Templates

Given a reusable policy template with placeholders for {actions, channels, geography, severities, content_types} When instantiated for territory T5 with actions {initiate, publish}, channels {SMS,email}, severities {minor,major}, and content_types {advisory} Then the created policies reflect those values exactly and carry a reference to the source template ID and version Given the same template is reused for territory T6 When instantiated Then policies are generated for T6 without affecting T5, and evaluation respects each territory’s scopes Given a template update creates a new version When instantiating after the update Then the new version is used; existing instantiated policies remain unchanged until explicitly re-instantiated

Scoped Initiate/Approve Workflow

"As a regional supervisor, I want scoped approvals for outgoing messages so that sensitive updates are reviewed by the right people before customers are contacted."

Description

Provides a two-stage workflow where users authorized to initiate within a given scope can propose notifications and changes, and publication requires approval by a user with matching or broader scope for the same channel/geography/severity/content type. Includes per-channel approval routing, SLA timers with escalation, and clear UI prompts explaining who can approve and why. Supports emergency override (“break-glass”) with dual authorization, mandatory justification, automatic narrowest-possible scoping, time-boxed access, and post-event review. Blocks self-approval unless explicitly allowed by policy. Integrates with OutageKit’s message composer and scheduling to ensure only approved, scoped content reaches subscribers.

Acceptance Criteria

Scoped SMS Advisory Initiation

Given a user with Initiate permission for channel=SMS, geography=City A, severity=Advisory, contentType=Advisory When they create and submit a draft with exactly those scopes Then the draft is saved and status set to Pending Approval. Given the same user attempts to include any scope outside their permissions (channel/geography/severity/contentType) When they attempt to save or submit Then submission is blocked, an inline error lists the unauthorized dimensions, and a denied-attempt audit entry is created. Given an in-scope draft When submitted for approval Then the draft becomes read-only except comment fields and enters the approval queue.

Approval Scope Matching and Self-Approval Policy

Given a pending request with scope (channel=SMS, geography=City A, severity=Major, contentType=ETR) When a user with Approve permission whose scope covers SMS, includes City A (or broader), covers severity Major (or higher as defined by policy), and includes ETR opens it Then the Approve action is enabled. Given a pending request and the initiator has approve permission When policy selfApproval=false Then the Approve action is disabled for the initiator and the UI tooltip reads "Self-approval not permitted per policy"; and the audit log records the attempted self-approval. Given a pending request and the initiator has approve permission When policy selfApproval=true Then the Approve action is enabled and the approval record is marked as self-approved. Given an approver whose scope does not cover any one of channel/geography/severity/contentType When they open the request Then Approve action is not available and the UI lists the missing scope dimensions. Given an approver approves a request When approval is recorded Then the approval record stores approver identity, decision timestamp, matched scope, and any scope adjustments; and the message proceeds to publish only after all required approvals for the target channel are satisfied.

Per-Channel Approval Routing and Explainability

Given a draft targeting channels SMS and Email When the initiator submits for approval Then two approval tasks are created, one per channel, each routed to the correct approver group per the roles matrix. Given a user views a pending draft When they click "Who can approve?" Then the UI displays the eligible approvers per channel with the reason they qualify (e.g., scope dimensions they cover). Given channel-level approvals are independent When SMS is approved and Email is pending Then only SMS is eligible to publish/schedule, and the UI shows SMS=Approved, Email=Pending with timestamps.

SLA Timers and Escalation for Approvals

Given an approval request is created at T0 with SLA=5 minutes for severity=Major (per policy) When no approver acts by T0+5m Then the system escalates to the next approver group and sends notifications via email and SMS to on-call approvers; and an escalation event is logged. Given an escalation occurs When approvers are notified Then the audit log records escalation level, recipients, and delivery result for each channel. Given an approver acts after SLA When they approve or reject Then SLA status is recorded as Breached with decision timestamp; if before SLA, recorded as Met. Given severity changes affect SLA When severity=Advisory with SLA=15 minutes (per policy) Then the timer reflects the configured SLA for that severity and escalates accordingly.

Composer and Scheduling Approval Enforcement

Given an unapproved draft When a user attempts to send or schedule Then the action is blocked with message "Requires approval" and a link to request approval. Given a message is scheduled for Tpublish When it is not fully approved for the relevant channels and scopes by Tpublish-2 minutes Then the schedule pauses and notifications are sent to the initiator and approvers; the message does not publish until approval is granted. Given approvals are granted before Tpublish When Tpublish occurs Then the message publishes to subscribers within the approved scope only. Given the approved scope differs from the initiated scope When the message publishes Then the system enforces the approved (narrower or equal) scope and records the delta in the audit log.

Emergency Override (Break-Glass) Dual Authorization

Given a user triggers Emergency Override on a draft When they provide a mandatory justification of at least 20 characters Then the system grants the narrowest possible scope needed to publish the current targeting, starts a 30-minute override window, and marks the draft as Break-Glass Pending. Given Break-Glass Pending When a distinct second user with Break-Glass Confirm permission confirms within 3 minutes Then the message publishes immediately to the minimal required scope, and all actions are tagged Emergency Override in the audit log. Given Break-Glass Pending When no second user confirms within 3 minutes Then the override is cancelled, no publication occurs, and the draft returns to Pending Approval. Given the override window expires When time-boxed access ends Then all elevated permissions are revoked and a post-event review task is created containing justification, users involved, scopes affected, and timeline.

Scope Expansion Controls and Audit Trail

Given an approver reviews a request When they attempt to broaden scope (e.g., add geography, increase severity, add channel, or change contentType) Then the change is allowed only if the approver's permission scope is a superset of the proposed scope and they enter a justification note; otherwise the change is blocked with a specific error listing the unauthorized dimensions. Given a request is published When finalization occurs Then the audit log contains the final published scope, any differences from the initiated scope, who changed them, and timestamps. Given any unauthorized or overbroad update attempt occurs When the system blocks the action Then a security event record is created with user, attempted change, scopes involved, and timestamp.

IdP Group Mapping & Sync

"As an identity admin, I want roles and scopes to sync from our IdP so that access reflects our org chart and on-call rotations without manual updates."

Description

Maps identity provider (SSO) groups and attributes to OutageKit roles and scopes, enabling automated assignment by geography, channel responsibility, and on-call status. Supports SCIM 2.0 and LDAP sync, just-in-time provisioning, periodic reconciliation, and immediate deprovisioning. Allows attribute-based rules (e.g., territory=“North” AND channel=“SMS”) to drive scope membership. Provides dry-run previews to validate mappings before applying. Ensures the Scoped Roles Matrix stays aligned with real org structures without manual user management.

Acceptance Criteria

SCIM 2.0 Provisioning Maps Groups to Scoped Roles

Given a valid SCIM 2.0 POST /Users with attributes territory=“North” and channel=“SMS” and group “Ops-Initiators” When OutageKit receives the request Then a new user record is created if no matching externalId exists within 5 seconds And the user is assigned the role Initiator with scope {channel: SMS, geography: North} And the assignment is visible in the admin UI and via API within 30 seconds And if the user already exists, attributes and role/scopes are updated idempotently without creating duplicates

LDAP Scheduled Reconciliation Updates Scope Membership

Given an LDAP directory where user A is removed from group “North-SMS-Initiators” and added to “South-Email-Approvers” And a reconciliation interval configured to 15 minutes When the next scheduled LDAP sync runs Then OutageKit removes user A from {channel: SMS, geography: North, role: Initiator} And adds user A to {channel: Email, geography: South, role: Approver} And no other users’ role/scope assignments change And the total adds/removes match the computed delta from LDAP And the sync completes within the configured window and surfaces a success summary

Just-In-Time Provisioning on First SSO Login

Given a user not present in OutageKit initiates SSO and the IdP assertion contains territory=“Central”, channel=“IVR”, on_call=true When the user completes SSO Then OutageKit creates the user and assigns scopes per mapping rules before redirecting to the app (<= 3 seconds post-assertion) And the user lands in the app with the mapped permissions effective immediately And if no mapping rule yields any role, the login is denied with a clear error and zero access is granted

Immediate Deprovisioning on IdP Disable/Delete

Given a user is deactivated or deleted in the IdP and a SCIM PATCH/DELETE for the user is issued When OutageKit receives the deprovision event Then all role/scope assignments are revoked within 60 seconds And all active sessions are terminated within 60 seconds And subsequent API/UI access attempts return 401/403 And the user no longer appears in any scope membership queries

Attribute-Based AND Rules Drive Scope Assignment

Given a mapping rule: (territory=“North” AND channel=“SMS” AND on_call=true) -> Approver: {SMS, North} When a user’s IdP attributes match all rule predicates (case-insensitive, trimmed) Then the user is assigned Approver with scope {channel: SMS, geography: North} And when any predicate does not match, no assignment from this rule occurs And when multiple rules match, the resulting roles/scopes are the de-duplicated union, bounded by any org-wide max-scope limits

Dry-Run Preview of Mappings Shows Deterministic Changes

Given a set of mapping rules and a selected IdP directory snapshot When an administrator runs a dry-run preview Then OutageKit produces counts of users to create, update, and deprovision, plus per-user role/scope diffs And no user, role, or scope changes are persisted And running the same dry-run again without directory or rule changes yields identical results And the administrator can apply or discard the changes explicitly after review

Union of Multiple Matching Mappings Without Overreach

Given a user matches three mapping rules that assign overlapping roles/scopes When the mappings are evaluated Then the final assignment is the union of roles/scopes without duplicates And the union cannot exceed preconfigured organizational maximums per role, channel, or geography And evaluation order does not change the final result (deterministic outcome)

Blast Radius Preview & Guardrails

"As a duty manager, I want a blast radius preview before sending so that I can confirm the audience and scope are correct and within policy."

Description

Adds a preflight check that visualizes and quantifies the impact of a proposed action, showing estimated recipients by channel, affected geographies on the map, and severity/content scope alignment. Validates that the selected audience and content are within the initiator’s and approver’s allowed scopes; surfaces explainable errors when out of bounds. Provides configurable thresholds and warnings (e.g., unusually large audience for a minor incident) and requires justification for crossing soft limits. Integrates directly into the compose and approve flows to reduce accidental overreach before publication.

Acceptance Criteria

Preflight Preview: Recipients, Geography, Severity/Content

Given a logged-in initiator has selected channels (SMS, email, IVR), geographies, severity, and content type in the compose flow When they open or update the preflight panel Then the panel displays estimated recipient counts per selected channel And highlights the selected geographies on the map And shows the chosen severity and content type And updates all values within 2 seconds of any change to selections (p95) And displays a zero-impact notice if no recipients are targeted

Initiator Scope Validation and Explainable Errors

Given an initiator’s allowed scope is defined by channel, geography, incident severity, and content type per the Scoped Roles Matrix When the draft targets any dimension outside the initiator’s allowed scope Then the system disables Send and Request Approval actions And displays an error that enumerates each violating dimension and the initiator’s permitted values And provides a link or inline view to the relevant scope policy And the error clears immediately when the draft is revised to be within scope

Soft Threshold Warning and Justification Capture

Given soft thresholds are configured (e.g., Minor severity audience > 5,000 recipients) When the draft exceeds a soft threshold but remains within role scope Then the system displays a warning specifying the threshold exceeded and the current calculated metric And requires the initiator to enter a free-text justification of at least 20 characters And records justification, user ID, timestamp, threshold type, and values in the audit log And enables Send or Request Approval only after a valid justification is provided

Hard Threshold Enforcement

Given hard limits are configured (e.g., Minor severity cannot target more than one county) When the draft exceeds any hard limit Then the system disables Send and Request Approval And displays a non-overridable error listing the specific limit and the current value And does not accept justification to bypass the limit And the error clears only when the draft is adjusted to comply with the hard limit

Approver Review Validates Approver Scope

Given an approver opens a pending action in the approve flow When any targeted dimension exceeds the approver’s allowed scope Then the Approve action is disabled And an error identifies each out-of-scope dimension relative to the approver’s permissions And the approver can reassign to an approver with sufficient scope And the system logs the blocked approval attempt with user, timestamp, and violating dimensions

Compose and Approve Flow Integration Gates Progress

Given a user is in the compose flow When all validations pass (scope, thresholds, required fields) Then the primary button state reflects the next permitted step based on role (Send or Request Approval) And a status banner shows Validation OK When any validation is Warning (soft threshold) Then the primary action is enabled only after justification is entered and validated When any validation is Error (scope or hard limit) Then the primary action remains disabled until resolved And in the approve flow, the same validations and preflight details are presented identically before approval is enabled

Performance and Resilience of Preflight Computation

Given a draft targeting up to 250,000 total recipients across channels When preflight is triggered or inputs change Then recipient estimates, map highlights, and validations render within 2 seconds p95 and 4 seconds p99 And the UI remains responsive during computation And if computation fails, the system retries up to 3 times with exponential backoff And after final failure, an error is shown with guidance to retry or adjust targeting, and the failure is logged with correlation ID

Immutable Audit Trail & Evidence

"As a compliance officer, I want an immutable audit trail of scoped actions so that I can demonstrate proper controls and investigate incidents quickly."

Description

Captures append-only logs for all permission evaluations, role/scope changes, initiations, approvals, overrides, and publications, including actor identity, decision rationale, content diffs, scopes, timestamps, IP/agent, and related incident IDs. Provides search, filters, and export to CSV/JSON and SIEM for compliance. Protects integrity with tamper-evident hashing and retention controls. Powers compliance reports demonstrating who did what, when, and under which authorized scope within OutageKit.

Acceptance Criteria

Permission Evaluation Logging for Scoped SMS Approval

Given a user attempts to approve an SMS ETR update for Incident I within Geography G and Severity S When the permission engine evaluates the request Then an append-only audit entry is written containing: actor_id, actor_role, action="approve", channel="SMS", content_type="ETR", decision in {allow, deny}, decision_rationale (non-empty), policy_version, requested_scopes {channel, geography, severity, content_type}, authorized_scopes, incident_id, request_id, tenant_id, timestamp_utc (ISO 8601), ip_address, user_agent, environment, prev_hash, entry_hash And the entry is visible via audit UI and API within 5 seconds of the evaluation completing And attempts to alter or delete the entry are rejected with HTTP 403 and the attempt is separately logged And the hash chain verifies such that entry_hash = H(prev_hash || payload) and the verification endpoint returns "valid" for the entry

Tamper-Evident Role and Scope Change Logging

Given an administrator modifies a user's role or scope assignment When the change is saved Then an append-only audit entry is created with: actor_id, target_principal, change_type in {add, remove, update}, before_state_hash, after_state_hash, field_level_diff, change_reason (required), approval_reference (optional), timestamp_utc, ip_address, user_agent, tenant_id, prev_hash, entry_hash And the entry is stored in WORM mode under the active retention policy and cannot be edited or deleted And exporting the entry and recomputing the hash reproduces the same entry_hash And verification of the hash chain including this entry returns "valid"

End-to-End Traceability for Initiation–Approval–Override–Publication

Given an incident update is initiated, approved, optionally overridden, and published to SMS, Email, and IVR When each step is performed Then each step writes an audit entry including: step_type, content_diff from prior step, requested_scopes, authorized_scopes, actor_id, timestamp_utc, incident_id, correlation_id, prev_hash, entry_hash And querying by incident_id returns a time-ordered, contiguous sequence of these entries linked by correlation_id with no gaps And the timeline view displays the full sequence within 3 seconds for incidents with <= 100 steps And if any expected step is missing, the system flags the timeline as incomplete and emits an alert event

Audit Log Search, Filter, and Pagination Performance

Given an auditor provides filters for actor_id, incident_id, date_range (UTC), decision, channel, geography, severity, action_type, and ip_address When a search request is executed via UI or API Then only records matching all provided filters (AND semantics) are returned sorted by timestamp_utc descending And the first page (up to 100 records) returns within 2 seconds for result sets <= 10,000 records And pagination via next_page_token returns the full result set without duplicates or omissions And access controls enforce tenant and scope isolation; unauthorized users receive HTTP 403 with no data leakage

Export to CSV/JSON and SIEM Forwarding with At-Least-Once Delivery

Given an auditor selects a filtered set of audit records up to 100,000 entries When exporting to CSV or JSON Then the export is delivered within 30 seconds in UTF-8 encoding with a stable schema that includes prev_hash and entry_hash, and a SHA-256 checksum file is produced And CSV escaping conforms to RFC 4180; JSON export is a well-formed array with consistent field names and types And SIEM forwarding can be configured via TLS syslog or HTTPS webhook using at-least-once delivery with exponential backoff and a dedupe_key based on request_id And delivery metrics and permanent failures are surfaced in the UI/API, with retry attempts capped and alerts emitted on failure

Compliance Report: Who/What/When/Scope

Given a compliance officer requests a report for a date range and optional filters (geography, incident_id) When the report is generated Then it lists each action with actor_id, action_type, requested_scopes, authorized_scopes, incident_id, timestamp_utc, decision, decision_rationale, and override_justification (if any) And it includes summaries by action_type and actor and validates against the hash chain, producing a report_signature And the report supports drill-through links to underlying audit entries and can be exported to CSV and JSON within 60 seconds for up to 10,000 actions

Retention Policy and Legal Hold Enforcement

Given a tenant-level retention period R is configured and a legal_hold flag may be applied to specific incidents or actors When log entries exceed age R and are not under legal hold Then they are expired via a WORM-compliant process that writes a cryptographic tombstone entry linking prev_hash and including summary metadata And any deletion, retention change, or legal hold application/removal generates its own audit entry And entries under legal hold are not deleted; attempts to delete return HTTP 403 and are logged And backup and restore operations preserve the hash chain; post-restore verification for a random 1% sample passes

Policy Versioning & Rollback

"As a security administrator, I want versioned, reviewable permission policies so that we can change access safely and revert quickly if needed."

Description

Introduces versioned Scoped Roles Matrix policies with draft, review, and publish states, scheduled effective dates, and change summaries. Provides diff views between versions, impact analysis (who gains/loses capabilities), and one-click rollback to a prior known-good configuration. Requires approval to publish policy changes and logs full provenance. Ensures safe evolution of permissions without unintended gaps or excessive access.

Acceptance Criteria

Draft Version Creation & Change Summary

Given a user with Policy Editor permissions is on the Scoped Roles Matrix policies page When the user creates a new policy version from the currently published policy Then a new Draft version is created with version number incremented by 1 And the draft is not enforced in runtime permissions And the user must enter a non-empty change summary of at least 15 characters before the first save is allowed And createdBy and createdAt are recorded on the draft And the draft can be saved and reopened with all edits persisted

Version Diff View Between Policy Versions

Given a draft version and a baseline version exist When the user opens the Diff view Then additions removals and modifications are shown grouped by channel geography severity and content type And each change shows previous and new values and affected principal type user or group And the diff view supports filtering by dimension and change type and search by principal And the diff summary displays total counts of added removed and modified capabilities And the diff can be exported to CSV and PDF with identical totals to the on-screen summary

Impact Analysis of Gains and Losses

Given a draft version exists And the user selects a baseline version to compare against When the user runs Impact Analysis Then the system lists users and groups gaining or losing each capability with counts per dimension And cross-geo or cross-severity expansions are flagged as high risk And the analysis completes within 10 seconds for up to 10000 policy rows And the analysis output can be downloaded as CSV and includes a generated analysisId for traceability

Approval Workflow With Review State and Dual Approval

Given a draft version with no outstanding validation errors exists When the author submits the draft for review Then the version state changes to In Review and the author cannot approve their own draft And at least 2 distinct approvers with Policy Approver role must approve before publish is enabled And approvers can request changes which moves the version back to Draft with a required comment And all approvals and rejections are recorded with actor timestamp and comment

Scheduled Publish and Atomic Cutover

Given an approved version exists When a publisher schedules it with a future effective date and timezone Then the schedule cannot be set in the past and cannot overlap with another pending schedule And at the effective timestamp the system atomically promotes the version to Published within 5 seconds And publishedAt publishedBy effectiveAt and priorVersion are recorded And a notification is sent to configured channels confirming successful cutover And if promotion fails the system automatically rolls back to the prior Published version and alerts on failure

One-Click Rollback to Prior Published Version

Given a Published version is active and at least one prior Published version exists When an authorized user triggers One-Click Rollback and provides a rollback reason Then the selected prior Published version becomes the active Published version within 5 seconds And the rollback creates a new change summary referencing from and to version numbers and the reason And all pending schedules tied to the superseded version are canceled And a notification is sent to configured channels confirming rollback

Provenance and Safety Validation

Given a draft or in-review version exists When the user runs Validate Changes Then the validator blocks publish if any capability that previously existed would be removed causing a gap for a geography channel pair And the validator blocks publish if any single principal gains permissions exceeding the configured blast radius threshold And non-blocking warnings are displayed for low-risk changes And all validation results approvals publishes schedules rollbacks and exports are written to an immutable audit log with actor timestamp object ids and a cryptographic hash chain And audit events are viewable with filters and exportable to CSV

Permissions Admin Console & API

"As a platform admin, I want a robust UI and API to manage permissions so that I can maintain and automate our Scoped Roles Matrix efficiently."

Description

Delivers an admin UI for creating roles, defining scopes, assigning users/groups, and importing/exporting policies as JSON/CSV. Includes validation, test-as-user capability, bulk operations, and a sandbox mode to trial policies against historical incidents without affecting production. Exposes REST endpoints for policy CRUD, evaluation, and sync status with pagination, rate limits, and fine-grained access controls. Ensures the Scoped Roles Matrix is manageable at scale and integrable with external tooling.

Acceptance Criteria

Create Role with Multi-Dimensional Scope via UI

Given I am an org admin with permission to manage roles When I create a new role and set Initiate privileges for channels [SMS, Email] and Approve for [IVR] And I scope geography to regions [R-101, R-202], severity to [Major, Critical], and content types to [ETR, Advisory] Then the role is saved successfully And the role detail view displays the exact channels, geographies, severities, and content types for Initiate and Approve And the role is retrievable via API and UI with identical scope values Given I attempt to save a role with no channels selected or with an empty scope on all dimensions When I click Save Then the save is blocked and inline validation messages identify each missing or invalid field Given I define overlapping scopes with an existing role When I save Then the system allows overlap and displays a non-blocking warning that overlap exists

User/Group Assignment and Effective Permission Preview

Given a user Jane belongs to groups FieldOps and RegionWest and has roles assigned via both user and group When I open Effective Permissions for Jane Then the UI shows a matrix of Initiate and Approve by channel, geography, severity, and content type that equals the union of scopes from all assigned roles And removing role B from Jane and refreshing recalculates the matrix to exclude B's scopes And the Effective Permissions view and the evaluation API return the same decision and rationale for the same test action

Policy Import/Export JSON and CSV with Validation and Dry-Run

Given I export policies as JSON and CSV When I download the files Then the files include all roles, scopes, assignments, and metadata with stable identifiers and a schema version Given I perform a dry-run import with a mixed-validity file When I submit the file with mode=dry-run Then I receive a report containing per-record status (Valid/Invalid), error messages, and summary counts without persisting any changes Given I perform a commit import with the same file When I submit with mode=commit Then valid records are applied idempotently, invalid records are rejected with reasons, and the response includes per-record results and a transaction ID And a subsequent export reflects the applied changes exactly

Test-As-User Policy Evaluation in UI

Given I open Test-as-User and select user Jane When I input action "Initiate SMS Advisory" for geography R-101 and severity Major Then the result displays Allow or Deny, the matched role ID(s), and the specific scope criteria that determined the outcome And the evaluation result matches the evaluation REST API for the same input

Bulk Edit Roles and Assignments

Given I select 200 users in the Users table When I bulk-assign role A and remove role B Then the operation completes with a progress indicator and a per-user result list of Success or Failure with reason And no user ends in a partial state (atomic per user) And the final count of successes equals the number of users that now have A and no longer have B

Sandbox Trial Against Historical Incidents

Given Sandbox mode is enabled When I select a date range and choose a draft policy set And I simulate actions against historical incidents within that range Then no production roles, assignments, or incident records are modified And I receive a report that lists each evaluated historical action with Allow/Deny, matched policy references, and counts aggregated by channel, geography, severity, and content type And the report is exportable as CSV and JSON

REST API: CRUD, Evaluate, Sync Status, Pagination, Rate Limits, Access Control

Given a client with token scope policies.read When it calls GET /roles and GET /assignments with a limit and page token Then the API returns 200 with items, pagination metadata (nextPageToken when more results exist), and respects the limit parameter Given a client with policies.write When it calls POST, PUT, PATCH, or DELETE on /roles and /assignments with valid payloads Then the API performs the operation and returns 201/200 and ETag headers; on precondition failure with If-Match, returns 412 Given a client exceeds rate limits When it continues to call endpoints Then the API returns 429 with rate limit headers (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset) Given a client with evaluation.invoke When it calls POST /evaluate with an action (userId, actionType, channel, geography, severity, contentType) Then the API returns 200 with decision Allow/Deny and rationale including matched role IDs and scopes Given a client calls GET /sync/status When the system is healthy Then the response includes lastSyncTime, source system identifiers, and healthy=true

Timeboxed Overrides

Break‑glass access for emergencies requires MFA, justification, and a set duration, with automatic rollback when the window expires. Enables fast action during storms while preserving guardrails, visibility, and a clean audit trail for every exception.

Requirements

MFA-Gated Break-Glass Initiation

"As an on-call incident commander, I want to initiate break-glass with MFA so that I can act quickly during emergencies while keeping access secure."

Description

Enforces step-up authentication when initiating an emergency override. Supports enterprise IdP integration (SAML/OIDC) and multiple MFA factors (WebAuthn/FIDO2, TOTP, IdP push; optional SMS OTP per policy). Presents a dedicated break-glass initiation flow in UI and API, validates active incident context, and rate-limits attempts. Records actor identity, factor type, device/browser fingerprint, and source IP for traceability. Integrates with OutageKit’s role model to ensure only designated roles can attempt overrides and that sessions are elevated only for the approved scope and timebox.

Acceptance Criteria

UI Break-Glass Step-Up with Factor Selection

Given a signed-in user with a designated break-glass role navigates to the Break-Glass Initiation UI When they select an active incident, enter a justification and requested duration, and click Initiate Then the system requires step-up authentication regardless of prior login state And the user is presented only with policy-allowed MFA factors (WebAuthn/FIDO2, TOTP, IdP push, SMS OTP if enabled) And upon successful factor verification, the system confirms initiation and shows the approved scope and expiration timestamp And if factor verification fails, initiation is not created and a non-enumerating error message is shown

API Step-Up via IdP (SAML/OIDC)

Given an organization configured for SAML or OIDC federation with step-up supported by the IdP When a client calls POST /api/break-glass/initiate without a valid step-up context Then the response is 401 Unauthorized with a step-up challenge that includes a transaction_id and IdP redirect URL (SAML AuthnContext or OIDC acr_values requesting MFA) And when the client completes IdP MFA and retries with the transaction_id Then the API responds 201 Created with an elevation_token scoped to the requested incident and timebox and an expires_at timestamp And if the IdP denies or times out, the API responds 403 Forbidden and no elevation is created

Policy-Driven MFA Factor Availability

Given an organization security policy that disables SMS OTP and enables WebAuthn, TOTP, and IdP push When a user initiates break-glass in UI or API Then SMS OTP is not offered as a factor option anywhere And WebAuthn, TOTP, and IdP push are offered when technically available on the device/client And attempts to force a disallowed factor via API are rejected with 400 Bad Request and are audited

Role-Based Authorization and Scoped Timebox Elevation

Given a user without a designated break-glass role When they attempt to access the Break-Glass Initiation UI or API Then access is denied with 403 Forbidden and an audit event is recorded Given a user with a designated break-glass role limited to Distribution Ops scope and max 2 hours When they request a scope outside Distribution Ops or a duration > 2 hours Then the request is rejected with validation errors and no elevation is created Given a user with appropriate role requests an allowed scope and duration When initiation succeeds Then the resulting elevated session/token carries claims restricting actions to the approved scope and expires at the approved time, after which calls are denied with 401/403 until a new initiation is completed

Active Incident Context Validation

Given there is no active incident in the selected region or the incident is resolved/archived When a user attempts to initiate break-glass Then the system blocks initiation with a clear error indicating no valid active incident context, and no elevation is created Given a valid active incident ID within the user’s organization When the user submits the initiation Then the system validates ownership and status and proceeds; cross-tenant or invalid IDs are rejected with 404/403 as appropriate

Attempt Rate Limiting and Lockout Messaging

Given a rate-limit policy of N attempts per user per T minutes and M attempts per source IP per H minutes When a user or IP exceeds the configured thresholds for break-glass initiation or MFA verification Then further attempts are blocked until the window resets, the API returns 429 Too Many Requests with Retry-After, and the UI shows a generic rate-limit message And each blocked attempt is audited without revealing which factor failed

Audit Trail Completeness and Integrity

Given any break-glass initiation attempt (success or failure) When the attempt completes Then an immutable audit record is written containing: actor user ID and role, authenticated IdP, factor type used, device/browser fingerprint, source IP, timestamp, incident ID, requested scope and duration, result (success/failure) with reason code, and correlation ID And when initiation succeeds Then the audit record links to the elevation session/token ID and the UI/API can retrieve it by correlation ID for traceability

Justification & Policy-Based Approval

"As an operations manager, I want to provide a justification and get policy-driven approval so that emergency overrides are accountable and compliant without slowing response."

Description

Requires a structured justification (free text, incident ID, severity, affected regions, expected actions) before an override can start. Evaluates org-defined policies to auto-approve certain conditions (e.g., declared storm, P1 outage) or route to approvers (duty lead, security) with time-bound SLAs. Supports one-click approvals via email/Slack and UI with full context, and captures approver identity and rationale. Falls back to post-facto review if policy permits immediate auto-start. Integrates with OutageKit’s incident objects to bind overrides to specific events for reporting and accountability.

Acceptance Criteria

Override Start Requires Structured Justification

Given a user initiates a timeboxed override request When they attempt to submit without all required justification fields (free-text justification, incident ID, severity, affected regions, expected actions) Then the system blocks submission, highlights missing/invalid fields, and does not create an override request ID Given all required fields are present and valid (incident ID exists, severity is from the org-defined set, affected regions match configured regions) When the user submits Then the request is accepted, a unique request ID is created, and the justification is stored immutably with that ID

Policy Auto-Approval for Declared Storm or P1

Given org policy defines auto-approval for declared storms and/or P1 incidents And the linked incident is flagged as a declared storm or severity P1 When a valid override request is submitted Then the system auto-approves the request without manual approver action and marks the decision source as policy-engine And the audit record includes the matched policy rule ID(s), evaluation timestamp, and inputs used

Approval Routing with SLA and Escalation

Given org policy requires approvals (e.g., Duty Lead and Security) with configured SLAs When a valid override request enters the approval workflow Then the named approvers are notified in UI, email, and Slack with full context and one-click Approve/Deny actions And the request shows per-approver SLA countdown timers in the UI When an approver SLA expires without action Then the system escalates to the configured alternate approver(s), records the escalation event, and re-notifies And if all required approvers approve within SLA, the request state becomes Approved; if any Deny, the request state becomes Denied

One-Click Approval via Email/Slack Captures Identity and Rationale

Given an approver receives a one-click Approve/Deny link in email or Slack When they use the link and submit a decision with a mandatory rationale Then the system records approver identity (user ID, role), channel (email/Slack/UI), decision, rationale, timestamp, and requester-visible comment And the decision link becomes single-use and is invalid after first use or after the configured TTL And the requester and other approvers are notified of the decision and current state

Immediate Auto-Start with Post-Facto Review

Given org policy permits immediate auto-start with post-facto review for specified conditions When a valid override request meets those conditions Then the override starts immediately in Provisional state and the approver queue is bypassed And a post-facto review task is created for designated approvers with a configured due time and reminders When the review is completed as Ratified within the due time Then the override is marked Ratified with audit entry referencing the review decision When the review is Denied or overdue per policy Then the system flags non-compliance, notifies compliance owners, and terminates the override if still active

Override Bound to Incident Object for Reporting

Given the request includes an OutageKit incident ID When the override is created Then the override is linked to the incident timeline and analytics, and appears in incident-based reports with request and decision details And creating an override without an incident is blocked unless policy allows no-incident overrides; if allowed, the system creates a placeholder incident and flags it for reconciliation

Comprehensive Audit Trail of Justification, Policy, and Approvals

Given any state change or action occurs on an override request (submit, auto-approve, manual approve/deny, escalation, start, ratify, terminate) Then an immutable audit event is written including: request ID, incident ID, actor (user or policy-engine), action, inputs (justification snapshot, policy rule IDs evaluated and outcomes), timestamps, and previous/new states And audit events are queryable by request ID and incident ID, exportable in CSV/JSON, and visible to authorized roles And audit events cannot be edited or deleted; corrections are logged as new append-only events with references to the corrected event

Configurable Timebox & Extension Rules

"As a security admin, I want configurable duration and extension rules so that overrides expire automatically and cannot persist beyond policy limits."

Description

Provides admin-defined default and maximum override durations by role, environment, and action type (e.g., broadcast limits vs. template edits). Displays a visible countdown and enforces automatic expiry. Supports controlled extension requests requiring renewed MFA and updated justification; applies stricter caps under normal operations and relaxed caps during declared incidents as per policy. Prevents silent lingering by notifying stakeholders before expiry and logging any extensions with reasons. Applies consistently across UI, API, and CLI.

Acceptance Criteria

Admin-Defined Defaults and Maximums by Role/Env/Action

Given an admin configured default=30m and max=120m for role=Ops Manager, env=Production, action=BroadcastLimitOverride When an Ops Manager starts a break-glass override via UI without entering a duration Then the override is created with duration=30m and the confirmation shows expiresAt within ±5s of now+30m Given the same policy caps When the same user requests a 180m override via API Then the request is rejected with HTTP 422 and error "duration_exceeds_max" and no override record is created Given the same policy caps When the same user requests a 90m override via CLI Then the override is created with duration=90m and the audit log records policyCaps {default:30m, max:120m}

Visible Countdown and Auto-Expiry with Rollback

Given an active override with expiresAt=T When the requester views the override in the UI Then a countdown displays remaining time in mm:ss and updates every 1s Given the same override When the API GET /overrides/{id} is called Then the payload includes remainingSeconds that decrements on subsequent calls and matches the UI within ±2s Given the same override When current time reaches T Then elevated permissions are revoked within 5s, override status changes to "Expired", endedAt is set, and privileged tokens/session are invalidated; any privileged action attempted after T is blocked with 403 "override_expired"

Controlled Extension with Renewed MFA and Updated Justification

Given an active override with 5m remaining and policy max of 120m total When the requester clicks Extend, successfully completes MFA, and submits an updated justification ≥ 15 characters for +20m Then the extension is granted, expiresAt increases by 20m, and the audit log captures extension_id, previous_expiresAt, new_expiresAt, MFA method, and justification Given the same policy caps When an extension request would cause total duration to exceed 120m Then the system denies the request with 403 "extension_not_allowed" and no change to expiresAt Given an override that already expired When the requester attempts to extend it Then the system denies with 403 "override_expired" and instructs to create a new override

Policy Mode Switching: Normal vs Declared Incident Caps

Given normal mode caps default/max=15m/60m and incident mode caps default/max=30m/180m for role=NOC Analyst, env=Production, action=TemplateEditOverride When an admin toggles Incident Mode to ON at 12:00 Then new overrides created at or after 12:00 use default=30m and max=180m and the mode field="incident" in records Given an active override created at 12:10 in incident mode with 45m remaining When the requester submits an extension that keeps total ≤180m Then the extension is allowed Given Incident Mode is turned OFF at 14:00 When a user requests a new override at 14:05 Then normal caps (15m/60m) apply And no existing override is auto-extended And an audit entry records the mode change with actor, timestamp, and reason

Pre-Expiry Stakeholder Notifications

Given an active override expiring at T with notification policy of 5m and 1m pre-expiry When current time reaches T-5m Then the requester and designated stakeholders receive notifications via email and SMS containing override_id, role, env, action, and remaining time Given the same override When current time reaches T-1m Then a second notification is sent via the same channels unless already acknowledged within the last 60s Given an extension is granted before T When expiresAt changes Then pre-expiry notifications are rescheduled to the new T and no duplicate 5m alerts are sent within 60s Given a notification delivery attempt fails When a retry policy is configured Then at least one retry occurs within 60s and failures are logged with channel and error code

Comprehensive Audit Trail for Overrides and Extensions

Given an override is created or extended via UI, API, or CLI When the operation completes (success or failure) Then an immutable audit record is written within 2s containing override_id, actor, role, env, action_type, channel, mode (normal/incident), justification, MFA method, requestedDuration, effectiveDuration, createdAt, expiresAt, policyCaps, outcome (success/failure), and error code if any Given audit records exist When querying by override_id, actor, time range, or channel Then matching records are returned within 2s and can be exported as CSV and JSON Given audit integrity requirements When attempting to modify an existing audit record via any interface Then the attempt is rejected with 403 and a new audit entry is created noting the forbidden attempt

Consistency Across UI, API, and CLI

Given identical policy caps for a role/env/action When creating overrides and extensions via UI, API, and CLI with equivalent inputs Then duration enforcement, error codes/messages, audit fields, and countdown semantics are identical across channels Given an override is created via API When the UI or CLI fetches the override within 5s Then it shows the active state and remaining time consistent within ±2s of the API value Given the API supports idempotency keys for creation When duplicate create requests with the same idempotency key arrive within 60s Then only one override is created and subsequent responses return the original resource

Scoped Least-Privilege Elevation

"As a platform engineer, I want overrides to grant only the specific actions needed so that we minimize risk while enabling urgent work."

Description

Grants only the minimum necessary permissions during an override, scoped to specific actions (e.g., bypass message throttling, edit ETA templates, modify geo-targeting) and resources (regions, customer segments). Issues ephemeral, scope-limited tokens/role bindings that work across OutageKit’s UI and APIs. Deny-by-default with explicit allowlists; incompatible actions remain blocked. Provides dry-run validation showing what will be allowed/denied before activation. Integrates with existing permission checks to enforce scope at execution time and logs all access decisions.

Acceptance Criteria

Scoped override allows only requested actions on selected resources

Given an approved override request with actions [bypass_message_throttling, edit_eta_templates] and resources {region:"North", segments:["Residential"]} When the override is activated Then attempting to bypass message throttling for North Residential succeeds with HTTP 200 and audit event "ALLOW" with scope {region:"North", segment:"Residential", action:"bypass_message_throttling"} And attempting to bypass throttling for any other region or segment returns HTTP 403 with error_code "SCOPE_DENIED" And attempting any non-requested action (e.g., modify_geo_targeting) returns HTTP 403 with error_code "ACTION_NOT_IN_SCOPE" And UI controls for out-of-scope actions are disabled or show a permission error state

Deny-by-default and incompatible actions enforcement

Given a requested override with actions [bypass_message_throttling, delete_incident] and resources {region:"West"}, where delete_incident is marked incompatible When the override is reviewed or activated Then delete_incident remains blocked and returns HTTP 403 with error_code "ACTION_INCOMPATIBLE" and reason "break-glass scope policy" And any action or resource not explicitly allowlisted is denied by default with HTTP 403 and error_code "SCOPE_DENIED" And the system does not broaden scope implicitly (no wildcard regions/segments are granted) And the final active scope contains only explicitly allowed actions/resources

Dry-run preview enumerates allowed and denied operations

Given a proposed override with actions [edit_eta_templates, modify_geo_targeting] and resources {region:"South", segments:["SMB"]} When a dry-run is executed before activation Then the response lists allowed_operations with entries specifying {action, resources} that will be permitted And the response lists denied_operations with entries specifying {action, resources, reason} for each denial (e.g., incompatible, not allowlisted) And no permissions are changed during dry-run (subsequent permission checks remain unchanged) And the dry-run result has a unique ID and is immutable and auditable

Ephemeral scoped token/role binding applies across UI and APIs

Given an override is activated and an ephemeral scoped token/role binding is issued When the user performs an in-scope action via the UI Then the action succeeds and the audit log associates it with the override ID and token ID When the same user or service uses the token via API to perform the same in-scope action Then the API call succeeds with HTTP 200 and the audit log shows identical scope attribution And attempts via UI or API to perform out-of-scope actions return HTTP 403 with error_code "SCOPE_DENIED" using the same enforcement decision engine

Execution-time scope enforcement at resource boundaries

Given an active override scoped to {regions:["East"], segments:["Residential"], actions:[modify_geo_targeting]} When the user attempts to modify geo-targeting for East Residential Then the request is allowed and the decision engine records an ALLOW with resource predicates matching the scope When the user attempts to modify geo-targeting for East Commercial or any non-East region Then the request is denied with HTTP 403 and error_code "RESOURCE_OUT_OF_SCOPE" And the denial occurs at execution time in each service enforcing existing permission checks (no bypass paths) And partial-batch requests are split so that in-scope items succeed and out-of-scope items are rejected with per-item results

Automatic rollback removes elevated access at expiry

Given an override with duration 30 minutes is activated at T0 When T0+30m is reached Then the ephemeral token/role binding is revoked automatically and cannot be used in UI or API And subsequent attempts to use the token return HTTP 401 with error_code "TOKEN_EXPIRED" or 403 with "SCOPE_REVOKED" And affected UI controls revert to baseline permissions without requiring a page reload within 60 seconds And an audit event "OVERRIDE_EXPIRED" is recorded with {override_id, actor, start_time, end_time, revoked_by:"system"}

Audit logging of all override access decisions

Given an override is in effect When any access decision (allow or deny) is made for an action evaluated against the override scope Then an audit record is written containing {timestamp, actor_id, override_id, token_id, action, resource_ref, decision:[ALLOW|DENY], reason_code, service, request_id} And audit records are immutable, queryable by override_id, and available within 5 seconds of the decision And exporting the audit trail for the override returns a complete, ordered sequence of decisions with no gaps And redacted fields (if any) follow the organization’s logging policy without omitting required decision metadata

Automatic Rollback & State Restore

"As an operations manager, I want the system to automatically restore guardrails and configs when the window ends so that we return to safe defaults without manual cleanup."

Description

Captures pre-override configuration snapshots (e.g., notification throttles, approval requirements, template locks) and diffs changes made under an override. On expiry or manual revoke, automatically re-enforces guardrails and reverts eligible changes to the pre-override state in a safe sequence with retries and conflict detection. Flags non-revertible operations and opens a post-incident task for manual review. Ensures broadcasts initiated under override complete, while preventing new actions after expiry. Emits clear UI banners and webhooks when rollback starts, succeeds, or requires intervention.

Acceptance Criteria

Pre-Override Snapshot Capture

Given an approved timeboxed override with MFA, justification, and duration When the override is activated Then the system captures and persists an immutable pre-override snapshot before any changes are applied And the snapshot includes notification throttles, approval requirements, template locks, escalation routes, and rate limits And the snapshot is assigned a snapshot_id linked to override_id, with timestamp and checksum recorded And activation does not complete until the snapshot is successfully persisted And if snapshot capture fails, the override activation is aborted and an error is displayed and logged to the audit trail

Override Change Diff Generation

Given an active override window When any eligible configuration is modified during the window Then a diff entry is recorded capturing resource identifier, field path, old_value, new_value, actor_id, method (UI/API), correlation_id, and timestamp And only changes occurring within the override window are included in the override diff And a consolidated diff report is available within 30 seconds of override expiry or revoke and is exportable as JSON and CSV And the diff report is immutable and associated with snapshot_id and override_id

Automatic Rollback on Expiry or Revoke

Given an override with a recorded snapshot S When the override expires or is manually revoked Then rollback begins within 5 seconds and proceeds in a dependency-aware safe sequence: re-enable guardrails, revert configuration values to S, clear caches, and re-open standard controls And each item revert is idempotent and retried up to 3 times with exponential backoff (0.5s, 1s, 2s) on transient errors And conflicts are detected using version/ETag or last-modified checks; conflicting items are skipped, labeled as conflicts, and do not block other reverts And the overall rollback result is marked succeeded only if all eligible items revert; otherwise it is marked intervention_required with counts of reverted, conflicted, and failed items

Non-Revertible Operations Handling

Given one or more non-revertible operations occurred during the override (e.g., broadcast deliveries initiated, external system side-effects, deleted artifacts without restore points) When rollback executes Then the system does not attempt to revert those operations and flags each as non_revertible with rationale And a post-incident task is created for each non-revertible or grouped by resource type, assigned to the on-call role with due date within 24 hours And the UI displays a banner and list of non-revertible items with deep links; the audit trail records task_ids and item details

In-Flight Broadcast Completion and Post-Expiry Blocking

Given a broadcast job was initiated before override expiry When the override expires during the broadcast Then the broadcast continues to completion without cancellation and all intended recipients are attempted And any new broadcast initiation attempts after expiry are blocked with HTTP 403 (code: override_expired) and corresponding UI error And scheduled broadcasts with start times after expiry are not executed and are marked cancelled (reason: override_expired) And configuration write attempts after expiry are blocked except for system-initiated rollback operations

Rollback UI Banners and Webhook Emission

Given rollback state changes (started, succeeded, intervention_required) When these states occur Then a UI banner is displayed within 2 seconds with severity, override_id, snapshot_id, and counts (reverted, conflicted, failed, non_revertible) And a webhook is emitted for each state with topics rollback.started, rollback.succeeded, rollback.intervention_required including payload fields: override_id, snapshot_id, started_at, completed_at (if applicable), counts, and status And webhooks are signed, retried up to 10 times over 24 hours with exponential backoff, and include an idempotency key to prevent duplicates And webhook failures are visible in an admin log with last error and next retry time

Conflict Detection and Partial Rollback Handling

Given one or more targeted settings have been modified after the override window by users outside the override context When rollback attempts to revert those settings Then compare-and-swap or version checks prevent overwrite if the current version does not match the snapshot baseline And the system records a conflict entry per item with current_value, snapshot_value, last_editor, last_modified_at, and a remediation recommendation And rollback continues for non-conflicting items and publishes a summary with counts and links to conflicted items for manual resolution

Real-time Override Visibility & Alerts

"As a duty lead, I want immediate visibility and alerts for active overrides so that the team can coordinate and intervene if something looks risky."

Description

Surfaces active overrides with a prominent UI banner, countdown timer, and activity feed of actions executed under the override. Sends real-time alerts to on-call channels (SMS, email, Slack/Teams) on start, extension, and expiry. Offers a dashboard listing current and recent overrides by incident, owner, scope, and remaining time. Allows authorized users to terminate early or request extensions from the alert itself. Provides webhooks/stream events for SOC/SIEM and integrates with incident rooms for shared awareness.

Acceptance Criteria

Active Override Banner and Countdown Visibility

Given an override is active for an incident When any authenticated console user loads any page in OutageKit Then a persistent, high-contrast banner appears at the top within 2 seconds and cannot be dismissed while the override remains active And the banner displays owner, incident, scope, justification summary, start time, and a remaining-time countdown And the countdown updates at least once per second and is accurate to ±1 second And the banner includes a visible link to “View activity” and deep-links to the override detail And authorization is enforced: “Terminate” and “Request extension” controls are shown only to users with the required role

Automatic Expiry and Countdown Behavior

Given an active override with an expiry time T When the countdown reaches T Then the system automatically revokes elevated access and rolls back the override state within 5 seconds And the UI banner is removed within 5 seconds and an activity entry “override.expired” is recorded with timestamp And a final expiry alert is sent to all configured channels And the countdown never displays negative values And if the override is extended before T, the banner countdown updates immediately and the previous and new expiry times are logged

Real-time Activity Feed During Override

Given an override is active When any action is executed under the override (e.g., config change, data export, termination/extension request) Then an activity entry is appended within 3 seconds including timestamp (UTC), actor, action, target, outcome (success/failure), and redacted parameters And entries are strictly scoped to the override window (only actions while active are shown) And the feed streams live without page refresh and supports filtering by actor and action type And 99.9% of actions during the override window are captured and persisted for at least 30 days

Start/Extend/Expiry Alerts to On-call Channels

Given on-call notification channels (SMS, email, Slack/Teams) are configured When an override starts, is extended, or expires Then exactly one alert per event per channel is sent with incident, owner, scope, justification, start time, previous/new expiry, and deep link to details And delivery SLOs are met: Slack/Teams ≤ 10s median, SMS ≤ 60s median, email ≤ 60s median from event time And failed deliveries are retried up to 3 times with exponential backoff and are traceable via message IDs And alerts are logged in the activity feed with delivery status per channel

Actionable Alerts: Early Termination and Extension

Given an alert is received in Slack/Teams, email, or SMS When an authorized user selects “Terminate now” or “Request extension” from the alert Then the user’s authorization is validated and an MFA challenge is required if not satisfied in the last 8 hours And “Terminate now” ends the override within 5 seconds and posts a confirmation to the originating channel/thread And “Request extension” collects justification and a new duration within policy limits and applies it within 5 seconds, posting confirmation with the new expiry And action links/tokens expire after 10 minutes and are single-use; unauthorized attempts are denied and audited

Override Dashboard Listing and Controls

Given a user with dashboard access opens the Overrides dashboard When the page loads Then current overrides and those from the past 30 days are listed with columns: incident, owner, scope, status (active/expired/terminated), start time, expiry, and remaining time And list supports sort and filter by incident, owner, scope, status, and time range and returns results within 2 seconds for up to 200 rows And selecting a row opens details with the live activity feed and countdown And authorized users can terminate early or request an extension from the dashboard; unauthorized users cannot see these controls

Webhooks/Stream Events and Incident Room Integration

Given a webhook endpoint and/or event stream subscription is configured When an override starts, extends, expires, terminates, or logs an action Then an event (override.started, override.extended, override.expired, override.terminated, override.activity) is emitted within 10 seconds with payload containing override ID, incident, owner, scope, timestamps, and correlation/idempotency keys And webhook requests are HMAC-SHA256 signed; non-2xx responses are retried up to 5 times with exponential backoff; per-override ordering is preserved And an incident room (Slack/Teams channel) linked to the incident receives threaded posts for start, extension, and expiry with a live countdown link and can be muted per incident

Tamper-Evident Audit Trail & Export

"As a compliance officer, I want a tamper-evident audit trail of overrides so that we can satisfy audits and investigate exceptions with confidence."

Description

Produces an immutable, tamper-evident log for each override: initiation details, MFA factor, justification, approvals, scope, actions taken, configuration diffs, extensions, expiry, and rollback outcomes. Uses hash-chaining and time-stamping to detect alteration, with secure retention policies. Supports export to SIEM/archival via API, syslog/webhook, and downloadable reports filtered by incident or time range. Redacts secrets but preserves evidence fidelity to meet regulatory and internal audit needs. Correlates entries to incident timelines within OutageKit for end-to-end traceability.

Acceptance Criteria

Override Initiation Entry Captures Required Metadata

Given a privileged user initiates a timeboxed override with required MFA, justification, and approvals When the override is created Then an immutable audit entry is committed before the override becomes active and includes: overrideId, actorId, actorRole, sourceIp, userAgent, requestId, mfaFactorType, mfaOutcome, justification, approvalIds with timestamps, overrideScope (resources/permissions), startTimestamp (UTC RFC3339), requestedDuration, and relatedIncidentIds (if any)

Action and Configuration Diff Logging During Override Window

Given an active override window When the user performs any privileged action or changes configuration within the defined scope Then each action is logged as a separate audit entry with actorId, timestamp, resource, operation, outcome, and a canonical before/after diff (with secrets redacted) and entries are strictly ordered by sequence number

Override Extensions, Expiry, and Automatic Rollback Outcomes Recorded

Given an existing timeboxed override When an extension is requested and approved Then a new audit entry records priorExpiry, newExpiry, approverId, justification, and timestamp When the override expires Then an audit entry records automatic rollback outcome including success/failure, itemsReverted count, durationMs, and errorDetails (if any)

Tamper-Evident Hash Chain and Time-Stamp Verification

Given a set of audit entries for a single override When integrity verification is executed Then each entry contains prevHash (SHA-256 of the canonical previous entry), entryHash, and a UTC RFC3339 timestamp with non-decreasing order And the verification succeeds for an unmodified log and fails with the index of the first invalid entry if any entry is altered

Secure Retention and Access Controls for Audit Log

Given a retentionPolicyYears is configured When audit entries are written Then entries are stored in WORM-compliant storage, encrypted at rest and in transit, retained for at least retentionPolicyYears, and cannot be modified or deleted by users And any administrative purge requires dual authorization, is logged, and preserves a purge receipt with hashes And all read/export access is authorized via RBAC and individually logged

API Export with Filtering, Redaction, Pagination, and Signing

Given a user with audit.export permission When they call the Audit Export API with incidentId and/or timeRange filters Then the response contains only matching entries in canonical JSON with stable redaction tokens for secrets, includes pagination (limit, cursor), and an HMAC-SHA256 signature header over the payload And the API enforces rate limits and returns 401/403 for unauthorized requests

Streaming and Downloadable Reports (Syslog/Webhook/CSV)

Given export destinations are configured When streaming is enabled Then entries are sent to syslog over TLS in RFC5424 format and to webhooks signed with a shared secret, with retry and exponential backoff on failures When a user requests a downloadable report with filters (incidentId or timeRange) Then a CSV file is generated containing matching entries, preserving redactions and including a chain verification checksum

Correlation to Incident Timeline View

Given an incident with related overrides When a user opens the incident timeline in OutageKit Then all related override audit entries appear in temporal order with deep links to the underlying entries and can be filtered (e.g., show only overrides) and exported while preserving incident correlation identifiers

Context Snapshot

Locks the exact message, targeting, affected clusters, map extent, and evidence at approval request time so approvers review a frozen, consistent view. Eliminates last‑second drift, ensures everyone approves the same payload, and reduces retractions.

Requirements

Immutable Snapshot Capture

"As an operations manager, I want the system to freeze the exact payload and evidence at approval time so that approvers review a consistent, unchanging context."

Description

On approval request, capture and freeze the full broadcast context into an immutable snapshot: message body and localization variants, channel selections (SMS, email, IVR), targeting rules and resolved recipient sets, affected outage clusters (IDs and attributes), map extent (bounds and zoom), ETA values and source, evidence attachments/links with checksums, and model/build versions used for clustering/ETAs. Assign a unique Snapshot ID, compute a content hash, record timestamps, requesting user, environment, and incident linkage. Persist synchronously so approvers always load the exact frozen payload and visuals, eliminating last‑second drift.

Acceptance Criteria

Atomic Snapshot on Approval Request

Given a draft broadcast with defined message variants, channels, targeting, clusters, map extent, ETAs, and evidence And the draft is editable When the requester clicks "Request Approval" Then the system creates a snapshot record atomically and returns a Snapshot ID and content hash And the approval view loads the snapshot by Snapshot ID And subsequent edits to the draft do not affect the approval view or the snapshot payload And re-opening the approval view always resolves to the same snapshot hash And attempting to start an approval without a successful snapshot returns HTTP 409 with an actionable error

Snapshot Field Completeness and Structure

Given a newly created snapshot Then the snapshot includes: message body and all localization variants; channel selections; targeting rules and the materialized recipient set; affected cluster IDs and required attributes; map bounds and zoom; ETA values and source; evidence attachments/links each with checksum; model/build versions for clustering and ETA; requesting user; environment; incident linkage; created/updated timestamps; Snapshot ID; content hash And all required fields are populated and pass schema validation And the materialized recipient count and IDs are stored and match the approval UI numbers And the snapshot is retrievable via API by Snapshot ID and returns HTTP 200

Content Hash and Immutable Enforcement

Given an existing snapshot When the same draft state is snapshotted again with no changes Then the content hash is identical When any source field changes and a new snapshot is created Then the content hash differs And the snapshot record is write-protected; any update attempts are denied with HTTP 403 and are fully audited And retention policies may delete snapshots, but no in-place mutation is permitted

Recipient Resolution Freezing

Given a snapshot with a resolved recipient set When the directory or targeting rules change after snapshot creation Then the snapshot's recipient IDs and counts remain unchanged in the approval UI and at send time And a dry-run send using the snapshot emits to exactly that set (zero drift) And the UI surfaces a non-blocking notice if live targeting no longer matches the snapshot

Synchronous Persistence and Performance

Given a draft with <= 50,000 resolved recipients, <= 500 affected clusters, and <= 20 evidence items totaling <= 200 MB And <= 5 concurrent approval requests per tenant under normal load When "Request Approval" is initiated Then p95 snapshot creation latency is <= 2.0 seconds and p99 <= 5.0 seconds (server-side) And the approval UI does not render until the snapshot is fully persisted And on any persistence failure, no partial snapshot is readable; the operation fails with a retriable error and cleanup occurs automatically

Evidence Integrity and Checksums

Given a snapshot containing evidence attachments and links Then each attachment stores checksum (e.g., SHA-256) and byte size; on download, checksum verification passes And each link evidence stores URL, fetch timestamp, and stable title/preview; if the target later 404s, the snapshot still renders with stored metadata And attachments/links are read-only via snapshot APIs; replace or delete attempts return HTTP 403

Version Provenance Pinning

Given clustering and ETA model/build identifiers stored in the snapshot When those models/builds are upgraded in the environment after snapshot creation Then the approval view renders results from the versions recorded in the snapshot And sending from the approval uses the versions recorded in the snapshot And the snapshot displays the exact model/build identifiers and timestamps for auditability

Tamper‑Evident Snapshot Artifact Storage

"As a compliance officer, I want snapshots to be tamper‑evident and securely stored so that we can audit approvals and prove what was approved."

Description

Generate a signed JSON snapshot artifact and store it along with any binary evidence in encrypted, access‑controlled storage. Include the content hash, signature, signer key ID, creation timestamp, and retention policy metadata. Enforce role‑based access, redact designated PII fields, and support geo‑replication. Provide low‑latency retrieval for review, and ensure write‑once semantics for the artifact while allowing non‑destructive metadata updates (e.g., approval outcome).

Acceptance Criteria

Signed Snapshot Artifact Content and Verification

Given a context snapshot is approved for capture When the system generates the snapshot Then the stored JSON artifact includes fields: contentHash (SHA-256 hex), signature (base64), signerKeyId, createdAt (RFC 3339 UTC), and retentionPolicy (name, durationDays) And the contentHash equals the SHA-256 of the stored payload bytes And signature verification using signerKeyId from the configured KMS succeeds for the stored payload and contentHash And altering any byte of the stored payload or metadata causes signature verification to fail and a verification status of "invalid" is returned

Encrypted Storage and Role-Based Access Controls

Given an artifact and any associated binary evidence are persisted When data is written to storage Then server-side encryption with KMS-managed keys is applied and recorded in object metadata And RBAC is enforced: Approver and Auditor can read; System can create; Admin can update metadata; all others receive 403 on read/write/delete And overwrite of existing artifact content is denied with a conflict or precondition failure And all access attempts are audit-logged with subject, action, outcome, and timestamp

Write-Once Artifact with Non-Destructive Metadata Updates

Given an artifact content has been stored When a client attempts to modify the artifact content bytes Then the operation is rejected and the original contentHash remains unchanged When a client updates allowed metadata fields (e.g., approvalOutcome, notes) Then a new metadata version is appended (monotonic metadataVersion increment) without altering the content bytes And GET returns the immutable content with the latest metadata view And the audit log shows the content unchanged across metadata updates

Designated PII Redaction on Persist

Given a configured list of PII field paths and classifier rules When a snapshot is generated for storage Then all designated PII fields are irreversibly redacted or masked in the JSON payload prior to hashing and signing And a redaction report is included in metadata listing fields redacted and the rule version applied And verification confirms no original PII values are present in the stored artifact by exact-match search And non-designated fields are preserved unmodified

Geo-Replication and Consistency of Artifacts

Given the primary region is operational When an artifact is stored Then it is replicated to at least two configured regions and becomes readable in those regions within 2 minutes p99 And the contentHash and signature bytes are identical across all replicas And if the primary region is unavailable, reads from a secondary region succeed within 30 seconds of failover initiation

Low-Latency Retrieval for Approval Review

Given an artifact <= 200 KB with associated evidence totaling <= 10 MB When an approver retrieves the snapshot within the same region under 100 concurrent requests Then p95 time-to-first-byte is <= 150 ms and p95 total download time is <= 1.5 s And the retrieved payload, evidence references, and metadata exactly match the stored artifact version

Retention Policy and Legal Hold Enforcement

Given retentionPolicy metadata (name, durationDays, legalHold flag) is set at artifact creation When the retention duration elapses and legalHold=false Then the artifact and evidence are purged or transitioned per policy within 24 hours and a deletion audit record is written And if legalHold=true, deletion is prevented and a blocked-deletion event is logged And after deletion, retrieval returns 404 and an admin-only tombstone remains for 30 days with contentHash and deletion timestamp

Frozen Approval Review UI

"As an approver, I want to review an uneditable snapshot with all context so that my decision is based on a consistent payload across devices."

Description

Render a read‑only approval screen that loads the snapshot artifact (not live data) and displays: message preview by channel, recipient counts from the frozen targeting, affected cluster overlay within the frozen map extent, ETAs, and linked evidence. Disable edits, clearly label the snapshot timestamp and ID, and provide approve/reject actions with comment capture. Ensure consistent rendering across web and mobile, with accessibility compliance and deterministic map tiles for the stored extent.

Acceptance Criteria

Load Snapshot Artifact (Not Live Data)

Given a valid Snapshot ID is provided When the approver opens the Frozen Approval Review UI Then all displayed data (messages, counts, clusters, map tiles, ETAs, evidence) is loaded exclusively from the snapshot artifact and no live data endpoints are queried (excluding a single GET to retrieve the snapshot) Given live incident data changes after the UI is opened When the approver remains on the page or refreshes using the same Snapshot ID Then the displayed data remains unchanged and matches the snapshot artifact checksum Given the snapshot loads successfully When the UI renders Then the snapshot timestamp (ISO 8601 with timezone) and Snapshot ID are visible in the header within 1 second of load

Read-Only UI and Controls Disabled

Given the Frozen Approval Review UI is open When the approver attempts to type, paste, drag-drop, or edit any field Then all input controls are disabled/read-only and no values change Given the Frozen Approval Review UI is open When any client action would trigger POST/PUT/PATCH to modify message, targeting, map, ETAs, or evidence Then no such requests are sent and no server-side mutations occur Given the Frozen Approval Review UI is open When the approver hovers the header info icon Then a tooltip indicates the view is a Read-only snapshot

Channel Previews and Frozen Recipient Counts

Given a snapshot contains channel payloads (SMS, Email, Voice) When the UI renders Then a preview for each included channel is shown with variables resolved from the snapshot and excluded channels display Not included Given frozen targeting results exist in the snapshot When the UI displays recipient counts Then counts are sourced from the snapshot’s stored recipient set cardinality and do not change if live targeting groups change Given counts are displayed When values are formatted Then thousands separators are applied and zero values appear as 0

Frozen Map Extent with Deterministic Tiles and Cluster Overlay

Given the snapshot includes a bounding box, zoom, style, and tile seed When the map renders Then panning and zooming are disabled and the view is fixed to the stored extent Given affected cluster geometries are stored in the snapshot When the overlay renders Then only those clusters are displayed and their IDs and shapes match the snapshot exactly Given tiles are requested for the stored extent When the UI loads on different devices/browsers Then tile URLs include the stored style/version/hash and resulting image checksums are identical for the same snapshot Given tile requests fail for any reason When the map renders Then a cached image of the stored extent is shown with an error badge and the overlay still renders

Evidence Links and ETA Display

Given the snapshot contains evidence items When the UI renders Then each evidence item shows title, source type, and immutable URL and opens in a new tab/window on click Given ETAs are stored per cluster or globally in the snapshot When the UI renders Then ETAs are displayed exactly as stored without recalculation using format ETA HH:MM TZ or Unknown if null Given an evidence URL returns 4xx/5xx or times out When the list renders Then the item remains visible with an Unavailable badge and no substitution or deletion occurs

Approval Actions with Mandatory Comment and Audit Trail

Given the snapshot has loaded successfully When rendering action buttons Then Approve and Reject are visible and enabled; both are disabled during loading or while submitting Given the approver selects Reject When submitting Then a non-empty comment of at least 5 characters is required and inline validation prevents submission until satisfied Given the approver submits a decision When the request is sent Then a single POST to the approvals endpoint includes snapshot_id, snapshot_timestamp, decision, comment, approver_id, client_platform, and submitted_at (ISO 8601) and duplicate clicks are ignored Given the server responds 201 Created When handling the response Then a success banner appears and navigation returns to the approvals list; if 4xx/5xx occurs an error banner appears and no duplicate records are created Given a decision is submitted successfully When auditing via the audit API Then an event exists containing the snapshot hash and exact decision details

Cross-Platform Rendering Consistency and Accessibility Compliance

Given the same Snapshot ID is opened on web (latest Chrome, Safari, Firefox, Edge) and mobile (current iOS and Android app) When the UI renders Then content parity is 100% for messages, counts, clusters, ETAs, evidence, snapshot ID, and timestamp (layout may differ) Given a keyboard-only user navigates the page When tabbing through elements Then focus order is logical, focus is visible, Enter activates the focused button, and Escape closes any transient alert/toast if dismissible Given a screen reader is active When the page loads Then the snapshot ID and timestamp are announced with the page title and all controls have accessible names; the map region is labeled Read-only map snapshot Given visual elements are evaluated When checking contrast and semantics Then all text and interactive elements meet WCAG 2.2 AA contrast ratios and live status messages are announced via ARIA live regions

Drift Detection and Re‑snapshot Flow

"As a requester, I want to be alerted if anything changes after I request approval so that I can revalidate or refresh the snapshot before sending."

Description

Continuously compare live entities referenced by the snapshot (clusters, targeting lists, recipient opt‑outs, ETA sources) to detect divergence between request and decision time. Surface a clear "drift detected" banner with a concise diff (e.g., recipient deltas, cluster boundary changes, ETA updates) and options to approve anyway, cancel, or create a new snapshot. Notify the requester and watchers on drift via in‑app and email/SMS per settings.

Acceptance Criteria

Display Drift Banner and Actions on Approval Screen

Given an approval request with snapshot S is opened by an approver And one or more referenced live entities have changed since S was created When the approval screen loads or changes are detected while open Then a persistent "Drift detected" banner appears within 5 seconds And the banner lists each diverged entity type and count (recipients, clusters, ETAs) And the actions "Approve anyway", "Cancel", and "Create new snapshot" are visible and enabled per permissions And if no divergence exists, no drift banner is shown

Concise Drift Diff Summarizes Recipients, Clusters, and ETAs

Given snapshot S references targeting lists, recipient opt-out states, and clusters with ETAs When any of these live entities diverge after S is created Then the diff displays: - Recipients: added count (+N), removed count (−M), and net delta computed from a fresh recomputation against live data - Clusters: list of changed cluster IDs/names with before/after impacted count per cluster and a "boundaries/membership changed" indicator - ETAs: old vs new ETA values with source names and timestamps per affected cluster And all counts and values are accurate to the latest detected state and rendered within 5 seconds of detection And no recipient PII beyond aggregate counts is shown in the diff

Approve Anyway Publishes Frozen Snapshot Payload

Given snapshot S has detected drift at approval time When the approver clicks "Approve anyway" and confirms Then the system publishes exactly the frozen payload from S (message, recipients, clusters, map extent, evidence, ETAs) And no recalculation against live data occurs during publish And the published recipient count equals the snapshot's frozen count And an audit log entry records "Approved with drift" including approver, timestamp, snapshot ID, and drift summary And the publish operation completes within 10 seconds

Create New Snapshot (Re-snapshot) and Update Approval View

Given snapshot S has detected drift When the approver selects "Create new snapshot" and confirms Then the system creates a new snapshot S2 from current live entities with a new snapshot ID And the approval request view updates to reference S2 and clears the drift banner And S2 has no drift at creation time (diff is empty) And an audit log links S to S2 with a "Resnapshot" reason and both snapshot IDs And no outbound communications are sent until S2 is approved

Cancel Approval Request From Drift State

Given an approval request with snapshot S shows a drift banner When the approver clicks "Cancel" and confirms Then the approval request transitions to a Cancelled state And no publish or notifications to end recipients occur And the drift banner is dismissed and the request becomes read-only And an audit log records the cancellation with user, timestamp, snapshot ID, and drift summary And the UI reflects the Cancelled status within 2 seconds

Notify Requester and Watchers on Drift per Preferences

Given an open approval request with snapshot S and configured requester/watchers notification preferences When drift is detected for S Then an in-app notification is created immediately for the requester and watchers And email and/or SMS notifications are sent within 1 minute according to each recipient's preferences And notifications are de-duplicated by batching changes within a 10-minute window per approval request And notifications include a link to the approval request and a concise summary of drift type(s) and counts And no notifications are sent to recipients with all channels disabled

Snapshot‑Aware Broadcast Execution

"As a release manager, I want the approved snapshot to be the source of truth for broadcast so that the sent messages exactly match what was approved."

Description

On approval, execute the broadcast strictly from the approved snapshot: use the frozen message, resolved recipient set, channel list, map extent, and evidence references. Tag all outbound messages with the Snapshot ID for traceability, and record a delivery report linked back to the snapshot and approval record. Enforce idempotency on Snapshot ID to prevent duplicate sends and handle partial failures with safe retries that do not alter the approved payload.

Acceptance Criteria

Execute From Frozen Snapshot Payload

Given an approved snapshot with ID S containing message M, recipient set R, channel list C, map extent E, and evidence references V When the broadcast is executed Then 100% of outbound requests must use M verbatim And the resolved recipient set must equal R exactly (no additions/removals) And the channels used must equal C exactly And any map content or links must use E exactly And any evidence pointers in outbound payloads must reference V exactly And no live data recomputation alters M, R, C, E, or V after execution starts

Snapshot ID Tagging Across All Outbound Messages

Given an approved snapshot with ID S When messages are sent over SMS, Email, Voice/IVR, and Webhook channels Then every outbound message record stored by the system must include snapshotId = S And every provider API request must include S as metadata/label/header as supported per channel And any human-visible message bodies must include S only where channel policy requires; otherwise traceability is via metadata And sampling 100 random outbound messages returns 100% with snapshotId = S

Delivery Report Linked to Snapshot and Approval

Given a completed execution for snapshot ID S with approval record A When delivery receipts and status callbacks are processed Then a delivery report must be created with aggregate counts per channel and per status linked to S and A And 100% of per-recipient delivery records must store snapshotId = S And querying reports by S must return execution time, totals, successes, failures, and retry counts And exporting the report yields the same totals as the in-app view

Idempotent Execution by Snapshot ID

Given the broadcast execution endpoint is called N times with the same snapshot ID S (where N ≥ 2) When calls occur concurrently or within a 5-minute window Then only one execution record is created and returned (same executionId for all calls) And no duplicate outbound messages are created per (recipient, channel) for S And subsequent calls return HTTP 200 with idempotency metadata referencing the original execution And system logs record deduplication events without altering the approved payload

Safe Partial-Failure Retries Without Payload Drift

Given some deliveries for snapshot ID S fail with retryable errors When the retry worker runs Then only failed (recipient, channel) pairs are retried And all retry attempts use a byte-identical payload to the approved snapshot (hash(Payload) constant across attempts) And successful prior deliveries are never resent And retries back off according to policy and cap at the configured max attempts And final report distinguishes original sends vs. retries without changing M, R, C, E, or V

Immutable Evidence and Map Extent Usage

Given snapshot ID S includes evidence references V and map extent E When outbound content is generated and links are embedded Then evidence links/files used in messages must resolve to the versions pinned in V And any generated map images/links must reflect the exact bounding box E from S And later changes to live evidence or map do not affect messages already queued or retried for S And audit records show evidence IDs and map parameters matching the snapshot

Audit Trail and Evidence Chain of Custody

"As an auditor, I want a verifiable chain of custody for each snapshot and broadcast so that I can confirm integrity and decisions during reviews."

Description

Append comprehensive entries to the audit log at each step: snapshot created (with hash and signer), drift detected, resnapshot created, approval outcome, and broadcast executed. Store evidence file hashes and sizes to verify integrity. Expose an audit view and export API that reconstructs the full chain of custody for any incident, enabling rapid investigations and post‑mortems.

Acceptance Criteria

Snapshot Creation Logged with Hash and Signer

Given a Context Snapshot is created for an incident When the snapshot is persisted Then an audit entry is appended with fields: incidentId, snapshotId, snapshotHash (SHA-256 of snapshot payload), signerUserId, signerKeyFingerprint, createdAt (UTC ISO-8601), eventType='SNAPSHOT_CREATED' And the audit entry is immutable via API (PATCH/DELETE return 405) and UI And recomputing the snapshotHash from the exported snapshot payload equals the stored snapshotHash

Drift Detection and Resnapshot Chain

Given the message/targeting/clusters/map extent/evidence changes after a snapshot When drift is detected Then an audit entry eventType='DRIFT_DETECTED' is appended with fields: incidentId, previousSnapshotId, previousSnapshotHash, changedFields[], detectedAt (UTC) And when a resnapshot is created Then an audit entry eventType='SNAPSHOT_RESNAPSHOT' is appended with fields: parentSnapshotId, newSnapshotId, newSnapshotHash, createdAt (UTC) And export and UI present DRIFT_DETECTED before the corresponding SNAPSHOT_RESNAPSHOT in chronological order And approval attempts referencing a non-latest snapshotId/hash return 409 and append eventType='APPROVAL_SNAPSHOT_MISMATCH'

Approval Outcome Logged and Bound to Snapshot

Given an approver submits a decision for a snapshot When the decision is recorded Then an audit entry eventType='APPROVAL_DECISION' is appended with fields: incidentId, snapshotId, snapshotHash, decision ('approved'|'rejected'), approverUserId, approverRole, reason (optional if rejected), decidedAt (UTC) And only the latest snapshotId/hash for the incident can be approved; otherwise the API rejects with 409 and logs 'APPROVAL_SNAPSHOT_MISMATCH' And the audit view shows the decision adjacent to the referenced snapshot

Broadcast Execution Logged and Verifiable

Given a broadcast is initiated from an approved snapshot When execution starts Then an audit entry eventType='BROADCAST_STARTED' is appended with fields: incidentId, jobId, snapshotId, snapshotHash, channels[], plannedRecipientCounts per channel, startedAt (UTC) And when execution completes Then an audit entry eventType='BROADCAST_COMPLETED' includes: jobId, per-channel attempted/succeeded/failed counts, payloadHash per channel (SHA-256 of rendered payload/template), completedAt (UTC), durationMs And recomputing each payloadHash from the exported payload/template matches the stored hash

Evidence Integrity: Hashes and Sizes Stored

Given evidence files are attached to an incident or snapshot When an evidence file is uploaded Then an audit entry eventType='EVIDENCE_ATTACHED' is appended with fields: incidentId, snapshotId (if applicable), evidenceId, fileName, mimeType, byteSize, sha256, addedByUserId, addedAt (UTC) And when evidence is exported or downloaded Then the recomputed sha256 equals the stored sha256 and the byteSize matches; otherwise the operation returns 422 and an 'EVIDENCE_INTEGRITY_MISMATCH' audit entry is appended

Audit View Reconstructs Full Chain of Custody

Given a user with Audit.View permission opens the audit view for an incident When the audit timeline loads Then it renders a chronological, numbered chain containing: SNAPSHOT_CREATED, DRIFT_DETECTED, SNAPSHOT_RESNAPSHOT, APPROVAL_DECISION, BROADCAST_STARTED, BROADCAST_COMPLETED, EVIDENCE_ATTACHED (and any mismatch events) And each snapshot node shows snapshotHash, signer identity, and UTC timestamp, and links to its parentSnapshotId if resnapshotted And the chain loads within 2 seconds (P95) for incidents with up to 1000 audit entries And exporting from the UI produces JSON identical (byte-for-byte) to the API export for the same incident and parameters

Audit Export API: Complete Chain, Filters, and Performance

Given an authenticated client with Audit.Export scope requests GET /audit/incidents/{incidentId}/chain?format=json When the incident exists and the client is authorized Then the API returns 200 with a fully ordered chain including all event types and fields defined by the audit schema, plus pagination cursors when entries exceed pageSize And format=csv returns 200 with a header row and rows ordered by sequence; unsupported formats return 400 And unauthorized access returns 403; nonexistent incident returns 404 And for incidents with 1000 entries and pageSize=200, the first page responds within 800 ms (P95)

Snapshot APIs and Webhooks

"As an integration engineer, I want APIs and webhooks for snapshot events so that I can automate workflows and external auditing in our tooling."

Description

Provide REST endpoints and OAuth scopes to create, fetch, and list snapshots; verify signatures; and retrieve approval and broadcast outcomes by Snapshot ID. Emit webhooks for snapshot.created, snapshot.drift_detected, snapshot.approved, snapshot.rejected, and broadcast.sent. Include schema versioning, rate limits, and idempotency keys to support integrations and external audit systems.

Acceptance Criteria

Create Snapshot with Idempotency and Versioning

Given a valid OAuth 2.0 bearer token with scope snapshot.create And an Idempotency-Key header containing a unique UUID v4 And an optional Accept-Version header set to a supported API version (e.g., v1) When the client POSTs /v1/snapshots with a well-formed body including message, targeting, affected_clusters[], map_extent, evidence[], and approval_request_id Then the API returns 201 Created with application/json containing id (UUID v4), version, status="pending_approval", checksum (sha256), created_at (ISO 8601), and idempotency_key And the snapshot content is immutable; subsequent GET /v1/snapshots/{id} returns an identical payload and checksum And a retry with the same Idempotency-Key within 24h returns 200 OK with an identical body and header X-Idempotent-Replay:true And a request with the same Idempotency-Key but a different body returns 409 Conflict with error.code="idempotency_conflict" And an unsupported Accept-Version returns 406 Not Acceptable with body.supported_versions including "v1"

List and Fetch Snapshots with Filtering and Pagination

Given a valid token with scope snapshot.read When GET /v1/snapshots?approval_status=pending&limit=50&cursor=<token> Then 200 OK returns items[] sorted by created_at desc; each item includes id, version, status, approval_status, cluster_count, created_at And the response includes next_cursor when more results exist; next_cursor is null when at the end And limit max is 100; values >100 are coerced to 100 And unauthorized or cross-tenant IDs are not listed When GET /v1/snapshots/{id} Then 200 OK returns the full frozen payload; frozen_at equals created_at And 404 Not Found is returned for unknown or cross-tenant IDs

Webhook Delivery and Retries for Snapshot Lifecycle Events

Given a tenant has configured an HTTPS webhook endpoint and subscribed to snapshot.created, snapshot.drift_detected, snapshot.approved, snapshot.rejected, broadcast.sent When the corresponding lifecycle action occurs Then the system sends a POST within 5 seconds to the endpoint with JSON including event_type, schema_version, delivery_id, occurred_at (ISO 8601), snapshot_id, and event-specific fields And the request includes X-OK-Timestamp and X-OK-Signature (HMAC-SHA256) headers And a 2xx response marks delivery success; 429/5xx trigger retries with exponential backoff for up to 12 attempts; 4xx (except 429) stop retries after the first attempt And deliveries are at-least-once; duplicates include X-OK-Delivery-Count incremented per attempt and a stable delivery_id And for each unique state change, only one event is generated; any additional deliveries are retries of the same event

Signature Verification Endpoint

Given a developer holds scope webhook.verify When POST /v1/signatures/verify with body {payload:string, timestamp:number, signature:string} Then 200 OK returns {valid:true, algorithm:"HMAC-SHA256", tolerance_seconds:300} when the signature matches using the tenant webhook secret and |now - timestamp| <= 300s And mismatched signatures or stale timestamps return 200 OK with {valid:false, reason:"mismatch"|"timestamp_out_of_range"} And malformed input returns 400 Bad Request with error.code and details And missing or insufficient scope returns 403 Forbidden with error.code="insufficient_scope"

OAuth Scopes and Tenancy Isolation

Given endpoints require snapshot.create for POST /v1/snapshots; snapshot.read for GET /v1/snapshots and GET /v1/snapshots/{id}; snapshot.outcome.read for GET /v1/snapshots/{id}/outcomes; webhook.verify for POST /v1/signatures/verify When a request is made without a required scope Then 403 Forbidden is returned with WWW-Authenticate: Bearer and error.code="insufficient_scope" And invalid or expired tokens return 401 Unauthorized with WWW-Authenticate: Bearer error="invalid_token" And callers cannot access resources belonging to other tenants; cross-tenant access attempts return 404 Not Found And all responses include X-Request-Id; access attempts are logged with token.client_id and granted scopes

Outcomes Retrieval by Snapshot ID

Given a valid token with scope snapshot.outcome.read When GET /v1/snapshots/{id}/outcomes Then 200 OK returns approval {status in ["pending","approved","rejected"], approver_ids[], decided_at|null, notes|null} and broadcast {status in ["pending","in_progress","sent","failed"], channels[], counts{sms,email,ivr,web}, started_at|null, completed_at|null, errors[]} And if outcomes are not yet available, fields are present with status "pending" and timestamps null And response includes X-OK-Data-Staleness header indicating maximum seconds since last update (<=5) And 404 Not Found is returned for unknown or cross-tenant snapshot IDs

Rate Limiting, Error Format, and Idempotent Replay Under Limit

Given per-tenant rate limits are enforced When a token exceeds 600 requests per minute on any endpoint Then subsequent requests receive 429 Too Many Requests with headers X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset And error responses use schema {error:{code, message, details?, correlation_id}} And if a POST includes a previously used Idempotency-Key and the original request completed within 24h, the stored response is returned with 200 and X-Idempotent-Replay:true even when over the rate limit And all responses include X-Request-Id; body.error.correlation_id equals X-Request-Id on errors

Risk Scoring Gate

Scores each broadcast’s risk based on audience size, ETA change magnitude, channel mix, and model confidence, then adjusts policy (e.g., require senior approver, add checklist, or stagger channels). Applies proportionate scrutiny to high‑impact updates while keeping routine notices quick.

Requirements

Risk Scoring Engine

"As an operations manager, I want each update to be automatically scored for risk so that high-impact communications get extra scrutiny while routine notices stay fast."

Description

Compute a real-time risk score (0–100) for each broadcast based on audience size, ETA change magnitude, channel mix, and model confidence from incident clustering. Normalize and weight inputs, apply configurable thresholds, and output a score with contributing factors. Expose a stateless API and SDK hook that evaluates within 100 ms per request and returns score, factors, and versioned model metadata. Support weight/version management, safe defaults when inputs are missing, idempotent evaluation by broadcast ID, and fallbacks if upstream confidence signals are delayed. Persist the final score on the broadcast record for downstream policy decisions and reporting.

Acceptance Criteria

Real‑time Weighted Risk Score (0–100) Computation

Given a broadcast with inputs audience_size, eta_change_minutes, channel_mix, and model_confidence and an active weight config version W When the Risk Scoring API POST /risk-score is called with the broadcast_id and inputs Then response.score is between 0 and 100 inclusive And response.factors includes exactly ["audience_size","eta_change","channel_mix","model_confidence"] And each factor has normalized_value in [0,1] and weight in [0,1] And the sum of all factor weights equals 1.0 ± 0.001 And score equals round(100 * Σ(normalized_value * weight)) within ±1 tolerance And response.metadata.config_version equals W And updating the active weight config to version W2 results in subsequent evaluations reflecting W2 weights and response.metadata.config_version=W2

P95 Latency ≤ 100 ms for Stateless API and SDK Hook

Given 10,000 evaluation requests with typical payloads and no external network calls When measuring end-to-end latency for POST /risk-score and the SDK method evaluateRisk(broadcastId, inputs) Then the 95th percentile latency is ≤ 100 ms and error rate ≤ 0.1% And results from API and SDK for the same inputs are identical by value for all returned fields And concurrent evaluations for different broadcasts produce consistent outputs regardless of call order, demonstrating statelessness And official Node.js and Python SDKs expose evaluateRisk(broadcastId, inputs, options) and pass integration tests against the API

Idempotent Evaluation by Broadcast ID

Given a broadcast_id B and identical inputs and config_version When evaluate is called multiple times within the idempotency window Then all responses have the same score, factors, risk_level, and metadata.idempotency_key And only one persistence write occurs for broadcast B for that config_version And re-ordering or retrying the calls does not change any returned values

Safe Defaults and Fallbacks on Missing or Delayed Inputs

Given a broadcast with one or more missing inputs (e.g., model_confidence, audience_size) When evaluation is performed Then the engine substitutes configured safe defaults for each missing input without error And each affected factor has default_used=true and appears in response.metadata.fallbacks And the score is computed using these defaults and remains within [0,100] And when the delayed upstream signal arrives and reevaluation is allowed by config, the new evaluation supersedes the defaulted score while preserving idempotency semantics per config

Configurable Thresholds and Risk Level Derivation

Given a thresholds configuration T that maps score bands to risk levels with defined inclusive/exclusive edges When evaluating broadcasts whose scores fall exactly on each band boundary and within each band Then response.risk_level matches T for all tested scores And updating T in the configuration service takes effect on subsequent evaluations without redeploy And response.metadata.thresholds_version equals T.version

Output Payload Completeness and Versioned Model Metadata

Given a successful evaluation When the response is returned Then it contains broadcast_id, score, risk_level, and factors where each factor has name, raw_value, normalized_value, weight, and contribution And metadata includes model_id, model_version, config_version, thresholds_version, evaluated_at (ISO8601 UTC), request_id, and idempotency_key And the response validates against the published JSON Schema version S And response.metadata.schema_version equals S And no undocumented fields are present

Persistence of Final Score on Broadcast Record

Given a successful evaluation for broadcast_id B When persistence completes Then the broadcast record contains risk_score, risk_level, factors (or a hash/reference), config_version, thresholds_version, and evaluated_at And reading the broadcast record immediately after returns values identical to the response And a single event "risk.score.computed" is emitted with correlation to B and the stored values And repeated evaluations with identical inputs do not create duplicate persisted entries

Policy Decision Matrix

"As a compliance lead, I want risk-based policies applied consistently so that high-risk messages follow stricter controls without slowing low-risk communications."

Description

Map risk score bands to deterministic actions, such as requiring senior approver, presenting a pre-send checklist, staggering channels, throttling SMS batch size, or blocking sends above a hard threshold. Provide an admin-configurable rules engine with versioning, effective dates, and auditability. Ensure precedence and conflict resolution are explicit, and expose a dry-run endpoint to preview which actions a score will trigger. Integrate tightly with the broadcast workflow so that policy actions are enforced before send and are recorded on the broadcast timeline.

Acceptance Criteria

Deterministic Risk Band to Action Mapping

Given active ruleset v2 defines bands: 0–39=None, 40–69=Checklist, 70–89=SeniorApprover+Checklist+Stagger+Throttle(500/min), 90–100=Block When evaluating a broadcast with riskScore=70 Then required actions are SeniorApprover, Checklist, Stagger, Throttle(500/min) and Block is not returned Given riskScore=69 When evaluating under ruleset v2 Then required actions are Checklist only Given riskScore=89 When evaluating under ruleset v2 Then required actions are SeniorApprover, Checklist, Stagger, Throttle(500/min) Given riskScore=90 When evaluating under ruleset v2 Then send is blocked and only Block is returned as the action Given identical inputs evaluated 1000 times When evaluating under ruleset v2 Then the action set returned is identical across all evaluations

Admin Rule Configuration with Versioning and Effective Dates

Given ruleset v2 is active and an admin publishes ruleset v3 with effectiveAt=2025-08-15T12:00:00Z When a policy evaluation occurs at 2025-08-15T11:59:59Z Then v2 is used for evaluation Given the same v3 effectiveAt When a policy evaluation occurs at or after 2025-08-15T12:00:00Z Then v3 is used for evaluation Given an admin edits a rule in v3 When saving changes Then a new version v4 is created and v3 remains immutable and viewable Given a ruleset publish attempt with missing effectiveAt When saving Then validation fails with a required-field error and ruleset is not activated Given version history UI When an admin views ruleset v3 Then they can see author, createdAt, effectiveAt, and a diff of changes from v2

Explicit Precedence and Conflict Resolution

Given two applicable rules where one sets Throttle=1000/min and another sets Throttle=200/min When evaluating Then the resulting Throttle is 200/min (the stricter value) Given applicable actions include Block and other non-blocking actions When evaluating Then Block overrides and the resulting action set indicates Block and no non-blocking actions are enforced at send time Given precedence order configured as Block > RequireSeniorApprover > Checklist > Stagger > Throttle When multiple rules generate conflicting or duplicate actions Then the engine applies the configured precedence deterministically and records the evaluated order in the evaluation log Given the same inputs and ruleset version When evaluating repeatedly Then the resolution outcome and evaluation log are identical across evaluations

Dry-Run Policy Preview Endpoint

Given POST /policy/dry-run with a valid payload including riskScore, audienceSize, etaChangeMagnitude, channelMix, modelConfidence When called with a valid auth token Then response is 200 with JSON containing actions[], rulesVersion, effectiveAt, and evaluationLog[] Given valid dry-run inputs When called under normal load (<=100 concurrent requests) Then p95 latency is <=300ms Given invalid payload (e.g., riskScore > 100 or missing required fields) When calling the endpoint Then response is 422 with field-level error details and no actions are returned Given no or invalid auth token When calling the endpoint Then response is 401 and no policy evaluation occurs Given identical inputs and no ruleset change When calling dry-run multiple times Then identical outputs are returned

Pre-Send Workflow Enforcement and Gating

Given policy evaluation returns RequireSeniorApprover When a sender attempts to dispatch a broadcast Then the UI blocks send until a user with the Senior Approver role approves, and the approver identity and timestamp are recorded Given policy evaluation returns Checklist with N configured items When a sender proceeds to send Then all N items must be affirmatively completed and captured with userId and timestamp before send is enabled Given policy evaluation returns Stagger(channels=Web+Email -> SMS -> IVR) with configured gaps When the broadcast is sent Then channel dispatch timestamps follow the configured sequence and minimum gaps, and deviations >5s are flagged in logs Given policy evaluation returns Throttle(500/min) When sending SMS Then SMS are dispatched in batches not exceeding 500 per minute until completion, with progress visible to the sender Given policy evaluation returns Block When a sender attempts to dispatch Then no messages are sent, the UI shows error code POLICY_BLOCKED, and the event is recorded on the broadcast timeline

Auditability of Rules and Broadcast Policy Actions

Given an admin creates, updates, or deletes any rule or ruleset When the action is saved Then an immutable audit record is created with actorId, actorRole, IP, timestamp (UTC), action type, and before/after JSON Given the audit log UI When filtering by date range and actor Then matching records are returned and can be exported as CSV Given a broadcast is evaluated against policy When viewing its timeline Then a Policy Evaluated event shows inputs (riskScore, audienceSize, etaChangeMagnitude, channelMix, modelConfidence), outputs (actions), rulesVersion, and evaluation timestamp Given approvals, checklist completions, throttling, and channel staggers occur When viewing the broadcast timeline Then each action is recorded with actor, timestamp, and outcome Given a non-admin user When attempting to modify or delete audit or timeline entries Then the system denies the operation and logs the attempt

Fallback Behavior and Invalid Rule Definitions Handling

Given a ruleset where a default band is configured for unmatched inputs When riskScore is null, NaN, or outside 0–100 Then the default band is applied and actions are returned accordingly Given a ruleset publish attempt with overlapping bands (e.g., 40–69 and 65–80) or gaps (e.g., 0–39, 50–69) When validating Then the publish is rejected with specific errors indicating the conflicting or missing ranges Given an active ruleset without a default band When attempting to activate it Then activation is rejected with an error and no evaluations use the incomplete ruleset Given policy evaluation fails due to misconfiguration When a sender attempts to send Then send is blocked with error POLICY_RULES_MISCONFIGURED and the failure is recorded on the broadcast timeline with rulesVersion and reason

Approval Gate UI

"As a duty manager, I want a clear approval page that tells me why an update is high risk and what I must do so that I can make informed, accountable decisions quickly."

Description

Present a unified pre-send screen that surfaces the risk score, key drivers, required checklist items, and the exact policy actions triggered. Enable escalation to a senior approver when required, capture attestations, and block sending until all gated steps are satisfied. Provide clear, human-readable explanations, inline diffs of ETA changes, and a one-click route to view related incidents. Enforce role-based access and capture who approved what and when. Optimize for desktop and mobile with accessibility compliance and fast load times.

Acceptance Criteria

Pre-send risk score with key drivers

Given a draft broadcast is opened in the Approval Gate When the UI renders Then it displays the current risk score as an integer 0–100 And shows a risk band label and color derived from configuration for that score And lists drivers: audience size, ETA change magnitude (minutes), channel mix, and model confidence with their current values and percentage contribution And shows a human-readable explanation summarizing the primary drivers of the score And displays a last-calculated timestamp in the user’s local timezone When any input affecting risk changes Then the risk score, drivers, and explanation recalculate and update within 1 second

Triggered policy actions surfaced and enforced

Given a risk score with associated policy actions When the Approval Gate loads Then the UI lists each triggered action with a clear label and status (Required/Complete) And the primary Send action is disabled until all Required actions are complete When a policy includes a staggered channel rollout Then the UI shows the channel schedule with configured offsets and a preview per channel When any required action becomes incomplete due to a change in risk or content Then the action list updates in real time and the Send action re-disables When all required actions are complete Then the Send action becomes enabled

Checklist rendering and attestations capture

Given the policy requires a pre-send checklist When the Approval Gate loads Then each checklist item renders with description, optional help text, and a required checkbox And a user must confirm each required item before sending When an item is checked Then the system records the user ID, UTC timestamp, and item ID as an attestation And the attestation is visible in the UI When the broadcast is sent Then attestations become read-only and are preserved in the audit log

Approval workflow, RBAC, and audit trail

Given the risk policy requires senior approval When the initiator clicks Request Approval Then a selectable list shows only users with the Senior Approver role When a senior approver is selected Then the approver receives an in-app notification immediately and an email within 1 minute with a deep link to the gate When the approver opens the gate and submits Approve or Decline Then the system records approver ID, decision, optional comment, and UTC timestamp And only users with the Senior Approver role may approve; other attempts are blocked with a permission error When approval is declined Then the initiator sees the decline reason and Send remains disabled When no decision is made within 30 minutes Then the request escalates to a configured fallback approver group And all approval and attestation events are written to an immutable audit log including the risk score snapshot, drivers, policy actions, and timestamps

Inline ETA change diff presentation

Given a broadcast updates an ETA relative to the last published ETA When the Approval Gate loads Then it shows an inline diff with the previous ETA and the new ETA side by side And highlights increases in ETA in red and decreases in green with the delta in minutes And displays the timezone used for both values When multiple incidents are affected Then the diff lists each impacted incident with its own before/after ETA and delta When there is no prior ETA Then the UI labels the change as New ETA and omits a delta

One-click related incidents view

Given the broadcast is linked to one or more incidents When the user clicks View Related Incidents Then the app opens a related-incidents view filtered to the broadcast’s incident IDs And the user can return to the Approval Gate via a Back action without losing unsaved progress And access is restricted by incident permissions; unauthorized incidents are omitted and a notice is shown And the related-incidents view opens within 1 second on desktop and 2 seconds on mobile on a baseline network

Performance, responsiveness, and accessibility

Given a draft broadcast is opened When the Approval Gate loads on desktop over a 10 Mbps connection Then Time to Interactive is ≤ 2.0 s at p50 and ≤ 4.0 s at p95 When loaded on mobile over a 4G connection Then Time to Interactive is ≤ 3.5 s at p50 and ≤ 5.0 s at p95 And the layout is responsive for widths 320–1440 px with no horizontal scrolling and tap targets ≥ 44×44 px on mobile And all interactive controls are reachable via keyboard (Tab/Shift+Tab), show visible focus, and have ARIA labels And the page meets WCAG 2.1 AA for color contrast (≥ 4.5:1), semantics, and screen-reader announcements for risk score, policy actions, errors, and confirmations

Channel Stagger Orchestrator

"As a communications lead, I want high-risk broadcasts to roll out in controlled phases so that we can catch issues early and minimize impact if a correction is needed."

Description

Execute staggered delivery policies across SMS, email, voice, and web, sequencing channels and cohorts based on risk. Support configurable delays, batch sizes, and hold windows; provide automatic cancel, amend, or roll-forward if a correction is issued mid-stagger. Ensure idempotent scheduling, per-channel success tracking, and backoff on delivery failures. Expose real-time progress and allow safe manual override with appropriate audit logging.

Acceptance Criteria

Risk-Based Channel and Cohort Sequencing Enforcement

Given a broadcast with risk band mapped to a policy that defines channel order, cohort batch sizes, per-channel delay offsets, and inter-channel hold windows When the Channel Stagger Orchestrator generates the schedule Then the scheduled channel order matches the policy exactly for that risk band And recipient cohorts are created per policy segmentation with batch sizes not exceeding the configured limit And per-channel delay offsets and inter-channel hold windows are applied with computed dispatch times within ±2 seconds of expected And the schedule is rejected with an error if any required policy parameter is missing

Mid-Stagger Correction Handling: Cancel, Amend, Roll-Forward

Given a broadcast is mid-stagger with pending (not yet dispatched) jobs When a correction with action=cancel is submitted Then all pending jobs are canceled within 10 seconds and no further sends occur for the original content Given a broadcast is mid-stagger with pending jobs When a correction with action=amend and new content/metadata is submitted Then all pending jobs are updated before dispatch and already dispatched deliveries remain immutable with a link to the amended version in audit logs Given a broadcast is mid-stagger with pending jobs When a correction with action=roll-forward is submitted Then a new schedule continues the remaining sequence with updated content and excludes recipients already contacted in any channel

Idempotent Scheduling on Duplicate Broadcast Requests

Given the orchestrator receives duplicate schedule requests carrying the same broadcastId and content/version hash When the requests are processed concurrently or retried Then exactly one schedule is created and subsequent requests return the existing scheduleId And no recipient is scheduled more than once per channel due to duplication And repeated client retries are safe and do not create additional sends

Delivery Failure Backoff, Retry, and Escalation Policy

Given a channel send attempt returns a transient failure (e.g., timeout, 429, or 5xx) When retrying the batch Then exponential backoff with jitter is applied starting at 2 seconds, doubling each attempt up to a maximum delay of 5 minutes, with a maximum of 5 attempts And if the failure persists beyond the retry limit, the batch is marked failed and an operator alert is emitted per policy And permanent failures (e.g., invalid address 4xx) are not retried and are recorded per recipient with a machine-readable reason code

Real-Time Progress and Per-Channel Success Tracking

Given a broadcast is in progress When a client queries the progress API or opens the progress UI Then per-channel and per-cohort metrics are available: scheduled, dispatched, delivered, failed, canceled counts and percentages And the data updates at least every 5 seconds with a lastUpdated timestamp and estimated time to complete per channel And delivery outcomes are recorded per recipient with provider response codes to support success/failure attribution

Safe Manual Override with RBAC and Audit Logging

Given an authorized operator with the required role accesses a live broadcast When they issue a manual action (pause, resume, cancel, reorder channels, or adjust delays) Then the system validates permissions and concurrency constraints before applying the change And approved changes take effect within 5 seconds and are reflected in the progress API/UI And an immutable audit record is written capturing operator identity, timestamp, action, before/after state, and broadcast identifiers

Explainability & Audit Trail

"As a regulatory auditor, I want a transparent record of how a broadcast was scored and approved so that I can verify compliance and decision rationale."

Description

Record risk inputs, normalized values, weights, final score, policy decisions, approvals, checklist responses, and timestamps in an immutable audit log linked to the broadcast. Provide an explainability view that shows how each factor contributed to the score and which rule fired. Support export via API and CSV, retention policies, and privacy controls for sensitive data. Ensure logs are tamper-evident and searchable for compliance reviews and postmortems.

Acceptance Criteria

Audit Log Captures Complete Risk Evaluation Data

Given a broadcast passes through the Risk Scoring Gate When the risk score is computed and the policy decision is finalized Then an audit log entry is appended linked by broadcast_id and includes: raw risk inputs (audience_size, eta_change_minutes, channel_mix, model_confidence), normalized_values per factor, factor_weights, final_risk_score, fired_rules[], policy_decision, approver_user_ids[], approval_outcomes with timestamps, checklist_responses[], engine/model version, and created_at And all timestamps are UTC ISO‑8601 with millisecond precision And the entry is retrievable via API by broadcast_id within 1 second for p95 And no update endpoint exists for audit entries; attempts to modify return 405/Method Not Allowed

Immutable, Tamper‑Evident Audit Chain

Given any audit log stream for a broadcast When a new audit event is written Then it stores content_hash=SHA‑256(payload), prev_hash of the prior event (or null for first), and a service signature And a daily chain anchor is published and retrievable via the verify endpoint Given the verify endpoint is called for a broadcast When no records have been altered Then the endpoint returns 200 with verified=true and the index of the latest anchored event Given any event content is altered or removed outside of retention policy When the verify endpoint is called Then it returns verified=false and identifies the first failing index

Explainability View Shows Factor Contributions and Fired Rules

Given a user opens the Explainability view for a specific broadcast When the page loads Then it displays for each factor: raw_value, normalized_value (0..1), weight, contribution=(normalized_value*weight), and contribution_percent of total And it displays final_risk_score, selected policy_decision, and fired_rules with human‑readable descriptions And all displayed values exactly match the latest audit log entry for that broadcast (tolerance: exact to 3 decimal places) And the user can expand to see approval history and checklist responses with timestamps And the view loads within 2 seconds at p95 for the last 30 days of data

Search and Filter for Compliance Reviews and Postmortems

Given a compliance reviewer queries the audit log via UI or API When filters are applied for time_range, broadcast_id, score_range, policy_decision, rule_name, approver_user_id, and channel Then results include only matching records, return total_count, and are sortable by created_at and final_risk_score And p95 query latency is ≤ 2 seconds for an index containing at least 1,000,000 audit entries And results pagination is stable and deterministic with cursor or page+size And clicking a result opens the Explainability view for that exact audit entry

Export via API and CSV with Stable Schema

Given an auditor requests an export via API for a time window and filters When Accept: application/json is used Then the API returns 200 with a JSON array using the documented schema including: broadcast_id, created_at, inputs, normalized_values, factor_weights, final_risk_score, fired_rules, policy_decision, approvals, checklist_responses, engine_version, integrity_hashes Given the same request with Accept: text/csv (or .csv route) When the response is returned Then the CSV is UTF‑8 with header row and the same fields flattened, using RFC4180 quoting And large exports are chunked/paginated with a consistent cursor and no duplicates across pages And exported timestamps are UTC ISO‑8601 with millisecond precision

Retention Policy and Legal Hold Enforcement

Given an admin sets retention_days=N and optionally enables legal_hold for specific broadcasts When current_time ≥ created_at+N days for an audit entry without legal_hold Then the entry is purged or irreversibly anonymized per policy, and a retention_purge tombstone is appended to the chain And purged entries no longer appear in search or exports, and verify excludes purged payloads while preserving chain integrity across remaining entries Given legal_hold is set on a broadcast When retention thresholds are reached Then no purge/anonymization occurs until legal_hold is removed, and this state is visible in admin configuration and audit logs

Privacy Controls for Sensitive Data

Given default system configuration When audit entries contain sensitive fields (e.g., phone_number, email, free_text_notes) Then those fields are masked/redacted in UI, search results, and exports by default Given a user with PII_VIEW permission accesses the same records When they explicitly toggle "Show PII" or request API with scope=pii:read Then unmasked values are shown/returned, and the access itself is logged as an audit event with user_id, reason, and timestamp And requests without required permission receive 403 and only masked data And explainability view never exposes PII beyond masked defaults unless explicitly revealed by an authorized user

Calibration Console

"As a product owner, I want to calibrate the scoring and thresholds using real outcomes so that the gate is strict where it matters and unobtrusive elsewhere."

Description

Provide an admin interface to tune factor weights and score thresholds, simulate changes on historical broadcasts, and preview downstream policy effects. Surface metrics like false-positive/negative gating rates, average time-to-send, and incidents of post-send corrections by risk band. Offer safe-guarded deployment of new configurations with staged rollout and automatic rollback if key KPIs regress.

Acceptance Criteria

Weight and Threshold Tuning UI Saves and Validates Inputs

Given an admin opens the Calibration Console, When they edit factor weights for audience size, ETA change magnitude, channel mix, and model confidence, Then each weight must accept numeric values in the range -1.00 to 1.00 with a step of 0.01 and inline validation errors are shown for out-of-range or non-numeric inputs. Given risk band thresholds are edited, When the admin sets Low/Medium/High score cutoffs, Then thresholds must be integers between 0 and 100 and strictly ascending (Low < Medium < High), otherwise Save is disabled and a descriptive error is shown. Given all inputs are valid, When the admin clicks Save, Then a new Draft configuration version is created with version ID, timestamp, and author, response time <= 2 seconds, and live scoring remains unchanged.

Historical Simulation Accuracy & Performance

Given a Draft configuration and a selected historical window up to 90 days and max 10,000 broadcasts, When the admin runs Simulation, Then the system scores all included broadcasts and completes within 60 seconds for 10,000 records, otherwise a progress indicator and ETA are shown. Then the result set includes, for each broadcast, baseline (Live) risk score/band, Draft risk score/band, baseline gate decision, and Draft gate decision. Then the console computes and displays distribution deltas by risk band and gating deltas vs Live; FP gating rate = count(Draft Gate=Yes AND Live Gate=No)/N; FN gating rate = count(Draft Gate=No AND Live Gate=Yes)/N. And repeating the same simulation with the same inputs produces identical results (tolerance ±0.1 score and identical gate decisions).

Policy Effect Preview by Risk Band

Given simulation results are available, When the admin selects any risk band filter, Then the console lists the policy actions that would be triggered (e.g., senior approver required, checklist required, stagger channels) with counts and percentage change versus Live. When the admin drills down into any action, Then a paginated sample (up to 50 broadcasts) with IDs, timestamps, and channels is shown with the exact rule(s) that caused the action. Then all displayed counts reconcile with the underlying simulation dataset within ±1 item.

Metrics Dashboard Surface & Consistency

Given the Metrics view is opened after a simulation, Then the console displays, side-by-side for Live vs Draft, FP gating rate, FN gating rate, average time-to-send (p50, p90), and post-send corrections per 1,000 broadcasts, each broken down by risk band. Then metric values equal an independent offline recomputation on the same dataset within ±1 percentage point absolute or ±5% relative (whichever is larger). When the admin exports metrics, Then CSV and PNG exports are generated within 5 seconds and reflect the current filters and dataset.

Staged Rollout with Guardrails and Auto-Rollback

Given a Draft configuration has completed simulation, When an admin with Deploy permission initiates rollout, Then the console supports staged exposure across broadcasts/channels in steps of 10% -> 25% -> 50% -> 100% (editable) and shows current stage and cohort size. During rollout, Then the system computes KPIs every 15 minutes on live traffic and triggers automatic rollback to the last good configuration within 2 minutes if any guardrail is breached for two consecutive intervals: FN gating rate increases by >2 percentage points vs baseline, post-send corrections per 1,000 increase by >20% vs baseline, or average time-to-send increases by >10% vs baseline. When rollback occurs, Then the system sends notifications to the configured email/Slack channels and logs the event in the audit trail with reason and metrics snapshot. At any point in rollout, When an admin clicks Manual Rollback, Then the live configuration reverts within 2 minutes and all in-flight evaluations switch to the previous configuration on the next request.

Versioning, Audit, and Access Control

Given role-based access control is enabled, When a user without Admin role attempts to edit weights/thresholds or deploy, Then the action is blocked and a read-only view is shown. Given an Admin creates or edits a configuration, Then every change is recorded in an immutable audit log with version ID, user, timestamp (UTC), changed fields with before/after values, simulation ID (if run), and deployment actions. When viewing Versions, Then the console can diff any two versions, highlighting changes in weights, thresholds, and policies; diff renders within 2 seconds for up to 100 fields. When exporting the audit log, Then a CSV covering the last 12 months is generated within 10 seconds and matches on-screen entries 1:1.

Approver Alerts & SLA Escalations

"As an on-call director, I want timely alerts and SLA tracking for pending high-risk approvals so that critical updates are not delayed."

Description

Notify required approvers when a high-risk broadcast is awaiting action and track SLA timers to escalate to on-call leadership if thresholds are missed. Support multi-channel alerts (in-app, email, SMS, chat) with quiet hours and acknowledgement tracking. Expose a dashboard of pending approvals with aging, and integrate with incident priority to adjust SLAs dynamically.

Acceptance Criteria

High-Risk Broadcast Alert to Required Approvers

Given a broadcast is classified High risk and is Awaiting Approval And required approvers are assigned with channel preferences When the approval request is created Then alerts are sent to all required approvers via their configured channels within 60 seconds And each alert includes broadcast ID, incident priority, risk class/score, SLA deadline timestamp, and Approve/Decline/Acknowledge actions And no non-assigned users receive alerts And no more than one alert per channel per approver is sent within a 60-second window

Quiet Hours and Channel Preferences Enforcement

Given an approver has quiet hours set from 22:00–06:00 in their local timezone And SMS and Voice are suppressed during quiet hours, while Email and Chat are allowed When a high-risk approval request is created at 23:15 local time Then only Email and Chat alerts are sent to that approver And SMS and Voice alerts are not sent And an audit log records which channels were suppressed due to quiet hours

Acknowledgement Tracking and De-duplication

Given an approver receives alerts across multiple channels for the same approval request When the approver acknowledges via any supported channel (in-app, email link, SMS keyword ACK, chat action) Then the system records the acknowledgement with approver identity and timestamp And suppresses further reminders to that approver for this request And updates the approval detail and dashboard acknowledgement status within 10 seconds

SLA Timer and Multi-Stage Escalation

Given a priority-to-SLA mapping exists: P1=5 minutes, P2=10 minutes, P3=20 minutes And an approval request for a P1 incident remains unacknowledged When the SLA timer reaches 5 minutes without acknowledgement Then an escalation alert is sent to the current on-call leader via Email, SMS, and Chat And if there is still no acknowledgement after an additional 5 minutes, a second-stage escalation is sent to the duty executive And each escalation stage is sent only once and is recorded with timestamps in the audit log

Dynamic SLA Adjustment on Priority Change

Given an approval request exists with priority P2 and an active SLA timer When the incident priority changes to P1 Then the remaining SLA is recalculated to 5 minutes from the time of change And all future reminders and escalations use the updated SLA And the dashboard displays a note of the priority change and SLA adjustment

Pending Approvals Dashboard Visibility and Aging

Given one or more approval requests are awaiting action When a user opens the Approvals dashboard Then each item displays broadcast title, incident priority, risk class, age since request (hh:mm), next SLA deadline (timestamp), and acknowledgement status And the list can be sorted by age and filtered by priority, risk, and approver And dashboard data auto-refreshes at least every 15 seconds

Escalation Ladder

Routes pending approvals to on‑call alternates with SLA timers, nudges via push/SMS/voice, and supports one‑tap approvals with secure links or codes. Keeps the two‑key path unblocked when people are busy, cutting approval delays during peak events.

Requirements

On-Call Roster & Schedule Sync

"As an operations supervisor, I want our on-call roster to sync automatically so that escalations always reach the correct person without manual updates."

Description

Synchronize on-call primary and alternate approvers from internal rosters and third‑party schedulers (e.g., PagerDuty, Opsgenie, Google/Microsoft calendars) with timezone, rotation, and holiday overrides. Provide an admin UI and API to manage teams, shifts, and escalation order, with validation for gaps, overlaps, and inactive users. The Escalation Ladder reads this live roster to target the right approver at each step and to auto-select alternates when someone is off-duty. Changes propagate in near real time, ensuring escalations reflect the latest staffing without manual intervention, reducing missed pings and delays.

Acceptance Criteria

Third‑Party Scheduler Sync with Timezone and Holiday Overrides

Given OutageKit is connected to PagerDuty and/or Opsgenie schedules for Team A with defined primary and alternate and associated timezones When the current time enters a new shift window or a schedule change is received via webhook Then the active primary and alternate for Team A in OutageKit are updated within 60 seconds of the change Given a holiday override is present on the third‑party schedule for Team A When the override period is active Then the override assignee becomes primary and the normal assignee is not targeted Given a user on the incoming schedule is inactive in OutageKit or the IdP When syncing assignments Then that user is skipped and the next eligible alternate is selected Given a transient API error occurs during sync When the sync executes Then the system retries up to 3 times with exponential backoff and records a visible sync warning

Calendar‑Based Shift Import (Google/Microsoft) with Rotation

Given OutageKit is connected to a Google Calendar or Microsoft 365 calendar containing events titled "On‑Call: Team A" and an alternate mapping is configured in Admin UI When the current time falls within an event window Then the event owner (or configured primary attendee) is set as primary for Team A and the configured alternate is set as alternate within 60 seconds Given consecutive events define a rotation When one event ends and the next begins Then the primary/alternate switch according to the new event within 60 seconds Given two events overlap for the same team When importing the calendar Then the overlap is flagged and the conflicting events are not activated until resolved

Admin UI Validation for Gaps, Overlaps, and Inactive Users

Given an admin creates or edits shifts for a team When attempting to save a schedule with overlapping shifts for the same team Then the save is blocked and overlapping intervals are highlighted with error messages indicating the conflicting ranges Given there is any gap between consecutive shifts for a team When attempting to publish the schedule Then publishing is blocked and gaps are listed with start/end timestamps until filled Given the schedule includes a user marked inactive When validating before save Then the UI requires replacement of the inactive user before the schedule can be saved Given an escalation order is missing an alternate When validating before publish Then the UI requires selection of an alternate or removal of the step

Roster Management API for Teams, Shifts, and Escalation Order

Given a caller with RosterAdmin role and a valid idempotency-key header When POST /api/roster/teams is called with a valid payload Then the API returns 201 Created with the canonical team object including id, name, timezone, and ETag Given an existing team with shifts When GET /api/roster/teams/{id}/shifts is called Then the API returns active and future shifts with ISO 8601 timestamps including timezone offsets Given an existing shift definition When PATCH /api/roster/shifts/{id} is called with If-Match ETag and valid changes Then the API returns 200 OK with the updated resource and a new ETag; when the If-Match does not match Then 409 Conflict is returned Given a caller without RosterAdmin role When calling any modifying endpoint Then the API returns 403 Forbidden

Escalation Ladder Targets Correct Primary/Alternate from Live Roster

Given an escalation is initiated for Team A at time T When the current primary is off‑duty per schedule, inactive, or marked unavailable Then the designated alternate is targeted as the first approver for Team A Given the primary is targeted first When the primary does not approve within the configured SLA (e.g., 2 minutes) Then the system escalates to the next on‑call approver according to the escalation order Given a DST transition or differing team timezone When selecting the current assignee Then the correct primary/alternate is chosen according to the team’s configured timezone and schedule

Near Real‑Time Propagation of Roster Changes to Escalations

Given a roster change is saved in Admin UI or received from an external sync When the change is committed Then all new escalations initiated thereafter reflect the updated assignees within 60 seconds; the 95th percentile across 50 test updates is ≤ 60 seconds and no update exceeds 120 seconds Given an escalation is already in progress When a roster change occurs Then subsequent escalation steps (not yet sent) use the updated roster while already sent notifications are not retroactively changed Given multiple concurrent roster updates When processing changes Then no two different primaries are active for the same team and time window; the last write by change timestamp determines the final assignment

SLA Rules & Escalation Engine

"As an incident commander, I want SLA-driven escalation rules so that pending approvals advance automatically and reliably when time limits are reached."

Description

Configurable SLAs and stepwise escalation logic that define how long to wait for an approval, which channels to use per step, and when to advance to alternates or broader groups. Includes per-approval-type policies, time-of-day exceptions, maximum total wait, and quorum requirements. Implements reliable timers, idempotent step transitions, and persistence so escalations survive restarts. Integrates with incidents and approval objects in OutageKit to start, pause, or cancel escalations as context changes, ensuring the two‑key path stays unblocked during peak events.

Acceptance Criteria

Escalate to Alternate on SLA Timeout

Given an approval request with step 1 SLA wait of 2 minutes and configured alternates When the primary approver does not act before the 2-minute SLA expires Then within 5 seconds of expiry the engine advances to the next step and routes to the configured alternate(s) And exactly one escalation event is created with a unique idempotency key And notifications are sent once per configured channel for that step (no duplicates) And the audit log records the step change, target recipients, channels, and timestamps

Time-of-Day Exception Routing

Given org timezone is America/New_York and an off-hours policy (18:00–08:00) sets channels to Voice+SMS with a 3-minute step wait and on-hours defaults to Push+SMS with a 5-minute step wait When an approval is requested at 19:30 local time Then the engine applies the off-hours policy, uses Voice+SMS, and sets the wait to 3 minutes And when an approval is requested at 10:00 local time Then the engine applies the on-hours defaults, uses Push+SMS, and sets the wait to 5 minutes And the audit log indicates which policy variant was applied with timestamps

Maximum Total Wait Enforcement

Given an escalation with max_total_wait set to 20 minutes across all steps When 20 minutes elapse without meeting the approval quorum Then the escalation terminates with status "SLA Exceeded" And no further escalation steps or notifications are executed after termination And a final notification is sent to the escalation owner and incident channel indicating SLA exceeded And a metrics counter for "escalation_sla_exceeded" increments by 1 and the event is logged with reason

Quorum-Based Approval Completion

Given an approval type requiring a 2-of-N quorum and an active escalation in progress When two distinct approvers submit approvals via any supported channel Then the escalation completes immediately, all pending timers are canceled, and queued notifications are suppressed And the final decision is recorded with approver identities, channels, and timestamps And any subsequent approvals for the same approval object are ignored with a 409-like "already decided" outcome logged and no side effects And the incident timeline reflects that quorum was met and escalation ended

Idempotent Step Transitions

Given a step transition from step 1 to step 2 is scheduled When duplicate timer expiry events or webhook retries deliver the same transition up to 5 times within 60 seconds Then the engine processes the transition at most once: the step index increments once, one set of notifications is created, and one audit entry is written And subsequent duplicate events return an idempotent no-op response and are counted in a "duplicate_transition" metric without changing state And the persisted state shows a single transition with an idempotency key traceable to the first processed event

Persistence & Resume After Restart

Given an active escalation with 90 seconds remaining on the current step When the escalation service restarts unexpectedly Then within 10 seconds of startup the engine reloads pending escalations and reschedules the remaining 90±2 seconds for the current step And the next step fires at the correct adjusted time without skipping or duplicating any step And no escalation records are lost; a "recovered_after_restart" event is logged for the escalation And timers scheduled before restart are reconciled against persisted timestamps to prevent clock drift beyond ±2 seconds

Integration with Incident Context Changes

Given an escalation linked to Incident #123 with status Active When the incident transitions to Resolved Then the escalation cancels immediately: all timers are canceled, queued notifications are suppressed, and a final cancellation is sent to prior recipients And when the incident transitions to On Hold Then all escalation timers pause and the remaining time is preserved And when the incident returns to Active Then timers resume with the preserved remaining time and the current step context And when the associated approval object is canceled Then the escalation cancels with an audit entry linking the cancellation reason to the approval change

Multi-Channel Nudges & Retry Policy

"As an on-call approver, I want timely nudges over my preferred channels with smart retries so that I can respond quickly without being spammed."

Description

Deliver approval prompts via push, SMS, email, and voice with per-user preferences, quiet hours, and severity-based overrides. Provide templated, localized messages with incident context and one-tap approval links or codes. Implement deduplication across channels, configurable retry cadence with exponential backoff, and provider failover with delivery receipts and webhook-driven status updates. Throttle to prevent alert fatigue while ensuring time-bound attention for critical requests.

Acceptance Criteria

Per-User Delivery Preferences and Quiet Hours Enforcement

- Given a user with channel preferences [Push, SMS, Email, Voice] and quiet hours set to 22:00–06:00 in their timezone, When a Severity=Medium approval request is created at 23:30 local time, Then no nudge is sent until quiet hours end. - Given the same user and a Severity=Critical request with “Bypass quiet hours” enabled, When the request is created at 23:30 local time, Then the initial nudge is sent immediately on the highest-ranked available channel (Push) and logged. - Given a user with Push disabled and SMS enabled, When an approval is triggered, Then the system skips Push and sends via SMS, honoring the preference order. - Given a user’s timezone differs from system timezone, When evaluating quiet hours, Then local time is computed from the user’s timezone and evaluated correctly. - Given a user has opted out of Voice, When retries escalate across channels, Then Voice is never used unless an admin override is explicitly set for Critical severity and audit-logged.

Localized Templated Approval Messages with One‑Tap Links/Codes

- Given a user with language=Spanish, When an approval prompt is sent, Then the message body and IVR prompts render in Spanish using the selected template variant, with fallback to English only if Spanish template is unavailable. - Given an approval request, When the message is generated, Then it includes incident ID, short description, location (if available), severity, requester/team, and SLA deadline timestamp. - Given link-based approvals are enabled, When the message is generated, Then it includes a signed, single-use, 10-minute TTL link that deep-links to the app on mobile and to a secure web page otherwise. - Given code-based approvals are enabled, When the message is generated, Then it includes a 6–8 digit one-time code with 10-minute TTL and rate-limited verification (max 5 attempts). - Given message length limits per channel (e.g., SMS 160 chars segments), When the content exceeds limits, Then it is auto-truncated with a hosted detail link while preserving the approval link/code and critical context.

Cross-Channel Deduplication and Auto-Cancel on First Response

- Given an approval request is sent initially via Push, When SMS is scheduled as a backup within 2 minutes, Then SMS is suppressed if a delivery receipt or user interaction is received on Push within that window. - Given the user approves via any channel, When other channel messages are pending or in-flight, Then all pending retries across all channels are canceled within 5 seconds and no further nudges are sent. - Given multiple systems attempt to trigger the same approval within 60 seconds, When deduplication is enabled, Then only one approval thread is created and referenced by a stable dedupe key. - Given a channel provider reports definitive failure (e.g., invalid number), When deduplication rules evaluate, Then subsequent attempts on other channels proceed but are still deduped against the single approval thread. - Given idempotency keys are reused within 10 minutes, When an identical request is received, Then the system returns the original approval thread reference without sending duplicate notifications.

Configurable Exponential Retry Cadence with SLA Awareness

- Given default retry policy is initial=0m, backoff=2x, maxAttempts=5, jitter=±10%, When an approval is not acknowledged, Then retries occur approximately at 0m, 2m, 4m, 8m, 16m with jitter applied. - Given severity=Critical with SLA=15 minutes to approval, When the default schedule would exceed the SLA, Then the system compresses intervals to ensure at least 4 attempts before the SLA deadline. - Given quiet hours are active, When retries are due for Severity=Medium, Then retries are deferred until quiet hours end; if Severity=Critical with override enabled, Then retries proceed. - Given an admin updates the retry policy for a team, When a new approval is created, Then the new policy is applied and recorded in the audit log; existing approvals retain their original policy. - Given a user interacts (approve/deny/snooze), When further retries are scheduled, Then the schedule is canceled or adjusted per the interaction outcome within 5 seconds.

Provider Failover with Delivery Receipts and Webhook Status

- Given Provider A is configured as primary for SMS, When no delivery receipt is received within 30 seconds of send, Then the message is retried via Provider B within 10 seconds and annotated as failover in logs. - Given a provider webhook delivers status updates (queued, sent, delivered, failed), When updates arrive, Then the approval thread status is updated idempotently within 3 seconds and visible to operators. - Given both Provider A and B fail, When the system detects repeated failures, Then it escalates to the next available channel (e.g., Voice) if permitted by user preferences and severity overrides. - Given webhook retries from providers may arrive out of order, When processing updates, Then the system preserves the latest terminal state using message timestamps and sequence IDs. - Given a provider returns a transient error, When retry policy applies, Then the resend is attempted on the same provider up to the provider-specific cap before failing over.

Secure One‑Tap Approval and Code Entry with Anti‑Replay

- Given a one-tap approval link with 10-minute TTL, When the link is opened after expiry or after being used once, Then the action is rejected with a message to request a new approval. - Given severity=Critical and policy requires step-up verification, When the one-tap link is clicked, Then the user must complete 2FA (e.g., OTP) before the approval is recorded. - Given a code-based approval is attempted more than 5 times incorrectly, When further attempts occur within 15 minutes, Then verification is temporarily locked and the approver is notified via preferred channel. - Given a valid approval is recorded, When audit logs are written, Then logs include approver identity, channel, device/user agent (where available), IP/geolocation (where permitted), timestamp, and request fingerprint. - Given CSRF or open-redirect vectors, When the approval endpoint is invoked, Then the system enforces origin checks and redirects only to allow-listed domains.

Throttling and Severity-Based Overrides to Prevent Alert Fatigue

- Given per-user throttle is set to max 3 nudges per 15 minutes for non-critical requests, When more than 3 approvals target the same user within that window, Then additional nudges are suppressed and aggregated into a single digest message. - Given severity=Critical, When throttling would suppress a critical approval, Then up to 2 additional critical nudges are allowed within 15 minutes before suppression applies, and this bypass is audit-logged. - Given multiple approvals for the same incident are generated within 5 minutes, When sending nudges, Then they are coalesced into a single multi-approval message with distinct action links for each approval. - Given throttling suppresses a nudge, When the suppression occurs, Then the system notifies the requester with a reason and next eligible send time. - Given an SLA deadline is within 5 minutes, When throttling is in effect, Then at least one nudge is sent before the deadline unless the user has fully opted out of all channels.

One-Tap Secure Approvals (Links & Codes)

"As an approver in the field, I want secure one-tap approvals with expiring links or codes so that I can approve safely without logging into a console."

Description

Enable frictionless approvals via short-lived, signed links and one-time codes usable in web, mobile app deep links, SMS, and IVR DTMF. Enforce device and session verification, optional step-up authentication (2FA/passkeys) based on risk, and automatic expiration and single-use constraints. Bind tokens to request scope, IP/risk checks, and brand-protected domains to reduce phishing risk. All approvals are recorded with channel, device, and geo metadata, integrating with OutageKit auth and audit subsystems.

Acceptance Criteria

SMS/App One‑Tap Approval via Signed Link

Given a pending approval assigned to an approver with a verified mobile number and identity When the system sends an SMS containing a signed, short‑lived HTTPS link on a brand‑protected domain Then tapping the link opens the approval summary with one‑tap Approve/Reject actions And the token’s signature, audience, subject (request ID), recipient, and expiry are validated server‑side And the default token TTL is 5 minutes (configurable 1–15 minutes) And if the mobile app is installed, the link deep‑links via OS universal links into the Approvals screen; otherwise it falls back to web And a valid tap commits the decision within 2 seconds end‑to‑end and returns HTTP 200 And any tampered, wrong‑recipient, expired, or scope‑mismatched token returns HTTP 401/403 and no state change

IVR Approval via One‑Time Code (DTMF)

Given a pending approval and an approver calling from any phone When the IVR verifies identity by request ID and sends or prompts a one‑time code Then a 6‑digit random code expires in 5 minutes, is single‑use, and allows max 10 attempts before 10‑minute lockout And pressing 1 approves and 2 rejects after code verification, with confirmation playback And success/failure is committed within 2 seconds and includes channel=IVR in the audit record And invalid or expired codes never change state and respond with a clear TTS error

Single‑Use and Expiry Enforcement Across Channels

Given any approval token or code previously issued When the same token/code is replayed, used after expiry, or used concurrently Then the request is rejected with HTTP 409/410 (web/app) or clear TTS (IVR) and no state change And the first successful redemption marks the jti as consumed and invalidates all siblings And issuing a new token/code for the same request automatically revokes all prior ones

Risk‑Based Step‑Up Authentication

Given risk rules (e.g., new device fingerprint, IP country change, TOR/hosting ASN, after‑hours access) When a one‑tap approval is initiated on a risky signal Then step‑up authentication is required (TOTP or platform passkey on web/app; secondary code on IVR) And low‑risk events proceed without step‑up And all decisions log risk score, factors, and step‑up outcome; denials show reason to user without leaking specifics

Device/Session Verification and Ephemeral Sessions

Given an approver with an active OutageKit session on the brand domain When the signed link is opened on the same device and session Then the approval action is executable in one tap without re‑authentication And opening the link without an active session creates an ephemeral approval‑only session bound to device fingerprint and IP And device/session mismatch triggers step‑up per policy or denies with 401

Brand‑Protected Domains and Anti‑Phishing Controls

Given links are delivered by SMS/email/push When the recipient inspects or opens the link Then the URL uses HTTPS on a customer‑approved subdomain with HSTS and no open redirects And links are never shortened by public shorteners and include human‑readable branded host And DKIM+SPF+DMARC alignment is enforced for email; registered sender ID or long code is used for SMS And any request from an unapproved host is rejected with 403

Comprehensive Audit Logging and Auth Integration

Given any approval attempt (success or failure) When the attempt is processed Then an audit record is written within 2 seconds containing request ID, approver ID, channel, device fingerprint, user agent/app version, IP, ASN, country, coarse geo, risk score, step‑up method/result, token ID (hashed), timestamps, result, and reason And audit records are immutable, searchable by request/approver within 1 second, and exportable via API And audit events are correlated with OutageKit auth subsystem session logs and share a clock‑synchronized timeline with <200 ms skew And PII is stored per policy with masking where required

Two-Person Integrity & Conflict Controls

"As a compliance-conscious operations manager, I want enforced two-person integrity and conflict checks so that approvals meet policy without creating bottlenecks."

Description

Enforce the two‑key rule by preventing the requester or members of restricted groups from approving their own changes, detecting duplicate identities across channels/devices, and requiring distinct approver roles when required. Support dynamic quorum policies for major incidents, explicit override workflows with justification, and hard blocks where policy forbids overrides. Integrate checks at approval time and at escalation steps to maintain separation of duties without stalling the workflow.

Acceptance Criteria

Block Self-Approval by Requester

Given a pending change request created by User A When User A attempts to approve via web UI, SMS link, or IVR code Then the system rejects the action with HTTP 403 / IVR denial and message "Self‑approval not permitted" And the approval remains pending with no state change And the attempt is logged with user ID, channel, device fingerprint, timestamp, and IP

Prevent Approval by Restricted Groups

Given the request is tagged with policy restricting Group X from approving And User B is a member of Group X When User B attempts approval via any channel Then the system blocks the approval with message "Conflict: restricted group" And the attempt is logged and counted as 0 toward quorum And the escalation ladder skips Group X members and notifies the next eligible alternate

Detect and Collapse Duplicate Identities Across Channels/Devices

Given a single person is mapped to identities {email, phone, SSO account, device} When two approval attempts are received from any combination of these identities Then the system counts them as one unique approver toward quorum And the second attempt returns message "Duplicate identity not counted" And the audit trail links the attempts under the same person record

Enforce Distinct Approver Roles in Quorum

Given the workflow requires roles: Operations and Duty Manager When approvals are collected from two Operations users Then the request remains pending with status "Waiting for Duty Manager" And the UI/API presents the missing role requirement And the escalation ladder targets the on‑call Duty Manager next

Dynamic Quorum Policy for Major Incidents

Given incident severity = Major and policy quorum = 3 with at least 1 Security Officer When approvals are received from three unique, conflict‑free identities within the SLA window Then the approval is granted and action executed When approvals do not meet role mix or arrive after SLA expiry Then the system escalates, expires stale approvals per TTL, and restarts quorum collection

Override Workflow With Justification and Hard Blocks

Given policy allows overrides on Medium severity and forbids on Critical When an Override approver initiates an override on Medium Then the system requires 2FA re‑auth, justification >= 20 characters, and a second independent approver And on success the action executes and the audit log records justification, approvers, channels, and timestamps When an override is attempted on Critical Then the system hard‑blocks the action, displays "Override forbidden by policy", and notifies Compliance

Conflict Screening During Escalation

Given a pending approval has entered escalation When the system selects candidates and sends push/SMS/voice nudges Then each candidate is pre‑screened for self‑approval, restricted groups, duplicate identity, and role requirements And conflicted candidates are skipped automatically And the next eligible alternate is contacted within 30 seconds And the audit log records all skips, contact attempts, and SLA timer state

Approval Queue & Operations Console

"As a duty manager, I want a live queue of pending approvals with context and controls so that I can keep escalations moving during busy events."

Description

Provide a real-time console showing pending approvals, current SLA stage, recipient history, aging, and next escalation step. Offer filters, sorting, bulk reassignment, snooze/deferral with reason, and inline comments. Display concise incident context (summary, impact footprint, ETA) and recent communications so operators can take corrective actions quickly. Integrates into OutageKit’s incident view and supports keyboard shortcuts and accessibility standards for rapid triage during peak load.

Acceptance Criteria

Real-time Queue Refresh & SLA Stage Accuracy

- Given the Approval Queue is open, when a new approval is created or an item's SLA stage changes, then the row appears/updates within 5 seconds without manual refresh and displays the correct SLA stage and aging timer. - Rule: A visible "Last updated" timestamp reflects the most recent event time within ±1 second accuracy. - Rule: Aging increments every second and matches the server-calculated age within ±2 seconds. - Rule: With 1,000 items in the queue, scrolling maintains >45 FPS on baseline hardware and memory usage remains <500 MB in Chrome. - Given no backend events are received for 10 seconds, when the console detects staleness, then a "Connection delayed" banner is shown and aging timers pause until connectivity resumes.

Advanced Filtering, Sorting, and Saved Views

- Given items exist, when filters (SLA stage, assignee, team, severity, aging range) are applied, then only matching rows display and the results count matches the filtered total. - Given a search query (incident ID, summary text, recipient name, phone/email), when entered, then matching items display within 300 ms on a dataset of 500 rows. - Given the user sets sort by aging, SLA stage, severity, or assignee, then the list orders correctly and the active sort indicator is visible. - Given filters, sort, and search are combined, when the user saves the view with a name, then it persists to the user profile, is selectable later, and can be set as default. - Rule: "Clear all" resets filters, sort, and search to defaults and restores the full list.

Bulk Reassignment with Audit Trail

- Given the user has Manage Approvals permission and selects N (>=1) items, when Reassign is confirmed to a target user/team, then all N items update assignee and a success toast shows the reassigned count. - Rule: An audit log entry is created per item capturing actor, timestamp (UTC), previous assignee, new assignee, and optional reason. - Given partial failures occur, then per-item error messages are displayed, successful items remain reassigned, and a Retry option is available for failed ones. - Rule: Reassignment immediately updates the "Next escalation step" preview to reflect the new assignee's path.

Snooze/Defer with Reason and SLA Adjustment

- Given an item is selected, when Snooze is applied for 5, 15, 30 minutes or a custom duration (max 120 minutes) with a required reason, then the item shows "Snoozed until HH:MM" and appears in the Snoozed filter. - Rule: Snoozing shifts the next escalation countdown by the snooze duration and displays the updated time and channel. - Given an item is snoozed, when Unsnooze is invoked, then original SLA timers resume from current time and the snooze audit record persists. - Rule: All snooze/unsnooze actions require a reason (list or free text), are time-stamped, actor-recorded, and visible in the item's history.

Inline Comments with @Mentions and Notifications

- Given an item detail pane is open, when a comment with @mentions to users/teams is posted, then the comment appears within 2 seconds with author and timestamp, and mentioned parties receive notifications per preferences with a deep link to the item. - Rule: Comments support plain text up to 1,000 characters; edits are allowed for 5 minutes; subsequent edits create a new revision with history. - Rule: Comments entered in the console are visible in the incident view's conversation and vice versa within 5 seconds. - Rule: Comments respect incident permissions; users without incident access cannot view them.

Incident Context Panel and Recent Communications

- Given an item is selected, then the side panel displays Incident Summary (≤140 chars, truncated with tooltip), Impact Footprint (customer count and affected areas), current ETA (or "ETA not set") with an edit link. - Rule: The panel lists the 5 most recent communications (SMS, email, voice) with timestamps, channel icons, and direction (inbound/outbound), plus a "View all" control opening the full thread in the incident view. - Rule: Recipient history is displayed showing sent/delivered/read/answered events with timestamps, channel, and response codes. - Rule: Context data mirrors the incident model and reflects external updates within 5 seconds; if a field is missing, placeholder text and CTAs are shown instead of blanks.

Keyboard Shortcuts and Accessibility (WCAG 2.2 AA)

- Rule: Keyboard shortcuts exist and function: Up/Down (navigate), Enter (open details), R (reassign), S (snooze/unsnooze), C (add comment), / (focus search), F (focus filters), ? (show shortcuts). A visible cheat sheet lists them. - Rule: All actions are possible using keyboard only; focus order is logical; focus indicators are visible on all interactive elements. - Rule: Live updates are announced via ARIA live regions (e.g., "Item updated, SLA stage: Level 2"); all controls have accessible names/roles/states. - Rule: Contrast ratio ≥ 4.5:1; no keyboard traps; no content flashes >3 times/second. - Rule: Automated accessibility scan (axe-core) reports 0 critical violations; manual screen reader smoke test confirms labels for primary controls.

SLA Compliance Metrics & Immutable Audit Log

"As a reliability lead, I want detailed metrics and an immutable audit trail so that I can prove compliance and improve our escalation effectiveness over time."

Description

Capture every nudge, response, and escalation transition with timestamps, actor, channel, device fingerprint, and policy decisions in an append-only, tamper-evident log. Provide dashboards for mean/median time to approve, breach rates by step/policy, approver responsiveness, and channel effectiveness. Support exports (CSV/JSON), retention policies, and privacy safeguards (PII minimization, encryption at rest), enabling post-incident review and regulatory compliance for approval workflows.

Acceptance Criteria

Append‑Only Tamper‑Evident Audit Log for Escalation Events

Given any nudge, response, approval, rejection, or escalation transition occurs in the Escalation Ladder When the event is processed Then a new audit record is appended containing: UTC ISO‑8601 timestamp, event type, actor ID (or system), channel (SMS/Web/IVR/Voice/Email/Push), device fingerprint (hashed), policy and step, decision outcome, request/trace ID, previous record hash, and current record hash. Given an API client attempts to update or delete an existing audit record When the request is executed Then the operation is rejected (HTTP 405 or 403), no existing record is modified, and the attempt is logged with outcome "blocked". Given a sequence of audit records for a workflow instance When integrity is verified Then the hash chain validates from genesis to latest with no gaps; if a record is missing or altered, verification fails and the index of the first failing link is returned.

SLA Metrics Computation and Breach Attribution

Given approvals across multiple policies and steps within a selected date range When the metrics computation runs (near‑real‑time with ≤60s freshness) Then mean and median time‑to‑approve are computed per policy and per step; breach rate = breached/total is computed per policy and per step; metrics are derived solely from audit log timestamps. Given a step has an SLA target (e.g., 5 minutes) When an approval exceeds the target Then the event is marked as a breach with attribution fields: policy, step, approver actor ID, channel used, and last nudge timestamp. Given a random sample of 100 approvals When metrics are recomputed directly from raw audit entries Then dashboard metrics match within ±1% for rates and ±1 second for times.

SLA Compliance Dashboards and Filters

Given a user with Operations Manager role When they open the SLA dashboard and apply filters (date range, policy, step, channel, team, region) Then cards and charts for mean/median time‑to‑approve, breach rates, approver responsiveness distribution, and channel effectiveness (CTR and approval conversion) render within 3 seconds at the 90th percentile. Given a dashboard data point is clicked When the user drills down Then a paginated list of underlying audit events is shown with consistent counts, and each row links to the immutable audit trail for that workflow instance. Given a timezone is selected When the dashboard renders Then timestamps are displayed in the selected timezone while exports preserve UTC.

Export Metrics and Audit Logs

Given a filtered dashboard or audit log view When the user exports as CSV or JSON Then the file contains only the filtered rows and the defined schema (timestamp, event type, actor ID, channel, device fingerprint hash, policy, step, decision, request/trace ID, previous hash, current hash); row count equals the on‑screen total. Given up to 100,000 rows are requested When export is initiated Then the download is ready in ≤10 seconds; for larger datasets, a streamed or asynchronous export with progress and completion notification is provided. Given PII minimization is enabled When exporting Then phone numbers and emails are masked (e.g., last 4 visible), device fingerprints are salted hashes, and no raw tokens, access codes, or message bodies are included. Given an export completes When it is delivered Then the export action is logged with requester ID, scope, time, format, and a checksum of the file.

Retention Policy Enforcement and Encryption at Rest

Given a retention policy of 180 days is configured When records exceed 180 days Then PII fields are irreversibly anonymized or the record is purged per policy, and a retention‑tombstone entry is appended containing the hash of the removed/redacted record to preserve chain integrity. Given data‑at‑rest requirements When storing audit logs and metrics Then all data is encrypted at rest with AES‑256 via managed KMS; keys are rotated per policy; only the audit service account can write; read access is role‑based and all accesses are logged. Given a retention policy change When an admin updates the policy Then the change requires dual authorization, is logged with before/after values, and takes effect prospectively only.

Automated Integrity Verification and Alerting

Given the scheduled integrity verification job When it runs Then it validates the full hash chain and publishes an attestation (Merkle root and timestamp) to an external store; the latest attestation is viewable in the UI. Given any integrity verification failure When detected Then a Sev‑2 alert is sent to on‑call within 2 minutes via SMS/Push/Email, and the dashboard shows "Integrity Check Failed" with the failing record index. Given successful verification over the last 24 hours When viewed in the UI Then the dashboard shows a "Last verified" timestamp within the past 24 hours and zero failures.

Immutable Audit Ledger

Tamper‑evident, append‑only log recording initiator, approvers, timestamps, diffs, and justifications with exportable reports. Simplifies compliance reviews, proves who changed what and when, and builds trust with regulators and leadership.

Requirements

Cryptographic Append‑Only Ledger

"As a compliance officer, I want an immutable audit ledger that proves every change and broadcast was recorded with cryptographic integrity so that I can demonstrate tamper‑evidence to regulators and leadership."

Description

Implement an immutable, append‑only event store that links each audit record via a cryptographic hash chain and per‑tenant Merkle roots to provide tamper‑evidence. The ledger records all material actions in OutageKit—including incident lifecycle changes, ETA updates, notification broadcasts, configuration edits, and permission changes—with write‑once semantics, idempotent ingestion, and at‑least‑once persistence. Integrate with existing event pipelines to capture normalized payloads and metadata, including actor identity, source service, and correlation IDs. Support multi‑tenant partitioning, encryption at rest, high‑throughput writes, and horizontal scalability. Provide read APIs to fetch entries, paginate by time and cursor, and retrieve Merkle proofs for integrity verification. Optionally anchor daily Merkle roots to an external transparency mechanism to increase evidentiary strength without introducing on‑chain dependencies.

Acceptance Criteria

Hash Chain Integrity on Append

Given a tenant ledger with at least one existing entry When a new entry Ei is appended Then Ei.prev_hash equals SHA-256(canonical(Ei-1)) And Ei.hash equals SHA-256(canonical(Ei) || Ei.prev_hash) And canonical(record) is JSON with UTF-8 encoding, sorted keys, and no insignificant whitespace And the verification API or CLI for the tenant returns valid=true for the full chain from genesis to head When any persisted field of an earlier entry Ej is altered outside the append pathway Then the verification API or CLI returns valid=false And reports firstBrokenEntryId == Ej.entry_id

Per-Tenant Daily Merkle Roots and Proof Verification

Given tenant T and UTC day D (00:00:00 to 23:59:59.999 UTC) When requesting the daily Merkle root for T on D Then the system returns root R(T,D) computed over the list of entry hashes for T on D in ascending timestamp, breaking ties by entry_id ascending When requesting a Merkle proof for entry E in T on D Then the proof verifies E.hash against R(T,D) When attempting to verify E against a root for a different tenant or day Then verification fails with a mismatchedRoot error

Write-Once, Idempotent Ingestion, At-Least-Once Delivery

Given an event with idempotency_key K for tenant T When the event is submitted N>=2 times due to retries Then exactly one ledger entry is persisted for K And all responses return the same entry_id and hash Given an ingest attempt for event X When the system crashes after durable write but before acknowledging upstream Then upon recovery the event may be retried And the ledger contains exactly one entry for X When any client attempts to update or delete an existing ledger entry Then the API responds 405 Method Not Allowed (or 409 Conflict if applicable) And no existing entries are modified or removed

Normalized Metadata and Required Fields

For every persisted ledger entry Then the following fields are present and valid: - tenant_id (UUID v4) - entry_id (ULID or UUID), unique per tenant - timestamp (UTC ISO 8601 with millisecond precision) - event_type (from registered enumeration) - actor_id (UUID or service principal ID) - actor_type (human|service|system) - source_service (registered service name) - correlation_id (non-empty string, max 128 chars) - payload_hash (SHA-256 of normalized payload or diff) - justification (string, may be empty) - prev_hash (omitted only for genesis entry) - hash And payload or diff is stored in normalized canonical JSON (UTF-8, sorted keys, no insignificant whitespace) And payload_hash matches the stored payload/diff And missing or invalid fields cause the write to be rejected with HTTP 400 and field-specific error codes

Read APIs: Time-Ordered Cursor Pagination

Given tenant T with more than 1,000 entries between start_time S and end_time E When calling GET /ledger?tenant=T&start_time=S&end_time=E&page_size=200 Then results are ordered by timestamp ascending, breaking ties by entry_id ascending And each page contains at most 200 entries and an opaque next_cursor when more data remains When iterating using next_cursor until completion Then the union of entries across pages contains every entry in [S,E] exactly once with no gaps or duplicates When requesting include_proof=true for a specific entry Then the response includes a Merkle proof that verifies against the corresponding daily root

Multi-Tenant Isolation and Partitioning

Given tenants T1 and T2 When writing entries for T1 Then they are not readable via T2 credentials or API tokens And Merkle roots and proofs for T1 do not verify entries from T2 And cross-tenant read attempts return 403 Forbidden And storage/index metrics or explain plans show queries scoped by tenant_id only touch T1 partitions

Encryption at Rest and Throughput/Scalability SLAs

Given the ledger data store Then encryption at rest is enabled with AES-256-GCM using KMS-managed keys And data encryption keys (DEKs) are rotated at least every 90 days via customer master key (CMK) re-wrapping When a CMK rotation occurs Then new writes use the new key version And previously written data remains decryptable And an online re-encryption job can be run without downtime Under a load test of 10,000 writes/sec sustained for 10 minutes across at least 10 tenants Then write error rate <= 0.1% And p99 write latency <= 200 ms And adding two additional nodes to the ledger cluster increases sustained throughput by >= 80% compared to baseline

Approval & Justification Capture

"As an operations manager, I want sensitive actions to require documented approval and justification so that we can satisfy change‑control policies and clearly show who authorized what and why."

Description

Capture and enforce recording of initiator, approvers, timestamps, and explicit justifications for sensitive actions (e.g., ETA overrides, mass notifications, template edits, permission changes). Integrate pre‑commit guards in UI and API to block completion until required approvals and a reason are supplied, with configurable approval flows by action type and tenant policy. Store reason codes (taxonomy) plus free‑text rationale with minimum length and optionally require attachment links (e.g., incident ticket). Persist the full approval graph (requested, approved, rejected, escalated) with actor identities from SSO, device/IP context, and step timestamps, all bound into the audit entry’s signature to prevent repudiation.

Acceptance Criteria

UI Pre-Commit Guard for Sensitive Action (ETA Override)

Given a tenant policy for ETA override requires 2 distinct approvers (excluding the initiator), a reason code from the tenant taxonomy, a free-text rationale minimum of 20 characters, and an attachment link When an initiator attempts to commit an ETA override without selecting a reason code Then the UI blocks submission, highlights the missing field, and displays a clear validation message Given the same policy When the initiator provides a free-text rationale shorter than 20 characters Then the UI blocks submission and displays a validation message indicating the minimum length requirement Given the same policy When the initiator attempts to select themself as an approver or is the sole approver Then the UI blocks submission and displays a policy violation message Given all required fields are satisfied and 2 distinct approvers have approved When the final required approval is recorded Then the action is committed, the UI shows success, and an audit entry is created referencing the approval request ID

Tenant-Scoped Configurable Approval Flow by Action Type

Given tenant policy configures Mass Notification to require 1 approver from role Operations within 15 minutes with escalation to Duty Manager, and Permission Change to require 2 approvers from role Admin When an initiator submits a Mass Notification request Then the system routes the approval to an Operations approver, starts a 15-minute SLA timer, and marks the request as awaiting_approval Given the Mass Notification approval remains pending beyond 15 minutes When the SLA timer elapses Then the system escalates to Duty Manager and records an escalation step with timestamp and target role Given a Permission Change request is submitted When approvals are collected Then the system enforces 2 distinct Admin approvers and blocks commit until both are recorded

API Enforcement and Error Semantics

Given a client submits an API request to perform a Template Edit with no reason code and rationale When the request is received Then the API responds 422 with a machine-readable error payload enumerating missing reason_code and rationale fields Given a client submits an API request to commit a Mass Notification action that requires prior approvals When approvals have not been satisfied Then the API responds 409 with status awaiting_approval and includes a link to the approval resource Given a client includes the initiator as an approver in the same request When policy forbids self-approval Then the API responds 403 with code policy_violation.self_approval Given a client retries the same approval request with the same Idempotency-Key within 24 hours When the original request is still processing Then the API returns 202 with the same approval request ID and status without creating duplicates

Approval Graph Persistence and Context Capture

Given an action requiring approval is initiated and progresses through request, approval, and escalation steps When the action completes Then the audit entry persists the full approval graph including each step type (requested, approved, rejected, escalated), actor SSO identity (subject ID, display name, email), device fingerprint, IP address, and step timestamps in order Given the audit entry exists When retrieved via the audit API by ID Then the response returns the approval graph exactly as stored, with immutable identifiers for each step and actors Given an attempt is made to modify an existing approval step via any API When the request is processed Then the API responds 405 or 409 indicating immutability, and no persisted data is changed

Audit Entry Signature Binding and Tamper Evidence

Given an approval-complete action is committed When the audit entry is written Then the system generates a signature over the action diff, initiator identity, approver identities, reason code, free-text rationale, device/IP context, and all step timestamps, and stores the signature with the entry Given the audit entry is retrieved When the signature is verified using the system's verification key Then the verification returns valid Given any field covered by the signature is altered (simulated tamper) When verification runs Then verification fails and the entry is flagged tampered=true, a security event is emitted, and the entry remains read-only

Rejection and Escalation Flow Controls

Given a Permission Change action requires 2 approvers When the first approver rejects with a reason Then the request status changes to rejected, the action is not committed, the rejection reason is recorded in the graph, and the initiator is notified Given tenant policy allows escalation after rejection When the initiator chooses to escalate Then a new escalation step is added with target role/time, prior approvals remain immutable, and a new approval cycle begins Given a request was rejected When an approver attempts to approve without a new approval cycle Then the system blocks the approval and returns 409 indicating the prior cycle is closed

Reason Code Taxonomy and Rationale/Attachment Validation

Given tenant policy defines allowed reason codes {Safety, Vendor, Compliance} and sets minimum rationale length to 20 characters When a user selects a reason code not in the taxonomy or leaves it blank Then the system blocks submission with a validation error Given the same policy When the user enters rationale fewer than 20 characters (after trimming) Then the system blocks submission with a message indicating remaining characters required Given tenant policy for ETA override requires an attachment link When the user provides a non-URL value or an URL exceeding 2048 characters Then the system blocks submission with a validation error Given a valid reason code, rationale meeting minimum length, and a valid attachment URL (when required) When the user submits Then the request is accepted and proceeds to the approval workflow

Structured Diff Recording

"As a reviewer, I want precise before/after diffs for each audited change so that I can quickly understand the impact and verify correctness without inspecting raw payloads."

Description

Generate and store normalized, field‑level before/after diffs for audited objects (incidents, ETAs, customer impact scopes, templates, routing rules). Use deterministic serialization to ensure consistent hashing and include summaries for complex structures (e.g., geo‑diffs for polygons, recipient count deltas for broadcasts). Redact or tokenize sensitive fields per data‑classification policy while preserving a cryptographic digest for integrity. Attach diffs to their parent audit entries and expose diff‑aware views and APIs to enable precise review, rollback analysis, and compliance evidence of what exactly changed.

Acceptance Criteria

ETA Update Diff for Incident

Given an existing incident with ETA="2025-08-11T10:00:00Z" and a user updates ETA to "2025-08-11T12:30:00Z" via the console When the update is saved Then an audit entry is created with a structured, field-level diff containing field "eta" with before/after ISO 8601 UTC values And the diff is stored in normalized form and attached to the audit entry with a deterministic serialization SHA-256 hash And GET /api/audit/{entryId}/diff returns the normalized diff with HTTP 200 and Content-Type application/json And the UI diff view renders the field-level change with policy-based redaction applied And the diff record includes initiator_id and millisecond-precision timestamp

Geo-Diff Summary for Impact Polygon Change

Given a customer impact scope polygon is modified by adding/removing vertices and/or rings When the change is saved Then the structured diff includes a geo_summary with: area_before_sqkm, area_after_sqkm, area_delta_sqkm (rounded to 0.01), vertex_count_before/after, vertex_count_delta, ring_count_before/after, and bounding_box_before/after And the diff references the impacted geometry IDs and change operations (added, removed, moved) And deterministic serialization produces the same geo_summary and SHA-256 hash for identical geometries across runs And for polygons up to 10,000 vertices, geo-diff generation completes in under 500 ms on reference hardware

Redaction and Tokenization of Sensitive Fields

Given audited objects contain sensitive fields (e.g., phone_number, email, auth_token) per data-classification policy When a structured diff is generated Then plaintext sensitive values are redacted or tokenized and are not stored in diffs, logs, or exports And a cryptographic digest (HMAC-SHA256 with key_id) of the original value is stored to preserve integrity verification And re-computing the digest with the correct key reproduces the stored digest; with an incorrect key it does not And API and UI present masked values and digest metadata only, with zero plaintext leakage verified by scan

Deterministic Serialization and Stable Hashing

Given two semantically equivalent audited objects differ only by key order, whitespace, numeric formatting, or timezone offsets When diffs are serialized and hashed Then canonical serialization normalizes: sorted keys, UTF-8 NFC strings, canonical numeric formatting, and ISO 8601 Z timestamps And SHA-256 digests are identical across repeated serializations and environments for the same semantic state And any material field change yields a different digest And re-serializing the same diff 3 times yields byte-for-byte identical output

Diff Attachment, Indexing, and Retrieval

Given an audit entry is created for a change to an audited object When requesting GET /api/audit/{entryId}/diff Then the response includes parent_audit_id, object_type, object_id, version_before, version_after, and the structured diff And the diff is discoverable via GET /api/audit?objectId={id}&page={n} with stable, timestamp-desc sorting and pagination metadata And unauthorized callers receive 403 without revealing the existence of the entry; non-existent entryId returns 404 And the UI "Diff" tab renders the same normalized diff as the API

Broadcast Recipient Delta and Template Diff

Given a routing rule or template change alters broadcast recipients from 10000 to 12500 and modifies message text When the change is saved Then the diff includes recipient_count_before=10000, recipient_count_after=12500, and recipient_count_delta=2500 And a digest of the recipient set (SHA-256 over sorted canonical IDs) is stored; no individual PII is recorded in the diff And the template diff lists added/removed/modified placeholders and per-channel content changes (sms, email, voice) And diff generation completes in under 300 ms for recipient sets up to 50,000

Trusted Timestamping & Time Cohesion

"As a security architect, I want trusted, signed timestamps on audit entries so that event order and timing are defensible in investigations and audits."

Description

Issue server‑signed timestamps for every audit entry using synchronized clocks (NTP/Chrony) with drift monitoring and alarms. Record both event_time (when the change occurred) and ledger_time (when persisted) plus monotonic sequence numbers per partition to establish order. Optionally obtain RFC 3161 timestamp tokens from a Time Stamping Authority for high‑assurance cases. Persist clock health metrics and include timestamp proofs in exports and verification APIs to increase evidentiary value during audits.

Acceptance Criteria

Server-Signed Timestamps on Audit Entries

Given a new audit entry is created When the entry is persisted Then a server-signed timestamp (UTC, microsecond precision) is attached using the platform signing key and includes key_id Given the exported public key set When the timestamp signature is verified Then verification succeeds and the timestamp equals ledger_time Given an invalid or missing signature When persistence is attempted Then the write is rejected and an error is logged with reason=signature_verification_failed

Dual Times: event_time and ledger_time Cohesion

Given a server-originated change When the entry is persisted Then both event_time and ledger_time are recorded in UTC with microsecond precision and ledger_time >= event_time Given the recorded times When delta_ms = ledger_time - event_time is computed Then delta_ms <= 100 for server-originated events; otherwise the entry is flagged time_delta_exceeds_threshold Given a client-supplied event_time When delta_ms > 5000 Then the entry is persisted with flag=client_clock_suspect and included in drift metrics

Monotonic Sequence Numbers per Partition

Given a partition_id When successive entries are persisted Then their sequence numbers are strictly increasing by 1 starting from 1 and are unique within the partition Given concurrent writers When entries are read Then total order within each partition is deterministic by sequence number without gaps or duplicates Given a failed transaction When it aborts Then no sequence number is consumed

Clock Drift Monitoring and Alarms

Given NTP/Chrony is configured with at least 2 upstream servers When the service is running Then clock metrics (offset_ms, jitter_ms, stratum, last_sync_at) are sampled and persisted every 60 seconds Given three consecutive samples where |offset_ms| > 200 When evaluated Then a Critical alarm is emitted and future audit entries are flagged clock_drift_critical=true Given no successful sync for > 300 seconds When evaluated Then a Warning alarm is emitted and metrics include unsynced_for_secs

RFC 3161 TSA Token Acquisition (Optional)

Given TSA is configured and reachable When an audit entry is persisted Then a hash over the canonicalized entry is submitted to the TSA and a valid RFC 3161 timestamp token is stored with the entry Given the TSA's public certificate When the stored token is verified Then verification succeeds and the TSA time is within 2 seconds of ledger_time Given TSA is unavailable for up to 60 seconds When retries (max 3) fail Then the entry is persisted with tsa_status=pending and a background job obtains and attaches the token within 10 minutes or marks tsa_status=failed

Exports and Verification API Include Proofs and Metrics

Given an export is requested for a time window and partition set When the export is generated Then each record includes event_time, ledger_time, sequence, server signature, TSA token (if any), and a reference to clock metrics covering the record time Given the verification API and an export file When verification is executed Then all server signatures and TSA tokens validate, record counts match, and per-partition ordering is strictly increasing by sequence Given any record flagged for clock drift or tsa_status!=ok When verification completes Then the report lists the count and identifiers of exceptions with reasons

Exportable Compliance Reports

"As a compliance analyst, I want to export signed, filterable audit reports with integrity proofs so that I can respond to regulator inquiries quickly and confidently."

Description

Provide self‑service and scheduled exports of audit data filtered by date range, actor, action type, incident, and tenant. Support JSONL and CSV for machine analysis and digitally signed PDF for human‑readable reports, including integrity proofs (Merkle proof for the selection and daily root), chain‑of‑custody metadata, and optional PII redaction. Deliver exports via download, secure email, and SFTP, with API endpoints for automation. Include watermarks, pagination, and reproducibility guarantees (stable sorting and deterministic generation) to streamline regulator requests and internal reviews.

Acceptance Criteria

On-Demand Filtered Export via Web Console (CSV/JSONL)

Given a user selects a tenant and applies filters for date range (UTC, inclusive start, exclusive end), actor(s), action type(s), and incident ID(s), When the user requests CSV, Then the downloaded file contains only matching records, is UTF-8 encoded, RFC 4180 compliant with a single header row, and uses a consistent column order across exports. Given the same filters and the user requests JSONL, Then the downloaded file contains only matching records, is UTF-8 encoded, newline-terminated, and contains one valid JSON object per line with deterministic key ordering. Then results are sorted by event_timestamp ASC, then event_id ASC, yielding a stable order across runs. Then repeating the same export within the same ledger daily root produces byte-identical CSV and JSONL files (matching SHA-256 checksums). Then the filename follows the pattern outagekit_audit_{tenant}_{fromUTC}_{toUTC}_{format}_{sha256-8}.{csv|jsonl}.

Digitally Signed PDF Report with Integrity Proofs

Given a user requests a PDF export for a filtered selection, When the export completes, Then the PDF is digitally signed (PAdES-compatible) and signature validation succeeds against the configured certificate chain and timestamp authority. Then each page displays a visible watermark "OutageKit Compliance Report" and a footer with "Page X of Y" and a UTC generated-at timestamp. Then the PDF package includes an embedded appendix or attachment containing: the selection Merkle proof set, the daily ledger root value and date, the hash algorithm (SHA-256), and checksums of export files. Then independently recomputing the selection root from the exported records and provided proofs matches the embedded selection root and links to the published daily root for that date. Then any modification to the PDF after generation causes signature validation to fail.

Scheduled Exports via Secure Email and SFTP

Given a user creates a schedule specifying tenant, filters, format(s), delivery channel(s), and time zone, When the schedule triggers, Then the export runs at the configured local time (±1 minute) and generates the specified formats. Then secure email deliveries transmit using TLS 1.2+ and contain expiring download links valid for 72 hours; no PII appears in the email body. Then SFTP deliveries authenticate with the configured SSH key and host key verification, upload files to the configured path, and set file permissions to 0600 on success. Then if any delivery fails, the system retries up to 3 times with exponential backoff and records the final status; the requester is notified of success or failure. Then the schedule’s audit trail records initiator, schedule definition, last run timestamp, outcome, and next run time.

API-Driven Export Automation

Given an API client POSTs to /v1/audit/exports with tenant, filters (UTC date range, actor, action type, incident), and format, When the request is valid, Then the API returns 202 with export_id and status=processing. Then GET /v1/audit/exports/{export_id} returns status in {processing, ready, failed} plus metadata: created_at (UTC), parameters hash, checksum(s), and file size(s). Then GET /v1/audit/exports/{export_id}/download streams the artifact with correct Content-Type and Content-Disposition filename per naming pattern. Then repeated POSTs with the same idempotency key and identical parameters return the original export_id without creating duplicates. Then all API responses for a tenant are scoped to that tenant; no cross-tenant data is ever returned.

Optional PII Redaction Across Formats

Given a user enables PII redaction for an export, When the export is generated, Then fields classified as PII (e.g., phone_number, email, caller_name) are masked or removed per policy in CSV, JSONL, and PDF outputs, and metadata indicates Redaction: enabled. Then non-PII fields, column/key order, and sort order remain unchanged relative to non-redacted exports. Then integrity proofs included with the PDF continue to verify record membership and order for the selection and daily root; redaction does not invalidate proof verification. Then filenames and API metadata include a "-redacted" indicator.

Reproducibility and Stable Sorting Guarantees

Given two exports are produced with identical parameters and within the same ledger snapshot (same daily root), Then CSV and JSONL outputs are byte-identical (equal SHA-256), and the PDF’s pre-signature content hash is identical. Then sorting is stable: records with identical timestamps maintain relative order using event_id as a deterministic tiebreaker. Then PDF pagination is deterministic for identical inputs: the same records appear on the same page numbers with identical page breaks and footers. Then each export includes a reproducibility token comprising parameters, selection root, tool version, and format version to allow regeneration of identical outputs.

Chain-of-Custody Metadata and Tenant Isolation

Given any export is generated, Then it embeds metadata including: export_id (UUID), tenant_id, initiator user_id, approver(s) if present, requested_at and generated_at (UTC), applied filters, tool and format version, selection root (SHA-256), daily root (SHA-256) with date, delivery channel(s), and checksum(s) per file. Then all delivery methods (download, secure email, SFTP, API) include or link to this metadata and a checksums.txt manifest. Then attempts to include records from another tenant are rejected with 403, and validation confirms that no cross-tenant records appear in the export. Then audit trail entries are appended for export initiation, completion or failure, and each delivery attempt with precise UTC timestamps.

Integrity Verification & Anomaly Alerts

"As a platform owner, I want automated integrity checks with real‑time alerts so that any tampering or data corruption is detected and addressed immediately."

Description

Continuously verify ledger integrity by recomputing hash chains and Merkle roots, comparing against stored values and any external anchors. Surface verification status and history in a dashboard and via API, and emit alerts to email, Slack, and PagerDuty on detection of gaps, reordering, or corruption. Quarantine suspect segments to read‑only mode, capture forensic artifacts, and provide guided remediation procedures. Track verification coverage SLIs and expose metrics for observability to ensure continuous trust in the ledger.

Acceptance Criteria

Scheduled Incremental and Full Ledger Verification

Given the verifier is enabled and the ledger has at least 1,000 entries, When the incremental job runs every 60 seconds, Then 100% of entries appended since the last run have their hash-chain recomputed and match stored link-hash values, And any gap, duplicate, or out-of-order index is flagged. Given a full verification window is due at 02:00 UTC daily, When the full job executes, Then 100% of on-disk entries are verified end-to-end within 30 minutes for ledgers up to 10 million entries, And the resulting Merkle root equals the persisted root for that snapshot, And job success is recorded with duration, coverage percentage, and commit id.

External Anchor Verification and Degraded Mode

Given an external anchor is configured hourly, When an anchor window closes, Then the computed anchor value matches the last published anchor with the same anchor_id, And if mismatched, a Critical verification anomaly is created with mismatch details. Given the anchor service is unavailable, When the verifier attempts retrieval, Then status is set to Degraded with cause=ANCHOR_UNAVAILABLE, And no Critical anomaly is raised, And a Warning is logged and exported as a metric.

Multi-channel Anomaly Alerting and Deduplication

Given an anomaly of type GAP, REORDER, or CORRUPTION is detected, When the anomaly is created, Then an alert is delivered to Email, Slack, and PagerDuty within 30 seconds (p95) with unique incident_key, affected_range, severity, and remediation link. Given repeated detections for the same anomaly_key within 30 minutes, When alerts are generated, Then downstream alerts are deduplicated and suppressed, And a heartbeat update is sent at 15-minute intervals until resolution. Given the anomaly is resolved, When verification passes for the affected range, Then a single Resolved notification is sent to all channels and incident autoclosure is triggered for PagerDuty.

Verification Status Dashboard and API Exposure

Given a user with role=OpsManager, When they open the Integrity tab, Then they see current overall status (OK/Degraded/Failed), coverage ratio, last run times (incremental/full), open anomalies count, and a 24h timeline, all updating within 5 seconds of backend change. Given an API client with token scope=ledger.read, When they GET /api/v1/ledger/verification/status, Then the response is 200 JSON containing fields: status, coverage_ratio, last_incremental_run_at, last_full_run_at, anomalies[], anchors[], and job_durations, with p95 latency < 300 ms. Given an unauthorized client, When they call the same endpoint, Then response is 401 or 403 with no sensitive data leakage.

Automatic Quarantine and Forensic Artifact Capture

Given an anomaly affects entries i..j, When the anomaly is confirmed by the verifier, Then the segment [i..j] is marked read-only within 5 seconds, And all write attempts overlapping the segment are rejected with 409 Conflict and audited. Given quarantine is active, When forensic capture runs, Then artifacts include: offending hashes, diffs, timestamps, node ids, approvers, justifications, and raw blocks, stored immutably and retrievable via /api/v1/ledger/forensics/{anomaly_id}. Given a user lacks role=LedgerAdmin, When they attempt to lift quarantine, Then the action is blocked with 403 and audited.

Guided Remediation Playbooks with Approvals

Given an active anomaly, When a user with role=LedgerAdmin opens the remediation panel, Then a context-specific playbook is presented with pre-checks, dry-run option, and estimated impact. Given the playbook requires approvals=2, When the first approver submits with MFA and justification, Then the request is recorded in the audit ledger, And execution is blocked until a second approver authorizes. Given remediation is executed, When steps complete, Then all actions are logged with timestamps and diffs, And the anomaly status transitions to Resolved, And a post-remediation verification passes for the affected range.

Verification Coverage SLIs and Observability Metrics

Given metrics scraping is enabled, When Prometheus scrapes /metrics, Then it exposes series: outagekit_ledger_verification_coverage_ratio, outagekit_ledger_verification_last_run_age_seconds, outagekit_ledger_anomalies_total{type}, outagekit_quarantine_segments, outagekit_anchor_mismatch_total, outagekit_alert_delivery_latency_seconds, outagekit_verification_job_duration_seconds, each with labels env, tenant, region. Given 24-hour operation, When SLI calculations run hourly, Then coverage_ratio >= 0.999, last_run_age_seconds p95 < 120, alert_delivery_latency_seconds p95 < 30, and verification_job_success_ratio >= 0.99, with SLO breach alerts emitted if thresholds are not met.

Lifeline Login

Emergency, token-based access when your identity provider is down. Users can securely sign in during storms without reconfiguring SSO, passing hardware key and IP checks to receive a time-limited session that keeps operations moving while maintaining strong security.

Requirements

IdP Outage Detection & Auto-Fallback

"As an operations manager, I want the system to detect when SSO is down and offer Lifeline Login automatically so that I can still access the console during incidents."

Description

Continuously monitor the configured identity provider (OIDC/SAML) for health, latency, and error rates. When thresholds indicate an outage or severe degradation, automatically switch the OutageKit sign-in flow to the Lifeline Login path with clear in-product messaging. Preserve tenant-level feature flags and policies, gracefully fail back to SSO when health is restored, and expose observability metrics and events for operations dashboards. Integrates with existing authentication gateway without requiring SSO reconfiguration.

Acceptance Criteria

IdP degradation triggers Lifeline auto-fallback

Given tenant T has configured IdP monitoring thresholds (latency_ms, error_rate_pct, outage_unreachable_secs, window_secs) And the system observes either p95 latency ≥ latency_ms for ≥ window_secs, or 5xx/timeout error rate ≥ error_rate_pct for ≥ window_secs, or health endpoint unreachable for ≥ outage_unreachable_secs When a user from tenant T initiates sign-in Then the sign-in flow for tenant T is routed to the Lifeline Login path within ≤ 5 seconds of threshold breach And the UI displays a notice that SSO is temporarily unavailable and Lifeline is enabled And a fallback_transition event is recorded with tenant_id, trigger_type, thresholds, and timestamps

Tenant-specific outage isolation

Given tenants T1 and T2 are hosted on the same cluster with independent IdPs And only T1’s IdP exceeds the configured outage/degradation thresholds When users from T1 and T2 attempt to sign in Then T1 is routed to Lifeline Login while T2 continues normal SSO And metrics, events, and routing changes are scoped only to T1 with no impact to T2

Clear in-product messaging during fallback

Given fallback mode is active for tenant T When a user views the sign-in page Then a prominent warning banner is shown indicating SSO is unavailable and Lifeline Login is active And the banner includes last_updated timestamp and a link to status details And the banner meets accessibility requirements (aria-live polite, contrast ≥ 4.5:1) and is localized to the user’s language

Preservation of tenant feature flags and policies in fallback

Given tenant T has feature flags F and access policies P configured When a user authenticates via Lifeline during fallback Then the same flags F and policies P are resolved server-side and applied to the session And authorization outcomes (roles/entitlements) match those under SSO for the same user identity attributes And no feature flag defaults or policy enforcement are altered by fallback mode

Graceful failback with anti-flapping

Given IdP health for tenant T remains below all outage/degradation thresholds for recovery_window_secs When the system evaluates health on the next check interval Then the sign-in flow automatically returns to SSO with no admin action required And existing Lifeline sessions persist until their normal expiry with no forced logout And an anti-flapping guard enforces at most one transition (fallback or failback) per 10 minutes And a failback_transition event is recorded with tenant_id, reason, and timestamps

Observability metrics and events published

Given monitoring is enabled When fallback or failback occurs for tenant T Then the system publishes metrics: ok_idp_health_status{tenant_id} in {healthy,degraded,outage}, ok_idp_latency_p95_ms{tenant_id}, ok_idp_error_rate_pct{tenant_id}, ok_fallback_active{tenant_id} (0/1), ok_fallback_transitions_total{tenant_id}, ok_time_in_fallback_seconds_total{tenant_id} And emits a structured event to the event bus/webhook within ≤ 2 seconds including tenant_id, status, trigger, reason, thresholds, and correlation_id And the operations dashboard reflects the state change within ≤ 10 seconds

Gateway integration without SSO reconfiguration

Given tenant T uses the existing authentication gateway with OIDC/SAML configured When fallback activates Then no changes are made to IdP metadata, redirect URIs, or SP configuration And routing to the Lifeline endpoint is toggled internally without admin intervention And the admin UI shows SSO configuration unchanged throughout fallback and after failback And sign-in via SSO functions normally after failback without any reconfiguration

Time-Bound, Scoped Session Creation

"As a NOC supervisor, I want lifeline sessions to be time-limited and least-privileged so that we reduce risk while keeping operations moving."

Description

Upon successful Lifeline verification, create a least-privilege session with configurable, short-lived TTL (e.g., 60–240 minutes), enforced server-side expiration, and device/IP binding. Limit accessible resources to essential outage operations, require re-authentication after TTL or upon IdP recovery, and provide immediate admin-driven revocation. Integrates with OutageKit RBAC to map lifeline roles to minimal permissions and logs all scope decisions for audit.

Acceptance Criteria

Configurable Lifeline Session TTL

Given Lifeline login is enabled and TTL is configured to a value between 60 and 240 minutes inclusive When a user completes Lifeline verification Then the created session TTL equals the configured value and the expiry timestamp is set server-side Given TTL is configured below 60 or above 240 minutes When the configuration is saved Then the system rejects the value with a validation error and logs the rejected setting Given TTL is not explicitly configured When a Lifeline session is created Then the system applies the default TTL of 120 minutes and logs the applied default

Server-Side Expiration Enforcement

Given a Lifeline session with a 120-minute TTL When 120 minutes have elapsed since issuance Then any API call with the session token returns 401 Unauthorized and the UI forces sign-out Given a Lifeline session is expired When the client attempts to refresh or extend the session without re-authentication Then the request is denied and no new token is issued Given a client device has an incorrect local clock When requests are made before or after TTL Then expiration is determined solely by server time and behavior is consistent

Device and IP Binding Enforcement

Given a Lifeline session is created When subsequent requests originate from a different device fingerprint or user agent Then the session is invalidated and the user is prompted to re-authenticate Given a Lifeline session is created from source IP A When a request is received from source IP B Then the request is rejected with 401 Unauthorized and an audit event records the IP mismatch Given simultaneous requests from multiple devices using the same Lifeline token When processed by the server Then only the original bound device is accepted and other requests are denied

Least-Privilege Scope via RBAC

Given the user signs in via Lifeline When the session scope is established Then only permissions required for essential outage operations are granted (e.g., view outage map, update incident status, send customer notifications) and all administrative/configuration actions are excluded Given the user attempts to access a non-essential endpoint or UI action When the request is evaluated Then access is denied with 403 Forbidden and the denial is logged with the blocked permission Given OutageKit RBAC role mappings exist When a Lifeline role is resolved Then it maps to the minimal corresponding RBAC permissions and the mapping decision is recorded for audit

Admin-Initiated Immediate Revocation

Given an active Lifeline session exists When an admin revokes the session via console or API Then the session is invalid within 10 seconds across all nodes and further requests return 401 Unauthorized Given a session is revoked When the user has an open UI Then the UI receives a forced sign-out event and displays a revocation message Given a session is revoked When audit logs are queried Then an entry shows who revoked it, when, target user/session ID, and optional reason

Re-Authentication on TTL Expiry or IdP Recovery

Given a Lifeline session reaches TTL expiry When the user performs any action Then the user is prompted to re-authenticate and cannot proceed without a new session Given the primary IdP is detected as recovered When the user next performs a privileged action Then the user is required to re-authenticate via SSO instead of Lifeline Given a Lifeline session is active When the user attempts to extend it without re-authentication Then extension is denied; no sliding refresh is allowed for Lifeline sessions

Comprehensive Audit Logging of Scope Decisions

Given a Lifeline session is created When scope and bindings are determined Then the audit log records user, assigned lifeline role, mapped RBAC permissions, TTL, expiry timestamp, device fingerprint, and source IP Given access to a non-scoped resource is attempted When the request is denied Then an audit entry records the denied permission/resource and reason (out of scope) Given audit logs are exported When filtered by session ID or user Then all related scope and revocation events are retrievable with tamper-evident timestamps

One-Time Token Issuance & Validation

"As an on-call engineer, I want to receive a one-time code to complete sign-in so that I can authenticate securely when SSO is unavailable."

Description

Generate cryptographically strong, single-use lifeline tokens tied to the user, device fingerprint, and policy context. Enforce short token expiry (e.g., 10 minutes), replay protection, and attempt throttling with opaque, non-enumerable responses. Validate token, nonce, and state server-side before session creation, and record the lifecycle for auditing. Works independently of IdP availability and leverages OutageKit’s secure key management.

Acceptance Criteria

Cryptographically Strong One-Time Token Generation

Given OutageKit is operational and the IdP is unavailable When an authorized operator triggers issuance of a lifeline token for a valid user and specified policy context Then the system generates a token with at least 256 bits of entropy, encoded as URL-safe Base64 without predictable prefixes/suffixes And the token is server-bound to userId, deviceFingerprint, and policyContext And the token value is stored only as a KMS-backed HMAC-SHA-256 (no plaintext persistence) And metadata includes createdAt and expiresAt set to no more than 10 minutes from creation (default 10 minutes) And issuance completes without any outbound calls to the external IdP

Short Expiry Enforcement with Clock Skew Tolerance

Given a lifeline token is issued with a 10-minute TTL When the token is presented within its validity window allowing up to 2 minutes of clock skew Then validation succeeds and proceeds to subsequent checks When the token is presented outside the validity window (including skew) Then validation fails with the same generic response and no session is created And expiresAt is immutable after issuance and policy-configurable between 5 and 15 minutes And server responses never disclose remaining TTL or expiry details

Single-Use and Replay Protection

Given a valid lifeline token with a server-issued nonce and state bound to the token, deviceFingerprint, and policyContext When the token+nonce+state are presented successfully the first time Then the validation is performed atomically and the token is immediately invalidated for any future use And the nonce is verified as server-issued, single-use, and unexpired And any subsequent attempt reusing the token (with any nonce/state or from any device/IP) fails with the same generic response And all replay attempts are recorded with a correlationId and do not reveal whether the token was ever valid

Attempt Throttling and Enumeration-Resistant Responses

Given a client submits token validation requests When more than 5 failed attempts occur within 10 minutes for the same userId, deviceFingerprint, or source IP Then subsequent attempts are throttled with exponential backoff up to 60 seconds per attempt And after 20 failed attempts within 60 minutes, the subject is temporarily locked for 15 minutes per policy And invalid, expired, or nonexistent tokens all return the same status code and message template with response-time jitter of 100–300 ms And responses do not indicate whether the user exists, the token format is valid, or the token is expired

Server-Side Validation and Session Creation

Given a token, nonce, and state are presented from a device matching the bound deviceFingerprint and within the allowed IP policy When all validations succeed (token integrity, binding, nonce/state validity, and expiry) Then the system creates a lifeline session independent of the IdP with a policy-defined TTL (default 1 hour) And the session is signed using OutageKit KMS-backed keys and flagged lifeline=true And only least-privilege scopes defined by the policyContext are granted And the response sets a single HTTP-only, Secure, SameSite cookie and does not include the token in URLs, headers, or logs

Audit Trail and Token Lifecycle Recording

Given token issuance, validation attempts, success, expiry, and revocation events occur Then an append-only audit record is written for each event with fields: eventType, pseudonymous tokenId, userId, deviceFingerprint hash, source IP, timestamp, outcome, reason, correlationId, and policyVersion And audit streams are tamper-evident via hash chaining and KMS signing at least every 5 minutes And authorized roles can query the last 30 days of events in under 5 seconds and records are retained for at least 365 days per policy And plaintext tokens are never logged or exported

Operation During IdP Outage and KMS Key Management

Given the external IdP is unreachable When issuing and validating lifeline tokens Then all flows complete without any network calls to the IdP And cryptographic operations use OutageKit KMS keys rotated at least every 90 days And token HMAC verification accepts the current and immediately previous key to support seamless rotation And administrators can revoke all unredeemed tokens immediately via policy without affecting already established valid sessions

Hardware Key Verification (WebAuthn)

"As a security admin, I want hardware key verification enforced during lifeline sign-in so that only trusted users gain access."

Description

Require a successful WebAuthn/FIDO2 assertion with an enrolled hardware or platform security key as part of the lifeline flow. Support roaming and platform authenticators, enforce user presence/verification, and validate against a securely cached set of registered credentials for offline resilience. Provide clear UX prompts and fallback policies configurable by admins, and log attestation details for security review.

Acceptance Criteria

Successful WebAuthn Assertion Grants Lifeline Session

Given a user initiates Lifeline Login and has at least one enrolled lifeline credential And the server issues a cryptographically random, single-use challenge that expires in 60 seconds When the user completes a WebAuthn get() assertion using an enrolled credential Then the server verifies the origin matches the configured allowlist and the rpId matches the OutageKit domain And validates the signature using the cached public key for the credentialId And confirms authenticator flags UP=1 and UV=1 And detects no replay (challenge unused and not expired; signCount increased or authenticator is counterless) And creates a lifeline session with a 30-minute expiry linked to the user And returns HTTP 200 with a session token

Offline Verification Using Securely Cached Credentials

Given IdP/SSO is unavailable and the credential cache is healthy When a user presents a valid WebAuthn assertion for a registered lifeline credential Then the server performs all verification using the securely cached credential material without any outbound IdP calls And denies the attempt with an actionable error if the credential is missing from cache And records an audit entry noting offline mode used and cache hit/miss And P95 verification latency remains under 500 ms during offline mode

Support for Roaming and Platform Authenticators

Given the user's account has enrolled platform and/or roaming authenticators permitted for lifeline When the WebAuthn challenge is presented Then allowCredentials includes all enrolled lifeline credential IDs regardless of transport And assertions from platform (built-in) or roaming (USB/NFC/BLE) authenticators are accepted if policy allows And the UI prompt adapts to indicate the expected authenticator type based on recent successful use And the authenticator transport and AAGUID are captured in logs

Enforce User Presence and Verification Flags

Given WebAuthn policy for lifeline requires user verification When an assertion response is received Then the assertion is accepted only if UP=1 and UV=1 in the authenticator flags And assertions with UP=1 and UV=0 are rejected with HTTP 401 and error code WEB_AUTHN_UV_REQUIRED And all rejected attempts are logged with reason and without issuing a session

Admin-Configurable Fallback and Error UX

Given an admin-defined fallback policy for lifeline is configured (Disabled | Secondary-Approval OTP | Break-Glass) When a WebAuthn assertion fails due to unsupported authenticator, UV not available, or timeout Then the system enforces the configured fallback path And the UI displays a clear, non-technical message with next steps aligned to policy And if a fallback path is used, it requires the specified approvals and no session is issued until approvals complete And all fallback invocations are fully audited with actor, approver, reason, and outcome

Security Audit Logging of Assertion Metadata

Given any WebAuthn attempt (success or failure) When processing the assertion Then the system writes an immutable audit record including timestamp, user ID, hashed credentialId, AAGUID, authenticator type (platform/roaming), transport, algorithm, rpId, origin, flags (UP/UV), signCount delta, offline/online mode, client IP, result, and error code if any And logs avoid storing biometric data or private key material And security reviewers with the proper role can query these logs within 1 minute of the event

Origin, RP ID, Challenge, and TLS Enforcement

Given Lifeline Login is accessed over the web When generating and verifying WebAuthn challenges Then challenges are cryptographically random (≥128 bits of entropy), single-use, and expire in 60 seconds And the request origin must match a configured allowlist and the rpId must match the configured base domain And requests over non-TLS or from disallowed origins/rpId are rejected with HTTP 400 and logged And successful verifications invalidate the challenge immediately to prevent replay

IP Risk & Network Posture Enforcement

"As a compliance officer, I want lifeline access limited to trusted networks so that we comply with security controls during outages."

Description

Evaluate the requester’s IP against tenant-defined allowlists, geolocation and ASN constraints, and threat intelligence (e.g., TOR/VPN/proxy indicators). Apply block, allow, or step-up actions before issuing lifeline tokens, and bind approved sessions to the originating IP/subnet where policy requires. Expose policy configuration per tenant, capture rationale in audit logs, and surface clear error states without leaking sensitive details.

Acceptance Criteria

IP Allowlist Enforcement Prior to Token Issuance

Given a tenant policy defines an IP allowlist as one or more CIDR ranges When a Lifeline Login request originates from an IP within any allowed CIDR Then the request passes allowlist evaluation and proceeds to the next policy checks Given a tenant policy defines an IP allowlist and the request IP is outside all allowed CIDRs When the request is evaluated Then the system denies the request before token issuance with HTTP 403 and error code NET_POLICY_BLOCKED And the response message is generic and does not disclose CIDRs, rule IDs, or provider names And an audit record is created with decision=blocked and reason=allowlist_miss Given the IP allowlist is empty for a tenant When a Lifeline Login request is evaluated Then the allowlist check is treated as pass (no restriction) and evaluation proceeds

Geolocation and ASN Constraint Evaluation

Given a tenant policy specifies allowed ISO country codes and allowed ASNs When the requester’s IP resolves to a country and ASN that are both on the allowed lists Then the request passes geolocation/ASN evaluation and proceeds to the next checks Given a tenant policy specifies allowed ISO country codes and/or allowed ASNs When the requester’s resolved country or ASN is not on the allowed lists Then the system denies the request before token issuance with HTTP 403 and error code NET_POLICY_BLOCKED And the response does not reveal the evaluated country, ASN, or rule details And an audit record is created capturing country_code, asn, matched_rule_ids, decision=blocked, reason=geo_asn_mismatch

Threat Intelligence Indicators With Step-Up Action

Given a tenant policy sets action=step_up for threat indicators (e.g., Tor/VPN/proxy) When the requester’s IP is flagged by threat intelligence with one or more threat tags Then the system requires a hardware key challenge before issuing a lifeline token And upon successful challenge, a session is issued with a time-limited TTL as configured by policy And upon failed or skipped challenge, the request is denied with HTTP 401 and error code NET_POLICY_STEP_UP_FAILED And an audit record includes threat_tags, action=step_up, outcome, and reason Given a tenant policy sets action=block for threat indicators When the requester’s IP is flagged Then the request is denied with HTTP 403 and error code NET_POLICY_BLOCKED And the response message is generic and non-disclosing And an audit record includes threat_tags, action=block, decision=blocked

Session Binding to Originating IP/Subnet

Given a tenant policy requires session binding with bind_mode=ip When a lifeline session is issued Then the session is bound to the exact originating IP and stored with the session metadata Given a tenant policy requires session binding with bind_mode=subnet and prefix (e.g., /24) When a lifeline session is issued Then the session is bound to the originating subnet per the configured prefix Given a session bound by policy When a subsequent request presents a source IP outside the bound IP/subnet Then the session is invalidated and the request is denied with HTTP 401 and error code NET_POLICY_BINDING_VIOLATION And an audit record is created with decision=terminated and reason=binding_violation Given a session bound by policy When a subsequent request presents a source IP within the bound IP/subnet Then the request is accepted without additional authentication

Tenant Policy Configuration Exposure and Validation

Given a tenant admin with permission Security.Policy.Edit When they call GET /api/tenants/{tenantId}/lifeline/network-policy Then the API returns the current policy including allowlist CIDRs, allowed countries, allowed ASNs, threat_actions, bind_mode, and token_ttl Given a tenant admin submits a PUT to /api/tenants/{tenantId}/lifeline/network-policy with valid values When the request is processed Then the API responds 200 and persists the changes And the new policy version is audit-logged with before/after diff, actor, and timestamp And the effective policy is applied across evaluators within 60 seconds Given a tenant admin submits invalid values (e.g., malformed CIDR, non-numeric ASN, unknown country code) When the request is processed Then the API responds 422 with field-level validation errors And no partial changes are applied And an audit record captures the failed attempt with reason=validation_error

Comprehensive Audit Logging of Policy Decisions

Given any Lifeline Login request undergoes network posture evaluation When a policy decision is made (allow, block, or step_up) Then an immutable audit entry is created containing tenant_id, request_id, user_id (if available), source_ip, country_code, asn, threat_tags, matched_rule_ids, decision, reason_code, and timestamp And the audit entry excludes sensitive rule contents, provider API keys, or internal IP intelligence sources And the audit entry is queryable by tenant admins within 1 minute of the decision

Clear, Non-Disclosing Error States

Given a request is blocked by allowlist, geo/ASN, or threat policy When the API responds Then it returns a standardized error code NET_POLICY_BLOCKED and a generic message that does not reveal the specific rule, IP range, country, ASN, or threat provider And a correlation_id is included in the response for support reference Given a request requires step-up per policy When the API responds Then it returns a standardized error code NET_POLICY_STEP_UP_REQUIRED and a generic message prompting step-up without revealing evaluation details Given a bound session violates IP/subnet binding When the API responds Then it returns a standardized error code NET_POLICY_BINDING_VIOLATION and a generic message And all three error conditions are consistently represented in UI surfaces consuming the API

Multi-Channel Token Delivery & Rate Limiting

"As a field dispatcher, I want to get my access code over SMS or a phone call so that I can log in even if email is delayed."

Description

Deliver lifeline tokens via SMS, email, and voice IVR using OutageKit’s communications stack with provider redundancy. Honor user channel preferences, automatically fail over between channels, and localize content. Implement per-user and global rate limits, challenge/response to prevent enumeration, and masked notifications to avoid data leakage. Track delivery status and surface resend options with backoff.

Acceptance Criteria

Honor User Channel Preferences

Given a verified user with saved channel preferences and opt-out flags And the organization has enabled Lifeline Login When the user requests a lifeline token Then the token is attempted first on the user’s highest-ranked preferred channel that is enabled and not opted-out And if that channel returns a definitive failure, the system attempts the next preferred channel in order And if the user has restricted delivery to a single channel, no other channels are attempted And the final selected channel is recorded with timestamp, locale, and provider id

Provider Redundancy and Cross-Channel Failover

Given SMS, email, and IVR providers are configured with a primary and at least one secondary per channel When the primary provider returns a transient or provider-specific failure as classified by the routing rules Then the system retries on a secondary provider for the same channel before attempting another channel And no duplicate tokens are delivered; if a success receipt is received, subsequent attempts are canceled And failures and retries are logged with error class, provider code, and correlation id

Localized Token Content Across Channels

Given the user has a locale and timezone set, and the organization has a default locale When a token is generated for delivery Then the content is localized to the user’s locale with fallback to organization default And the message includes token value, purpose, expiration timestamp in the user’s timezone, and support contact And IVR uses the correct TTS language/voice and reads digits with appropriate pacing And no untranslated strings or placeholder keys appear in any channel

Per-User and Global Rate Limits

Given configured limits for per-user and global token sends exist When token requests exceed the per-user limit within the configured window Then further sends for that user are blocked until the window resets, with a Retry-After communicated to the client And the UI disables the resend control with a visible countdown matching Retry-After And when global limits are exceeded, token sends are queued or rejected according to policy with generic messaging And all limit decisions are captured in audit logs with user id (or anonymous hash), timestamp, and limit counters

Challenge/Response Anti-Enumeration

Given a user initiates lifeline token delivery When the system prompts for a challenge response tied to the account (e.g., last 2 digits of phone or org code) Then tokens are only sent if the correct response is provided within the allowed attempts window And incorrect or unknown inputs receive the same generic response without confirming account existence And repeated failures increase delay between attempts per configured backoff and cause temporary lockout after the maximum attempts And challenge outcomes and throttling events are logged without exposing PII

Masked Notifications and UI

Given notification confirmations and delivery status are displayed to the requester When presenting destination addresses or numbers Then email addresses are masked (e.g., a***@d***.com) and phone numbers are masked (e.g., +1 5••• ••23) And IVR references the destination generically (e.g., “ending in 23”) without stating the full number And system responses are identical for unknown accounts or unsubscribed destinations And application logs and webhooks deliver masked values only unless explicitly configured for secure sinks

Delivery Status Tracking and Resend With Exponential Backoff

Given a token delivery attempt has been initiated When status callbacks or polling responses are received from providers Then the system records per-attempt events with timestamp, channel, provider, status (queued, sent, delivered, failed), and reason code And the UI exposes a resend option only when allowed by rate limits and not while an attempt is in-flight And resend attempts follow a configured exponential backoff schedule and respect channel/provider failover rules And users receive a single valid token; prior tokens are invalidated on resend according to policy

Audit Logging, Monitoring & Alerting

"As a security lead, I want complete audit trails and alerts for lifeline usage so that I can investigate and respond to any suspicious access."

Description

Capture end-to-end lifeline activity including detection events, token issuance/validation, hardware key checks, IP decisions, and session lifecycle with tamper-evident logs and retention controls. Provide real-time alerts to designated channels (e.g., email/Slack/SIEM) on lifeline usage and anomalies, plus dashboards with trends and success/failure rates. Support exports and APIs for compliance reporting and incident investigations.

Acceptance Criteria

End-to-end Lifeline Login Audit Trail

- Log Coverage: 100% of lifeline events are captured: lifeline_detected, token_issued, token_validation_succeeded, token_validation_failed, hardware_key_check_passed, hardware_key_check_failed, ip_allowed, ip_blocked, session_started, session_refreshed, session_ended. - Required Fields: Each entry includes timestamp (UTC ISO8601 ms), tenant_id, user_identifier, actor_type, event_type, correlation_id, request_id, source_ip, geo, device_fingerprint, result, reason_code, service_version, sequence_number, prev_hash, entry_hash, signature. - Redaction: Token values are masked except last 4 chars; hardware key private material is never logged; IPs retained as collected; secrets never logged. - Ordering & Correlation: Events for the same correlation_id have strictly increasing sequence_number with no gaps; session lifecycle events share the same correlation_id. - Persistence SLO: At a tested load of ≥10 tenants each producing ≥1,000 lifeline events/min, ≥99.9% of entries are durably persisted within 2 seconds of event time; 0 data loss on process/node restarts. - Time Accuracy: System clocks are NTP-synchronized; inter-event timestamp skew within a correlation_id ≤ 100 ms relative order.

Tamper-Evident Logs and Retention Enforcement

- Hash Chain: Each log record’s entry_hash = SHA-256(record_body) and includes prev_hash of prior record in the per-tenant stream; records are digitally signed with a rotating signing key. - Verifiability: A verification job over any selected time range returns PASS with 0 broken chains and 0 invalid signatures. - Daily Anchoring: A daily root hash is published/anchored; anchor verification for any day returns PASS. - Retention Policy: Per-tenant retention is configurable (90/180/365/1825 days), default 365; first 90 days are WORM-locked; deletions after expiry are irreversible and logged with who/when/what. - Change Control: Retention and signing-key changes require dual-approval; all changes are auditable with user, timestamp, and justification. - Integrity Alerts: Any verification failure (broken prev_hash, invalid signature, missing anchor) triggers a logging_integrity_alert to all configured channels within 60 seconds.

Real-Time Alerts to Email/Slack/SIEM

- Trigger Set: Alerts fire for token_issued, token_validation_failed, ip_blocked, hardware_key_check_failed, session_started, session_ended. - Delivery Targets: Slack (webhook), Email, and SIEM (HTTPS webhook) are supported; tenants can enable/disable targets per trigger. - Latency: From event time to receipt at target, p95 ≤ 30s and p99 ≤ 60s (measured per target) during tested load (≥10 tenants, ≥1,000 events/min each for 10 min). - Payload: Alerts include tenant_id, event_type, correlation_id, user_identifier (or service account), source_ip, geo, result/reason_code, timestamp, and deep link to the event view; secrets and token values are masked. - De-duplication: Repeated identical alerts (same tenant_id, correlation_id, event_type) within 60s are emitted once with an incremented count. - Delivery Reliability: On failure, each target is retried with exponential backoff for ≥5 attempts over ≥15 minutes; undelivered alerts are queued for ≥72 hours and visible in a "Alert Delivery" status panel.

Anomaly Detection for Lifeline Activity

- Threshold Rules (defaults, per-tenant configurable): • ≥5 token_validation_failed from the same source_ip within 5 minutes -> severity=high. • ≥10 token_issued for a tenant within 1 minute -> severity=medium. • token_validation_succeeded from a country not in tenant allowlist -> severity=high. • ≥3 hardware_key_check_failed for the same user_identifier within 10 minutes -> severity=medium. • ≥3 concurrent lifeline sessions for the same user_identifier -> severity=medium. - Detection Latency: p99 detection-to-alert time ≤ 60s. - Suppression: After an anomaly alert, identical condition is suppressed for 10 minutes per key (IP/user/tenant) while counts continue to be tracked. - Auditability: Changes to anomaly thresholds or allowlists are logged with old_value, new_value, editor, and timestamp. - Notification: Anomaly alerts are sent to all enabled channels with severity, rule_id, counts, and sample events.

Lifeline Monitoring Dashboard and Trends

- Metrics: Shows time series for lifeline event volume, success/failure rates, anomalies by severity, active sessions, median and p95 time from token_issued to session_started. - Filters: Time range (Last 1h/24h/7d/custom), tenant, user_identifier, event_type, source_ip. - Drill-down: Clicking any chart point opens a correlated event list filtered by correlation_id/time slice; selecting an event opens full details and raw log. - Freshness: Data freshness (ingestion-to-visualization lag) ≤ 60s p95. - Availability: Dashboard endpoints meet ≥99.9% monthly availability. - Access Control: Visible only to roles Ops Manager and Security Analyst; unauthorized users receive 403 and no data leakage. - Export: Current view exportable to CSV and JSON within 60s for up to 1M rows, with progress indicator.

Compliance Exports and Investigation APIs

- Query API: GET /audit/lifeline/logs supports filters (tenant_id, time_from, time_to, user_identifier, event_type, correlation_id, source_ip), pagination via opaque cursor, and returns results within 2s p95 for pages ≤10k records. - Bulk Export: POST to create an export job for a time range; up to 1,000,000 records are delivered within 2 minutes as chunked CSV and JSON; larger ranges are segmented automatically. - Integrity Manifests: Each export includes a manifest with SHA-256 for each file and a detached signature; verification tool validates all hashes and the signature. - Redaction & Schema: Sensitive fields (token values, secrets) are masked; a machine-readable data dictionary (JSON Schema) is included with every export. - Access Logging: All export/API access is itself logged with actor, purpose (if provided), IP, and timestamp. - Errors & Limits: Invalid filter combinations return 400 with field-specific errors; unauthorized access returns 401/403; tenant-level rate limit is enforced at ≥10 req/s with 429 responses carrying retry-after.

Hardware Key Bind

Enforces FIDO2/WebAuthn hardware keys for issuing and using Lifeline access. Tokens are cryptographically bound to a registered physical key and device, stopping phishing and shared-credential risks so only authorized staff can enter during outages.

Requirements

WebAuthn Hardware Key Enrollment

"As an operations manager, I want to register my hardware security key to my account so that I can securely access Lifeline functions during outages without relying on passwords."

Description

Implement a FIDO2/WebAuthn enrollment flow that allows authorized staff to register one or more hardware security keys (USB/NFC/BLE) to their OutageKit account. Enforce attestation verification using the FIDO Metadata Service and an allowlist of approved AAGUIDs to ensure only compliant roaming authenticators are accepted. Store credential ID, public key, AAGUID, and signature counter securely (encrypted at rest), and require user verification during registration. Provide a guided UI to add, nickname, set primary/backup keys, and remove keys, with clear error states for unsupported authenticators or failed attestations. Expose backend APIs for registration options and finalization, integrate with existing SSO/IdP where applicable, and ensure cross-browser support for modern WebAuthn-capable clients.

Acceptance Criteria

Successful Enrollment with Approved Hardware Key

Given an authorized user initiates WebAuthn registration from OutageKit on a supported browser When the backend returns PublicKeyCredentialCreationOptions with: - rp.id equal to the effective OutageKit domain - userVerification set to "required" - authenticatorSelection.authenticatorAttachment set to "cross-platform" - attestation set to "direct" - challenge >= 32 random bytes and unique per request - excludeCredentials containing the user’s existing credential IDs (if any) And the user completes registration with a roaming authenticator whose AAGUID is on the approved allowlist And the attestation chain validates against the FIDO Metadata Service with status not revoked/compromised Then the server verifies origin matches the allowed origins, rpIdHash, challenge, attested credential data, and attestation statement And the system stores credentialId, publicKey (COSE), AAGUID, and signCount associated to the user, encrypted at rest And the UI confirms success within 2 seconds and prompts the user to set or edit a nickname (prefilled with detected model)

Block Enrollment for Disallowed or Untrusted Authenticators

Given an authorized user attempts registration with an authenticator whose AAGUID is not on the allowlist or whose MDS status is revoked/compromised When attestation verification runs Then the server rejects the registration with HTTP 400 and error code "unsupported_authenticator" And no credential record is created or partially stored And the UI displays a clear error message instructing the user to use an approved hardware key without exposing sensitive certificate details And the event is audit-logged with user ID, AAGUID, reason, timestamp, and request ID And the response is returned within 2 seconds

Enforce User Verification During Registration

Given an authorized user starts registration When PublicKeyCredentialCreationOptions are generated Then userVerification is set to "required" And authenticatorSelection requires cross-platform roaming authenticators When the authenticator response is received Then the server validates that user verification (UV) is true in the authenticator data And if UV is false or absent, the server rejects with error code "user_verification_required" and no credential is stored

Secure Storage and Audit of Credential Material

Given a registration completes successfully Then the system stores credentialId (base64url), publicKey (COSE), AAGUID (UUID), and signCount encrypted at rest using a KMS-managed key And no private keys are stored anywhere And access to decrypt is restricted to the enrollment service role via IAM and is audited And direct database inspection shows ciphertext for encrypted columns And all create/update/delete actions on credentials are audit-logged with actor, action, credentialId hash, and timestamp And key material can be rotated without data loss (successful decrypt after rotation verified in test)

Key Management UI: Nickname, Primary/Backup, and Removal Rules

Given a user has 0 or more registered hardware keys When adding a key nickname Then the nickname is required (1–40 chars), trims whitespace, prevents duplicates per user, and supports ASCII and common Unicode letters/numbers/spaces When setting a primary key Then exactly one key is primary if the user has >=1 keys; setting a new primary demotes the previous primary When removing a key Then removing the primary requires selecting a new primary if backups exist And removing the last key shows a blocking warning explaining loss of Lifeline access until a new key is added and requires explicit confirmation And all changes reflect immediately in the UI and via API within 1 second and are audit-logged

Registration API Contracts and IdP Session Binding

Given the user has an active authenticated session via the existing SSO/IdP When calling GET /webauthn/registration/options Then the API returns 200 with creation options including rp, user (stable ID), challenge (>=32 bytes), pubKeyCredParams (ES256 and RS256 at minimum), authenticatorSelection (cross-platform), attestation (direct), timeout (<=60000ms), and excludeCredentials And the challenge is single-use with TTL of 5 minutes and is bound to the user/session When calling POST /webauthn/registration/finalize with clientDataJSON and attestationObject Then the API validates origin, rpId, challenge, attestation, AAGUID allowlist, MDS trust chain, and UV And it is CSRF-protected, idempotent (same payload returns 200 without duplicates), and rate-limited (e.g., <=5 attempts/min/user) And on success returns 201 with credential metadata (id, nickname, AAGUID, primary flag) and no sensitive attestation certificates

Cross-Browser and Transport Support with Clear Errors

Given supported environments (latest two versions): Chrome, Edge, Firefox (Windows/macOS/Linux), and Safari (macOS/iOS/iPadOS) When enrolling with approved roaming authenticators over USB, NFC, and BLE from at least two vendors on the allowlist Then enrollment completes successfully with UV required in each environment And unsupported environments or blocked contexts (e.g., insecure HTTP) detect lack of WebAuthn support and display a clear, actionable error banner with documentation links And transport prompts/instructions are shown contextually (e.g., tap NFC, insert USB) and time out gracefully with retry And the overall enrollment completes within 30 seconds in 95% of attempts across tested environments

Lifeline Step-up Authentication Enforcement

"As a field supervisor, I want a hardware key prompt before performing Lifeline actions so that only verified staff can make high-impact changes during an outage."

Description

Require a successful WebAuthn assertion with user verification for any action that issues or uses Lifeline access (e.g., unlocking consoles, escalating privileges, or approving outage overrides). Gate relevant UI controls and backend endpoints behind a step-up auth check, with configurable re-authentication TTL (e.g., 15–60 minutes) and forced re-prompt on risk signals (new IP, device, abnormal time). Deny access by default if assertion fails, is absent, or uses a non-approved authenticator. Provide clear UX prompts and fallback messaging while ensuring consistent enforcement across web and native clients.

Acceptance Criteria

Step-up Prompt on Lifeline Action (Web Console)

Given a logged-in user without an active Lifeline step-up window When the user clicks Approve Outage Override in the web console Then a WebAuthn prompt requiring user verification from an approved hardware key is displayed And on successful assertion the action completes and a new step-up window is started for the configured TTL And on assertion failure or cancel the action is blocked, no state change occurs, and a message "Verification required to continue" is shown

Backend Enforcement for Lifeline Endpoints

Given a POST to /lifeline/overrides with a valid user session but no current step-up assertion When the request is processed by the API Then the API responds 401 with error_code=STEP_UP_REQUIRED and no side effects Given a POST with an expired or tampered step-up token When validated Then the API responds 403 with error_code=STEP_UP_INVALID and no side effects Given a POST with a step-up assertion from a non-approved authenticator When validated Then the API responds 403 with error_code=STEP_UP_UNAPPROVED_AUTH and no side effects And all denials are audit-logged with user_id, endpoint, reason, and timestamp

TTL Configuration and Enforcement

Given an admin sets the Lifeline step-up TTL to a value between 15 and 60 minutes When saved Then the system accepts the value and applies it within 1 minute to new sessions Given a value outside 15–60 minutes When saved Then the system rejects it with validation error "TTL must be 15–60 minutes" Given a successful step-up at T0 with TTL=30 minutes When the user performs a Lifeline action at T0+29m Then no re-prompt occurs When the user performs a Lifeline action at T0+30m+15s Then a re-prompt is required before proceeding

Risk Signal Forced Re-authentication

Given a user has an active step-up window When a Lifeline action is attempted from a new IP address not seen in the last 30 days Then the system forces a WebAuthn re-prompt before proceeding Given the same user attempts within TTL from a new device fingerprint or outside configured business hours When the action is initiated Then the system forces a WebAuthn re-prompt and blocks the action until successful assertion And the risk reason (new_ip|new_device|abnormal_time) is recorded in audit logs

Authenticator Policy Enforcement (Hardware Keys Only)

Given organization policy requires FIDO2 roaming hardware keys with user verification (UV=true) When a step-up assertion is made with a platform authenticator or passkey not on the approved list Then the system rejects it with error "Hardware security key required" When a step-up assertion is made with a registered hardware key whose AAGUID is on the approved list and UV=true Then the system accepts the assertion And assertions with UV=false are rejected with error "User verification required"

Cross-Client Consistency (Web, iOS, Android)

Given identical user, policy, and TTL When the user initiates a Lifeline action on web, iOS, and Android clients Then each client prompts for step-up under the same conditions, uses the same backend validations, and produces the same allow/deny outcomes And error codes and messages returned by the backend are consistent across clients And no client allows the action to proceed without a valid step-up assertion

UX Prompting and Fallback Messaging

Given the user triggers a step-up prompt When the prompt is displayed Then the UI copy includes a clear title "Verify with your security key" and guidance to insert/tap key, with a cancel option Given no registered hardware key is found for the user When step-up is required Then the UI shows fallback messaging "No key registered" with a link to Manage Security Keys and contact support, and the action remains blocked And all prompt dialogs meet WCAG 2.1 AA for contrast, have focus management, and are announced correctly by screen readers

Hardware-Bound Token Binding

"As a security engineer, I want tokens tied to a specific registered hardware key so that stolen sessions cannot be reused to access Lifeline capabilities."

Description

Bind session and authorization tokens for Lifeline operations to the user’s registered WebAuthn credential by embedding the credential ID and last verified signature counter into token claims. Issue or refresh tokens only after a fresh WebAuthn assertion and validate claims server-side before executing privileged operations. Invalidate tokens on credential revocation or signature counter regression to mitigate cloning. Ensure tokens are short-lived and scoped to Lifeline operations, preventing replay or use on sessions without the corresponding hardware key assertion.

Acceptance Criteria

Fresh WebAuthn Assertion for Lifeline Token Issuance

Given a user with an active registered WebAuthn credential and an authenticated session When the user requests a Lifeline authorization token Then the system must prompt for a WebAuthn assertion using the registered credential And the token must not be issued unless the assertion is successfully verified against the stored public key And the assertion verification timestamp is recorded and associated with the session

Token Claims: Credential Binding, Scope, and TTL

Given a successful WebAuthn assertion for a Lifeline operation When the authorization server issues a token Then the token must include claims: credentialId (equal to the asserted credential ID), signCount (from the verified assertion), scope limited to Lifeline operations only, exp no more than 15 minutes from issuance, and a unique jti And the token must be cryptographically signed by the authorization server

Server-Side Verification on Privileged Lifeline Operation

Given an API request to perform a privileged Lifeline operation with a presented token When the server validates the request Then it must verify token signature, expiration, audience, and required Lifeline scope And verify credentialId in the token corresponds to an active registered credential for the requesting user And verify the token has not been revoked and its jti has not been seen before And only then execute the operation; otherwise return 401/403

Session Binding Enforcement (No Cross-Session Use)

Given a Lifeline token issued after a successful WebAuthn assertion and bound to the current session When the token is presented from a different browser session, device, or a session without the recorded assertion Then the server must reject the request with 401/403 And the token must only be accepted when presented from the bound session

Token Refresh Requires Fresh Assertion and Rotation

Given a valid Lifeline token nearing expiration When the client requests a token refresh Then the server must require a fresh WebAuthn assertion using the same registered credential And upon successful verification, issue a new token with a new jti, updated exp (<= 15 minutes), and updated signCount And revoke the prior token immediately And deny refresh if the assertion fails or is not provided

Immediate Invalidation on Credential Revocation

Given an administrator revokes or disables a user’s registered WebAuthn credential When the revocation is committed Then all active Lifeline tokens bound to that credential must become invalid within 60 seconds And subsequent attempts to use those tokens must return 401/403 And issuing or refreshing tokens with the revoked credential must be blocked

Signature Counter Regression Blocks Issuance and Revokes Tokens

Given a WebAuthn assertion returns a signature counter lower than the server’s last stored counter for that credential When a Lifeline token issuance or refresh is attempted Then the issuance must be denied and a 403 returned And all active tokens bound to that credential must be revoked immediately to mitigate potential key cloning

Authenticator Policy Controls

"As a system administrator, I want to define which authenticators are allowed and required for Lifeline so that our organization consistently meets phishing-resistant access standards."

Description

Provide admin-configurable security policies to enforce hardware-key-only access for Lifeline, including allowed AAGUIDs, attestation requirements (trusted roots only), mandatory user verification, and minimum authenticator capabilities (CTAP2, resident key support if needed). Allow setting the number of required keys per user (e.g., primary + backup), re-enrollment intervals, and restrictions by role, environment, or geography. Integrate with RBAC so Lifeline roles cannot be assigned or used without compliant credential enrollment. Surface policy status and violations in the admin console with remediation guidance.

Acceptance Criteria

Allowed AAGUID and Trusted Attestation Enforcement

Given an admin configures an Allowed AAGUIDs list and enables “Trusted attestation roots only” When a user attempts registration with an authenticator whose AAGUID is not on the allowed list Then the registration is rejected with error code POLICY_AAGUID_BLOCKED and an audit event records userId, AAGUID, rpId, timestamp, IP, device, and policyVersion When a user attempts registration with missing attestation or an attestation chain not anchored to a trusted root Then the registration is rejected with error code POLICY_ATTESTATION_UNTRUSTED and an audit event is recorded with attestation metadata When a user registers with an allowed AAGUID and a valid attestation chain Then registration succeeds and the credential is stored with attestation metadata, AAGUID, and policyVersion And policy updates to AAGUID/attestation settings take effect within 60 seconds of admin save

Mandatory User Verification and Minimum Authenticator Capabilities

Given policy requires User Verification (UV=true), CTAP2 >= 2.1, and resident keys when configured When a user attempts to authenticate and the assertion indicates UV flag = false Then the authentication is rejected with error code POLICY_UV_REQUIRED and an audit event is recorded When a user attempts registration/authentication using CTAP1/U2F or CTAP2 < 2.1 Then the operation is rejected with error code POLICY_CTAP_VERSION and an audit event is recorded When resident/discoverable credentials are required and the authenticator cannot create them Then registration is rejected with error code POLICY_RESIDENT_KEY_REQUIRED and an audit event is recorded When the authenticator proves UV, is CTAP2 >= 2.1, and creates a resident credential when required Then the operation succeeds and server persists capability flags (uv, up, be, rk, aaguid) with the credential

Required Keys per User Enforcement

Given policy requires a minimum of 2 compliant hardware keys per user for Lifeline roles When a user with fewer than 2 compliant keys is assigned a Lifeline role Then the assignment is blocked with status 409 POLICY_KEYS_MIN_NOT_MET and the admin UI shows the missing count When a user with fewer than 2 compliant keys attempts to access Lifeline features Then access is denied with status 403 POLICY_KEYS_MIN_NOT_MET and an audit event is recorded When the user enrolls additional compliant keys to meet the minimum Then role assignment and access succeed and duplicate credential IDs or duplicated physical keys are prevented And the compliance state in the admin console updates within 60 seconds

Re-enrollment Interval and Credential Expiry

Given policy sets a re-enrollment interval (e.g., 12 months), a reminder window (e.g., 30 days), and a grace period (e.g., 7 days) When a credential enters the reminder window Then the system sends notifications to the user and admins at least weekly until renewal or expiry and logs these events When the credential exceeds interval + grace without renewal Then the credential is marked expired and Lifeline access using that credential is denied with 401 CREDENTIAL_EXPIRED; an audit event is recorded When the user re-enrolls a compliant key Then the new credential is activated, the expired one is retired, and the user regains access immediately And compliance status reflects the change within 60 seconds

RBAC Gating of Lifeline Role Assignment and Use

Given Lifeline roles (e.g., Operator, Supervisor) are bound to a specific authenticator policy When an admin attempts to assign a Lifeline role to a noncompliant user Then the assignment is blocked with 409 RBAC_POLICY_NONCOMPLIANT and the UI presents a remediation link to enroll required keys When a noncompliant user invokes Lifeline APIs or UI routes Then authorization is denied with 403 RBAC_POLICY_NONCOMPLIANT and the response includes the policyId and missing requirements When the user becomes compliant with the bound policy Then role assignment and all protected actions succeed without further admin intervention And all events are audit-logged with actor, target, role, policyId, and timestamp

Environment and Geography Restriction Enforcement

Given policy restricts Lifeline access to Production and allowed geographies (e.g., US-CA, US-OR) using IP-to-geo and admin-defined network ranges When a user authenticates from an IP mapped outside the allowed geographies or in a disallowed environment (e.g., Staging) Then the request is denied with 403 POLICY_GEO_RESTRICTED or POLICY_ENVIRONMENT_RESTRICTED and an audit event captures source IP, resolved region, environment, and policyId When a user authenticates from within an allowed geography and the Production environment Then the request proceeds and the resolved region/environment are stored with the session And an emergency override (time-boxed, dual-approval) can be enabled, after which all overridden attempts are logged and reported in the console

Admin Console Policy Status and Violation Surfacing

Given an admin opens the Policies view in the console When viewing a policy detail page Then the console shows current effective status, scope (roles/environments/geographies), compliant vs noncompliant user counts, last update time, and policy version And a violations table lists each violation (type, user, role, env, region, last seen) with filters and sorting and provides remediation guidance with deep links And export to CSV and JSON is available and reflects data consistent with on-screen counts within 1% and generated within 2 minutes When a policy is changed and saved Then a new version is created, changes are audit-logged (actor, diff), and UI reflects the update within 60 seconds

Credential Recovery & Break-Glass Workflow

"As a regional lead, I want a controlled recovery and break-glass process so that work can continue during emergencies without weakening our security posture."

Description

Implement secure recovery for lost or damaged hardware keys, including support for pre-registered backup keys, revocation of compromised credentials, and guided re-enrollment. Provide a time-bound, least-privilege break-glass path requiring multi-party approval and out-of-band verification (e.g., manager + security approver) to temporarily grant Lifeline access while a new key is issued. Automatically log and notify on all recovery and break-glass events, enforce rapid expiration, and require WebAuthn re-binding before normal access resumes.

Acceptance Criteria

Backup Hardware Key Authentication

Given a user account with at least one pre-registered backup FIDO2/WebAuthn key and the primary key is unavailable When the user authenticates using a registered backup key via WebAuthn Then Lifeline access is granted with the same role-based permissions as the primary key And the authentication event is recorded in the audit log with key ID, user ID, timestamp, IP, and device fingerprint And the user is prompted to mark the primary key as lost/compromised or keep it active And no break-glass token is issued for this session

Immediate Revocation of Lost/Compromised Keys

Given a user or security admin initiates revocation of a specific registered key When the revocation is confirmed with second-factor approval Then the key becomes unusable for authentication within 60 seconds across all services And subsequent WebAuthn assertions with the revoked key are rejected with error code RK-403 and message "Security key revoked" And an immutable audit entry is created with actor, target key ID, reason, and timestamp And notifications are sent to the user and security distribution list within 60 seconds

Guided Re-Enrollment Flow After Key Loss

Given a user has zero active keys on file When the user starts the recovery flow from the login screen or profile Then the system collects a loss reason and initiates identity verification per policy And upon successful approval or backup-key authentication, the user is guided to register a new hardware key via WebAuthn And the new key is activated and bound per Hardware Key Bind policy; revoked keys remain disabled And the user cannot access non-read-only Lifeline functions until at least one new key is registered And all steps in the flow are logged and notifications are sent to the user and security distribution list

Break-Glass Token: Least-Privilege, Time-Bound Access

Given a break-glass request is approved per policy When a break-glass token is issued Then the token grants only minimal Lifeline permissions to issue outage updates and view incident dashboards; all admin/config endpoints are blocked And the token duration is a maximum of 120 minutes (default 60), is non-renewable, and auto-expires server-side And every API request using the token is tagged and logged; the UI displays a persistent "Break-Glass Mode" banner And attempts outside the allowed scope return BG-403 authorization errors

Multi-Party Approval and Out-of-Band Verification

Given a user without an active key requests break-glass access When approvals are sought Then approval from two distinct approvers is required: the user's manager and a security approver, each in a hardware-key-authenticated session And the system performs out-of-band verification by sending a one-time code to the requestor's verified phone on file and requires code entry within 10 minutes And no individual can fulfill both approver roles; delegated approvals (if used) are recorded with delegator link And if approvals and OOB verification are not completed within 30 minutes, the request auto-expires and no token is issued

Comprehensive Logging and Notifications

Given any recovery or break-glass event occurs When the event is processed Then an immutable audit log record is created with event type, actor(s), approver(s), affected user, reason, timestamps (request, approval, issuance, expiration), IP addresses, device fingerprints, and outcome And audit records are retained for at least 1 year and are exportable in JSON and CSV formats And notifications are sent to the requestor, their manager, and the security channel (email and SIEM webhook) within 60 seconds And sensitive fields (e.g., phone numbers) are redacted in notifications while stored in full in the audit log

Mandatory WebAuthn Re-Binding Before Normal Access

Given a user has used break-glass access or has all keys revoked When the user next signs in Then the system forces WebAuthn registration of a new hardware key before restoring normal (non-break-glass) access And until at least one active key is registered, non-read-only Lifeline operations remain blocked with RB-401 errors And upon successful registration, any outstanding break-glass tokens are immediately invalidated and normal access resumes And if re-binding is not completed within 48 hours of break-glass issuance, the account is suspended pending admin review

Access Audit & Anomaly Alerts

"As a compliance officer, I want comprehensive logging and alerts for Lifeline access so that I can detect misuse and demonstrate control effectiveness."

Description

Capture detailed, immutable audit logs for WebAuthn registrations, assertions, failures, policy violations, and break-glass activity, including user, time, IP, RP ID, AAGUID, and outcome. Provide searchable logs, export to SIEM, and configurable alerts for anomalous patterns (e.g., repeated failures, new geographies, frequent step-up prompts). Surface per-user and organization-level reports to support post-incident reviews and compliance requirements.

Acceptance Criteria

Immutable Audit Logging for WebAuthn Events

- Given a successful WebAuthn registration, When the registration completes, Then the system writes an immutable audit record containing event_type=registration, user_id, username, timestamp (UTC ISO 8601), source_ip, rp_id, aaguid, outcome=success, and event_id (UUID). - Given a successful WebAuthn assertion, When the assertion completes, Then the system writes an immutable audit record with the same fields and event_type=assertion and outcome=success. - Given any audit record exists, When a user with admin privileges attempts to edit or delete it via UI or API, Then the mutation is rejected and a separate audit record is created with event_type=audit_mutation_attempt and outcome=blocked. - Given the last 10,000 audit records, When the integrity verification job runs, Then 100% of records validate against the tamper-evidence chain and the job completes within 60 seconds.

Capture of Failures, Policy Violations, and Break-Glass Activity

- Given a failed WebAuthn assertion occurs, When the failure is returned, Then an audit record is created with event_type=assertion, outcome=fail, user_id, username, timestamp, source_ip, rp_id, aaguid (if provided), error_code, and error_reason. - Given a policy violation is detected (e.g., disallowed AAGUID or RP ID mismatch), When the attempt occurs, Then an audit record is created with event_type=policy_violation, policy_rule_id, outcome=blocked, plus standard fields (user_id, timestamp, source_ip, rp_id, aaguid when available). - Given break-glass access is initiated, When the flow completes, Then an audit record is created with event_type=break_glass, actor_user_id, approver_user_id (if applicable), justification, scope, start_time, end_time, and outcome in {approved, denied, expired}. - Given an unauthorized attempt to invoke break-glass occurs, When approvals are missing, Then an audit record is created with event_type=break_glass and outcome=blocked and severity=high.

Searchable Audit Log Interface and API

- Given a user with the Security Auditor role, When they filter by user_id, date range, event_type in {registration, assertion, policy_violation, break_glass}, outcome, rp_id, aaguid, and source_ip/CIDR, Then matching records are returned sorted by timestamp desc within 3 seconds for up to 10,000 results. - Given a large result set, When pagination is used, Then the API returns a stable next_cursor and total_count, and the UI paginates consistently with 100–500 records per page. - Given a free-text query on justification and error_reason, When the query is executed, Then only records containing the terms in those fields are returned. - Given a filtered result set, When Export is requested, Then JSON and CSV exports contain identical records and fields to the on-screen results. - Given a user without the Security Auditor role, When they attempt to access logs, Then access is denied and the attempt is audited.

SIEM Export and Streaming Delivery

- Given a SIEM destination is configured (HTTPS webhook with HMAC or Syslog over TLS), When audit events are generated, Then ≥99% of events are delivered within 60 seconds and each payload includes a verifiable signature. - Given transient delivery failures occur, When retries are attempted, Then exponential backoff is applied for up to 24 hours with at-least-once delivery guarantees. - Given connectivity is restored after an outage, When streaming resumes, Then the backlog is drained in order and duplicates, if any, are marked replay=true. - Given an on-demand export is requested for a date range, When the job completes, Then a downloadable file (NDJSON or CSV) contains all matching records and the count matches the UI/API within ±0.1%.

Anomaly Alert: Repeated Authentication Failures

- Given an alert policy of N=5 failures in M=10 minutes per user, When a user accrues ≥5 assertion failures within any rolling 10-minute window, Then an alert is generated within 1 minute containing user_id, username, window_start/end, failure_count, distinct source_ips, and sample event_ids. - Given an alert has fired for a user, When additional failures occur within a 30-minute suppression window, Then no duplicate alerts are sent and the existing alert is updated with the new counts. - Given a successful assertion for the same user occurs after an alert, When the success is logged, Then the alert is auto-resolved (if auto-resolve is enabled) and resolution is audited. - Given delivery channels are configured (email and webhook), When the alert triggers, Then notifications are sent to all active channels with a testable payload schema.

Anomaly Alert: New Geography Access Detection

- Given a 90-day baseline of prior successful assertion geographies per user, When a new successful assertion originates from a country/region not seen in the baseline, Then an alert is generated within 2 minutes containing user_id, prior_locations, new_location, source_ip, and geo_confidence. - Given an organization-defined allowlist of locations, When an assertion matches an allowlisted location, Then no new-geo alert is generated. - Given the geo_confidence is below a threshold (e.g., <0.6), When location cannot be reliably determined, Then the event is flagged as low-confidence and no alert is sent. - Given an alert is generated, When viewed in the UI, Then it links to the underlying audit records and a map pin is shown for context.

Per-User and Organization-Level Compliance Reports

- Given a user is selected, When the per-user report is loaded for a 90-day range, Then it displays counts by event_type, a time series, registered AAGUIDs, last 5 source IPs, and any anomalies, and loads within 3 seconds for ≤50k events. - Given an organization-level report is requested for a custom date range, When it is generated, Then it includes totals by event_type, top IPs/geographies, break-glass summary, and anomaly counts, and totals reconcile with raw logs within ±0.5%. - Given a report is exported, When CSV/PDF is generated, Then the contents match the on-screen data and include a generation timestamp and the filters applied. - Given role-based access control is enforced, When a user without reporting permissions attempts access, Then access is denied and the attempt is audited. - Given any report is viewed or exported, When the action completes, Then a corresponding audit record is created with event_type=report_access and outcome=success.

IP Safe Zones

Restricts Lifeline sessions to approved networks and locations with granular IP allowlists (NOC, EOC, depots, designated trucks). Geofenced access slashes exposure if a token leaks, while letting critical teams connect from pre-cleared sites.

Requirements

Safe Zone Policy Engine

"As a security administrator, I want to define and apply IP-based Safe Zones to Lifeline sessions so that only approved networks and sites can access critical outage controls."

Description

Implements named Safe Zones composed of IPv4/IPv6 CIDR allowlists for NOC, EOC, depots, and designated truck networks. Supports zone metadata (owner, location, purpose), tags, effective time windows, and environment scoping. Policies bind to Lifeline session types and roles, enforcing deny-by-default outside approved zones. Includes CIDR normalization, overlap detection, and validation against reserved/private ranges. Ensures multi-tenant isolation, versioned policy changes with rollback, and propagation to all enforcement points within seconds.

Acceptance Criteria

Create Named Safe Zone with Required Metadata and Time Window

Given a tenant and environment scope and an admin user When the admin creates a Safe Zone with a unique name, at least one IPv4 or IPv6 CIDR, owner, location, purpose, tags, and an effective start/end time Then the zone is persisted with a unique ID and the provided metadata And the zone evaluates as active only when current time is within the effective window And attempts to save without name, environment, owner, or at least one CIDR are rejected with field-level errors

Enforce Deny-by-Default by Role and Session Type

Given a user with role and a Lifeline session type attempting to connect from a source IP not in any bound Safe Zone for the environment When policy evaluation runs Then the session is denied with HTTP 403 and reason "outside approved Safe Zone" And the decision log includes user, role, session type, source IP, matched/none zone, and policy version Given the same user from a source IP within any CIDR of a bound zone When policy evaluation runs Then the session is allowed and the log includes the matched zone name and CIDR

CIDR Normalization and Overlap Detection

Given a zone with IPv4 entries [10.0.0.0/24, 10.0.0.0/23] When the zone is saved Then the stored allowlist is normalized to [10.0.0.0/23] and duplicates are removed Given two zones within the same tenant and environment whose CIDRs overlap When binding both zones to the same policy Then the binding is rejected with an error listing overlapping ranges and zone names Given non-overlapping CIDRs in a zone When saving Then no overlap warnings or errors are produced

CIDR Classification and Validation (Private/Reserved/Public)

Given an input CIDR in reserved or documentation ranges (e.g., 0.0.0.0/8, 127.0.0.0/8, 169.254.0.0/16, 192.0.2.0/24, ::/128, ::1/128, 2001:db8::/32) When saving a zone Then validation fails with code "CIDR_RESERVED_NOT_ALLOWED" and identifies the offending CIDR Given an input CIDR in public address space When saving without tag "public-ok" Then validation fails with code "CIDR_PUBLIC_REQUIRES_TAG" And when saving with tag "public-ok" Then the zone saves and the CIDR is annotated as "public" Given an input CIDR in 100.64.0.0/10 (CGNAT) or 198.18.0.0/15 (benchmark) When saving without tag "cgnat-ok" Then validation fails with code "CIDR_CGNAT_REQUIRES_TAG" And when saving with tag "cgnat-ok" Then the zone saves and the CIDR is annotated as "cgnat" Given CIDRs in RFC1918 (10/8, 172.16/12, 192.168/16) or IPv6 ULA (fc00::/7) When saving Then the zone saves and the CIDRs are annotated as "private"

Environment Scoping and Tenant Isolation

Given a zone scoped to "prod" and a policy scoped to "staging" When attempting to bind the zone to the policy Then binding is rejected with code "ENV_MISMATCH" Given two tenants A and B When a user from tenant A lists or references zones Then only zones in tenant A are returned; referencing a zone ID from tenant B returns 404 Given policy evaluation in tenant A When a request originates from an IP allowed by a zone in tenant B Then the request is denied unless also allowed by a zone in tenant A

Versioned Policy Changes and Rollback

Given an existing policy version N When zones or bindings are updated and published Then a new version N+1 is created with a diff and audit record (actor, time, changes) And version N remains available for rollback Given version N+1 is active When a rollback to version N is initiated Then version N becomes active within 10 seconds and is propagated to enforcement points per SLA And an audit record of rollback is created Given a failed publish due to validation errors When publishing Then the active version remains unchanged and the publish is aborted with no partial updates

Policy Propagation SLA to Enforcement Points and Fail-Closed Behavior

Given a successfully published policy or zone change When measuring from publish time to receipt at all enforcement points Then 95th percentile propagation latency is <= 5 seconds and 99th percentile <= 10 seconds over at least 1000 events And all enforcement points report the new active version ID Given an enforcement point is unreachable When a policy change is published Then that enforcement point enters "degraded" state and enforces deny-by-default until it receives the update And an alert is emitted within 30 seconds identifying the lagging enforcement point

Zone Management UI and API

"As a network engineer, I want an intuitive UI and API to manage Safe Zones and IP ranges so that I can maintain access controls quickly and accurately during evolving field conditions."

Description

Provides an admin console and REST API to create, edit, and delete Safe Zones; attach CIDRs; assign labels; and map zones to roles and Lifeline scopes. Includes bulk import/export (CSV/JSON), inline validation with error highlighting, preview of affected source IPs, and change-review with optional two-person approval. Offers search, filtering, and history views with diffs between versions. API secured with service-to-service auth and rate limits, with idempotent operations for automation pipelines.

Acceptance Criteria

UI Create Safe Zone with Inline Validation and Preview

Given an authenticated org admin is on Zone Management > Create Zone When they enter a zone name and add CIDRs including invalid entries (e.g., "10.0.0.0/33", "abc") Then invalid CIDR fields are highlighted inline with specific messages and the Save button remains disabled And when all CIDRs are valid (IPv4/IPv6, up to 100 entries), labels are unique (<=10), and a name is provided Then the Save button becomes enabled And when the user clicks "Preview affected source IPs" Then a modal shows per-CIDR: total address count, overlap warnings, and up to 10 sample IPs, and the preview renders in under 2 seconds for up to 100 CIDRs And when the user saves the zone Then the new zone appears in the list within 2 seconds with correct name, labels, and CIDRs, and an audit entry records creator, timestamp, and payload summary

REST API S2S Auth, Rate Limits, and Idempotent Create/Update/Delete

Given a service client presents a valid JWT with audience "outagekit.api" and scope "zones.write" When it POSTs /v1/safe-zones with an Idempotency-Key header and a valid payload Then it receives 201 Created with JSON body including id, etag, and createdAt, and subsequent identical POSTs within 24h return 200 OK with the same body and header Idempotency-Replayed: true And when the client exceeds 100 write requests per minute per client id Then the API returns 429 Too Many Requests with Retry-After set in seconds And when a request has an invalid/expired JWT or missing scope Then the API returns 401 (invalid token) or 403 (insufficient scope) without side effects And when the client PATCHes /v1/safe-zones/{id} with If-Match: {etag} Then updates succeed with 200 OK and a new etag; a mismatched etag returns 412 Precondition Failed And when the client DELETEs /v1/safe-zones/{id} Then it receives 204 No Content; repeating DELETEs are idempotent and return 204; subsequent GET /v1/safe-zones/{id} returns 404

Bulk Import/Export with Validation and Dry-Run

Given an admin uploads a CSV or JSON file conforming to the documented schema When "Dry Run" is selected and the import is executed Then the system validates all rows and returns a summary with counts of to-create, to-update, and rejected rows, plus per-row error messages, and no data is persisted And when "Commit" is selected with mode = All-or-nothing Then either all valid rows are applied and the summary shows 0 rejected, or no changes are applied if any row fails, with errors reported And when mode = Best-effort Then valid rows are applied and failed rows are skipped with detailed errors listed; a downloadable results report is provided And when the user exports zones as CSV or JSON with active filters Then only filtered zones are included with fields [id,name,labels,cidrs,roles,scopes,version,updatedAt] in a stable order, and the file is generated within 3 seconds for up to 10,000 zones

Map Zones to Roles and Lifeline Scopes

Given roles and Lifeline scopes exist When an admin assigns one or more roles and scopes to a zone and saves Then the mapping persists and is visible on the zone detail view And when calling GET /v1/safe-zones/{id} Then the response includes roles[] and scopes[] reflecting the UI selections And when filtering zones by a role or scope in the UI or via GET /v1/safe-zones?role=...&scope=... Then only zones with matching mappings are returned And removing a role/scope from a zone updates the mapping and appears in audit history

Search and Filter Zones by Name, Label, CIDR, Role, and Scope

Given a list of existing zones When a user types a search term (e.g., name or label) in the search box Then the results update within 300 ms after typing stops and matching is case-insensitive and diacritic-insensitive And when a user searches for a CIDR Then exact CIDR matches are returned; partial IP fragments do not match unless part of a label or name And when filters for labels, roles, and scopes are applied in combination Then the result set reflects the logical AND of selected filters And when no results match Then the UI displays a "No zones found" state with a clear option to clear filters

Change Review with Optional Two-Person Approval

Given the organization setting "Require two-person approval for zone changes" is enabled When User A proposes changes to a zone (create/edit/delete) Then a draft change request is created with status Pending Approval and includes a diff of proposed changes; User A cannot approve their own request And when User B (with approval permission and not the requester) approves Then the change is applied, status becomes Approved, and audit log records requester, approver, timestamps, and diff And when User B rejects Then no changes are applied and status becomes Rejected with an optional reason recorded And when the setting is disabled Then saving changes applies immediately without an approval step but still records an audit entry with the diff

History View with Diffs Between Versions

Given a zone with multiple saved versions When a user opens the History tab Then a chronological list of versions is displayed with user, timestamp, and action (create/edit/delete/approve) for each entry And when the user selects two versions to compare Then a diff view highlights added/removed/modified CIDRs, labels, and role/scope mappings And when the user opens a single version Then a read-only snapshot of that version is shown And the history list loads in under 1 second for up to 50 versions and supports export of the selected diff as JSON

Real-time Session Enforcement Middleware

"As an operations manager, I want Lifeline access to be automatically limited to approved locations so that any leaked token or credential cannot be used from untrusted networks."

Description

Adds gateway middleware that validates client source IP at Lifeline session creation and on each privileged call. Honors a trusted proxy list (X-Forwarded-For) and supports IPv4/IPv6, NAT, and CGNAT edge cases with configurable matching rules. Implements low-latency cache with short TTL, fail-closed defaults, and graceful degradation policies for known outages. Generates structured decision logs (allow/deny, matched zone, reason) and emits security alerts on zone violations or token use from non-approved networks.

Acceptance Criteria

Deny Session from Non-Approved IP at Creation

Given allowlist zones are configured and the evaluated client IP is not in any approved zone When a Lifeline session is created Then the middleware denies the request with HTTP 403, error_code "ip_zone_denied", reason "ip_not_in_allowlist" Given the evaluated client IP is in an approved zone When a Lifeline session is created Then the middleware allows the request and stamps the session with matched_zone_id Given fail-closed default is enabled When allowlist lookup times out (>200 ms) or errors Then session creation is denied with HTTP 503, error_code "policy_fail_closed", reason "allowlist_unavailable" Given the decision is computed When the response is returned Then added decision latency is <= 20 ms p95 and <= 50 ms p99 measured over 1,000+ requests

Enforce Per-Call Checks for Privileged Endpoints

Given an established Lifeline session When a privileged API call is made Then the middleware re-validates the current evaluated client IP against approved zones before forwarding upstream Given the evaluated client IP no longer matches an approved zone When a privileged API call is made Then the call is denied with HTTP 403, error_code "ip_zone_violation" and the session token is flagged so further privileged calls are blocked until re-authentication Given the evaluated client IP still matches an approved zone When a privileged API call is made Then added enforcement latency is <= 10 ms p95 and <= 25 ms p99

Trusted Proxy Handling for X-Forwarded-For

Given a chain of trusted proxies is configured When a request includes X-Forwarded-For Then the middleware extracts the client IP as the left-most valid IP preceding the first trusted proxy boundary and ignores untrusted headers Given X-Forwarded-For is present but the immediate sender is not in the trusted proxy list When a request is received Then the middleware ignores X-Forwarded-For and uses the network source IP for evaluation Given multiple X-Forwarded-For IPs including private/reserved ranges When extracting the client IP Then the middleware selects the first valid public IP; if none exists it uses the network source IP Given a spoofed or malformed X-Forwarded-For header When parsing occurs Then the middleware logs reason "invalid_xff" and proceeds using the network source IP without failing the request

IPv4/IPv6 and NAT/CGNAT Matching Rules

Given zones contain IPv4 and IPv6 CIDRs and explicit IPs When matching occurs Then the middleware correctly matches IPv4, IPv6 (including compressed forms), and IPv4-mapped IPv6 client addresses Given NAT/CGNAT scenarios where only carrier ranges are approved When a client IP falls within an approved CGNAT range Then the request is allowed per policy and reason "matched_cgnat_range" Given an RFC1918 or ULA address appears as client IP via headers and no trusted proxies are present When evaluation occurs Then the request is denied by fail-closed policy with reason "unroutable_client_ip" Given overlapping zones exist When matching occurs Then the middleware applies longest-prefix match and records the matched_zone_id

Low-Latency Cache with Short TTL and Invalidation

Given zone configuration is cached with TTL=30s (configurable) When a change is made to allowlists Then 95% of enforcement decisions reflect the update within TTL + 2s Given the cache is warm When decisions are made Then cache hit rate is >= 95% and cache lookup latency is <= 2 ms p95 Given an administrator triggers explicit cache invalidation When the invalidation API is called Then all nodes purge relevant entries within 5s and subsequent decisions use fresh data

Decision Logging and Security Alerts on Zone Violations

Given any allow or deny decision When the middleware processes a request Then it emits a structured JSON log with request_id, tenant_id, timestamp (UTC ISO8601), decision (allow/deny), reason_code, matched_zone_id (or null), evaluated_client_ip, proxy_chain, and latency_ms Given a deny due to zone violation or token use from non-approved networks When detected Then a security alert is emitted within 10s containing severity "high", tenant_id, token_id (hashed), matched_zone_id (or null), evaluated_client_ip, and reason_code, deduplicated to max 1 alert per token per 5 minutes Given logs are emitted When sampled over an hour Then >= 99% are parseable against the defined JSON schema

Graceful Degradation During Allowlist Service Outage

Given the upstream allowlist source is degraded and fail-closed default is enabled When lookups fail Then decisions deny with HTTP 503, reason "allowlist_unavailable", and a Retry-After header set to <= 60s Given a known outage window is configured for a tenant with a temporary override zone When the allowlist source is unavailable Then the middleware applies the configured override zone for up to 30 minutes and logs reason "graceful_override_applied" Given degradation lasts beyond the override duration When decisions occur Then the system reverts to fail-closed behavior and emits an "override_expired" alert

Emergency Bypass with MFA and Auto-Expiry

"As an on-call incident commander, I want a temporary, auditable bypass when I’m outside pre-cleared networks so that I can restore service without compromising security."

Description

Provides a break-glass workflow allowing temporary access outside Safe Zones under strict controls: step-up MFA, mandatory justification, scope reduction, time-boxed expiry, and optional approver escalation. Sends real-time notifications to security and incident channels, displays prominent banners during bypass, and records full audit trails. Auto-revokes access at expiry or when the user returns to an approved zone, with post-incident review reports.

Acceptance Criteria

Bypass Initiation Requires Step-Up MFA

Given a user is outside all Safe Zones and requests an emergency bypass When the user initiates the bypass workflow Then the system requires step-up MFA using at least two distinct allowed factors (e.g., FIDO2 security key, TOTP, push) And the bypass is denied after 5 failed MFA attempts or if MFA is not completed within 3 minutes And the MFA outcome and factors used are recorded in the audit log

Mandatory Justification Capture

Given a user requests an emergency bypass outside Safe Zones When prompted for business justification Then the user must select a reason from a policy-defined list and enter a free-text justification of at least 20 characters And submission is blocked until both fields are provided And the justification and reason code are stored with the bypass session audit record

Time-Boxed Expiry and Auto-Revocation on Return to Safe Zone

Given a bypass session is approved and active When the session starts Then an expiry between 5 and 120 minutes is enforced (default 60 minutes) And the user is shown a countdown and receives a 5-minute pre-expiry warning When the expiry time is reached Then access tokens are revoked within 30 seconds and the session is terminated When the user returns to an approved Safe Zone during an active bypass Then the bypass ends within 60 seconds and standard Safe Zone controls re-apply

Reduced Access Scope Enforcement During Bypass

Given a bypass session is active When the user accesses OutageKit features Then the Bypass-Restricted policy is applied limiting permissions to read-only dashboards, incident viewing, and sending pre-approved communication templates And administrative settings, role management, network configuration, API keys, and allowlist changes are blocked with HTTP 403 and disabled UI controls And all blocked attempts are logged with user, action, resource, and timestamp

Optional Approver Escalation Policy

Given the organization requires approval for emergency bypass When the user submits a bypass request Then the request is routed to the designated approver group with user, IP, geo, reason, requested duration, and scope details And a single approver must approve within 2 minutes via a supported channel for access to be granted And if the request is denied or times out, bypass is not granted and the user is notified And approver identity, decision, and timestamps are recorded in the audit

Real-Time Alerts and On-Screen Bypass Banner

Given bypass lifecycle events (requested, approved, started, extended, ended, expired) When any such event occurs Then notifications are delivered to configured security and incident channels (email, chat, webhook, SMS) within 10 seconds including user, IP, geo, reason, scope, and expiry And during an active bypass, a non-dismissible banner is displayed on all pages indicating "Emergency Bypass Active," remaining time, scope limitations, and an "End Bypass" action

Audit Trail and Post-Incident Review Report

Given any emergency bypass session activity occurs When the session ends by expiry, manual end, or Safe Zone return Then an immutable audit record includes user ID, device fingerprint, IP, geolocation, MFA factors, justification, approver decision, timestamps, actions performed, and notifications sent And security roles can filter and export audit data as CSV or JSON And a post-incident review report is generated within 5 minutes summarizing timeline, scope, actions taken, and recommendations, available to security and compliance roles

Comprehensive Audit and Compliance Reporting

"As a compliance officer, I want detailed, exportable audit records and reports so that I can demonstrate controlled access to regulators and auditors."

Description

Captures immutable logs for all policy changes, access decisions, bypass events, and administrative actions with actor, IP, zone, timestamp, and outcome. Exposes role-restricted dashboards and export (CSV/JSON) with filters by user, site, time, and result. Supports SIEM forwarding (Syslog/CEF), retention policies, and tamper-evident storage. Includes prebuilt reports for SOC2/ISO27001 evidence and executive summaries of zone effectiveness and attempted violations.

Acceptance Criteria

Immutable Policy Change Logging

Given an authenticated admin creates, updates, or deletes an IP Safe Zone policy When the change is saved Then an immutable audit record is appended with fields: actor_id, actor_role, action_type, policy_id, before_value_hash, after_value_hash, request_ip, request_zone_id, timestamp_utc (ISO8601 ms), outcome, reason And the record contains prev_hash and entry_hash forming a verifiable hash chain across all policy-change records And the write is acknowledged only after durable replication to at least 2 storage nodes And attempts to modify or delete any historical audit record are rejected and a tamper_attempt event is logged with actor_id, ip, timestamp_utc And a daily integrity job recomputes the chain; on failure it logs integrity_status = "fail" and raises a high-severity alert within 60 seconds

Access Decision and Bypass Event Logging

Given a user or service initiates a Lifeline session evaluated against IP Safe Zones When the decision engine returns allow, deny, challenge, or bypass Then an audit record is written for every decision with: decision_id, session_id, user_id (or client_id), source_ip, matched_zone_id (nullable), rule_id, evaluation_reasons[], latency_ms, timestamp_utc, outcome And for bypass events, the record also includes approver_id, approval_method, justification (non-empty), scope, expiration_utc, and a link to the related denied attempt And 100% of decisions are captured; if the logging backend is unavailable, events are durably queued locally (capacity >= 100,000 events) and forwarded on recovery; on overflow, a high-severity alert is emitted And system clocks across decision and logging services are synchronized within 1 second to preserve event ordering

Role-Restricted Audit Dashboards and Filtering

Given role-based access control is configured When a user with role SecurityAdmin or AuditViewer opens the Audit dashboard Then access is granted and an access event is audited; users with other roles receive HTTP 403 and a denial is audited And the dashboard supports filters by user_id, site/zone_id, time range (UTC), action_type, and outcome; combined filters return correct results And for a dataset of 1,000,000 records, filtered results render within 3 seconds and support sorting, pagination, and column selection And any dashboard action (view, filter, export, share) is itself audited with actor, timestamp_utc, and parameters And counts and samples shown in the UI match the underlying log store within 0.1% for the same filters

Exporting Audit Data to CSV and JSON

Given a user with export permission applies filters on the Audit dashboard When the user requests CSV export Then the file contains only records matching the filters with required columns: tenant_id, event_type, actor_id, user_id/client_id, source_ip, zone_id, rule_id, outcome, timestamp_utc, event_id, and conforms to RFC4180 (quoted, escaped, UTF-8, header row) When the user requests JSON export Then the output is NDJSON (one JSON object per line) with the same fields and UTC ISO8601 timestamps And for up to 1,000,000 records, streaming export completes within 2 minutes; larger exports run asynchronously and provide a downloadable link and email notification And every export creates an audit event with export_id, actor_id, filter_summary, format, record_count, and a SHA-256 checksum of the payload And exported record_count matches the UI count for the same filters

SIEM Forwarding via Syslog/CEF

Given a SIEM destination is configured with host, port, TLS (TLS 1.2+), credentials, and format (RFC5424 Syslog or CEF) When forwarding is enabled Then all new audit events are forwarded within 5 seconds of commit with stable event_id and partition-ordered delivery And the connection validates the SIEM certificate chain and hostname; failures prevent transmission and are logged And disconnections trigger exponential backoff with jitter and at-least-once delivery semantics; duplicates (if any) carry the same event_id for de-duplication And the offline queue can buffer at least 5,000,000 events or 72 hours, whichever first; thresholds emit alerts And health metrics (backlog_size, last_forward_timestamp, failure_count, last_error) are exposed via API and visible on the dashboard And a "Send test event" action produces a verifiable synthetic log at the SIEM within 10 seconds

Retention Policies and Legal Hold

Given a default retention of 365 days and tenant-configurable retention per event_type (90–1825 days) When a SecurityAdmin updates retention settings Then the policy is stored, audited, and applied to new data immediately and retroactively to existing data within 24 hours And a daily purge job permanently deletes data older than retention while preserving records with legal_hold = true; purge actions write a summary audit with counts per event_type And setting or clearing legal hold requires SecurityAdmin role, a case_id, and justification; removing a hold requires dual approval (two distinct approvers) within 24 hours And purged data is no longer retrievable via UI, API, export, or SIEM replay And a verification job samples at least 1% of eligible records to confirm deletion and reports success/failure

Prebuilt Compliance Reports and Executive Summaries

Given prebuilt templates for SOC 2 and ISO 27001 evidence When a SecurityAdmin generates a report for a time window Then the report includes mapped controls, evidence references (event_ids/links), completeness metrics, and a signed snapshot hash; generation is audited And executive summaries display IP Safe Zones effectiveness: coverage (% of Lifeline sessions within zones), allowed vs denied, attempted violations by site, and time-to-alert for tamper checks; dashboards refresh within 5 minutes of new data And reports can be scheduled (daily/weekly/monthly), delivered securely (link with expiry), and exported to PDF and JSON; scheduled runs are audited And counts and metrics in reports reconcile with the underlying logs for the same time window within 0.5%

Zone Health Monitoring and Drift Detection

"As a network administrator, I want proactive checks and alerts on Safe Zone accuracy so that access remains secure and operational during network changes."

Description

Monitors Safe Zones for staleness, overlapping or conflicting CIDRs, unreachable site networks, and expiring entries. Performs scheduled verification (e.g., depot egress IP checks) and alerts owners of discrepancies. Suggests cleanups and consolidations, and supports maintenance windows for planned IP changes. Integrates with inventory sources to auto-update known site IPs and reduces false positives through suppression rules.

Acceptance Criteria

Scheduled Verification Flags Unreachable or Stale Safe Zones

Given a Safe Zone with verification set to TCP:443 and a 15-minute schedule, when three consecutive verification attempts fail within 5 minutes, then the zone status is set to Unreachable and owners are alerted within 2 minutes with failure evidence. Given a Safe Zone with no successful verification and no access events for 30 days, when the daily staleness job runs, then a Stale Zone alert is created with a cleanup suggestion and due date 7 days out. Given an Unreachable or Stale Zone alert, when an active maintenance window overlaps the detection time, then the alert is suppressed and logged with reason Maintenance Window. Given a previously Unreachable zone that passes verification, when the next verification succeeds, then the alert is auto-resolved and resolution is recorded in the audit log within 2 minutes.

Real-time Overlap and Conflict Detection for CIDRs

Given an allowlist update that adds 10.1.0.0/16 to Zone A, when the change is saved, then the system detects overlaps against existing CIDRs within 60 seconds for both IPv4 and IPv6 spaces. Given two overlapping CIDRs owned by different teams or zones, when detected, then a High-severity Conflict alert is created with both owners listed and the overlap range enumerated. Given exact duplicate CIDR entries in the same zone, when a save is attempted, then the save is blocked with a Duplicate Entry error and the existing entry is referenced. Given two adjacent CIDRs under the same owner that can be summarized losslessly, when detected, then a Low-severity Consolidation Suggestion is generated with the proposed aggregate and impacted entries listed.

Expiring Entry Notification and Escalation

Given an allowlist entry with an expiration date, when it is 14 days before expiry, then the owner receives a notification via email and Slack with a renew link; repeat at 7 days and 1 day if unacknowledged. Given an expiring entry that is not acknowledged within 24 hours of the T-1 day notice, when the escalation policy runs, then the on-call rotation is paged once and the service owner is tagged. Given an entry reaches its expiration time without renewal, when the expiry job runs, then the entry is disabled within 1 minute, new sessions from that IP are denied, and an Expired Entry alert is issued. Given a disabled entry is renewed by the owner, when the renewal is confirmed, then the entry is re-enabled within 2 minutes and the associated incident is auto-resolved.

Inventory Integration Auto-Updates Site Egress IPs

Given the CMDB updates a depot's egress IP from 203.0.113.10 to 203.0.113.22 and marks the source as authoritative, when the next sync runs, then the Safe Zone allowlist is updated within 10 minutes and the change is versioned with who/when/why. Given an inventory-driven update occurs outside a maintenance window, when processed, then a Non-actionable Info alert is posted to owners indicating Auto-Update Applied and no drift alert is raised. Given an inventory source is unavailable, when a sync is attempted, then the system retries with exponential backoff for up to 30 minutes and does not produce drift alerts solely due to source unavailability. Given an inventory update would create a duplicate or overlap, when applied, then duplicates are de-duped automatically and overlaps trigger the standard conflict workflow with the inventory job identified as the actor.

Maintenance Window Suppresses Planned-Change Alerts

Given a maintenance window is scheduled for Zone B from 01:00–03:00 with a 15-minute buffer, when drift or unreachable conditions are detected during 00:45–03:15, then alerts are suppressed and logged with the window ID. Given a maintenance window includes a planned IP change set, when verification runs during the window, then comparisons use the planned set and do not emit Egress-IP-Mismatch alerts. Given the maintenance window ends, when the next verification cycle runs, then reconciliation executes within 5 minutes and any remaining mismatches generate standard alerts with post-change evidence. Given a suppressed alert during the window, when the window closes, then the system emits a single summary event of suppressed findings instead of retroactive paging.

False-Positive Reduction via Suppression Rules

Given a suppression rule for transient failures <5 minutes exists, when a zone experiences verification failures that recover within 5 minutes, then no external alert is sent and a Suppressed event is recorded with duration and reason. Given a suppression rule scoped to a site and cause Inventory Change exists, when an inventory-driven IP change is applied, then drift alerts are suppressed and an informational change notice is logged instead. Given a suppression rule has a max duration of 2 hours, when the triggering condition persists beyond 2 hours, then a full alert is emitted at 2 hours with context that suppression was exceeded. Given suppressed events occur, when daily reports are generated, then suppressed counts and reasons are included without generating notifications to end-users.

Cleanup Suggestions for Duplicate and Aggregatable CIDRs

Given two entries 10.0.0.0/25 and 10.0.0.128/25 exist under the same owner, when the consolidation job runs, then a suggestion is created within 2 minutes to replace them with 10.0.0.0/24 including risk summary and impacted entries. Given non-contiguous blocks (10.0.0.0/25 and 10.0.1.0/25) exist, when the job runs, then no aggregation suggestion is produced and a rationale of Non-contiguous is recorded. Given an owner reviews a suggestion, when Approve is clicked, then the system applies the aggregate, archives superseded entries, and maintains active sessions without disruption; audit records capture before/after states. Given a suggestion is created, when unit/integration tests run, then cases cover IPv4 and IPv6 summarization, duplicate detection, and owner-attribution consistency with a 100% pass rate on defined scenarios.

Dual-Issue Tokens

Requires two distinct approvers to enable Lifeline mode or mint emergency tokens, with clear context, justifications, audience impact, and SLA nudges for on-call approvers. Prevents unilateral bypasses and keeps emergency access accountable and auditable.

Requirements

Dual-Approver Authorization

"As an on-call duty manager, I want high-risk actions to require two distinct approvals so that no single person can bypass safeguards during an incident."

Description

Enforces a two-distinct-approver workflow for enabling Lifeline mode and minting emergency tokens. Supports policy-based configuration of eligible approver roles, required sequence (A then B or any order), timeouts, and cancellation rules. Integrates with OutageKit incidents to attach context and ensures approvals can be actioned via web console, SMS, or IVR. Blocks self-approval and duplicate approvals by the same individual, records each decision with timestamp and method, and surfaces pending requests in the operator console.

Acceptance Criteria

Lifeline Mode Activation Requires Two Distinct Approvals

Given a Lifeline mode activation request is created with incident ID, justification, audience impact, and ETA fields completed When Approver A approves via any channel Then the system prevents activation until a different eligible Approver B approves within the configured window And blocks approval by the requester or by Approver A again And activates Lifeline mode immediately upon the second distinct approval And records both approvals with timestamp, approver ID, role, and method

Emergency Token Minting Requires Two Distinct Approvals

Given an emergency token minting request is created with incident linkage, justification, audience impact, token scope, and TTL When Approver A approves via any channel Then the system prevents minting until a different eligible Approver B approves within the configured window And blocks approval by the requester or by Approver A again And mints the token immediately upon the second distinct approval with the configured scope and TTL And records both approvals with timestamp, approver ID, role, and method

Approver Eligibility and Sequence Policy Applied

Given a policy defining eligible approver roles and a required sequence (A then B) or any order When any user attempts to approve a request Then the system validates the user’s current role against the policy and rejects ineligible roles with a clear reason And enforces the configured sequence (e.g., blocks B until A completes) when applicable And accepts approvals in any order when policy is set to any order And logs the policy snapshot (roles and sequence) used for each decision

Approval Timeout, Reminders, and Auto-Cancel

Given a configured approval timeout and reminder cadence When no second approval is received before timeout Then the request auto-cancels and no Lifeline mode or token is applied And requester and approvers are notified of cancellation with reason and timestamps And SLA nudges/reminders are sent to on-call approvers per cadence until completion or cancellation And all reminders and cancellations are logged with timestamps, delivery status, and channel

Multi-Channel Approvals with Full Audit Trail

Given approvers can act via web console, SMS, or IVR When an approver submits an approve or reject via any one channel Then the decision is processed once idempotently and reflected across all channels And the audit log records approver ID, role, decision, channel, timestamp, request ID, and incident ID And subsequent attempts via other channels show the finalized state and do not alter the decision And audit logs are immutable and exportable

Pending Requests Visibility in Operator Console

Given there are active dual-approval requests When an operator opens the Pending Approvals panel Then each request displays incident ID, requester, justification summary, required roles/sequence, elapsed time, SLA remaining, and current approver state And the operator can filter by incident, request type (Lifeline or Token), region, approver role, and SLA status And authorized users can cancel requests with a required reason; the cancellation is logged and notifications are sent And counts, badges, and request rows update in near real time (<=5 seconds) after any approval or cancellation

Self-Approval and Duplicate Approval Prevention

Given any dual-approval request When the requester attempts to approve their own request Then the system blocks the action with a clear error and logs the attempt And when the same approver attempts to approve the same request a second time Then the system blocks the duplicate with a clear error and logs the attempt And completion requires two unique user IDs recorded as approvers

Separation-of-Duties Enforcement

"As a security administrator, I want enforced separation-of-duties rules so that emergency approvals remain accountable and cannot be self-approved or rubber-stamped by close collaborators."

Description

Validates approver distinctness and role separation using IdP group membership and identity signals (SAML/OIDC/SCIM). Enforces constraints such as different teams/shifts, no approving one’s own request, and configurable conflict-of-interest rules. Provides policy authoring UI and API, with real-time checks during approval and clear error feedback. Ensures device and session trust requirements are met before an approval is accepted.

Acceptance Criteria

Distinct Dual Approver Enforcement

Given a pending Dual-Issue Token action requiring two approvals When Approver A submits the request and Approver B attempts to approve Then the system shall verify Approver A and Approver B have different immutable user IDs from the IdP and block if identical And the system shall block any attempt by the requester to approve their own request with error "Self-approval is not permitted" And the system shall mark the action executable only after two distinct approvals are recorded within the policy-defined approval window

Cross-Team and Cross-Shift Separation Rules

Given separation-of-duties policy requires different teams and shifts When the second approver attempts to approve Then the system shall confirm team attributes differ per configured IdP group membership; otherwise block with error "Approvers must be from different teams" And the system shall confirm shift identifiers or on-call rotations differ per configured source; otherwise block with error "Approvers must be on different shifts"

IdP Identity and Group Attribute Validation

Given an approval attempt When identity and group attributes are evaluated Then the system shall validate OIDC/SAML token signature, audience, and expiry and resolve subject ID And the system shall resolve group memberships via SCIM/IdP with a cache TTL of at most 5 minutes And if any identity or group source is unavailable or data is stale beyond TTL, the system shall deny the approval with reason "Attribute source unavailable" and log the dependency; no bypass is permitted And p95 evaluation latency for identity/attribute checks shall be ≤ 500 ms under normal load

Configurable Conflict-of-Interest Policy Authoring (UI and API)

Given an administrator with Policy Admin privileges When they create or edit a separation-of-duties policy via UI or API Then the system shall support rules including: not same requester, not same team, not same shift, not same manager (manager chain depth ≤ 2), and custom attribute comparisons And the system shall validate and block publish on errors, returning structured errors with field/line references And the system shall support draft/preview mode that evaluates sample approvals and returns allow/deny with matched rules without enforcing And a published policy version shall take effect for new approval attempts within 60 seconds and be versioned and retrievable for audit

Real-Time Violation Feedback and Auditable Decision Logging

Given an approver violates a separation rule or trust requirement When they attempt to approve via UI or API Then the response shall be returned within 1 second with HTTP 409 (API) or inline error containing rule ID, rule name, and remediation hint And an audit record shall be persisted within 2 seconds containing requester ID, approver ID, failed rule(s), evaluated attributes (redacted per policy), policy version, device/session posture result, and correlation ID And the system shall prevent any state change to the request and keep it pending for alternate approvers

Device and Session Trust Enforcement for Approvals

Given device and session trust policy is enabled When an approver initiates an approval action Then the system shall verify MFA freshness within policy (default ≤ 12 hours), valid device compliance attestation, and session network meets allowlist/geo policy And if any trust check fails, the approval shall be denied with a specific error indicating which trust condition failed and remediation steps And all trust checks shall be re-evaluated on every approval attempt (not solely at login) and recorded in audit with posture evidence hash

Justification & Impact Capture

"As an operations manager, I want approvers to submit structured justifications and impact details so that decisions are defensible and audit-ready."

Description

Requires structured justification fields for all high-risk actions, including reason, intended audience impact, scope, expected duration, and incident linkage. Presents templates and guidance to standardize input, auto-populates known incident data, and validates completeness before submission. Stores all inputs in an immutable, queryable audit record with change history and export capability to SIEM/compliance systems.

Acceptance Criteria

Mandatory Fields & Completeness Validation

Given a user initiates any high-risk action (enable Lifeline mode or mint emergency token) When the Justification & Impact form opens Then the following fields are present and required before submission: Reason (20–500 chars), Intended Audience Impact (select from taxonomy; optional free-text up to 200 chars), Scope (one or more: Services, Geography, Customer Class), Expected Duration (minutes 1–1440 or ISO-8601 interval), Incident Link (existing Incident ID or create new incident) Given the user attempts to submit with any required field missing or invalid When they click Submit Then the submission is blocked, invalid fields are highlighted with inline errors and an error summary, and the primary action remains disabled until all errors are resolved Given all inputs are valid When the user submits Then the system persists the record and returns a 201 Created with the record ID within 2 seconds at p95

Auto-Populate Incident Context

Given the form is launched from an Incident detail page When it loads Then Incident Link, Severity, Start Time, Affected Regions, and Service are auto-populated from the incident, with a visible “Last synced <timestamp>” indicator Given the user edits any auto-populated field When background re-sync occurs Then user-entered values are not overwritten without explicit user confirmation Given no incident context is available When the user searches for an incident to link Then typeahead returns matches by ID, title, or tag within 300 ms at p95 for a dataset of 50k incidents

Approval Payload Contains Structured Justification

Given a user submits a Dual-Issue Token request or enables Lifeline mode When approval requests are dispatched to two distinct approvers Then the full structured justification is included and visible in the approver UI and notification channels (SMS/email/IVR summaries truncated to 240 chars without losing required fields) Given an approver opens the request When required justification fields are missing or invalid Then the Approve action is disabled and the approver can return the request to the requester with a prefilled “complete justification” prompt Given both approvers approve When the action is executed Then the executed change record is linked to the justification record ID in the audit trail

Immutable Audit Record & Versioned Change History

Given a justification is submitted When it is stored Then it is written to an append-only audit log with a cryptographic hash of the payload and previous entry and a monotonic sequence number Given any post-submission edit occurs When the user saves changes Then a new version is created capturing who/when/what diffs; prior versions remain unchanged and retrievable; the hash chain verifies end-to-end integrity Given an auditor requests a specific record by ID When the API responds Then it returns the latest version plus ordered version history and a verification endpoint that returns a valid chain proof for the record

Queryable Audit & SIEM Export

Given an auditor uses the Audit UI or API When they filter by date range, action type, requester, approver, incident ID, or token ID and sort by timestamp Then results return within 2 seconds at p95 for up to 10k records, with pagination (page size 100) and stable cursors Given an export is initiated When exporting to JSON download, RFC5424 syslog, or Splunk HEC Then the system emits records including all justification fields, version, approvers, and timestamps within 60 seconds and updates delivery status to Success, Retry, or Failed Given a transient delivery failure occurs When retry logic runs Then exponential backoff with jitter is applied for up to 24 hours, operators can manually requeue exports, and all attempts are logged

Templates & Guidance for Standardized Input

Given the action type is Lifeline enablement or emergency token minting When the form opens Then the appropriate justification template is applied with labeled sections and example phrasing aligned to taxonomy Given a user focuses any field When contextual help is available Then accessible field-level guidance and examples are displayed via help text or tooltip and can be dismissed Given keyboard-only or screen reader interaction When navigating the form Then all controls are reachable in a logical tab order, have correct ARIA labels, and the form meets WCAG 2.1 AA for form interactions

Scoped Emergency Tokens & TTL

"As a platform engineer, I want emergency tokens to be tightly scoped and time-limited so that elevated access is contained and automatically expires."

Description

Provides granular scoping when minting emergency tokens, limiting accessible resources, geographic areas, permitted actions, and maximum concurrency. Supports configurable TTLs, one-time use, pre-expiry reminders, and immediate revocation. Tokens are signed, auditable, and enforced across OutageKit services and APIs, with runtime checks and automatic expiry to minimize blast radius.

Acceptance Criteria

Scoped Access By Resource, Geography, and Actions

Given an emergency token is minted with scopes resources=[incidents.read, broadcast.send], geography=[County-A], actions=[read_incidents, send_broadcast] When requests using the token read incidents within County-A Then responses are 200 and results are filtered to County-A only When requests using the token read incidents outside County-A Then responses are 403 with error_code=SCOPE_GEO_DENIED and an audit event is recorded When requests attempt an endpoint not included in resources Then responses are 403 with error_code=SCOPE_RESOURCE_DENIED and no side effects occur When sending a broadcast to recipients outside geography scope Then the request is blocked with 403, no broadcast is created, and an audit event is recorded

Max Concurrency Limit Enforcement

Given an emergency token is minted with max_concurrent_sessions=2 And two active sessions are established using the token When a third session attempt is made anywhere in the system Then the attempt is rejected with 429 error_code=CONCURRENCY_LIMIT and no session is created When one of the two active sessions terminates Then a new session can be established within 5 seconds And active session count is enforced across all OutageKit services and APIs

Configurable TTL and Automatic Expiry

Given an emergency token is minted with ttl=30m (measured from issued_at) When the wall-clock reaches issued_at + 30m Then all OutageKit services reject further requests with 401 error_code=TOKEN_EXPIRED within 60 seconds And the token is marked expired in the audit trail with end_time set And cached authorizations are invalidated within 60 seconds of expiry

One-Time Use Token Consumption

Given an emergency token is minted with one_time_use=true When the token is used in the first successful authorized request Then the token is immediately invalidated and cannot be used again When any subsequent request presents the token Then it is rejected with 401 error_code=TOKEN_CONSUMED and an audit event is recorded When the first request using the token fails authentication/authorization Then the token is not consumed

Pre-Expiry Reminder Notifications

Given an emergency token is minted with ttl=2h and reminder_window=10m When the time reaches expiry_minus_10m Then reminder notifications are sent to the token requester and both approvers via configured channels (email and SMS) within 1 minute And the reminder includes token_id, scopes, geography, actions, expires_at, and a revoke link And no reminders are sent if the token has already been revoked or expired And a notification delivery audit record is stored with success/failure per channel

Immediate Revocation Propagation

Given an emergency token is active When an authorized user triggers revoke Then all active sessions using the token are terminated within 30 seconds And subsequent API requests using the token are rejected with 401 error_code=TOKEN_REVOKED within 30 seconds across all services And no new sessions may be established with the token after revocation And a revocation audit event is recorded with actor, justification, timestamp, and affected scopes

Signature Verification and Auditable Usage

Given emergency tokens are issued as JWS-signed artifacts with kid referencing the active key When any OutageKit service receives a request with a token Then it verifies the signature against the current or valid previous key from JWKS and rejects on failure with 401 error_code=BAD_SIGNATURE And all successful and denied token uses create immutable audit entries including token_id, actor, action, resource, geography, timestamp, justification, and outcome When signing keys rotate Then previously issued tokens remain valid until their expiry and continue to verify against the rotated key set

Approver Context & Multichannel Notifications

"As an on-call approver, I want clear, actionable context delivered on any channel so that I can make fast, informed approval decisions without opening multiple tools."

Description

Bundles a concise context packet for approvers containing incident summary, justification snapshot, affected customer count/map, proposed scope, and SLA timing. Delivers actionable notifications via SMS, email, push, and IVR with secure deep links and code-based confirmation for low-connectivity scenarios. Tracks delivery and interaction status, retries intelligently, and localizes content by approver preference.

Acceptance Criteria

Approver Context Packet Assembled and Attached

Given an emergency access request is initiated When the approval packet is generated Then it includes: incident summary (<=500 chars), justification snapshot (<=300 chars), affected customer count, impact map (image or link), proposed scope, and SLA countdown timestamp And the packet renders correctly on web and mobile previews And per-channel payload limits are respected: email <=500KB, SMS <=1600 chars (segmented), IVR TTS <=45 seconds And no customer PII is included beyond aggregated counts and geospatial cluster visualizations

Secure Deep Link and Code-Based Confirmation

Given an approver receives a notification When the secure deep link is opened Then the approver is authenticated via SSO or time-bound magic token and shown the approval screen And the link is single-use, expires in 15 minutes, and is bound to approver ID and request ID And in low-connectivity scenarios the approver can confirm via a 6-digit code by SMS reply or IVR DTMF And code verification is rate-limited to 5 attempts/hour with lockout and alert after 3 failures

Preference-Driven Multichannel Notification Delivery

Given approver channel and locale preferences are stored When a request is sent for approval Then notifications are dispatched according to the approver’s ranked preferences across SMS, email, push, and IVR And disabled or unavailable channels for that approver are automatically skipped And each message includes a secure deep link and clear code-entry fallback instructions And content is trimmed to channel constraints while preserving mandatory fields

Delivery and Interaction Tracking

Given notifications are dispatched When delivery and user interactions occur Then per-channel events are recorded: queued, sent, delivered, failed/bounced, opened, link-clicked, IVR answered, code-entered, approve/deny And all events include timestamps in UTC with millisecond precision, channel, approver ID, request ID, and correlation IDs And events are visible in the audit log UI and exportable as CSV/JSON And unreachable destinations (e.g., invalid phone/email) are flagged with remediation suggestions

Intelligent Retries and SLA-Based Escalation

Given an SLA countdown is attached to the approval request When no approval decision is received within 2 minutes Then a nudge is sent on the next preferred available channel And additional nudges are sent at T+4 and T+8 minutes unless a decision is recorded And at T+10 minutes the request escalates to the on-call backup approver with full context And retries cease immediately upon approval or explicit decline And quiet hours and on-call rules are honored according to configuration

Localization and Accessibility Compliance

Given an approver has a language and locale preference When messages and IVR prompts are generated Then all content (templates, numerals, dates, times) is localized to that locale And if a translation is missing, the system falls back to English with a logged warning And RTL languages render with correct directionality and punctuation And IVR uses the correct TTS voice per locale and includes phonetic handling for numbers and times And all text content meets WCAG 2.1 AA contrast and clarity guidelines in supported clients

Low-Connectivity IVR Fallback Flow

Given the approver’s device lacks reliable data When the IVR call is connected Then the IVR summarizes the context within 45 seconds and offers Approve/Decline options And the approver can enter the 6-digit code followed by # to confirm the action And the system plays a confirmation message and sends a follow-up SMS/email receipt And the action, call SID, and DTMF result are logged to the audit trail

SLA Nudges & Escalations

"As an incident commander, I want SLA-based reminders and escalations for pending approvals so that Lifeline mode and emergency access are not delayed."

Description

Applies SLA-aware reminders and escalation policies when approvals are pending, with time windows tailored to incident severity and audience impact. Sends progressive nudges across channels, escalates to secondary approvers or duty managers via on-call integrations (PagerDuty/Opsgenie), and pauses during quiet hours per policy. Captures response times, breach alerts, and provides analytics to tune SLAs and schedules.

Acceptance Criteria

Sev1 Lifeline request: 2-minute SLA nudge cadence

Given a Sev1 Lifeline-mode enable request requiring two distinct approvers and audience impact ≥ 10,000 accounts When the request is created during active hours and no approvals have been recorded Then Nudge #1 is sent via in-app + SMS to both primary approvers within 2 minutes of request creation And Nudge #2 is sent via email at T+5 minutes if neither approver has responded And Nudge #3 is sent via SMS at T+9 minutes if still no response And nudges cease immediately upon the second approval or upon a denial by any approver And all nudge events are logged with channel, recipient, and timestamp for analytics

Sev2 emergency token: escalation to secondary approver via PagerDuty

Given a Sev2 emergency token minting request pending with two primary approvers and a configured secondary-approver group in PagerDuty When no primary approver has acknowledged by T+10 minutes Then an escalation is created in PagerDuty targeting the secondary-approver on-call rotation with incident title containing request ID, justification summary, audience impact, and remaining SLA And duplicate escalations for the same request are suppressed for 15 minutes And when any escalated approver acknowledges in PagerDuty, acknowledgment is reflected in OutageKit within 30 seconds and further nudges pause And when the request is fully approved or denied, the PagerDuty incident is auto-resolved within 60 seconds

Sev3 low-impact request: quiet hours pause and deferment

Given a Sev3 approval request created during configured quiet hours for the approver's team When the request does not have a 'critical override' flag Then no nudges or escalations are sent during quiet hours And a deferred schedule is created to send Nudge #1 within 5 minutes of quiet hours ending And the pause is logged with reason 'quiet hours policy' and next planned send time And if a manual approval or denial occurs during quiet hours, all deferred nudges are canceled immediately

Progressive multichannel nudges with required context and compliance

Given a pending approval request with two approvers who have channel preferences (SMS, email, in-app) When nudges are sent Then the first nudge uses each approver's highest-priority reachable channel and includes: request type (Lifeline/Token), justification, audience impact band, remaining SLA time, and one-tap approve/deny links And subsequent nudges rotate channels without exceeding 4 total nudges per approver per request And SMS nudges respect opt-in/opt-out; if an approver opted out of SMS, no SMS is sent and an alternate channel is used And all approval links are signed and expire upon state change (approve/deny/cancel/timeout)

SLA breach: duty manager escalation via Opsgenie

Given a pending approval exceeding its SLA threshold without two approvals When the breach occurs Then a 'SLA Breach' alert is created in Opsgenie to page the duty manager on-call including severity and elapsed time And the duty manager can approve/deny via secure link; their action is recorded as 'escalated approver' in the audit trail And when the alert is acknowledged in Opsgenie, further nudges to primary approvers pause And when approvals are completed or the request is denied, the Opsgenie alert is auto-closed within 60 seconds

Telemetry and analytics available for SLA tuning

Given a set of approval requests over a reporting period When analytics for 'SLA Nudges & Escalations' are viewed Then the dashboard shows distributions of time-to-first-ack and time-to-final-approval by severity and audience impact band, nudge counts per request, channel conversion rates, and quiet-hours deferrals And filters allow slicing by team, approver, time-of-day, and integration (PagerDuty/Opsgenie) And CSV export includes event timestamps, channel used, outcome, and anonymized approver IDs And metrics freshness is under 5 minutes and retains at least 90 days of history

Scoped Safe Mode

Applies least-privilege controls to Lifeline sessions—permitting essential actions (status updates, ETR confirmations, crew sync) while automatically blocking high-risk changes (role edits, integration reconfigs). Balances speed and safety when stakes are high.

Requirements

Safe Mode Session Scoping Engine

"As an incident commander, I want to activate Safe Mode for a specific outage so that my team can work quickly within a controlled scope without risking unrelated system changes."

Description

Defines and manages Lifeline Scoped Safe Mode sessions with explicit boundaries across who, what, where, and when. Supports activation per incident, region, or tenant with configurable TTL and auto-expiry on incident resolution. Enforces scope consistently across web console and API so only in-scope resources and operations are reachable. Integrates with RBAC/SSO to inherit identity while overlaying temporary least-privilege session policies. Provides triggers to auto-enable on declared major incidents or via API/CLI and guarantees deterministic deactivation with rollback to pre-session privileges.

Acceptance Criteria

Activate Safe Mode scoped to a single incident via Web Console

Given an authenticated user with authorization to activate Safe Mode And an open incident I-123 exists When the user enables Safe Mode for incident I-123 with TTL 60 minutes via the Web Console Then the system creates a Scoped Safe Mode session with scope=incident:I-123, ttl=60m, status=active, and a unique session_id And a session banner with session_id and scope appears in the console within 5 seconds And enforcement begins within 10 seconds of activation And an audit record is written capturing actor, scope, ttl, reason, channel=web, and timestamp

Activate Safe Mode scoped to a region via API/CLI

Given an authenticated API client with permission to manage Safe Mode And a defined region R-NE exists When a POST /safe-mode/sessions is made with scope=region:R-NE and ttl=30m Then the API returns 201 Created with session_id, scope, ttl, status=active And resources in region R-NE are marked in-scope and others out-of-scope within 10 seconds And an audit record is written with actor, scope, ttl, reason, channel=api, and timestamp

Enforce least-privilege overlay with RBAC/SSO identity inheritance

Given a user authenticated via SSO with baseline RBAC privileges including high-risk operations And a Safe Mode session is active with any scope When the user attempts high-risk operations (role edits, RBAC policy changes, integration reconfigurations) Then each attempt is denied with HTTP 403 and error_code=SAFE_SCOPE_BLOCKED_OPERATION and is logged And allowed Safe Mode operations (status updates, ETR confirmations, crew sync) succeed with 2xx responses And the user's effective permissions equal intersection of baseline privileges and Safe Mode allowlist for the session scope

Block out-of-scope access across Web Console and REST API

Given a Safe Mode session scoped to incident I-123 is active When the user navigates to incidents outside I-123 in the Web Console Then out-of-scope incidents are hidden or disabled and actionable controls are unavailable When the user calls API endpoints for out-of-scope resources Then the API responds 403 with error_code=SAFE_SCOPE_VIOLATION and includes session_id and scope in the response payload And all in-scope requests succeed with 2xx and responses include header X-Safe-Mode: active; session={session_id}

Auto-trigger and idempotent activation on Major Incident declaration

Given an integration publishes a Major Incident declaration MI-77 for region R-NE And a Safe Mode auto-trigger policy exists for major incidents When the event is received Then a Safe Mode session is activated with scope derived by policy (e.g., region:R-NE) within 30 seconds And if an equivalent active session already exists, no new session is created and an audit entry records idempotent activation And an audit record is written with trigger=major_incident_event and correlation_id of MI-77

Auto-expiry on incident resolution and TTL timeout with privilege rollback

Given a Safe Mode session scoped to incident I-123 with ttl=45m is active When incident I-123 is marked Resolved Then the session auto-deactivates within 60 seconds of the resolution event And the user's effective permissions revert to the pre-session snapshot across web and API And a deactivation audit record includes session_id, reason=incident_resolved, timestamp, and privileges_before/after When ttl expires without resolution Then the session auto-deactivates within 60 seconds of ttl expiry with reason=ttl_expired and identical rollback guarantees

Deterministic manual deactivation with full state cleanup

Given a Safe Mode session with session_id S-001 is active When an authorized actor calls DELETE /safe-mode/sessions/S-001 or clicks Deactivate in the console Then the API returns 200 OK and the session status becomes deactivated with ended_at timestamp And all Safe Mode enforcement stops within 10 seconds and effective permissions equal the pre-session snapshot And UI banners and indicators clear within 10 seconds and API responses include X-Safe-Mode: inactive And session caches, tokens, and in-memory state are purged across nodes within 60 seconds And a final audit record is written with session_id, actor, deactivation_channel, and checksum of restored privileges

Least-Privilege Action Allowlist

"As an operations manager, I want only essential actions to be available in Safe Mode so that responders can act fast without exposure to risky capabilities."

Description

Implements a centrally managed allowlist of essential actions permitted during Scoped Safe Mode, including status updates, ETR confirmations, crew assignment and sync, incident notes, and targeted customer notifications. Provides fine-grained operation-level controls (for example, update_outage_status, confirm_etr) with contextual constraints such as limiting actions to affected circuits or geographies. Ships with secure defaults, supports per-tenant overrides and templates, and mirrors UI controls with server-side enforcement to prevent client-side bypass.

Acceptance Criteria

Safe Mode Allows Essential Operations

- Given Safe Mode is active for tenant T and incident I is in scope, When an authorized operations user performs update_outage_status on I, Then the action is accepted (HTTP 2xx), persisted, and visible in UI and API within 5 seconds. - Given Safe Mode is active and incident I is in scope, When confirm_etr is submitted with a valid ETR timestamp, Then the ETR is saved, audit logged, and broadcast eligibility is updated; response HTTP 2xx. - Given Safe Mode is active and incident I is in scope, When assign_crew and sync_crew are invoked with valid crew IDs, Then the crew linkage is updated and visible in crew views within 5 seconds; response HTTP 2xx. - Given Safe Mode is active and incident I is in scope, When add_incident_note is submitted, Then the note is saved with author and timestamp and appears in incident timeline; response HTTP 2xx. - Given Safe Mode is active and a customer segment S is derived from I’s affected area, When send_targeted_notification is invoked to S, Then only customers in S receive the message; response HTTP 2xx and delivery count matches S size (±0%).

High-Risk Operations Are Blocked in Safe Mode

- Given Safe Mode is active, When a user attempts role_edit, Then the server rejects with HTTP 403 and error_code=SAFE_MODE_BLOCK; no change occurs. - Given Safe Mode is active, When a user attempts integration_reconfigure (e.g., rotate API key, change webhook URL), Then the server rejects with HTTP 403 and error_code=SAFE_MODE_BLOCK; no change occurs. - Given Safe Mode is active, When a user attempts policy_delete or widen_policy_scope beyond safe defaults, Then the server rejects with HTTP 403 and error_code=SAFE_MODE_BLOCK; no change occurs. - Given Safe Mode is active, When UI controls for blocked operations are rendered, Then they are disabled or hidden; attempts via API remain blocked.

Contextual Scoping by Circuit and Geography

- Given Safe Mode is active with scope limited to circuits {C1..Cn} and geographies {G1..Gm}, When update_outage_status targets incident J outside {C1..Cn} or {G1..Gm}, Then the server rejects with HTTP 403 and error_code=SCOPE_VIOLATION. - Given Safe Mode is active with scope S, When confirm_etr is submitted for an incident outside S, Then the server rejects with HTTP 403 and error_code=SCOPE_VIOLATION. - Given Safe Mode is active with scope S, When send_targeted_notification is requested for recipients outside S, Then recipients outside S are excluded; response includes excluded_count > 0 and decision reason=SCOPE_FILTERED. - Given Safe Mode is active with scope S, When assign_crew is invoked for an incident outside S, Then the request is rejected with HTTP 403 and error_code=SCOPE_VIOLATION.

Server-Side Enforcement Prevents Client Bypass

- Given Safe Mode is active, When a client calls a blocked endpoint directly via API using a valid token, Then the server returns HTTP 403 with error_code=SAFE_MODE_BLOCK and includes policy_id and policy_version in the payload. - Given Safe Mode is active, When a client tampers with the UI to re-enable a blocked button and submits the request, Then the server denies the request with HTTP 403 and logs reason=client_bypass_prevented. - Given Safe Mode is active, When an internal service attempts a disallowed operation via service account, Then the same policy is enforced and the action is denied; audit entry includes service_account=true. - Given Safe Mode is active, When an allowed operation is called with extra parameters to escalate privileges, Then those parameters are ignored or validated; only allowlisted fields are applied; response indicates ignored_fields if any.

Secure Defaults and Per-Tenant Policy Templates

- Given a new tenant with no overrides, When Safe Mode is first activated, Then the default allowlist (essential operations with scoped constraints) is applied; server returns policy_id=default and policy_version >= 1. - Given a tenant admin with policy_manage permission, When they apply the "Utility-Local-Default" template and publish, Then the template becomes the active allowlist within 30 seconds; subsequent decisions reference the new policy_id and version. - Given a tenant admin edits an override with invalid schema or unknown operations, When they attempt to publish, Then validation fails with HTTP 400, error_code=POLICY_INVALID, and the previous policy remains active. - Given a tenant has an active override, When they revert to defaults, Then the default policy becomes active within 30 seconds and is recorded in audit logs with reason=revert_to_default.

Auditability and Decision Performance

- Given Safe Mode is active, When any allowed or denied operation is evaluated, Then an audit log is written containing tenant_id, user_or_service_id, operation, incident_id (if any), scope, decision, reason_code, policy_id, policy_version, and timestamp within 2 seconds. - Given sustained load of 100 decisions per second per tenant, When decisions are evaluated, Then p95 decision latency <= 100 ms and p99 <= 250 ms measured at the enforcement service. - Given Safe Mode is active, When metrics are scraped, Then counters for allowed_count and denied_count by operation and reason_code are exposed via /metrics and reflect the last minute’s activity within 10% of actual.

High-Risk Change Blocking and Guardrails

"As a platform owner, I want high-risk changes blocked in Safe Mode so that we avoid accidental configuration drift and security incidents during critical operations."

Description

Automatically blocks high-impact changes during Safe Mode, including role and permission edits, integration reconfiguration, API key and webhook management, notification template changes, and account-wide settings. Presents contextual guardrails in the UI with rationale and links to request temporary elevation. Enforces policy at the API layer to prevent scripted or third-party bypass and returns structured error codes suitable for automation handling.

Acceptance Criteria

Block Role and Permission Edits During Safe Mode

Given Safe Mode is active for the tenant And a user with Org Admin privileges attempts to edit a user's role or permissions via the web console When the user submits the change Then the action is blocked and no changes are persisted And a guardrail banner appears containing the phrase "Safe Mode blocks role and permission changes" and a link labeled "Request Temporary Elevation" And an audit log entry is created with eventType="safe_mode_block", target="roles_permissions", actor=<currentUserId>, outcome="blocked" Given Safe Mode is active When a PATCH/PUT/DELETE request targets /v1/admin/roles or /v1/admin/users/{id}/permissions Then the API responds HTTP 403 with Content-Type application/json and body containing: - error.code="SAFE_MODE_BLOCKED" - error.target="roles_permissions" - error.scope="tenant" - correlationId (non-empty UUID) - remediation.url (https link) And no role or permission records are changed

Block Integration Reconfiguration and Secret Management During Safe Mode

Given Safe Mode is active And a user attempts to modify any integration configuration (e.g., provider credentials, endpoint URLs) via the web console When the user clicks Save Then the Save action is disabled with tooltip "Disabled in Safe Mode" And a guardrail panel explains the rationale and provides a "Request Temporary Elevation" link And an audit log entry records eventType="safe_mode_block", target="integrations_config", outcome="blocked" Given Safe Mode is active When an API call attempts to create/rotate/delete API keys or webhooks (POST/DELETE /v1/api-keys, /v1/webhooks) or update integration configs (PUT/PATCH /v1/integrations/*) Then the API returns HTTP 403 with JSON body error.code in {"SAFE_MODE_BLOCKED"}, error.target in {"api_keys","webhooks","integrations_config"}, and a correlationId And no new keys are issued, no secrets rotated, and no configuration values are changed

Block Notification Template Changes; Allow Read-only Preview

Given Safe Mode is active When a user opens Notification Templates in the console Then templates render in read-only mode (inputs disabled) And the Preview function works without saving changes And a guardrail message states "Template edits are blocked in Safe Mode" with a help link Given Safe Mode is active When the user attempts to save, publish, or delete a template Then the action is prevented, no drafts or versions are created, and an audit log eventType="safe_mode_block", target="notification_templates" is recorded Given Safe Mode is active When PUT/PATCH/DELETE is sent to /v1/notifications/templates/* Then the API responds HTTP 403 with error.code="SAFE_MODE_BLOCKED", error.target="notification_templates", and a remediation.url And no template content changes are persisted

Block Account-wide Settings Changes; Preserve Read Access

Given Safe Mode is active When a user navigates to Account Settings (e.g., time zone, escalation routing, severity matrices) Then all editable controls are disabled and display a lock icon with tooltip "Blocked by Safe Mode" And current values remain visible for read-only reference Given Safe Mode is active When the user attempts any save action on Account Settings Then no settings are changed and an audit event eventType="safe_mode_block", target="account_settings" is recorded Given Safe Mode is active When PUT/PATCH requests target /v1/settings/* Then the API returns HTTP 403 with error.code="SAFE_MODE_BLOCKED", error.target="account_settings", correlationId present And subsequent GET /v1/settings/* returns unchanged values

Allow Essential Lifeline Actions Under Safe Mode

Given Safe Mode is active When a dispatcher posts an incident status update via UI or POST /v1/incidents/{id}/status Then the request succeeds with HTTP 2xx and the incident timeline reflects the update within 2 seconds And an audit event eventType="safe_mode_allow", target="incident_status" is recorded Given Safe Mode is active When a user confirms or updates ETR via UI or POST /v1/incidents/{id}/etr Then the request succeeds with HTTP 2xx and outbound notifications are sent per policy Given Safe Mode is active When a crew sync is performed via UI or POST /v1/crews/sync Then the request succeeds with HTTP 2xx and crew locations/assignments update And no guardrail block banners are shown for these allowed actions

Enforce Policy at API Layer with Structured Error Contract

Given Safe Mode is active When any blocked endpoint is called (e.g., roles, permissions, integrations, api-keys, webhooks, notification templates, account settings) Then the response is HTTP 403 with headers: - Content-Type: application/json; charset=utf-8 - X-Correlation-Id: <UUID> And the JSON body includes fields: - error.code = "SAFE_MODE_BLOCKED" - error.message (human-readable reason) - error.target (one of the canonical targets) - error.scope = "tenant" or "org" as applicable - remediation.url (HTTPS link to docs or elevation request) - correlationId matching X-Correlation-Id And the schema validates against the published OpenAPI spec for version >= 1.0 And automated clients can parse error.code and error.target to branch logic without string matching on error.message

Guardrail UI With Rationale and Elevation Request Link

Given Safe Mode is active When a user initiates any blocked high-risk action from the console Then a guardrail banner or modal appears containing: - concise rationale mentioning "Safe Mode" and the blocked action - "Request Temporary Elevation" link/button - link to documentation "What is Safe Mode?" And the primary destructive/action button is disabled When the user clicks "Request Temporary Elevation" Then a modal opens with prefilled context (action, target, current page) and a required justification text field (min 20 chars) And submitting the form sends POST /v1/elevation/requests with HTTP 202 Accepted returning requestId and estimated SLA minutes And an audit event eventType="elevation_request", target=<blockedTarget>, status="submitted" is recorded And the original blocked action remains blocked until an explicit override token is present

Timeboxed Break-Glass Elevation (Dual Approval)

"As a duty manager, I want a timeboxed break-glass option with dual approval so that rare but necessary high-risk actions can proceed safely and transparently."

Description

Provides a controlled, auditable path to temporarily elevate privileges within a Safe Mode session for a narrowly defined task. Requires dual approval with reason capture, maximum duration, and automatic reversion. Supports just-in-time policy creation with preapproved playbooks such as re-enabling a specific webhook for a region and emits alerts to security and compliance channels.

Acceptance Criteria

Dual-Approval Break-Glass Request Creation

Given a Safe Mode session and a user requires elevated privilege for a narrowly defined task When the user initiates a break-glass request Then the system requires selection of a preapproved playbook or a JIT custom scope, entry of reason, target resource(s), region, and requested duration And the system validates requested duration does not exceed the configured maximum (default 30 minutes) and the scope matches the selected playbook schema And a unique request ID is created and the request enters Pending Approval state

Separation of Duties for Approvals

Given a pending break-glass request When approvals are submitted Then the system requires approvals from two distinct approvers with the required approval role, neither being the requester nor each other And conflicting approver groups are disallowed per policy And if any approver rejects or if the approval window (10 minutes configurable) elapses, the request is marked Denied and no elevation is granted And upon receipt of the second approval, elevation activates immediately and the start timestamp is recorded

Timeboxed Elevation and Automatic Reversion

Given an approved elevation with duration D minutes When the clock reaches the expiry time Then all elevated permissions are revoked within 5 seconds And any in-flight or subsequent out-of-scope actions are denied with HTTP 403 and a user-visible notice of expiry And the session banner updates to reflect reversion and the end timestamp is recorded in the audit log

Scoped Permission Enforcement During Elevation

Given an elevation using a playbook that permits re-enabling one webhook for a specified region When the elevated user attempts an action outside the permitted endpoints, resources, or region Then the action is blocked with HTTP 403 and an audit entry including attempted action, resource, and reason Blocked:OutOfScope And actions within the permitted scope succeed and are logged with correlation to the elevation request ID

Security and Compliance Notifications

Given a break-glass request lifecycle event (Created, Approved, Denied, Activated, Revoked, Expired) When the event occurs Then notifications are sent to configured security and compliance channels within 15 seconds with request ID, requester, approvers, scope, reason, and timestamps And failed deliveries are retried at least 3 times with exponential backoff and failures are surfaced in system health

Audit Trail Export and Integrity

Given a completed break-glass request When an auditor exports the audit record for the request Then the export contains the full immutable event timeline (creation, approvals, activation, actions taken, revocation) with user IDs, roles, IPs, timestamps, and scope details And the export includes a verifiable integrity hash/signature And the export is retrievable via UI and API within 5 seconds for records from the last 90 days

Safe Mode Audit Trail and Telemetry

"As a compliance officer, I want comprehensive Safe Mode audit logs and metrics so that we can satisfy audits and improve our incident controls."

Description

Captures immutable, correlated audit logs for all Safe Mode actions, denials, approvals, and policy changes with session identifiers and actor metadata. Exposes real-time dashboards and standardized exports to SIEM platforms via webhook and CSV. Provides metrics such as time in Safe Mode, blocked-attempt counts, and elevation frequency to inform policy tuning and satisfy compliance reporting.

Acceptance Criteria

Immutable Safe Mode Audit Log Capture

Given Safe Mode is enabled for an incident and a user or system performs any action (status update, ETR confirmation, crew sync) or attempts any high-risk change (role edit, integration reconfig, policy change, elevation request) When the action is executed, blocked, approved, or denied Then an audit event is appended within 3 seconds (p95) containing: event_id (UUIDv4), occurred_at (ISO-8601 UTC), session_id, incident_id, actor_id, actor_type (human|system), actor_role, auth_method, mfa_used (boolean), action_type, action_payload_hash (SHA-256), outcome (allowed|blocked|approved|denied), reason_code, request_ip, user_agent, correlation_id, sequence And then audit storage is append-only: any API/UI attempt to update or delete an event returns 403 and no mutation occurs; a daily integrity job verifies a cryptographic hash chain across 100% of events And then all Safe Mode events are retained for at least 365 days in immutable (WORM or equivalent) storage

Session and Actor Correlation Across Channels

Given a Lifeline session spans multiple channels (SMS, web, IVR) and/or devices When Safe Mode-related events are generated across those channels Then all events belonging to the same session share the same session_id and carry a correlation_id that ties related cross-channel interactions And then querying by session_id for a time range returns a complete, chronologically ordered timeline (occurred_at, sequence) with no missing events And then actor metadata is captured for every event: actor_id, actor_type, actor_role at time of action, auth_method, mfa_used; fields are non-null and consistent with the identity provider claims

Real-Time Safe Mode Telemetry Dashboard

Given an operations manager opens the Safe Mode dashboard When new Safe Mode events are produced Then dashboard tiles and charts refresh at least every 5 seconds and reflect new events within 10 seconds (p95) And then the dashboard displays: active Safe Mode sessions, average time in Safe Mode (last 24h), per-session duration, blocked-attempt counts (total and by action_type), elevation request frequency, approval rate, and time-to-approval p50/p95, top reason_codes And then selecting a time range (last 1h/24h/7d/custom) updates all metrics and timelines with count discrepancies <= 1% versus a raw event query for the same filters And then a drill-down for any session shows its full event timeline with filters for action_type and outcome

Standardized Webhook Export to SIEM

Given a SIEM webhook destination is configured and enabled When a Safe Mode audit event is generated Then OutageKit sends an HTTPS POST within 5 seconds (p95) containing the event in JSON schema v1.0; headers include Content-Type: application/json and Idempotency-Key: <event_id> And then each request is signed with HMAC-SHA256 using the shared secret and includes X-Signature and X-Timestamp; receivers can verify signature and reject replays older than 5 minutes And then delivery uses at-least-once semantics with exponential backoff (initial 2s, max 5m, up to 12 attempts) and deduplication by event_id/Idempotency-Key; permanently failed events are moved to a dead-letter queue and alerting is triggered And then, in staging tests with a healthy destination (HTTP 2xx within 2s), >= 99.9% of events are delivered within 10 minutes over a rolling 24h window

CSV Compliance Export

Given a user requests a CSV export for a time range with optional filters (incident_id, session_id, action_type, outcome) When the export job completes Then the CSV includes a header and one row per event with columns: event_id, occurred_at (UTC), session_id, incident_id, actor_id, actor_type, actor_role, action_type, outcome, reason_code, request_ip, user_agent, correlation_id, sequence, payload_hash And then values are UTF-8 encoded, correctly escaped, and protected against CSV formula injection (cells beginning with =, +, -, @ are prefixed with '); timestamps are ISO-8601 UTC; booleans are true/false And then row count equals the number of events matching the filters; ordering is occurred_at then sequence deterministically; a SHA-256 checksum is provided and matches the file contents And then exports up to 1,000,000 rows complete within 2 minutes (p95) via async jobs with progress; larger exports are chunked with pagination cursors and support resumable download

Elevation and Approval Flow Capture and Metrics

Given Safe Mode restricts high-risk changes and an operator requests elevation or submits a policy change When the request is processed Then the audit trail records: request_created, approver_assigned, each approval/denial event (with approver_id), final decision, and activation, including reason, ticket_id, and approval_comment And then if two-person approval is configured, activation occurs only after approvals from two distinct approvers within the configured validity window; denials and timeouts are logged with outcome and reason_code; unauthorized attempts are blocked and logged as outcome=blocked And then dashboard metrics expose elevation request count, approval rate, median and p95 time-to-approval, and blocked high-risk attempts by action_type for the selected period, with counts matching raw event queries within 1%

Safe Mode UX Indicators and Guidance

"As a responder, I want obvious Safe Mode cues and guidance so that I understand what I can do right now and how to request additional access if needed."

Description

Introduces clear, persistent UI indicators when Safe Mode is active, including banners, iconography, and color state, with inline tooltips explaining allowed versus blocked actions and quick links to request elevation. Disables or hides restricted controls consistently and surfaces a compact checklist for essential workflows such as status update, ETR confirmation, and crew sync. Ensures accessibility and localization across web and mobile companion interfaces.

Acceptance Criteria

Persistent Safe Mode Banner Across Web and Mobile

Given the user is in Safe Mode, When any screen loads or the user navigates within the app (web or mobile), Then a persistent top-level banner labeled "Safe Mode" with the active scope is visible until Safe Mode is turned off. Given Safe Mode is toggled off, When the session state updates, Then the banner hides within 1 second and does not reappear unless Safe Mode resumes. Given the banner is visible, Then it uses the designated Safe Mode color token and icon, meets 4.5:1 contrast, and is tappable/clickable to open Safe Mode details. Given the device is offline or on a slow network, When the app launches, Then the banner renders from cached assets within 2 seconds.

Inline Tooltips for Allowed vs Blocked Actions

Given a control is disabled or hidden due to Safe Mode, When the user hovers, focuses, or long-presses the control or its placeholder, Then a tooltip appears within 300 ms stating "Blocked in Safe Mode" with the reason and any allowed alternatives. Given a control is allowed in Safe Mode, When the user hovers or focuses it, Then a tooltip indicates it is allowed and notes any limits. Given a tooltip is shown, Then it is keyboard and screen-reader accessible (focusable trigger, role=tooltip, aria-describedby), dismisses on Esc/blur, and does not obstruct critical content. Given the tooltip includes "Request elevation", When activated, Then the elevation request flow opens within 500 ms.

Quick Elevation Request from Blocked Control

Given a user encounters a blocked action in Safe Mode, When "Request elevation" is selected from a tooltip or banner, Then a modal/sheet opens with prefilled context (action, screen, timestamp) and a mandatory justification field. Given the request is submitted with a justification of at least 10 characters, When the backend returns 2xx, Then a success confirmation with a copyable request ID is displayed and the control remains blocked until elevation is granted. Given the request submission fails (4xx/5xx/timeout), Then an error message with retry is shown and no state changes occur. Given rate limiting of 3 requests per user per 15 minutes, When the limit is exceeded, Then a clear message shows the next available time.

Consistent Restriction of High-Risk Controls

Rule: In Safe Mode, high-risk controls (role edits, integration reconfiguration, delete outage, bulk import/export, API key rotation) are disabled or hidden across all entry points, including menus, quick actions, and context panels. Rule: Allowed essentials (status update, ETR confirmation, crew sync) remain enabled and are visually marked as "Allowed in Safe Mode". Rule: Invoking a blocked action via deep link or shortcut shows a non-destructive block dialog with reason and a request-elevation action; no state mutation occurs. Rule: Each block event is written to the audit log with user, action, timestamp, and UI surface.

Essential Workflow Compact Checklist

Given Safe Mode is active, When a user opens the incident or console view, Then a compact checklist appears listing Status Update, ETR Confirmation, and Crew Sync with real-time completion states. Given a checklist item is completed via the app, Then its state updates to complete within 1 second and persists for the session and incident. Given the user dismisses the checklist, Then it can be recalled from the banner/details and the dismissal persists until Safe Mode ends. Given screen width is under 360 dp, Then the checklist collapses into an accessible overflow chip without truncation.

Accessibility Compliance for Safe Mode Indicators

Rule: All Safe Mode UI indicators and tooltips meet WCAG 2.2 AA for contrast (4.5:1 text, 3:1 UI components), focus visibility, and target size (44x44 px minimum on touch). Rule: All interactive elements are operable via keyboard (Tab/Shift+Tab/Enter/Esc), maintain logical focus order, and expose correct ARIA roles/names/states. Rule: Dynamic Safe Mode state changes are announced via aria-live="polite" without moving focus. Rule: Mobile screen readers (TalkBack/VoiceOver) correctly read labels for banner, checklist, and tooltips.

Localization and Internationalization of Safe Mode UI

Rule: All Safe Mode strings (banners, tooltips, buttons, checklist) are externalized and localized for supported locales (en, es, fr, de) with no hard-coded text. Rule: Layouts accommodate translated strings without clipping or overflow on viewports down to 320 px; truncation uses locale-appropriate ellipses with full text available on focus/tooltip. Rule: Date/time and ETR formats respect user locale; right-to-left languages render correctly with mirrored icons and layout where applicable. Rule: Fallback to English occurs only when a translation is missing and the event is captured in telemetry.

Auto-Revoke Window

Every Lifeline session auto-expires after a configurable timebox, with one-click global recall and automatic rebind to SSO when it recovers. Eliminates lingering backdoors, reduces admin cleanup, and ensures emergency access ends when the crisis does.

Requirements

Configurable Auto-Expiry Policies

"As a security admin, I want to set and enforce Lifeline session time limits so that emergency access always ends automatically and cannot linger beyond our policy."

Description

Define and enforce timeboxed Lifeline session durations at organization, environment, and role levels with sensible defaults and allowed bounds. Support per-incident overrides with mandatory justification and audit capture. Display remaining time to users in-app and via API metadata, and support optional short extensions gated by policy. Ensure enforcement across OutageKit admin console and API tokens, with clear precedence rules and versioned policy histories. Handle clock drift via server-side TTL, and surface effective policy in admin UI for transparency.

Acceptance Criteria

Policy Precedence Across Org, Environment, and Role Levels

Given organization-, environment-, and role-level Lifeline expiry policies exist with allowed bounds and defaults When a Lifeline session is created for a user with Role R in Environment E Then the effective expiry duration is selected by precedence Role > Environment > Organization And the effective duration is clamped to the organization-level allowed bounds [min_duration, max_duration] And if no value is set at any layer, the organization default_duration is applied and recorded as source "default" And the effective policy source and computed expires_at (ISO 8601 UTC) are persisted server-side and exposed via API/admin UI

Per-Incident Override With Mandatory Justification and Audit

Given an active incident I and a user with permission to override Lifeline policies per incident When the user submits an override with a new expiry duration for incident I Then a non-empty justification is mandatory and the request is rejected if missing And the override is validated against organization allowed bounds; out-of-range values are rejected with a clear error And an audit record is created capturing incident_id, actor, timestamp (UTC), previous_policy, new_policy, scope, and justification And the override applies immediately to new sessions and to all active Lifeline sessions linked to incident I by recomputing server-side expires_at And the override appears as a new version in the policy history tagged to incident I

Remaining Time Visible In-App and Via API

Given a user has an active Lifeline session When the user views the admin console Then a countdown banner shows remaining time in mm:ss and updates at least once per second And GET /sessions/{id} returns expires_at (ISO 8601 UTC), ttl_seconds (integer), and policy_source And the UI countdown matches ttl_seconds within ±1 second And when ttl_seconds <= 0, the UI immediately disables Lifeline-gated actions and indicates expiry And after expiry, Lifeline-gated API calls return 401 with error_code "lifeline_expired"; ttl_seconds is never negative

Enforcement Across Admin Console Sessions and API Tokens

Given server-side TTL expiration for a Lifeline session When a user attempts any Lifeline-gated action after expiry Then the admin console session is revoked and the user is redirected to SSO re-auth within 5 seconds of expiry And all API tokens derived from the Lifeline session return 401 with error_code "lifeline_expired" within 5 seconds of expiry And token refresh/rotation endpoints refuse to refresh expired Lifeline tokens unless a valid policy-approved extension exists And revocation propagates across all active devices/browsers for the same user within 5 seconds

Optional Short Extensions Gated by Policy

Given a policy configured with extension settings (enabled, extension_window, max_extension_minutes, max_extensions_per_session, justification_required) When a user requests a Lifeline extension Then the request is allowed only if enabled and within the configured extension_window relative to current expires_at And the granted extension increases expires_at by no more than max_extension_minutes and does not exceed max_extensions_per_session And if justification_required is true, a non-empty justification is mandatory; otherwise it is optional And the system records an audit entry with actor, timestamp (UTC), prior_expires_at, new_expires_at, and justification And the session and API reflect updated ttl_seconds and expires_at within 2 seconds of approval And denied requests return 403 with a machine-readable reason code

Server-Side TTL and Clock Drift Handling

Given a client device with incorrect local time When a Lifeline session is created and used Then expiry is determined solely by server time; changing client time does not extend or shorten the session And API responses include server_now and expires_at in ISO 8601 UTC so clients can compute drift And if client-local time differs from server_now by more than 30 seconds, the UI displays a clock drift warning without altering enforcement And expirations occur at server-side expires_at ±1 second; no session remains active beyond its TTL

Effective Policy Transparency and Versioned History

Given an admin views policy configuration for a user/environment When navigating to the policy panel in the admin UI Then the UI shows the effective expiry value, the source layer (role/environment/organization/incident_override/default), and the rationale (precedence) used And a versioned history is displayed with entries containing version_id, editor, timestamp (UTC), change summary, and field diffs And history entries are immutable; new changes create a new version rather than editing prior versions And GET /policies/effective?user_id={id}&environment_id={id} returns values matching the UI, including effective source and version_id

One-Click Global Recall

"As an incident commander, I want to recall all active Lifeline sessions with one click so that I can immediately close emergency access when the crisis subsides or risk is detected."

Description

Provide a guarded control and API endpoint to revoke all active Lifeline sessions instantly across the tenant. Propagate revocation within seconds to web sessions and API tokens with retries for partitioned nodes, and present a real-time impact summary (sessions revoked, endpoints pending). Support scope filters (entire org, environment, role) and dry-run mode for preview. Require confirmation with reason capture, ensure idempotency, and prevent immediate reissuance unless explicitly reauthorized. Log all actions to audit and incident timelines.

Acceptance Criteria

Tenant-Wide Instant Recall via UI

Given I am a tenant admin with recall permissions and active Lifeline sessions exist When I click the guarded "Recall All" control, confirm the action, and enter a non-empty reason Then 100% of active Lifeline web sessions and API tokens in the tenant are revoked within 10 seconds p95 and 30 seconds p100 And affected web sessions are forced to logout and API tokens are rejected on next request with HTTP 401/invalid_token And the recall button is disabled during execution to prevent duplicate submissions And re-clicking within 5 minutes does not create duplicate operations or side-effects (idempotent)

Idempotent API Recall Endpoint

Given an authorized client calls POST /v1/lifeline/recall with scope=tenant and a reason and Idempotency-Key=X When the request is processed Then the API returns 202 with operation_id and begins revocation And a subsequent identical request with Idempotency-Key=X returns the same operation result without triggering additional revocations And unauthorized or insufficiently scoped callers receive HTTP 403 And the endpoint enforces JSON schema validation and returns HTTP 400 for missing reason or invalid scope

Scope-Filtered Recall (Org/Environment/Role)

Given active sessions exist across multiple environments and roles When I configure filters (e.g., environment=staging, role=FieldOps) and initiate recall Then only sessions matching the selected filters are revoked; all others remain active And the impact summary reflects targeted_total, revoked_count, and pending_count for the filtered scope And clearing filters and selecting entire_org targets all sessions

Dry-Run Preview of Impact

Given there are active sessions matching selected filters When I enable Dry Run and execute recall Then no sessions or tokens are revoked And the impact summary returns the counts that would be targeted, revoked, and pending if executed And an audit entry is recorded with dry_run=true and no destructive changes And the UI requires explicit confirmation to proceed from Dry Run to Execute

Real-Time Impact Summary and Retry Behavior

Given a recall operation is running When I view the operation progress panel Then it displays targeted_total, revoked_count, pending_count, pending_by_reason (e.g., partitioned, unreachable), and last_updated timestamp And counts refresh at least once per second until completion or timeout And unreachable endpoints are retried with exponential backoff for up to 2 minutes before being marked pending=partitioned And the final state is success if revoked_count == targeted_total; otherwise partial with enumerated pending reasons

Reissue Prevention Until Explicit Reauthorization

Given a recall has completed for scope X When a user or service requests a new Lifeline session/token within scope X Then issuance is blocked with HTTP 403 and error=LIFELINE_RECALL_ACTIVE until an authorized admin explicitly reauthorizes issuance And reauthorization via UI control or API immediately permits new issuance and records the reauthorization actor, timestamp, and scope And attempts to bypass via refresh/renew flows are also blocked until reauthorization

Comprehensive Audit and Incident Logging

Given any recall action (UI or API, dry-run or execute) When the action is initiated, progresses, and completes Then audit records capture timestamp, actor, actor_type, IP/client_id, scope filters, dry_run flag, reason text, idempotency_key, operation_id, targeted_total, revoked_count, pending_by_reason, outcome (success|partial|failed) And an incident timeline entry is posted with a human-readable summary and link to the operation detail And audit and timeline entries are immutable (WORM) and accessible only to authorized roles

SSO Recovery Auto-Rebind

"As an IT admin, I want Lifeline access to end automatically and users to be re-routed to SSO when it is healthy so that we eliminate temporary backdoors without manual cleanup."

Description

Continuously monitor IdP health (Okta, Azure AD, Google, generic OIDC/SAML) via webhooks and periodic checks with debounce to avoid flapping. On confirmed recovery, automatically invalidate Lifeline sessions, restore normal SSO flow, and prompt users to reauthenticate via SSO while preserving non-destructive in-progress work. Map Lifeline users back to their SSO identities for seamless context transfer. Provide admin controls for manual override and maintenance windows, and record all transitions in audit logs.

Acceptance Criteria

Debounced IdP Recovery Detection (Webhooks + Polling)

Given the IdP provider supports health webhooks and periodic health probes with a configured debounce window D seconds and N consecutive-success threshold When the system receives a recovery webhook or observes N consecutive successful probe results within D seconds Then the provider status transitions to Recovered only after the N-of-N successes are observed within the debounce window And transient recoveries that do not meet N-of-N within D seconds do not change the status And the decision includes the sampled timestamps, results, and applied debounce parameters in the system state

Automatic Lifeline Session Invalidation on Confirmed Recovery

Given Lifeline mode is active for a tenant and one or more users hold active Lifeline sessions And the tenant’s IdP recovery is confirmed per the debounce rules When recovery is marked Recovered Then all active Lifeline sessions for that tenant are invalidated within 10 seconds And new authentications are routed to the standard SSO flow And users with an invalidated Lifeline session are prompted to reauthenticate on their next action And non-destructive in-progress work (e.g., draft outage reports, unsent updates, selected filters) is preserved for at least 30 minutes or until SSO reauthentication completes And upon successful SSO reauthentication, preserved work is restored and bound to the reauthenticated user context

Identity Rebind and Context Preservation After SSO Reauthentication

Given a user previously operated under a Lifeline session that is mapped to an SSO principal via subject/email/externalId When the user completes SSO reauthentication after recovery Then the user’s roles, team memberships, and resource scopes reflect the SSO assertion And prior, non-destructive in-progress work from the Lifeline session is associated to the SSO identity and restored without duplication or loss And no privileges exceed those asserted by SSO; reductions are applied immediately And ambiguous or missing mappings require user resolution before restoration, and the event is logged

Admin Override and Maintenance Window Controls

Given an admin with Auth Admin permission When the admin triggers Force Rebind Now for a tenant Then the system immediately executes the recovery workflow regardless of current debounce state and logs the action with actor identity and reason When the admin schedules a maintenance window with start/end times and targeted IdPs Then automatic rebind and auto-invalidation are suppressed during the window And end users see a maintenance banner explaining the temporary behavior And after the window ends, automatic detection and rebind resume

Comprehensive Audit Logging of State Transitions

Given any transition among states (SSO Degraded, Lifeline Active, Recovery Pending, Rebound, Regressed) When the system changes state, invalidates sessions, prompts reauthentication, or an admin invokes override/maintenance Then an audit record is created containing UTC timestamp, tenant, IdP provider, previous→new state, trigger source (webhook/probe/admin), debounce parameters, impacted session/user counts, actor identity/correlation IDs And audit records are immutable, queryable by time range and tenant, and exportable via API as JSON and CSV And if audit write initially fails, the system retries with backoff and emits an alert on repeated failure

Recovery Regression Handling and Rollback to Lifeline

Given recovery was confirmed and rebind executed for a tenant When the IdP health degrades again within 5 minutes (configurable) Then the system automatically re-enters Lifeline mode using the same debounce protections to avoid flapping And users who have not yet completed SSO reauthentication remain able to perform Lifeline-permitted critical actions And users are notified of the state change if org-level notifications are enabled

Multi-Provider and Per-Tenant Rebind Segmentation

Given a tenant has multiple IdPs (e.g., Okta, Azure AD, Google, generic OIDC/SAML) or multiple tenants share infrastructure When recovery is confirmed for only a subset of providers Then rebind and session invalidation occur only for sessions tied to recovered providers; others remain in Lifeline And provider statuses are displayed independently in admin views and exposed via API And where webhooks are unavailable for a provider, polling is used without impacting other providers

Graceful Termination & Work Preservation

"As a console user, I want a brief, clear wind-down when my Lifeline session ends so that I can save my work and avoid leaving the system in a bad state."

Description

On auto-expiry or recall, present a visible countdown (e.g., 60 seconds), auto-save drafts, and allow in-flight safe operations to complete while blocking new destructive actions. For API clients, return structured 401/403 responses with reason and retry-after guidance. Ensure backend operations are idempotent to avoid partial state. Provide clear UX messaging, accessibility compliance, and localized strings. Include configurable grace periods per policy with safeguards against indefinite extension.

Acceptance Criteria

Visible Countdown Prior to Auto-Expiry/Recall

- Given an authenticated user with an active Lifeline session and org policy gracePeriodSeconds=60, When the system triggers auto-expiry or a global recall, Then a persistent countdown UI appears within 500ms, initially showing 60 seconds remaining, and decrements every 1s with accuracy ±1s. - And the countdown is visible across all app views and cannot be dismissed. - And at T-10s the UI escalates via color and text to warn imminent termination. - And when the countdown reaches 0, Then the session is revoked, the user is redirected to SSO for automatic rebind if available (within 2s), else a sign-in prompt is shown. - And audit events "session_countdown_started" and "session_revoked" are recorded with userId, sessionId, reason (auto_expiry|recall), and timestamps.

Auto-Save Drafts on Termination

- Given unsaved user inputs (forms, notes, configuration edits), When a countdown starts, Then the system auto-saves all drafts within 1s and saves deltas at most every 5s until termination. - And auto-saved drafts are versioned, associated to the user, and recoverable upon successful re-auth within 5s of returning to the same context. - And auto-save writes are idempotent (no duplicate drafts for the same resource) and pass integrity validation. - And telemetry emits events "draft_autosaved" and "draft_restored" with correlation to sessionId.

Quiesce: Allow Safe Operations, Block Destructive Actions During Grace Window

- Given the grace window is active, When the user initiates a new destructive action (create/update/delete/broadcast), Then client controls are disabled and server attempts are rejected with HTTP 403 and body { code:"session_expiring", reason:"grace_window", retry_after: secondsRemaining }. - And safe operations (read-only queries and non-mutating exports) initiated before or during the grace window continue to completion. - And destructive operations already in flight at countdown start either complete atomically before T=0 or are rolled back with clear user feedback and no partial state. - And upon T=0, all new requests are blocked until re-auth; the UI presents a single clear path to re-auth or exit.

API Error Contract for Expired/Recalled Sessions

- Given an API call with an expired token, When accessing a protected endpoint, Then the response is 401 with WWW-Authenticate: Bearer error="invalid_token" and an application/problem+json body { type, title, code:"token_expired", reason:"auto_expiry", retry_after: seconds }. - Given an API call for a recalled session, When accessing a protected endpoint, Then the response is 403 with an application/problem+json body { type, title, code:"session_recalled", reason:"recall", retry_after: secondsRemaining, request_id } and a Retry-After header when retry is appropriate. - And all protected endpoints conform to this schema and are covered by contract tests in OpenAPI/CI.

Idempotent Backend to Prevent Partial State

- Given write endpoints accept Idempotency-Key, When duplicate requests with the same key occur within 24h, Then only one side effect is applied and subsequent responses return the original status and body. - And multi-step operations are transactional; on failure or session expiry mid-process, state is fully rolled back with no partially visible records. - And concurrent duplicate submissions do not produce duplicate side effects (verified across 10k test runs with anomaly rate ≤0.01%). - And each write operation records a completion status of "committed" or "rolled_back" for auditability.

Accessible, Localized UX Messaging for Session End

- Given supported locales (e.g., en, es, fr), When the countdown starts, Then all strings are sourced from the i18n catalog with correct pluralization and numeral formatting; missing keys fall back to the default locale without breaking the UI. - And the countdown and termination messages meet WCAG 2.2 AA (contrast ≥4.5:1, focus visible, keyboard operable), with an aria-live announcement of "Session ending in N seconds" not more often than every 5s. - And on termination, focus moves to the primary re-auth action; screen readers announce the state change; no keyboard traps are introduced.

Configurable Grace Period with Safeguards Against Indefinite Extension

- Given an org admin policy setting Grace Period (min 15s, max 300s, default 60s), When the value is updated, Then it applies to new countdowns within 60s and is recorded in audit logs with actor and old/new values. - And per-session extension is capped by policy (max one extension up to +120s), requires admin role, and attempts beyond the cap are rejected with clear messaging and a 403 from the API. - And global recall respects a configured recallGracePeriod (0–120s); setting 0 shows a brief notice and revokes immediately; out-of-bounds values are rejected with 400 code="invalid_policy_value".

Comprehensive Audit Logging & Exports

"As a compliance officer, I want immutable logs of all Lifeline session events so that we can prove emergency access was controlled and time-bound during audits."

Description

Record all Lifeline lifecycle events—issuance, extension, override attempts, auto-expiry, recall, and SSO rebind—with actor, timestamp, reason, incident ID, device fingerprint, IP, and scope. Store logs in append-only, tamper-evident storage with configurable retention. Provide searchable UI, CSV/JSON export, and integrations to SIEMs (Splunk, Datadog) via webhook/syslog. Sign logs and include correlation IDs to tie events to incident timelines and user actions for compliance and forensics.

Acceptance Criteria

Lifecycle Event Capture and Field Completeness

Given any Lifeline session event of type issuance, extension, override_attempt, auto_expiry, global_recall, or sso_rebind When the event is processed Then exactly one audit record is appended within 1 second (p95) containing: event_type, actor, timestamp_utc (ISO8601 with milliseconds), reason (nullable), incident_id (nullable), device_fingerprint (nullable), ip, scope, correlation_id, and digital_signature that verifies against the active audit signing public key

Tamper-Evident Append-Only Storage and Verification

Given audit storage is configured as append-only When an update or delete is attempted on an existing record via any API or backend interface Then the operation is rejected and an integrity alert is recorded, and the record remains unchanged And a daily integrity verification job computes proofs over all records for the previous day and stores an exportable proof artifact When any stored record is altered out-of-band Then the Verify Integrity endpoint returns integrity_status = "fail" and identifies the earliest offending record And WORM/immutability is enforced for the configured retention period

Searchable Audit UI with Filters and Pagination

Given the audit UI and a dataset of at least 10 million records When the user filters by time range, event_type, actor, incident_id, ip, device_fingerprint, scope, and correlation_id, and optionally free-text on reason Then the first page of results returns within 2 seconds (p95) for up to 10,000 matching records, sorted by timestamp desc, with pagination (25/50/100 per page) And each row displays event_type, actor, timestamp_utc, reason, incident_id, device_fingerprint, ip, scope, correlation_id, and a signature verification badge And copy-to-clipboard controls are available for correlation_id and incident_id

CSV and JSON Export Fidelity

Given any applied filter in the audit UI covering up to 1,000,000 records When the user requests CSV export Then a downloadable RFC4180-compliant CSV is generated within 5 minutes with UTF-8 encoding, header row, and fields: event_type, actor, timestamp_utc, reason, incident_id, device_fingerprint, ip, scope, correlation_id, digital_signature When the user requests JSON export Then a downloadable JSON file is generated within 5 minutes containing an array of the same records and fields, with timestamp_utc in ISO8601 and digital_signature as base64 And exports over 1,000,000 records are segmented into multiple files and queued, with progress and success/failure status visible And exported record counts match the on-screen result count for the same filter

SIEM Integrations via Webhook and Syslog

Given Splunk HEC and Datadog Logs integrations are configured and enabled When a new audit record is appended Then a delivery attempt is made to each enabled destination within 5 seconds (p95) with field mapping preserving all specified fields and correlation_id And webhook deliveries are HMAC-SHA256 signed, include an idempotency key, treat 2xx as success, and retry non-2xx with exponential backoff for up to 24 hours before routing to a dead-letter queue and emitting an alert And syslog output conforms to RFC5424 over TCP+TLS with facility set to security/authorization and embeds the JSON payload And per-destination delivery success rate over a rolling 1-hour window is >= 99.5% excluding destination-reported outages, with metrics exposed for monitoring

Correlation ID Linking Across Incident Timeline and User Actions

Given an incident timeline view and the audit log view When a user opens an incident with ID X Then the timeline shows links to all audit records whose correlation_id appears on that incident’s events, and the count matches the audit search by correlation_id When a user clicks a correlation_id in the audit log view Then the app navigates to a filtered audit view showing all related audit records and provides a link to associated incident(s) And correlation_id is present in UI, exports, and SIEM payloads

Configurable Retention and Legal Hold Enforcement

Given an admin sets audit retention to R days When new records are written Then immutability prevents modification or deletion before R days elapse When R is increased Then existing and future records inherit the longer retention immediately When R is decreased Then existing records keep their original (longer) retention and only new records use the shorter period When records reach age R Then they become non-queryable within 24 hours and are permanently purged within 72 hours, with a signed purge report produced And a legal hold prevents purge regardless of age until removed, and all retention and hold changes are themselves audited

Admin Notifications & Alerting

"As a security lead, I want timely alerts about Lifeline session activity so that I can respond quickly and ensure emergency access is closed as soon as it’s no longer needed."

Description

Send real-time notifications to security admins and incident commanders on key Lifeline events: issuance, nearing expiry, recall executed, and SSO recovery detected. Support channels such as email, SMS, Slack/Teams with per-user preferences, quiet hours, localization, and rate limiting. Include actionable details (who, what, scope, time remaining) and deep links to the relevant console view. Provide delivery status and retries with fallback channels.

Acceptance Criteria

Notify on Lifeline Issuance

Given a Lifeline session is issued in OutageKit and Admin Notifications are enabled And the recipient is a security admin or incident commander with active channel preferences When the issuance event is recorded Then a notification is sent to each enabled channel within 30 seconds And the message includes issuer identity, session scope, issuance timestamp, expiration timebox, and a deep link to the Lifeline console view And SMS content is <= 500 characters including link; email/Slack/Teams include full details And sending honors per-user channel enablement and priority order

Notify on Nearing Expiry

Given a Lifeline session’s remaining time drops below the configured nearing-expiry threshold (default 10 minutes) And the recipient has not opted out of nearing-expiry alerts When the threshold is crossed Then exactly one nearing-expiry notification per recipient is sent within 30 seconds And the message includes current time remaining (minutes), session owner, scope, and a deep link And no additional nearing-expiry notifications are sent for the same session unless remaining time increases above the threshold and later crosses below it again

Notify on Recall Executed

Given an authorized user executes a global recall of active Lifeline sessions When the recall is confirmed by the system Then notifications are sent to all designated recipients on their enabled channels within 30 seconds And the message includes executor identity, count of sessions recalled, impacted users or groups, timestamp, and a deep link to the recall audit view And recipients only receive one notification per recall action

Notify on SSO Recovery Detected

Given a prior SSO outage resulted in active Lifeline sessions And the platform detects SSO recovery and auto-rebind completes When rebind is successful Then notifications are sent within 60 seconds to recipients per their preferences And the message includes SSO provider, systems rebound, number of sessions auto-terminated, timestamp, and a deep link to SSO health And duplicate notifications for the same recovery event are suppressed

Quiet Hours and Preferences

Given a recipient has defined quiet hours (local timezone) in their profile When a Lifeline notification is triggered during that window Then SMS and voice are suppressed unless the recipient enabled "override quiet hours" And email and Slack/Teams are delivered with silent mode where supported And a single digest summarizing suppressed notifications is delivered within 5 minutes after quiet hours end And all sends honor per-user channel enablement and priority order

Localization of Notifications

Given a recipient’s language and timezone preferences are set When any notification is sent Then the content is localized to the recipient’s language and date/time formats reflect their timezone and locale And if a translation is unavailable, the system falls back to the org default language; if still unavailable, it falls back to English And required placeholders (who, what, scope, time remaining, deep link) render correctly in localized templates

Delivery Reliability, Retries, Fallbacks, and Rate Limiting

Given a notification is dispatched When the primary channel fails or no delivery confirmation is received within 60 seconds Then the system retries up to 3 times with exponential backoff and then attempts the next channel in the recipient’s fallback order And per-channel status (Queued, Sent, Delivered, Failed) is updated in the console within 10 seconds of state change And identical notifications to the same recipient within a 2-minute window are deduplicated And per-recipient rate limiting ensures no more than 5 notifications per 10 minutes per event type; excess are coalesced into a single summary message with counts

SSO Health Sentinel

Continuously monitors IdP health and error rates to auto-offer Lifeline only when thresholds are met, then notifies admins and logs duration, users, and actions. Cuts confusion at login, speeds recovery decisions, and produces clean post-incident evidence.

Requirements

IdP Health Telemetry Collector

"As a tenant admin, I want OutageKit to continuously measure my IdP’s health and error rates so that we can detect SSO degradations promptly and objectively."

Description

Continuously collects and aggregates authentication health metrics from supported IdPs (e.g., Okta, Azure AD, Google Workspace, generic SAML/OIDC), including success/failure rates, error codes, latency, and endpoint availability. Supports polling, synthetic sign-in probes, and webhook/event ingestion where available. Provides rolling-window aggregation (1/5/15 minutes), baseline learning, per-tenant isolation, resilient retries/backoff, and time-series storage with retention policies. Ensures secure handling of credentials/secrets and aligns telemetry with OutageKit’s incident model for downstream actions.

Acceptance Criteria

Multi-IdP Metrics Ingestion and Normalization

Given tenants configured with Okta (REST polling + event webhook), Azure AD (Graph polling), Google Workspace (Admin SDK polling), and generic SAML/OIDC endpoints When the collector runs for 15 minutes and each IdP produces >= 200 authentication outcomes and >= 5 distinct error codes Then per minute per IdP the system persists: success_count, failure_count, success_rate, failure_rate, error_code_counts, latency_p50/p95/p99, endpoint_availability_percent, request_rate And ingestion-to-store p95 latency <= 15s And field names and units are normalized across IdPs per telemetry schema version And missing-minute records == 0

Synthetic Probe Scheduling and Capture

Given a synthetic sign-in probe configured per IdP with schedule 60s and a designated test account When network and IdP are healthy for 10 consecutive minutes Then probes execute every 60s ± 10s and record: outcome, error_code (if any), end-to-end latency_ms, endpoint, idp_type, tenant_id, is_test=true And probe result p95 ingest lag <= 10s And probe credentials are never transmitted to logging; logs contain redacted values only And a forced failure returns the upstream error reason verbatim in probe_error field

Rolling-Window Aggregation Accuracy and Timeliness

Given minute-level inputs for a tenant/IdP When 1, 5, and 15-minute windows roll Then 1-minute metrics equal the last minute's values; 5 and 15-minute counts are sums over the window; rates are successes/total; availability is 1 - (failed_checks/total_checks) And latency percentiles computed for 5 and 15-minute windows are within ±2% absolute of an exact offline reference And each window finalizes within 5s of the minute boundary

Baseline Learning and Anomaly Scoring

Given 7 days of historical minute metrics with <20% missing data When the baseline job runs at 02:00 UTC Then it outputs per-tenant per-IdP per hour-of-week baselines for success_rate, failure_rate, latency_p95, endpoint_availability with mean and MAD And for current windows it computes z_score for each metric And baseline status is "insufficient_data" if missing data >= 20% And new baselines are versioned and take effect without collector restart

Per-Tenant Isolation and Data Partitioning

Given tenants A and B with separate API keys and secrets When events for both are ingested and queried using tenant A credentials Then zero records from tenant B are returned And all stored records include tenant_id and idp_type tags And synthetic probes for A never use credentials from B And an access attempt from tenant B to A's secrets is denied and audited

Resilient Retries and Backoff with Circuit Breaker

Given an IdP API returning 429 and intermittent 5xx for 5 minutes When the collector polls the API Then it retries up to 6 attempts per request with exponential backoff starting at 500ms, capped at 60s, with ±20% jitter And after 5 consecutive failures it opens a circuit for 120s, marks endpoint_unavailable=true, and emits retry metrics And no duplicate records are stored; once recovery occurs, cumulative counts match the IdP over the test window

Time-Series Storage, Retention, and Incident Model Alignment

Given time-series storage configured with 35-day retention for raw minutes and 180-day retention for 5/15-minute aggregates When data is ingested for 40 days Then raw-minute data older than 35 days is purged while 5/15-minute aggregates remain until 180 days, with zero orphaned indexes And each stored series conforms to telemetry schema v1: {tenant_id, idp_type, window, ts, metrics, source} And when success_rate_5m z_score <= -3 or endpoint_unavailable=true for >= 2 consecutive minutes Then the collector emits a normalized IncidentSignal(IdPHealthDegradation) with correlation_id, severity, affected_users_estimate, and idempotency_key, and it validates against the OutageKit incident model schema

Threshold Rules & Hysteresis Engine

"As a security admin, I want to define precise thresholds and policies for SSO health so that fallback actions only trigger when they are truly warranted."

Description

Configurable per-tenant policies that evaluate IdP telemetry against thresholds (e.g., error rate > X% over Y minutes, latency > Z ms) to determine degraded/outage states. Includes hysteresis and cool-downs to prevent flapping, maintenance window suppression, environment scoping (prod/non-prod), and multi-IdP awareness. Policies map to actions (offer Lifeline, notify admins, open incident) and support simulation/dry-run mode with auditability and versioning.

Acceptance Criteria

Degraded state on sustained IdP error-rate breach

Given tenant T has a policy: if IdP A error_rate > 20% for 5 consecutive minutes, set state=Degraded and actions=[Offer Lifeline, Notify Admins, Open Incident Sev3] And telemetry indicates IdP A authentication failures averaging 25%+ for 5 consecutive minutes When the evaluation cycle runs Then the engine marks IdP A state=Degraded for tenant T within 60 seconds of the 5-minute window closing And offers Lifeline on the next login attempt for tenant T users mapped to IdP A And sends admin notifications via configured channels within 60 seconds including {tenant_id, idp=A, state=Degraded, rule_id, threshold=20%/5m, observed=25%, correlation_id} And opens a Sev3 incident linked to correlation_id And writes an audit record with {policy_version, rule_id, start_time, state=Degraded, actions_emitted=[lifeline, admin_notify, incident_open]}

Outage state on high latency breach

Given tenant T has a policy: if IdP A p95_latency > 1500ms for 3 consecutive minutes, set state=Outage and actions=[Offer Lifeline, Notify Admins, Open Incident Sev2] And telemetry shows p95_latency for IdP A is ≥1600ms for 3 consecutive minutes When the evaluation cycle runs Then the engine marks IdP A state=Outage within 60 seconds of the 3-minute window closing And offers Lifeline to affected users at login And sends admin notifications containing {tenant_id, idp=A, state=Outage, metric=p95_latency, threshold=1500ms/3m, observed=1600ms, correlation_id} And opens a Sev2 incident associated with correlation_id And emits no duplicate notifications if the state remains Outage on subsequent cycles

Hysteresis and cool-down prevent state flapping

Given tenant T has thresholds: enter Degraded when error_rate > 20% for 5m; recover when error_rate < 10% for 10m; cool_down=15m And IdP A error_rate fluctuates between 18–22% minute-by-minute for 8 minutes When the evaluation cycle runs each 30 seconds Then the engine does not enter Degraded until the error_rate has been >20% for a contiguous 5-minute window And once Degraded, the engine remains Degraded until error_rate <10% for 10 consecutive minutes And only one set of actions is emitted on entry to Degraded and one on recovery And after recovery, the engine will not re-enter Degraded for at least 15 minutes even if error_rate briefly exceeds 20%

Maintenance window suppression

Given IdP A has a maintenance window configured for 2025-08-12T01:00Z–02:00Z with suppress_actions=true and record_observations=true And during that window the error_rate rises to 90% for 10 minutes When the evaluation cycle runs Then the engine records the breach as suppressed with reason=maintenance and retains observed metrics And it does not change state or emit actions (no Lifeline, no admin notifications, no incident) And if the breach persists ≥5 minutes after 02:00Z, the engine evaluates normally and applies actions within 60 seconds

Environment scoping per tenant

Given tenant T has IdPs A_prod and A_nonprod and a policy scoped to environment=prod When A_nonprod exceeds the error-rate threshold Then no state transition or actions occur for tenant T And an audit entry is recorded with decision=skipped and reason=environment_scope When A_prod exceeds the same threshold Then the engine applies the state transition and actions per policy And all notifications and audit records include environment=prod

Multi-IdP awareness and action scoping

Given tenant T uses IdP A and IdP B with policies: (1) if either IdP is Degraded or Outage, actions=[Offer Lifeline, Notify Admins] scoped to that IdP; (2) if both IdP A and IdP B are Outage for ≥2 minutes, actions=[Open Incident Sev1] And IdP A is Degraded while IdP B is Outage When the evaluation cycle runs Then the engine offers Lifeline to users authenticating via A and via B respectively, with context indicating the affected IdP And sends separate admin notifications for A and B with their respective states And does not open the composite Sev1 incident When both IdP A and IdP B are Outage for 2 consecutive minutes Then the engine opens exactly one Sev1 incident for tenant T with references to both IdPs and correlation_id

Simulation mode with auditability and versioning

Given policy version v3 (draft) is set to simulation mode with proposed changes (error_rate threshold from 20% to 15%) And a telemetry replay is configured for 2025-08-01T00:00Z–2025-08-02T00:00Z When the simulation runs Then the engine evaluates state transitions and actions as if v3 were active but emits no real actions And writes audit entries for each hypothetical transition with {mode=simulation, policy_version=v3, rule_id, from_state, to_state, actions_would_emit, affected_user_count} And produces a comparison report vs active version v2 including deltas for {transitions_count, incidents_would_open, notifications_would_send} And stores the simulation artifact with run_id, checksum, and export URL

Adaptive Lifeline at Login

"As an operations manager, I want a clear fallback login option to appear only during SSO incidents so that I can access OutageKit quickly without confusing users during normal operation."

Description

Dynamically offers a limited-scope fallback authentication (e.g., email/SMS OTP, magic link, backup codes) on OutageKit login screens only when thresholds are breached, with clear user messaging and default suppression when SSO is healthy. Enforces RBAC-limited access during Lifeline sessions, configurable eligibility (roles/IPs), rate limiting, CAPTCHA, and session timeouts. Captures telemetry on offer/accept/decline events and integrates with branding, localization, and accessibility standards.

Acceptance Criteria

Auto-Offer Lifeline When IdP Degraded; Suppress When Healthy

Given an IdP error threshold of 5% over a rolling 2-minute window is configured and the threshold is breached, When a user visits the OutageKit login page within 60 seconds of breach detection, Then the page displays a Lifeline panel with the configured fallback methods and a clear explanation, And the SSO button remains available but de-emphasized. Given the IdP error threshold is not breached for 2 consecutive minutes, When a user visits the login page, Then the Lifeline panel is not rendered. Given the IdP status transitions from breached to healthy, When 60 seconds have elapsed since recovery detection, Then the Lifeline panel is no longer shown on new sessions. Given the feature flag for Lifeline is disabled, When a user visits the login page, Then the Lifeline panel is never shown regardless of IdP status. Given any caching layer in front of the login page, When IdP status changes, Then the Lifeline visibility decision is evaluated server-side per request and not served stale from cache.

Eligibility Controls (Roles and IPs)

Given eligibility rules are configured with a roles allowlist ["Ops Manager","NOC"] and an IP allowlist ["10.0.0.0/8","192.168.0.0/16"], When a user enters an identifier and the system resolves the user's role and request IP, Then the Lifeline panel is shown only if the user matches at least one allowed role and the IP matches an allowed CIDR. Given a user does not meet eligibility, When thresholds are breached, Then the Lifeline panel remains hidden and a generic message indicates SSO is required without revealing role/IP details. Given both allowlist and blocklist are configured, When evaluating eligibility, Then blocklist rules take precedence over allowlist rules. Given eligibility cannot be resolved for a user identifier, When thresholds are breached, Then the system defaults to not offering Lifeline and logs an eligibility_resolution_failed telemetry event.

RBAC-Limited Access During Lifeline Sessions

Given a user authenticates via Lifeline, When the session is established, Then the session is tagged with auth_method=lifeline and a restricted RBAC scope is applied. Given a lifeline session accesses a permitted read-only endpoint (e.g., incident dashboard), When the request is made, Then the response is 200. Given a lifeline session attempts an admin or write operation (e.g., change RBAC, delete data, modify integrations), When the request is made, Then the response is 403 and the attempt is audited with reason=insufficient_scope. Given UI elements for restricted features, When rendered during a lifeline session, Then controls are hidden or disabled and show a tooltip indicating limited access. Given an API token is requested during a lifeline session, When the request is made, Then token issuance is denied with 403.

Abuse Protections: Rate Limiting and CAPTCHA

Given OTP-based Lifeline is enabled, When a user requests OTP delivery, Then allow a maximum of 5 sends per 15 minutes per account identifier and 20 sends per hour per source IP, otherwise return 429 with a generic throttle message. Given OTP verification attempts, When a user submits codes, Then allow a maximum of 10 attempts per 30 minutes per account identifier, otherwise return 429 and require a CAPTCHA on the next attempt. Given consecutive OTP verification failures ≥ 3, When the next attempt is made, Then present a CAPTCHA challenge that must be solved before accepting the code. Given magic-link Lifeline is enabled, When a magic link is issued, Then the link is single-use with a 10-minute expiration and is invalidated immediately upon use. Given backup codes are used, When a valid backup code is redeemed, Then decrement the remaining count and prevent reuse of the same code.

Session Management and Timeouts for Lifeline

Given a lifeline session is active, When there is 10 minutes of inactivity, Then the session expires and the user is redirected to the login page with a session_expired message. Given a lifeline session is active, When 30 minutes have elapsed since session creation, Then the session expires regardless of activity (absolute timeout) and the user is redirected to login. Given the IdP has recovered to healthy, When a lifeline user initiates a new session after recovery, Then only SSO is presented and Lifeline is not offered. Given a web lifeline session approaches absolute timeout, When 60 seconds remain, Then the UI displays a non-modal countdown warning. Given a lifeline session is expired, When the user attempts any API call, Then the response is 401 with a WWW-Authenticate header indicating lifeline_session_expired.

Telemetry and Audit for Offer/Accept/Decline

Given the login page renders, When the Lifeline panel is shown, Then emit a telemetry event lifeline_offer with fields: event_id, timestamp_ms, request_ip, user_agent, idp_status_snapshot, eligibility_state, correlation_id. Given a user selects a Lifeline method, When they complete authentication successfully, Then emit lifeline_accept with fields: user_id (or hashed_identifier if pre-auth), method, duration_ms, attempts_count, success=true, and write an audit log entry linked by correlation_id. Given a user declines Lifeline, When they continue with SSO, Then emit lifeline_decline with fields: hashed_identifier, reason (user_opted_for_sso), and no authentication is established. Given errors occur during Lifeline, When an error is shown to the user, Then emit lifeline_error with fields: code, message_key, retryable, and increment a metrics counter per code. Given telemetry is emitted, When 5 minutes have elapsed, Then the events are queryable in the admin audit UI and exportable via API endpoint /v1/audit with filters for time range, method, and outcome.

UX Compliance: Messaging, Branding, Localization, Accessibility

Given Lifeline is offered, When the panel renders, Then the message uses plain language describing the issue and the limited-scope access, and includes a link to learn more. Given product branding is configured, When the Lifeline panel renders, Then typography, colors, and logo match the active theme tokens. Given localization files for en, es, and fr are installed, When the browser Accept-Language matches one of these, Then all Lifeline UI strings are displayed in that language with a fallback to en for missing keys. Given accessibility requirements, When tested with keyboard-only navigation and a screen reader, Then all interactive elements are reachable in a logical order, have ARIA labels, and meet WCAG 2.1 AA color contrast (≥ 4.5:1 for text). Given error messages are displayed, When validation fails (e.g., bad OTP), Then focus moves to the error, the error is announced by screen readers, and the message does not reveal sensitive details.

Admin Alerts & Escalations

"As an on-call admin, I want timely, actionable alerts about SSO degradation and recovery so that I can coordinate response and minimize disruption."

Description

Sends actionable, deduplicated notifications to configured channels (email, SMS, Slack/Teams, PagerDuty, webhooks) when IdP health degrades and when it recovers. Includes severity mapping, quiet hours, on-call schedules, and acknowledgment with auto-snooze. Messages contain current metrics, affected users/regions, Lifeline adoption, runbook links, and incident references. Supports per-tenant contact groups and localization.

Acceptance Criteria

IdP degradation triggers deduplicated multi-channel alert

Given the tenant’s IdP error rate or health metric crosses a configured severity threshold for the configured evaluation window When SSO Health Sentinel detects the threshold breach Then send an actionable alert to all enabled channels (email, SMS, Slack/Teams, PagerDuty, webhook) within 60 seconds of detection And emit no more than one alert per channel per incident key within the configured deduplication window And include a stable incident_id/dedupe_key that remains constant until recovery And deliver a webhook payload containing incident_id, tenant_id, severity, started_at (ISO-8601), metrics_snapshot, affected_segments, lifeline_adoption, and runbook_links And Slack/Teams messages include an actionable Acknowledge control and a View Runbook link And email subject lines contain the severity tag and incident_id

Recovery event closes incident and notifies recipients

Given an incident is active and the IdP metrics remain below the recovery threshold for the configured recovery window When SSO Health Sentinel determines recovery Then send a recovery/resolve notification to the same channels and contact groups as the originating alert within 60 seconds And include incident_id, total duration, recovery_time, peak metrics, affected user/region summary, and final lifeline_adoption And emit a PagerDuty resolve event correlating to the original trigger using the same routing key/dedup key And update the webhook payload with status='recovered' and close_reason='metrics_normalized' And suppress any further notifications for this incident after the recovery message is sent

Quiet hours and on-call routing enforcement

Given quiet hours and an on-call schedule are configured for the tenant When a new incident is detected during quiet hours with severity below Critical Then notify only the current on-call contact(s) per schedule and suppress non-urgent channels until quiet hours end And send a single quiet-hours digest summarizing suppressed alerts within 5 minutes after quiet hours end And when severity is Critical, bypass quiet hours and notify all critical channels immediately And all timestamps in notifications respect the tenant’s timezone configuration

Acknowledgment and auto-snooze behavior

Given an alert has been sent to one or more channels When an authorized admin acknowledges via Slack/Teams action, email link, SMS reply ('ACK'), or PagerDuty acknowledgement Then cease further notifications for that incident to the acknowledged contact group(s) and record ack user, channel, and timestamp in the audit log And start the configured auto-snooze timer for the incident And if metrics worsen to a higher severity tier or the snooze expires without recovery, send a renewed notification with updated severity and an escalation note And if the acknowledgment is cleared ('UNACK') before recovery, resume notifications according to the escalation policy

Severity mapping and time-based escalation

Given a tenant-defined severity mapping and escalation policy are configured When IdP metrics meet a mapped threshold Then compute severity according to the mapping and route to channels/policies as configured (e.g., Sev-1 => PagerDuty high-urgency + SMS; Sev-2 => Slack/Teams + email) And include the severity indicator in the message title/subject and payload And if no acknowledgment is received within the configured escalation timeout, escalate to the next responder or channel tier And do not exceed the configured maximum escalation depth, logging each escalation step with timestamp and target

Per-tenant contact groups and localization

Given per-tenant contact groups and preferred locales are configured When an alert or recovery notification is generated Then deliver only to recipients in the tenant’s selected contact group(s), with no cross-tenant leakage And localize message content to each recipient’s locale with fallback to English if a translation is unavailable And format dates/times and numbers according to locale and tenant timezone settings And verify sample deliveries in at least two locales (e.g., en-US, es-ES) contain equivalent information and working links

Message content completeness and schema compliance

Given a notification (alert or recovery) is to be sent When composing the message for any channel Then include current metrics (error rate %, auth latency), affected users/regions summary, lifeline adoption %, runbook link(s) (HTTP 200), incident reference ID/URL, dedupe key, next check ETA, and support contact And reject the send if any required field is missing, retrying up to 3 times with exponential backoff and logging the failure And ensure webhook payload conforms to JSON schema version v1.2 and is HMAC-signed; receivers can validate the signature with the shared secret And ensure Slack/Teams formatting renders actionable controls and no markdown/HTML escapes are shown to end users

Incident Auto-Logging & Evidence

"As a compliance officer, I want an immutable, exportable audit of the SSO incident so that post-incident reviews and audits have reliable evidence."

Description

Automatically creates and updates an OutageKit incident when thresholds are met, capturing start/end times, severity changes, impacted authentication flows, and correlation to external IdP status pages. Records user-level events (attempts, errors, Lifeline usage) with PII minimization, immutable audit trail, and export (PDF/CSV/JSON). Provides post-incident timeline, metrics charts, and admin action logs to support compliance and root-cause analysis.

Acceptance Criteria

Auto-Create Incident on IdP Threshold Breach

Given SSO Health Sentinel detects an IdP auth error rate ≥ the configured threshold for the configured duration window, When this condition is first met and no active incident exists for the same tenant+IdP, Then an incident is created within 120 seconds with fields: incident_id, tenant_id, idp_identifier, start_time (UTC ISO 8601), initial_severity, detection_method=Sentinel, and a threshold_snapshot. Given a matching active incident exists for the same tenant+IdP, When additional threshold breaches occur, Then no duplicate incident is created and the existing incident is updated with latest metrics and breach windows. Given the incident is created, Then it appears in the Incident list with status=Active and tag="SSO Health Sentinel" and is queryable via API by incident_id.

Incident Progression, End Time, and Closure

Given an incident is Active, When the observed error rate remains below the configured recovery threshold for the configured recovery duration, Then the incident end_time is set to the first minute of the recovery window and the status transitions to Resolved within 120 seconds. Given an incident transitions through severity bands according to configured rules, When thresholds are crossed up or down, Then the severity_change is appended to the incident timeline with timestamp, old_severity, new_severity, and rationale. Given a Resolved incident, When a new threshold breach occurs within the configured cooling period, Then the incident is reopened (same incident_id), end_time is cleared, and a timeline entry records the reopen event; otherwise a new incident is created.

Impacted Authentication Flows Identification

Given an incident is Active, When auth traffic is processed, Then per flow type (e.g., Browser SSO, API token exchange, MFA challenge, Passwordless/OTP) the system records per-minute attempts, failures, success rate, and top error codes, and attaches these metrics to the incident. Given per-minute flow metrics are recorded, Then totals in the incident detail match backend counters within ±1% for the same period and flow type. Given flows with zero traffic, When an incident is Active, Then the UI/API explicitly reports zero values (not null) for those flows for the affected time buckets.

External IdP Status Correlation and Evidence Linkage

Given an incident is Active, When the configured IdP status page/API is reachable, Then the system polls at least every 5 minutes and stores timestamped snapshots relevant to the IdP components. Given status snapshots exist, When an IdP incident overlaps the OutageKit incident window by at least 10 minutes, Then a correlation record is attached with external incident id/url, first_seen, last_seen, and component mapping. Given correlation is attached, Then the incident detail displays the correlation and the export includes the linked evidence; if the status page is unavailable, a poll_failure event with error details is recorded at most once per 5 minutes.

User-Level Event Recording with PII Minimization

Given auth attempts occur during an incident, When user-level events are logged, Then user identifiers are stored as tenant-scoped salted hashes, emails/phones are masked (e.g., a****@d***.com, +1-***-***-1234), and IPs are truncated (/24 IPv4, /64 IPv6); no raw PII is stored. Given user-level events are recorded, Then each event contains: event_time (UTC ISO 8601), flow_type, outcome, error_code (if any), idp_identifier, incident_id, and lifeline_used flag; required fields presence ≥ 99.9% for events linked to the incident. Given retention is configured, When events exceed the retention period, Then PII-minimized records are purged according to policy while aggregate incident metrics remain; all purges are appended to the audit trail.

Immutable Audit Trail with Tamper Evidence

Given incident and event records are written, When stored, Then they are appended-only and each entry includes a content hash and previous_hash forming a verifiable chain; daily root hashes are generated and stored for verification. Given an authorized user attempts to edit or delete an existing audit entry, Then the system denies mutation and records a denied_mutation event with actor, time, and reason; redactions are allowed only as append-only tombstones with scope and rationale. Given the verify endpoint is called for an incident, Then it returns verification_status=OK and the first mismatched index (if any); internal verification of the last 10,000 entries completes within 5 seconds for the test dataset.

Evidence Exports (PDF/CSV/JSON) with Timeline, Metrics, and Admin Actions

Given an incident exists, When an export is requested via UI or API, Then JSON, CSV, and PDF exports are available and generated within 15 seconds for incidents up to 100,000 events. Given exports are generated, Then they include: incident summary (id, tenant, IdP, start/end, severity history), timeline (threshold breaches, severity changes, reopen/resolve), flow metrics, correlation links and snapshots metadata, admin action logs, and audit verification hash. Given data minimization is required, Then exports contain only masked PII and comply with the PII policy; timestamps are UTC ISO 8601; CSV row counts equal the number of exported events; JSON validates against schema v1.0; PDFs render charts and include page numbers and a footer with incident_id and generation_time.

Tenant Config UI & API

"As a tenant owner, I want an intuitive UI and API to configure SSO Health Sentinel so that setup and ongoing adjustments are safe, fast, and auditable."

Description

Provides a secure UI and REST API for configuring IdP connections, threshold policies, Lifeline methods, admin contacts, and escalation rules. Includes credentials vaulting, field validation, test connections, preview/simulation of policies, role-based permissions, audit logs for configuration changes, and versioned rollback. Offers templates for common IdPs and integrates with existing OutageKit tenant and notification settings.

Acceptance Criteria

IdP Connection Setup via UI/API with Templates and Test Connection

Given a user with Config Admin role selects an "Okta" template in Tenant Config UI, When the IdP setup form opens, Then client ID, issuer URL, scopes, and redirect URI fields are pre-populated per template and remain editable. Given required fields (issuer URL, client ID, client secret) are populated with valid formats, When the user clicks "Validate & Test", Then field validation passes and a live test to the IdP completes within 10 seconds with status "Connected". Given invalid issuer URL or mismatched redirect URI, When "Validate & Test" is run, Then the form blocks save and displays inline errors specifying the failed field and reason. Given the same configuration is submitted via REST POST /tenants/{tenantId}/idp with valid payload, When called with a bearer token with scope idp:write, Then the API returns 201 Created with resource id and testConnection.status="Connected". Given an attempt to create a second active IdP of the same protocol without a unique name, When saving, Then the system rejects with 409 Conflict and message "IdP name must be unique per tenant".

Threshold Policy Definition and Simulation

Given a Config Admin opens Threshold Policies, When creating a policy with metric=login_error_rate, window=5m, threshold>=5%, min_samples=200, Then the UI enforces numeric ranges and units and prevents save if any value is missing or out of bounds. Given a valid policy is saved, When the user clicks "Simulate (last 24h)", Then the system renders a timeline highlighting predicted Lifeline activations with start/end times and counts, and shows "0 changes applied" to confirm read-only simulation. Given an API request POST /tenants/{id}/sso-policies/simulate with historicalRange=PT24H, When executed, Then the response includes activations[] with reason, start, end, and peakErrorRate, and does not change active policies. Given multiple policies exist, When they overlap, Then evaluation order follows explicit priority and the UI displays priority and conflict resolution "first-match wins".

Lifeline Methods Configuration and Eligibility Rules

Given a Config Admin selects Lifeline methods "Email OTP" and "Backup Admin Link", When saving, Then only the selected methods are enabled and ordered per drag-and-drop. Given Lifeline method Email OTP is enabled, When "Send test" is used for a specified user email, Then the user receives a one-time code within 60 seconds and the test result shows delivery provider response. Given a rule "Offer Lifeline only when policy X is active and user is in group 'Ops'", When simulated against a sample user not in group, Then the preview indicates "Not eligible" with evaluated rule conditions. Given API PUT /tenants/{id}/lifeline with a valid methods and rules schema, When saved, Then response is 200 OK and subsequent GET returns the same configuration idempotently.

Admin Contacts and Escalation Rules Integrated with OutageKit Notifications

Given a Config Admin defines Level 1 On-call (SMS, Email) and Level 2 Duty Manager (Voice), When saving, Then contacts are validated against the OutageKit tenant contact directory and deduplicated by channel. Given an escalation rule "notify L1 immediately, escalate to L2 if unresolved after 15 minutes", When "Send test alert" is triggered, Then L1 receives SMS and Email within 60 seconds, and L2 receives a Voice call after 15 minutes unless the test is acknowledged. Given API GET /tenants/{id}/notification-settings, When called, Then it reflects the configured contacts and escalation rules used by SSO Health Sentinel. Given a contact is disabled in the OutageKit directory, When saving rules referencing that contact, Then the UI blocks save with a clear error and remediation link.

Credentials Vaulting and Secret Redaction

Given a client secret is entered and saved, When the form is re-opened, Then the secret field is redacted (••••) and the plaintext is not retrievable via UI. Given the IdP configuration is retrieved via API GET, When the response is returned, Then secret values are redacted and a flag "hasSecret": true indicates presence without exposing value. Given storage at rest, When inspecting the database or backups, Then secrets are encrypted using tenant-scoped keys managed by KMS and key rotations are logged; direct plaintext is not stored. Given "Rotate secret" action is invoked, When a new secret is submitted and tested successfully, Then the old secret is retired and an audit log records the rotation event without plaintext exposure.

Role-Based Permissions and API Scopes

Given a user with role "Config Admin", When accessing the Tenant Config UI and idp:write API, Then they can create, edit, test, and rollback configurations. Given a user with role "Viewer", When accessing the same, Then they can view configurations, run simulations, and export audit logs but cannot save; save controls are disabled and API write attempts return 403 Forbidden. Given a service account token with scopes idp:read and audit:read, When calling GET endpoints, Then responses succeed and POST/PUT/DELETE return 403 Forbidden. Given permission changes are applied, When the user refreshes the UI or obtains a new token, Then permissions take effect immediately and the change is captured in the audit log.

Audit Logs and Versioned Rollback

Given any configuration change via UI or API, When saved, Then an immutable audit entry is created capturing who, when, source (UI/API), IP, resource type, before/after diff, and optional reason, and is queryable by time range. Given an administrator selects a prior version and clicks "Rollback", When confirmed, Then a new version is created with the previous values, becomes active within 60 seconds, and an audit entry records the rollback linkage. Given concurrent edits, When a stale version is submitted, Then the system detects a version mismatch and rejects with 409 Conflict requiring refresh. Given an audit export is requested for a date range, When generated, Then a CSV or JSON file is downloadable with all fields including version IDs and diff summaries.

Rules Studio

Visual policy builder for credit calculation with tier multipliers, thresholds, grace periods, caps, and disaster exemptions. Versioned rules let you test changes on historical events before going live, preventing bill shock and rework. Clear previews show per-customer outcomes and total liability so operations and compliance agree before a single dollar moves.

Requirements

Drag-and-Drop Rule Builder

"As an operations manager, I want to assemble credit rules visually so that I can encode complex policies without code and reduce errors and rework."

Description

A node-based visual composer that lets users assemble credit policies using building blocks such as thresholds, tiered multipliers, grace periods, caps, customer class conditions, service territories, and disaster exemptions. The builder validates rule graphs in real time, prevents contradictory clauses, and converts the visual model into an executable, versioned DSL. Reusable sub-flows and templates accelerate policy creation across jurisdictions. Tight integration with OutageKit incident data (duration, affected accounts, cluster severity) enables on-canvas test inputs and instant preview of computed credits while designing.

Acceptance Criteria

Compose Tiered Multiplier Rule with Grace Period and Cap

Given a blank policy and a base credit rate of $10/hour And a Grace Period node set to 30 minutes And Threshold nodes at 2h and 6h with tier multipliers: <2h = 0x, 2–6h = 1.0x, >6h = 2.0x And a Cap node set to $200 When the nodes are connected to form a single path to Output Then the graph validates with no errors and Publish is enabled When a test incident duration of 7h (Residential) is entered on-canvas Then the previewed credit equals $50.00 When the Cap is changed to $45 Then the previewed credit equals $45.00 When the Grace Period is changed to 2h Then the previewed credit equals $30.00

Prevent Contradictory Clauses at Design Time

Given a rule graph under construction When two Cap nodes exist on any single execution path Then an error "Multiple caps on path" is shown on each Cap node within 100 ms and Publish is disabled When threshold bands overlap or are unsorted (e.g., 6h before 2h) Then an error "Invalid threshold ordering" is shown on the affected nodes and Publish is disabled When an edge creates a unit/type mismatch (e.g., Multiplier input fed by Boolean) Then an error "Type mismatch" is shown on the edge endpoints and the connection is rejected When the offending connection or node is corrected Then all related errors clear within 200 ms and Publish becomes enabled if no other errors remain

Versioned DSL Generation and Round‑Trip Fidelity

Given a valid visual rule graph When the user clicks Save Then a new immutable DSL artifact is generated with a monotonically increasing version identifier and timestamp And the artifact validates against the DSL schema with zero errors When the saved DSL is re-opened into the canvas Then the reconstructed graph matches the original in node types, parameters, edge topology, and evaluation order (deep-equality = true) When the DSL is executed by the rules engine with the same test inputs Then the outputs exactly match the canvas preview results (tolerance = 0)

Reusable Sub‑Flow Library and Template Instantiation

Given a selected group of nodes implementing a disaster exemption When the user saves the selection as a Sub‑Flow named "Disaster Exemption" v1.0 Then it appears in the Library sidebar and can be dragged into any policy When the Sub‑Flow is inserted into a new policy Then the instance references Library version v1.0 and executes identically to the original selection When the Library Sub‑Flow is updated to v1.1 Then existing instances prompt for upgrade; accepting upgrades the instance to v1.1, declining pins it to v1.0 And previewed results reflect the chosen version for each instance

On‑Canvas Incident Data Test and Liability Preview

Given an OutageKit incident with id INC‑123 (duration 3h12m, affected_accounts 1,250, severity High) When the incident is selected as test input on the canvas Then duration, affected_accounts, and severity fields auto-populate; other test fields remain unchanged When the user clicks Preview Then per-customer sample outcomes (>=10 customers) and total liability are computed and displayed within 2,000 ms (P95) And the total liability matches the offline engine run to within 0.1% and sampled customer credits match exactly

Customer Class and Territory Conditions Apply Correctly

Given a rule graph with condition nodes: customer_class = Residential and territory IN {North} And a test dataset with customers: Residential/North, Residential/South, Commercial/North, Commercial/South When Preview is executed for the dataset Then only Residential/North customers receive non-zero credits; all others receive $0 And the aggregate totals reflect the inclusion/exclusion logic exactly

Versioned Rule Lifecycle & Rollback

"As a compliance officer, I want versioned rules with effective dates and rollback so that we can audit and reverse changes safely when needed."

Description

End-to-end rule version management including create, clone, diff, annotate, and schedule effective/expiry windows by territory or customer segment. Supports draft, review, approved, and live states with the ability to pin runtime calculations to specific versions and to rollback instantly if issues arise. Diffs highlight logic changes and projected financial impact deltas. All versions are immutable and link to incidents and calculations for complete traceability.

Acceptance Criteria

Create and Clone Draft Rule Version

Given a user with Rule Editor permissions When they create a new rule version Then the version is saved in Draft state with a unique immutable version ID and timestamps And annotations (title, description, change reason) are required and saved And attempting to modify the saved logic payload is blocked and prompts creation of a new Draft version Given an existing version When the user clones it Then the clone inherits logic and metadata (excluding version ID and timestamps) And the clone starts in Draft state and can be scheduled independently

State Transitions and Approval Workflow

Given a Draft version When it is submitted for review Then its status changes to Review and a review request is logged with requester, time, and notes Given a version in Review When at least one approver from Compliance and one from Operations approve Then the version status changes to Approved and approvals are recorded with user, role, time, and comments Given an Approved version When it is promoted to Live Then the promotion is logged and no logic edits are permitted during or after promotion Given any non-Draft version When a user attempts to modify logic Then the system blocks the change and requires creating a new Draft version Given a Review rejection When a reviewer rejects with a reason Then the version returns to Draft and the rejection reason is recorded

Schedule Effective and Expiry Windows by Scope

Given a Draft or Approved version with defined scope (territory and/or customer segment) When the user schedules an effective start time and optional expiry time Then the system validates that no overlapping Live windows exist for the same scope and rule type And if an overlap exists, scheduling is blocked with a conflict message identifying the conflicting version(s) And if validation passes, the version is queued to become Live at the scheduled time for the specified scope(s) And after the expiry time, the version is no longer effective for that scope while the version record remains immutable And all schedule changes are logged with user, time, and before/after values

Diff View with Logic Changes and Financial Impact Delta

Given two selected rule versions and a historical time window and scope When the user opens the Diff view Then logic differences are highlighted at rule, condition, and parameter level And projected financial impact deltas are computed using historical incidents/calculations within the selected window and scope And the Diff displays total liability delta, number of affected customers, and the top 10 customers by absolute delta And an itemized CSV and a PDF summary can be exported And computations complete within 30 seconds for up to 100,000 incidents or the UI displays progress with an ETA and does not freeze

Pin Runtime Calculations to Specific Version

Given a calculation job (manual or incident-triggered) When a specific rule version ID is provided (pinned) Then the engine executes using exactly that version regardless of current Live versions And the job record stores the pinned version ID, commit hash, and scope used And rerunning the job with identical inputs and the same pinned version produces identical outputs And if the pinned version is not visible to the job's scope, the job is blocked with a clear error explaining the mismatch

Instant Rollback of Live Version

Given a Live version causing issues for a specific scope When an operator selects a prior Approved version for that scope and confirms rollback Then the selected version becomes Live for that scope within 60 seconds of confirmation And all new calculations after switchover use the rolled-back version And in-flight jobs complete with the version they started with and are labeled with that version ID And an audit event is emitted capturing initiator, reason, old/new version IDs, scope, and timestamps

End-to-End Traceability and Immutability

Given any rule version When viewing its audit and linkage details Then links to all associated incidents, calculations, diffs, approvals, schedules, and deployments are accessible And the version logic payload, annotations, and approvals are read-only and displayed with a cryptographic checksum And querying by calculation ID reveals the exact version and scope used And attempts to delete a version are blocked, with the action and user attempt logged

Historical Backtesting & Liability Preview

"As a finance analyst, I want to test rules on past events so that I can estimate total liability and avoid unexpected bill credits."

Description

A simulator that runs proposed rule changes against historical outages and customer impact data from OutageKit to quantify per-customer outcomes and aggregate liability before publishing. Supports scenario comparisons (baseline vs draft), sensitivity analysis on thresholds, and guardrails that block promotion when variance exceeds configurable limits. Generates exportable reports and dashboards for finance and compliance, with performance optimizations for large territories via sampling and parallelization.

Acceptance Criteria

Per-Customer Outcome Accuracy on Historical Replay

Given a historical outages dataset for a selected date range and a ruleset identical to the current baseline version When the simulator runs against the full customer population with a fixed random seed Then per-customer credit amounts equal the production ledger within 0.01 currency units for at least 99.5% of customers And any discrepancies greater than 0.01 are itemized with customer identifiers and rule-path explanations And outcomes correctly apply thresholds, tier multipliers, grace periods, caps, and disaster exemptions per the ruleset And the run summary reports processed count, skipped count (with reasons), and total liability with two-decimal precision

Aggregate Liability Consistency and Breakdown

Given per-customer outcomes from a completed simulation When aggregate liability is computed Then the aggregate equals the sum of all per-customer outcomes within 0.01 currency units And aggregates are available by territory, customer class, and outage event with each subtotal reconciling to the grand total And duplicate accounts and overlapping outage windows are deduplicated according to policy with counts of deduplications reported And totals accurately reflect inclusion or exclusion of disaster exemptions based on a user-visible toggle and are labeled accordingly And the UI and API return precomputed aggregations within 3 seconds p95 for datasets up to 5,000 outage events

Baseline vs Draft Comparison and Deltas

Given a selected baseline ruleset A and a draft ruleset B When a comparison run is executed Then the UI displays side-by-side metrics including total liability, affected-customer count, average credit per customer, and top 10 increases/decreases by segment And a per-customer delta table is available with absolute and percentage change, filterable by territory and customer class, and downloadable as CSV/XLSX And variance is displayed at both absolute and percentage levels and reconciles exactly with aggregate subtotals And the comparison snapshot is versioned with timestamp, ruleset IDs, dataset period, sample rate (if any), and a content hash

Threshold Sensitivity Analysis and Worst-Case Identification

Given a draft ruleset and a numeric parameter (e.g., threshold hours or multiplier) selected for sensitivity analysis When a sweep is configured with min, max, and step values Then the simulator executes the sweep and produces a curve of aggregate liability versus parameter value with at least N points where N = 1 + (max − min)/step And the analysis flags the worst-case (max liability) and best-case (min liability) within the sweep range And each point includes sample rate and error bounds when sampling is used, with 95% confidence intervals displayed And users can pin up to three parameter sets for side-by-side comparison and export the sensitivity results as CSV and PDF And a sampled sweep of up to 21 points completes within 5 minutes p95 at a 10% sample rate

Guardrail Enforcement on Excess Variance Before Publish

Given guardrail limits configured for absolute liability variance and percentage variance relative to baseline When a draft ruleset exceeds any configured guardrail in comparison results Then the Publish action is blocked and a banner lists the violated guardrails with measured variance values and limits And only users with role ComplianceAdmin can submit an override with a non-empty justification when the guardrail is marked Overridable And guardrails marked NonOverridable cannot be overridden by any role And all block and override events are written to an immutable audit log with user, timestamp, ruleset IDs, baseline snapshot ID, and comparison metrics And the API returns HTTP 403 with machine-readable error codes for blocked publish attempts

Finance and Compliance Reports and Dashboard Exports

Given a completed simulation or comparison When a user requests exports for finance or compliance Then the system generates per-customer and aggregate CSV/XLSX files and a PDF summary report within 2 minutes p95 for datasets up to 1,000,000 customers And each export includes metadata: ruleset IDs and versions, dataset period, sample rate, generator version, guardrail status, and content checksum And PII handling adheres to policy: compliance exports mask customer identifiers; finance exports show full identifiers only for permissioned roles And download links are pre-signed URLs with configurable expiry (default 7 days) and are revoked upon manual invalidation And scheduled exports can be configured and delivered via SFTP or encrypted email with delivery success/failure logged

Performance at Scale with Sampling and Parallelization

Given a territory with 1,000,000 customers, 12 months of history, and 5,000 outage events When a full-population simulation runs on a worker pool sized to 16 vCPUs Then wall-clock completion time is ≤ 10 minutes p95 and ≤ 12 minutes p99 And CPU utilization during compute phases is ≥ 70% and memory usage remains ≤ 75% of allocated per worker And a 10% sampling run completes in ≤ 2 minutes p95 with ≤ 1% absolute error on aggregate liability at 95% confidence, validated against the full run And runs are reproducible given a fixed random seed, and the seed is recorded in run metadata And progress reporting updates at least every 5 seconds with estimated time remaining accuracy within ±20%

Per-Customer Outcome Explorer

"As a customer care lead, I want to preview individual customer credits with explanations so that agents can answer billing questions confidently."

Description

An interactive preview that surfaces expected credits for specific customers, accounts, or cohorts, including explanation traces that show which thresholds, grace periods, caps, and exemptions were applied. Highlights edge cases near thresholds and customers hitting caps. Supports secure search, PII masking in non-production, and CSV/PDF exports for agent playbooks and regulator responses. Integrates with OutageKit’s customer and incident views for one-click context switching from an outage cluster to affected customers’ credit previews.

Acceptance Criteria

Per-Customer Credit Preview with Explanation Trace

Given a user with appropriate access and a valid customer identifier, When they open the Outcome Explorer and select a rules version and incident/time window, Then the UI displays the computed credit amount in currency (2 decimals), the rules version ID, and the computation timestamp. And the line-item breakdown lists each applied rule component (thresholds, grace periods, tier multipliers, caps, exemptions) with input values, decision, and contribution amount. And an explanation trace shows the evaluation order and final decision path IDs. And recomputing the same customer, incident, and rules version returns identical results within rounding policy (±0.01 currency units). And if required inputs are missing, a non-blocking warning lists missing fields and the credit is marked "Indeterminate".

Edge Case Highlighting Near Thresholds and Caps

Given computed outcomes, Then any customer within 10% of a duration/amount threshold or within 5 minutes of a time-based threshold is flagged "Near Threshold". And any customer whose computed credit is limited by a policy cap is flagged "At Cap". And interacting with a flag reveals threshold/cap value, measured value, and delta. And list and export views support sorting and filtering by these flags.

Secure Search with PII Controls

Given an authenticated user, When they search by customer ID, account number, phone, email, or service address, Then results match allowed identifiers and sanctioned fuzzy match rules with a match-strength label. And RBAC restricts access; unauthorized users see zero results and an "Insufficient permissions" message. And in non-production, name, email, and phone are masked by default (e.g., John D., j***@example.com, ***-***-1234); account numbers show last 4 only. And all searches are audit-logged with user, criteria, timestamp, environment, and result count. And P95 latency is ≤ 2s for single-customer queries and ≤ 5s for cohort queries up to 10k customers.

Cohort Preview and Aggregates

Given a selected cohort (outage cluster, service area, or saved filter), Then the UI displays total customers, sum of expected credits, average credit, and counts of Near Threshold/At Cap flags. And per-customer rows are paginated (50/100/250 per page) with stable sorting across pages. And aggregate totals equal the sum across all pages within 0.01 currency units. And P95 initial cohort load time is ≤ 5s for up to 50k customers; pagination fetch P95 is ≤ 2s.

Rules Version Selection and Historical Replay

Given at least one draft and one production rules version, When a user selects a rules version and an as-of date/time, Then outcomes recompute using that version against historical incidents. And the UI displays a delta view versus production showing per-customer delta and cohort total delta. And the selected version ID and semantic tag are visible on-screen and embedded in exports. And recomputation completes in ≤ 3s P95 for a single customer and ≤ 8s P95 for cohorts up to 10k customers.

CSV and PDF Export with Metadata

Given the current selection (customer or cohort), When a user exports to CSV or PDF, Then files include all visible columns plus metadata (filters, rules version, environment, timestamp, user, and totals). And non-production exports apply PII masking; production exports honor user entitlements for unmasked fields. And CSV supports up to 100k rows via an asynchronous job with progress and completion notification; job completes within 10 minutes for 100k rows P95. And PDF supports up to 200 customers synchronously; generation completes within 60s P95 and preserves pagination and flags. And exported totals match on-screen totals within 0.1%.

Deep Link from Incident Cluster to Credit Previews

Given a user viewing an outage cluster, When they click "Preview credits", Then the Outcome Explorer opens with filters pre-populated to the cluster, time window, and default rules version. And the link is a signed URL that expires in 15 minutes and is valid only for the initiating user/session. And the transition occurs without re-authentication during an active SSO session; otherwise the user is prompted to login and is returned to the target view. And telemetry records navigation with a correlation ID across source and target. And the resulting cohort count matches the impacted-customer count shown in the outage cluster within 1%.

Compliance Workflow & Audit Trail

"As a compliance manager, I want an approval workflow and audit trail so that policy changes meet regulatory requirements and internal controls."

Description

Configurable multi-step approval workflow with role-based permissions (author, reviewer, approver, auditor) and mandatory sign-offs before a rule goes live. Captures immutable audit logs of edits, comments, approvals, and deployment events, with timestamps and user identity. Supports evidence exports for regulators, policy attachment storage, and links to external ticketing systems. Enforces segregation of duties and can require dual control for high-impact changes.

Acceptance Criteria

Enforce Multi-Step Role-Based Approval Before Go-Live

Given a workflow defines steps Author → Reviewer → Approver → (optional) Auditor and a draft rule version exists When the Author submits for review Then the rule status changes to "In Review" and only users with Reviewer role can record a review decision Given at least one Reviewer approval is recorded and Approver sign-offs are required per configuration (default 1) When an Approver records approval(s) Then the Deploy action remains disabled until all required Approver sign-offs are present Given required approvals are incomplete When any user attempts to deploy the rule Then the Deploy action is blocked and a message lists pending approvals by role Given all required approvals are recorded When a user with Approver permission triggers deployment Then the rule version transitions to "Active" and a deployment event is logged

Segregation of Duties Enforcement

Given segregation of duties is enabled for Compliance Workflow When the same user attempts to perform more than one role among Author, Reviewer, Approver on the same rule version Then the action is blocked with an error "Segregation of duties violation" and the attempt is audit-logged Given a user authored a change When the user attempts to approve that change Then approval is prevented and the system prompts for assignment to a different Approver Given a user with Auditor role is viewing a rule When the Auditor attempts to modify rule content or approvals Then the action is forbidden and recorded in the audit log

Dual Control for High-Impact Changes

Given dual control is configured with an impact threshold and a rule change exceeds that threshold When approvals are recorded Then two distinct Approver users must approve before deployment is enabled Given the first Approver has already approved When the same user account attempts to submit the second required approval Then the approval is rejected with a message "Second approval must be from a different user" Given dual approvals are not met for a high-impact change When any user attempts to deploy Then deployment is blocked and the UI indicates "Dual control required: 2 approvals"

Immutable Audit Trail of Edits and Events

Given any rule edit, comment, review decision, approval/rejection, or deployment occurs When the event is committed Then an audit entry is appended containing event type, UTC timestamp, actor user ID, role, entity/version ID, prior and new values (diff when applicable), and optional comment Given audit entries exist When a user attempts to modify or delete an existing audit entry via UI or API Then the system returns 403 Forbidden and appends a tamper-attempt audit entry Given an Auditor requests the audit log for a rule version When entries are retrieved Then entries are returned in chronological order with immutable sequence IDs and verifiable hashes for integrity checking

Evidence Export for Regulators

Given an Auditor selects a rule version and date range When the Auditor requests an evidence export Then a downloadable package is generated within 60 seconds containing: audit log (CSV), approval summary (PDF), deployment events, rule definition at deployment, linked external ticket references, and all policy attachments Given an export package is generated When it is downloaded Then the package includes a manifest listing file names, counts, and SHA-256 checksums and the checksums validate Given an export with identical parameters is requested within 24 hours When the system serves the export Then the content is identical to the prior export or returned from cache with matching checksums

Policy Attachment Storage and Integrity

Given an Author or Approver uploads a policy attachment When the file type is allowed (PDF, DOCX, PNG) and size ≤ 25 MB Then the upload succeeds, the file is stored read-only, associated to the rule version, and a checksum is recorded and audit-logged Given a policy attachment is stored on an approved rule version When a user attempts to replace or delete the attachment Then the action is blocked and the user is instructed to create a new version to modify attachments Given an Auditor downloads an attachment When the checksum is verified Then it matches the stored checksum value

External Ticketing Link Requirement

Given external ticketing integration is enabled and required for approvals When an Author submits a rule for review Then a valid linked ticket ID must be provided and validated via API before submission succeeds Given a rule is linked to an external ticket When the external ticket status changes Then the linked status in the rule reflects the latest external status within 5 minutes and an audit event is recorded Given the external ticket link is invalid or API validation fails When a user attempts to approve or deploy the rule Then the action is blocked with a clear error indicating the ticket validation failure and remediation steps

Disaster Exemption Data Ingestion

"As a regulatory analyst, I want automatic disaster exemptions applied to rules so that mandated waivers are honored consistently across territories."

Description

Automated ingestion of disaster declarations from authoritative sources (e.g., FEMA, state agencies) and internal operations flags to define exemption windows by geography and time. Includes territory mapping, conflict resolution, and manual overrides with expiry. Exemption artifacts are first-class inputs to the rules engine and preview tools, ensuring credits are suppressed or modified during declared events as required by regulation or policy.

Acceptance Criteria

FEMA Declaration Ingestion and Normalization

Given a new FEMA disaster declaration is published with county FIPS and start/end timestamps, When the ingestion job runs, Then the declaration is fetched, parsed, deduplicated by FEMA identifier, and stored as an exemption artifact within 15 minutes of source publish time. Given overlapping FEMA updates to an existing declaration, When re-ingested, Then the artifact is versioned, the window/geographies are updated atomically, and prior versions remain queryable for audits. Given timestamps and geographies in diverse formats, When normalized, Then all artifact windows are stored in UTC with source timezone preserved as metadata and pass daylight-saving boundary tests.

State Agency Declaration Ingestion and Normalization

Given a supported state agency source publishes a declaration via a configured feed, When the ingestion runs, Then the declaration is parsed with the configured extractor, mapped to counties/ZIPs/polygons, and stored as an exemption artifact with provenance=state and source URL. Given the state and FEMA declare for the same geography/time, When both artifacts exist, Then both are stored independently with provenance so conflict resolution can be applied downstream. Given a malformed or unavailable state source, When ingestion runs, Then the job retries up to 3 times with exponential backoff, emits an error event, and no partial artifact is created.

Territory Mapping to Service Areas and Customers

Given an exemption artifact with county FIPS and polygon geometry, When mapping to the utility's territories, Then all affected service areas and premises within the geometry are linked to the artifact using point-in-polygon or admin-boundary matching. Given a service area partially overlaps an exemption polygon, When mapping, Then only customers within the overlapping sub-geometry are marked exempt; customers outside are not. Given a premise with missing coordinates but known county/ZIP that matches the artifact, When mapping, Then the premise is marked exempt using fallback administrative mapping and flagged reduced_precision=true.

Conflict Resolution and Precedence Rules

Given overlapping artifacts from FEMA, state, internal operations flags, and manual overrides for the same geography/time, When determining the effective exemption, Then precedence is Manual Override > Internal Ops > State > FEMA, and windows are merged by earliest start and latest end unless a higher-precedence artifact narrows the window explicitly. Given two artifacts with conflicting start/end times, When computing the effective window, Then the system produces a single effective window per customer-geography and stores the derivation trace showing sources and precedence applied. Given a manual override that explicitly disables an exemption for a subset of customers, When effective calculation runs, Then that subset is excluded regardless of lower-precedence artifacts.

Manual Override Creation, Expiry, and Audit

Given a user with the Exemptions:Manage role, When creating a manual override, Then they must specify geography (polygon/ZIP/county), start/end timestamps (UTC), scope (include or exclude), and expiration, and the system validates required fields before saving. Given a manual override reaches its expiration, When the scheduler runs, Then the override automatically deactivates and is excluded from effective calculations without a deployment or manual action. Given any create/update/delete of a manual override, When the action completes, Then an immutable audit record is stored with user, timestamp, change diff, and justification.

Rules Engine and Preview Consumption

Given a customer is within an active exemption window, When the Rules Studio evaluates credit policies, Then credits are suppressed or modified per policy flags and the result includes exemption_reason and source_provenance. Given Operations previews a rules change on a historical event with exemptions, When running the preview, Then per-customer outcomes and total liability reflect the exemptions, with a banner indicating exemptions applied and dollar deltas. Given no exemptions apply, When evaluating, Then the engine yields the same results as with exemptions disabled (idempotence check).

Data Freshness, Backfill, and Monitoring

Given any configured source has not produced an update in 24 hours, When the monitoring job runs, Then an alert is sent to the configured channel(s) with source name and last successful fetch time, and the system continues retrying without blocking other sources. Given the source publishes retroactive amendments to a declaration, When backfill runs, Then the system reprocesses affected windows and geographies, updates effective calculations, and emits a data-change event that triggers dependent recalculations within 30 minutes. Given an ingestion job completes, When metrics are recorded, Then the dashboard shows counts ingested/updated/deleted, processing latency p50/p95, and failure rates for the last 24 hours.

Deterministic Calculation Engine & API

"As a platform engineer, I want a deterministic rules engine and API so that credit calculations are reliable, scalable, and traceable during major outages."

Description

A scalable, deterministic service that executes versioned rule graphs for batch and real-time calculations with idempotency, version pinning, and trace IDs for every evaluation. Meets defined SLOs for throughput and latency at peak incident volumes and exposes observability (metrics, logs, traces) for debugging. Provides APIs to compute credits by incident, customer, or cohort and integrates with OutageKit notifications to include credit estimates in outbound messages. Includes rate limiting, retries, and sandbox/production environments.

Acceptance Criteria

Deterministic Evaluation, Version Pinning, and Trace IDs

- Given a fixed rules graph version V and canonical input payload P for customer C, When I evaluate the graph twice without an idempotency key, Then the outputs (creditAmount, currency, breakdown, ruleVersionApplied) are identical and ruleVersionApplied equals V. - Given the same input P and version V evaluated on three separate service instances, When all evaluations complete, Then the outputs are identical across instances and timestamps/time zones do not affect results (UTC used). - Given floating point calculations and tier thresholds in V, When an evaluation runs, Then rounding is applied deterministically to 2 decimal places in the configured currency and ties are resolved by bankers rounding. - Given an effectiveDate D instead of explicit version, When V is the active version at D, Then ruleVersionApplied equals V and the output matches a run explicitly pinned to V. - Given any evaluation, When the response is returned, Then it includes a traceId (UUID v4 format) and an evaluationId (UUID v4 format).

Idempotency Keys, Safe Retries, and Rate Limiting

- Given a POST compute request with header Idempotency-Key=K, When the request is sent and then retried 3 times within 24 hours with the same K, Then the first response is 201 Created, subsequent responses are 200 OK with identical body, the same evaluationId, and header Idempotent-Replay=true, and no duplicate side effects occur. - Given a transient 5xx error on the first attempt, When the client retries the request with the same Idempotency-Key K, Then the successful retry returns the same evaluationId and output as if no error occurred. - Given a POST compute request without an Idempotency-Key, When it is sent, Then the service responds 400 Bad Request with error code MISSING_IDEMPOTENCY_KEY and no evaluation is executed. - Given a tenant API key limited to 1000 requests per minute with a burst of 200, When the client exceeds the limit, Then the service responds 429 Too Many Requests with headers X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, and Retry-After, and no evaluations are processed for the throttled requests. - Given the rate limit window resets, When the client resumes within limits, Then requests are accepted and processed normally.

Real-time Throughput and Latency SLOs at Peak Load

- Given a synthetic workload of 2,000 requests per second sustained for 10 minutes across three tenants, When invoking the real-time compute endpoint, Then p95 latency is <= 250 ms, p99 latency is <= 500 ms, and the HTTP error rate is < 0.1% during the window. - Given a spike to 5,000 requests per second for 60 seconds, When the system auto-scales, Then at least 99% of requests succeed, p99 latency remains <= 750 ms, and no request is queued for more than 1 second. - Given steady-state traffic at 200 RPS, When background maintenance tasks run, Then p95 latency and error rate remain within SLO and no more than 0.5% latency regression is observed versus baseline. - Given all responses, When headers are inspected, Then X-Trace-Id is present on 100% of successful and failed responses for correlation.

Batch Evaluation SLOs for Large Cohorts

- Given a batch job for 1,000,000 customer evaluations for incident I pinned to rules version V, When executed, Then total completion time is <= 50 minutes with average throughput >= 20,000 evaluations per minute. - Given per-record transient failures, When the job runs, Then each failed record is retried up to 3 times with exponential backoff and jitter, and final unprocessed record rate is <= 0.05%. - Given a worker node failure at 50% progress, When the job is restarted, Then it resumes from the last durable checkpoint and at most 2 minutes of work are reprocessed. - Given batch output, When sampling any 100 records, Then each record includes evaluationId, traceId, ruleVersionApplied, creditAmount, and currency, and totals reconcile with the sum of per-record credits.

Observability: Metrics, Structured Logs, and Distributed Traces

- Given the service is running in test mode with 100% tracing, When 1,000 evaluations are executed, Then at least 99% emit a distributed trace with spans named ingress, rule_eval, and persist, and each span includes traceId and duration. - Given the /metrics endpoint, When scraped, Then it exposes Prometheus metrics: outagekit_eval_requests_total{route,tenant,outcome,rule_version}, outagekit_eval_latency_seconds_bucket, outagekit_eval_errors_total{error_code}, and outagekit_queue_depth, all reporting non-negative values. - Given an evaluation completes, When logs are inspected, Then a structured JSON log line exists containing fields trace_id, evaluation_id, tenant_id, rule_version_applied, idempotency_key (if provided), latency_ms, outcome, and error_code (if any). - Given an evaluationId E, When querying logs and traces by E, Then corresponding entries are retrievable within 5 seconds of completion and share the same traceId across services.

Compute APIs and Notification Integration

- Given POST /v1/credits/customer with customerId=C and optional ruleVersion=V, When called, Then the response is 200 OK and includes evaluationId (UUID), traceId (UUID), ruleVersionApplied, creditAmount (decimal with 2 fractional digits), currency (ISO 4217), and breakdown (array of rule components). - Given POST /v1/credits/incident with incidentId=I and effectiveDate=D, When called, Then the response is 200 OK and ruleVersionApplied resolves from D, and if a cohort filter is provided via /v1/credits/cohort, Then the response also includes customerCount and totalCreditAmount fields. - Given an invalid incidentId or customerId, When called, Then the service returns 404 Not Found with error code RESOURCE_NOT_FOUND; Given an invalid payload schema, Then 400 Bad Request with error code VALIDATION_ERROR is returned. - Given OutageKit sends an incident update with includeCreditEstimate=true for customer C, When the notification is generated, Then the rendered SMS/email/IVR contains an "Estimated credit: $X.XX" line matching the engine's creditAmount for C and the additional end-to-end latency p95 is <= 300 ms. - Given the engine is temporarily unavailable, When a notification would include an estimate, Then the message falls back to "Estimate unavailable" and an operational alert is emitted; the pipeline retries estimation within 10 minutes and can send a follow-up update if enabled.

Sandbox vs Production Isolation and Promotion

- Given sandbox and production base URLs and credentials, When a compute request is sent to sandbox, Then resulting data and logs are visible only in sandbox observability and not in production, and vice versa. - Given a sandbox API key, When used against production endpoints, Then the request is rejected with 401 Unauthorized and error code INVALID_API_KEY_ENV; Likewise, a production key is rejected on sandbox. - Given a rules version V published in sandbox, When promoted to production, Then an audit record is created containing actor, timestamp, diff summary, and promotionId, and V becomes available in production within 60 seconds. - Given the same input payload P evaluated in sandbox and production pinned to V after promotion, When results are compared, Then outputs are identical (creditAmount, currency, breakdown, ruleVersionApplied). - Given metrics, logs, and traces, When inspected, Then each record includes an env label (sandbox or prod) and no cross-environment data leakage is detected.

Impact Matcher

Accurately links outage clusters to customer accounts using GIS boundaries, AMI pings, and time windows, deduplicating overlapping reports to avoid double credits. Handles partial restorations with minute-level proration and service-degradation flags, so credits reflect real impact. Reduces manual reconciliation and keeps credits fair and defensible.

Requirements

GIS Boundary Association

"As an operations manager, I want customer accounts automatically associated to outage clusters using our GIS data so that impact counts and maps are accurate without manual matching."

Description

Implements a robust spatial join that links outage clusters to customer accounts using utility GIS assets (service territories, circuits/feeders, meter point coordinates). Uses polygon overlays with precedence rules to resolve gaps/overlaps, and falls back to geocoded service addresses when meter coordinates are missing. Caches topology and supports multiple GIS providers via adapters. Streams updates as cluster geometries evolve so the impacted account list stays live, powering OutageKit’s map, metrics, and notifications with accurate coverage counts.

Acceptance Criteria

Associate Cluster to Accounts via Meter Coordinates

Given a live outage cluster polygon and accounts with valid meter point coordinates in the same spatial reference When the spatial join is executed Then every account whose meter point lies within the cluster polygon is marked impacted And accounts whose meter point lies on the polygon boundary are treated as impacted (boundary-inclusive containment) And accounts outside the polygon are not marked impacted And the coverage count exactly equals the number of impacted accounts And the join completes within 500 ms for a cluster intersecting up to 10,000 meter points

Precedence Rules Resolve Overlaps and Gaps

Given accounts with meter points, circuit/feeder polygons, service territory polygons, and known overlaps/gaps When spatial ambiguities occur during association Then precedence is applied in order: meter point containment > circuit/feeder containment > service territory containment > nearest circuit/feeder centerline within 100 meters And in polygon overlaps, the higher-precedence polygon determines association And in polygon gaps, the nearest-centerline rule applies; if no centerline within 100 meters, proceed to geocoded address fallback And the applied rule is recorded per account association

Fallback to Geocoded Service Address When Meter Missing

Given an account missing meter coordinates or having invalid meter geometry And a validated geocoded service address with location precision ≤ 30 meters When joining accounts to a cluster Then the geocoded point is used for containment tests in place of the meter point And accounts with address precision worse than 30 meters are flagged low-precision and excluded from automatic impact unless within service territory and within 150 meters of a circuit centerline And a fallback_reason of geocoded_address is recorded for any impacted account determined via this method

Incremental Streaming Updates on Cluster Geometry Change

Given a cluster geometry is updated (expand, contract, move) or its status changes When the updated geometry is received by the association engine Then impacted account additions and removals are computed as a diff against the prior version And add/remove events are streamed to downstream consumers within 5 seconds p95 and 10 seconds p99 And no duplicate add/remove event is emitted for the same account and cluster version And the map, metrics, and notifications reflect the new impacted set within 5 seconds p95

GIS Provider Adapter Compatibility and Cached Topology

Given configured GIS adapters for Esri Feature Service and PostGIS When the system starts and topology cache warm-up runs Then service territories and circuit/feeder layers are cached with a TTL of 15 minutes and invalidated on provider change events And cold-start warm-up completes within 90 seconds for up to 500 service territories and 10,000 circuit segments And joins read topology from cache and clusters from the live stream without provider-specific code paths And switching providers at runtime yields zero failed joins and no more than 1 minute of reduced cache hit rate

Auditability of Account-Cluster Associations

Given accounts have been associated or de-associated with a cluster When querying the association audit API for a cluster/account pair Then the response includes association_method, rule_applied, data_source_ids and versions, geometry_ids, associated_at and deassociated_at timestamps, and actor=engine And 100% of association changes are retrievable within 60 seconds of occurrence And audit records are retained for at least 13 months

Performance and Scale Under Load

Given 100,000 customer accounts (≥50% with meter points), 2,000 circuit segments, and a sustained stream of 20 cluster updates per second When the system runs under this load for 15 minutes Then p95 per-cluster association latency is ≤ 1.5 seconds and p99 ≤ 3 seconds And CPU utilization remains < 75% and memory usage < 70% of allocated limits And the streaming pipeline exhibits zero data loss and a join error rate < 0.1%

AMI Ping Correlation

"As a reliability engineer, I want AMI ping status correlated with clusters so that we can confirm outages and restorations objectively and reduce false positives."

Description

Ingests AMI telemetry (last-heard, power status, voltage flags) and correlates meters to outage clusters to confirm energized/de-energized states. Applies latency-aware heuristics and vendor-specific adapters, rate limiting, and retry policies. Produces confidence scores and per-account state transitions that continuously update impact status and restoration confirmation, reducing false positives/negatives in OutageKit’s live views and credit pipeline.

Acceptance Criteria

Vendor AMI Payload Normalization

Given AMI payloads from multiple vendors with differing field names and encodings When the payloads are processed by vendor-specific adapters Then each record is normalized to the schema {meter_id, last_heard_ts (UTC ISO-8601), power_status ∈ {energized, de-energized, unknown}, voltage_flags[], vendor_id, source} And records with missing meter_id or invalid timestamps are rejected with an explicit error code And no record contains a last_heard_ts in the future; such records are rejected and logged And adapter success rate is ≥ 99.0% and rejection reasons are emitted as metrics per vendor And p95 normalization latency per record is ≤ 200 ms at 1k RPS sustained

Latency-Aware Energization Inference

Given a configurable freshness threshold T = 15 minutes When last_heard_ts for a meter is older than T and the last known power_status was energized Then the meter correlation state is set to unknown and does not contribute to de-energized counts Given last_heard_ts ≤ T and power_status indicates de-energized When correlating to an active outage window Then the meter correlation state is de-energized with freshness_flag = true Given last_heard_ts > T but voltage_flags include recent outage indicators within the past 5 minutes When computing meter state Then the meter correlation state is de-energized with freshness_flag = false

Cluster Correlation and Membership

Given an outage cluster with a GIS polygon and active time window When correlating a meter with known geocoordinates and feeder association Then the meter is associated to the cluster if inside the polygon OR on a feeder linked to the cluster AND the meter event time overlaps the cluster time window And if the meter falls within multiple overlapping clusters Then it is assigned to the cluster with the highest confidence score; ties are broken by smallest polygon area And a meter has at most one active cluster membership at any point in time

Confidence Scoring and Thresholded Actions

Given inputs {power_status signal, last_heard freshness, voltage_flags, spatial proximity, cluster density} When computing a meter-to-cluster confidence Then a numeric score in [0.0, 1.0] is produced and persisted per meter-cluster link with updated_at And if score ≥ 0.80 for ≥ 3 distinct meters within a cluster within 5 minutes Then the cluster gains status = "confirmed_by_ami" And if score ≤ 0.20 for ≥ 5 distinct meters spatially distributed across the cluster within 10 minutes after a restoration event Then the cluster transitions to status = "restoration_likely" and triggers a verification sweep

Per-Account State Transition Timeline and Debounce

Given meter correlation states over time When a meter transitions between {de-energized, energized, unknown} Then an AccountStateChanged event is emitted with {account_id, meter_id, from_state, to_state, transition_ts (minute precision), correlation_confidence} And flapping transitions within 120 seconds are debounced into a single transition preserving the earliest transition_ts And events are delivered exactly-once per meter per minute to the downstream topic And p95 end-to-end latency from detection to publish is ≤ 60 seconds

"As a product owner, I want a dashboard and exportable audit trail of Critical Watch activity so that I can tune thresholds and demonstrate SLA compliance."

Description

Captures an immutable audit trail of threshold evaluations, alerts, acknowledgments, suppressions, and communications, with rule versioning and configuration snapshots. Presents dashboards for time-to-alert, time-to-acknowledge, false-positive rate, alert volume by tier, and crew recommendation acceptance. Exposes exports and APIs for compliance and continuous tuning, with retention policies and privacy controls.

Product Details

Vision & Mission

Problem & Solution

Details & Audience

User Personas

Reliability Director Rhea

Background

Needs & Pain Points

Needs

Pain Points

Psychographics

Channels

Integration Innovator Ian

Background

Needs & Pain Points

Needs

Pain Points

Psychographics

Channels

Municipal Coordinator Maya

Background

Needs & Pain Points

Needs

Pain Points

Psychographics

Channels

GIS Guru Grace

Background

Needs & Pain Points

Needs

Pain Points

Psychographics

Channels

Experience Analyst Alex

Background

Needs & Pain Points

Needs

Pain Points

Psychographics

Channels

Owner-Operator Owen

Background

Needs & Pain Points

Needs

Pain Points

Psychographics

Channels

Product Features

Dual-Approver Flow

Requirements

Two-Approver Gate for High-Risk Actions

Description

Acceptance Criteria

Side-by-Side Payload Diff Review

Description

Acceptance Criteria

Audience Size and Segment Preview

Description

Acceptance Criteria

Approver Role and Separation of Duties Enforcement

Description

Acceptance Criteria

Approval Escalation and Timeout Workflow

Description

Acceptance Criteria

Tamper-Proof Approval Audit Trail

Description

Acceptance Criteria

Change Invalidation and Concurrency Controls

Description

Acceptance Criteria

Scoped Roles Matrix

Requirements

Granular Roles & Scopes Engine

Description

Acceptance Criteria

Scoped Initiate/Approve Workflow

Description

Acceptance Criteria