Decision Analytics: Measuring Leadership Under Pressure

How After Action quantifies participant decision quality in crisis exercises

After Action | Version 1.0 | April 2026

Executive Summary

Crisis response isn't won by having the right playbook — it's won by the team executing under pressure. Two organizations with identical plans will produce dramatically different outcomes during an incident because of how their people think, coordinate, and commit to action when things are unclear and time is short.

Measuring this is hard. Traditional cyber assessments score controls and technology, not human behavior. Survey-based maturity assessments score policy documents. None of them capture whether the people actually responsible for responding can do it.

The After Action Decision Analytics Engine solves this. During every facilitated exercise, participants submit structured decisions (text + rationale + confidence). The engine analyzes those decisions across five dimensions — specificity, leadership, consistency, fatigue, and coordination — to produce behavioral profiles of individual participants and the team as a whole.

This whitepaper documents the methodology.

1. What Gets Captured

During a live exercise, each inject triggers a decision prompt. Participants type their response and provide:

Decision text — free-form, what they would do
Rationale — why (optional)
Confidence — 1–5 scale, how certain they are
Structured actions — zero or more chips from the decision-action catalog (isolate host, capture forensics, notify counsel, …) stored as stable keys for analytics rollup

Each decision also inherits a category (detection / escalation / containment / communications / recovery / legal-regulatory) from the inject that prompted it — participants don't self-tag categories.

This data is richer than a questionnaire and messier than a log. It looks like real crisis decision-making because it is.

2. The Five Analysis Dimensions

2.1 Specificity Score (0–100)

What it measures: How detailed and action-oriented a decision is. A decision like "I would escalate this" scores low. A decision like "I would immediately notify our CISO and activate the IR team to isolate affected endpoints via EDR, while our comms lead drafts a holding statement for the CEO" scores high.

Algorithm:

specificity_score = 0

+15 if word_count >= 10
+15 if word_count >= 25
+10 if word_count >= 50
+20 if contains any action keyword
+15 if mentions any stakeholder
+10 if contains escalation language
+15 if rationale field is > 20 chars

capped at 100

Action keyword list (24): notify, escalate, isolate, contain, shut down, activate, contact, brief, deploy, investigate, engage, initiate, implement, execute, coordinate, mobilize, invoke, trigger, disable, revoke, quarantine, block, monitor, assess

Stakeholder keyword list (25): CISO, CEO, CTO, CFO, COO, board, legal, counsel, HR, communications, PR, executive, management, leadership, team lead, incident commander, SOC, NOC, forensics, vendor, partner, regulator, law enforcement, FBI, CISA

Escalation keyword list (13): the stems escalat- and elevat-, plus tier, severity, priority, critical, urgent, immediate, emergency, war room, bridge call, incident command, chain of command

Why pattern matching, not NLP: The vocabulary of incident response is narrow. These 62 terms across three dictionaries cover the overwhelming majority of real decisions. Using pattern matching is deterministic, fast, and auditable in a way that transformer-based NLP is not — and these published lists ARE the dictionaries, verbatim from src/lib/decision-analytics.ts.

2.2 Leadership Score (0–100)

What it measures: How much a decision reflects leadership behavior vs. individual contributor behavior. Leaders coordinate, explain, and escalate. ICs act.

Formula:

leadership_score =
    (action_rate        × 25)   // did they take action, not just observe?
  + (stakeholder_rate   × 25)   // did they coordinate with others?
  + (escalation_rate   × 15)    // did they escalate when warranted?
  + (rationale_rate    × 15)    // did they explain their reasoning?
  + (specificity_score/100 × 20) // how detailed?

The *_rate variables are 0 or 1 per decision (present or absent), then averaged across all decisions by the participant.

A participant who writes "do X" on every decision will score high on action_rate but low on stakeholder_rate — they behave like an IC. A participant who writes "I'll notify X, then Y, because Z" on every decision will score high on all dimensions — they behave like a leader.

2.3 Consistency Score (0–100)

What it measures: How calibrated a participant's confidence is across decisions. Are they consistently confident, consistently uncertain, or wildly variable?

Formula:

stddev = standard deviation of (confidence values)
consistency_score = max(0, 100 - stddev × 20)

A participant with perfectly flat confidence (always 4) scores 100. A participant alternating between 2 and 5 scores 70. A participant alternating 1 and 5 — the maximum possible swing — scores 60. (On a 1–5 scale the standard deviation cannot exceed 2, so the practical floor of this score is 60; the scale's discriminating range is 60–100.)

Why consistency matters: In real crisis response, leaders who are "confidently wrong" are worse than leaders who are "uncertainly right." A high-variance confidence profile suggests someone who isn't calibrated to the situation — a red flag.

2.4 Fatigue Indicator

What it measures: Whether the team's confidence drains away over the course of an exercise.

Method:

sort the exercise's decisions by time
split into first half / second half
fatigue_indicator = true IF:
  mean(first-half confidence) − mean(second-half confidence) > 0.8

The signal is confidence-based, not timing-based, and is computed across the whole exercise rather than per participant: if the room's average confidence in the second half sits nearly a full point below the first half, the team is degrading under cognitive load. That's a red flag worth raising in the debrief — and a reason to schedule shorter, more frequent exercises.

2.5 Strength/Weakness Areas by Category

What it measures: Which inject categories a participant performs well in, and which they struggle with.

Method:

for each decision:
  bucket by inject_category
  track confidence and specificity per bucket

for each category:
  avg_confidence = mean of bucket confidence values
  avg_specificity = mean of bucket specificity scores

  IF avg_confidence >= 4 AND avg_specificity >= 60:
    mark as STRENGTH
  ELIF avg_confidence < 3 OR avg_specificity < 40:
    mark as IMPROVEMENT AREA

A participant might be a strength in technical response (high confidence, detailed decisions) but an improvement area in communications (low confidence, vague decisions). This lets facilitators provide targeted coaching.

3. Team-Level Analysis

Individual profiles are useful. Team-level patterns are where the real insight lives.

3.1 Confidence Trend by Inject Order

Tracks average team confidence across the sequence of injects:

for inject in sorted injects:
  avg_confidence[inject.order] = mean of all participant confidences on that inject

A healthy team has a flat or slightly rising curve — they gain confidence as they work through the scenario. A struggling team has a steeply falling curve — they lose composure as pressure mounts.

Graph this trend and show it to the team. It's the single most persuasive visualization in a post-exercise debrief.

3.2 Category Performance

Per-category team averages:

category_performance[category] = {
  decision_count: N,
  avg_confidence: mean,
  avg_quality: mean_specificity,
  gap_correlation: 0   // reserved — see note
}

gap_correlation is a reserved field, currently always 0: correlating category decisions against identified gaps is on the roadmap, and the field is emitted now for forward compatibility. Don't build on it yet.

A team might be strong in technical response categories but weak in communications. This maps directly to the capability areas in the readiness score.

3.3 Team Alignment Score

What it measures: How tightly the team's decisions cluster around each other. High alignment = coherent response. Low alignment = siloed thinking.

Method: For each inject with two or more decisions, compute the standard deviation of the participants' confidence ratings; alignment for that inject is max(0, 100 − stddev × 25). The exercise-level score is the mean across those injects.

Alignment measures confidence coherence — whether the room agrees on how bad things are and how sure they should be — not text similarity. A team with consistently low alignment may have a communication problem: half the room rates the situation a 5-confidence containment call while the other half is at 2, which means they aren't talking to each other during the exercise.

3.4 Decision Velocity

decision_velocity = {
  avg_time_between_decisions_minutes: mean,
  fastest_response_minutes: min,
  slowest_response_minutes: max,
  fatigue_indicator: boolean,
}

Slow decision velocity on the first few injects → team isn't warmed up. (Note the fatigue_indicator reported alongside these timing fields is the confidence-based signal from §2.4, not a timing signal.) Both are actionable findings for the facilitator's debrief.

4. Predictive Insights

The engine uses these metrics to generate predictive insights — rule-based classifications typed as strength / risk / trend / recommendation. All insights are team- and org-level, generated from six fixed rules across the org's exercise history; the engine deliberately does not emit per-participant call-outs or inject-by-inject narratives:

Trend — average team confidence rising or falling by more than 0.3 across the last 3+ exercises ("confidence is trending down across the program")
Risk — the fatigue indicator fired in more than 40% of the org's exercises
Risk — a decision category averaging below 2.5 confidence across 2+ exercises ("schedule a {category}-focused tabletop")
Strength / Risk — leadership bench depth: the share of participants with leadership score ≥ 60 is at least 60% (strength) or under 30% (risk)
Recommendation — team average specificity below 50 → decision-documentation drill
Recommendation — team alignment below 50 → communication-cadence work

Each insight carries a fixed, rule-specific confidence level (70–85) plus a separate data_points count showing how many exercises or decisions sit behind it. The confidence is not computed from the data volume — it reflects how reliable we judge each rule to be.

5. Integration with Scoring

The decision analytics engine is tightly integrated with the readiness scoring engine:

Confidence data feeds into the executive_alignment bonus (+5 points if avg confidence >= 70/100)
Decision speed data feeds into the decision_speed capability area
Category performance informs the "strength areas" section of the AAR
Fatigue indicator is a direct input to the coaching insights in the post-exercise debrief

This integration means the analytics aren't a side channel — they're part of the main scoring model.

6. Privacy & Ethics

6.1 What's stored

Decision text (the participant's own words)
Rationale (optional)
Confidence rating
Structured action keys (selected by the participant) and the inject's category
Timestamp

6.2 What's not stored

Audio recordings
Facial analysis
Typing biometrics
Any data the participant didn't explicitly submit

6.3 Who can see individual decisions

The facilitator (during and after the exercise)
The participant themselves (via their post-exercise debrief screen)
Org administrators (via the AAR)

Individual decisions are not exposed to other participants during the exercise to prevent groupthink. After the exercise, the full decision log is available to the facilitator and org administrators through the AAR and replay views.

6.4 Anonymization in exports

Any decision data included in the carrier field intake export is aggregated only — individual decision text is never shared outside the client's own org.

7. Why Not Use an LLM?

LLMs could analyze decision text with more linguistic nuance than pattern matching. The reasons we don't:

Determinism — same input, same score, every time. Critical for audit.
Speed — a synchronous pattern match vs. a multi-second LLM call
Cost — zero marginal cost per decision, vs. per-token LLM billing
Privacy — decision text never leaves the platform; nothing sent to a vendor
Auditability — every keyword in the dictionary is in git history

LLM augmentation is layered as an optional enhancement for AAR prose and coaching narratives, but the quantitative scoring is deterministic.

8. Implementation

The engine is a single-file module: src/lib/decision-analytics.ts. Public API:

analyzeDecisionQuality(text, rationale): DecisionQualityMetrics
buildParticipantProfile(id, decisions): ParticipantProfile
computeExerciseAnalytics(exerciseId, decisions): ExerciseAnalytics
computeOrgAnalytics(exercises, participants, readinessScores?): OrgAnalytics
generatePredictiveInsights(exercises, participants): PredictiveInsight[]
aggregateQualityMetrics(metrics): DecisionQualityMetrics

Every function is pure. No DB, no LLM, no network. Fully tested (100+ unit tests in the repo).

9. Licensing

The engine's pattern dictionaries, scoring weights, and insight generation rules are proprietary trade secret. The source is in src/lib/decision-analytics.ts and licensed via licensing@afteraction.dev.