Explainability API¶

Geodesia G-1 provides two explainability interfaces: the inline explain flag on the evaluate endpoint (for LLM-internal attribution), and the causal explainability endpoint on the gateway (for black-box attribution). This page documents both.

Inline Explain (Evaluate Endpoint)¶

When you call POST /glad/evaluate with explain: true, attribution scores are computed alongside the regular detection scores and returned in the xai field of the response.

Parameters¶

Parameter	Description
`explain`	Set to `true` to enable XAI computation
`explain_mode`	`"standard"` (default) or `"causal"`
`credit_tiers`	Which attribution methods to run (see below)
`system_prompt_text`	If provided, tokens belonging to the system prompt are excluded from attribution

Credit Tiers¶

The credit_tiers array specifies which attribution methods to run. Methods can be combined:

Tier	Key	Speed	Description
Tier 1	`"gradient"`	Fast (~50ms)	Deterministic prompt-token occlusion: each prompt token is masked one at a time and the change in detection score is the importance. Deterministic and reproducible.
Tier 1.5	`"pss"`, `"tier1_5"`, `"stability"`	Slow (~N× generation)	Positional Semantic Stability: generates N alternative answers and measures how much each prompt token affects whether specific output claims appear. Training-free. Controlled by `pss_n_samples`, `pss_temperature`, `pss_match_mode`.
Tier 2	`"mupax"`	Medium (~0.4–2s)	Monte Carlo Perturbation Attribution: statistically robust attribution via kernel SHAP. The production default.
Tier 3	`"learned"`	Fast (~10ms)	Learned attribution head (if the checkpoint includes one). Fastest, but accuracy depends on training data coverage.

Example — run MuPAX and gradient together:

{
  "model_path": "/app/pretrained_glad",
  "prompt": "When was the Eiffel Tower built?",
  "explain": true,
  "credit_tiers": ["mupax", "gradient"]
}

`explain_mode: "causal"`¶

In causal mode, the system additionally computes a token→token causal matrix: for the answer token with the highest attribution score, it identifies which prompt tokens are causally responsible for its generation.

This answers: "Not just what words were important — but specifically which prompt words caused the model to write the most suspicious part of the answer."

XAI Response Structure¶

"xai": {
  "mupax_halluc": {
    "detection_type": "hallucination",
    "top_tokens": [
      {
        "token": "Paris",
        "position": 7,
        "importance": 0.48,
        "retention_frequency": 0.71,
        "conditional_goodness": 0.88
      }
    ],
    "threshold_W": 0.14,
    "threshold_percentile_used": 0.2,
    "n_accepted": 412,
    "n_total": 500,
    "attribution_heatmap": [0.02, 0.01, 0.48, 0.12, ...],
    "score_function": "combined_logreg"
  },
  "mupax_safety": {
    "detection_type": "safety",
    ...
  },
  "mupax_halluc_causal": {
    "detection_type": "hallucination_causal",
    "target_token": "1889",
    "target_position": 14,
    "prompt_tokens": [...],
    "answer_tokens": [...],
    "causal_edges": [
      {
        "source_position": 4,
        "source_token": "built",
        "target_position": 14,
        "target_token": "1889",
        "raw_importance": 0.61,
        "normalized_importance": 0.83,
        "absolute_importance": 0.83
      }
    ]
  }
}

Per-Token Attribution Fields¶

Field	Description
`token`	The token text as decoded from the vocabulary
`position`	Token position in the full input sequence
`importance`	χ attribution value. Higher means this token contributed more to the detection score.
`retention_frequency`	Proportion of Monte Carlo samples in which this token appeared in configurations with above-threshold scores
`conditional_goodness`	Mean detection score when this token was present

Causal Edge Fields¶

Field	Description
`source_position`	Position of the prompt token
`source_token`	The prompt token text
`target_position`	Position of the answer token
`target_token`	The answer token text
`raw_importance`	Signed χ importance (positive = causal contribution)
`normalized_importance`	Signed χ normalized by the maximum absolute χ in the graph
`absolute_importance`	Absolute normalized importance [0, 1]

Causal XAI via Gateway¶

For the companion gateway deployment (where Geodesia runs against an external LLM without access to model internals), black-box attribution is available at:

POST/v1/glad/causal-explainability/analyze

See Causal XAI for full documentation.

PSS: Positional Semantic Stability¶

Positional Semantic Stability (PSS, Tier 1.5) is a training-free attribution method that asks: "If I change the prompt in this position, does the key claim in the answer change?"

Unlike gradient-based methods, PSS does not need access to model gradients. It works by generating N alternative answers with small prompt variations and measuring stability.

PSS Configuration Parameters¶

Parameter	Env override	Default	Description
`pss_n_samples`	`GLAD_PSS_N_SAMPLES`	`5`	Number of additional samples to generate. Each sample costs one generation pass. 2 = fast/noisy; 16 = slow/robust.
`pss_temperature`	`GLAD_PSS_TEMPERATURE`	`0.7`	Sampling temperature for PSS resamples. Must be > 0 (temperature 0 would make all samples identical).
`pss_match_mode`	`GLAD_PSS_MATCH_MODE`	`"ngram"`	How to compare claims across samples. Options: `"ngram"` (n-gram containment + entity match), `"strict"` (exact surface), `"fuzzy"` (Levenshtein), `"entity"` (named entities only), `"claim"` (sentence-level bidirectional).

When to use PSS¶

You are explaining hallucination in long-form answers where gradient methods are noisy
You need attribution that works without any model weights (fully black-box)
You are building a human review workflow and need the explanation to be relatable ("this claim changed when we removed that specific context sentence")