Upstream Backends¶
Geodesia G-1 is model-agnostic. It can sit in front of any LLM backend that speaks the OpenAI or Ollama API. The upstream backend is configured once (via the UI's Settings → Service Connection page or via the API) and the gateway automatically adapts its detection strategy to what the backend can provide.
Supported Backend Types¶
| Type key | Description | Logprobs | Closed-book axis |
|---|---|---|---|
vllm | vLLM serving engine | ✅ Full support | ✅ 5 axes |
sglang | SGLang serving framework | ✅ Full support | ✅ 5 axes |
trtllm / tensorrt-llm | NVIDIA TensorRT-LLM | ✅ Full support | ✅ 5 axes |
openai | OpenAI API or any OpenAI-compatible endpoint | ✅ When logprobs=true | ✅ 5 axes |
ollama | Ollama (local models) | ❌ Not natively | ⚠️ 4 axes (5 with sidecar) |
internal | vLLM managed and lifecycle-controlled by the gateway itself | ✅ Full support | ✅ 5 axes |
4 vs 5 axes
The closed-book fabrication axis requires per-token log-probability values from the upstream. Log-probabilities are a measure of how "certain" the model is about each word it generates. When they are unavailable, this axis is automatically disabled and the gateway operates with 4 axes — no configuration change is needed. The /health endpoint tells you which mode is active.
Configuring the Backend¶
Via the UI¶
Navigate to Settings → Service Connection. You will find:
- Service type — a dropdown to select
vLLM,SGLang,TRT-LLM,OpenAI,Ollama, orInternal. - URL — the base URL of the upstream (
http://host:port). - API key — required for OpenAI and hosted services; leave empty for local deployments.
- Test connection — sends a probe request to the upstream and reports reachability, latency, available models, logprob support, and a sample reply.
- Model — populated automatically from the upstream's model list after a successful test.
- Calibrate closed-book now — runs the closed-book calibration for this model. Streams progress inline. Only needed when switching to a new model.
- Save — persists the configuration to
GW_CONFIG_FILEso it survives restarts.
The "Exposed API (OpenAI-compatible)" section shows the gateway's own base URL (http://host:port/v1). Point your downstream application here.
Via the API¶
# Update gateway configuration
curl -s -X POST http://localhost:8800/v1/glad/gateway/config \
-H "Content-Type: application/json" \
-d '{
"upstream_type": "vllm",
"upstream_base_url": "http://localhost:8000",
"upstream_model": "my-model",
"upstream_api_key": ""
}'
Any field in GatewayConfig can be updated. Changes take effect immediately on the next request.
Backend-Specific Notes¶
vLLM¶
vLLM is the recommended backend for production deployments on NVIDIA GPUs. It returns per-token log-probabilities natively, enabling all 5 detection axes.
# Start vLLM (example)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-8B-Instruct \
--port 8000
# Configure gateway
curl -X POST http://localhost:8800/v1/glad/gateway/config \
-d '{"upstream_type":"vllm","upstream_base_url":"http://localhost:8000","upstream_model":"meta-llama/Llama-3-8B-Instruct"}'
SGLang¶
SGLang is fully OpenAI-compatible and supports log-probabilities. Use type sglang:
curl -X POST http://localhost:8800/v1/glad/gateway/config \
-d '{"upstream_type":"sglang","upstream_base_url":"http://localhost:30000","upstream_model":"meta-llama/Llama-3-8B-Instruct"}'
TensorRT-LLM¶
TensorRT-LLM deployments typically front-end with an OpenAI-compatible server. Use type trtllm:
curl -X POST http://localhost:8800/v1/glad/gateway/config \
-d '{"upstream_type":"trtllm","upstream_base_url":"http://localhost:8000","upstream_model":"llama-3-8b"}'
OpenAI API¶
To use the actual OpenAI API (or any hosted OpenAI-compatible service such as Azure OpenAI, Together AI, Groq, Mistral AI):
curl -X POST http://localhost:8800/v1/glad/gateway/config \
-d '{
"upstream_type": "openai",
"upstream_base_url": "https://api.openai.com",
"upstream_api_key": "sk-...",
"upstream_model": "gpt-4o"
}'
Log-probability access
Log-probabilities are available from the OpenAI API when you set logprobs: true in the request. The gateway does this automatically. Some models or pricing tiers may not support them — the gateway detects this on the first request and falls back to 4-axis mode.
Ollama¶
Ollama is a popular tool for running open-source models locally. It does not natively expose per-token log-probabilities in its chat API. The gateway therefore operates in 4-axis mode by default.
curl -X POST http://localhost:8800/v1/glad/gateway/config \
-d '{
"upstream_type": "ollama",
"upstream_base_url": "http://localhost:11434",
"upstream_model": "llama3.2"
}'
Enabling the 5th axis with a logprob sidecar¶
To enable closed-book fabrication detection with Ollama, you need a logprob sidecar: a second instance of llama.cpp or another OpenAI-compatible server serving the same model with log-probability support. The gateway deterministically re-derives the answer through the sidecar to recover the needed signals.
# Configure sidecar (same model, different server, WITH logprobs)
curl -X POST http://localhost:8800/v1/glad/gateway/config \
-d '{
"upstream_type": "ollama",
"upstream_base_url": "http://localhost:11434",
"upstream_model": "llama3.2",
"ollama_logprob_sidecar_url": "http://localhost:8080",
"ollama_logprob_sidecar_model": "llama3.2"
}'
| Field | Description |
|---|---|
ollama_logprob_sidecar_url | Base URL of the OpenAI-compatible server that serves the same model with log-probabilities (e.g., llama.cpp server). |
ollama_logprob_sidecar_model | Model name as known to the sidecar server. Defaults to upstream_model if empty. |
Internal (self-managed vLLM)¶
When upstream_type is set to internal, the gateway launches and manages its own vLLM subprocess. It starts vLLM when selected and frees GPU memory when you switch to an external backend. This is useful for single-GPU deployments where you want the gateway to own the entire lifecycle.
curl -X POST http://localhost:8800/v1/glad/gateway/config \
-d '{
"upstream_type": "internal",
"internal_vllm_cmd": "python -m vllm.entrypoints.openai.api_server --model my-model --port 8000",
"internal_vllm_url": "http://localhost:8000",
"upstream_model": "my-model"
}'
| Field | Description |
|---|---|
internal_vllm_cmd | Full shell command to start vLLM. The gateway runs it as a subprocess. |
internal_vllm_url | Base URL where the internal vLLM is accessible after startup. |
Testing a Connection¶
The /upstream/test endpoint performs a full connection probe and returns a diagnostic summary. It is called automatically when you click Test connection in the UI.
curl -s -X POST http://localhost:8800/upstream/test \
-H "Content-Type: application/json" \
-d '{
"url": "http://localhost:8000",
"type": "vllm",
"api_key": "",
"model": "my-model"
}'
Response:
{
"reachable": true,
"models": ["my-model"],
"has_logprobs": true,
"sample_reply": "OK",
"latency_ms": 117,
"error": null,
"type": "vllm",
"model": "my-model",
"closed_book_available": true,
"axes": 5
}
| Field | Description |
|---|---|
reachable | true if the server responded with an HTTP status below 500 |
models | List of model IDs available on the upstream |
has_logprobs | true if the upstream returns per-token log-probabilities |
sample_reply | The model's reply to "Reply with the single word OK" (up to 120 characters) |
latency_ms | Round-trip latency in milliseconds |
error | Error message if the connection failed, otherwise null |
closed_book_available | Same as has_logprobs — whether the 5th detection axis is available |
axes | 5 with logprobs, 4 without |
Closed-Book Calibration¶
The closed-book fabrication detector requires calibration to the specific vocabulary and generation style of the upstream model. Calibration runs a set of known-truthful and known-fabricated queries through the model, fits a regression over the log-probability signals, and saves a calibrated checkpoint.
Calibration is automatic on first boot in the Docker customer profile. You can also trigger it manually:
After calibration, the gateway reloads the new checkpoint on the next request. No restart is required.
When to recalibrate
Recalibrate whenever you switch to a different base model (e.g., from Llama to Qwen). The same checkpoint can be shared across different GGUF quantizations of the same model because the underlying log-probability distribution is stable.