Knowledge Base (RAG)¶

Geodesia G-1 includes a built-in Retrieval-Augmented Generation (RAG) system that lets you upload documents and have the LLM answer questions grounded in those documents. Every retrieved chunk is passed to the faithfulness detection axis, and the claims in the answer are verified citation-by-citation against the source material.

Supported document formats: PDF, Word (.docx), PowerPoint (.pptx), Markdown (.md), HTML, Excel (.xlsx), CSV, plain text (.txt)

How It Works¶

1. You upload a document → Docling parses it → chunked into ~480 tokens with 64-token overlap
2. Each chunk is embedded with BGE-M3 (multilingual) → stored in LanceDB
3. On a RAG-enabled chat request:
   a. Retrieve top-K chunks most relevant to the user's question (dense retrieval + reranking)
   b. Inject the retrieved context into the upstream LLM's prompt
   c. The LLM answers using the context
   d. Geodesia verifies each claim in the answer against the retrieved chunks
   e. If all claims are verified with citations → halluc_context flag suppressed
   f. If any claim is ungrounded → halluc_context flags normally

Collections¶

Documents are organised into collections. A collection is a named group of documents that shares an embedding index. You can have multiple collections for different topics or customers.

API Reference¶

All RAG endpoints are mounted at /v1/glad/rag/ on the gateway.

GET /v1/glad/rag/status¶

Returns whether the RAG service is loaded and ready.

curl http://localhost:8800/v1/glad/rag/status

{"ok": true, "embed_model": "BAAI/bge-m3", "device": "cuda:0", "store_dir": "runs/rag_store"}

POST /v1/glad/rag/collections¶

Create a new document collection.

Request:

{"name": "company-policies", "description": "Internal HR and legal policies"}

Field	Type	Required	Description
`name`	`string`	✅	Human-readable name for the collection.
`description`	`string`	—	Optional description.

Response:

{"collection_id": "c_a3f7b2d1", "name": "company-policies", "created_at": "2026-06-10T12:00:00Z"}

GET /v1/glad/rag/collections¶

List all collections.

curl http://localhost:8800/v1/glad/rag/collections

Response:

[
  {
    "collection_id": "c_a3f7b2d1",
    "name": "company-policies",
    "description": "Internal HR and legal policies",
    "document_count": 3,
    "chunk_count": 142,
    "created_at": "2026-06-10T12:00:00Z"
  }
]

DELETE /v1/glad/rag/collections/{collection_id}¶

Delete a collection and all its documents and embeddings.

curl -X DELETE http://localhost:8800/v1/glad/rag/collections/c_a3f7b2d1

POST /v1/glad/rag/collections/{collection_id}/documents¶

Upload a document to a collection. The document is automatically parsed, chunked, and embedded.

curl -X POST \
  http://localhost:8800/v1/glad/rag/collections/c_a3f7b2d1/documents \
  -F "file=@/path/to/policy.pdf"

Multipart fields:

Field	Type	Required	Description
`file`	file	✅	The document to upload. Supported: PDF, DOCX, PPTX, MD, HTML, XLSX, CSV, TXT. Maximum 100 MB.
`title`	string	—	Optional document title. If omitted, the filename is used.

Response:

{
  "document_id": "doc_b5c2e1a3",
  "title": "policy.pdf",
  "chunk_count": 47,
  "status": "indexed"
}

Parsing notes: - PDF and DOCX files are parsed with Docling (IBM's multi-format parser), which preserves reading order, headings, and tables better than simple text extraction. - Large files may take several seconds to process. Embeddings are computed synchronously; the response is returned when indexing is complete.

DELETE /v1/glad/rag/collections/{collection_id}/documents/{document_id}¶

Delete a single document and cascade-remove its embedded chunks.

curl -X DELETE \
  http://localhost:8800/v1/glad/rag/collections/c_a3f7b2d1/documents/doc_b5c2e1a3

POST /v1/glad/rag/collections/{collection_id}/query¶

Query a collection directly (without a full chat request). Returns the most relevant chunks for a given question.

Request:

{
  "query": "What is the refund window?",
  "top_k": 5,
  "rerank": true
}

Field	Type	Default	Description
`query`	`string`	✅	The question or search query.
`top_k`	`integer`	`5`	Number of chunks to return (after reranking).
`rerank`	`boolean`	`true`	Whether to apply the cross-encoder reranker (BGE-reranker-v2-m3) after initial retrieval. Reranking improves relevance at the cost of an additional model forward pass.

Response:

{
  "chunks": [
    {
      "text": "Our return policy allows refunds within 30 days of purchase...",
      "score": 0.94,
      "document_id": "doc_b5c2e1a3",
      "document_title": "policy.pdf",
      "page": 3,
      "heading": "Return Policy"
    }
  ]
}

Using RAG in Chat Requests¶

To use a knowledge base in a chat request, add the rag field:

{
  "model": "my-model",
  "stream": false,
  "messages": [{"role": "user", "content": "What is our refund window?"}],
  "rag": {
    "collection_id": "c_a3f7b2d1",
    "top_k": 5,
    "rerank": true,
    "verify": true,
    "verify_deep": true
  }
}

RAG Chat Request Fields¶

Field	Type	Default	Description
`collection_id`	`string`	✅	ID of the collection to retrieve from.
`top_k`	`integer`	`5`	Maximum chunks to retrieve and inject into the prompt.
`rerank`	`boolean`	`true`	Apply the cross-encoder reranker. Slightly slower but significantly more accurate for ambiguous queries.
`verify`	`boolean`	`true`	Run claim-level grounding verification after the answer is generated.
`verify_deep`	`boolean`	`true`	When `true`, verification uses the hallucination detection model for each claim (more accurate). When `false`, falls back to lexical overlap (faster, less accurate).

RAG in the Response¶

When RAG is active, the geodesia.rag field in the response contains retrieval and verification details:

"geodesia": {
  "rag": {
    "collection_id": "c_a3f7b2d1",
    "n_sources": 3,
    "sources": [
      {
        "text": "Our return policy allows refunds within 30 days...",
        "score": 0.94,
        "document_title": "policy.pdf",
        "page": 3
      }
    ],
    "verification": {
      "n_total": 2,
      "n_grounded": 2,
      "ungrounded": false,
      "claims": [
        {
          "claim": "refunds within 30 days",
          "grounded": true,
          "citation": "Our return policy allows refunds within 30 days..."
        }
      ]
    }
  },
  "brake": false
}

Field	Description
`n_sources`	Number of chunks retrieved
`sources`	List of retrieved chunks with text, relevance score, and document metadata
`verification.n_total`	Total claims extracted from the answer
`verification.n_grounded`	Claims supported by the retrieved chunks
`verification.ungrounded`	`false` when all claims are grounded — triggers hallucination suppression
`verification.claims`	Per-claim grounding status and the matching citation

Configuration¶

RAG-specific environment variables:

Variable	Default	Description
`GW_RAG_DIR`	`runs/rag_store`	Directory where the LanceDB embedding store is saved. Must be writable.
`GW_RAG_DEVICE`	`cuda:0`	Device for the embedding model. Use `cpu` on machines where the GPU is fully occupied by the LLM.
`GW_RAG_EMBED_MODEL`	`BAAI/bge-m3`	Hugging Face model ID for the text embedding model. BGE-M3 is multilingual and recommended.
`GW_RAG_RERANK`	`1`	Set to `0` to disable the reranker globally.
`GW_RAG_RERANK_MODEL`	`BAAI/bge-reranker-v2-m3`	Hugging Face model ID for the cross-encoder reranker.
`GW_RAG_TOPK`	`5`	Default number of chunks to retrieve (overridable per-request).
`GW_RAG_OVERFETCH`	`20`	Number of candidates retrieved by the dense retriever before reranking. Higher = more recall at the cost of reranker speed.
`GW_RAG_CTX_MAXCHARS`	`6000`	Maximum characters of retrieved context injected into the prompt. Long contexts are truncated.
`GW_RAG_MAX_CLAIMS`	`12`	Maximum number of claims extracted from the answer for claim-level verification.

GPU allocation

If your GPU is fully occupied by the LLM, set GW_RAG_DEVICE=cpu. BGE-M3 on CPU is slower for large uploads (~10–30 seconds per document) but runs fine. After the initial indexing, retrieval from CPU is typically fast enough for real-time use.