Domain grounding
The Nortinia AI Assistant uses a strictly tenant-owned knowledge base (KB) for any question that is not common knowledge and not an in-app action. This article walks through the retrieval-then-generate pipeline, the hallucination guard, and the May 2026 evaluation results.
How the knowledge base is built
Every tenant can upload its own documents (PDF, DOCX, Markdown, HTML) in the assistant admin UI. The pipeline:
- Chunked — documents are split into ~800-token pieces with 100-token overlap at boundaries.
- Embedded — vectorised with OpenAI text-embedding-3-large.
- Indexed — stored in pgvector, scoped by tenantId.
- Metadata-tagged — document title, page range, upload date, uploader.
A typical tenant uploads 20-200 documents: T&C, GDPR policy, internal procedures, product catalogue, FAQ.
Retrieval-then-generate
For every knowledge question ("what is our refund policy?", "how many days to ship to Slovakia?") the flow:
- Retrieval — embed the user question, fetch the top-5 most relevant chunks from the tenant KB.
- Relevance check — if the top-1 chunk's cosine similarity is < 0.72, the system does NOT call the LLM for generation; it returns: "I did not find information on this in your knowledge base."
- Generate — if there is a relevant chunk, the LLM is prompted strictly: "answer ONLY from the supplied context, cite by reference, if not present say it is not present."
- Citation embedding — every factual statement in the reply carries an automatic [1], [2] pointer to the source chunk (document + page).
The hallucination guard
The LLM's reply is checked by a second model (post-process step): each factual statement is matched against the cited chunk. If the chunk does not support the statement, it is flagged. With 2 or more flags, the assistant either rewrites the reply more narrowly or returns "not found".
This adds latency (about +400ms) but cut hallucination to 96.4% grounded-only (see below).
The 412-question eval
In May 2026 we assembled a 412-question reference set across 5 customer KBs. Questions were balanced:
- 60% — clearly "in the KB" type (cited answer expected)
- 30% — "not in the KB" type ("not found" expected)
- 10% — misleading type (KB has something similar but not the exact answer)
The results:
- Grounded-only answers: 96.4% (398/412)
- Wrong citation: 1.7% (7/412) — answer correct, wrong chunk pointer
- Hallucinated answer: 0.5% (2/412) — stated a fact not in the KB
- Too conservative ("not found" when there was): 1.4% (6/412)
The 0.5% hallucination rate is below our GA threshold. For comparison: the same question set on a raw gpt-5 without KB produced a 38% hallucination rate.
What we do not ground
Two things we deliberately let the LLM use "its own" knowledge for:
- Common knowledge — "how many months in a year", "what is HU VAT in general". Grounding overhead is not worth it.
- In-app actions — "navigate to the invoice", "create a refund". These are not knowledge questions; they are MCP tool calls. Different path.
Edge cases ("what is the HU intra-EU VAT rate as of July 2026") count as knowledge questions, and if the tenant KB does not have them, the system says: "I did not find this, I suggest checking the NAV website".
Per-tenant isolation
One of the most important rules: tenant A's KB NEVER leaks to tenant B. The pgvector query carries a mandatory tenant_id = $current filter; two integration tests on every release (and a monthly chaos test) guarantee that the filter cannot be broken even by prompt injection.
Roadmap
Two big improvements queued: (1) cross-document reasoning — when the answer needs combining chunks from two different documents, today's top-5 retrieval does not always surface both; multi-hop retrieval is needed. (2) auto-refresh KB — daily scan to see whether a new doc landed in the tenant's Drive and auto-import (opt-in).