aichat18 May 2026EN

Multi-language, one knowledge base — how the AI Chat translates

Source written in one language, multilingual embedding, query in any language. Glossary override for brand and domain terms. Three tested languages: HU, EN, DE.

Multi-language, one knowledge base — how AI Chat translates

Most multilingual chat systems fail in two ways: either they translate everything at the CMS layer (per-language hand-edited duplicate content, expensive, slow), or they translate everything mid-conversation (expensive in token billing, and domain terms drift). Nortinia AI Chat takes a third path: source written in one language, translation-aware retrieval.

The concept: embed once, query many

The tenant writes the knowledge base in a single source language — on home markets that is typically Hungarian. It can be a product description, a FAQ, an ToS clause, or an internal knowledge-base entry.

The pipeline:

Embedding — every document chunk is embedded in the source language with a multilingual model (text-embedding-3-large or similar). The model encodes semantics, not surface words. A Hungarian "szállítási idõ" embedding sits close in vector space to an English "delivery time" embedding.
Query — the visitor can ask in any language. The question is embedded in the visitor's language. The vector search runs in the same space.
Generation — the top-k chunks return to the model in the source language, and the model composes the response language. The prompt explicitly says: "User language is Hungarian / English / German — answer in that."

Key: we never pre-translate the knowledge base content. The model translates the meaning back and forth during the conversation.

The glossary override for domain terms

There are places where generic translation is not enough. Legal, regulatory, or brand-specific terms must appear in a fixed form. The per-tenant glossary handles that:

{
  "glossary": {
    "hu": {
      "ÁSZF": "ÁSZF",
      "GDPR-megfelelõség": "GDPR-megfelelõség",
      "Pro csomag": "Pro csomag"
    },
    "en": {
      "ÁSZF": "Terms of Service (ÁSZF)",
      "GDPR-megfelelõség": "GDPR compliance",
      "Pro csomag": "Pro plan"
    },
    "de": {
      "ÁSZF": "AGB",
      "GDPR-megfelelõség": "DSGVO-Konformität",
      "Pro csomag": "Pro-Tarif"
    }
  }
}

At prompt-build time, the glossary entries are passed as strict instructions. The model knows: if the term "ÁSZF" comes up, render it in English as "Terms of Service (ÁSZF)", not as "general conditions".

The three languages we tested most

Hungarian, English, German. 96% of total fleet traffic runs on these. Accuracy (response correctness by human eval) looks like this:

HU → HU: 94.2% (the source language, naturally best)
HU → EN: 91.8% (generic translation is very good here)
HU → DE: 88.3% (without the glossary override it was 79% — because of compounded domain terms)

The corner case: Hungarian compound nouns

One feature of Hungarian is unbounded noun compounding. "Ügyfélelégedettségfelmérés" — that is a real word, but the tokenizer does not recognize it. Embedding quality drops on these (they are rare in the training corpus).

What we do: before indexing, the tenant's source text is run through a pre-processor pass that splits overly long compound nouns into parts. The result: it becomes "ügyfélelégedettség-felmérés", which the tokenizer handles naturally. In an early chunk-accuracy test, this pass improved embedding-search accuracy by 6.4% on Hungarian source.

The boring discipline

A few things the tenant must hold to for this to work well:

One source language per document group. Do not mix HU + EN text in a single FAQ entry — the embedding will be noisy.
Glossary for every brand term. Product names, plan names, legal abbreviations must be listed.
Human-eval testing. Quarterly, 50 questions per language pair, random sample — if something regresses, we see it in time.

The lesson

Multi-language in AI chat is not free. But you save 80% of the cost of the dual-edited-knowledge-base model if you do it right. The remaining 20% is glossary + human-eval — a quarter's work that bakes into the system.