Multi-language, one knowledge base â how AI Chat translates
Most multilingual chat systems fail in two ways: either they translate everything at the CMS layer (per-language hand-edited duplicate content, expensive, slow), or they translate everything mid-conversation (expensive in token billing, and domain terms drift). Nortinia AI Chat takes a third path: source written in one language, translation-aware retrieval.
The concept: embed once, query many
The tenant writes the knowledge base in a single source language â on home markets that is typically Hungarian. It can be a product description, a FAQ, an ToS clause, or an internal knowledge-base entry.
The pipeline:
- Embedding â every document chunk is embedded in the source language with a multilingual model (text-embedding-3-large or similar). The model encodes semantics, not surface words. A Hungarian "szĂĄllĂtĂĄsi idĂ”" embedding sits close in vector space to an English "delivery time" embedding.
- Query â the visitor can ask in any language. The question is embedded in the visitor's language. The vector search runs in the same space.
- Generation â the top-k chunks return to the model in the source language, and the model composes the response language. The prompt explicitly says: "User language is Hungarian / English / German â answer in that."
Key: we never pre-translate the knowledge base content. The model translates the meaning back and forth during the conversation.
The glossary override for domain terms
There are places where generic translation is not enough. Legal, regulatory, or brand-specific terms must appear in a fixed form. The per-tenant glossary handles that:
{
"glossary": {
"hu": {
"ĂSZF": "ĂSZF",
"GDPR-megfelelÔség": "GDPR-megfelelÔség",
"Pro csomag": "Pro csomag"
},
"en": {
"ĂSZF": "Terms of Service (ĂSZF)",
"GDPR-megfelelÔség": "GDPR compliance",
"Pro csomag": "Pro plan"
},
"de": {
"ĂSZF": "AGB",
"GDPR-megfelelÔség": "DSGVO-KonformitÀt",
"Pro csomag": "Pro-Tarif"
}
}
}
At prompt-build time, the glossary entries are passed as strict instructions. The model knows: if the term "ĂSZF" comes up, render it in English as "Terms of Service (ĂSZF)", not as "general conditions".
The three languages we tested most
Hungarian, English, German. 96% of total fleet traffic runs on these. Accuracy (response correctness by human eval) looks like this:
- HU â HU: 94.2% (the source language, naturally best)
- HU â EN: 91.8% (generic translation is very good here)
- HU â DE: 88.3% (without the glossary override it was 79% â because of compounded domain terms)
The corner case: Hungarian compound nouns
One feature of Hungarian is unbounded noun compounding. "ĂgyfĂ©lelĂ©gedettsĂ©gfelmĂ©rĂ©s" â that is a real word, but the tokenizer does not recognize it. Embedding quality drops on these (they are rare in the training corpus).
What we do: before indexing, the tenant's source text is run through a pre-processor pass that splits overly long compound nouns into parts. The result: it becomes "ĂŒgyfĂ©lelĂ©gedettsĂ©g-felmĂ©rĂ©s", which the tokenizer handles naturally. In an early chunk-accuracy test, this pass improved embedding-search accuracy by 6.4% on Hungarian source.
The boring discipline
A few things the tenant must hold to for this to work well:
- One source language per document group. Do not mix HU + EN text in a single FAQ entry â the embedding will be noisy.
- Glossary for every brand term. Product names, plan names, legal abbreviations must be listed.
- Human-eval testing. Quarterly, 50 questions per language pair, random sample â if something regresses, we see it in time.
The lesson
Multi-language in AI chat is not free. But you save 80% of the cost of the dual-edited-knowledge-base model if you do it right. The remaining 20% is glossary + human-eval â a quarter's work that bakes into the system.