UgrĂĄs a tartalomhoz
← Back to the journal

AI stack decisions: the 3-layer architecture we'd build today

Not a single model — a stack: Opus for synthesis (5-10%), Haiku for fast answers (80%), embeddings for memory. Measured outcome: -77% cost, -67% latency.

AI stack decisions: the 3-layer architecture we'd build today

The most common mistake we see in AI rollouts: the team decides to “use GPT-4”, and everything — chat, summarization, embeddings, classification — runs on a single model. Six weeks later, a EUR 3,200/month bill and 4-7 second response times. The fix isn't a different model. The fix is a stack.

The three layers

When we draw a stack plan for a client at Nortinia consulting, we sketch three layers on the whiteboard — none of them optional, each one suited to a different job.

Layer 1 — Synthesis (Claude Opus / GPT-4 class)

This is everything that has to “think” hard: long-document summarization, legal rewrites, multi-step reasoning, tool-use orchestration, agent coordination. Expensive, slow (~2-6s), but you can't get the quality elsewhere. Typical share of traffic: 5-10%. It also accounts for 60-70% of the bill.

Anti-pattern: making Opus answer “Thanks, that helped!” on the chatbot. In money: you pay EUR 18-25 per million tokens where Haiku would solve it for EUR 0.80.

Layer 2 — Fast answer (Claude Haiku / GPT-4o-mini class)

This is 80% of the chat: simple Q&A, FAQ, fixed-format extraction, classification, light routing. Latency 400-900ms, price one tenth of the top tier. The rule of the game: put everything here that can go here — and only promote upward on the evidence of a measurement.

Across Nortinia tenants the average split is 73% Haiku, 18% Opus, 9% embedding/other. At a client where no layering was done, this was 4% Haiku / 91% Opus on entry.

Layer 3 — Memory and retrieval (embeddings + vector DB)

What you don't call from the chat: document chunking, embedding, similarity search. text-embedding-3-small or bge-m3, a thousandth of the cost per call. Here the question isn't latency, it's what you index and how you refresh it. The most common mistake: they indexed everything once 18 months ago, and the knowledge base has drifted ever since.

The decision matrix

In the cost / quality / latency triangle, every new feature gets sorted by five questions:

  1. Is the answer a sentence or a chain of reasoning? — one sentence → layer 2
  2. Does it need to call external tools mid-response? — multiple calls → layer 1
  3. Real-time UX or async batch? — batch → always the cheaper model
  4. Structured output (JSON) or free prose? — Haiku does JSON reliably if the schema is tight
  5. Retrieval or generation? — retrieval → embedding + RAG, don't search with an LLM

This matrix has to be filled in before implementation. Doing it after hurts.

Typical measured outcome

On a mid-size B2B SaaS (40k chats/month), after we introduced stack-level routing:

  • Monthly model bill: EUR 1,840 → 410 (-77%)
  • p50 latency: 2.1s → 0.7s (-67%)
  • Chat CSAT: 4.1 → 4.4 (faster answer matters more than “slightly smarter”)

The stack-level decision isn't AI strategy. It's engineering hygiene.

Takeaway

Pick a stack, not a model. Draw the three layers before you write code, and fill the decision matrix before every new feature. The LLM market shifts every 6 months; the architecture pattern will hold for 3 years.

Let's talk about your project

Tell us what you are building — we will figure out how to help.

AI stack decisions: the 3-layer architecture we'd build today — Nortinia Journal | Nortinia