AI stack decisions: the 3-layer architecture we'd build today
The most common mistake we see in AI rollouts: the team decides to âuse GPT-4â, and everything â chat, summarization, embeddings, classification â runs on a single model. Six weeks later, a EUR 3,200/month bill and 4-7 second response times. The fix isn't a different model. The fix is a stack.
The three layers
When we draw a stack plan for a client at Nortinia consulting, we sketch three layers on the whiteboard â none of them optional, each one suited to a different job.
Layer 1 â Synthesis (Claude Opus / GPT-4 class)
This is everything that has to âthinkâ hard: long-document summarization, legal rewrites, multi-step reasoning, tool-use orchestration, agent coordination. Expensive, slow (~2-6s), but you can't get the quality elsewhere. Typical share of traffic: 5-10%. It also accounts for 60-70% of the bill.
Anti-pattern: making Opus answer âThanks, that helped!â on the chatbot. In money: you pay EUR 18-25 per million tokens where Haiku would solve it for EUR 0.80.
Layer 2 â Fast answer (Claude Haiku / GPT-4o-mini class)
This is 80% of the chat: simple Q&A, FAQ, fixed-format extraction, classification, light routing. Latency 400-900ms, price one tenth of the top tier. The rule of the game: put everything here that can go here â and only promote upward on the evidence of a measurement.
Across Nortinia tenants the average split is 73% Haiku, 18% Opus, 9% embedding/other. At a client where no layering was done, this was 4% Haiku / 91% Opus on entry.
Layer 3 â Memory and retrieval (embeddings + vector DB)
What you don't call from the chat: document chunking, embedding, similarity search. text-embedding-3-small or bge-m3, a thousandth of the cost per call. Here the question isn't latency, it's what you index and how you refresh it. The most common mistake: they indexed everything once 18 months ago, and the knowledge base has drifted ever since.
The decision matrix
In the cost / quality / latency triangle, every new feature gets sorted by five questions:
- Is the answer a sentence or a chain of reasoning? â one sentence â layer 2
- Does it need to call external tools mid-response? â multiple calls â layer 1
- Real-time UX or async batch? â batch â always the cheaper model
- Structured output (JSON) or free prose? â Haiku does JSON reliably if the schema is tight
- Retrieval or generation? â retrieval â embedding + RAG, don't search with an LLM
This matrix has to be filled in before implementation. Doing it after hurts.
Typical measured outcome
On a mid-size B2B SaaS (40k chats/month), after we introduced stack-level routing:
- Monthly model bill: EUR 1,840 â 410 (-77%)
- p50 latency: 2.1s â 0.7s (-67%)
- Chat CSAT: 4.1 â 4.4 (faster answer matters more than âslightly smarterâ)
The stack-level decision isn't AI strategy. It's engineering hygiene.
Takeaway
Pick a stack, not a model. Draw the three layers before you write code, and fill the decision matrix before every new feature. The LLM market shifts every 6 months; the architecture pattern will hold for 3 years.