assistant4 May 2026EN

Token budget at tenant level — soft cap, hard cap, circuit breaker

Soft cap at 90% (mutations preview-only), hard cap at 100% (read-only). Tenant admin notify, operator on-call escalation.

Token budget at tenant level

LLM usage cost in the Nortinia AI Assistant is accounted per tenant and bounded by per-tenant daily and monthly caps. This article describes the soft / hard cap system, the circuit-breaker behaviour, and the operator alert path.

Why tenant-scoped

A large customer (5000 users) burns 80M tokens a day. A small one (50 users) burns 200k. Per-user limits do not work (you have to set thousands at the big tenant) and they are not fair (a big tenant absorbs its own power user). Tenant-scoped is the natural unit of accounting because the tenant is what gets invoiced.

The three caps

Daily soft cap

Every tenant has a daily token budget by contract (e.g. 5M tokens/day on the STANDARD tier). When usage hits 90%, the system does two things:

Notify the tenant admin — push and email: "you are at 90% of today's budget".
Mutation tools switch to preview-only — every write/update/delete tool now only previews; it does not execute. The user sees what would happen, but to commit they have to wait (until tomorrow) or have an admin raise the cap.

Read-only tools (search, KB lookup, audit query) continue unchanged.

Daily hard cap

When usage hits 100%:

The assistant becomes fully read-only. All mutations off; reads still go.
Pager-level alert to our operator (Netorigo-side on-call).
SMS to the tenant admin (on top of the soft-cap notify, to be sure it lands).
In-chat warning to the user — "today's budget is used up, read-only until tomorrow morning; ask your admin for a cap raise".

Hard cap stays for 24 hours, then resets at the next 00:00 UTC.

Monthly cap

Same soft/hard logic at month scale. This protects against the "under daily cap, but trends over the month" scenario.

The circuit-breaker decision tree

Before every tool call the system checks:

IF tenant.daily_used / tenant.daily_cap < 0.9
  → run tool normally
ELSE IF tenant.daily_used / tenant.daily_cap < 1.0
  → if tool is mutation: preview-only
  → if tool is read: run normally
ELSE
  → if tool is mutation: refuse, suggest admin contact
  → if tool is read: run normally

The monthly cap runs the same check in parallel; the stricter side wins.

Operator alert path

When a tenant hits the hard cap, on the Netorigo side:

PagerDuty alert — on-call engineer gets an incident within 5 minutes.
Auto-context — the alert includes the tenant name, the last-24h usage graph, and a link to the billing surface.
Default action — the on-call engineer can with one click raise the tenant cap by 20% for 24h (emergency top-up). This package shows up on a separate line on the customer's monthly invoice.
Escalation — if a tenant hits hard cap twice within 3 days, the account manager gets an auto-ticket to talk to the customer about a tier upgrade.

What the tenant admin sees

In the tenant admin UI a /usage dashboard:

Daily graph — last 30 days, hourly breakdown
Per-user breakdown — who burns the most
Per-tool breakdown — which tools cost the tokens (often long-form generation)
Trend forecast — "at this rate you will be at x% by month end"
Cap raise button — goes straight into the tier upgrade flow

The most common cause

80% of hard-cap hits trace to one pattern: a huge group-export that someone asks for in chat ("export every 2025 invoice for me"). The LLM loads 50k+ records into context and the token count jumps.

In March 2026 we added a guard: above a certain size (> 10MB of planned context) the tool does not run, it offers an alternative ("this export goes via standard CSV download, not through the LLM"). Hard-cap hits dropped 60% after that.

What this means for the customer

The tenant knows what it spends. No invoice surprise. The 90% soft cap is always an early warning. The hard cap is not a punishment; it is a safety net — and the emergency raise is one on-call click. The assistant does not stop entirely even at hard cap; only the risky operations wait.

What we are building next

Per-user cap (alongside the tenant cap) and per-team cap (an intermediate level), so one team inside the unit cannot drain the budget for another. And an analyst view that tells you which feature combination gives the tenant the best cost / value ratio.