assistant11 May 2026EN

Voice vs. text — when humans pick which, and why

Voice and text modes in the Nortinia AI Assistant share one tool set. The only question is which is more convenient at that moment.

Voice vs. text

The Nortinia AI Assistant works in two modes on the same surface: text chat and live voice. Both use the same tool catalogue, both leave the same audit trail, both enforce the same scopes. The difference is not in capability. It is in when each mode is the right call.

When voice wins

Mobile pickers

Typing on a phone is slow. A warehouse picker carrying a scanner is not going to fill out an 8-field form with two thumbs. Voice wins outright here: "waybill, three pallets, Pécs warehouse, tomorrow morning" — done. The system shows the preview immediately; the user confirms with a nod (or "yes").

Executives on the move

The finance director asks from the car: "what is the total of invoices closed since yesterday?" The answer comes as a spoken summary, with the visual breakdown underneath (in the car: heard; reviewed at the office: visible).

Hands-busy contexts

Medical practice (just washed hands), shop counter (packaging), production line (holding a tool). Voice is not a luxury here — it is the only usable interface.

Quick lookups

"Where is Péter Kovács's invoice?" — three seconds by voice. By hand: open search, type, wait, click. Voice wins on latency.

When text wins

Long lists

"Give me the 50 biggest transactions last week" — reading 50 items aloud is absurd. The system automatically switches to text and voice only summarises: "here are the 50 items in a table, click through for details".

Formal approvals

Approving a HUF 4M refund deserves a formal button press — not a voice confirm. An explicit click looks better in the audit trail than a voice sample. For compliance reasons every mutation > 1M HUF mandatorily goes through text + button.

During a meeting you share your screen and your colleague sees what you ask and what you get back. Voice is disruptive — the rest of the room hears it too. Text is visually discreet, and the colleague can scroll back through the dialogue for later reference.

Precise parameter entry

"New product, name PRO-2026-Q4-EU, price 199990, VAT 27%, category electronics." Passing this in text is faster and lower-error than spelling it out by voice.

The bidirectional handoff

Most users do not pick one mode. They start by voice in the morning in the car ("summarise last night's orders") and continue the same conversation in text at the laptop.

Continuity works because every chat session runs under one conversation ID regardless of whether the user is on voice or text. Next to the waveform there is a discreet "text" toggle: one click and the same context continues in text mode (the voice history is added as a transcript).

The reverse works too: during a text chat the user clicks the mic icon and the next prompt comes in by voice. The bot replies in text or voice depending on the user's preference (preference is persistent, overridable per session).

The hybrid use-case

"Tell me how tomorrow's shipping plan looks" — by voice. The bot summarises by voice (3 sentences) and shows the details in text (a table). The user hears the gist, sees the detail. This hybrid mode is the most common, and the most popular in user feedback.

The choice in practice

No strategy required. The user uses whatever is convenient at the moment. The system is set up so both channels deliver the same thing and switching is frictionless.

Our measurement: the typical power user runs 70% text, 25% voice, 5% hybrid. Mobile-first users run 60% voice, 35% text, 5% hybrid. Neither is wrong — the system has both.

What we are building next

Three things: (1) ambient mode — the voice channel stays open in the background; the bot only speaks for important events ("new high-priority refund landed"). (2) multilingual voice in the same session (HU prompt, EN reply, etc.). (3) voice onboarding mode for new users — at first login the bot guides through a 90-second spoken tour of the surface.