Four layers.
One real call.
The architecture below is what happens when Rosa, calling from Houston in Spanish, reaches Maya at a clinic that operates in English. Same pipeline, regardless of which doorway the call comes through.
Stream audio in, generate a per-call speaker embedding, render in the listener's ear before the speaker has finished.
Glossary-aware NMT runs token-by-token. Late corrections append to the transcript instead of rewriting it.
Every utterance ships with a C2PA-signed manifest. PII redaction is designed to run before logs hit disk; per-tenant and audit-logged.
One pipeline, six adapters: contact center, telephony, AI voice agents, CRM, EHR, browser. Same model, different doorway.
Rosa says “Buenas tardes.” Two-thirds of a second later, Maya hears it in her own voice — same warmth, same hesitation, just a different language.
Maya picks up. The pipeline is already there.
Twelve hundred milliseconds between Rosa's first ring and a fully provisioned, language-detected, region-pinned session ready to stream voice in both directions. Here's how that 1.2 seconds breaks down.
Their words, in your voice. Your words, in theirs.
TransVoix doesn't replace voices with synthetic narration. It clones each speaker from a few seconds of audio, then carries that voice across the language barrier in both directions, so customers and agents recognize each other through accent, cadence and warmth, not just words.
Maya speaks English in her own voice. Rosa hears Spanish in Maya's voice.
Rosa speaks Spanish in her own voice. Maya hears English in Rosa's voice.
What we don't do: read a translation back in a generic voice. The cloned voice is ephemeral, generated per-call, never stored, never re-used. Voice consent & ethics →
How we hit sub-second.
Most translation pipelines are slow because each stage waits for the last to finish. TransVoix streams every stage in parallel, a 50ms chunk of speech is already being translated while the next chunk is being captured, and the first cloned syllable lands in the listener's ear before the speaker has finished their sentence.
End-to-end p50 across our 30 production language pairs is 740ms. For comparison, a typical international call already has 200–400ms of network latency before any translation happens. Per-stage budgets above are illustrative engineering allocations, not benchmarked stage timings.
Eighteen seconds in, Rosa clarifies what she meant. We don’t rewrite the past — we append a correction that’s timestamped, visible, and part of the record forever.
We don't edit the past. We add to it.
Streaming translators get faster by guessing, and slower by walking the guesses back. TransVoix never rewrites what was said. When a later word changes the meaning of an earlier one, we issue a correction that's appended to the transcript, timestamped, and visible to both parties. Nothing disappears.
6 live · 9 in beta · 13 on the roadmap.
Voice-to-voice across 30 bidirectional pairs. Glossary support for industry-specific vocabulary. LLM-Judge benchmark: 4.72 / 5 average.
At 4:02, the call ends. The cloned voices are deleted from the inference pool. The transcript ships with a C2PA-signed manifest binding every word to its source.
Every word, signed.
Every translated utterance ships with a C2PA-signed manifest binding it to the speaker, the source audio hash, the model version, and the timestamp. If a recording shows up later in court, in compliance review, or in the press, you can prove what was said, who said it, and that nothing in between was altered.
We’ve been describing one call through Twilio. The pipeline is the same regardless of how the audio arrives — what changes is the adapter, not the model.
Six surfaces. One conversation.
TransVoix lives where your conversations already happen. The pipeline above is the same regardless of how a call enters or exits, what changes is the adapter, not the model. See full integration matrix on /for-business →
Where your data lives, and how it leaves.
TransVoix processes voice in the region you choose, is designed to redact PII before any log touches disk, and never trains on customer audio. You can also run the whole pipeline in your own VPC, the architecture is the same; only the operator changes.
Regional processing
US region today (us-east-1). Multi-region (EU, APAC) is on the architecture roadmap — when it ships, audio won't cross regions and cloned voice models will be pinned to the source region's keys.
PII redaction in-flight
The transcription layer is designed to redact card numbers, SSNs, account IDs and named entities before logs hit disk. Redaction policy is per-tenant and audit-logged.
Customer-controlled deployment
Roadmap: a customer-VPC deployment of the same pipeline so you keep the audio, the keys, and the audit log. Available today: tenant isolation via RLS, BAA at pilot onboarding, regional pinning.
See it on a real call.
30 days · 500 minutes · 3–5 agents · integration scoped to your stack. Your own traffic, your own glossary, your own metrics. The pilot is free — scope aligned together on a discovery call first.