How it works

Four layers.
One real call.

The architecture below is what happens when Rosa, calling from Houston in Spanish, reaches Maya at a clinic that operates in English. Same pipeline, regardless of which doorway the call comes through.

The architecture p50 · 740ms

↓ VOICE IN

Integration

CCT · TEL · CRM · EHR

Trust

C2PA · PII · residency

Language

Translate · correct

Voice

Capture · clone

↓ VOICE OUT

One real call · CALL_8F3E REC

+0.00s L01

Ring

Twilio SIP trunk · es-MX detected

+1.42s L01

Maya: "Hello"

Speaker embedding warmed · ready

+2.84s L02

Rosa hears it · Spanish

Cloned in Maya's voice · 740ms p50

+18.0s L02

Correction appended

"my account" → "my mother's account"

+4:02 L03

Manifest signed

C2PA · 47 utterances · 2 corrections

Five events · Four minutes · One conversation

01 Voice

Capture & clone

Stream audio in, generate a per-call speaker embedding, render in the listener's ear before the speaker has finished.

02 Language

Translate & correct

Glossary-aware NMT runs token-by-token. Late corrections append to the transcript instead of rewriting it.

03 Trust

Attest & redact

Every utterance ships with a C2PA-signed manifest. PII redaction is designed to run before logs hit disk; per-tenant and audit-logged.

04 Integration

Where it plugs in

One pipeline, six adapters: contact center, telephony, AI voice agents, CRM, EHR, browser. Same model, different doorway.

LAYER 01 · Voice

Rosa says “Buenas tardes.” Two-thirds of a second later, Maya hears it in her own voice — same warmth, same hesitation, just a different language.

Before the first word · 1.2 seconds

Maya picks up. The pipeline is already there.

Twelve hundred milliseconds between Rosa's first ring and a fully provisioned, language-detected, region-pinned session ready to stream voice in both directions. Here's how that 1.2 seconds breaks down.

+0.00s INGRESS SIP INVITE arrives at TransVoix media bridge

+0.14s AUTH Tenant matched · region pinned to us-east-1

+0.31s SESSION Agent embedding (agt_maya_8f3e) warmed in GPU pool

+0.62s LANG DETECT Caller audio classified · es-MX (confidence 0.97)

+1.20s READY Both legs joined · pipeline streaming · p50 740ms

Voice cloning · bidirectional

Their words, in your voice. Your words, in theirs.

TransVoix doesn't replace voices with synthetic narration. It clones each speaker from a few seconds of audio, then carries that voice across the language barrier in both directions, so customers and agents recognize each other through accent, cadence and warmth, not just words.

Maya · Agent

English (US)

VOICE → VOICE

Rosa · Caller

hearing Spanish (MX)

Maya speaks English in her own voice. Rosa hears Spanish in Maya's voice.

Rosa · Caller

Spanish (MX)

VOICE → VOICE

Maya · Agent

hearing English (US)

Rosa speaks Spanish in her own voice. Maya hears English in Rosa's voice.

What we don't do: read a translation back in a generic voice. The cloned voice is ephemeral, generated per-call, never stored, never re-used. Voice consent & ethics →

Latency budget

How we hit sub-second.

Most translation pipelines are slow because each stage waits for the last to finish. TransVoix streams every stage in parallel, a 50ms chunk of speech is already being translated while the next chunk is being captured, and the first cloned syllable lands in the listener's ear before the speaker has finished their sentence.

Stage Budget (p50) What happens

Capture ≤ 50ms Streaming VAD on 20ms frames; chunked at phrase boundaries.

Speech → Text ≤ 180ms Multilingual ASR with overlap windowing for late corrections.

Translate ≤ 220ms Glossary-aware NMT, streaming token-by-token.

Voice Clone ≤ 160ms Pre-warmed speaker embedding; first phoneme out at 80ms.

Render & Stream ≤ 130ms 48kHz output spliced back into the existing call leg.

End-to-end ≤ 740ms p50 measured across our 30 production language pairs.

End-to-end p50 across our 30 production language pairs is 740ms. For comparison, a typical international call already has 200–400ms of network latency before any translation happens. Per-stage budgets above are illustrative engineering allocations, not benchmarked stage timings.

LAYER 02 · Language

Eighteen seconds in, Rosa clarifies what she meant. We don’t rewrite the past — we append a correction that’s timestamped, visible, and part of the record forever.

Append-only correction

We don't edit the past. We add to it.

Streaming translators get faster by guessing, and slower by walking the guesses back. TransVoix never rewrites what was said. When a later word changes the meaning of an earlier one, we issue a correction that's appended to the transcript, timestamped, and visible to both parties. Nothing disappears.

Live transcript · call_8f3e REC

00:14.220 Rosa

Necesito hablar sobre mi cuenta.

I need to talk about my account.

00:18.880 Rosa

…la cuenta de mi madre, en realidad.

…my mother's account, actually.

00:19.120 ↳ correction Correction · "my account" → "my mother's account"

00:23.410 Maya

Of course. Can you give me her account number?

Por supuesto. ¿Me puede dar el número de cuenta?

Language coverage

6 live · 9 in beta · 13 on the roadmap.

Voice-to-voice across 30 bidirectional pairs. Glossary support for industry-specific vocabulary. LLM-Judge benchmark: 4.72 / 5 average.

See all languages →

LAYER 03 · Trust

At 4:02, the call ends. The cloned voices are deleted from the inference pool. The transcript ships with a C2PA-signed manifest binding every word to its source.

Trust · provenance

Every word, signed.

Every translated utterance ships with a C2PA-signed manifest binding it to the speaker, the source audio hash, the model version, and the timestamp. If a recording shows up later in court, in compliance review, or in the press, you can prove what was said, who said it, and that nothing in between was altered.

✓ Verified · C2PA v1.4

{

"speaker_id": "agt_maya_8f3e" ,

"source_hash": "sha256:9c3a…b21f" ,

"model": "tvx-clone-v3.2" ,

"language": "en-US → es-MX" ,

"timestamp": "2026-04-27T14:22:08Z" ,

"corrections": 2 ,

"signed_by": "transvoix.ai"

}

What's in the manifest

Speaker identity binds the cloned voice to a verified agent or caller, not a free-floating impersonation.

Source hash is a fingerprint of the original audio — proof the translation is downstream of a real utterance, not synthesized from text.

Model version means the same input, verified six months later, produces the same output. Reproducible, not magic.

Corrections count tells you whether the line you're reading is the first take or a revision — and lets you replay the full append-only history.

LAYER 04 · Integration

We’ve been describing one call through Twilio. The pipeline is the same regardless of how the audio arrives — what changes is the adapter, not the model.

Integrations · where it plugs in

Six surfaces. One conversation.

TransVoix lives where your conversations already happen. The pipeline above is the same regardless of how a call enters or exits, what changes is the adapter, not the model. See full integration matrix on /for-business →

CCT

Contact Center

Genesys · Five9 · Talkdesk

TEL

Telephony

Twilio · Vonage · SIP trunks

AIV

AI Voice Agents

Vapi · Retell · custom LLM

CRM

Salesforce · HubSpot · Zendesk

EHR

Epic · Cerner · athenahealth

BRW

Browser

Chrome extension · WebRTC

Data handling · residency

Where your data lives, and how it leaves.

TransVoix processes voice in the region you choose, is designed to redact PII before any log touches disk, and never trains on customer audio. You can also run the whole pipeline in your own VPC, the architecture is the same; only the operator changes.

Regional processing

US region today (us-east-1). Multi-region (EU, APAC) is on the architecture roadmap — when it ships, audio won't cross regions and cloned voice models will be pinned to the source region's keys.

today: us-east-1 · roadmap: eu · apac

PII redaction in-flight

The transcription layer is designed to redact card numbers, SSNs, account IDs and named entities before logs hit disk. Redaction policy is per-tenant and audit-logged.

redact_policy=per_tenant

Customer-controlled deployment

Roadmap: a customer-VPC deployment of the same pipeline so you keep the audio, the keys, and the audit log. Available today: tenant isolation via RLS, BAA at pilot onboarding, regional pinning.

deploy=managed · vpc=roadmap

See it on a real call.

30 days · 500 minutes · 3–5 agents · integration scoped to your stack. Your own traffic, your own glossary, your own metrics. The pilot is free — scope aligned together on a discovery call first.

Start a pilot → See pricing

Four layers. One real call.