noCall

GET

/backend/api/v1

Production

Live Transcription WebSocket API

This documentation explains how to connect to the live transcription WebSocket API, push audio, and receive real-time transcription or translation. There are two main roles: sender and listener.

Roles

Sender

Push audio to the server and optionally see transcription.
Requires sender_token.
Only one sender per live session.
Audio format: PCM 16-bit, 16 kHz, base64-encoded.

Listener

Receive real-time transcription/translation only.
Requires listener_token.
Multiple listeners allowed.

WebSocket URL

To send audio data :

wss:///ws/live/?token_sender=<token_sender>

To read transcription :

wss:///ws/live/?token_sender=<token_listener>

<token_sender>: is the token of the created live for sender.
<token_sender>: is the token of the created live for listener.
Use wss:// for secure WebSocket connections.

Connection Flow

Step 1: Connect

Client opens a WebSocket connection to the URL.
Server responds with a JSON message:

Step 2: Role-specific handling

Listener

Simply receives transcription data.

AVEC LE FORMAT

Sender

Must send a configuration JSON before sending audio.
Server responds with:

{
  "status": "ASR process started"
}

Once the model is loaded, server sends:

{ “status”: “started” }

Configuration JSON

Example configuration:

{
  "type": "config",
  "min_buffer": 2.0,
  "max_buffer": 4.0,
  "max_chars": 40,
  "max_lines": 2
}

min_buffer: Minimum audio buffer in seconds.
max_buffer: Maximum audio buffer in seconds.
max_chars: Maximum characters per subtitle line.
max_lines: Maximum lines per subtitle block.

For live sessions using audio flux (HLS, RTMP, RTSP), the audio URL is predefined in the session. The sender still sends the config but audio push is not required.

Sending Audio

Audio must be PCM 16-bit, 16 kHz.
Encode raw PCM chunks as base64 before sending. (binary)

{
  "type": "audio",
  "data": audio (in base64 binary)
}

To indicate the end of transmission:

{
  "type": "FINISH"
}

Closing the WebSocket also stops the ASR process.

Transcription Format

The transcription is a JSON format with theses informations:

type : transcription or translation,
start: start second of segment, is a float
end : end second of segment, is a float
text : transcribed text of segment, is a string
language : language of output text

Exemple

Server sends transcription updates:

{
  "type": "<transcription>",
  "start": 0.0,
  "end": 3.0,
  "text": "Hello world",
  "language": "en"
}

Server sends translation updates:

Only if live was created with translation. It send updates on all translated language:

{
  "type": "<translation>",
  "start": 0.0,
  "end": 3.0,
  "text": "Bonjour le monde",
  "language": "fr"
}

{
  "type": "<translation>",
  "start": 0.0,
  "end": 3.0,
  "text": "Ciao mondo",
  "language": "it"
}

Summary of Flow

Sender (microphone)

Connect WebSocket with sender_token.
Receive "Connexion established".
Send configuration JSON.
Receive status "ASR process started" : mean the config is well received and process start.
Receive status "started" : mean that the transcription model is loaded and ready to transcribe audio.
Push audio chunks.
Receive transcription messages.
Send type __FINISH__ or close WebSocket.

Sender (audio flux)

Connect WebSocket with sender_token.
Receive "Connexion established".
Send configuration JSON.
Receive transcription messages -> Server automatically starts transcription (no audio push needed).
Send type __FINISH__ or close WebSocket.

Listener

Connect WebSocket with listener_token.
Receive "Connexion established".
Listen for transcription messages.
Disconnect when done.

Notes

Only one sender is allowed per live session.
Multiple listeners can connect simultaneously.
Supports both live microphone input and audio flux URLs.
Audio encoding and chunking must respect 16-bit PCM, 16 kHz, and base64 encoding.

Authorizations

bearerAuth
apiKeyHeader

Responses

200

Successful response (inferred from assertions)

object

Example

{}