Skip to content

noCall

GET
/backend/api/v1

Live Transcription WebSocket API

This documentation explains how to connect to the live transcription WebSocket API, push audio, and receive real-time transcription or translation. There are two main roles: sender and listener.

Roles

Sender

  • Push audio to the server and optionally see transcription.
  • Requires sender_token.
  • Only one sender per live session.
  • Audio format: PCM 16-bit, 16 kHz, base64-encoded.

Listener

  • Receive real-time transcription/translation only.
  • Requires listener_token.
  • Multiple listeners allowed.

WebSocket URL

To send audio data :

wss:///ws/live/?token_sender=<token_sender>

To read transcription :

wss:///ws/live/?token_sender=<token_listener>

  • <token_sender>: is the token of the created live for sender.
  • <token_sender>: is the token of the created live for listener.
  • Use wss:// for secure WebSocket connections.

Connection Flow

Step 1: Connect

  • Client opens a WebSocket connection to the URL.
  • Server responds with a JSON message:

Step 2: Role-specific handling

Listener

  • Simply receives transcription data.

AVEC LE FORMAT

Sender

  • Must send a configuration JSON before sending audio.
  • Server responds with:
{
  "status": "ASR process started"
}
  • Once the model is loaded, server sends:

{ “status”: “started” }

Configuration JSON

Example configuration:

{
  "type": "config",
  "min_buffer": 2.0,
  "max_buffer": 4.0,
  "max_chars": 40,
  "max_lines": 2
}
  • min_buffer: Minimum audio buffer in seconds.
  • max_buffer: Maximum audio buffer in seconds.
  • max_chars: Maximum characters per subtitle line.
  • max_lines: Maximum lines per subtitle block.

For live sessions using audio flux (HLS, RTMP, RTSP), the audio URL is predefined in the session. The sender still sends the config but audio push is not required.

Sending Audio

  • Audio must be PCM 16-bit, 16 kHz.
  • Encode raw PCM chunks as base64 before sending. (binary)
{
  "type": "audio",
  "data": audio (in base64 binary)
}
  • To indicate the end of transmission:
{
  "type": "FINISH"
}
  • Closing the WebSocket also stops the ASR process.

Transcription Format

The transcription is a JSON format with theses informations:

  • type : transcription or translation,
  • start: start second of segment, is a float
  • end : end second of segment, is a float
  • text : transcribed text of segment, is a string
  • language : language of output text

Exemple

Server sends transcription updates:

{
  "type": "<transcription>",
  "start": 0.0,
  "end": 3.0,
  "text": "Hello world",
  "language": "en"
}

Server sends translation updates:

Only if live was created with translation. It send updates on all translated language:

{
  "type": "<translation>",
  "start": 0.0,
  "end": 3.0,
  "text": "Bonjour le monde",
  "language": "fr"
}
{
  "type": "<translation>",
  "start": 0.0,
  "end": 3.0,
  "text": "Ciao mondo",
  "language": "it"
}

Summary of Flow

Sender (microphone)

  1. Connect WebSocket with sender_token.
  2. Receive "Connexion established".
  3. Send configuration JSON.
  4. Receive status "ASR process started" : mean the config is well received and process start.
  5. Receive status "started" : mean that the transcription model is loaded and ready to transcribe audio.
  6. Push audio chunks.
  7. Receive transcription messages.
  8. Send type __FINISH__ or close WebSocket.

Sender (audio flux)

  1. Connect WebSocket with sender_token.
  2. Receive "Connexion established".
  3. Send configuration JSON.
  4. Receive transcription messages -> Server automatically starts transcription (no audio push needed).
  5. Send type __FINISH__ or close WebSocket.

Listener

  1. Connect WebSocket with listener_token.
  2. Receive "Connexion established".
  3. Listen for transcription messages.
  4. Disconnect when done.

Notes

  • Only one sender is allowed per live session.
  • Multiple listeners can connect simultaneously.
  • Supports both live microphone input and audio flux URLs.
  • Audio encoding and chunking must respect 16-bit PCM, 16 kHz, and base64 encoding.

Successful response (inferred from assertions)

object
Example
{}