noCall
GET /backend/api/v1
GET
/backend/api/v1
Live Transcription WebSocket API
This documentation explains how to connect to the live transcription WebSocket API, push audio, and receive real-time transcription or translation. There are two main roles: sender and listener.
Roles
Sender
- Push audio to the server and optionally see transcription.
- Requires
sender_token. - Only one sender per live session.
- Audio format: PCM 16-bit, 16 kHz, base64-encoded.
Listener
- Receive real-time transcription/translation only.
- Requires
listener_token. - Multiple listeners allowed.
WebSocket URL
To send audio data :
wss://
/ws/live/?token_sender= <token_sender>
To read transcription :
wss://
/ws/live/?token_sender= <token_listener>
<token_sender>: is the token of the created live for sender.<token_sender>: is the token of the created live for listener.- Use
wss://for secure WebSocket connections.
Connection Flow
Step 1: Connect
- Client opens a WebSocket connection to the URL.
- Server responds with a JSON message:
Step 2: Role-specific handling
Listener
- Simply receives transcription data.
AVEC LE FORMAT
Sender
- Must send a configuration JSON before sending audio.
- Server responds with:
{
"status": "ASR process started"
}
- Once the model is loaded, server sends:
{ “status”: “started” }
Configuration JSON
Example configuration:
{
"type": "config",
"min_buffer": 2.0,
"max_buffer": 4.0,
"max_chars": 40,
"max_lines": 2
}
- min_buffer: Minimum audio buffer in seconds.
- max_buffer: Maximum audio buffer in seconds.
- max_chars: Maximum characters per subtitle line.
- max_lines: Maximum lines per subtitle block.
For live sessions using audio flux (HLS, RTMP, RTSP), the audio URL is predefined in the session. The sender still sends the config but audio push is not required.
Sending Audio
- Audio must be PCM 16-bit, 16 kHz.
- Encode raw PCM chunks as base64 before sending. (binary)
{
"type": "audio",
"data": audio (in base64 binary)
}
- To indicate the end of transmission:
{
"type": "FINISH"
}
- Closing the WebSocket also stops the ASR process.
Transcription Format
The transcription is a JSON format with theses informations:
- type :
transcriptionortranslation, - start: start second of segment, is a float
- end : end second of segment, is a float
- text : transcribed text of segment, is a string
- language : language of output text
Exemple
Server sends transcription updates:
{
"type": "<transcription>",
"start": 0.0,
"end": 3.0,
"text": "Hello world",
"language": "en"
}
Server sends translation updates:
Only if live was created with translation. It send updates on all translated language:
{
"type": "<translation>",
"start": 0.0,
"end": 3.0,
"text": "Bonjour le monde",
"language": "fr"
}
{
"type": "<translation>",
"start": 0.0,
"end": 3.0,
"text": "Ciao mondo",
"language": "it"
}
Summary of Flow
Sender (microphone)
- Connect WebSocket with
sender_token. - Receive
"Connexion established". - Send configuration JSON.
- Receive status
"ASR process started": mean the config is well received and process start. - Receive status
"started": mean that the transcription model is loaded and ready to transcribe audio. - Push audio chunks.
- Receive transcription messages.
- Send type
__FINISH__or close WebSocket.
Sender (audio flux)
- Connect WebSocket with
sender_token. - Receive
"Connexion established". - Send configuration JSON.
- Receive transcription messages -> Server automatically starts transcription (no audio push needed).
- Send type
__FINISH__or close WebSocket.
Listener
- Connect WebSocket with
listener_token. - Receive
"Connexion established". - Listen for transcription messages.
- Disconnect when done.
Notes
- Only one sender is allowed per live session.
- Multiple listeners can connect simultaneously.
- Supports both live microphone input and audio flux URLs.
- Audio encoding and chunking must respect 16-bit PCM, 16 kHz, and base64 encoding.
Authorizations
Section titled “Authorizations ”Responses
Section titled “ Responses ”Successful response (inferred from assertions)
object
Example
{}