WebSocket Bi-Directional Audio Streaming
WebSocket Bi-Directional Audio Streaming
Welcome to our WebSocket audio streaming documentation! This guide explains how to integrate real-time, bi-directional audio streaming between our voice platform and your WebSocket endpoint.
What Does This Enable?
Our WebSocket integration allows you to: - Receive live audio streams from ongoing phone calls - Send audio responses back in real-time - Access call metadata and participant information - Build custom voice AI agents and experiences - Create interactive voice response systems
The audio flows both ways: we send you the caller's voice along with call information, and you can send audio back that plays to the caller instantly.
How It Works
The integration uses WebSocket events to manage the audio stream lifecycle:
Events We Send to You: - Connection handshake and stream initialization - Real-time audio chunks every 100ms - Touch-tone (DTMF) key presses - Stream termination signals
Events You Send to Us: - Audio responses to play to the caller - Stream control commands (mark, clear)
Audio Format Specifications
All audio data uses the following format: - Encoding: ยต-law (audio/x-mulaw) also known as G.711 PCMU - Sample Rate: 8000 Hz - Bit Rate: 64 kbps - Bit Depth: 8-bit - Format: Base64 encoded PCM mono audio
Important: When sending audio to us, payload size must be at least 160 bytes or a multiple of 160 (320, 480, 800, etc.). Non-compliant sizes may cause audio gaps during playback.
Events We Send to Your Endpoint
1. Connected Event
The connected event acts as a handshake response and sets expectations between the client and server. It is the first message sent after establishing the Web Socket connection.
| Property | Description |
|---|---|
event |
Describes the type of Web Socket message. In this case, connected. |
Payload:
{
"event": "connected"
}
2. Start Event
The start message contains metadata about the Stream and is sent immediately after the connected message. It is only sent once at the start of the Stream.
| Property | Description |
|---|---|
event |
Describes the type of Web Socket message. In this case, start. |
sequenceNumber |
Number used to keep track of message sending order. The first message has a value of 1 and then is incremented for each subsequent message. |
start |
An object containing Stream metadata |
start.streamSid |
The unique identifier of the Stream |
start.accountSid |
The unique identifier of the Account for which the stream was created |
start.callSid |
The unique identifier of the Call for which the Stream was started |
from |
The number from which the call was originated to the above-mentioned account. |
to |
The number of the account to which the call was originated to. |
start.mediaFormat |
An object containing the format of the payload in the media messages. |
start.mediaFormat.encoding |
The encoding of the data in the upcoming payload. Value is always audio/x-mulaw. (also known as G.711 ยต-law (PCMU)) |
start.direction |
The Direction of the call (inbound/outbound) |
start.mediaFormat.sampleRate |
The sample rate in hertz of the upcoming audio data. Value is always 8000 |
start.mediaFormat.bitRate |
The number of bits used to represent one second of audio in the input audio data. Value is always 64 kbps. |
start.mediaFormat.bitDepth |
It refers to the number of bits used to represent each sample. (8-bit) |
start.customParameters |
An object containing the custom parameters that were set when defining the Stream |
streamSid |
The unique identifier of the Stream |
Payload:
{
"event": "start",
"sequenceNumber": "1",
"start": {
"accountSid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
"streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
"callSid": "CAXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
"from": "XXXXXXXXXX",
"to": "XXXXXXXXXX",
"direction": "outbound",
"mediaFormat": {
"encoding": "audio/x-mulaw",
"sampleRate": 8000,
"bitRate": 64,
"bitDepth": 8
},
"customParameters": {
"FirstName": "Jane",
"LastName": "Doe",
"RemoteParty": "Bob"
}
},
"streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
}
3. Media Event
This message type encapsulates the raw audio data.
Please note, the media message is sent to the vendor in every 100ms.
| Property | Description |
|---|---|
event |
Describes the type of Web Socket message. In this case, "media". |
sequenceNumber |
Number used to keep track of message sending order. The first message has a value of 1 and then is incremented for each subsequent message. |
media |
An object containing media metadata and payload |
media.chunk |
The chunk for the message. The first message will begin with 1 and increment with each subsequent message. |
media.timestamp |
Presentation Timestamp in Milliseconds from the start of the stream. |
media.payload |
Raw audio encoded packets in base64 |
streamSid |
The unique identifier of the Stream |
Payload:
{
"event": "media",
"sequenceNumber": "3",
"media": {
"chunk": "1",
"timestamp": "5",
"payload": "no+JhoaJjpz..."
},
"streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
}
4. DTMF Event
A dtmf message is sent when someone presses a touch-tone number key in the inbound stream, typically in response to a prompt in the outbound stream.
| Property | Description |
|---|---|
event |
Describes the type of Web Socket message. In this case, dtmf. |
streamSid |
The unique identifier of the Stream |
sequenceNumber |
Number used to keep track of message sending order. The first message has a value of 1 and then is incremented for each subsequent message. |
dtmf.digit |
The number-key tone detected |
An example dtmf message is shown below. The dtmf.digit value is 1, indicating that someone pressed the 1 key on their handset.
Payload:
{
"event": "dtmf",
"streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
"sequenceNumber": "5",
"dtmf": {
"digit": "1"
}
}
5. Mark Event
When endpoint sends a media message, it could then send a mark message with a label; When that media message's playback is complete, we send the mark message to the endpoint using the same label mark.name indicating that the media has been played.
If the endpoint (WebSocket server) sends a clear message, we will empty the audio buffer and send back the mark messages matching any remaining mark messages from the server.
| Property | Description |
|---|---|
event |
Describes the type of Web Socket message. In this case, "mark". |
sequenceNumber |
Number used to keep track of message sending order. The first message has a value of 1 and then is incremented for each subsequent message. |
streamSid |
The unique identifier of the Stream |
mark |
An object containing the mark metadata |
mark.name |
A custom value. We send back the mark.name you specify when it receives a mark message |
Payload:
{
"event": "mark",
"sequenceNumber": "4",
"streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
"mark": {
"name": "mark label"
}
}
6. Stop Event
This message indicates when the Stream has stopped, or the call has ended.
| Property | Description |
|---|---|
event |
Describes the type of Web Socket message. In this case, stop. |
sequenceNumber |
Number used to keep track of message sending order. The first message has a value of 1 and then is incremented for each subsequent message. |
stop |
An object containing Stream metadata |
stop.accountSid |
The Account identifier that created the Stream |
stop.callSid |
The Call identifier that started the Stream |
stop.reason |
The reason for ending the Stream. |
streamSid |
The unique identifier of the Stream |
Payload:
{
"event": "stop",
"sequenceNumber": "5",
"stop": {
"accountSid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
"callSid": "CAXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
"reason": "The caller disconnected the call"
},
"streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
}
Events You Send to Us
1. Media Event
The payload must be encoded audio/x-mulaw with a sample rate of 8000 encoded with base64 PCM mono audio. The audio can be of any size.
Please note that the payload of media received from the vendor should at least be of 160 bytes or a multiple of 160 bytes (i.e., 320, 800, 4000, etc.). In case the payload is not a multiple of 160 bytes, audio gaps might occur when it is played over the call.
| Property | Description |
|---|---|
event |
Describes the type of Web Socket message. In this case, "media". |
streamSid |
The SID of the Stream that should play the audio |
media |
An object containing the media payload |
media.payload |
Raw mulaw/8000 audio in encoded in base64 |
media.chunk |
The chunk for the message. The first message will begin with 1 and increment with each subsequent message. |
Payload:
{
"event": "media",
"streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
"media": {
"payload": "a3242sa...",
"chunk": 1
}
}
2. Mark Event
Sends a mark event message after sending a media event message to be notified when the audio that they have sent has been completed. We send back a mark event with a matching name when the audio ends (or if there is no audio buffered).
The Web Socket Server also receives an incoming mark event message if the buffer was cleared using the clear event message.
| Property | Description |
|---|---|
event |
Describes the type of Web Socket message. In this case "mark". |
streamSid |
The SID of the Stream that should receive the mark |
mark |
An object containing mark metadata and payload |
mark.name |
A name specific to your needs that will assist in recognizing future received mark event |
Payload:
{
"event": "mark",
"streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
"mark": {
"name": "my label"
}
}
3. Clear Event
Sends a clear message if the server want to interrupt the audio that has been sent in various media messages. This empties all buffered audio and causes any mark messages to be sent back to the Web Socket server.
| Property | Description |
|---|---|
event |
Describes the type of Web Socket message. In this case, "clear". |
streamSid |
The SID of the Stream in which you wish to interrupt the audio. |
Payload:
{
"event": "clear",
"streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
}
Typical Stream Lifecycle
Here's how a typical interaction flows:
- Connection Establishment
- Establish a Web Socket connection with the vendor.
-
Send the
connectedevent to initiate the handshake. -
Stream Initialization
-
Send the
startevent with stream metadata. -
Audio Streaming
- Send
mediaevents with base64 encoded audio data. -
Receive
mediaevents from the vendor with base64 encoded audio data. -
Stream Termination
- Send the
stopevent to terminate the stream. -
Handle the
clearevent from the vendor to reset the stream. -
End of Input
- Send the
markevent when all media from the bot has been played. - Handle the
markevent from the vendor to denote the end of input.
Example Workflow
Our System โ Your Endpoint:
connected โ start โ media โ media โ media โ dtmf โ media โ stop
Your Endpoint โ Our System:
media โ mark โ media โ mark โ clear
Best Practices
-
Stream IDs: Ensure that the
streamSidis unique for each stream and consistent across all events for a given stream -
Sequence Numbers: Track sequence numbers to detect any dropped messages
-
Audio Quality: Base64 encoding is used for audio data to ensure compatibility and ease of transmission
-
Latency: Process and respond to audio quickly to maintain natural conversation flow
-
Error Handling: Implement reconnection logic and handle connection drops gracefully
-
Mark Synchronization: The
markevent is used to synchronize the end of input between the client and vendor
Technical Requirements
- WebSocket connection support (wss://)
- Base64 encoding/decoding capability
- ยต-law (G.711) audio codec support
- JSON parsing and generation
- Real-time processing capability (<500ms latency recommended)
Support Resources
Need help integrating? We're here for you: - ๐ Call us - ๐ฌ WhatsApp us - ๐ง Email Support
Ready to build amazing voice experiences? Start integrating today!