WebSocket Bi-Directional Audio Streaming

2025-08-18 • by Team Astra • Tags: API docs, websocket, streaming, audio, real-time

WebSocket Bi-Directional Audio Streaming

Welcome to our WebSocket audio streaming documentation! This guide explains how to integrate real-time, bi-directional audio streaming between our voice platform and your WebSocket endpoint.

What Does This Enable?

Our WebSocket integration allows you to: - Receive live audio streams from ongoing phone calls - Send audio responses back in real-time - Access call metadata and participant information - Build custom voice AI agents and experiences - Create interactive voice response systems

The audio flows both ways: we send you the caller's voice along with call information, and you can send audio back that plays to the caller instantly.

How It Works

The integration uses WebSocket events to manage the audio stream lifecycle:

Events We Send to You: - Connection handshake and stream initialization - Real-time audio chunks every 100ms - Touch-tone (DTMF) key presses - Stream termination signals

Events You Send to Us: - Audio responses to play to the caller - Stream control commands (mark, clear)

Audio Format Specifications

All audio data uses the following format: - Encoding: µ-law (audio/x-mulaw) also known as G.711 PCMU - Sample Rate: 8000 Hz - Bit Rate: 64 kbps - Bit Depth: 8-bit - Format: Base64 encoded PCM mono audio

Important: When sending audio to us, payload size must be at least 160 bytes or a multiple of 160 (320, 480, 800, etc.). Non-compliant sizes may cause audio gaps during playback.

Events We Send to Your Endpoint

1. Connected Event

The connected event acts as a handshake response and sets expectations between the client and server. It is the first message sent after establishing the Web Socket connection.

Property	Description
`event`	Describes the type of Web Socket message. In this case, `connected`.

Payload:

{
  "event": "connected"
}

2. Start Event

The start message contains metadata about the Stream and is sent immediately after the connected message. It is only sent once at the start of the Stream.

Property	Description
`event`	Describes the type of Web Socket message. In this case, `start`.
`sequenceNumber`	Number used to keep track of message sending order. The first message has a value of `1` and then is incremented for each subsequent message.
`start`	An object containing Stream metadata
`start.streamSid`	The unique identifier of the Stream
`start.accountSid`	The unique identifier of the Account for which the stream was created
`start.callSid`	The unique identifier of the Call for which the Stream was started
`from`	The number from which the call was originated to the above-mentioned account.
`to`	The number of the account to which the call was originated to.
`start.mediaFormat`	An object containing the format of the payload in the `media` messages.
`start.mediaFormat.encoding`	The encoding of the data in the upcoming payload. Value is always `audio/x-mulaw`. (also known as G.711 µ-law (PCMU))
`start.direction`	The Direction of the call (inbound/outbound)
`start.mediaFormat.sampleRate`	The sample rate in hertz of the upcoming audio data. Value is always `8000`
`start.mediaFormat.bitRate`	The number of bits used to represent one second of audio in the input audio data. Value is always `64` kbps.
`start.mediaFormat.bitDepth`	It refers to the number of bits used to represent each sample. (`8-bit`)
`start.customParameters`	An object containing the custom parameters that were set when defining the Stream
`streamSid`	The unique identifier of the Stream

Payload:

{
  "event": "start",
  "sequenceNumber": "1",
  "start": {
    "accountSid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
    "streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
    "callSid": "CAXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
    "from": "XXXXXXXXXX",
    "to": "XXXXXXXXXX",
    "direction": "outbound",
    "mediaFormat": {
      "encoding": "audio/x-mulaw",
      "sampleRate": 8000,
      "bitRate": 64,
      "bitDepth": 8
    },
    "customParameters": {
      "FirstName": "Jane",
      "LastName": "Doe",
      "RemoteParty": "Bob"
    }
  },
  "streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
}

3. Media Event

This message type encapsulates the raw audio data.

Please note, the media message is sent to the vendor in every 100ms.

Property	Description
`event`	Describes the type of Web Socket message. In this case, `"media"`.
`sequenceNumber`	Number used to keep track of message sending order. The first message has a value of `1` and then is incremented for each subsequent message.
`media`	An object containing media metadata and payload
`media.chunk`	The chunk for the message. The first message will begin with `1` and increment with each subsequent message.
`media.timestamp`	Presentation Timestamp in Milliseconds from the start of the stream.
`media.payload`	Raw audio encoded packets in base64
`streamSid`	The unique identifier of the Stream

Payload:

{
  "event": "media",
  "sequenceNumber": "3",
  "media": {
    "chunk": "1",
    "timestamp": "5",
    "payload": "no+JhoaJjpz..."
  },
  "streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
}

4. DTMF Event

A dtmf message is sent when someone presses a touch-tone number key in the inbound stream, typically in response to a prompt in the outbound stream.

Property	Description
`event`	Describes the type of Web Socket message. In this case, `dtmf`.
`streamSid`	The unique identifier of the Stream
`sequenceNumber`	Number used to keep track of message sending order. The first message has a value of `1` and then is incremented for each subsequent message.
`dtmf.digit`	The number-key tone detected

An example dtmf message is shown below. The dtmf.digit value is 1, indicating that someone pressed the 1 key on their handset.

Payload:

{
  "event": "dtmf",
  "streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
  "sequenceNumber": "5",
  "dtmf": {
    "digit": "1"
  }
}

5. Mark Event

When endpoint sends a media message, it could then send a mark message with a label; When that media message's playback is complete, we send the mark message to the endpoint using the same label mark.name indicating that the media has been played.

If the endpoint (WebSocket server) sends a clear message, we will empty the audio buffer and send back the mark messages matching any remaining mark messages from the server.

Property	Description
`event`	Describes the type of Web Socket message. In this case, `"mark"`.
`sequenceNumber`	Number used to keep track of message sending order. The first message has a value of `1` and then is incremented for each subsequent message.
`streamSid`	The unique identifier of the Stream
`mark`	An object containing the mark metadata
`mark.name`	A custom value. We send back the `mark.name` you specify when it receives a `mark` message

Payload:

{
  "event": "mark",
  "sequenceNumber": "4",
  "streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
  "mark": {
    "name": "mark label"
  }
}

6. Stop Event

This message indicates when the Stream has stopped, or the call has ended.

Property	Description
`event`	Describes the type of Web Socket message. In this case, `stop`.
`sequenceNumber`	Number used to keep track of message sending order. The first message has a value of `1` and then is incremented for each subsequent message.
`stop`	An object containing Stream metadata
`stop.accountSid`	The Account identifier that created the Stream
`stop.callSid`	The Call identifier that started the Stream
`stop.reason`	The reason for ending the Stream.
`streamSid`	The unique identifier of the Stream

Payload:

{
  "event": "stop",
  "sequenceNumber": "5",
  "stop": {
    "accountSid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
    "callSid": "CAXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
    "reason": "The caller disconnected the call"
  },
  "streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
}

Events You Send to Us

1. Media Event

The payload must be encoded audio/x-mulaw with a sample rate of 8000 encoded with base64 PCM mono audio. The audio can be of any size.

Please note that the payload of media received from the vendor should at least be of 160 bytes or a multiple of 160 bytes (i.e., 320, 800, 4000, etc.). In case the payload is not a multiple of 160 bytes, audio gaps might occur when it is played over the call.

Property	Description
`event`	Describes the type of Web Socket message. In this case, `"media"`.
`streamSid`	The SID of the Stream that should play the audio
`media`	An object containing the media payload
`media.payload`	Raw `mulaw/8000` audio in encoded in base64
`media.chunk`	The chunk for the message. The first message will begin with `1` and increment with each subsequent message.

Payload:

{
  "event": "media",
  "streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
  "media": {
    "payload": "a3242sa...",
    "chunk": 1
  }
}

2. Mark Event

Sends a mark event message after sending a media event message to be notified when the audio that they have sent has been completed. We send back a mark event with a matching name when the audio ends (or if there is no audio buffered).

The Web Socket Server also receives an incoming mark event message if the buffer was cleared using the clear event message.

Property	Description
`event`	Describes the type of Web Socket message. In this case `"mark"`.
`streamSid`	The SID of the Stream that should receive the mark
`mark`	An object containing mark metadata and payload
`mark.name`	A name specific to your needs that will assist in recognizing future received `mark` event

Payload:

{
  "event": "mark",
  "streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
  "mark": {
    "name": "my label"
  }
}

3. Clear Event

Sends a clear message if the server want to interrupt the audio that has been sent in various media messages. This empties all buffered audio and causes any mark messages to be sent back to the Web Socket server.

Property	Description
`event`	Describes the type of Web Socket message. In this case, `"clear"`.
`streamSid`	The SID of the Stream in which you wish to interrupt the audio.

Payload:

{
  "event": "clear",
  "streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
}

Typical Stream Lifecycle

Here's how a typical interaction flows:

Connection Establishment
Establish a Web Socket connection with the vendor.
Send the connected event to initiate the handshake.
Stream Initialization
Send the start event with stream metadata.
Audio Streaming
Send media events with base64 encoded audio data.
Receive media events from the vendor with base64 encoded audio data.
Stream Termination
Send the stop event to terminate the stream.
Handle the clear event from the vendor to reset the stream.
End of Input
Send the mark event when all media from the bot has been played.
Handle the mark event from the vendor to denote the end of input.

Example Workflow

Our System → Your Endpoint:

connected → start → media → media → media → dtmf → media → stop

Your Endpoint → Our System:

media → mark → media → mark → clear

Best Practices

Stream IDs: Ensure that the streamSid is unique for each stream and consistent across all events for a given stream
Sequence Numbers: Track sequence numbers to detect any dropped messages
Audio Quality: Base64 encoding is used for audio data to ensure compatibility and ease of transmission
Latency: Process and respond to audio quickly to maintain natural conversation flow
Error Handling: Implement reconnection logic and handle connection drops gracefully
Mark Synchronization: The mark event is used to synchronize the end of input between the client and vendor

Technical Requirements

WebSocket connection support (wss://)
Base64 encoding/decoding capability
µ-law (G.711) audio codec support
JSON parsing and generation
Real-time processing capability (<500ms latency recommended)

Support Resources

Need help integrating? We're here for you: - 📞 Call us - 💬 WhatsApp us - 📧 Email Support

Ready to build amazing voice experiences? Start integrating today!