Get your first 5,000 calls free ๐ŸŽ‰
Back to Blog

WebSocket Bi-Directional Audio Streaming

2025-08-18 โ€ข by Team Astra โ€ข Tags: API docs, websocket, streaming, audio, real-time

WebSocket Bi-Directional Audio Streaming

Welcome to our WebSocket audio streaming documentation! This guide explains how to integrate real-time, bi-directional audio streaming between our voice platform and your WebSocket endpoint.

What Does This Enable?

Our WebSocket integration allows you to: - Receive live audio streams from ongoing phone calls - Send audio responses back in real-time - Access call metadata and participant information - Build custom voice AI agents and experiences - Create interactive voice response systems

The audio flows both ways: we send you the caller's voice along with call information, and you can send audio back that plays to the caller instantly.


How It Works

The integration uses WebSocket events to manage the audio stream lifecycle:

Events We Send to You: - Connection handshake and stream initialization - Real-time audio chunks every 100ms - Touch-tone (DTMF) key presses - Stream termination signals

Events You Send to Us: - Audio responses to play to the caller - Stream control commands (mark, clear)


Audio Format Specifications

All audio data uses the following format: - Encoding: ยต-law (audio/x-mulaw) also known as G.711 PCMU - Sample Rate: 8000 Hz - Bit Rate: 64 kbps - Bit Depth: 8-bit - Format: Base64 encoded PCM mono audio

Important: When sending audio to us, payload size must be at least 160 bytes or a multiple of 160 (320, 480, 800, etc.). Non-compliant sizes may cause audio gaps during playback.


Events We Send to Your Endpoint

1. Connected Event

The connected event acts as a handshake response and sets expectations between the client and server. It is the first message sent after establishing the Web Socket connection.

Property Description
event Describes the type of Web Socket message. In this case, connected.

Payload:

{
  "event": "connected"
}

2. Start Event

The start message contains metadata about the Stream and is sent immediately after the connected message. It is only sent once at the start of the Stream.

Property Description
event Describes the type of Web Socket message. In this case, start.
sequenceNumber Number used to keep track of message sending order. The first message has a value of 1 and then is incremented for each subsequent message.
start An object containing Stream metadata
start.streamSid The unique identifier of the Stream
start.accountSid The unique identifier of the Account for which the stream was created
start.callSid The unique identifier of the Call for which the Stream was started
from The number from which the call was originated to the above-mentioned account.
to The number of the account to which the call was originated to.
start.mediaFormat An object containing the format of the payload in the media messages.
start.mediaFormat.encoding The encoding of the data in the upcoming payload. Value is always audio/x-mulaw. (also known as G.711 ยต-law (PCMU))
start.direction The Direction of the call (inbound/outbound)
start.mediaFormat.sampleRate The sample rate in hertz of the upcoming audio data. Value is always 8000
start.mediaFormat.bitRate The number of bits used to represent one second of audio in the input audio data. Value is always 64 kbps.
start.mediaFormat.bitDepth It refers to the number of bits used to represent each sample. (8-bit)
start.customParameters An object containing the custom parameters that were set when defining the Stream
streamSid The unique identifier of the Stream

Payload:

{
  "event": "start",
  "sequenceNumber": "1",
  "start": {
    "accountSid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
    "streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
    "callSid": "CAXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
    "from": "XXXXXXXXXX",
    "to": "XXXXXXXXXX",
    "direction": "outbound",
    "mediaFormat": {
      "encoding": "audio/x-mulaw",
      "sampleRate": 8000,
      "bitRate": 64,
      "bitDepth": 8
    },
    "customParameters": {
      "FirstName": "Jane",
      "LastName": "Doe",
      "RemoteParty": "Bob"
    }
  },
  "streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
}

3. Media Event

This message type encapsulates the raw audio data.

Please note, the media message is sent to the vendor in every 100ms.

Property Description
event Describes the type of Web Socket message. In this case, "media".
sequenceNumber Number used to keep track of message sending order. The first message has a value of 1 and then is incremented for each subsequent message.
media An object containing media metadata and payload
media.chunk The chunk for the message. The first message will begin with 1 and increment with each subsequent message.
media.timestamp Presentation Timestamp in Milliseconds from the start of the stream.
media.payload Raw audio encoded packets in base64
streamSid The unique identifier of the Stream

Payload:

{
  "event": "media",
  "sequenceNumber": "3",
  "media": {
    "chunk": "1",
    "timestamp": "5",
    "payload": "no+JhoaJjpz..."
  },
  "streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
}

4. DTMF Event

A dtmf message is sent when someone presses a touch-tone number key in the inbound stream, typically in response to a prompt in the outbound stream.

Property Description
event Describes the type of Web Socket message. In this case, dtmf.
streamSid The unique identifier of the Stream
sequenceNumber Number used to keep track of message sending order. The first message has a value of 1 and then is incremented for each subsequent message.
dtmf.digit The number-key tone detected

An example dtmf message is shown below. The dtmf.digit value is 1, indicating that someone pressed the 1 key on their handset.

Payload:

{
  "event": "dtmf",
  "streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
  "sequenceNumber": "5",
  "dtmf": {
    "digit": "1"
  }
}

5. Mark Event

When endpoint sends a media message, it could then send a mark message with a label; When that media message's playback is complete, we send the mark message to the endpoint using the same label mark.name indicating that the media has been played.

If the endpoint (WebSocket server) sends a clear message, we will empty the audio buffer and send back the mark messages matching any remaining mark messages from the server.

Property Description
event Describes the type of Web Socket message. In this case, "mark".
sequenceNumber Number used to keep track of message sending order. The first message has a value of 1 and then is incremented for each subsequent message.
streamSid The unique identifier of the Stream
mark An object containing the mark metadata
mark.name A custom value. We send back the mark.name you specify when it receives a mark message

Payload:

{
  "event": "mark",
  "sequenceNumber": "4",
  "streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
  "mark": {
    "name": "mark label"
  }
}

6. Stop Event

This message indicates when the Stream has stopped, or the call has ended.

Property Description
event Describes the type of Web Socket message. In this case, stop.
sequenceNumber Number used to keep track of message sending order. The first message has a value of 1 and then is incremented for each subsequent message.
stop An object containing Stream metadata
stop.accountSid The Account identifier that created the Stream
stop.callSid The Call identifier that started the Stream
stop.reason The reason for ending the Stream.
streamSid The unique identifier of the Stream

Payload:

{
  "event": "stop",
  "sequenceNumber": "5",
  "stop": {
    "accountSid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
    "callSid": "CAXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
    "reason": "The caller disconnected the call"
  },
  "streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
}

Events You Send to Us

1. Media Event

The payload must be encoded audio/x-mulaw with a sample rate of 8000 encoded with base64 PCM mono audio. The audio can be of any size.

Please note that the payload of media received from the vendor should at least be of 160 bytes or a multiple of 160 bytes (i.e., 320, 800, 4000, etc.). In case the payload is not a multiple of 160 bytes, audio gaps might occur when it is played over the call.

Property Description
event Describes the type of Web Socket message. In this case, "media".
streamSid The SID of the Stream that should play the audio
media An object containing the media payload
media.payload Raw mulaw/8000 audio in encoded in base64
media.chunk The chunk for the message. The first message will begin with 1 and increment with each subsequent message.

Payload:

{
  "event": "media",
  "streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
  "media": {
    "payload": "a3242sa...",
    "chunk": 1
  }
}

2. Mark Event

Sends a mark event message after sending a media event message to be notified when the audio that they have sent has been completed. We send back a mark event with a matching name when the audio ends (or if there is no audio buffered).

The Web Socket Server also receives an incoming mark event message if the buffer was cleared using the clear event message.

Property Description
event Describes the type of Web Socket message. In this case "mark".
streamSid The SID of the Stream that should receive the mark
mark An object containing mark metadata and payload
mark.name A name specific to your needs that will assist in recognizing future received mark event

Payload:

{
  "event": "mark",
  "streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
  "mark": {
    "name": "my label"
  }
}

3. Clear Event

Sends a clear message if the server want to interrupt the audio that has been sent in various media messages. This empties all buffered audio and causes any mark messages to be sent back to the Web Socket server.

Property Description
event Describes the type of Web Socket message. In this case, "clear".
streamSid The SID of the Stream in which you wish to interrupt the audio.

Payload:

{
  "event": "clear",
  "streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
}

Typical Stream Lifecycle

Here's how a typical interaction flows:

  1. Connection Establishment
  2. Establish a Web Socket connection with the vendor.
  3. Send the connected event to initiate the handshake.

  4. Stream Initialization

  5. Send the start event with stream metadata.

  6. Audio Streaming

  7. Send media events with base64 encoded audio data.
  8. Receive media events from the vendor with base64 encoded audio data.

  9. Stream Termination

  10. Send the stop event to terminate the stream.
  11. Handle the clear event from the vendor to reset the stream.

  12. End of Input

  13. Send the mark event when all media from the bot has been played.
  14. Handle the mark event from the vendor to denote the end of input.

Example Workflow

Our System โ†’ Your Endpoint:

connected โ†’ start โ†’ media โ†’ media โ†’ media โ†’ dtmf โ†’ media โ†’ stop

Your Endpoint โ†’ Our System:

media โ†’ mark โ†’ media โ†’ mark โ†’ clear

Best Practices

  1. Stream IDs: Ensure that the streamSid is unique for each stream and consistent across all events for a given stream

  2. Sequence Numbers: Track sequence numbers to detect any dropped messages

  3. Audio Quality: Base64 encoding is used for audio data to ensure compatibility and ease of transmission

  4. Latency: Process and respond to audio quickly to maintain natural conversation flow

  5. Error Handling: Implement reconnection logic and handle connection drops gracefully

  6. Mark Synchronization: The mark event is used to synchronize the end of input between the client and vendor


Technical Requirements

  • WebSocket connection support (wss://)
  • Base64 encoding/decoding capability
  • ยต-law (G.711) audio codec support
  • JSON parsing and generation
  • Real-time processing capability (<500ms latency recommended)

Support Resources

Need help integrating? We're here for you: - ๐Ÿ“ž Call us - ๐Ÿ’ฌ WhatsApp us - ๐Ÿ“ง Email Support


Ready to build amazing voice experiences? Start integrating today!