Developers Now Build Speech-to-Speech Experiences with OpenAI’s Realtime API

The Realtime API enables all paid developers to build low-latency, multimodal experiences in their apps.

OpenAI has introduced a public beta of the Realtime API, which allows developers to build fast speech-to-speech experiences into their applications. The Realtime API, similar to ChatGPT’s Advanced Voice Mode, provides natural speech-to-speech conversations with six preset voices (opens in a new window) already supported in the API.

OpenAI is also adding audio input and output to the Chat Completions API (opens in a new window) to support use cases that do not require the low-latency features of the Realtime API. With this version, developers can send any text or audio inputs to GPT-4o and have the model respond with either text, audio, or both.

Why the Realtime API?

Developers have already been using voice experiences to connect with their users in a variety of contexts, including language apps, educational software, and customer support. Developers no longer need to stitch together various models to power these experiences thanks to the Realtime API, and audio will soon be added to the Chat Completions API.

Instead, you can create lifelike conversational experiences with a single API call.

How the Realtime API Works?

Developers previously had to transcribe audio with an automatic speech recognition model like Whisper to create a similar voice assistant experience. They had to send the text to a text model for inference or reasoning, then play the model’s output through a text-to-speech (opens in a new window) model.

This approach often resulted in a loss of emotion, emphasis, and accent, as well as apparent latency.

The Chat Completions API allows developers to handle the entire process with a single API call, though it remains slower than human conversation. The Realtime API improves this by streaming audio inputs and outputs directly, resulting in more lifelike conversational experiences. It can also handle interrupts automatically, similar to Advanced Voice Mode in ChatGPT.

Realtime API: Availability & Pricing

The Realtime API is available to all paid developers in the public beta. The audio capabilities of the Realtime API are powered by the new GPT-4o model ‘gpt-4o-realtime-preview’.

Audio for the Chat Completions API will be available in the coming weeks as a new model called ‘gpt-4o-audio-preview’. ‘gpt-4o-audio-preview’ allows developers to enter text or audio into GPT-4o and receive responses in text, audio, or both.

Pricing

The Realtime API works with both text and audio tokens.

Text

Input tokens – $5 per 1 million
Output tokens – $20 per 1 million

Audio

Input tokens – $100 per 1 million
Output tokens – $200 per 1 million

This is equivalent to around $0.06 per minute of audio input and $0.24 per minute of audio output. Audio in the Chat Completions API will be priced the same.

Over the next few days, developers can start creating with the Realtime API in the Playground (opens in a new window), as well as with OpenAI’s documentation (opens in a new window) and reference client.

Also Read:

Stay Tuned to The Future Talk for more AI news and insights!

Developers Now Build Speech-to-Speech Experiences with OpenAI’s Realtime API

Leave a Reply Cancel reply

Next-Gen Tech