OpenAI’s new voice AI model gpt-4o-transcribe lets you add speech to your existing text apps in seconds

March 20, 2025 11:21 AM

Credit: VentureBeat made with OpenAI ChatGPT

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

OpenAI’s voice AI models have gotten it into trouble before with actor Scarlett Johansson, but that isn’t stopping the company from continuing to advance its offerings in this category.

Today, the ChatGPT maker has unveiled three, all new proprietary voice models called gpt-4o-transcribe, gpt-4o-mini-transcribe and gpt-4o-mini-tts, available initially in its application programming interface (API) for third-party software developers to build their own apps atop, as well as on a custom demo site, OpenAI.fm, that individual users can access for limited testing and fun.

Moreover, the gpt-4o-mini-tts model voices can be customized from several pre-sets via text prompt to change their accents, pitch, tone, and other vocal qualities — including conveying whatever emotions the user asks them to, which should go a long way to addressing any concerns OpenAI is deliberately imitating any particular user’s voice (the company previously denied that was the case with Johansson, but pulled down the ostensibly imitative voice option, anyway). Now it’s up to the user to decide how they want their AI voice to sound when speaking back.

In a demo with VentureBeat delivered over video call, OpenAI technical staff member Jeff Harris showed how using text alone on the demo site, a user could get the same voice to sound like a cackling mad scientist or a zen, calm yoga teacher.

Discovering and refining new capabilities within GPT-4o base

The models are variants of the existing GPT-4o model OpenAI launched back in May 2024 and which currently powers the ChatGPT text and voice experience for many users, but the company took that base model and post-trained it with additional data to make it excel at transcription and speech. The company didn’t specify when the models might come to ChatGPT.

“ChatGPT has slightly different requirements in terms of cost and performance trade-offs, so while I expect they will move to these models in time, for now, this launch is focused on API users,” Harris said.

It is meant to supersede OpenAI’s two-year-old Whisper open source text-to-speech model, offering lower word error rates across industry benchmarks and improved performance in noisy environments, with diverse accents, and at varying speech speeds — across 100+ languages.

The company posted a chart on its website showing just how much lower the gpt-4o-transcribe models’ error rates are at identifying words across 33 languages, compared to Whisper — with an impressively low 2.46% in English.

“These models include noise cancellation and a semantic voice activity detector, which helps determine when a speaker has finished a thought, improving transcription accuracy,” said Harris.

Harris told VentureBeat that the new gpt-4o-transcribe model family is not designed to offer “diarization,” or the capability to label and differentiate between different speakers. Instead, it is designed primarily to receive one (or possibly multiple voices) as a single input channel and respond to all inputs with a single output voice in that interaction, however long it takes.

The company is further hosting a competition for the general public to find the most creative examples of using its demo voice site OpenAI.fm and share them online by tagging the @openAI account on X. The winner is set to receive a custom Teenage Engineering radio with OpenAI logo, which OpenAI Head of Product, Platform Olivier Godement said is one of only three in the world.

An audio applications gold mine

The enhancements make them particularly well-suited for applications such as customer call centers, meeting note transcription, and AI-powered assistants.

Impressively, the company’s newly launched Agents SDK from last week also allows those developers who have already built apps atop its text-based large language models like the regular GPT-4o to add fluid voice interactions with only about “nine lines of code,” according to a presenter during an OpenAI YouTube livestream announcing the new models (embedded above).

For example, an e-commerce app built atop GPT-4o could now respond to turn-based user questions like “tell me about my last orders” in speech with just seconds of tweaking the code by adding these new models.

“For the first time, we’re introducing streaming speech-to-text, allowing developers to continuously input audio and receive a real-time text stream, making conversations feel more natural,” Harris said.

Still, for those devs looking for low-latency, real-time AI voice experiences, OpenAI recommends using its speech-to-speech models in the Realtime API.

Pricing and availability

The new models are available immediately via OpenAI’s API, with pricing as follows:

• gpt-4o-transcribe: $6.00 per 1M audio input tokens (~$0.006 per minute)

• gpt-4o-mini-transcribe: $3.00 per 1M audio input tokens (~$0.003 per minute)

• gpt-4o-mini-tts: $0.60 per 1M text input tokens, $12.00 per 1M audio output tokens (~$0.015 per minute)

However, they arrive into a time of fiercer-than-ever competition in the AI transcription and speech space, with dedicated speech AI firms such as ElevenLabs offering its new Scribe model that supports diarization and boasts a similarly (but not as low) reduced error rate of 3.3% in English, and pricing of $0.40 per hour of input audio (or $0.006 per minute, roughly equivalent).

Another startup, Hume AI offers a new model Octave TTS with sentence-level and even word-level customization of pronunciation and emotional inflection — based entirely on the user’s instructions, not any pre-set voices. The pricing of Octave TTS isn’t directly comparable, but there is a free tier offering 10 minutes of audio and costs increase from there between

Meanwhile, more advanced audio and speech models are also coming to the open source community, including one called Orpheus 3B which is available with a permissive Apache 2.0 license, meaning developers don’t have to pay any costs to run it — provided they have the right hardware or cloud servers.

Industry adoption and early results

Several companies have already integrated OpenAI’s new audio models into their platforms, reporting significant improvements in voice AI performance, according to testimonials shared by OpenAI with VentureBeat.

EliseAI, a company focused on property management automation, found that OpenAI’s text-to-speech model enabled more natural and emotionally rich interactions with tenants.

The enhanced voices made AI-powered leasing, maintenance, and tour scheduling more engaging, leading to higher tenant satisfaction and improved call resolution rates.

Decagon, which builds AI-powered voice experiences, saw a 30% improvement in transcription accuracy using OpenAI’s speech recognition model.

This increase in accuracy has allowed Decagon’s AI agents to perform more reliably in real-world scenarios, even in noisy environments. The integration process was quick, with Decagon incorporating the new model into its system within a day.

Not all reactions to OpenAI’s latest release have been warm. Dawn AI app analytics software co-founder Ben Hylak (@benhylak), a former Apple human interfaces designer, posted on X that while the models seem promising, the announcement “feels like a retreat from real-time voice,” suggesting a shift away from OpenAI’s previous focus on low-latency conversational AI via ChatGPT.

Additionally, the launch was preceded by an early leak on X (formerly Twitter). TestingCatalog News (@testingcatalog) posted details on the new models several minutes before the official announcement, listing the names of gpt-4o-mini-tts, gpt-4o-transcribe, and gpt-4o-mini-transcribe. The leak was credited to @StivenTheDev, and the post quickly gained traction.

But looking ahead, OpenAI plans to continue refining its audio models and is exploring custom voice capabilities while ensuring safety and responsible AI use. Beyond audio, OpenAI is also investing in multimodal AI, including video, to enable more dynamic and interactive agent-based experiences.

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

Subscribe to Updates

What's Hot

OpenAI’s new voice AI model gpt-4o-transcribe lets you add speech to your existing text apps in seconds

OpenAI’s new voice AI model gpt-4o-transcribe lets you add speech to your existing text apps in seconds

Discovering and refining new capabilities within GPT-4o base

An audio applications gold mine

Pricing and availability

Industry adoption and early results

Related Posts