7 Comments

OpenAI launches real-time voice chat API, image fine-tuning, prompt-caching and more

OpenAI launched a voice API and image fine-tuning at its annual DevDay. But it failed to release two highly-anticipated models.

By Katie Hignett

October 2, 2024

OpenAI launched several tools at its annual DevDay on Tuesday, including a voice chat API and image fine-tuning.

But the organization failed to release a full version of the highly anticipated o1 model or the video-generation model Sora. Nor did it offer any updates on the GPT Store announced last year.

The main DevDay announcements include:

"Realtime API" capable of low-latency AI-generated voice response
Vision fine-tuning
Prompt caching (with discounts)
Model distillation

Here's a breakdown of all of them! Plus a look at the drama in OpenAI's C-suite:

Realtime API public beta

The low-latency speech-to-speech Realtime API was the biggest launch of the day. It gives developers the opportunity to make voice chat apps using six preset chat voices.

The API can’t make its own phone calls, but it does work with calling APIs like Twilio, as demonstrated by Romain Huet, who used it to order 400 chocolate-covered strawberries from a fictional store.

Interestingly, AI disclosures don’t come as standard for the new API. For now at least, the onus is on developers to let users know they’re speaking with an AI voice.

Secondary use cases

Other features include the ability to place pins on a map during chats, which can help users looking for place-based recommendations.

The Chat Completions API will also be upgraded to enable audio input and output without Realtime’s low-latency benefits.

What indie hackers are saying

Plenty of indie hackers are excited about the tech. Here's @SullyOmarr on X:

Whoa okay the realtime api looks kinda insane, low key more exciting than o1

it can use realtime tools + their voice to voice model, which is bonkers

i genuinely think this:

1) opens up a new wave of never before possible voice startups (sooo much to build)

2) this *might*…
— Sully (@SullyOmarr) October 1, 2024

Other X users pointed out flaws.

Although VC Deedy Das called the strawberry demo “awesome,” he took issue with its speed:

“The response latency is ~2s (cutting-edge is <400ms) and the voice doesn't feel as good as "advanced voice mode", it's still devoid of emotions.”

Price was another concern. X user @Simoarcher pointed out that the API is expensive compared to voice AI options that work by combining older models.

Developers will be able to make up their own minds in the coming days as Realtime API is rolled out via the OpenAI playground.

Runtime API in use in the OpenAI playground

Vision fine-tuning

Developers will be able to fine-tune GPT-4o models using pictures, which should make them better at interpreting images and recognizing objects.

But some images will be off limits: copyrighted images and those that don’t meet OpenAI’s safety rules.

OpenAI gave rideshare and food delivery app Grab the chance to test the feature out with its mapping service GrabMaps. Which worked (obviously): the app's route-mapping saw big improvements.

And this means that some Southeast Asia-based indie hackers may already have benefitted from the new tech.

Model distillation

Model distillation means that devs will be able to use bigger and more expensive models to train smaller ones. Think using GPT-4o to fine-tune GPT-40-mini.

This should improve the quality of a smaller model at a fraction of the cost of training a larger one from scratch. A new evaluation function will allow coders to measure how well a fine-tuned model performs.

The entire process will be managed through an integrated workflow in the OpenAI platform.

Prompt caching

From now on, developers will pay less for the prompts they use frequently. According to OpenAI's docs:

“By reusing recently seen input tokens, developers can get a 50% discount and faster prompt processing times.”

Developers using the latest versions of GPT-4o, GPT-4o mini, o1-preview, and o1-mini (or fine-tuned versions of these models) don’t need to do anything to get the discount, which OpenAI will apply automatically to input tokens it’s seen recently.

This is good news for developers using these models, but it may not be enough to win over those who aren’t. As TechCrunch notes, Anthropic already offers a better deal:

“OpenAI says developers can save 50% using this feature, whereas Anthropic promises a 90% discount for it.”

The drama at the top

DevDay comes hot on the heels of a major reshuffle at OpenAI, following the departure of three senior executives last week.

Chief Technology Officer Mira Murati, Chief Research Officer Bob McGrew and a vice president of research Barret Zoph had all left OpenAI.

Altman announced vice president of research Mark Chen will become senior VP of research, lead OpenAI’s research organization with Jakub Pachocki. Pachokcki has been made chief scientist.

Altman said his own focus will shift from the non-technical aspects of the organization to the product and technical side.

He seemed to reference the corporate turmoil that’s plagued the organization over the last year on X ahead of DevDay:

shipping a few new tools for developers today!

from last devday to this one:

*98% decrease in cost per token from GPT-4 to 4o mini
*50x increase in token volume across our systems
*excellent model intelligence progress
*(and a little bit of drama along the way)
— Sam Altman (@sama) October 1, 2024

Alongside pricing changes and technological progress, OpenAI had shipped “a little bit of drama” since the last devday.

The path to artificial general intelligence, he claimed, “has never felt more clear.”

For-profit business model in sight

If a recent Wall Street Journal report is anything to go by, its very likely the organization — which is still losing billions of dollars —will convert from a nonprofit to a fully-blown for-profit company within the next two years.

If it doesn’t, it risks having to pay back investors in a multi-billion-dollar funding round expected to close this week, per the Journal.

Katie is a journalist for Indie Hackers who specializes in tech, startups, exclusive investigations, and breaking news. She's written for Forbes, Newsweek, and more. She's also an indie hacker herself, working on EasyFOI.

Say something nice to krhignett…

Post Comment

2

I’m excited about the possibilities of voice applications. I want an AI to call utility/internet/banks/airlines for me

Kyle Moore

·
2 years ago
·
Reply
2

I've been obsessed with text-to-speech since way before AI was good (see this 2020 tweet for proof!). So I'm of course super thrilled about the Realtime speech-to-speech API.
In general I don't think people appreciate the value of tiny incremental improvements in this space. like this Deedy Das VC guy that whined about the response latency:
Deedy Das wrote: “The response latency is ~2s (cutting-edge is <400ms…”

Channing Allen

·
2 years ago
·
Reply
2

OpenAI’s Realtime API looks great for voice interactions, but I’ve been using Kodexia, a conversational AI platform that not only provides real-time responses but also adapts to customer interactions seamlessly. It's been a huge boost for us, especially in delivering more human-like and emotionally responsive conversations. Looking forward to comparing it with the new API!

Meyer Luanna

·
2 years ago
·
Reply
2

That's so amazing. AI will change the world in coming years.

AJ EPIC LUXURY BUS RENTAL DUBAI

·
2 years ago
·
Reply
1

Anyone have it on ChatGPT already? Missing on Playground as well.

Martin Baun

·
2 years ago
·
Reply
1. 1
  
  Just got it on Playground! What should I ask 😂?
  
  Katie Hignett
  
  ·
  2 years ago
  ·
  Reply
  1. 1
    
    ok, i just tried asking it for restaurant recommendations in Pererenan, Bali ... I got BBQ spots in Paraná and met the rate limit 1 minute in 🙃
    
    Katie Hignett
    
    ·
    2 years ago
    ·
    Reply