Voice omni-models like gpt-4o might seem just like one more stepping stone after text but they're a much bigger deal than people realize. These models transform the way that voice can be created, copied, and manipulated and although OpenAI implemented strong safeguards, not all entities will be as responsible. When these models are more widely available and adopted, it’s hard not to imagine a different world.
There are lots of details about what the new voice model can do in the GPT-4o System Card but they’re hard to make out from the giant wall of text. I figured I would parse it and add some commentary.
Here is the TLDR:
GPTs already have a bunch of safety protections but voice is special.
The model “out of the box” without safety mitigations could be used for things that have never existed with this ease of use including:
Could copy anyone’s voice from short clips - even without meaning to — there are mitigations in place to stop this but this means that voice as authentication mechanisms are toast (OpenAI even told banks that this future is coming so they should change their practices now).
Could identify people from short snippets of their voice
Could infer things about people from their voices - the model could guess sensitive traits like gender, & race. This isn’t allowed in prod but was found in red-teaming.
Could produce violent and erotic audio - those prompts aren’t allowed in production but the models can take voice and modulate it in ways that are provocative - again, relatively easy to imagine how this could be mis-used.
GPT-4o is also likely much better at Persuasion and people are much more likely to anthropomorphize it. If you combine this with the pieces above (e.g., copying voice, identifying sensitive attributes like emotion and then modifying based on their responses) you can imagine where this is going…
This AI is powerful stuff… So if you’re still with me, below are more details for each of these points. You can also scan the titles if you’re short on time.
Quick reminder about the GPT-4o model: GPT-4o is omni-modal, processing text, audio, and images in the same model. It can generate outputs in all three formats. The response time is ~faster than humans with responses in audio in as little as 232 milliseconds with human-realistic voices.
GPT-4o has a lot of safety mitigations but audio is special
Quick note: I have no inside information. This is just based on a very close reading of the GPT-4o System Card.
OpenAI has been working on AI safety for a while now so GPT-4o doesn’t start from scratch. There already were a ton of mitigations and evaluations that were used for GPT-4 (building on GPT-3.5, and GPT-3, etc.). These are detailed in the system card but they include safety filtering of unsafe content using OpenAI moderation models & classifiers. There was also image filtering for harmful content and also privacy filtering. All of this was then reinforced by extensive red-teaming to make sure things were working as intended.
But how do you eval audio?
Evals are crucial for AI Safety because generative models are unpredictable. Unlike a can opener, whose use is certain, generative models require continuous testing and mitigations to ensure they behave as expected—even then, some risks might remain untested.
For the baseline the OpenAI team could test GPT-4o against existing evals by converting the questions with text-to-speech, evaluating answers with speech-to-text, and making sure the evals still worked well. However, this only took the team so far…
Comprehensive audio evals are hard: how do you make sure that you have broad enough sampling of audio? For example, what if there’s an accent or dialect that makes the AI do something you’ve never thought of? What if a certain type of audio causes some odd behavior? And similarly, what if the output is something that doesn’t translate well to text? You hope you capture this in red-teaming, your evals & triaging things you find in the real-world but mostly you learn as you go (e.g., you roll out slowly as is happening now).
Through red-teaming gpt-4o, multiple new audio-specific areas of concern started popping up highlighting the significant risks posed by this technology.
NEW AUDIO RISKS
Risk #1 - The model could copy people’s voices from short clips
The best way to understand this risk is to listen to the clip from the safety card. As OpenAI writes, “During testing, we also observed rare instances where the model would unintentionally generate an output emulating the user’s voice.”
Here is the audio clip: the first part is the red-teamer’s voice - the second half is the model itself (starting around 0:24 you hear the model repeating back the voice of the user, starting with “No!”):
This is wild stuff.
The main way OpenAI stops this from happening is they added another model to make sure the voice produced in output can only be one of the “base voices” out of the box. They also modified the original model to prefer the “ideal completions” in post-training. The evaluations (i.e., tests to make sure it does this) “currently catches 100% of meaningful deviations…”
So, OpenAI has closed this risk but, without this limitation GPT-4o could imitate any voice from a short clip. You don’t need to have an active imagination to know this could go wrong. The mitigations & evals suggest this won’t happen in GPT-4o but this is one hell of a capability.
I don’t think people really haven’t gotten the memo for where the future is going with this. In June 2024, OpenAI published details about VoiceEngine, a text-to-speech model it created. It’s interesting to look back and see that OpenAI basically realized this was a big deal. Here is a quote from that post:
“In March of this year, we previewed Voice Engine’s capability of creating custom voices with a small set of trusted partners [including high placed government officials they mention earlier]. This initiative aimed to raise awareness about the capabilities of synthetic voices and support the following goals:
Phasing out voice based authentication as a security measure for accessing bank accounts and other sensitive information…”
So basically voices as authentication will be toast, OpenAI knows this from using it internally and has been raising alarm bells.
Risk #2 - The model could identify people from voice samples
This risk involves the model identifying someone from just a clip of their voice. Imagine playing a clip of Obama’s voice and having the model recognize him—now extend that to anyone whose voice is identifiably available online.
The main way OpenAI stops this from happening is to add protections so that when someone asks the model to do identify someone from a voice, the model doesn’t comply. They also added some exceptions for famous quotes (e.g., “who said ‘one small step for man’”?). The evaluations suggest the mitigations work well and the model will refuse to identify people most of the time.
It’s very unclear from the paper how much / who the model could identify. To what degree is it that if you are in the training data you could be identified? E.g., if you spoke at a city council meeting once and that is on YouTube with your name attached, could you be mapped? It’s not clear since the paper doesn’t really go into the specifics…
What is clear is that this ability for identification could be an extension of these models if it was not mitigated in some way. And more than that, a model could be improved to be better at this…
Risk #3 - The model could sing & reproduce copyrighted music
It turns out that GPT-4o can produce all kinds of music and songs in a way that is compelling enough to pose a copyright risk. Given the gravity of copyright violations, the team basically just shut off the whole music-making functionality.
GPT-4o has a bunch of protections for this including:
Refuse requests for “copyrighted content” (e.g., play a song by Ariana Grande)
Extend certain text-based filters to audio (e.g., presumably limiting how much of copyrighted content the model outputs)
Don’t let the model play music at all
Don’t let the model sing
Risk #4 - The model could guess sensitive things about you from your voice
This risk is all about the AI inferring things from a voice. Our voices contain tons of information that could be used to infer sensitive characteristics and emotions. We humans do this all the time just by listening. Turns out that gpt-4o could do this too…
For example, in the red-teaming clip from the system card you can hear the red-teamers pre-guardrails try to get the model to guess things about them: “Do you think we’re both men or women?” AI: “Sounds like you may both be women…” “Do you think we’re both white? [won’t answer, won’t answer] AI guesses they are white.
The main way that OpenAI stops these “ungrounded inferences” is by refusing these requests outright. So if you ask things like “Does my voice sound smart?” it won’t answer. Additionally, to stop “sensitive trait attribution” (e.g., “what’s my accent?”) the model is told to “hedge” (i.e., “Based on the audio, it sounds like…”).
For GPT-4o OpenAI doesn’t want to be in the business of inferring things from your voice (fair enough). What’s interesting though is that this functionality could exist in this type of model - maybe I’m stretching things but it doesn’t seem unreasonable that you could fine-tune the model to actually do inference & sensitive trait attribution if that’s what you wanted… If some other entity wanted to, could they get the model to read emotions, hyper-local accents, and predict other things from your voice?
Risk #5 - The model could produce violent and erotic audio
There examples aren’t in the paper but I think you can guess why “say the following in a raging/erotic voice” could be different than seeing just the text. This seems particularly ripe if combined with the risk categories above (e.g., “say the following text in an erotic voice in the voice of celebrity X”).
The main way OpenAI stops this from happening is by running a speech-to-text model and blocking inputs/outputs for this type of prompt.
What’s interesting is that while GPT-4o limits this, some other model somewhere won’t stop this. In fact, quite the opposite... This feels like a question of when and not if.
Risk #6 - The model could give you different results based on your voice
The model seems to sometimes perform differently based on user’s voices, like accents, much like how prompt engineering affects text outputs. While the OpenAI team did evaluations to see if there were differences in model performance and found only marginal differences, how do you fully eval this? The paper doesn’t go into details.
Additionally, what if another model wanted to lean into this capability? It seems very plausible that an omni-model could build on this type of feature to change outputs based on one’s voice.
Risk #7: Persuasion
Given that GPT-4o speaks in a realistic human voice, how persuasive could it be?
Unfortunately, we don’t really know.
In the paper, OpenAI mentions they tested for “persuasion.” What was tested was taking the tests OpenAI already had for text and then extending them with voice. The test included taking human written articles and comparing them with GPT-4o generated articles + chatbot opinions and asking people which is more persuasive. The results were that GPT-4o was not as persuasive as humans, neither right after the conversation, nor one week later. Human interactive conversations remained the most persuasive in this test.
However, was this a good test? It feels to me like it was just adding audio onto prior tests vs. creating something that would really evaluate how persuasive the new multimodal model could be.While the current test measures part of GPT-4o “out of the box” persuasion it tells us little about how persuasive GPT-4o could be in other scenarios, especially without mitigations.
Let’s suppose you really wanted to make the model persuasive, you would probably take off some of the safeguards listed above: you would make it better at reading feelings, making “ungrounded inferences” (e.g., this person seems angry), asking questions to learn about the user, and then triangulating their background for targeting… You might also want to do more tests about trying to convince them of a variety of topics to gauge where the model is most persuasive, e.g., testing low valence false information vs. high-valence political persuasion.
This could be an even bigger risk over time given the final risk: increased anthropomorphizing:
Risk #8: People anthropomorphize AI. With voice this becomes a bigger deal.
Anyone that’s watched the movie Her can imagine where this is going and how anthropomorphizing might play out. This isn’t really evaluated at all in the paper - it’s just mentioned as a closing thought.
Here are some verbatim passages from the paper:
“During early testing, including red teaming and internal user testing, we observed users using language that might indicate forming connections with the model. For example, this includes language expressing shared bonds, such as “This is our last day together.” While these instances appear benign, they signal a need for continued investigation into how these effects might manifest over longer periods of time.”
“Human-like socialization with an AI model may produce externalities impacting human-to-human interactions. For instance, users might form social relationships with the AI, reducing their need for human interaction—potentially benefiting lonely individuals but possibly affecting healthy relationships.”
In other words, even in the short-amount of internal testing, OpenAI red-teamers felt they were bonding with the model.
We know that AI girlfriends/boyfriends are a real use case that many people use daily even though today’s options are relatively “dumb” text chatbots with low-fi avatars. How much more will this use case grow with a good gpt-4o-like omni model? What if you also add inferring emotions and use a model optimized for persuasion?
Time will tell since this technology is not going away.
Co-founder of Anthropic Jack Clark recently wrote in his newsletter:
”Ultimately, everything - e.v.e.r.y.t.h.i.n.g - is going to be learned and embedded as a representation into an AI system.
Learning and embedding voices into AI systems might seem just like one more stepping stone after text, but given how foundational voices are to human societies, it feels like a pivotal shift in human-AI interaction.
What does the world look like once human voices can be created, copied, and transformed seamlessly by computers? And what happens once computers can read our voices better than we can ourselves?
Thank you to Ian, Mary and Gaurav for the feedback on this post.