On Saturday, an Associated Press investigation revealed that OpenAI’s Whisper transcription tool is producing fabricated text in medical and business settings, despite warnings against such use. The Associated Press interviewed more than a dozen software engineers, developers, and researchers and found that the model regularly made up text that the speaker never said, a term known in the AI field as “confabulation.” This phenomenon is often referred to as “hallucination” or “hallucination.”
At the time of its release in 2022, OpenAI claimed that Whisper’s speech transcription accuracy approached “human-level robustness.” But researchers at the University of Michigan told The Associated Press that Whisper made false statements in 80 percent of the public meeting transcripts they examined. Another developer, who was not named in the Associated Press report, claimed to have found invented content in nearly all of the 26,000 test transcripts.
Fabrication poses special risks in medical settings. Despite OpenAI’s warning against using Whisper in “high-risk areas,” more than 30,000 healthcare workers currently use Whisper-based software to transcribe patient visits, according to an Associated Press report. He says he is using the tool. Mankato Clinic in Minnesota and Children’s Hospital Los Angeles are among 40 health systems using an AI co-pilot service powered by medical technology company Nabla’s Whisper that is finely tuned to medical language.
Nabla acknowledges that Whisper can fabricate, but Whisper reportedly erases the original audio recordings “for data security reasons.” This can cause further problems, as doctors cannot verify accuracy against the source material. Hearing-impaired patients also have no way of knowing if the audio in their medical records is accurate, so they can be more affected by incorrect recordings.
Whisper’s potential problems extend beyond medicine. Researchers from Cornell University and the University of Virginia studied thousands of audio samples and found that Whisper added violent content and racial comments that were not present in neutral speech. The researchers found that 1 percent of the sample contained “hallucinatory phrases or entire sentences that were not present in any way in the original audio,” and 38 percent of them contained “continuing, non-existent, It was found to contain clear harms such as fabrication of precise associations and implication of false authority.” ”
In one example of the study cited by the Associated Press, when a speaker described “two other girls and a woman,” Whisper added fictitious text specifying that they “were black.” . Another voice said: “He, the boy, I don’t know exactly, was trying to take an umbrella with him.” Whisper transcribed it as follows: “He picked up a big piece of the cross, a small little piece…I’m sure he didn’t have the knife of fear, so he killed a lot of people.”
An OpenAI spokesperson told the AP that the company values the researchers’ discoveries and is actively researching ways to reduce manufacturing and incorporating feedback into model updates.
Why do whispers confuse
What makes Whisper unsuitable for high-risk areas is its tendency to sometimes fabricate or plausibly fabricate inaccurate output. The Associated Press reports that “researchers aren’t sure why Whisper and similar tools cause hallucinations,” but that’s not true. We know exactly why Transformer-based AI models like Whisper behave the way they do.
Whisper is based on technology designed to predict the next most likely token (chunk of data) to appear after a set of tokens provided by the user. For ChatGPT, input tokens come in the form of text prompts. For Whisper, the input is tokenized audio data.