Sept. 23, 2024, 9:23 a.m.
For one German reporter, the statistical underpinnings of a large language model meant his many bylines were wrongly warped into a lengthy rap sheet.
When German journalist Martin Bernklau typed his name and location into Microsoft’s Copilot to see how his articles would be picked up by the chatbot, the answers horrified him. Copilot’s results asserted that Bernklau was an escapee from a psychiatric institution, a convicted child abuser, and a conman preying on widowers. For years, Bernklau had served as a courts reporter and the AI chatbot had falsely blamed him for the crimes whose trials he had covered.
The accusations against Bernklau weren’t true, of course, and are examples of generative AI’s “hallucinations.” These are inaccurate or nonsensical responses to a prompt provided by the user, and they’re alarmingly common. Anyone attempting to use AI should always proceed with great caution, because information from such systems needs validation and verification by humans before it can be trusted.
But why did Copilot hallucinate these terrible and false accusations?
Copilot and other generative AI systems like ChatGPT and Google Gemini are large language models (LLMs). The underlying information processing system in LLMs is known as a deep learning neural network, which uses a large amount of human language to “train” its algorithm. From the training data, the algorithm learns the statistical relationship between different words and how likely certain words are to appear together in a text. This allows the LLM to predict the most likely response based on calculated probabilities; LLMs do not possess actual knowledge.
The data used to train Copilot and other LLMs is vast. While the exact details of the size and composition of the Copilot or ChatGPT corpora are not publicly disclosed, Copilot incorporates the entire ChatGPT corpus plus Microsoft’s own specific additional data. The predecessors of ChatGPT 4 — ChatGPT 3 and 3.5 — are known to have used “hundreds of billions of words.” Copilot is based on ChatGPT 4, which uses a larger corpus than ChatGPT3 or 3.5. While we don’t know how many words this is exactly, the jumps between different versions of ChatGPT have tended to be orders of magnitude greater. We also know that the corpus includes books, academic journals, and news articles.
And herein lies the reason that Copilot hallucinated that Bernklau was responsible for heinous crimes. Bernklau had regularly reported on criminal trials involving abuse, violence, and fraud, and his stories were published in national and international newspapers. His articles must presumably have been included in the language corpus which uses specific words relating to the nature of the cases. Because Bernklau spent years reporting in courts, when Copilot is asked about him, the most probable words associated with his name relate to the crimes he has covered.
A news report on Bernklau’s situation (in German).
This was not the first case of its kind, and we will probably see more in years to come. In 2023, U.S. talk radio host Mark Walters successfully sued OpenAI, the company which owns ChatGPT. Walters hosts a show called Armed American Radio, which explores and promotes gun ownership rights in the U.S. The LLM had hallucinated that Walters had been sued by the Second Amendment Foundation (SAF), a U.S. organisation that supports gun rights, for defrauding and embezzling funds. This came after a journalist had queried ChatGPT about a real and ongoing legal case concerning the SAF and the Washington state attorney general.
Walters had never worked for SAF and was not involved in the case between SAF and Washington state in any way. But because the foundation has similar objectives to Walters’ show, one can deduce that the text content in the language corpus built up a statistical correlation between Walters and the SAF which caused the hallucination.
Correcting these issues across the entire language corpus is nearly impossible. Every single article, sentence, and word included in the corpus would need to be scrutinized to identify and remove biased language. Given the scale of the dataset, this is impractical.
The hallucinations that falsely associate people with crimes, such as in Bernklau’s case, are even harder to detect and address. To permanently fix the issue, Copilot would need to remove Bernklau’s name as author of the articles to break the connection.
To address the problem, Microsoft has engineered an automatic response that is given when a user prompts Copilot about Bernklau’s case. The response details the hallucination and clarifies that Bernklau is not guilty of any of the accusations. Microsoft has said that it continuously incorporates user feedback and rolls out updates to improve its responses and provide a positive experience.
How Copilot responded to questions about Martin Bernklau from a U.S. user, September 22, 2024.
There are probably many other examples that are yet to be discovered. It is impractical to try and address every lone issue. Hallucinations are an unavoidable byproduct of how the underlying LLM works.
As users of these systems, the only way for us to know that output is trustworthy is to interrogate it for validity using established methods. This could include finding three independent sources that agree with assertions made by the LLM before accepting the output as correct, as my own research has shown. For the companies that own these tools, like Microsoft or OpenAI, there is no real proactive strategy that can be taken to avoid these issues. All they can really do is to react to the discovery of similar hallucinations.