Why Did Google’s AI Overviews Fail? “Google will do the googling for you.” That was the promise made by the tech giant when it announced earlier this month that it would launch its AI-powered search tool. On top of search results, a new tool called AI Overviews highlights essential information and links with quick, AI-generated summaries.
The inherent unreliability of AI systems is a significant drawback. Users began posting examples of responses that were peculiar, to put it mildly, within days after AI Overviews’ introduction in the US. Former US president Andrew Johnson received degrees between 1947 and 2012, despite dying in 1875, and users were advised to consume at least one small rock daily or add glue to pizza.
Google Search Head Liz Reid revealed Thursday that the business has been implementing technological changes to the system, such as more robust detection techniques for nonsensical requests, to reduce the likelihood of inaccurate response generation. Since satirical, hilarious, or user-generated information could lead to erroneous advice, it limits its inclusion in responses.
How do AI Overviews Work?
Examining the optimizations made to AI-powered search engines can shed light on their blunders. We know that the new generative AI model in Google’s family of large language models (LLMs), Gemini, which has been tailored for Google Search, is used by AI Overviews. That approach is now part of Google’s core web ranking algorithms, which it uses to get relevant results from its database of web pages.
Most LLMs give the impression of fluency by guessing the following word (or token) in a series, but this leaves them open to fabrication. Google’s AI Overviews Fail: They select each word based only on a statistical formula, without any basis in-ground fact. Hallucinations result from that. According to Chirag Shah, a search engine optimization expert at the University of Washington, the Gemini model in AI Overviews probably avoids this problem by making use of retrieval-augmented generation (RAG), an AI technique that enables an LLM to search for specific sources outside of its training data, like particular web pages.
A significant advantage of RAG is that, compared to a regular model that uses its training data to provide an answer, the results it gives to a user’s query should be more current, accurate, and relevant. It is common practice to employ this method to forestall LLM hallucinations.
It Returns Awful Answers—Why?
However, RAG has limitations. A RAG-based LLM must accurately retrieve and respond to data to be successful. If any stage fails, the result will be poor. Shah believes AI Overviews offered a pizza recipe with glue since the user had asked about cheese not sticking to pizza, and the post sounded related. Something went wrong during retrieval. He said, “The generation part of the process doesn’t question that” when determining relevance.
When given conflicting information, such as an old policy manual and a later edition, an RAG system can’t decide which version to utilize. It may blend the two kinds of data to provide a false response.
“The large language model generates fluent language based on the provided sources, but fluent language is not the same as correct information,” says Suzan Verberne, a Leiden University professor specialising in natural-language processing.
The more specific a topic is, the higher the chance of misinformation in a large language model’s output, she adds: “This is a problem in the medical domain, but also education and science.”
An official from Google has stated that when AI Overviews delivers inaccurate results, it is usually due to a lack of high-quality online resources matching the query or because the query is very similar to those of satirical or humorous sites.
The representative clarified that most AI Overviews are of good quality and that the instances of poor answers were in response to rare queries. They also mentioned that less than one of 7 million unique queries had AI Overviews with potentially harmful, obscene, or otherwise unacceptable content. In line with its content restrictions, Google has been removing AI overviews from specific inquiries.
It’s not Just About Bad Training Data
While the pizza glue gaffe exemplifies how AI Overviews have directed users to questionable sources, it is also possible for the system to produce false information even when given accurate information. “How many Muslim presidents has the US had?” was googled by Melanie Mitchell, an AI researcher at New Mexico’s Santa Fe Institute. According to AI Overviews, “Barack Hussein Obama is the only president of the United States who is Muslim.”
A chapter in an academic book titled “Barack Hussein Obama: America’s First Muslim President?” provided the material that AI Overviews used in their response, which is incorrect because Barack Obama is not Muslim. Google’s AI Overviews Fail: According to Mitchell, the AI system misunderstood the text and failed to grasp its complete meaning. “There are a few problems here for the AI; one is finding a good source that’s not a joke, but another is interpreting what the source is saying correctly,” she adds. “This is something that AI systems have trouble doing, and it’s important to note that even when it does get a good source, it can still make errors.”
Can the Problem be Fixed?
We all know that AI systems aren’t trustworthy and that the risk of hallucination will persist if they rely on probability to generate text word by word. We can never be sure that AI Overviews will be 100% correct, but it will likely improve as Google makes adjustments behind the scenes.
There have been more “triggering refinements” for health-related inquiries, and Google has said that it is implementing triggering constraints for questions where AI Overviews were not particularly useful. According to Verberne, the business may incorporate a safeguard into its information retrieval procedure that would cause the system to reject answering potentially dangerous queries. A representative from Google has stated that the business has no intention of showing AI Overviews for sensitive or explicit content or for queries that suggest a vulnerable scenario.
Methods that enhance the quality of LLM replies include reinforcement learning from human input and other similar approaches. Google’s AI Overviews Fail: Similarly, LLMs may be programmed to recognise when a question has no answer, and it would be beneficial to teach them to carefully evaluate the quality of a document they retrieve before coming up with a response; as Verbene puts it, “Proper instruction helps a lot!”
Although Google has added a label to AI Overviews answers reading “Generative AI is experimental,” it should consider making it much clearer that the feature is in beta and emphasizing that it is not ready to provide fully reliable answers, says Shah. “Until it’s no longer beta—which it currently definitely is, and will be for some time— it should be completely optional. It should not be forced on us as part of core search.”