New Studies Challenge Google’s Claims on Gemini AI Models’ Capabilities

Gemini AI Models: Gemini 1.5 Pro and 1.5 Flash, two of Google’s most prominent generative AI models, have received a lot of praise for their capacity to handle and analyze massive volumes of data, including summarizing long documents and searching scenes in film footage. New evidence, however, indicates that these models fail to deliver as promised.

Gemini AI Models Struggling with Long Contexts

According to two research that looked at how well Google’s Gemini models deal with big datasets, 1.5 Pro and 1.5 Flash frequently have trouble providing accurate answers regarding lengthy texts. Only 40% to 50% of the time did the models get the questions right in the document-based tests. “While models like Gemini 1.5 Pro can technically process long contexts, we have seen many cases indicating that the models don’t actually ‘understand’ the content,” said Marzena Karpinska, a postdoc at UMass Amherst.

Context Window Limitations

The amount of input data that a model can consider before generating an output is called its context window. According to Google, the latest Gemini versions can manage 2 million tokens, which is about 1.4 million words, two hours of video, or 22 hours of audio. The models’ capacity to reason and provide answers across such vast contexts is, however, constrained, as demonstrated by practical tests.

Research Findings

In one study, researchers tested the models with true/false statements about recent fiction books. Gemini 1.5 Pro answered correctly only 46.7% of the time, while Flash managed just 20%. These results suggest that the models are less effective than a random guess at understanding long texts.

A second study focused on the models’ ability to reason over videos. The researchers created a dataset of images and questions, finding that Flash struggled significantly. In one test, Flash correctly transcribed only 50% of handwritten digits from a series of images, with accuracy dropping to 30% for more digits.

Overpromising Capabilities

Both studies indicate that Google may have overpromised the capabilities of its Gemini models. While other models, like OpenAI’s GPT-4 and Anthropic’s Claude 3.5, also performed poorly, Google has heavily marketed the context window as a key feature.

Industry Implications

Generative AI is under increasing scrutiny as businesses and investors grow frustrated with the technology’s limitations. Surveys by Boston Consulting Group found that half of the C-suite executives surveyed do not expect substantial productivity gains from generative AI and worry about potential mistakes and data compromises.

Calls for Better Benchmarks

Researchers like Karpinska and Michael Saxon of UC Santa Barbara argue that better benchmarks and third-party critiques are needed to validate claims about generative AI capabilities. Current benchmarks, such as “needle in the haystack,” often cited by Google, only measure information retrieval and not the ability to answer complex questions.