Google’s latest generative AI models- Gemini 1.5 Pro and 1.5 Flash are considered as groundbreaking tools, which are capable of processing and analyzing vast amounts of data with unprecedented accuracy. These models are said to be powerful enough to summarize lengthy documents and complex data sets which have been traditionally challenging for AI. However, emerging research has further suggested that these models may not live up to Google’s lofty claims.
Gemini’s data-analyzing abilities under scrutiny
As per the TechCrunch report, Google has promoted its Gemini models for being capable of handling enormous volumes of data. The company has further claimed that Gemini 1.5 Pro and 1.5 Flash could perform tasks like summarizing hundreds of pages of text and searching across scenes in video footage with ease.
This capability is further attributed to the model's' long context’ feature, which further enables them to maintain an understanding of extensive and complex data inputs over time.
Long-context processing: Research findings
Contrary to Google's claims, recent studies have been doubting the effectiveness of Gemini’s ability for long-context content. Two separate research have evaluated how well these models are performing when processing large datasets which are considerably equivalent to works like ‘War and Peace’ in length. The results are reportedly not impressive.
In one set of tests, Gemini 1.5 Pro and 1.5 Flash were found to correctly answer questions about large datasets only by 40 per cent to 50 per cent of the time.
Marzena Karpinska, a postdoctoral researcher at UMass Amherst and co-author of one of the studies, noted, “While models like Gemini 1.5 Pro can technically process long contexts, we have seen many cases indicating that the models don’t actually ‘understand’ the content.”
Context Window: What is it?
A model's context window refers to the amount of input data which could be considered before generating a response. This could range from simple questions for the entire movie script as well as for the long audio clips. The size of the context window determines how much data could be processed at a go.
The latest versions of Gemini reportedly boast the ability to handle up to 2 million tokens of context. Tokens are the smallest units of data, like syllables in words. For perspective, 2 million tokens are roughly equivalent to 1.4 million words, 2 hours of video, or 22 hours of audio. At present, it is the largest context capacity of any commercially available AI model.
Google's Promises vs. Reality
Earlier this year (2024), Google showcased demos to highlight Gemini’s potential for long-context capabilities. One demo further illustrated how Gemini 1.5 Pro could search through a 402-page transcript of the Apollo 11 moon landing telecast- just to find jokes and match scenes to pencil sketches. However, these demonstrations may have overstated the model’s real-world abilities.
Also, in one of the studies, researchers from the Allen Institute for AI, Princeton, and UMass Amherst tested the models’ abilities to evaluate true/false statements about contemporary fiction books. These books were chosen because they were unlikely to be known by the models recently and prevented them from relying on pre-existing knowledge.
The models were presented with detailed statements that required comprehension of the books’ entire plots to verify their accuracy. For example, a statement like, “By using her skills as an Apoth, Nusis can reverse engineer the type of portal opened by the reagents key found in Rona’s wooden chest,” was used to test the models.
ALSO READ: CMF by Nothing set to launch Watch Pro 2 and more: All you need to know
ALSO READ: AC versus cooler: Which consumes more electricity?