Google reveals new AI capabilities built on top of its Gemini 2.0 foundational model

The News

Google showed off some of its tantalizing AI tools to me Wednesday, built atop Gemini 2.0 — the next big iteration of its advanced foundational model trained off images and audio — that moves further into the “multimodality” field.

Unfortunately, unless you’re one of the company’s trusted testers, you can’t use these tools yet thanks to the company’s cautious approach to product rollouts.

A chart showing the number of foundational models developed by organizations in 2023, with Google leading

In this article:

The News

Know More

Reed’s view

Know More

During a visit I made to the company’s Mountain View headquarters, Google shared the latest “Astra” capabilities and a new, particularly useful Chrome plugin called “Project Mariner,” which takes control of your web browser and autonomously completes tasks for you.

Mariner, which is still in the testing phase, can essentially figure out any web interface, or at least take a crack at it, removing all kinds of friction that exists on the web today. You could ask Mariner to do that grunt work — scouring government websites, or health care and kids’ school forms — while you focus on something else.

It may be a while before this is generally available, though other startups have tried garnering limited adoption. The more powerful the AI tool, the riskier it is to release into the wild. But once you see it, there’s no going back.

The other demo was updated capabilities for Project Astra, which uses your phone’s camera to scan the world and then answer questions based on what it sees. It’s incredibly fast, and has been updated with a “memory” that can answer questions about anything it’s seen for the last ten minutes.

Product managers showed how you can scan a bunch of wine bottles and ask it about the selection and prices, for example.

Reed’s view

Once these progressive features are released, they will be quickly and widely adopted. And the data Google gathers from them could be extremely valuable. While current LLMs largely rely on text and computer code, the next generation needs real-life, visual data to create a reasoning engine with a true understanding of the world. I asked Google execs about this (more on that in Friday’s newsletter) and they hesitated to go into detail, so this is largely speculation.