Courtesy of Gemini RoboticsGoogle unveiled Gemini Robotics earlier this morning, a new family of AI models based on Gemini 2.0, which it released late last year. The accompanying research paper shows off some astonishing capabilities, like AI robots that can fold origami, offering a glimpse into the future of AI. “Gemini 2.0’s embodied reasoning capabilities make it possible to control a robot without it ever having been trained with any robot action data,” the paper reads. “It can perform all the necessary steps, perception, state estimation, spatial reasoning, planning and control, out of the box. Whereas previous work needed to compose multiple models to this end.” In other words, you build the physical robot, Google’s AI models will make it work. This is an area we’ve been watching closely. We wrote last month about Apptronik, a humanoid robotics startup that partnered with Google DeepMind. Jeff Cardenas, the company’s CEO, told me he believed the final roadblock to building humanoid robots had been cleared — powerful, general-purpose AI models capable of understanding the physical world. Apptronik made a bet that the trajectory of AI software capabilities would eventually enable robotics. So it decided to build the hardware platform — the custom-designed digits with fine motor skill ability, for instance, and a way for some company to come along with an AI model that can control its Apollo robots. That is essentially what DeepMind unveiled today. Granted, it is still in the research phase, and it will be a while before we have Rosey the Robot. But the key here is that Google has built a foundation model that uses the real world to learn how to reason. Large language models have shown some ability to “reason” when given text inputs, but it’s also very easy to show the limitations of text-based reasoning. DeepMind’s leadership has long believed that the path to artificial general intelligence is multimodal. Put simply, if we want to have AI models that can reason like humans, they need to learn to use the same inputs humans use. The human brain remains a marvel. Whatever our neurons are doing as our brains develop, the process is built on a foundation of patterns represented in the physical world, even before we learn language and can read text. I recently asked Google DeepMind co-founder Demis Hassabis about whether he believed Gemini would evolve into the same model that powers robotics and self-driving cars and he said, essentially, yes. The paper today shows pretty clearly how that will happen. It’s inevitable that Google’s robotaxi company, Waymo, will eventually replace its current technology with a single AI model, likely some future version of Gemini. An important thing to note: Multimodal AI requires far more compute than text-based AI. There are more “tokens” in images than in text. This is partly why Google has been pushing the limits of AI “context” windows, or how much information models are able to hold onto before they forget what they were doing. As companies like Google, Amazon, Meta, and Microsoft spend hundreds of billions of dollars on new data centers packed with powerful AI processors, we will see an acceleration of AI technology that uses a model of the world to reason. These models will enable robotics and autonomous vehicles, but they will be the same ones that write code and produce research reports. If that seems odd, think about how well you’d be able to reason without ever having a sense of smell, taste, or touch. |