The Scene
As big tech companies scramble to gain an edge in the AI race, they all have different ideas on how far the finish line is.
That’s influencing their strategies in big and small ways. For Google DeepMind CEO Demis Hassabis and his parent company, it has led to several methodical calculations. First, text alone would not get you to the endgame faster. They made a bet that reasoning would come through multimodal training data. Another part of that strategy was focusing on increasing the “context window” of models, or the ability to process more data in an exchange.
That meant starting a little bit slower, which has made Google seem slightly behind where it really was. In the edited conversation below, Hassabis addressed that perception and the tricks up his company’s sleeve to gain an AI edge.
The View From Demis Hassabis
This conversation has been edited for length and clarity.
Reed Albergotti: It’s often a critique that all the AI companies have the same technology, but I imagine there is a unique flavor to each. The multimodal aspects of Gemini probably put it in that category, right?
Demis Hassabis: You touched on one of the major things. From the beginning, I said Gemini was multimodal, and it’s still, by far, the best multimodal model. The reason we did that is we wanted it to be useful for things like Project Astra, our universal assistant. A proper universal digital assistant needs to understand the spatial-temporal context that you’re in, not just language, but understand the real world. We have made great progress with that, and Gemini is at the heart of it.
Also, we’re imagining a new project, Project Mariner, which would involve AI agents using the Chrome browser, understanding human interfaces, and then acting on your behalf and completing useful tasks. Finally, we’re very excited about robotics. For robotics, you also need exactly those attributes to understand an embodied intelligence, you need to understand the world around you.
It’s very interesting combining those technologies, those ideas, with the Gemini foundation models. I think that’s a new era that we’re very well poised for, and we just released a new thinking model yesterday. The applications area, too — AI for science — has always been my passion and the reason I got into AI in the first place, obviously, AlphaFold being the prime example of that. But it’s just the beginning. I think we’ll be able to use AI to eventually cure, most if not all, diseases.
Because this model was trained on this multimodal approach from the beginning, do you see it becoming the generalized model for robotics and drug discovery? Is it all one?
Eventually, it will be as we get towards AGI and beyond. But for robotics, for sure. It’s exactly the kind of base model for Astra and robotics. An Astra thing, like smart glasses that can understand the world, is very similar to how you’d want a robot to process the world around you. But the biology stuff, that’s still more specialized systems on top of the general-purpose algorithms.
So AlphaFold isn’t just a straightforward application of transformers. It needs its own bespoke architecture that’s suited for that particular problem. A lot of scientific, deep problems are going to be like that for a little while until the general models become truly, fully general. We’re still a little bit away from that.
There’s been this narrative lately about progress plateauing in AI. Why do you think people insist that progress is plateauing?
The whole debate is: Are we still in an exponential, or is it a sigmoid? Because it looks like an exponential while you’re in the middle of the sigmoid, and very few things in nature or science continue to be an exponential forever. But the feeling people are getting is that some aspects, like the pre-training systems and the pre-training aspects of modeling, are not as straightforwardly scaling as in the early days. But you wouldn’t expect it to. There were incredible leaps in the early days on just the pre-trained base models.
Now the focus has changed to post-training, and more recently, influenced time-compute thinking, or processing, and each of those has its own scaling attributes. We believe there’s still a lot to be gained from pre-training innovations. We’re still getting very substantial gains that are worth investing in at a massive scale, as we’re seeing everyone doing, but it’s not as straightforward as it was before. Also, partially because you’re running out of real-world tokens and data. Although of course, there’s the question of creating synthetic data, which we’ve always done with our gaming work, too, in AlphaFold.
I got to look at some of these projects, like Mariner — they would be such a time saver. Could they be a source of data?
Yeah, of course. You could utilize anything with the right permissions for data training, especially for improving those systems themselves. I think there’s still a lot of real-world data left to gain. There’s also still more video data. We’re only using a frame a second, or something. You could definitely up-sample that to 50 frames a second and get much better granularity on things.
There’s really interesting super realistic game simulations now. Eventually, you want data that includes not just passive videos, but also what actions people are taking so you can mimic or model that. Then you start going towards what we call an ‘interactive world’ model. Genie, which is sort of like text-to-game at some point, right? You text something in, and then you have an interactive video, and then eventually that’ll be a game. It harks back to my early career, but I would love to one day complete the circle with the modern AI on that.
If you were ‘you’ as a kid today, or maybe even a year or two in the future when we have these agents — how do you think your life would be different?
Someone asked me this question the other day, or what would I be doing if I was 16 now? When I started off in the Theme Park. I guess there’s a lot of opportunity with societies of agents or multi-agent interactions. Maybe they’re competing, or perhaps they’re cooperating. There are also interesting tie-ins with crypto, digital currency, and agents. Because if you think about agents doing lots of stuff for you, maybe that is a good use case. Or maybe there’s a good use case there combined with crypto technologies. So if I were a young kid starting out today, those would be two of the things I’d be looking at.
I had this profound experience recently where I was staying up late trying to write code using AI — I’m not a coder — and I kept running up against token limits.. But what you and all other hyperscalers are building, there will be this massive compute increase which will raise those limits. How different is it going to be when you have unlimited tokens and memory?
That’s another big area the Gemini models have a competitive advantage on: the context windows. In our research work, we’ve got up to beyond 10 million context, and eventually, that will be effectively unlimited. That’s the sort of equivalent to working memory for us, just a ginormous one. But I think we probably also need a type of episodic memory… where you don’t store everything, you just store the important things. Then it would also be much more efficient to retrieve, right?
At the moment, having a massive context window is a kind of brute-force solution. We need a more elegant solution. That’s one of the things that’s probably still missing and we’re researching heavily. One of the nice things about our 2.0 Flash models and our smaller models is how performant they are for the speed, latency, and cost. You could imagine a world in a couple of years where we have this massive abundance of very efficient but pretty capable, small agents that are doing all sorts of things for you, perhaps negotiating with each other, all sorts of things. In the end, the whole way the web works will change.
What does that world look like when you can essentially automate anybody, you can automate their life, in a way?
What I hope with these digital assistants, firstly, is it enriches your life. It frees up a lot of your time to do things that you want. It recommends things so you’re never wanting for a great book, or [picks] the right wine for your dinner. I think that will be the first step, and then we’ll see where it progresses from there.
But imagine if you had an assistant that you really trusted, and was working for you, think what you might want to do with that, in terms of preserving your own brain space and attention. At the moment, we’re kind of deluged with information bombardment, not from AI, just from normal algorithms, right? Social media algorithms and so on.
It would be great if you had a digital assistant that sort of dipped into that river of information and just picked out the things that you were interested in. Or, say you wanted to learn things, it would select those things out, and not distract you with all the other things. So it could be really good, actually, for some of the issues that we have to do with today’s modern technologies, like attention and breaking your flow.
So, congratulations on your Nobel Prize. Looking at your AlphaFold history, you picked out this thing that only a small subset of scientists were trying to solve. Is there something else that you are now pursuing?
That was very special. There aren’t many problems like the protein folding problem, right?. I came across that as an undergrad, and I just found it was a fascinating problem. It would unlock so many things, and it’s perfectly suited, in many ways, to the AI techniques that we have.
We have lots of other things. I’d love to design a room-temperature superconductor, assuming it is possible with physics. I think our AI systems are perfectly suited to find new materials. We’ve done some preliminary work on that with our known project, which we published in Science [Magazine] a couple of years ago. And we’re building up to our AlphaFold moment, I would say, for material design.
Another: Can you design the optimal battery? We could, I think, with AI, get to optimal designs of various materials, like “the perfect battery,” not just a better one. And all of those things would help with climate and energy.
Obviously, we’re working on fusion containing plasma, plasma fusion reactors. And then there are fundamental areas in physics. We’re helping our quantum computing friends with error correction, collaborating with a quantum AI group at Google. And then there’s extending what we’ve done with AlphaFold into drug discovery itself. As you know, protein design, protein structure are only small parts of the whole problem of drug discovery. So we’re pushing hard with Isomorphic and other things into the more chemistry part of that. I’ve probably left a lot of things out there.
Have you picked one?
Well, they’re all very exciting. I guess we’d love to cure one of the big diseases. Then probably the material science one. I think it’s the next thing on the production line of innovation. It is maybe at AlphaFold One level, and we’re building up to AlphaFold Two level.
That’ll be really interesting to watch. If there’s anything you can say about what Isomorphic has coming down the pike, I would love to know.
Yeah, it’s going really well. We’ve got great partnerships with Eli Lilly and Novartis. We’ve got over a dozen drug programs, and we hope to have our first AI-designed drug in clinical trials by the end of the year. So it’s going very well.
I want to ask you a little bit about chips. Do you think that the TPU is a big advantage for you?
I think so. We use both TPUs and GPUs. The way I would describe it is TPUs are super well-specialized and well-designed for things like large model training. It’s hard to beat it for specific tasks like that, where you know the architecture, and you’ll know what you’re trying to build at a massive scale. GPUs we use a lot in our science work, where we’re still experimenting with architectures more and it’s not so much about the scaling. It’s sort of slightly more general purpose. They have different strengths and weaknesses. But for us, obviously, it’s great that we have the TPU line. It’s a competitive advantage for us.
Is that well suited for the test time compute era?
We’re working on inference-specific chips that are based on the same TPU designs, but more for inference. We have these, sort of, light chips, we call them, that are very efficient for serving. And so, of course, that’s super useful, but it turns out now there’s inference, time, and processing that is also super important for the capability, it’s also a capability now. So then both the serving chips and the specialized, small models that are very performant become extra good for serving billions of users at scale, but also thinking for longer, right? So per second of thinking, how much processing can you actually do? And how much searching over the thinking space can you do?
How much do the current constraints with chips — knowing that they’re going to change soon and over time — hold back the things you’re able to roll out product-wise?
Sometimes you have the “victim of your own success” problem, where if you build a very performant model, like 2.0 Flash, everyone wants it. Which is great, but then suddenly you only have a set amount of chips. You need more for serving. Probably all of the leading labs are wrestling with those issues.
We’re always trying to drive up the efficiency of it. We have these cool dynamic pools of chips, which, at any one time, can switch to things within a few seconds. This is something that Google infrastructure does very well, having to serve billions of users every day. And [there are] some amazing engineers who work on this kind of routing problem and scheduling problem. Then, of course, the other thing we compute for is not just the training and serving of the large models, but also working on the next innovations. You don’t have to do that full-model scale, but it turns out you need to do it at some reasonable scale because the extrapolation doesn’t really hold beyond 234x of scaling. Something that works on a tiny model won’t necessarily scale to 100x that size. So, you need a reasonable amount of compute to try out all of your algorithmic ideas.
It sounds like this race is heavily dependent on compute. Amazon has Annapurna, they seem to be doing well. How much do you look at your competitors in this race?
You have to plan for hardware quite a few years out. So we have our own plans. We have our own design lines. Another advantage for us is we’re full stack. We’re probably the only company that goes from the bare metal to the chips to the data centers. And we also have the algorithms and the products. So there’s a feedback loop between knowing where the algorithms are going, and then what chips would you design? Compute is a key ingredient, so is data, so are talented researchers who come up with the algorithmic ideas. They’re all equally important. And of course, you need to track what everyone else is doing as well, and make sure you’re sizing that correctly, taking into account budgets, and forecasts as well as competitors and what they’re doing.
If you do the math, three years into the future, did you make a change in the chip architecture based on this new market dynamic?
We make changes to it all the time. We work very closely with the chip design team. And we have AlphaChip, which is also a chip design AI that helps with some aspects of the chip design. And rooting the transistors and the wires on the chip itself is a very complex problem. So we’re working two, three years out by the time your new chip goes from design to the manufacture, and at scale.That’s a continual process. Obviously, you’ll start seeing the fruits of the most recent race stuff probably in a couple of years when the next generation of TPUs comes out.
Can we expect a bit of a jump then? More than we’ve seen in the last three years?
For sure, especially in terms of serving chips, which has become more in focus the last couple of years. And this is billions of users being served. You imagine a universal assistant that goes huge and also has a billion-user product, you’re going to need a lot of chips, right?
With the multimodal approach that you took early on with Gemini, were there trade-offs in terms of the speed of rolling it out that you made in order to play that long game?
It’s harder to get it to work because you’re adding in multimodal tokens. One of the big challenges of training large general models is you could add more data in a specific domain, let’s say, to play chess, right? You could add 10 million games of chess to your database. What you want is to get good at chess or good at multimodal without harming the other capabilities. Language, for example.
And if you get it right, sometimes the additional capability, like mathematics or coding, it turns out that actually helps your general reasoning capabilities. But it’s a careful balance. Does that specialized data actually help raise all the boats, or is it kind of a trade-off? Where, if you start specializing the network to be good at that, you end up weakening yourself in the more general systems.
That’s where the tool use part comes in. It’s like, at which point should you actually call a separate tool rather than try and incorporate that in the single brain? The single giant model. Multimodal, we always felt was a key part of the model understanding the world because ultimately, we want a world model, not just a language model. And then you want to plan and do planning and reasoning over the world model, not just a language model, or, originally, our game models. We have some cool things in the pipeline using the multimodal capabilities.