Amazon’s Trainium chips to be tested by Anthropic

The Scene

AUSTIN, TX — Amazon’s gamble to take on Nvidia, leading to the e-commerce giant’s biggest investment ever this year, is also a bet for startup Anthropic.

By itself, Amazon’s five nanometer Trainium 2 microprocessor is not as powerful as Nvidia’s latest AI chip, coveted by companies like OpenAI and xAI for its ability to train the next generation of powerful AI models.

But Amazon hopes its homemade silicon, dreamed up by Annapurna Labs, an Israeli chip startup the company acquired in 2015 for $350 million, will be used to build the most powerful computer in the world — dubbed “Project Rainier.”

Amazon’s success or failure is not riding on the raw power of each individual chip, but on a meticulously planned vertical integration in which entire data centers, down to each screw, copper wire and cooling fan, are engineered to squeeze every ounce of compute power from hundreds of thousands of Trainium 2 chips.

“We take vertical integration to an extreme,” said Rami Sinno, director of engineering for Annapurna, during a tour of the chipmaking facility. “This concept of power and power efficiency permeates everything we do.”

If the plan works, it won’t just be a win for Amazon, but also for Anthropic, the AI company behind the Claude AI chatbot. It has become a favorite among professional software developers and “vibe coders,” whose only gripe with the tool is its rate limits that cut users off in order to keep costs under control.

Anthropic is Amazon’s most important customer and has agreed to use Rainier to train the next version of Claude, making it more performant and cost effective, providing more coveted “tokens” for Claude’s users.

Anthropic, boosted by an $8 billion investment from Amazon on its way to a $60 billion valuation, used Google Tensor Processors and Nvidia GPUs to train the previous versions of Claude models.

Two people familiar with the matter told Semafor that the company’s agreement to use Amazon’s custom chips is separate from Amazon’s decision to invest in the company.

However Anthropic came to its decision, it is a win for Amazon; luring a leading foundation model company away from Nvidia is not easy.

Since 2006, Nvidia has been improving and adding functionality to Cuda, a powerful software program that allows AI researchers and other programmers to run nearly any machine-learning algorithm or AI model on Nvidia GPUs.

Because of Cuda’s head start, competing with Nvidia is incredibly difficult.

In this article:

The Scene

Know More

Step Back

Now What?

Reed’s view

Room for Disagreement

Notable

Know More

Anthropic may also benefit from diversifying away from Nvidia, which has faced shortages, frustrating companies like OpenAI and Microsoft. Compute efficiency has become increasingly important in the AI industry, as companies have had difficulty satisfying soaring demand for the technology.

AI models require the world’s biggest computers for training — but companies have found ways to increase the capabilities of models during the inference phase, when models are responding to individual prompts. That trend, known as “test time compute,” has drastically increased the need for data centers.

Even so, Amazon has faced questions from detractors about whether it can entice the AI world to use its custom-made chips.

Amazon says its Trainium chips have found a market. “Every single chip that we build and deliver has customers waiting for it,” said Sinno.

According to Gadi Hutt, director of product and customer engineering for Annapurna, the collaboration between the two companies began before Amazon invested in Anthropic.

In an interview at the Austin design and testing facility, Hutt recalled one of the earliest interactions he had with Anthropic, not long after the San Francisco research firm was founded in 2021.

Annapurna gave Anthropic researchers one of its first-generation Trainium chips so that they could “take it for a spin” over the weekend. Before the weekend was over, an Anthropic employee had discovered a flaw in the chip’s compiler, software that converts AI algorithms into instructions for the microprocessor, that was hurting its performance.

“That was just one weekend of work that proved to us this is a super strong team that we were really eager to continue working with,” Hutt said. “It took some time on the business side.”

Step Back

AI researchers, while brilliant, aren’t usually well versed in the ins and outs of the actual silicon used to do the trillions of calculations that make their work possible.

Tom Brown, co-founder and chief compute officer for Anthropic, told Semafor he has spent his career bending the will of the world’s most powerful computers, without ever seeing them up close.

“To my great shame, I’ve been training big models for around 10 years and I’ve still never been to one of the physical data centers,” he said.

But that has not stopped Brown and his colleagues from dissecting the inner workings of powerful AI chips down to the core software that controls them.

Anthropic, Brown says, has hired skilled engineers who know how to reverse engineer Nvidia GPUs to get access to their instruction set architecture, the software that directly controls the operation of the transistors. It is so core to how the chips work that Nvidia tries to obscure the information to keep competitors from seeing it.

By gaining access to the information, Anthropic can better optimize its models to run or train more efficiently. “But it’s really annoying to do that when they’re trying to obfuscate it,” Brown said.

He said one key benefit to switching to Trainium 2 was the fact that Amazon agreed to open up its instruction set, removing a pain point and allowing better optimization.

Working with Trainium chips required a learning curve, Brown said. “We’re the only lab that has more than one chip that we design for because it’s a huge cost to do that, but then once you do it, it means that now you’ve paid this big upfront cost, you can reap the benefits,” he said.

While there are few companies with the talent and resources to take advantage of that level of code, Anthropic and a handful of other firms can use the access to help improve the chip.

If Anthropic continues to train its models using Trainium chips, a side benefit will be that the models will likely run most efficiently using Amazon’s architecture, turning many of Anthropic’s customers into de facto customers of Amazon Web Services.

Now What?

When compute clusters get as big as Rainier, where an undisclosed amount in the hundreds of thousands of chips are networked together, tiny optimizations that would normally have no impact suddenly become amplified to meaningful levels.

On a tour through an Annapurna Labs chip testing area, Sinno explained how moving components around a miniscule amount can increase electrical efficiency.

The effort is like Tetris for engineering geniuses. The goal is to move everything as close together as possible, reducing the distance each electron has to travel as much as possible, while finding creative ways to carry heat away from the chips to prevent them from overheating.

A rack of Trainium 2 chips is essentially a furnace, with hot air billowing out of it at high speed.

During training runs of massive foundation models, so much data is traveling back and forth between GPUs that the potential to improve the speed of the connections between them has spawned entire companies.

The goal of reducing latency as much as possible makes one of Project Rainier’s distinct characteristics puzzling: It plans to divide a single compute cluster into multiple buildings, connecting them with a high-speed data connection Amazon calls Elastic Fiber.

“We don’t disclose the exact architecture, but you can imagine it’s so huge, it will require multiple buildings,” Hutt said. These multiple buildings will act as one computer, he said, allowing a model training run that performs as if the entire compute cluster were under one roof without having to break up the training into separate parts.

“The architecture will allow customers like Anthropic to train across the entire cluster,” he said.

Reed’s view

Anthropic’s leap into the Trainium ecosystem, no matter why it happened, is a mutually beneficial arrangement.

Anthropic’s Claude, while not as well known as ChatGPT, has a lot of street cred in the AI world. Its flagship model has become a favorite among software developers for its ability to generate high quality computer code.

It’s doubtful Anthropic, which is in a tight race with other foundation models companies, would not have agreed to train Claude on subpar chips. Even with the investment money, its decision is validation. And if the next version of Claude remains state of the art, Amazon will be taking a victory lap.

There’s no beating Nvidia, and AWS’ Nvidia offerings will no doubt remain popular. But Amazon doesn’t need to beat Nvidia. It just needs Trainium to be successful enough to lure some customers away and reduce its reliance on Nvidia chips, which are so coveted they are prone to shortages.

Room for Disagreement

Business Insider, citing internal documents, said Amazon has struggled to find customers for its chips.

“Last year, the adoption rate of Trainium chips among AWS’s largest customers was just 0.5% of Nvidia’s GPUs, according to one of the internal documents. This assessment, which measures usage levels of different AI chips through AWS’s cloud service, was prepared in April 2024. Inferentia, another AWS chip designed for a type of AI task known as inference, was only slightly better, at 2.7% of the Nvidia usage rate.”

Notable

Bloomberg’s article on Trainium chips nicely captures the atmosphere of the Austin facility.

The tiny chips behind Amazon’s big AI investment

The Scene

Know More

Step Back

Now What?

Reed’s view

Room for Disagreement

Notable

The Scene

The Scene

Know More

Step Back

Now What?

Reed’s view

Room for Disagreement

Notable

Know More

Step Back

Now What?

Reed’s view

Room for Disagreement

Notable