New synthetic data techniques shake up AI models

Events Newsletters

About Speakers Bureau Careers Podcast

D.C.
BXL
Lagos
Dubai
Beijing
SG

D.C.
BXL
Lagos

Dubai
Beijing
SG

Events Newsletters

IntelligentTransparentGlobal

November 3, 2023


Technology

Reed Albergotti

Hi, and welcome back to Semafor Tech.

On Wednesday, we broke some news about a new capability of Microsoft’s tiny AI model, called Phi 1.5. Today, we’re telling you how they did it — by using GPT-4 as a teacher, creating very specialized “synthetic data” to instruct its smaller pupil. Read below for more.

In other news, Sam Bankman-Fried has been convicted on all charges after a blockbuster trial. It’s being framed in the media as the perfect ending to the crypto mania that gripped the tech industry. Before we scold Silicon Valley, we in the media might want to take a look at ourselves. SBF’s rise couldn’t have happened without news outlets as a willing partner. The same goes for WeWork, which will soon file for bankruptcy, according to The Wall Street Journal. My take is below.

↓

Move Fast/Break Things

Wikimedia Commons

➚ MOVE FAST: No Taxes. Amazon founder Jeff Bezos announced he was moving from Seattle to Miami to be closer to family and his rocket company, Blue Origin. Florida also doesn’t have a capital gains tax, making it an appealing destination for billionaires.

➘ BREAK THINGS: No Tips. DoorDash is warning customers in the U.S. and Canada that if they don’t tip, they might have to wait longer for their food. The experiment is DoorDash’s latest attempt to improve working conditions for couriers amid concerns about fair treatment.

Post Email

↓

Artificial Flavor

One of the weirder things researchers are learning about large language models is that they respond positively to emotional appeals. If you warn different LLMs that a task is “very important to my career,” the results below show it will produce a more accurate output in some cases. The findings are from a new paper by scholars at the Chinese Academy of Sciences, Microsoft, and other institutions.

DeepMind similarly found that telling an AI to “Take a deep breath and work on this problem step-by step” gets a better result than saying “Let’s solve the problem.” The researchers behind the latest paper conclude that there’s plenty of “open questions and opportunities lying at the intersection of LLMs and psychology.”

arXiv

Post Email

↓

Reed Albergotti

New synthetic data techniques shake up AI models

THE SCENE

In the burgeoning field of generative AI, the term “synthetic data” has become a lightning rod, dividing those who think it’s the savior of the industry and others who say it will destroy AI in what researchers describe as “model collapse.”

Data, and lots of it, is a necessary ingredient in generative AI. But “real” data, or content created by humans, is fraught with problems. Much of it is copyrighted and it contains other troublesome issues, like racial bias, inaccurate information, and pornography.

Yet synthetic data, which is machine-learning generated based on the patterns and properties of real-world data, can also be thorny because it can miss the nuances of human-created content, repeat human biases, and be difficult to validate for accuracy. If those shortcomings make it into large language models, one theory goes, it could create a vicious cycle where models keep creating worse synthetic data that then gets fed back into newer models, creating an algorithmic version of Idiocracy.

In many ways, the fate of synthetic data sits at the center of the biggest questions facing generative AI. With artists, novelists, and even comedians claiming AI companies have illegally used copyright-protected material to train models, synthetic data could be a work-around. And synthetic data, which doesn’t require potentially costly licenses, may be necessary to make the next leap in capability because there isn’t enough data to keep improving models, especially for certain specialized areas of knowledge like biotech and drug discovery.

In exclusive interviews with Semafor, Microsoft researchers offered new insight into the role synthetic data will play in the development of new AI models, and it may not be what people feared or hoped.

“There is a lot of confusion out there,” said Sébastien Bubeck, who leads the Machine Learning Foundations group at Microsoft Research.

The disorganized, vast pool of online words that make up the internet is what made ChatGPT incredibly smart. Researchers believe it’s also partly why it hallucinates and sometimes goes off the rails.

But what if AI models could do the same learning from less data that was more organized and targeted, like synthetic data? Microsoft put the theory to the test. The result was Phi 1.5, an open source AI model that is a tiny fraction of the size of GPT-4 and yet has many of the same core capabilities.

The idea behind Phi was to get at the essence of how GPT-4, the AI model that powers ChatGPT, learned. And then use that knowledge to create a dataset capable of teaching those lessons to a smaller model in a more direct and efficient way.

“The first question that I want to ask is, ‘What were the minimum ingredients that were needed for this intelligence to emerge?” said Bubeck.

Dall-E

Microsoft used the larger model to create a kind of curriculum to teach a smaller model — what researchers there called “textbooks.”

The author of those textbooks was GPT-4, which was prompted to stay laser focused on data that researchers thought would lead to the best capabilities. In doing so, GPT-4 created a dataset that did away with its encyclopedic knowledge of the web, like a parent teaching children while sheltering them from the harsh and confusing realities of the world.

Researchers then made that child demonstrate its thinking with a series of exercises. “Just like human beings, after you’ve read the textbook, you don’t really know anything yet. You have to put this knowledge into action,” Bubeck said. “You have to do the exercises. The jump in capabilities was huge after this fine tune.”

Read why synthetic data could help Reed write stories. →

Post Email

↓

Semafor Stat

Amount Amazon made from an algorithm that quietly increased list prices for certain products, according to the U.S. Federal Trade Commission’s antitrust lawsuit against the e-commerce giant. “Project Nessie” allegedly identified items that it predicted would cause other stores to raise their prices to match Amazon. The company argues the tool has been mischaracterized, and said it stopped using it several years ago.

Post Email

↓

Plug

Thanks to our friends at Patent Drop, it’s never been easier to stay up to speed about the future of technology. Their twice-weekly newsletter scours the U.S. Patent and Trademark website to uncover the latest innovations from companies like Meta, Apple, and Nvidia. Sign up for Patent Drop here.

Post Email

↓

Evidence

Measures that would give parents more control over their children’s use of social media have broad appeal among U.S. adults, according to a new survey from the Pew Research Center. But that hasn’t incentivized federal lawmakers to pass legislation that would protect kids online, despite discussing the issue during 39 Congressional hearings since 2017, as former White House staffer Tim Wu lamented in an Atlantic article earlier this week. Meanwhile, over two dozen states are now suing Meta for allegedly hooking kids on Facebook and Instagram.

Post Email

↓

What We’re Tracking

Elon Musk said that his new AI venture will release its first model tomorrow to a select group of people. “In some important respects, it is the best that currently exists,” he claimed in an X post. We’ll see if his startup delivers after being late to the AI game.

Post Email

↓

Obsessions

Reuters/John Lamparski via NurPhoto

FTX was not very interesting from a tech standpoint, because there wasn’t any new technology there. If you talked to people in crypto, they barely mentioned FTX, except to maybe roll their eyes. In reality, the underlying technology behind crypto — the blockchain — just keeps on evolving, as we wrote about last week.

But a lot of the media that covers tech often values clicks over curiosity. Over the last decade or so, how many stories did you read about breakthroughs in nuclear fusion versus scooter rentals? The ideas worth hyping are often the ones that can’t get any press. People getting rich or people getting in trouble will always attract eyeballs, but it doesn’t usually represent what’s really happening in the tech industry.

That’s a big reason why stories about SBF being the crypto king proliferated without looking into his actual kingdom. Of course, it was reporters at Coindesk who broke the story about how things didn’t add up at FTX. But SBF was allowed to reign for some time before his fall.

This is really nothing new. During the dot-com boom of the mid-1990s, the media covered the soaring financial success of tech companies and then wrote off the internet when their stock prices crashed. But the development of the internet and the technology around it never slowed down.

I’m not trying to scold the media, and I’ve also made plenty of wrong calls. But just like the tech industry can learn from colossal failures, we can also improve. And I hope we get better at letting curiosity drive us, not clicks.

Post Email

↓

Hot On Semafor

The biggest donor set in America is still undecided on the Republican presidential field, leaving millions up for grabs in what is expected to be the most expensive election ever.
Nancy Pelosi defended Biden’s response to the Israel-Gaza war and said she was confident that it won’t be a problem as the president seeks a second term.
Major Chinese social media platforms announced they will begin requiring influencers with more than 500,000 followers to display their real names, breaking one of the last pillars of online anonymity in China.

Post Email

Technology

New synthetic data techniques shake up AI models