THE SCENE In the burgeoning field of generative AI, the term “synthetic data” has become a lightning rod, dividing those who think it’s the savior of the industry and others who say it will destroy AI in what researchers describe as “model collapse.” Data, and lots of it, is a necessary ingredient in generative AI. But “real” data, or content created by humans, is fraught with problems. Much of it is copyrighted and it contains other troublesome issues, like racial bias, inaccurate information, and pornography. Yet synthetic data, which is machine-learning generated based on the patterns and properties of real-world data, can also be thorny because it can miss the nuances of human-created content, repeat human biases, and be difficult to validate for accuracy. If those shortcomings make it into large language models, one theory goes, it could create a vicious cycle where models keep creating worse synthetic data that then gets fed back into newer models, creating an algorithmic version of Idiocracy. In many ways, the fate of synthetic data sits at the center of the biggest questions facing generative AI. With artists, novelists, and even comedians claiming AI companies have illegally used copyright-protected material to train models, synthetic data could be a work-around. And synthetic data, which doesn’t require potentially costly licenses, may be necessary to make the next leap in capability because there isn’t enough data to keep improving models, especially for certain specialized areas of knowledge like biotech and drug discovery. In exclusive interviews with Semafor, Microsoft researchers offered new insight into the role synthetic data will play in the development of new AI models, and it may not be what people feared or hoped. “There is a lot of confusion out there,” said Sébastien Bubeck, who leads the Machine Learning Foundations group at Microsoft Research. The disorganized, vast pool of online words that make up the internet is what made ChatGPT incredibly smart. Researchers believe it’s also partly why it hallucinates and sometimes goes off the rails. But what if AI models could do the same learning from less data that was more organized and targeted, like synthetic data? Microsoft put the theory to the test. The result was Phi 1.5, an open source AI model that is a tiny fraction of the size of GPT-4 and yet has many of the same core capabilities. The idea behind Phi was to get at the essence of how GPT-4, the AI model that powers ChatGPT, learned. And then use that knowledge to create a dataset capable of teaching those lessons to a smaller model in a more direct and efficient way. “The first question that I want to ask is, ‘What were the minimum ingredients that were needed for this intelligence to emerge?” said Bubeck. Dall-EMicrosoft used the larger model to create a kind of curriculum to teach a smaller model — what researchers there called “textbooks.” The author of those textbooks was GPT-4, which was prompted to stay laser focused on data that researchers thought would lead to the best capabilities. In doing so, GPT-4 created a dataset that did away with its encyclopedic knowledge of the web, like a parent teaching children while sheltering them from the harsh and confusing realities of the world. Researchers then made that child demonstrate its thinking with a series of exercises. “Just like human beings, after you’ve read the textbook, you don’t really know anything yet. You have to put this knowledge into action,” Bubeck said. “You have to do the exercises. The jump in capabilities was huge after this fine tune.” Read why synthetic data could help Reed write stories. → |
|