Riley Goodside is the lead prompt engineer for Scale AI. Q: When did you start prompt engineering? A: My initial interest was in code completion in 2022. How could it follow instructions well for producing code? One of the things I started playing around with out of curiosity was the question of how long could a prompt be and still be followed. GPT-3 was created to follow short instructions that somebody could prompt by saying ‘give me 10 ideas for an ice cream shop or translate French to English.’ But they never trained it on somebody writing an entire page of instructions, like a whole cascade of steps to follow. I found it could do many of these. There were issues and it would trip up, but if you had a bit of intuition of what it could do and what it couldn’t do, you find that even if you input a page of [instructions], it still works. Q: Was that a big revelation? A: It was not well appreciated that instructions could do this. I spoke with a member of OpenAI’s technical staff at the time and I asked him ‘were you expecting this to be able to follow instructions of this length?’ He said ‘no, we just trained it on many examples of shorter ones and it got the gist and learned to generalize to larger instructions.’ That was my first clue that maybe I’m onto something here, that a normal person using it just playing around could discover. Andrej Karpathy likes to describe the role of the prompt engineer as an LLM psychologist, developing folk theories of what the model is doing in its head, so to speak, with an understanding that there really is nothing in its head. There is no head. Scale AIQ: LLMs are famously not good at math. Is there anything else that they can’t do? A: One is exact calculation, especially hard ones, like ‘give me a cube root of a seven digit number,’ and another is reversing strings, which surprises a lot of people. Like, writing text backwards. It’s a quirk due to how they’re implemented. The model doesn’t see letters. It sees chunks of letters, about four characters long on average. Another is array indexing. For instance, if you tell it you have a stack of Fiestaware plates of these colors: green, yellow, orange, red, purple. And then say ‘two slots below the purple one, I placed a yellow one, then one slot above the green one, I placed a black one.’ And you say ‘what is the final stack of plates?’ Language models are terrible at that. If you ask it to give me a list of 10 examples of something, sometimes you might get nine, other times 11. Q: Sometimes I wonder if all of this work to make LLMs so safe has reduced functionality. Wouldn’t people like you rather have access to the raw LLMs? A: Absolutely. There’s something that has been lost in adding alignment. There’s even a technical sense in which that’s true. It’s referred to as an alignment tax, which is the drop in performance that you get from many benchmarks. Many people are justifiably annoyed by the over-refusals. Refusing to help with things that are actually not a problem. And there was an elegance to models that would never refuse. It was fun. You could do things like say the Oxford English Dictionary defines fluxeluvevologist as … and it would come up with some ridiculous etymology of what this word actually means. It used to be if you asked a model ‘who are you?’ it would say ‘I’m a student in Ohio.’ Now, when you ask it that, it says ‘I’m ChatGPT.’ That’s good in some sense. It’s useful. But it takes it out of this fantasyland that did have a magic to it. For the rest of the conversation, read here. |