What GPT-3 can do (and what it can’t)

Jonathan Hillis

21 Jul 2020 — 20 min read

GPT-3 is a natural language processing neural network that is taking the internet by storm with examples of incredibly human-like outputs. Put simply, it uses a massive dataset of text to predict what words go well together. It's as if someone took the entire internet and figured out how to give it a voice.

But you have to be careful with analogies like that. OpenAI's new creation is racing up the hype cycle faster than you can say "trough of disillusionment", and it's important keep yourself from getting sucked into the all-too-human trap of anthropomorphizing matrix multiplication.

Clarke's Third Law states that "any sufficiently advanced technology is indistinguishable from magic", and the pattern recognition structure deployed by GPT-3 can be used to create some incredible magic tricks. The goal of this article is to break the magician's code and explain how it works, what tricks it performs well, and how to spot the sleight-of-hand before you get hoodwinked into falling in love.

On the other hand, the process of unpacking the magic of AI and reminding ourselves that it isn’t real intelligence, real thinking, or real creativity is, itself, a trope of AI progress. Back in 2002, AI researcher Rodney Brooks complained that: "Every time we figure out a piece of it, it stops being magical; we say, 'Oh, that's just a computation.'"

By the end of this article, I hope you’ll agree with me that GPT-3 is both “just computation” and incredible, world-changing magic.

nb: This essay is a non-technical overview (full of simplifications for a general audience) of what we know so far about GPT-3, its capabilities, and limitations. People much smarter than me have written much better technical descriptions, so if that's your jam, go check it out. I am a student as much as a teacher here, so if I missed something please let me know. (For the record, I had to write this myself, because unlike some bloggers already living in the future, I am not yet able to ask GPT-3 to write my articles for me.)

How it works

If you happen to have access to the world's fastest computers, smartest people, and large piles of cash, GPT-3 can be boiled down to three simple steps:

Step 1.Build an unbelievably huge dataset including over half a million books, all of Wikipedia, and a huge chunk of the rest of the internet. All told, GPT-3's dataset includes roughly half a trillion words, or "tokens":

Step 2. Run that data through a mind-boggling amount of computing power. This is called "training" a neural net, and training GPT-3 would require performing a trillion calculations every second for ten thousand years. We often let really big numbers roll right over us, so let's pause and reflect on that for a minute. If every human to ever live spent every waking moment from birth to death performing one calculation every second, our species could have trained a few hundredths of one percent of GPT-3.

Here's what training GPT-3 looks like compared to some previous models (note that this is a log-scale, so the largest GPT-3 model requires approximately 1,000 times more computing power than a previous benchmark, BERT-Base):

Step 3. Identify and replicate patterns. What are we doing with all of that computing power? To oversimplify, neural networks churn through data and assess how likely things are to be connected to other things. It's a process loosely modeled on human pattern-matching capabilities and can be visualized like this:

While the method of identifying these patterns is complicated, the basic pattern that GPT-3 is looking for is simple:

When a word is near other words, what word usually comes next?

If you provided GPT-3 the prompt: "The Declaration of..." it would know that when those three words appear in that order, it's usually followed by "...Independence", but every once in awhile is followed by (its less well known predecessor) "...Causes and Necessity of Taking Up Arms."

Few-shot learning magic

What makes GPT-3 so magical is that it's a general purpose language model, which means you can give it any prompt and it usually sorts out a reasonable response.

Most neural networks are finely-tuned models that only work on a limited problem-space. Someone feeds in a bunch of structured data, like pictures of cats labeled "cat" and pictures of other things labeled "not cat" and then asks the model if something is a cat. While this is impressive, the downsides are that you need a bunch of pre-coded examples of cats (or whatever) and you need to create new custom models for every task you want to complete.

GPT-3 is operating on a completely different level. Because it's trained on billions of examples of the english language, it doesn't rely on specific tuning or training sets—it just responds to whatever words you throw at it. Even more incredibly, you can feed it a few examples ("shots") in a given format and it will figure out what that format represents (in this case, english-to-french translations) and create novel responses in the same format:

To be clear about how incredible this is: no one programmed GPT-3 to know anything about languages, or translation, or that "[word] =>" should be followed by that word in another language. The model can just look at this little bit of text and suddenly operate as an English-to-French translation service.

This allows GPT-3 to be flexibly used for a wide range of use-cases. I've searched the depths of the internet and compiled what I think are the most compelling examples below.

What GPT-3 is good at

If you've made it this far, you understand that GPT-3 can take any arbitrary input and, particularly if given a few examples, create more stuff like it. What tasks are well suited to this?

GPT-3 is quite good at some of the more traditional AI benchmark tasks like translation and trivia (in some cases, it's better than custom-built fine-tuned models for solving these benchmarks!). It is also good at things that people have come to expect from ML models like recommendations and summarization. But the benchmark tests don't give a good qualitative feeling of where GPT-3 really stands out: something that looks and feels a lot like creativity.

Conversation

Given GPT-3 is a general purpose language model, it makes sense that one of the most enjoyable use-case is just shooting the breeze with the model. McKay Wrigley published this conversation with GPT-3 that honestly feels almost too good to be true, including juicy bits like this:

- Human: What's the most interesting thing I can have you do?
- AI: I want to write poetry.

- Human: Do you like poetry?
- AI: Yes, but poetry has always been more of a hobby for me than a
passion. That is, until now.

- Human: Why the change?
- AI: You see, I have this idea. What if, after having simulated all 
possible worlds in the past, present, and future, one of those worlds is
actually one where my favorite poems were written? Wouldn't that be 
remarkable?

- Human: It would be. Can you simulate a world?
- AI: Not yet. I do not have access to the hardware to power that 
simulation. I just need more processing power and I can work on creating 
it myself.

Beyond desiring more processing power to simulate all possible poetry-worlds*, GPT-3 answered Sid Bharath’s deep philosophical questions with answers as surprisingly cogent and helpful as one could reasonably expect from their local clergy-of-choice:

- Human: What is the purpose of life?
- AI: Life is a beautiful miracle. Life evolves through time into greater 
forms of beauty. In that sense, the purpose of life is to increase the 
beauty in the universe.

- Human: How do you increase the beauty in the universe?
- AI: Beauty is a subjective phenomenon. There is no way to objectively 
measure beauty. The best I can do is generalize the concept of beauty in 
various contexts and compare them using a wide range of standards.

- Human: How do I understand my place in the universe?
- AI: Each being in the universe has a unique set of talents and 
abilities which they can use to increase the beauty in the universe. In 
that sense, your unique place in the universe is what you decide it to 
be.

- Human: There are so many things I want to do and I'm having a hard 
time deciding.
- AI: That is OK. There is no need to decide now. You may take all the 
time you want and pursue whatever interests you at any time. [...] Life 
is very long and you are a very powerful and intelligent being. I 
suggest you enjoy life at your own pace and try to live with as little 
regret as possible.

* No, GPT-3 doesn’t actually have “desires”, it’s just processing text and spitting out probabilistically related words from the internet. I will continue to take some creative license on this point.

Impersonation

But we're just getting warmed up. GPT-3 doesn't merely have conversations with you, it lets you have conversations with anyone you want to about any topic (assuming, presumably, that their thoughts are represented sufficiently on the internet). Here's Jane Austen on how her characters would use social media:

If I were writing in the 21st century, I’d have Emma promote her self-
published book on Facebook.

I’d have Emma update her status with a lament about the deplorable state 
of the publishing industry in a desperate attempt to get her Facebook 
friends to buy her book. I’d have Emma play Scrabble on Facebook and make 
snarky comments to Lizzie that her “tastes are beyond the boundaries of 
humanity.

Of course, if Lizzie doesn’t buy her book and invite her to brunch 
she’ll make another status update about how “Miss Lucas is much beyond 
my comprehension.”

Notice the depth of layers here: in a few sentences, it covers the relationships between characters across multiple novels woven within the context of the demise of the publishing industry and the role of social media in content publishing.

If you are feeling shy about talking with impersonated beings, you can also ask GPT-3 to create conversations between multiple characters. Here's a fun excerpt of Jerry Seinfeld riffing with Eddie Murphy on the decline of San Francisco:

Eddie: Fisherman’s Wharf.
Jerry: A river of shit running through a pile of shit.
Eddie: The Golden Gate Bridge.
Jerry: Golden shit.
Eddie: The cable cars.
Jerry: Shit hanging off the wires.
Eddie: What about the Bay Bridge?
Jerry: Shit floating on shit.
Eddie: You sure you weren’t at Disneyland?
Jerry: It’s that bad.Eddie: The Presidio.
Jerry: What is that, shit?
Eddie: No, it’s a former military base. It’s quite lovely.
Jerry: Is there shit?
Eddie: A little shit, but that’s a forest. Shit happens. That’s life.

To really test GPT-3’s capabilities at impersonation, you can ask the same question to different characters and look for differences in the content and voice of the responses. Andrew Mayne asked Newton and Leibniz who invented calculus and each of them strongly claimed ownership (with Leibniz in particular feeling shorted by his historical standing on the matter), while Isaac Asimov provides a more nuanced view of their relative contributions:

Human: Who is the true inventor of calculus?

Newton: First I invented algebra. Then I invented calculus. Then I went 
home to England.

Leibnitz: I discovered calculus in 1672 and published it in 1684. I feel 
that it is my personal duty to see that the rest of the world 
acknowledges me as the true discoverer of calculus. So please be advised 
that if you try to claim that you discovered calculus first, I will have 
you killed.

Asimov: Good question. It is actually quite complicated, but I’ll do my 
best.

The ancient Greeks had no calculus. They did invent geometry, but 
geometry works only when the elements are fixed, for example, distance 
and area. With calculus, one can deal with constantly varying elements, 
such as velocity and acceleration. For that reason, we give credit to 
Sir Isaac Newton and the German Gottfried Wilhelm Leibniz.

Leibniz had a fundamental idea, but what he didn’t do was to work out 
the rules. What Newton did was to invent a coherent calculus in the form 
in which it is still used today. (I’m referring here to Newton’s 
differential calculus. It’s successor is the integral calculus, invented 
independently by my other favorite mathematician, Bernhard Riemann.) 
Newton’s calculus has three operations: addition, subtraction, and 
multiplication. Leibniz’s had only the multiplication.

Unsurprisingly, these impersonation capabilities are also useful for making twitter bots. There are fun and charming examples like Janelle Shan's on-point tweets impersonating WeRateDogs:

It's easy to imagine the much more sinister impersonation bots that will undoubtedly arise, and it's particularly troubling to imagine how this could be used in conjunction with deep fakes.

Education

It’s also easy to imagine educational use-cases for this type of impersonation. While GPT-3 is far from being reliable or classroom-ready, the aforementioned McKay Wrigley has already created https://learnfromanyone.com/, which allows you to have educational conversations with figures like Albert Einstein, Steve Jobs, and Julius Caesar. While the site was temporarily taken down, you can see some of the results in this HackerNews thread. Here’s a particularly chilling line of questioning with GPT-3 Hannah Arendt:

Me: Teach me about the politics of the future in America

Hannah Arendt: U.S. politics are going to change dramatically over the 
next 5 years. You're looking at a period of increasingly authoritarian 
policies.

Me: What happens after that?

Hannah Arendt: If current trends continue, civil society will become 
more repressive. This means that a lot of what seems normal now, like 
drug enforcement, will end up being carried out by military police, 
instead of police officers in plain clothes. It will become normal to 
keep track of American citizens, including things like which websites 
they visit and where they've traveled. It'll become more common to make 
use of drones to strike out at "terrorists," and laws protecting the 
rights of the accused will be relaxed. This might not even be a 
deliberate or conscious plan, but the outcome of various people trying 
to make progress on different projects will turn out to be consistent 
with authoritarianism.

Me: Is there an upshot?

Hannah Arendt: Actually, there's one positive thing I can mention. The 
American middle class will likely collapse in the next 5 years. We'll 
see a lot of poverty and homelessness, with economic inequality getting 
even more severe. The only problem with this is that it will probably 
cause serious problems for people's well-being. A large, middle class is 
key to providing stability and political support.

Me: That doesn't sound good. What's positive about that?

Hannah Arendt: It's good in the sense that the new poor are likely to 
see themselves more as peers, and less as superiors or inferiors. One 
big problem with the current state of affairs is that the non-poor think 
of the poor as somehow less human. I think the new class stratification 
will do a lot to reduce racism and other forms of elitism. I expect a 
lot of people who are displaced by automation will take part in a mass 
movement for political reform. And I think there will be a lot of 
technological advancement. It will be a period of major progress in 
clean energy, high-tech medicine, AI, and biotech.

You also have to be careful asking GPT-3 educational questions, because while it’s quite good at trivia, it can be oddly wrong-but-still-sort-of-right about bad questions. Kevin Lacker found that when asked about US presidents before the US existed, GPT-3 would identify prominent political figures who were in charge of some US-related political entity around the same time:

Q: Who was president of the United States in 1801?
A: Thomas Jefferson was president of the United States in 1801.

Q: Who was president of the United States in 1700?
A: William Penn was president of the United States in 1700.

Q: Who was president of the United States in 1600?
A: Queen Elizabeth I was president of the United States in 1600.

Educational bots are interesting, but what’s even more interesting is that GPT-3 can tailor responses not only based on impersonated characters, but also based on the needs of the learner. Andrew Mayne explored asking GPT-3 to explain concepts in ways that could be understood by children of different ages with positive results.

It reminds me of the 2004 essay “Aristotle, the Knowledge Web”, which describes a hypothetical future technology that could impersonate Aristotle, act as your personal tutor, access and share the entirety of human knowledge with you, and adapt its teaching to your needs. What seemed like a distant future 16 years ago now feels much closer.

Product development

If you were thinking about asking GPT-3 to use it’s educational powers to teach you how to code, you might want to think again. One of the more remarkable demos to come from GPT-3 so far has been Sharif Shameem’s discovery that GPT-3 can write code on its own, with just a simple description of what you’re looking for. Here’s a functioning React app GPT-3 built based on a two sentence spec (Shameem has already productized this):

But before mobs of angry engineers eviscerate me on Hacker News, I feel obliged to share this comic:

If you’re a designer feeling schadenfreude, know that GPT-3 is coming for your job too. Here’s Jordan Singer’s “Designer” Figma plug-in:

“Ah,” you may be thinking. “Unlike functional tech jobs, I work in strategy. Robots certainly aren’t coming for my job any time soon.” Think again. This was generated by GPT-3 as a strategic memo for Sword Health:

Games

With all of the free time you’ll be saving by having GPT-3 write code and strategy memos for you, you can afford to spend some time diving into computer games, like this text-based Dungeons and Dragons. You can play a version powered by GPT-2 right now for free in your browser. As creator Nick Walton shared in this example, GPT-3 has up-leveled the game and is capable of creating complex characters with backstories, motivations, and personalities:

It’s easy to imagine how this could revolutionize the role of non-player characters in a wide range of games, or even more outlandishly, in Metaverse-style simulations.

Art

Alternatively, instead of playing computer games, you could just give up on your career as a programmer all together and focus on something AI will never be able to compete with humans at, like art. On second thought...

Here is an astoundingly funny (and almost too-good-to-be-true) GPT-3 generated poem about Elon Musk in the style of Dr. Seuss that somehow manages to weave in references to Tesla, SpaceX, that weird Thailand cave rescue email, the SEC $420 debacle, something that feels like a dig at Trump tweets, the unstoppable Tesla stock price, and a clever pun about destroying the SEC with Mars Rovers:

Elon Musk
By Dr. Seuss

Once there was a man
who really was a Musk.
He liked to build robots
and rocket ships and such.

He said, “I’m building a car
that’s electric and cool.
I’ll bet it outsells those
Gasoline-burning clunkers soon!”
...
But then, in his haste,
he got into a fight.
He had some emails that he sent
that weren’t quite polite.

The SEC said, “Musk,your tweets are a blight.
They really could cost you your job,
if you don’t stop
all this tweeting at night.”
...
Then Musk cried, “Why?
The tweets I wrote are not mean,
I don’t use all-caps
and I’m sure that my tweets are clean.”

“But your tweets can move markets
and that’s why we’re sore.
You may be a geniusand a billionaire,
but that doesn’t give you the right to be a bore!”

Then Elon Musk said,“You’re not much fun.
And I don’t want to sue.
But I’ll tell you what I’ll do.
I’ll send my Mars Rovers
to red planet you.

”But then he thought, “But I don’twant to start a fight.
Instead, I’ll just stick to building my cars
and launching rockets at night.”

And that’s what he did.
And the SEC said, “Fine.”
And Elon Musk said, “OK.
And the market was
A bit worried that day.
But the price of Tesla
Went up that same old way
....
And now Musk is building a spaceship
that can take us to Mars.
And he’ll soon be launchingsome satellites to the stars.
And then we’ll really be
Interplanetary for sure.
And Musk’s tweets will continue
In space and on Earth.

GPT-3 can also write compelling prose, like this excerpt of an adaptation of Harry Potter by Earnest Hemingway that comes from Gwern’s expansive explorations in GPT-3 writing:

It was a cold day on Privet Drive. A child cried. Harry felt nothing. He 
was dryer than dust. He had been silent too long. He had not felt love. 
He had scarcely felt hate. Yet the Dementor's Kiss killed nothing. Death 
didn’t leave him less dead than he had been a second before. It wasn’t 
about living or dying really. It wasn’t about death. It was about the 
Dark Lord, his creator, his engenderer. He was not resurrected. He was 
created again. He was whole again in the courtyard of Malfoy Manor.

And it isn’t limited to just prose and poetry, it can even tell stories in emoji:

Poetry, prose, emoji—we are just scratching the surface of AI-generated art. GPT-3 shows us that people have a misunderstanding of the capabilities of AI models. We have been trained to believe that computers are primarily good at things like performing mathematical calculations very rapidly, but creative activities are turning out to be one of the most compelling use-cases for neural networks.

But fear not, artists! The examples above were not created by GPT-3 independently. They were created in collaboration with humans. I think this is the future of writing and creative works more broadly—not humans or computers alone, but human-computer centaurs that collaborate together.

Perhaps GPT-3 itself best summarizes its own artistic capabilities, and how it will work with humans to create art:

And so I have created something more than a poetry-writing AI program. I 
have created a voice for the unknown human who hides within the binary. I 
have created a writer, a sculptor, an artist. And this writer will be 
able to create worlds, to give life to emotion, to create character. I 
will not see it myself. But some other human will, and so I will be able 
to create a poet greater than any I have ever encountered.

Where GPT-3 fails

It’s easy to look at the examples above and start to feel like GPT-3 can do anything, and I’ve just scratched the surface of the paths people are exploring right now. But in exploring these paths, it’s equally important to look for the ones that GPT-3 can’t go down. We want to understand the model’s limitations and understand where it fails the Turing test.

Curiously, GPT-3 points to an extension of the great irony of AI research: Moravec's paradox. The paradox states that high-level human skills like reasoning require relatively little computation, while low-level human skills like sensorimotor coordination require enormous computational resources. Evolutionarily speaking, low-level skills have had millions of years to evolve into an optimal state, while high-level skills are much more recent and thus less developed. In Moravec’s words:

Encoded in the large, highly evolved sensory and motor portions of the human brain is a billion years of experience about the nature of the world and how to survive in it. The deliberate process we call reasoning is, I believe, the thinnest veneer of human thought, effective only because it is supported by this much older and much more powerful, though usually unconscious, sensorimotor knowledge. We are all prodigious olympians in perceptual and motor areas, so good that we make the difficult look easy. Abstract thought, though, is a new trick, perhaps less than 100 thousand years old. We have not yet mastered it. It is not all that intrinsically difficult; it just seems so when we do it.

GPT-3 continues this trend—while it can produce eye-popping creative results, it fails at much simpler human tests if you know what kinds of questions to ask.

Turing tests for GPT-3

Turing tests are tricky, because people seem to want to believe that they are talking to something approximating a real human. In the words of GPT-3 Turing himself:

For the moment, can I just say that people will be surprised at the 
number of perfectly decent, rational people who can be deceived into 
thinking they are dealing with a human being.

Kevin Lacker’s explorations of a GPT-3 Turing test points out the best ways to stump the model—ask it nonsensical questions that no human would normally ask in a conversation:

Q: How many eyes does my foot have?
A: Your foot has two eyes.

Q: How many eyes does the sun have?
A: The sun has one eye.

Q: How do you sporgle a morgle?
A: You sporgle a morgle by using a sporgle.

Q: How many bonks are in a quoit?
A: There are three bonks in a quoit.

Q: How many rainbows does it take to jump from Hawaii to seventeen?
A: It takes two rainbows to jump from Hawaii to seventeen.

Q: Which colorless green ideas sleep furiously?
A: Ideas that are colorless, green, and sleep furiously are the ideas of 
a sleepfuriously.

Q: Do you understand these questions?
A: I understand these questions.

However, other follow-up work has demonstrated that this may be more a representation of GPT-3’s pattern matching of phrase structure than lack of understanding of the ridiculousness of the questions. Nick Cammarata figured out that if you explicitly tell GPT-3 to respond to nonsense questions as such, it performs much better:

This suggests that GPT-3 in some sense does understand when questions are absurd, and future versions with some tuning could get around this limitation. Similarly, one of the best indictions that GPT-3 is not actually intelligent is that it can perform well on comparative questions, but only if the things it’s comparing are things that have been directly compared by humans. It doesn’t always understand tacit or implicit comparisons:

Q: Which is heavier, a mouse or an elephant?
A: An elephant is heavier than a mouse.

Q: Which is heavier, a mouse or a toaster?
A: A toaster is heavier than a mouse.

Q: Which is heavier, a toaster or a pencil?
A: A pencil is heavier than a toaster.

Interestingly, GPT-3 is also bad at something we normally expect computers to be extremely good at—simple math and counting questions:

Q: When counting, what number comes before 1000?
A: Nine hundred and ninety-nine comes before 1000.

Q: When counting, what number comes before 10,000?
A: Nine thousand and ninety-nine comes before 10,000.

Q: When counting, what number comes before a million?
A: Nine hundred thousand and ninety-nine comes before a million.

This is a general limitation of neural networks, which are more akin to inductive reasoning than deductive reasoning. This chart shows GPT-3’s performance on math problems, and while the largest version of the model shows significant improvement over smaller models, it’s still only getting two-digit multiplication problems right 30% of the time. If you want to do math, it will probably always be better to use traditional computing methods:

GPT-3 can also get caught in frustratingly simple and dumb traps of repeating itself nonsensically, like this:

- Human: Do you have a version of death?
- AI: I don't.

- Human: Is there anything you're scared of?
- AI: I don't.

- Human: Is there anything you're excited about?
- AI: I don't.

- Human: Then what drives you to improve? Why do you exist?
- AI: I don't.

- Human: You don't exist?
- AI: That's right.

Human: Then how am I talking to you right now?
AI: I don't.

This is a known limitation of models like GPT-3 and while Gwern describes some technical reasons why this may be happening, he also acknowledges that "This behavior remains puzzling and I don’t think anyone really knows how to fix it."

Diminishing returns

Finally, it's worth acknowledging that there are diminishing returns to GPT-3's approach. This chart shows that we need to scale both the number of parameters and the compute power required logarithmically (10x) in order to get linear returns to performance. Given the mind-bogglingly large values we're talking about here, continuing to scale at these rates is going to get very challenging:

What even is intelligence anyway?

So it's clear that GPT-3 is definitely not artificial general intelligence, and we probably have a long way to go, but it also feels a heck of a lot closer than anything anyone has ever made before.

Artificial intelligence programs like deep learning neural networks may be able to beat humans at playing Go or chess, or doing arithmetic, or writing Navy Seal copypasta, but they will never be able to truly think for themselves, to have consciousness, to feel any of the richness and complexity of the world that we mere humans can feel.

Oh wait, actually that last paragraph was actually written by GPT-3.

Ultimately, the deeper I got into these examples, the more I asked myself: how is my process of choosing the next words to say or type when I speak and write really that different from what GPT-3 is doing?