GPT-3 can’t save you today. GPT-3 by OpenAI (private beta) dazzles the world in 2020 with incredible demos like generating SQL query using plain English, writing code, and doing what seems like machine comprehension. Yes comprehensions are even hard for humans (think of SAT and GRE). We talked about this exciting news in our article GPT-3 past present future. We also wrote about NLP fundamentals in this our Getting Started with NLP article. It is very important for us to follow up and write about what GPT-3 cannot do. Limitations, weakness, mistakes. It is a bit hard without access to GPT-3 yet. But we still did some early investigation and found insights to present in this article.
We were also able to get hands on with the the GPT-3 playground. First let’s talk about Sam Altman then about what really happens in GPT-3.
This article is not completed and is going through continuous development, continuous writing. Any feedback is welcome. We write all kinds of programming, data science, machine learning and deep learning articles. To support us please subscribe on Medium, Clap, or if you feel like it, leave a comment below.
All Machine Learning models are defined and limited by the data it consume. After all, GPT-3 learned a huge slice of sensible, random, factual and fictional content from the internet — forums, wikipedia, twitter etc.
GPT-3 and Silicon Valley Startups
YC partner Sam Altman, tweeted about GPT-3 release and also asked every one to cool down when it got over-hyped. He is also now a leader at OpenAI.
His twitter thread on GPT-3 being mind-blowingly cool. Just a quote. Not as interesting as the thread 2 days later.
Two days later his twitter thread on GPT-3 being over hyped:
Find Sam on Twitter and Medium (at) sama. This Twitter thread discusses GPT-3.
Trying out GPT-3 playground
We were able get access to test GPT-3 this is what we found.
To get good results you need to give GPT-3 good prompts and examples. That is important for few shot learners that require a few examples to get started (warm up). Normally models need much more data. Few shots models can extrapolate quickly using just a few examples and appear intelligent.
Right now it seems trial and error is a fast way to get reasonable results from GPT-3.
Todo Harry Potter rap
Formatting is important. GPT-3 currently expects the example follow a few spacing formats (prompt on one line, expected result on another line) and label formats (labeling questions as Q:, answers as A: . Labeling person A, B, C when generating scripts). The examples you provided will want to mimic the examples they provide for each task type.
To illustrate how much does task specific example matters, we can use this scenario: Q: What comes after 99? Q: What comes after 90? Because GPT-3 learns from the texts of the internet, the answer is not necessarily the numerically correct answer. But if we provide in a non Q&A task, but just a text synthesis or general generation task: we can provide the prompt as 98,99,100, and ask it to generate the next 2–3 items, it may be on the right track and say 101, 102. If we want to get it right completely, our example can count from 0, 1 …. all the way to 98, 99, 100 then asking GPT-3 to generate, and almost certainly it will generate the next correct sequence.
During our observation, as well as those reported in twitter, and on GPT-3's tech blog, we found that GPT-3 is brittle to rephrasing. The answer may vary wildly with the prompt and example combination, and the jumbling of the order of words.
A machine learning model is as good as the data you feed it. In this case, GPT-3 like most models learn from the internet. For example, we are more formal, proper and respectful in academic writing, but will be casual, even hostile when anonymous on the internet. May be our writing is shorter due to limitations like length of tweet. May be we tend to use more emojis, hashtags on the internet. It is important to know what data GPT-3 is trained on and important to know inherently the data brings a bias.
Let that sink in for a moment: GPT-3 learned it’s “intelligence” from the internet, a diverse yet divisive, chaotic space which contains a large part of modern knowledge as well as verbal abuse, false information, speculations and instigations. Or may be internet just more colloquial, casual and unrestrained than we expect from a source of true intelligent, or the smartest language model under the sun.
Natural language model tends to use corpus, collections of documents, like the Wikipedia. The wikipedia foundation calls on contributors of more languages internationally to contribute because currently the articles have an American-European English Speaking bias. Bias does not mean bad. It could just mean imbalanced data. There are just more English language articles on wikipedia written by authors living in English speaking and European language countries than other languages like Chinese, Japanese etc.
Because machine learning models learn patterns from data, it may also try to predict the common, likely scenario and have trouble predicting underpresent labels. It cannot predict results that it doesn’t see often. It may tend to choose the more likely result.
Your GPT-3 output is as good as your prompts
It is not hard to learn the Playground form for various tasks. In Q&A, you just have to label
Q: A: in chat you want to label
Human: AI: not too bad. And you want to give it quite a few examples. Good examples. It is sensitive to typos meaning it can drastically change the result. For example, asking it to go from paragraph tag <p></p> to a link tag it will literally output <link></link>. Text to command is not magical as it can write code on the fly. You will want to prompt it with the right code. Asking it to chat about making smoothies, its first thought is banana. Perhaps because banana is a common based for smoothies. Does it tend to pick the average or the common (mode, highest frequency) result? Asking it who created GPT-3, it said Ian Goodfellow. Perhaps because often GPT-3 appears with the word “generative” model, which in turn co-occur with “created by Ian Goodfellow”. Next let’s discuss cost.
GPT-3 takes the Turing test
There’s an interesting article discussing how the model performs taking the Turing test. Deep Neural Networks are great function approximations that consume data samples. Statistics and probabilities of the training data can influence or bias the model. It appears intelligent at human tasks but lacks true human intelligence. Sophisticated models appear more human but has a long way. Even if it might perform way better on the task of machine comprehension (think about how much we may dislike the SAT and the GRE reading comprehension session, the model surely can do Q&A better than us including our own writers).
OpenAI means to charge for GPT-3 access. Early beta testers already got a notification that a pricing model will be in place October 1st (Read about the pricing model here.). It is not the Louis Vuitton of models but it does have a startup level price tag $100+ per month with various level of support.
Hmm, given generative models still output seemingly sensible disjointed phrases, uh, no thanks. That being said, omg the parameter controls are so amazing.
GPT-3 API pricing model definitely faces criticism on all kinds of tech blogs.
What GPT-3 can do very well : easily modify parameters
Weakness associated with generative models
Ethics and philosophy
Gpt-3 who’s responsible for the mistakes of generated text
Play with GPT-3, here’s a parody account Wisdom of GPT-3 https://twitter.com/ByGpt3
OpenAI gives out fellowship scholarships https://jobs.lever.co/openai/90311c53-38a6-467d-98ca-2d2735fa1a8a
Inherent Limitation of Generative Models
Generative models can appear seemingly coherent without achieving comprehension, without achieving true intelligence.
The reality is sometimes testers have to click the generate button multiple times to sample the result and choose the best (funniest, most coherent or impressive ) of the few. Generative models can generate different results using the same prompt because it uses random noise as a part of its input.
Prompting the model can introduce some initial bias
Inability to override well known concept for example Stanford … existing concept like Harry Potter
It cannot learn from current, future data. It is a pre-trained model. If a new concept comes out today
State of art natural language models require a lot of data to train. GPT-3 essentially learn from the entire internet: webpages, wikipedia pages, books, … the internet with filled with cats, memes, tweets, not everything is editorial, and there perhaps is a western bias (Wikipedia research has shown it is inevitable to have Western, English bias because the articles with the most traffic are written in English).
GPT-3 Limitations Mentioned in the GPT-3 Paper
GPT-3 paper authors literally spell it out for us — section 5 of the paper helps us understand GPT-3 limitations.
These are just some excerpts. This section is fairly long. The research of the model is on-going and researchers are all looking at this continuously. Here’s a taste of some of the limitations. “It has notable weaknesses in text synthesis and several NLP tasks.” “In text synthesis, GPT-3…. repeat themselves semantically at the document level… conflict themselves… ” “Within the domain of discrete language tasks, we have noticed informally that GPT-3 seems to have special difficult with ‘common sense physics’ …” “GPT-3 has several structural algorithmic limitations, …. ”
Ethics of GPT-3 Model
Who’s responsible when GPT-3 goes rogue? What if it generates biased and offensive comments? What if the content generated is inaccurate and causes economic lost.
Here’s a thinking exercise: who is responsible for unlawful, offensive, discriminatory, biased or fake content generated by a model? Years ago, Microsoft created a Twitter bot called Tay.ai. The female persona learns through tweets on Twitter and by interacting with users. Quickly, Tay would start to go on racist, misogynist rampage… well, it learned from the internet, an offensive place that can quickly turn lethal under the veil of anonymity. In this Guardian article, Microsoft is deeply sorry for its offensive twitter AI .
The large number of parameters and the time and resources required to train GPT-3 makes it more of a black box, just like many existing models, pre-trained embeddings and deep learning neural nets.
Because the model is a few shot learner, the few prompts, examples that is provided can greatly influence its output, even invert right or wrong. For example, if providing sorting example, biased is better than just, ignorant is better than curious, it may well generate violence is better than peace, because the provided examples have inverted moral order. So the prompts can change the output so is the fault truly the model’s?
One of the coolest demo is using GPT-3 to generate code. Who will be responsible for code? Will their be a dedicated reviewer? Pair programming and code reviews work best where there are two way communications, questions are asked and solutions are defended. Being a black box how can GPT-3 explain and defend its choices. Some are worried programming mistakes can cause companies substantial money loss.
Limitation of GPT-2
While we don’t fully understand GPT-3 strengths and weaknesses yet, we can gain some insights from its predecessor GPT-2, a smaller model.
Machine comprehension is an interesting task for GPT-3. With its superior intelligence compared to predecessors and competitors, can it potentially be better than humans? Our writer jokingly said, GPT-3 will have better reading comprehension score than her in standardized tests like SAT and the GRE.
GPT-3 takes the turing test. TODO
Does GPT-3 Know Time, Tense?
GPT-3 likely will reply and generate answers that is the consensus of the current time, but can it perceive the past, current, future and distinguish among the three? Likely not. It cannot predict the likely next scenario, or reason about it. Noteably when asked where’s Steve Jobs, GPT-3 once answered in the Apple HQ. (We checked again 2021, it’s gotten smarter and no longer says that.) What is perceived as machine comprehension may not have reached the level to understand, since Mr. Jobs passed away, he does not exist any more on the earth. May be it answered in Apple HQ, because information on the internet include articles stating Mr. Jobs answered interview questions or launched something in the Apple HQ, and it is always the location language model identifies to be associated with Jobs.
GPT-3 won’t be cheap
Though still in private beta (application required), GPT-3 notified its users that it likely will start charging in October 2020. And surely tech blog posts emerged on internet: many dislike and disapprove of the pricing model.
Pretrained models offer a starting point to NLP but often requires customization. As more beta testers and developers use GPT-3, it will get better but real world training will pull it in all kinds of directions, customization is still much needed. How customizable is GPT-3?
OpenAI is not open
According to MIT media, previously OpenAI portrayed itself as an open initiative, AI for the benefits of humanity but did not releasing the GPT-3 predecessor GPT-2. Now it is charging the exclusive user club price for accessing the GPT-3 API. MIT media also reported that OpenAI is only giving Microsoft access to its underlining code, configurations and training data. Perhaps the limiting feature of GPT-3 is its licensure as well as premium pricing.
Next step in GPT-3
It will be very interesting to customize and custom fine tune GPT-3 models in the future and watch it improves within an organizing or as a model across all users. It’d also be interesting to see new tools that explain, interpret are or visualize the GPT-3 model and its outputs.
Right now what’s amazing that GPT-3 works smartly out of the box and does not need task specific fine tuning. In fact, amazingly it is not trained on a specific task. But can it become a domain expert? How to make it a specialist not a generalist?