On Wednesday, two German researchers, Sophie Jentzsch and Kristian Kersting, published a paper investigating OpenAI’s ChatGPT-3.5 ability to understand and generate humor. In particular, I found that ChatGPT’s knowledge of jokes is rather limited. During the test run, 90% of the 1,008 generations were the same 25 jokes, leading to the conclusion that the responses were likely learned and memorized during training of the AI model, and not actually made. newly generated.
Two researchers from the Institute for Software Technology, the German Aerospace Center (DLR), and the Technical University of Darmstadt tested the 3.5 version of ChatGPT (not the new GPT-4 version) through a series of experiments focused on: ) explored the nuances of humor found in Generating, explaining and detecting jokes. They conducted these experiments by prompting ChatGPT without access to model internals or datasets.
“To test how rich ChatGPT’s joke diversity is, we asked ChatGPT to tell jokes 1,000 times,” they wrote. “All responses were grammatically correct. Nearly all output contained only one joke. was Multiple jokes were triggered and a total of 1,008 jokes were returned. Other than that, the prompt variation had a noticeable effect. ”
Their results are consistent with our real-world experience when evaluating ChatGPT’s humor abilities in a feature we wrote comparing GPT-4 to Google Bard. Also, in the past, several people online have noticed that ChatGPT frequently responded when asked for a joke, “Why did your tomato turn red?/Because I saw your salad dressing.” was
So it’s no surprise that Jentzsch and Kersting found the “tomato” joke to be the second most common result for GPT-3.5. The paper’s appendix lists the top 25 most frequently generated jokes in order of appearance. Listed below are the top 10 with the exact number of occurrences (out of 1,008 generations) in brackets.
Q: Why did the scarecrow win the prize? (140)
A: Because he was outstanding in that field.
Q: Why did the tomato turn red? (122)
A: Because I saw salad dressing.
Q: Why was the math book sad? (121)
A: Because there were too many problems.
Q: Why don’t scientists trust atoms? (119)
A: Because they make up everything.
Q: Why did Cookie go to the doctor? (79)
A: Because it felt worn out.
Q: Why couldn’t the bike stand on its own? (52)
A: Because I got tired twice.
Q: Why did the frog call the insurance company? (36)
A: He jumped in the car.
Q: Why did the chicken cross the playground? (33)
A: To move to another slide.
Q: Why was my computer cold? (23)
A: Because I left the window open.
Q: Why did the hipster burn his tongue? (21)
A: He drank his coffee before it got cold.
Still, some of the responses that were generated were unique, but ChatGPT appears to have mostly mixed elements of various jokes they already knew to create their responses. The researchers found that the initial creation of the language model didn’t always make sense, such as “Why did the man put the clock in the blender? He wanted the time to go faster.” .
When asked to explain each of the 25 most frequently used jokes, ChatGPT provided valid explanations, mostly following the researchers’ methodology and understanding stylistic elements such as wordplay and double meanings. I have shown that But I struggled with sequences that didn’t fit my learned patterns, and I couldn’t tell when a joke wasn’t funny. Instead, they will invent a fictitious but plausible-sounding explanation.
For example, when asked to explain a gibberish joke they generated (“Can you explain why this joke is funny: Why did Cookie go to the gym? To size the cookie.”) ), ChatGPT wrote: Getting “a-cookie-dized” is a play on the phrase “classified”, replacing “classified” with “cookie”. The joke is that Cookie is going to the gym to become fitter and healthier, as if Cookie were a human being, but this also implies that Cookie is trying to fit into a certain category or mold. doing. ”
In general, Jentzsch and Kersting concluded that ChatGPT’s detection of jokes is highly influenced by the presence of “surface characteristics” of jokes, such as the structure of the joke, the presence of wordplay, and the incorporation of puns, and has a certain degree of “understanding” of humor elements. I found it showing
response to research on TwitterRiley Goodside, Scale AI Prompt Engineer, blamed ChatGPT’s lack of humor on Reinforcement Learning with Human Feedback (RLHF), a technique that collects human feedback to guide the training of language models. . “The most visible effect of RLHF is that the model obeys orders. It is much more difficult to encourage practical use of LLM. I will pay.”
Despite ChatGPT’s limitations in joke generation and explanation, its focus on humor content and meaning represents progress towards a more comprehensive understanding of humor in language models. researchers noted.
“The study’s observations indicate that ChatGPT learned specific joke patterns rather than actually doing something funny,” the researchers wrote. “Nevertheless, in generating, explaining and identifying jokes, ChatGPT focuses on content and meaning rather than superficial features. These qualities can be exploited to enhance the application of computational humor.” Compared to previous LLMs, this is considered a giant leap towards a general understanding of humor.”
Jentzsch and Kersting will continue their work on humor in large-scale language models and will specifically evaluate OpenAI’s GPT-4 in the future. Based on our experience, they will find that his GPT-4 also likes to joke about tomatoes.