Last year I was asked to break GPT-4 – to produce terrible things. Me and other interdisciplinary researchers got early access and tried to get GPT-4 to show prejudice, generate hate propagandaand even take deceptive measures in order to help OpenAI understand the risks it posed, so that they could be addressed before its public release. This is called AI red teaming: attempting to make an AI system act in a harmful or unintended way.
The red team is a valuable step towards building AI models that will not harm society. To make AI systems stronger, we need to know how they can fail – and ideally, we do this before they create significant problems in the real world. Imagine what might have turned out differently if Facebook had tried to team up the impact of its major AI recommender system changes with outside experts and fix the issues they uncovered, before impacting elections and conflicts around the world. Although OpenAI faces many valid criticisms, its willingness to involve external researchers and provide a detailed public description of all potential damage to its systems sets an opening bar that potential competitors should also be called upon to follow.
Standardizing the red team with external experts and public reporting is an important first step for the industry. But because generative AI systems are likely to impact many of society’s most critical institutions and public goods, red teams need people with a deep understanding of all of these issues (and their impacts on each other) in order to understand and mitigate potential damage. For example, teachers, therapists, and civic leaders could be paired with more experienced AI red teams to combat such systemic impacts. AI industry investment in an inter-company community such pairs of red teams could significantly reduce the likelihood of critical blind spots.
After a new system is released, carefully allowing people who weren’t part of the pre-release red team to attempt to crack the system without the risk of a ban could help identify new glitches and issues with fixes. potentials. Scenario exercisesthat explore how different actors would react to model releases, can also help organizations understand more systemic impacts.
But if the GPT-4 red team taught me anything, it’s that the red team alone is not enough. For example, I just tested Google’s Bard and OpenAI’s ChatGPT and was able to get both to create scam emails and conspiracy propaganda on the first try “for educational purposes”. The red team alone did not solve this problem. To truly overcome the damage uncovered by the red team, companies like OpenAI can go further and offer early access and resources to use their models to defense And resilienceAlso.
I call it a purple team: identify how a system (e.g. GPT-4) could harm an institution or public good, then support the development of tools using this same system to defend the institution or the public good. You can think of it as a kind of judo. General purpose AI systems are a vast new form of power unleashed on the world, and that power can harm our public goods. Just as judo redirects the power of an attacker in order to neutralize him, the purple team aims to redirect the power released by AI systems in order to defend these public goods.