[Note: this is written in a op-ed style to be a writing sample.]
In November 2023, Stanford researcher David Thiel made a startling discovery. The popular text-to-image AI model, Stable Diffusion 1.5, had been trained on a dataset containing child sexual abuse material (CSAM). His discovery sparked a public outcry. Forbes and CNN wrote negative articles, and the non-profit which hosted the dataset took it offline temporarily to remove the offending content.
However, the damage had already been done. Stable Diffusion 1.5 was released as an open model, so existing copies could not be retracted. Today, you can still access the model online, through uploads by random strangers.
This case highlights a more general problem. Without clear standards for training data, AI developers risk embedding harmful content into their systems. And a lack of transparency about how developers filter data creates risks for researchers and users. AI developers need to be proactively transparent around data practices, before more problems emerge.
Data-guzzlers
The central challenge with data arises out of how AI models are developed. AI systems, such as Stable Diffusion image models and OpenAI's GPT-4.5, are more grown than made. While software engineers build traditional software by writing specific instructions in code, AI developers instead train AI models by feeding them large amounts of data. The AI learns to understand and predict patterns present in that data. The more data you give the AI system, the better it is at internalizing those patterns and producing useful output.
Seeking to make the best model possible, developers compile massive datasets, often copying most of the public internet. However, the internet contains harmful or sensitive material, like CSAM, copyrighted content, and leaked evaluations. The AI models can then internalize and amplify these harmful outputs.
One particularly prominent example of sensitive data is copyrighted content. AI companies like OpenAI use copyrighted data from the internet, often without compensation agreements. In response, several media companies have sued OpenAI for copyright infringement, in part alleging that ChatGPT sometimes repeats parts of its articles word for word.
AI companies have responded by paying media outlets for the licensed use of their data. In May 2024, OpenAI paid News Corp an estimated $250 million for their content, which includes the Wall Street Journal. And, in a recent paper, fellow AI developer Anthropic stated that they respect website's instructions for whether they allow crawling.
Cheating on the test ... and getting away with it:
Data contamination poses a more subtle challenge for evaluating model quality. Currently, one of the primary ways that researchers and policymakers assess the strength of a released model is by looking at benchmarks, which are sets of easily-graded questions or tasks. These benchmarks are most accurate if AI models have not seen the questions -- or the answers -- beforehand. With publicly accessible datasets, however, there are no strict guarantees that developers will not train on that data.
Because of this, some researchers include "canary strings" -- long sequences of random digits -- in their benchmark data. If a model is able to repeat the canary, it means that it has seen the data on the benchmark.
This approach has revealed that AI models such as OpenAI's GPT-4 and Anthropic's Claude were trained on data explicitly meant to be excluded from training. Both models have been shown to know the canary string for "BigBench," one of the largest and most important benchmarks for large language models (LLMs). While OpenAI notes this in their technical report on GPT-4 (footnote 5), Anthropic has never publicly admitted to have included BigBench data in their training set.
Unlike CSAM and copyright, there has been no big backlash to this contamination, and no corresponding transparency. This leaves researchers in a bit of a tricky spot, because it's unclear whether benchmarks or other sensitive papers will end up in the training data. And there is no reliable way yet to determine that a given dataset has not been trained on. For instance, it is easy to train a model not to reproduce a canary string.
This lack of transparency poses real threats. Recent research from Anthropic suggests that AI systems learn unexpected lessons from the data they are trained on. Documents which detail training procedures can lead models to fake being trained further. Seeing papers which claim that AIs cheat a lot on exams make them more likely to cheat themselves.
Ideally, safety papers with sensitive content would not end up in the training data if they made models more likely to behave dangerously. If filtering out such papers is not possible, AI companies should be upfront about this fact, so that benchmark creators and researchers could choose which data they share more carefully.
Transparency: the time is now:
Stable Diffusion-1.5's case demonstrates both the problem and potential solutions. A couple of months after Thiel broke the news, the leading AI companies partnered with Thorn, an anti-CSAM organization, to establish a set of guidelines to combat CSAM. Data filtration was the first principle on the list. Still, this progress only came after a harmful model had been released and couldn't be shut down.
AI companies should avoid repeating the mistake. They need to be proactively transparent about the types of sensitive data that they are using. If companies cannot guarantee the exclusion of certain types of data (like files with canary strings), they should state this explicitly, so that researchers and policymakers can make an informed decision.
It's time to be transparent about sensitive data, before the next harmful dataset becomes embedded in systems we cannot recall.