Seattle Giant Amazon Stumbled On Huge Cache Of Child Sex Abuse Images

Source: Wikipedia/ faismeen, CC BY-SA 4.0, via Wikimedia Commons

Amazon quietly told child-safety officials it had uncovered hundreds of thousands of suspected child-sex-abuse images and related files inside datasets the company had assembled to train artificial-intelligence models. The company says it stripped out the material before any model training began, but it has not said where those files came from or how they got into its training pool. Child-protection groups warn that this kind of missing provenance can make reports useless for investigators and can leave real victims unidentified. The revelation is now fueling sharper questions about how big tech companies build and vet the massive data corpora behind generative AI.

Bloomberg’s reporting

According to a report by Bloomberg, Amazon last year reported hundreds of thousands of pieces of content that its automated systems flagged as matching known child-sex-abuse material, then sent those matches to the National Center for Missing & Exploited Children. Those tips made up the bulk of more than 1 million AI-related reports NCMEC received in 2025, a sharp break from prior years. Bloomberg notes that child-safety officials and researchers view the sheer scale of Amazon’s reports as a striking outlier that still needs a clear explanation.

What child-safety groups recorded

Data from NCMEC shows generative-AI-related reports surged into the hundreds of thousands in just the first half of 2025, a steep rise from earlier periods. NCMEC leaders told reporters that many of the reports arrived without crucial origin information, which one official bluntly labeled “inactionable” for law enforcement that is trying to track suspects or locate children. Advocates say that combination of huge volume and thin provenance is exactly what keeps police from turning raw data into real-world cases.

Amazon’s response

An Amazon spokesperson told reporters the company runs candidate training data through automated hashing systems that compare files against databases of known child sexual abuse material, removes any matches before model training, and deliberately uses an over-inclusive threshold so that possible CSAM is less likely to slip through. In a statement cited by Engadget, Amazon said it is “not aware of any instances” in which its models have generated child-sex-abuse material and that roughly 99.97% of the reports came from scanning non‑proprietary training data. The company also acknowledged that it often lacks provenance metadata that would make some of its own reports truly actionable for investigators.

Why experts are alarmed

Researchers warn that even attempted exposure of AI systems to exploitative material, followed by removal before training, highlights a deeper structural risk in how datasets are collected. They say this episode underscores how easy it is for highly sensitive content to be swept into giant data troves and how that can normalize techniques that digitally alter or sexualize images of minors. Reporting in The Guardian and other outlets has already documented that generative tools can produce realistic child sexual imagery and complicate forensic analysis. Safety advocates argue that the mix of vast, rapidly assembled datasets and loose provenance controls is precisely the blind spot regulators have been warning about.

Calls for transparency and standards

Policy researchers and academics say the Amazon disclosures heighten the need for stronger provenance rules and mandatory transparency from AI developers, a point emphasized in analysis by Stanford Cyberlaw. Experts there and elsewhere caution that raw report totals can mislead if they are not paired with context about deduplication, human review and the technical workings of automated scanning tools. Advocates argue that regulators and Congress should move toward standardized disclosure requirements so that, when suspected CSAM is detected, law enforcement receives enough detail to actually pursue a case.

What’s next

National outlets including the Los Angeles Times and broadcast partners such as CBS News have amplified the Bloomberg report, prompting a new round of questions from lawmakers and child-safety advocates. Amazon has said it plans to publish broader data and methodology next month, a release that could help NCMEC and investigators piece together what happened and how. In the meantime, technologists and safety experts say the message is hard to miss: provenance checks, meaningful human review and tougher industry standards are no longer optional for what companies like to call “responsible AI.”

For now, the core facts remain straightforward. Amazon reported and removed material it believes matched known child-sex-abuse content, and child-protection officials say they need much better provenance to act on those tips. Investigators, legislators and watchdog groups are likely to keep pressing Amazon and its peers about how AI training datasets are collected, screened and audited in the first place.