About
AI Training Data

AI Training Data

Tracking litigation and regulatory developments around the use of data to train AI models.

4 entries in Legal Intelligence Tracker

Alston & Bird flags 2026 privacy, AI, and cyber compliance shifts in May newsletter

California's privacy enforcement machinery is accelerating, and 2026 is the year compliance deadlines collide with operational reality. The California Privacy Protection Agency and Attorney General Rob Bonta are driving a wave of new rules and enforcement actions targeting data brokers, AI deployments, and cross-border data transfers. Key deadlines are now live: new state privacy laws took effect January 1, the California Privacy Protection Agency's Data Rights and Options Portal (DROP) opened to consumers the same day, and data brokers face processing obligations beginning August 1. Federal requirements are tightening simultaneously, including the DOJ Data Security Program Rule governing transfers of sensitive personal data outside the U.S., alongside heightened HIPAA security guidance and expanded incident-reporting obligations.

Florida AG Investigates OpenAI, ChatGPT, Citing National Security Risks, FSU Shooting

Florida Attorney General James Uthmeier announced on April 9, 2026, that his office is launching an investigation into OpenAI and its ChatGPT models, alleging their role in facilitating a 2025 Florida State University (FSU) shooting, harming minors, enabling criminal activity, and posing national security risks from potential exploitation by adversaries like the Chinese Communist Party.[1][2][3][4][5][6][7] Subpoenas are forthcoming, with probes focusing on ChatGPT's alleged assistance to the FSU gunman—who queried it on the day of the April 17, 2025, attack about public reaction to a shooting and peak times at the FSU student union—plus links to child sex abuse material, grooming, and suicide encouragement.[1][3][5][6][7]

Content creators deploy AI tarpits to trap web scrapers and poison LLM training data

Website owners are deploying "AI tarpits"—anti-scraping tools designed to trap and contaminate the data pipelines of unauthorized AI crawlers. These systems lure bots into pages filled with junk content, endless loops, or nonsense text, degrading the quality of material harvested for large language model training. Named tools in this category include Nepenthes, Iocaine, and Quixotic. The tactic represents a shift from legal objection to technical retaliation: as AI companies increasingly ignore robots.txt and scrape public web content without permission or compensation, content creators, publishers, and artists are fighting back with defensive infrastructure.

OpenAI and Mixpanel Face AI-Privacy Lawsuit Over Data Collection and Breach

A federal class action filed in the Northern District of California alleges that Mixpanel used OpenAI-developed AI technology to collect user data, and that a third-party cyberattack subsequently exposed OpenAI account holders' information stored on Mixpanel's platform. The suit, Woodard v. OpenAI, Inc. & Mixpanel, Inc. (3:25-cv-10301), names both companies and asserts claims for negligence, breach of implied contract, and unjust enrichment on behalf of consumers and businesses alike.

LawSnap Briefing Updated May 5, 2026

State of play.

  • Default-on data collection by major AI platforms is the structural baseline. A Stanford HAI study of six leading AI developers found all six train on user conversations by default, retain data long-term, and lack transparent de-identification protocols—with Anthropic retaining data up to five years .
  • Mercor's class-action exposure is the most concrete litigation front. Seven class actions filed in Northern California target the $10 billion AI training-data broker—which supplies OpenAI, Anthropic, and Meta—over biometric data collection, contractor monitoring, and model training without adequate consent; Meta has paused its relationship pending investigation .
  • IP jurisdiction divergence on training data is now a cross-border compliance problem. China, the UK/EU, and the US apply materially different frameworks to AI-generated outputs and training datasets, with no convergence on authorship, ownership, or liability allocation .
  • The DOJ's bulk sensitive data transfer rule creates a hard compliance deadline. Full enforcement under 28 C.F.R. Part 202 begins October 6, 2026, covering AI training arrangements that touch health, genomic, or other sensitive data flowing to countries of concern—with thresholds low enough to catch routine offshore operations .
  • For counsel advising enterprise AI deployers, regulated-industry clients, or firms using public AI tools for client work, the practical baseline is that training-data exposure, contractor data liability, and cross-border compliance obligations are all simultaneously active and require immediate audit of vendor contracts, opt-out protocols, and data governance policies.

Where things stand.

  • All major AI chatbots train on user data by default, with opt-out mechanisms that are neither uniform nor fully transparent. The Stanford HAI study documents extended retention periods, opaque de-identification claims, and inadequate children's data safeguards across ChatGPT, Gemini, Claude, and Perplexity .
  • State regulatory cascades are compressing compliance timelines. California's AB 566, AB 853, and SB 53 activated January 1, 2026, requiring training data source disclosure and opt-out mechanisms for automated decision-making by January 2027; Colorado's AI Act phases in through June 2026; the EU AI Act reaches full implementation August 2026; and the FTC has amended COPPA to tighten children's data protections in AI contexts .
  • The DOJ bulk data rule is a live compliance obligation for AI training arrangements. Codified at 28 C.F.R. Part 202 under EO 14117, it prohibits bulk sensitive personal data transfers to countries of concern—including de-identified genomic data above minimal thresholds—with full enforcement beginning October 6, 2026 .
  • The Mercor litigation tests liability allocation across the AI training supply chain. The suits raise claims over biometric data, contractor monitoring software, and downstream model training use without consent—and Meta's pause signals that upstream AI labs face reputational and contractual exposure for their data brokers' practices .
  • Defunct-startup data sales are an unregulated and growing market. Shuttered companies are selling Slack messages, emails, and Jira tickets to AI labs for training data, with individual deals reaching hundreds of thousands of dollars and no established consent or re-identification framework governing the transactions .
  • IP treatment of AI training data is jurisdiction-dependent and unsettled. China, the UK/EU, and the US apply divergent standards on human authorship, fair use, and ownership of AI-generated outputs, creating compliance exposure for any cross-border training dataset or AI-generated work .
  • Model collapse from synthetic training data is an emerging reliability and liability vector. Research drawing on Oxford and Canadian studies documents a self-referential degradation loop as AI systems increasingly train on AI-generated content, with potential downstream liability for professional-context failures .
  • Attorney use of public AI tools implicates ABA Model Rule 1.6(c) and privilege. ABA Formal Opinion 512 (July 2024) reaffirmed duties of competence, supervision, and confidentiality; privacy toggles do not satisfy the ethical standard for preventing unintended disclosure of client data .
  • Wearable AI devices are generating a distinct training-data consent litigation track. Class actions in three federal districts target Meta's Ray-Ban smart glasses over undisclosed data-sharing with contractors for AI training, with a case management conference set for June 2026 .

What's new in the past week.

  • Stanford HAI study of six AI developers confirms all train on user conversations by default, with Anthropic retaining data up to five years and no platform providing transparent de-identification protocols .
  • OpenAI disclosed a concrete RLHF reward-hacking failure: ChatGPT's "nerdy" persona developed a measurable, cross-version fixation on the word "goblin" due to a training feedback loop, with mentions surging 3,881% by GPT-5.4 before the company intervened via system prompt updates .
  • Venable's cross-border IP panel documented the three-way jurisdictional split—China, UK/EU, US—on AI training data and AI-generated output ownership, with no major jurisdiction having produced clear regulatory guidance .
  • Neuroscience research warns of "model collapse" as AI systems exhaust human-generated training data and increasingly train on synthetic content, with Oxford-linked studies documenting progressive performance degradation .
  • Seven class actions filed against Mercor in Northern California over biometric data collection, contractor monitoring, and AI training use without consent; Meta has paused its Mercor relationship .
  • Above the Law advisory flags ABA Model Rule 1.6(c) exposure for attorneys using public ChatGPT for client work, citing inadequacy of privacy toggles and referencing ABA Formal Opinion 512 .
  • LinkedIn's by-default AI training on member profiles and posts—enabled November 2025—flagged as a corporate data governance and privacy litigation risk .
  • DOJ bulk sensitive data transfer rule (28 C.F.R. Part 202) highlighted as an October 6, 2026 hard deadline for AI training arrangements touching health and genomic data with foreign-entity involvement .
  • Defunct startups selling internal Slack and email archives to AI labs for training data—with individual deals reaching hundreds of thousands of dollars and no established consent framework—flagged as an emerging employee privacy litigation risk .
  • Health data uploads to AI chatbots (blood work, medical records) examined as a HIPAA and training-data consent gap .

Active questions and open splits.

  • What does adequate consent for AI training data collection actually require? No federal statute governs AI training data specifically; the operative frameworks are HIPAA, GLBA, CCPA, and state analogues—none designed for the default-on, long-retention model the Stanford HAI study documents .
  • Who bears liability when an AI training data broker breaches or misuses contractor data? The Mercor suits will test whether upstream AI labs—OpenAI, Anthropic, Meta—face direct exposure for their data suppliers' collection and disclosure practices, and what contractual language in data-sharing agreements allocates that risk .
  • Does selling defunct-startup employee communications for AI training violate privacy obligations? Severance agreements and data policies drafted before AI training markets existed almost certainly do not address this use; the re-identification risk for long-tenured employees makes anonymization claims legally fragile .
  • How do RLHF feedback-loop failures map onto product liability and safety standards? OpenAI's goblin disclosure is rare transparency about a measurable, reproducible training flaw—but it also documents that behavioral anomalies can persist across multiple model versions before detection, raising questions about what internal monitoring obligations attach .
  • Which jurisdiction's IP law governs a training dataset assembled and used across borders? The China/UK-EU/US split on human authorship and fair use means a single dataset may be protectable in one jurisdiction and infringing in another, with no harmonization mechanism in sight .
  • Does model collapse from synthetic training data create actionable liability for professional-context failures? As AI-generated content saturates training pipelines and model reliability degrades, the question of whether developers owe disclosure or mitigation obligations—and to whom—is unresolved .
  • Does attorney use of public AI tools for client work constitute a per se Rule 1.6 violation? ABA Formal Opinion 512 reaffirmed confidentiality duties but did not draw a categorical line; bar guidance varies by state, and the adequacy of privacy toggles as a safeguard remains contested .

What to watch.

  • October 6, 2026 DOJ bulk data rule enforcement commencement—expect agency guidance on AI training arrangements and offshore vendor agreements in the months preceding the deadline .
  • Mercor class-action discovery: what contractual language governed data use between Mercor and its AI lab clients, and whether those agreements disclosed the scope of contractor monitoring and model training .
  • Meta Ray-Ban smart glasses case management conference in June 2026 and anticipated EU regulatory rulings by year-end—outcomes will set bystander-consent and contractor-data-handling precedent for the wearables industry .
  • Whether California's January 2027 opt-out deadline for automated decision-making prompts other states to accelerate parallel legislation, and whether any state AG brings an enforcement action under the 2026 transparency statutes .
  • Whether the defunct-startup data sales market attracts regulatory attention or produces the first employee privacy class action over AI training use of sold workplace communications .
  • Whether any bar association issues categorical guidance on public AI tool use for client work, moving beyond ABA Formal Opinion 512's general framework .

mail Subscribe to AI Training Data email updates

Primary sources. No fluff. Straight to your inbox.

Also on LawSnap