A frontier large language model represents tens to hundreds of millions of dollars of compute, terabytes of training data, and years of engineering refinement. The model weights are, in commercial terms, the most valuable asset of the company that produced them. They are also surprisingly hard to protect in practice. The threat of model theft has moved from theoretical to operational, with documented attempts and successful extractions across multiple categories of models.
The defensive landscape is uneven, and the trajectory of regulation and best practice is still being written.
The categories of model theft
Three distinct attack patterns exist:
Direct weight exfiltration. Theft of the actual model weights, a file or files representing the complete trained network. Requires access to the systems where the model is stored or served. Most damaging because it gives the attacker exactly what the original developer has.
Model extraction (also called model stealing). The attacker queries the deployed model through whatever API is available and uses the responses to train a substitute model that approximates the target’s behaviour. Does not give exact weights but produces a functional clone. Pioneered for image classifiers (Tramèr et al. 2016) and now well-developed for LLMs.
Distillation. A specific form of extraction where the attacker explicitly trains a smaller "student" model to imitate the larger "teacher" model. The technique is widely used legitimately (model compression) and increasingly used adversarially.
A fourth, lower-level concern: training-data extraction, where the attacker recovers individual training examples from the model. Demonstrated against GPT-2 (Carlini et al. 2021) and continues to be a research topic. Less about stealing the model than about extracting sensitive information embedded in it.
Documented incidents
The publicly known cases give a sense of the landscape:
LLaMA leak (March 2023). Meta’s first-generation LLaMA model was distributed under a research-only licence, requiring approval. Within a week of restricted release, the weights were leaked on 4chan and rapidly mirrored across torrent sites and Hugging Face. The leak was a classic insider exfiltration; Meta subsequently shifted to LLaMA 2 with a more permissive licence, partly in response.
Mistral 7B early leak (August 2023). The model was released openly shortly afterward, but pre-release weights had circulated.
Mosaic / DBRX, Falcon, and other open-weight model accidental early releases. Multiple incidents through 2023-2024 of models intended for staged release leaking earlier than planned.
Closed-API extraction research. Carlini et al. demonstrated in 2024 that ChatGPT’s embedding-vector API could be used to extract structural information about the model. OpenAI subsequently restricted the API. Subsequent academic work (Carlini et al., "Stealing Part of a Production Language Model," 2024) demonstrated extraction of the projection matrix from GPT-3.5 and similar models through carefully constructed queries.
Distillation of frontier models. The training of competitive open-weight models on datasets generated by closed models is widely understood to have happened across 2023-2024. Specific instances are hard to confirm because the practice is somewhere between "questionable" and "violating ToS" depending on the case. OpenAI’s terms of service prohibit using ChatGPT outputs to train competing models; enforcement is necessarily limited.
Insider-driven theft attempts. Several US companies have alleged in court filings that departing engineers exfiltrated model weights or training data. The Anthropic v. Anthropic-employee cases and similar provide some public detail, though most disputes are settled or sealed.
Why model weights are hard to protect
The structural difficulties:
The model is a file. Once an attacker has read access, they can copy it. Standard file-system access controls work, but in a complex training environment many people and many automated systems need read access at various points.
The model is queryable. Even without access to weights, the deployed model is a function the attacker can interact with. Any API exposes information; aggregating enough queries produces a functional clone.
The model can be embedded in other systems. Once integrated into a product, the weights may be distributed alongside the product to customer systems. Edge deployments and on-device inference particularly create distribution.
Traditional DRM does not work well. The model weights, by their nature, must be loaded into memory and computed against. Cryptographic protection of weights at rest works; protection during inference is much harder.
The provenance of derivative work is hard to prove. If a competitor releases a model that performs similarly to yours, proving they trained it on your output, distilled from your model, or copied your weights is technically difficult.
Defences against direct exfiltration
The most tractable category. Standard infosec hygiene:
Access controls on storage. Model weights stored with strong access controls; audit logging on every read; alerts on unusual access patterns.
Encryption at rest. Weights encrypted with keys that require active authentication to use. Particularly relevant for weights distributed in client devices or cloud appliances.
Watermarking. Embed identifying patterns in the model weights or behaviour that allow you to identify your model if it appears elsewhere. Active research area; multiple techniques (parameter-level watermarks, behaviour-level watermarks, training-data watermarks). Some are robust to fine-tuning; some are not.
Insider threat programs. The plurality of confirmed model-theft incidents involve insiders. Standard insider-threat tooling (DLP, behavioural monitoring, exit interviews) applies.
Air-gap and physical security for high-value training environments. The frontier-model labs (Anthropic, OpenAI, Google DeepMind, Meta FAIR, Microsoft Research) maintain secure compute environments with substantial physical and procedural protections. Smaller organisations training competitive models often have weaker controls.
Defences against extraction and distillation
Harder. The defences are about raising attacker cost, not preventing extraction:
Rate limiting and query monitoring. Detect query patterns characteristic of extraction attempts (large query volumes, systematic exploration of input space, queries from known research groups).
Output perturbation. Add noise to model outputs to make extraction harder. Trade-off: degrades legitimate use.
Watermarking outputs. Embed statistical signatures in generated text that allow you to detect outputs used to train derivative models. Kirchenbauer et al.’s "A Watermark for Large Language Models" (2023) at arxiv.org/abs/2301.10226 is the foundational paper. OpenAI and Google have publicly discussed deploying watermarks; effectiveness varies.
Access control on high-value APIs. Embedding APIs, fine-tuning APIs, and other endpoints that disclose more model information are more tightly controlled than completion APIs.
Legal protection through ToS. The contracts used by major LLM API providers explicitly prohibit using outputs to train competing models. Enforcement is partial but real.
Defences against training-data extraction
A separate concern. The risk: a deployed model leaks individual training examples, including sensitive personal data, copyrighted text, or proprietary information.
Differential privacy in training. Adds noise during training to bound information leakage. Costly but produces provable guarantees. Production deployments are limited.
Data filtering. Remove sensitive examples from training data before training. Standard practice; never complete.
Output filtering. Detect and refuse outputs that match training-data segments verbatim. Carlini-style extraction attacks specifically circumvent this, but it raises the bar.
Membership inference defences. Adversarial training against attackers attempting to determine whether a specific example was in the training set.
The IP and legal landscape
Several developments through 2024-2025 matter:
Trade-secret protection of model weights is increasingly tested in court. The Anthropic and OpenAI insider-departure cases test the boundaries.
Patent protection of model architectures and training techniques is an active area.
Copyright treatment of model weights is unsettled. The Sebastian Bach / public-domain-output cases and the broader "AI output copyright" debates frame parts of this.
Trade-secret-style protection appears to be the most operationally effective. Patents have not been heavily asserted yet.
The EU AI Act and related regulations require certain transparency about training data and capabilities; the conflict between transparency and IP protection is a central tension.
What organisations should do
For frontier-model developers:
Treat model weights as the highest-tier intellectual property. Access controls equivalent to source code or financial systems.
Implement watermarking for both weights (where feasible) and outputs.
Monitor for extraction attempts at the API layer.
Maintain forensic visibility, what model checkpoints exist, where they are stored, who has accessed them, when.
Track legal protections in your jurisdiction. Trade-secret status often requires demonstrable protective measures.
For organisations deploying AI:
Understand the licence terms of the models you use. Open-weight, research-only, commercial-restricted, fully open, each has different implications.
Treat fine-tuned model weights as containing potentially sensitive information from your fine-tuning data. Protect accordingly.
Audit access to inference endpoints; rate-limit aggressively if extraction is a concern.
Recognise the risk of building business-critical systems on closed APIs that may change unpredictably; weight-availability planning is part of architecture.
The deeper observation: models are software, weights are software artifacts, and they require the same supply-chain and access-control discipline as any other valuable software asset. The treatment will mature; the risks are present today.
