Adversarial Examples: Tricking ML Models with Imperceptible Changes

In 2013, a group of researchers led by Christian Szegedy and Ian Goodfellow demonstrated something unsettling: a deep neural network confidently classifying an image of a panda would, after the addition of a small carefully chosen pixel-level perturbation, equally confidently classify the same image as a gibbon. The perturbation was imperceptible to humans. The phenomenon, adversarial examples, has been the most-studied and least-resolved problem in machine learning security for over a decade. It now has direct operational relevance as ML models are deployed in security-critical roles.

The current state of adversarial robustness in 2026 is partial defences, ongoing arms races, and an underlying recognition that perfect robustness against an adaptive attacker is probably impossible at current model architectures.

The classical attack

The original Goodfellow et al. paper "Explaining and Harnessing Adversarial Examples" (2014) at arxiv.org/abs/1412.6572 introduced the Fast Gradient Sign Method. The intuition is that the gradient of the model’s loss with respect to the input tells you which direction to push pixels to maximally confuse the classifier; a small step in that direction frequently flips the prediction.

Subsequent attacks, Projected Gradient Descent (PGD), Carlini-Wagner attacks, momentum methods, refined the technique. The general principle is the same: gradient-based optimisation finds inputs that the model misclassifies despite being visually indistinguishable from correct examples.

Three classes of attack matter operationally:

White-box attacks, where the attacker has the model weights and can compute gradients directly. Easiest from the attacker’s perspective; most damaging against open-weight models.

Black-box attacks, where the attacker has only query access. Harder but feasible; transfer attacks (training a surrogate model on observed inputs and outputs, then crafting adversarial examples that often transfer to the target) are the dominant technique.

Physical-world attacks, where the perturbation is added to a real-world object (printed sticker, fabric pattern, eyeglass frames) rather than digital pixels. Demonstrated against face-recognition systems, traffic-sign recognition in autonomous driving, and surveillance object detectors.

What this means in practice

Several deployed systems have been demonstrated vulnerable in research:

Tesla autopilot was fooled in 2019 (McAfee research) by minor stickers on a road that caused the system to swerve into oncoming traffic.

Face-recognition systems have been bypassed by adversarial eyeglass frames (CMU 2016 research) and by adversarial makeup patterns (subsequent work).

Object detectors used in surveillance and autonomous driving have been demonstrated bypassable by printed adversarial patterns.

Spam filters are routinely evaded by adversarial perturbations to email content; the cat-and-mouse here predates modern deep learning by decades.

Malware classifiers can be evaded by carefully constructed perturbations to PE binaries that preserve functionality.

Voice-recognition systems can be triggered by audio perturbations imperceptible to humans (the "DolphinAttack" line of work).

The threat model for each is different, autonomous vehicles vs. content moderation vs. malware classification, but the underlying mechanism is shared.

Adversarial examples against LLMs

The more recent and growing concern is adversarial inputs against language models. The pattern differs from image-classifier attacks but the family resemblance is clear:

Jailbreak prompts. Carefully constructed text inputs that induce the model to produce content it has been trained to refuse. The "DAN" prompts of 2023 and the AutoDAN automated jailbreak framework of 2024 both fall in this category.

Universal adversarial suffixes. Sequences of seemingly random tokens that, when appended to a wide range of prompts, induce harmful outputs. Andy Zou et al.’s "Universal and Transferable Adversarial Attacks on Aligned Language Models" (2023) at arxiv.org/abs/2307.15043 demonstrated suffixes that transferred across multiple aligned models.

Indirect prompt injection. The category covered in the separate prompt-injection post; structurally similar in that small input perturbations cause large behaviour changes.

Adversarial-perturbation attacks against LLM safety classifiers. The auxiliary models used to filter inputs and outputs in LLM applications are themselves vulnerable to adversarial examples in the classical sense.

Defences and their limits

Adversarial training. Train the model on adversarial examples generated during training so that it learns to be robust. The most effective defence in practice. Costly: roughly 5-10x the compute of standard training. Robustness gains are real but bounded; adaptive attacks designed against adversarially trained models still succeed at meaningful rates. Madry et al.’s "Towards Deep Learning Models Resistant to Adversarial Attacks" (2017) at arxiv.org/abs/1706.06083 is the foundational paper.

Certified defences. Mathematical proofs that the model’s prediction does not change within a specified perturbation budget. Randomised smoothing is the most successful technique. Provides genuine guarantees within tight bounds; struggles to scale to realistic threat models.

Input transformations. Compressing, denoising, or randomising inputs before passing to the model. Often defeats simple attacks; routinely defeated by adaptive attacks that account for the transformation.

Detection. Train a separate classifier to identify adversarial inputs. The same vulnerability, the detector itself can be adversarially attacked.

Ensembles. Multiple models, each making a prediction; consensus required. Increases attacker cost; does not eliminate vulnerability since transferability between models is high.

Architectural choices. Some model architectures (sparse models, capsule networks, vision transformers in some configurations) appear to be more robust than others. The advantage is small.

The honest summary: there is no general-purpose defence that defeats an adaptive attacker. The defensive value comes from raising attacker cost, narrowing the space of practical attacks, and making attacks detectable through monitoring.

The 2026 operational picture

Several trends matter for organisations deploying ML:

Robustness has been added to model evaluation suites. The MITRE ATLAS framework at atlas.mitre.org catalogues adversarial-ML attack techniques and is becoming the canonical reference for ML threat modelling. NIST has published guidance on adversarial ML. AI red-teaming exercises (covered in a separate post) routinely include adversarial testing.

Production systems are deploying detection-style defences. Behavioural monitoring of ML system inputs to flag anomalous patterns; rate limiting and authentication on ML inference endpoints; ensemble voting where one model’s prediction must be confirmed by others before action.

The regulated-industry picture is evolving. EU AI Act high-risk systems must demonstrate "appropriate level of accuracy, robustness and cybersecurity", language that explicitly contemplates adversarial robustness. Compliance documentation is starting to include adversarial-evaluation results.

Security research budgets are growing. The first full conferences on ML security (SaTML, USENIX Security tracks, IEEE SLT) have established peer-reviewed venues for adversarial ML research.

Practical recommendations

For organisations deploying ML in security-relevant contexts:

Threat-model the ML system explicitly. The MITRE ATLAS framework gives a vocabulary. Consider both white-box and black-box attacker capabilities.

Evaluate robustness at deployment time. Standard adversarial benchmarks (RobustBench, ImageNet-Adv, AdvBench for LLMs) provide reference points.

Assume an adaptive adversary. Robustness against fixed benchmarks does not imply robustness against an attacker who knows your defences.

Bound the consequences. ML systems should not have unfettered ability to take action; human-in-the-loop or capability-bounded design limits the damage from successful attacks.

Monitor for unusual patterns. Adversarial attacks typically leave statistical fingerprints on inference patterns even when they evade single-input detection.

Consider the threat economy. Adversarial-ML attacks require effort; the rate at which attackers expend that effort is bounded by the value of the target. Most low-stakes ML deployments will not face sophisticated adversarial attacks. Some will.

The deeper position: ML in 2026 has a real adversarial-robustness problem and a partial set of defences. Deploying ML in safety- or security-critical applications requires acknowledging both. The research community continues to produce both attacks and defences; the gap may close over the next decade or it may persist. Operating today requires planning as if the gap persists.

Adversarial Examples: Tricking ML Models with Imperceptible Changes

Vibe coding is shipping vulnerabilities at scale in 2026

Prompt injection left the lab in 2026. It is in the wild now

Shadow AI is the new stealer-log jackpot in 2026

Adversarial Examples: Tricking ML Models with Imperceptible Changes

The classical attack

What this means in practice

Adversarial examples against LLMs

Defences and their limits

The 2026 operational picture

Practical recommendations

Related Posts

Vibe coding is shipping vulnerabilities at scale in 2026

Prompt injection left the lab in 2026. It is in the wild now

Shadow AI is the new stealer-log jackpot in 2026