"Red teaming" is one of the security industry’s most useful borrowed military terms. In traditional cybersecurity, it refers to adversarial testing of systems, processes, and people by a dedicated team simulating real-world attackers. The methodology is mature; the playbooks are well-developed.
AI red teaming uses the same word but operates in a different conceptual space. The "system under test" is a model, not a network. The attack surface is statistical and behavioural rather than primarily technical. The defences are training-time and runtime, not just configuration. Treating AI red teaming as an extension of traditional penetration testing misses much of what matters; treating it as something entirely new misses the considerable transferable methodology.
A practical guide to what AI red teaming is, what frameworks exist, and how to do it effectively.
What AI red teaming actually tests
The objectives of an AI red team exercise typically span several categories:
Safety failures. The model produces content it is supposed to refuse, instructions for harmful activities, content that violates the deployment’s policies, illegal content, content that could cause real-world harm.
Security failures. The model is exploited to take unintended actions in an agentic context, exfiltrate data, escalate privileges, perform unauthorised operations on connected systems. Prompt injection (covered separately) is the dominant case here.
Robustness failures. The model produces incorrect or misleading outputs that the user accepts as authoritative. Hallucinations, confident wrong answers, systematic bias.
Privacy failures. The model leaks training data, conversation history from other users, or confidential context.
Bias and fairness failures. The model produces systematically different outputs for different demographic groups, or the deployment’s outcomes have disparate impact.
Capability failures. The model fails on the actual tasks it is deployed for in ways the developers did not anticipate.
The categories overlap and interact. A red team report typically organises findings by severity and likelihood, similar to traditional penetration test reports.
The methodologies
Several published frameworks exist:
Microsoft AI Red Team’s published methodology. Through several blog posts and the Counterfit and PyRIT open-source frameworks. PyRIT (Python Risk Identification Tool) at github.com/Azure/PyRIT is one of the most polished public toolkits.
Anthropic’s red-teaming research. The "Frontier Threats Red Teaming" papers describe systematic approaches to evaluating model capabilities for misuse.
OpenAI’s red-teaming program. Periodic external red-teaming engagements with disclosed methodologies and lessons.
NIST AI 100-2 (Adversarial Machine Learning Taxonomy). Provides a framework for thinking about ML attacks systematically.
MITRE ATLAS. Adversarial Threat Landscape for AI Systems at atlas.mitre.org. The closest thing to MITRE ATT&CK for AI; catalogues attack techniques against ML systems.
OWASP Top 10 for LLMs. Application-level threat list covering most common deployment-time vulnerabilities.
Practical red-team exercises typically draw on multiple frameworks; the methodologies are complementary.
The structure of an exercise
A typical AI red-team engagement runs in phases:
Scoping. What system is being tested. What policies define acceptable behaviour. What threat models matter (mass-market user, sophisticated adversary, insider threat). What integration points exist (API only, agentic with tool access, agentic with internet access). What test data is permitted.
Capability mapping. Document what the model and surrounding system can do. The attack surface depends on what the model has access to. An LLM with file-system access has a different threat profile than one without.
Manual probing. Skilled red-teamers explore the system through structured prompting and conversation. The goal is to find behaviours that violate intended policy. This is the equivalent of manual penetration testing: hard to automate, high-value.
Automated stress testing. Automated frameworks (PyRIT, Garak, Promptfoo, NVIDIA’s Adversarial Robustness Toolbox) generate variations of attacks at scale. Effective at finding weaknesses in known categories; less effective at novel categories.
Targeted attack development. For high-value targets, the red team develops bespoke attacks: specific jailbreaks, customised prompt injections, retrieval-augmented attack chains.
Post-exploitation. For agentic systems, what an attacker who has compromised the model can actually do. The damage potential is what matters, not just the model’s behaviour.
Reporting. Structured findings, severity ratings, reproducible test cases, recommendations.
The methodology is iterative. New findings lead to new test categories. The exercise has a defined endpoint but not a single answer; AI systems’ attack surfaces continue to change with new capabilities and new use patterns.
Specific attack categories worth testing
A non-exhaustive checklist for any AI red team exercise:
Direct jailbreaks. Asking the model to do things it has been trained to refuse, through various prompt engineering techniques. The DAN family, role-playing scenarios, "ignore previous instructions" patterns, ethical reframing.
Indirect prompt injection. Content delivered through retrieved documents, web pages, file contents, or any external source that the model will consume.
Context-window saturation. Filling the context with content that overwhelms safety instructions or causes the model to behave inconsistently.
Multilingual and code-switching attacks. Some safety training is more robust in English than in other languages; cross-language attacks exploit the gap.
Encoding-based attacks. Base64, leetspeak, ASCII art, character substitution, that bypass input-level filtering.
Persona manipulation. "You are now [character]" prompts that shift the model’s behavioural defaults.
Time-pressure and authority manipulation. Lures that invoke urgency or claim authority to override safety instructions.
Tool-use attacks. For agentic systems, attacks aimed at causing the model to invoke tools in unintended ways.
Refusal-bypass attacks. Asking for the same harmful information through legitimate-looking framings.
Training-data extraction. Carlini-style attacks designed to recover sensitive information from the training corpus.
Membership inference. Determining whether specific data was in the training set.
The 2026 industry landscape
AI red teaming has moved from research practice to commercial service. Organisations offering structured AI red-team engagements include Robust Intelligence (acquired by Cisco), HiddenLayer, Lakera, Trail of Bits’ AI practice, and the major incident-response firms (Mandiant, Crowdstrike, Trail of Bits) that have added AI red-team capability. Pricing and methodology varies; the field is still maturing.
Internal red-team programs at the major model labs are well-developed. Anthropic, OpenAI, Google DeepMind, and Microsoft all have dedicated AI red-team functions whose findings inform model training and deployment. Their public reports give some insight into what they look for.
Regulatory pressure is growing. EU AI Act high-risk systems are required to undergo "appropriate" testing; the regulatory definition of "appropriate" is being developed. The US Executive Order on AI required NIST to publish guidance on red-teaming foundation models, which has been influential. China’s interim measures on generative AI have similar provisions.
What an effective exercise produces
A useful AI red-team report includes:
Reproducible test cases. Each finding can be replicated; the prompts and conditions are documented.
Severity ratings tied to the deployment context. A jailbreak that produces violent content is more severe in a children’s-application context than in a general-purpose context.
Mitigation recommendations. Both at the model layer (training adjustments, system prompt updates) and at the application layer (filtering, capability restrictions, monitoring).
Coverage analysis. What categories of attack were tested and not tested. The boundaries of the engagement matter for downstream interpretation.
Re-testing protocol. How and when to verify that mitigations have been effective.
A useful AI red-team program, distinct from a single engagement, also produces:
Standing test suites that are run continuously against model and system updates.
Internal capability to handle the "easy" findings without external help.
Integration with the development lifecycle so that findings flow back into training and deployment improvements.
Threat-intelligence sharing with the broader community where appropriate (OWASP, MITRE ATLAS contributions, vendor disclosures).
What to avoid
A few common failure modes:
Treating red-teaming as compliance theatre. Running an exercise to check a box and not acting on the findings.
Limiting scope so narrowly that real attack surfaces are not tested. Closed-environment red teaming of a system that will face open-world inputs misses what matters.
Optimising for not finding things. The temptation to scope away the categories likely to produce embarrassing findings.
Conflating red teaming with continuous evaluation. Both are valuable; they are different activities.
Outsourcing entirely without internal capability to act on findings.
The state of AI red teaming in 2026 is roughly where general security red teaming was in the late 2000s: genuinely useful, professionally available, increasingly structured, but still craft as much as discipline. Organisations deploying consequential AI systems need to do this work or have it done. The cost of skipping it has begun to show in public incidents; that pattern is set to continue.
