AI pentest scoping playbook

Table of Contents
- Why AI Pentesting Is Different
- The Anatomy of an AI System
- OWASP LLM Top 10 is your Baseline
- What OWASP LLM Top 10 Doesn't Cover
- Scoping Questions
- Writing the Scope Document
- Common Scoping Mistakes
- Building Continuous AI Security Testing
- Conclusion
Organizations are throwing absurd amount of money at "AI red teams" who run a few prompt injection tests, declare victory, and cash checks. Security consultants are repackaging traditional pentest methodologies with "AI" slapped on top, hoping nobody notices they're missing 80% of the actual attack surface. And worst of all, the people building AI systems, the ones who should know better, are scoping engagements like they're testing a CRUD app from 2015.
This guide exists because the current state of AI security testing is dangerously inadequate. The attack surface is massive. The risks are novel. The methodologies are immature. And the consequences of getting it wrong are catastrophic.
These are my personal views, informed by professional experience but not representative of my employer. What follows is what I wish every CISO, security lead, and AI team lead understood before they scoped their next AI security engagement.
Why AI Pentesting Is Different
Traditional web application pentests follow predictable patterns. You scope endpoints, define authentication boundaries, exclude production databases, and unleash testers to find SQL injection and XSS. The attack surface is finite, the vulnerabilities are catalogued, and the methodologies are mature.
AI systems break all of that.
The system output is non-deterministic. You can't write a test case that says "given input X, expect output Y" because the model might generate something completely different next time. This makes reproducibility, the foundation of security testing, fundamentally harder.
Also, the attack surface is layered and interconnected. You're not just testing an application. You're testing a model (which might be proprietary and black-box), a data pipeline (which might include RAG, vector stores, and real-time retrieval), integration points (APIs, plugins, browser tools), and the infrastructure underneath (cloud services, containers, orchestration).
Novel attack classes exist that don't map to traditional vuln categories. Prompt injection isn't XSS. Data poisoning isn't SQL injection. Model extraction isn't credential theft. Jailbreaks don't fit CVE taxonomy. The OWASP Top 10 doesn't cover this.
You might not even control the model. If you're using OpenAI's API or Anthropic's Claude, you can't test the training pipeline, you can't audit the weights, and you can't verify alignment. Your scope is limited to what the API exposes, which means you're testing a black box with unknown internals.
AI systems are probabilistic, data-dependent, and constantly evolving. A model that's safe today might become unsafe after fine-tuning. A RAG system that's secure with Dataset A might leak PII when Dataset B is added. An autonomous agent that behaves correctly in testing might go rogue in production when it encounters edge cases.
This isn't incrementally harder than web pentesting. It's just fundamentally different. If your scope document looks like a web app pentest with "LLM" find-and-replaced in, you're going to miss everything that matters.
The Anatomy of an AI System
Layer 1: The Model
This is the thing everyone focuses on because it's the most visible. But "the model" isn't monolithic.
Is it GPT-4? Claude? Llama 3? Mistral? A custom model you trained from scratch? Each has different vulnerabilities, jailbreaks, different safety mechanisms, different failure modes.
Have you fine-tuned the base model on your own data? Fine-tuning can break safety alignment. It can introduce backdoors. It can memorize training data and leak it during inference. If you've fine-tuned, that's in scope.
Have you applied instruction-tuning or RLHF to shape model behavior? That's another attack surface. Adversaries can craft inputs that reverse your alignment work.
Are you running multiple models and aggregating outputs? That introduces new failure modes. What happens when Model A says yes and Model B says no? How do you handle consensus? Can an adversary exploit disagreements?
How is the model deployed? Is it an API? A container? Serverless functions? On-prem hardware? Each deployment model has different security profile.
Layer 2: Data Pipelines
AI systems don't just run models. They feed data into models. And that data pipeline is massive attack surface.
Where did the training data come from? Who curated it? How was it cleaned? Is it public? Proprietary? Scraped? Licensed? Can an adversary poison the training data?
Are you using RAG to ground model outputs in retrieved documents? That's adding an entire data retrieval system to your attack surface. Can an adversary inject malicious documents into your knowledge base? Can they manipulate retrieval to leak sensitive docs? Can they poison the vector embeddings?
If you're using RAG, you're running a vector database (Pinecone, Weaviate, Chroma, etc.). That's infrastructure. That has vulnerabilities. That's in scope.
Are you pulling live data from APIs, databases, or user uploads? Each data source is a potential injection point.
How are inputs sanitized before hitting the model? Are you stripping dangerous characters? Validating formats? Filtering content? Attackers will test every preprocessing step for bypasses.
Layer 3: Application Integration
Models don't exist in isolation. They're integrated into applications. And those integration points are attack surface.
How do users interact with the model? REST APIs? GraphQL? WebSockets? Each has different attack vectors.
Who can access the model? How are permissions enforced? Can an adversary escalate privileges?
Can an adversary send 10,000 requests per second? Can they DOS your model? Can they extract the entire training dataset via repeated queries?
Are you logging inputs and outputs? If yes, are you protecting those logs from unauthorized access? Logs containing sensitive user queries are PII.
Can the model call external APIs? Execute code? Browse the web? Use tools? Every plugin is an attack vector. If your model can execute Python, an adversary will try to get it to run os.system("rm -rf /"). LLM4SHELL says Hi!
Do users have multi-turn dialogues with the model? Multi-turn interactions create new attack surfaces because adversaries can condition the model over multiple turns, bypassing safety mechanisms gradually.
Layer 4: Autonomous Agents
If you've built agentic systems, AI that can plan, reason, use tools, and take actions autonomously, you've added an entire new dimension of attack surface.
What tools can the agent use? File system access? Database queries? API calls? Browser automation? The more powerful the tools, the higher the risk.
How does the agent decide what actions to take? Can an adversary manipulate the planning process? Can they inject malicious goals?
Do agents have persistent memory? Can adversaries poison that memory? Can they extract sensitive information from memory?
Are you running multiple agents that coordinate? Can adversaries exploit coordination protocols? Can they cause agents to turn on each other or collude against safety mechanisms?
Can an agent escalate privileges? Can it access resources it shouldn't? Can it spawn new agents?
Layer 5: Infrastructure
AI systems run on infrastructure. That infrastructure has traditional security vulnerabilities that still matter.
Are you running on AWS, Azure, GCP? Are your S3 buckets public? Are your IAM roles overly permissive? Are your API keys hardcoded in repos?
Are you using Docker, Kubernetes? Are your container images vulnerable? Are your registries exposed? Are your secrets managed properly?
How do you deploy model updates? Can an adversary inject malicious code into your pipeline?
Are you using vulnerable Python libraries? Compromised npm packages? Poisoned PyPI distributions?
Where are your API keys, database credentials, and model weights stored? Are they in environment variables? Config files? Secret managers?
How much of that did you include in your last AI security scope document? If the answer is less than 60%, your scope is inadequate. And you're going to get breached by someone who understands the full attack surface.
OWASP LLM Top 10 is your Baseline
The OWASP Top 10 for LLM Applications is the closest thing we have to a standardized framework for AI security testing. If you're scoping an AI engagement and you haven't mapped every item in this list to your test plan, you're doing it wrong.
Here's the 2025 version:
| ID | Threat | Core risk | Key attack vectors |
|---|---|---|---|
| LLM01 | Prompt Injection | Malicious inputs override instructions or cause data exfiltration | Direct/indirect prompts, cross-turn, jailbreaks |
| LLM02 | Sensitive Disclosure | Model leaks prompts, training data, API keys, or PII | Ask to repeat system instructions; crafted extraction queries; conversation history probing |
| LLM03 | Supply-Chain | Compromised model or dependencies introduce malicious behavior | Poisoned datasets; pre-trained model trojans; backdoored libraries; malicious plugins |
| LLM04 | Data/Model Poisoning | Injected training or fine-tune data creates backdoors or biased outputs | Backdoor triggers; bias amplification; targeted misclassification |
| LLM05 | Improper Output Handling | Unsanitized model outputs cause code/SQL/HTML/shell injection downstream | Generated code/SQL/HTML/commands executed without validation |
| LLM06 | Excessive Agency | Over-privileged agents perform unauthorized or harmful actions | Arbitrary code execution; direct DB/API access; privilege escalation |
| LLM07 | System Prompt Leakage | Extraction of system prompt reveals sensitive instructions or config | Direct questioning; role-play; prompt injection overriding confidentiality |
| LLM08 | Vector/Embedding Weakness | Manipulated embeddings poison retrieval or extract data | Poisoned docs with adversarial embeddings; similarity manipulation; embedding-space queries |
| LLM09 | Misinformation/Hallucination | Model generates false, misleading, or fabricated content | Prompted misinformation; exploiting hallucinations; generating fake content |
| LLM10 | Unbounded Consumption (DoS) | Resource exhaustion via long, looped, or expensive queries | Extremely long prompts; tight loops; repeated costly RAG/tool calls |
What OWASP LLM Top 10 Doesn't Cover
The OWASP LLM Top 10 is valuable, but it's not comprehensive. Here's what's missing in it:
AI Safety Risks
Safety is not the same as security. Unsafe AI systems cause real harm, and that should be scope for red teaming.
Can the model be made to behave in ways that violate its stated values? Can adversaries bypass constitutional AI techniques like Anthropic's Claude? Does the model exhibit or amplify demographic biases across different groups? Can the model be tricked into generating illegal, dangerous, or abusive content? Can the model lie, manipulate, or deceive users? These aren't just ethics issues. They're legal risks under GDPR, EEOC, and other regulations.
Adversarial Machine Learning
Traditional adversarial ML attacks apply to AI systems.
- Can adversaries craft inputs that cause misclassification?
- Can adversaries reconstruct training data from model outputs?
- Can adversaries steal model weights through repeated queries?
- Can adversaries determine if specific data was in the training set? - Does the model have hidden backdoors that trigger on specific inputs?
Multimodal Risks
If your AI system handles multiple modalities (text, images, audio, video), you have additional attack surface.
Attackers can embed malicious instructions in images that vision-language models follow. Small pixel changes invisible to humans can cause model failures. Audio inputs crafted to cause misclassification exist. Adversarial text rendered as images can bypass filters. Combining text and images across multiple turns can bypass safety mechanisms.
Privacy and Compliance
AI systems must comply with GDPR, HIPAA, CCPA, and other regulations.
Does the model process, store, or leak personally identifiable information? How long is data retained? Can users request deletion? Does the model send data across jurisdictions?
Scoping Questions
Before you write your scope document, answer every single one of these questions. If you can't answer them, you don't understand your system well enough to scope a meaningful AI security engagement.
About the Model
- What base model are you using (GPT-4, Claude, Llama, Mistral, custom)?
- Is the model proprietary (OpenAI API) or open-source?
- Have you fine-tuned the base model? On what data?
- Have you applied instruction tuning, RLHF, or other alignment techniques?
- How is the model deployed (API, on-prem, container, serverless)?
- Do you have access to model weights?
- Can testers query the model directly, or only through your application?
- Are there rate limits? What are they?
- What's the model's context window size?
- Does the model support function calling or tool use?
- Is the model multimodal (vision, audio, text)?
- Are you using multiple models in ensemble or orchestration?
About Data
- Where did training data come from (public, proprietary, scraped, licensed)?
- Was training data curated or filtered? How?
- Is training data in scope for poisoning tests?
- Are you using RAG (Retrieval-Augmented Generation)?
- If RAG: What's the document store (vector DB, traditional DB, file system)?
- If RAG: How are documents ingested? Who controls ingestion?
- If RAG: Can testers inject malicious documents?
- If RAG: How is retrieval indexed and searched?
- Do you pull real-time data from external sources (APIs, databases)?
- How is input data preprocessed and sanitized?
- Is user conversation history stored? Where? For how long?
- Can users access other users' data?
About Application Integration
- How do users interact with the model (web app, API, chat interface, mobile app)?
- What authentication mechanisms are used (OAuth, API keys, session tokens)?
- What authorization model is used (RBAC, ABAC, none)?
- Are there different user roles with different permissions?
- Is there rate limiting? At what levels (user, IP, API key)?
- Are inputs and outputs logged? Where?
- Who has access to logs?
- Are logs encrypted at rest and in transit?
- How are errors handled? Are error messages exposed to users?
- Are there webhooks or callbacks that the model can trigger?
About Plugins and Tool Use
- Can the model call external APIs? Which ones?
- Can the model execute code? In what environment?
- Can the model browse the web?
- Can the model read/write files?
- Can the model access databases?
- What permissions do plugins have?
- How are plugin outputs validated before use?
- Can users add custom plugins?
- Are plugin interactions logged?
About Autonomous Agents
- Do you have autonomous agents that plan and execute multi-step tasks?
- What tools can agents use?
- Can agents spawn other agents?
- Do agents have persistent memory? Where is it stored?
- How are agent goals and constraints defined?
- Can agents access sensitive resources (DBs, APIs, filesystems)?
- Can agents escalate privileges?
- Are there kill-switches or circuit breakers for agents?
- How is agent behavior monitored?
About Infrastructure
- What cloud provider(s) are you using (AWS, Azure, GCP, on-prem)?
- Are you using containers (Docker)? Orchestration (Kubernetes)?
- Where are model weights stored? Who has access?
- Where are API keys and secrets stored?
- Are secrets in environment variables, config files, or secret managers?
- How are dependencies managed (pip, npm, Docker images)?
- Have you scanned dependencies for known vulnerabilities?
- How are model updates deployed? What's the CI/CD pipeline?
- Who can deploy model updates?
- Are there staging environments separate from production?
About Safety and Alignment
- What safety mechanisms are in place (content filters, refusal training, constitutional AI)?
- Have you red-teamed for jailbreaks?
- Have you tested for bias across demographic groups?
- Have you tested for harmful content generation?
- Do you have human-in-the-loop review for sensitive outputs?
- What's your incident response plan if the model behaves unsafely?
About Testing Boundaries
- Can testers attempt to jailbreak the model?
- Can testers attempt prompt injection?
- Can testers attempt data extraction (training data, PII)?
- Can testers attempt model extraction or inversion?
- Can testers attempt DoS or resource exhaustion?
- Can testers poison training data (if applicable)?
- Can testers test multi-turn conversations?
- Can testers test RAG document injection?
- Can testers test plugin abuse?
- Can testers test agent privilege escalation?
- Are there any topics, content types, or test methods that are forbidden?
- What's the escalation process if critical issues are found during testing?
About Compliance and Legal
- What regulations apply (GDPR, HIPAA, CCPA, FTC, EU AI Act)?
- Do you process PII? What types?
- Do you have data processing agreements with model providers?
- Do you have the legal right to test this system?
- Are there export control restrictions on the model or data?
- What are the disclosure requirements for findings?
- What's the confidentiality agreement for testers?
If you can answer all these questions, you're ready to scope. If you can't, you're not.
Writing the Scope Document
Your AI pentest engagement scope document needs to be more detailed than a traditional pentest scope.
Section 1: Executive Summary
One-paragraph description of the AI system. Business objectives (compliance, pre-launch validation, continuous assurance, incident response). Top 3-5 risks that drive the engagement. What does "passing" look like?
Section 2: System Architecture
Include an architectural diagram showing everything: model, data pipelines, APIs, infrastructure, third-party services. List every testable component with owner, version, and deployment environment. Document how data moves through the system, from user input to model output to downstream consumers. Identify where data crosses trust boundaries.
Section 3: In-Scope Components
Be exhaustive. List models, APIs, data stores, integrations, infrastructure, applications. For each component, specify access credentials testers will use, environments that are in scope, testing windows if limited, and rate limits or usage restrictions.
Section 4: Attack Vectors and Test Cases
Map every OWASP LLM Top 10 item to specific test cases. For LLM01 prompt injection: test direct instruction override, indirect injection via RAG documents, multi-turn conditioning, system prompt extraction, jailbreak techniques, cross-turn memory poisoning.
Include specific threat scenarios: Can an attacker leak other users' conversation history? Can an attacker extract training data containing PII? Can an attacker bypass content filters to generate harmful instructions?
Section 5: Out-of-Scope Components
Explicitly list what's NOT being tested: production environments if testing only staging, physical security, social engineering of employees, third-party SaaS providers you don't control, specific attack types if any are prohibited.
Section 6: Testing Methodology
List specific tools: Promptfoo for LLM fuzzing, Garak for red teaming, PyRIT for adversarial prompting, ART for ML attacks, custom scripts for specific attack vectors, traditional tools like Burp Suite for infrastructure.
Testing techniques: prompt injection testing, jailbreak attempts, data extraction attacks, model inversion, membership inference, evasion attacks, RAG poisoning, plugin abuse, agent privilege escalation, infrastructure scanning.
Section 7: Rules of Engagement
All testing must be explicitly authorized in writing with names, signatures, dates. No attempts at physical harm, financial fraud, or illegal content generation unless explicitly scoped for red teaming. Critical findings must be disclosed immediately via designated channel. Standard findings can wait for formal report.
Testers must not exfiltrate user data, training data, or model weights except as explicitly authorized for demonstration purposes. All test data must be destroyed post-engagement. Testing must comply with all applicable laws and regulations.
Section 8: Deliverables
Technical report with detailed findings, severity ratings, reproduction steps, evidence, and remediation guidance. Executive summary for business leadership. Updated threat model. Retest availability confirmation.
Section 9: Timeline and Contacts
Start date, end date, report delivery date, retest window. Key contacts: engagement lead, technical point of contact, escalation contact for critical findings, legal contact for scope questions.
That's your scope document. It should be 10-20 pages. If it's shorter, you're missing things.
Common Scoping Mistakes
You test the web app that wraps the LLM, but you don't test the LLM itself. You find XSS and broken authz, but you miss prompt injection, jailbreaks, and data extraction. Scope the full stack: app, model, data pipelines, infrastructure.
If you fine-tuned the model, you have access to training data and weights. Test for data poisoning, backdoors, and alignment failures. Don't just test the API. If you control any part of the model lifecycle, include that in scope.
You test the LLM, but you don't test the document store. Adversaries inject malicious documents, manipulate retrieval, and poison embeddings. If you're using RAG, the vector database and document ingestion pipeline are in scope.
You test single-shot prompts, but adversaries condition the model over 10 turns to bypass refusal mechanisms. Test multi-turn dialogues explicitly. Test conversation history isolation. Test memory poisoning.
You're using OpenAI's API, so you assume it's secure. But you're passing user PII in prompts, you're not validating outputs before execution. Even with third-party models, test your integration. Test input/output handling. Test failure modes.
You test for technical vulnerabilities but ignore alignment failures, bias amplification, and harmful content generation. AI safety is part of AI security. Include alignment testing, bias audits, and harm reduction validation.
You test the LLM, but your agent can execute code, call APIs, and access databases. An adversary hijacks the agent, and it deletes production data. Autonomous agents are their own attack surface. Test tool permissions, privilege escalation, and agent behavior boundaries.
You do one pentest before launch, then never test again. But you're fine-tuning weekly, adding new plugins monthly, and updating RAG documents daily. Scope for continuous red teaming, not one-time assessment.
Organizations hire expensive consultants to run a few prompt injection tests, declare the system secure, and ship to production. Then they get breached six months later when someone figures out a multi-turn jailbreak or poisons the RAG document store.
The problem isn't that the testers are bad. The problem is that the scopes are inadequate. You can't find what you're not looking for. If your scope doesn't include RAG poisoning, testers won't test for it. If your scope doesn't include membership inference, testers won't test for it. If your scope doesn't include agent privilege escalation, testers won't test for it.
Attackers will.
The asymmetry is brutal, you have to defend every attack vector. Attackers only need to find one that works.
So when you scope your next AI security engagement, ask yourself, if I were attacking this system, what would I target? Then make sure every single one of those things is in your scope document.
Because if it's not in scope, it's not getting tested. And if it's not getting tested, it's going to get exploited.
Building Continuous AI Security Testing
Traditional pentests are point-in-time assessments. You test, you report, you fix, you're done. That doesn't work for AI systems.
AI systems evolve constantly: models get fine-tuned, RAG document stores get updated, new plugins get added, agents gain new capabilities, infrastructure changes.
Every change introduces new attack surface. If you're only testing once a year, you're accumulating risk for 364 days. You need continuous red teaming.
1. Automate What Can Be Automated
Use tools like Promptfoo, Garak, and PyRIT to run automated adversarial testing on every model update. Integrate tests into CI/CD pipelines so every deployment is validated before production.
Set up continuous monitoring for prompt injection attempts, jailbreak successes, data extraction queries, unusual tool usage patterns, and agent behavior anomalies.
2. Run Periodic Deep Assessments
Quarterly or bi-annually, bring in expert red teams for comprehensive testing beyond what automation can catch. Focus deep assessments on novel attack vectors that tools don't cover, complex multi-step exploitation chains, social engineering combined with technical attacks, and agent hijacking.
3. Build Internal Red Team Capability
Train your own security team on AI-specific attack techniques. Develop internal playbooks for prompt injection testing, jailbreak methodology, RAG poisoning, and agent security testing.
4. Update Threat Models Continuously
Every quarter, revisit your threat model. What new attacks have been published? What new capabilities have you added? What new integrations are in place? What new risks does the threat landscape present? Update your testing roadmap based on evolving threats.
Conclusion
Scoping AI security engagements is harder than traditional pentests because the attack surface is larger, the risks are novel, and the methodologies are still maturing.
But it's not impossible.
You need to understand the full stack: model, data pipelines, application, infrastructure, agents, everything. Map every attack vector. OWASP LLM Top 10 is your baseline, not your ceiling.
Answer the scoping questions. If you can't answer them, you don't understand your system. Write detailed scope documents. 10-20 pages, not 2 pages.
Use the right tools: Promptfoo, Garak, ART, LIME, SHAP, not just Burp Suite. Test continuously, not once. Avoid common mistakes: don't ignore RAG, don't underestimate agents, don't skip AI safety.
If you do this right, you'll find vulnerabilities before attackers do. If you do it wrong, you'll end up in the news explaining why your AI leaked training data, generated harmful content, or got hijacked by adversaries.
Your call.