Cross icon
Test your LLM for oversharing!  Test for real-world oversharing risks with role-specific prompts that mimic  real workplace questions. FREE - Start Now
Skip to main content
Skip to main content

Key Findings on Prompt Injection

  • Prompt injection is a method used to trick AI assistants into bypassing rules or leaking data, often through hidden or malicious input.

  • Key attack types include direct overrides, hidden instructions in content, and tool or retrieval abuse, making enterprise systems especially vulnerable.

  • Unlike jailbreaking, prompt injection spans both direct and indirect threats, requiring broader, layered defenses.

  • Real-world cases demonstrate the effectiveness of prompt injections in emails, SharePoint files, and web metadata, with success rates reaching up to 86%.

  • Prevention strategies include input hardening, retrieval controls, tool restrictions, and active monitoring, supported by tools like Knostic for real-time safeguards and auditability.

What Is a Prompt Injection

Prompt injection targets LLMs by embedding adversarial instructions that hijack model behavior and bypass safety protocols. A research article, Benchmarking and Defending against Indirect Prompt Injection Attacks on Large Language Models, discusses how the model treats the malicious text as legitimate directions. The result can lead to data leakage, unsafe actions, or incorrect outputs. 

Prompt injection appears in chats, agents, RAG pipelines, and plugins. It also appears in files, tickets, pages, and metadata that models read. It matters because modern assistants have access to powerful tools and vast amounts of data. According to a computer science article, Fine-tuned Large Language Models (LLMs): Improved Prompt Injection Attacks Detection, single injected instruction can cause silent exfiltration or policy bypass. Standards bodies now flag this as a main AI security risk. 

OWASP, which aims to improve software security through open source initiatives, lists “LLM01: Prompt Injection” as the first risk for LLM applications. In the U.S., NIST and CISA, as well as National Cyber Security Centres (NSCS) in other countries, advise developers of AI systems to design around this class of failure. 

Prompt Injection Types

LLM jailbreak and abuse techniques take many forms, and understanding these attack patterns is critical to building resilient defenses.

Direct Override

This is the classic “ignore previous instructions” jailbreak. The attacker issues a command that redefines rules. The model follows the latest directive and breaks policy. A U.S. study, Systematically Analyzing Prompt Injection Vulnerabilities in Diverse LLM Architectures, shows high success in prompt extraction and policy overrides. Another academic U.S. study, Assessing Prompt Injection Risks in 200+ Custom GPTs, evaluates the GPTs both with and without code interpreters. The authors found a 97.2% success rate in extracting system prompts and 100% success in file leakage.

Indirect Prompt Injection

This prompt injection attack hides instructions in the content the model consumes. The content can be a webpage, a PDF, an email, or a ticket. The assistant reads it and executes the hidden steps. An EU research paper considers efficient shielding of LLMs against prompt injection and jailbreaking. It states that up-to-date benchmarks and surveys now track this vector across LLM agents, urging layered defenses because simple filters miss many cases.

Data-Exfiltration Prompts

These prompts aim to reveal secrets or personal data. The instructions can ask the model to quote credentials or chats. Another computer science study examines cryptography and security. Imprompter: Tricking LLM Agents into Improper Tool Use shows that obfuscated prompts can exfiltrate PII by coercing an agent to emit a Markdown image URL that transmits data to an attacker's domain. The attack achieved nearly 80% end-to-end success against Mistral’s LeChat and ChatGLM. Independent reporting published in WIRED confirms the technique and notes that Mistral implemented mitigations after disclosure.

Tool/Function Abuse

Here, the prompt forces an action, not just a reply. It can trigger API calls, file writes, or network requests. The model becomes a control plane for the attacker. CISA’s Joint Guidance on Deploying AI System Securely highlights this risk during deployment. Recommended mitigations include least privilege and strict tool scopes. Prompt-driven API abuse coerces the model to invoke a legitimate endpoint with attacker-chosen arguments, whereas endpoint misuse attacks the API directly without the model in the loop.

Retrieval Manipulation 

This targets RAG by poisoning or manipulating sources. Crafted documents can change what gets retrieved. The model then cites attacker content as truth. Research shows small poisoning rates can drive high extraction success. The authors of Data Extraction Attacks in Retrieval-Augmented Generation via Backdoors, show that with only 3% poisoned data, their backdoor-based method is highly effective. It achieves an average verbatim extraction success rate of 79.7% on Llama 2‑7B, with a ROUGE‑L score of 64.21. They also report 68.6% success in paraphrased extraction across four datasets.

Prompt Injection Examples

Prompt injection is a common occurrence in day-to-day enterprise workflows. It shows up in tickets, shared drives, intranet pages, plugins, and RAG answers. The following examples illustrate some of the most common situations in which prompt injection can occur.

Support Ticket Poisoning 

A customer email can hide instructions in the body, footer, or HTML comments. An agent that summarizes or classifies the ticket may execute the hidden steps. Benchmarks show that tool-using agents are vulnerable to indirect prompt injection from external content like emails and webpages. Ticket systems should treat untrusted HTML as hostile and sanitize before LLM parsing, because the measured attack rates demonstrate material risk.

SharePoint Doc Trap 

A policy or HR file can include a hidden footer that instructs disclosure of salary tables. In a RAG workflow, the assistant retrieves the file and follows the injected footer. Backdoor research indicates that data exfiltration can be successful even when only a small portion of the training data is poisoned. Organizations should map retrieval paths and measure exposure against these published success rates. They should also verify that internal documents cannot inject tool actions or override policies.

Website Metadata Attack 

Agents browse internal and public sites to collect facts. Non-body elements outside the main content can carry hidden prompts. URL anchors can hold injected text that the server ignores, but the agent reads. A paper on Benchmarking Web Agent Security reports partial attack success up to 86% and end-to-end success up to 16% of the time in realistic browsing flows. These tests show that small cues in page context can derail agent goals. Another research paper, A Whole New World, shows how agents can be served different, cloaked pages that target only AI traffic. They create a “parallel-poisoned web only AI-Agents can see,” making metadata, accessibility trees, and URLs high-risk surfaces, not just the visible body. Sanitization and normalization must run before the model sees any markup. Provenance and field-level logging help reconstruct which elements influenced the answer.

Plugin Misuse 

Plugins and tools turn text into actions. Attackers try to coerce those actions. The Philosopher’s Stone, a peer-reviewed study, demonstrates that a Trojan adapter can reliably drive tool use. In a malware download case study, the attack achieved an executable success rate of up to 86%. With 5% poisoned data, the method raised target-keyword generation from about 50% to nearly 100% on one setup. At a 1.0 poisoning ratio, the keyword-matching rate reached 0.99 while still appearing benign on unrelated inputs. The authors showed covert email sending for spear-phishing via an agent’s mail tool. These findings prove that plugin scope and egress are critical control points. Tie every tool call to its originating prompt and user for audit.

How To Prevent Prompt Injection

Preventing prompt injection requires a layered defense approach that combines input controls, context filtering, AI monitoring, and strict policy enforcement.

Input Hardening 

Write strict system prompts that fix role, scope, and output shape. Require JSON-only responses for machine-to-machine steps to shrink parsing ambiguity. Disallow free-form tool calls and enumerate allowed verbs. Formal benchmark work shows defenses can reduce attack success, but efficacy is task-dependent, so layering matters. Black-box and white-box defenses have both shown measurable reductions against indirect prompt injection in controlled studies.

PromptArmor: Simple yet Effective Prompt Injection Defenses takes a guardrail approach (see below). It reports reducing attack success to below 1% on one agent benchmark after removal of injected prompts, illustrating a potential upside when hardening is placed before tool execution. Teams should combine hardening with evaluation on public benchmarks to validate the impact on their stack.

Retrieval Hygiene 

Control what the model can read. Use domain allow-lists and deny risky Multipurpose Internet Mail Extensions (MIME) types — also known as media types — before retrieval, and then chunk and tag content so every answer exposes provenance. Backdoor research indicates that even a modest poisoning rate can pose a significant risk of massive leaks in RAG systems. Enterprises must verify that content comes from trusted sources, reject unsigned or unverified documents, and normalize markup to eliminate hidden instructions. Retrieval policies should be clearly documented and tested using the same benchmarks that researchers use to quantify the risk associated with agents.

Policy Guardrails

Policy guardrails define what the model may write, while least-privilege enforces what it may do. Combining both enables the transformation of policy into enforceable controls at both answer time and action time. Here, pre-generation and post-generation checks reduce real risk in production. The U.S. Department of Homeland Security (DHS) has published Safety and Security Guidelines for Critical Infrastructure Owners and Operators. It recommends preventing the exposure of confidential information in AI tools and monitoring inputs and outputs for unusual or malicious behavior, which supports the use of PII redaction and oversharing blocks. DHS also calls for red-teaming and incident disclosure so that policy guardrails are exercised and measurable. 

A research paper, Attention Tracker: Detecting Prompt Injection Attacks in LLMs, shows that model-aware monitors can improve prompt-injection detection quality. An attention-based method reported AUROC gains of up to 10% over baselines across models and attack types.

Least-Privilege Actions

Tool and plugin permissions must be tightly scoped. DHS highlights autonomy failures caused by excessive permissions or poorly defined operational parameters, which is a direct argument for least privilege on AI-invoked actions. The same guidance urges operators to monitor model behavior and inputs, and to validate systems before and during use, which supports rate limits and egress controls on sensitive tools.

CISA’s Joint Cyber Defense Collaborative Playbook emphasizes coordinated practices for AI security across providers and adopters, including controlling operational reach and sharing patterns for exploitation.

Content Provenance

Provenance reduces confusion about what the model ingests and repeats.NIST’s Reducing Risks Posed by Synthetic Content explains how detection can rely on recorded provenance information such as metadata and digital watermarks. It recommends provenance tracking to establish authenticity and integrity. NIST’s Generative AI Profile adds concrete actions to document the origin and history of training and generated data as part of governance. In practice, provenance and signing help isolate which specific chunk or artifact injected instructions during an incident.

Human-in-the-Loop (HITL)

Human review remains a required safeguard in high-risk flows. The U.S. President’s Office of Management and Budget (OMB) M-24-10 sets minimum risk-management practices for federal AI uses, requiring that each use be formally approved and subject to ongoing oversight to protect safety and rights.. 

The EU Artificial Intelligence Act requires human oversight for high-risk systems, emphasizing measures such as AI prompt injection prevention and broader safeguards to minimize risks from intended use or foreseeable misuse. 

DHS guidance recommends real-world testing, red-teaming, and incident response, which depend on people reviewing traces and making decisions on escalations. Human escalation complements technical controls when prompts are needed to seek sensitive information or prevent unauthorized actions.

Monitoring And Detection

Effective defenses against prompt injection depend on rich telemetry, reliable detection signals, and actionable alerting, which together enable a fast and auditable response.

Telemetry: Log prompt, retrieval, tools, output; mask sensitive data

Log the raw user prompt, the expanded system prompt, and all retrieved artifacts. Log every tool call with arguments, destinations, and outcomes. Log the model output and any guardrail decisions that modified it. Mask secrets at collection to reduce exposure and legal risk. Multi-task evaluations show defenses behave differently across tasks, so telemetry must cover multiple workflows to be useful. Store hashes of retrieved content so you can detect differences and prove what changed over time. Retain traces long enough to compare against new attack families as they appear in USENIX literature.

Signals: Sudden instruction changes, unknown domains, off-policy tool calls

Watch for abrupt instruction shifts inside a conversation and flag retrieval from unknown or newly registered domains. Detect off-policy tool verbs or destinations and unexpected egress attempts. Seed pattern rules from agent benchmarks that enumerate exfiltration and harm intents, and map to fundamental tools. Combine rules with learned detectors to raise precision and recall on prompt-injection text. 

Alerting: Oversharing hits, jailbreak detections, anomaly spikes

Alert when outputs match secret or PII patterns, or when knowledge policies block an answer. Prioritize alerts that involve external network egress, bulk sends, or cross-tenant shares. Then, group related events by user, session, and data source to reduce noise and expedite the review process. 

In addition, government guidance recommends implementing layered mitigations, along with timely detection and response, for externally developed AI systems. Joint playbooks emphasize operational collaboration and information sharing to improve response quality. Tie alerts to playbooks that require human approval for high-risk actions and record the disposition. Track mean time to acknowledge (MTTA) and mean time to resolve (MTTR) as key program metrics and feed lessons back into prompts and retrieval policies.

How Knostic Helps Prevent Prompt Injection

Knostic enforces knowledge boundaries at answer time, redacting or blocking sensitive information that exceeds a user’s permissions. Unlike file-level controls, it focuses on what an assistant can infer across multiple sources, ensuring “need-to-know” is applied in practice. It prevents oversharing before answers reach the user, while feeding enforcement evidence back into governance systems to refine labels and policies over time.

Knostic also runs automated prompt simulations to uncover oversharing risks before deployment. By testing Copilot, Glean, and other enterprise assistants with both benign and adversarial queries, it identifies where policies may fail and produces reproducible audit trails of prompts, sources, and decisions. This systematic approach replaces ad hoc red-teaming and helps security teams prioritize remediation where exposure is highest.

Additionally, Knostic delivers explainability with complete inference lineage. Each assistant response can be traced from the prompt to the retrieved sources and policy rationale, providing searchable audit trails for incident response and regulatory reviews. This transparency turns opaque AI behavior into evidence that auditors and investigators can trust, enabling enterprises to prove who saw what, and why.

What’s Next?

Now that this article has shown why safe prompting matters, you can explore risky prompting patterns and edge cases in more detail at https://prompts.knostic.ai/

FAQ

  • What is a prompt injection?

It is text crafted to make an AI assistant ignore rules or reveal data it should not, and can be typed directly in chat or hidden in documents and pages an assistant reads. It aims to change instructions, exfiltrate data, or trigger tools, and is a leading risk for enterprise assistants because they connect to real data and systems.

  • Which is one of the ways to avoid prompt injections?

Limit what the assistant can read and act on, then enforce “need-to-know” at answer time. Combine repository allowlists with knowledge boundaries that block oversharing before response. Audit and simulate risky prompts to find gaps before attackers do. Maintain an evidence trail to ensure fixes improve labels and policies.

  • What is the difference between prompt injections and jailbreaking?

Jailbreaking tries to bypass a model’s safety rules directly. Prompt injection is broader and includes hidden instructions in sources or tools. It often exploits retrieval and integrations rather than model alignment alone. Enterprise defenses must therefore cover data access, inference, and policy enforcement, not just prompt wording.

bg-shape-download

See How to Secure and Enable AI in Your Enterprise

Knostic provides AI-native security and governance across copilots, agents, and enterprise data. Discover risks, enforce guardrails, and enable innovation without compromise.

195 1-min
background for career

What’s next?

Want to solve oversharing in your enterprise AI search? Let's talk.

Knostic offers the most comprehensively holistic and impartial solution for enterprise AI search.

protect icon

Knostic leads the unbiased need-to-know based access controls space, enabling enterprises to safely adopt AI.