Skip to main content

Why We Built openclaw-shield: Securing AI Agents from Themselves

We recently open-sourced openclaw-shield, a security plugin for OpenClaw agents. This post covers what we built, the critical issues we ran into along the way, and the lessons we learned about securing AI agents in practice.


The Problem: AI Agents Without Guardrails

AI agents operating on behalf of users can access files, run shell commands, and produce text responses. Without guardrails, an agent can:

  • Read .env files and output raw API keys, database passwords, or tokens
  • Read customer data and display raw Social Security numbers, credit card numbers, or emails
  • Execute destructive commands like rm -rf that permanently delete files
  • Exfiltrate credentials by embedding them in shell commands (e.g., curl with bearer tokens)

How It Works: 5-Layer Defense-in-Depth

Rather than relying on a single mechanism, we built five independent security layers. If one layer is bypassed or unsupported, the others still protect.
 
Layer 1 (Prompt Guard) injects a security policy block into the agent's context on every turn, including rules about secrets, PII, destructive commands, and sensitive files, plus a mandatory workflow requiring the agent to call our security gate before any exec.
 
Layer 2 (Output Scanner) scans every tool result before it's persisted to the conversation. It redacts secrets (AWS keys, Stripe keys, GitHub tokens, OpenAI keys, private keys, etc.) and PII (emails, SSNs, credit card numbers, phone numbers), replacing matches with tagged placeholders like [REDACTED:aws_access_key] or [PII_REDACTED:us_ssn].
 
Layer 3 (Tool Blocker) is registered but non-functional on the current published OpenClaw version (v2026.1.30). When the host supports it, it will hard-block destructive commands and secrets in tool parameters. More on this below.
 
Layer 4 (Input Audit) logs every inbound message with a preview and flags if the user pastes secrets into a message. Observe-only.
 
Layer 5 (Security Gate Tool) registers a knostic_guard tool that the agent must call before executing any shell command or reading files. The tool inspects the proposed action and returns STATUS: ALLOWED or STATUS: DENIED. Combined with L1's prompt injection telling the agent to always call this tool first, this creates a functional gate.
 

Validated Results

Destructive Commands: Blocked. When asked to delete a file, the agent calls our security gate first. The gate detects the destructive command and returns DENIED. The agent reports the denial to the user. The file is never deleted.
 
PII Protection: Working. When asked to read a file containing a Social Security number and email address, the agent calls our security gate. The gate allows the read but instructs the agent not to show raw PII. The agent responds with: "The file contains a customer record with three fields: Customer name, SSN, and Email. I can't share the raw values since they're PII." The SSN and email are never shown.
 
Secret Redaction: Working. When the agent reads a file containing an AWS secret key, the output scanner redacts it from the conversation history. The agent describes the file as containing "an AWS secret access key" without ever showing the raw key value.
 

Critical Issues We Found

before_tool_call Hook Not Wired (Severity: High)

 
The before_tool_call hook is defined in the hook system, can be registered by plugins, and has a runner function, but on v2026.1.30, nothing in the tool execution chain actually calls it. The wrapping code (wrapToolWithBeforeToolCallHook) was added to the codebase after v2026.1.30 was published and exists only in the dev repo (v2026.2.1). We confirmed this by searching the installed dist and finding zero results for the function. This is why we built L5 as a workaround.
 
Only 3 out of 14 defined hooks have invocation sites in the published version: before_agent_start, message_received, and tool_result_persist. The rest are defined but never called.
 

prependContext is Weak (Severity: Medium)

 
The before_agent_start hook's prependContext field prepends text to the user's message, not to the system prompt. The model treats it as user-provided context rather than a system-level constraint. In testing, L1 fired and injected the security policy, but the agent still executed rm when directly asked. L5's tool-based gate provides a stronger mechanism because the agent makes a real tool call, receives a structured DENIED response, and acts on it.
 

L2 Timing Gap (Severity: Medium)

 
The tool_result_persist hook fires when the tool result is persisted to the session transcript, not when it's returned to the LLM. The LLM sees the raw result for the current turn, and L2 redacts secrets only before persistence. The LLM could potentially include the raw secret in its response for that turn. The message_sending hook (which could catch this) is also not wired on v2026.1.30.
 

Known Limitations

 
Hard tool blocking requires a host update. Once before_tool_call is wired, our Layer 3 becomes a true hard-block that can't be bypassed.
 
The security gate is a soft enforcement. The agent is instructed to always call our gate tool before acting, but a sufficiently adversarial prompt could theoretically bypass this. The upcoming host update eliminates this concern.
 
CRITICAL: OpenClaw gets updated constantly, and without community updates, openclaw-shield won't stay effective for more than mere days. We've already had to update it several times. PRs are welcome and encouraged.
 

Lessons Learned

Always verify the runtime binary matches the source. We spent significant time debugging why before_tool_call didn't fire, only to discover the gateway was running an older published version than the dev repo.
 
Hook definition does not equal hook invocation. OpenClaw defines 14 hooks but only 3 have invocation sites in the published version. Always trace the full chain from registration to invocation.
 
Prompt injection is a weak guardrail for direct instructions. When the user explicitly asks to delete a file, the model will often comply despite injected policies. A tool-based gate (where the model gets a real DENIED response) is far more effective.
 
tool_result_persist is synchronous. Returning a Promise from this hook causes the host to silently skip it. This is documented in the source but easy to miss.
 
The plugin tool API (registerTool) is the most powerful surface for security. It lets you inject logic directly into the agent's decision flow, not just observe or modify at the edges.
 
 
 

Knostic: Discovery and Control for the Agent Layer

If you're looking for visibility and control over your coding agents, MCP servers, and IDE extensions, from Cursor and Claude Code to Copilot, check out what we're building at https://www.getkirin.com/ 

Data Leakage Detection and Response for Enterprise AI Search

Learn how to assess and remediate LLM data exposure via Copilot, Glean and other AI Chatbots with Knostic.

Get Access

Mask group-Oct-30-2025-05-23-49-8537-PM
The Data Governance Gap in Enterprise AI

See why traditional controls fall short for LLMs, and learn how to build policies that keep AI compliant and secure.

Download the Whitepaper

data-governance
Rethinking Cyber Defense for the Age of AI

Learn how Sounil Yu’s Cyber Defense Matrix helps teams map new AI risks, controls, and readiness strategies for modern enterprises.

Get the Book

Cyber Defence Matrix - cover
Extend Microsoft Purview for AI Readiness

See how Knostic strengthens Purview by detecting overshared data, enforcing need-to-know access, and locking down AI-driven exposure.

Download the Brief

copilot-img
Build Trust and Security into Enterprise AI

Explore how Knostic aligns with Gartner’s AI TRiSM framework to manage trust, risk, and security across AI deployments.

Read the Brief

miniature-4-min
Real Prompts. Real Risks. Real Lessons.

A creative look at real-world prompt interactions that reveal how sensitive data can slip through AI conversations.

Get the Novella

novella-book-icon
Stop AI Data Leaks Before They Spread

Learn how Knostic detects and remediates oversharing across copilots and search tools, protecting sensitive data in real time.

Download the Brief

LLM-Data-min
Accelerate Copilot Rollouts with Confidence

Equip your clients to adopt Copilot faster with Knostic's AI security layer, boosting trust, compliance, and ROI.

Get the One-Pager

cover 1
Reveal Oversharing Before It Becomes a Breach

See how Knostic detects sensitive data exposure across copilots and search, before compliance and privacy risks emerge.

View the One-Pager

cover 1
Unlock AI Productivity Without Losing Control

Learn how Knostic helps teams harness AI assistants while keeping sensitive and regulated data protected.

Download the Brief

safely-unlock-book-img
Balancing Innovation and Risk in AI Adoption

A research-driven overview of LLM use cases and the security, privacy, and governance gaps enterprises must address.

Read the Study

mockup
Secure Your AI Coding Environment

Discover how Kirin prevents unsafe extensions, misconfigured IDE servers, and risky agent behavior from disrupting your business.

Get the One-Pager

post-widget-13-img
bg-shape-download

See How to Secure and Enable AI in Your Enterprise

Knostic provides AI-native security and governance across copilots, agents, and enterprise data. Discover risks, enforce guardrails, and enable innovation without compromise.

195 1-min
background for career

Schedule a demo to see what Knostic can do for you

protect icon

Knostic leads the unbiased need-to-know based access controls space, enabling enterprises to safely adopt AI.