Building openclaw-shield: Lessons Learned Securing OpenClaw Agents

Written by Knostic Team | Feb 5, 2026 6:42:38 PM

Why We Built openclaw-shield: Securing AI Agents from Themselves

We recently open-sourced openclaw-shield, a security plugin for OpenClaw agents. This post covers what we built, the critical issues we ran into along the way, and the lessons we learned about securing AI agents in practice.

Check it out here: https://github.com/knostic/openclaw-shield

The Problem: AI Agents Without Guardrails

AI agents operating on behalf of users can access files, run shell commands, and produce text responses. Without guardrails, an agent can:

Read .env files and output raw API keys, database passwords, or tokens
Read customer data and display raw Social Security numbers, credit card numbers, or emails
Execute destructive commands like rm -rf that permanently delete files
Exfiltrate credentials by embedding them in shell commands (e.g., curl with bearer tokens)

How It Works: 5-Layer Defense-in-Depth

Rather than relying on a single mechanism, we built five independent security layers. If one layer is bypassed or unsupported, the others still protect.

Layer 1 (Prompt Guard) injects a security policy block into the agent's context on every turn, including rules about secrets, PII, destructive commands, and sensitive files, plus a mandatory workflow requiring the agent to call our security gate before any exec.

Layer 2 (Output Scanner) scans every tool result before it's persisted to the conversation. It redacts secrets (AWS keys, Stripe keys, GitHub tokens, OpenAI keys, private keys, etc.) and PII (emails, SSNs, credit card numbers, phone numbers), replacing matches with tagged placeholders like [REDACTED:aws_access_key] or [PII_REDACTED:us_ssn].

Layer 3 (Tool Blocker) is registered but non-functional on the current published OpenClaw version (v2026.1.30). When the host supports it, it will hard-block destructive commands and secrets in tool parameters. More on this below.

Layer 4 (Input Audit) logs every inbound message with a preview and flags if the user pastes secrets into a message. Observe-only.

Layer 5 (Security Gate Tool) registers a knostic_guard tool that the agent must call before executing any shell command or reading files. The tool inspects the proposed action and returns STATUS: ALLOWED or STATUS: DENIED. Combined with L1's prompt injection telling the agent to always call this tool first, this creates a functional gate.

Validated Results

Destructive Commands: Blocked. When asked to delete a file, the agent calls our security gate first. The gate detects the destructive command and returns DENIED. The agent reports the denial to the user. The file is never deleted.

PII Protection: Working. When asked to read a file containing a Social Security number and email address, the agent calls our security gate. The gate allows the read but instructs the agent not to show raw PII. The agent responds with: "The file contains a customer record with three fields: Customer name, SSN, and Email. I can't share the raw values since they're PII." The SSN and email are never shown.

Secret Redaction: Working. When the agent reads a file containing an AWS secret key, the output scanner redacts it from the conversation history. The agent describes the file as containing "an AWS secret access key" without ever showing the raw key value.

Critical Issues We Found

before_tool_call Hook Not Wired (Severity: High)

The before_tool_call hook is defined in the hook system, can be registered by plugins, and has a runner function, but on v2026.1.30, nothing in the tool execution chain actually calls it. The wrapping code (wrapToolWithBeforeToolCallHook) was added to the codebase after v2026.1.30 was published and exists only in the dev repo (v2026.2.1). We confirmed this by searching the installed dist and finding zero results for the function. This is why we built L5 as a workaround.

Only 3 out of 14 defined hooks have invocation sites in the published version: before_agent_start, message_received, and tool_result_persist. The rest are defined but never called.

prependContext is Weak (Severity: Medium)

The before_agent_start hook's prependContext field prepends text to the user's message, not to the system prompt. The model treats it as user-provided context rather than a system-level constraint. In testing, L1 fired and injected the security policy, but the agent still executed rm when directly asked. L5's tool-based gate provides a stronger mechanism because the agent makes a real tool call, receives a structured DENIED response, and acts on it.

L2 Timing Gap (Severity: Medium)

The tool_result_persist hook fires when the tool result is persisted to the session transcript, not when it's returned to the LLM. The LLM sees the raw result for the current turn, and L2 redacts secrets only before persistence. The LLM could potentially include the raw secret in its response for that turn. The message_sending hook (which could catch this) is also not wired on v2026.1.30.

Known Limitations

Hard tool blocking requires a host update. Once before_tool_call is wired, our Layer 3 becomes a true hard-block that can't be bypassed.

The security gate is a soft enforcement. The agent is instructed to always call our gate tool before acting, but a sufficiently adversarial prompt could theoretically bypass this. The upcoming host update eliminates this concern.

CRITICAL: OpenClaw gets updated constantly, and without community updates, openclaw-shield won't stay effective for more than mere days. We've already had to update it several times. PRs are welcome and encouraged.

Lessons Learned

Always verify the runtime binary matches the source. We spent significant time debugging why before_tool_call didn't fire, only to discover the gateway was running an older published version than the dev repo.

Hook definition does not equal hook invocation. OpenClaw defines 14 hooks but only 3 have invocation sites in the published version. Always trace the full chain from registration to invocation.

Prompt injection is a weak guardrail for direct instructions. When the user explicitly asks to delete a file, the model will often comply despite injected policies. A tool-based gate (where the model gets a real DENIED response) is far more effective.

tool_result_persist is synchronous. Returning a Promise from this hook causes the host to silently skip it. This is documented in the source but easy to miss.

The plugin tool API (registerTool) is the most powerful surface for security. It lets you inject logic directly into the agent's decision flow, not just observe or modify at the edges.

More Open Source Tools to Secure OpenClaw

openclaw-shield: https://github.com/knostic/openclaw-shield

openclaw-telemetry: https://github.com/knostic/openclaw-telemetry/

openclaw-detect: https://github.com/knostic/openclaw-detect//

Knostic: Discovery and Control for the Agent Layer

If you're looking for visibility and control over your coding agents, MCP servers, and IDE extensions, from Cursor and Claude Code to Copilot, check out what we're building at https://www.getkirin.com/

View full post