Skip to main content
Technology & EngineeringPrompt Injection Defense164 lines

Agent Tool Permissions and Confirmation Flows

Design the tool-permission model for an LLM agent so that compromise

Quick Summary18 lines
The agent's tools are where prompt injection becomes consequential. The model can be jailbroken, the response can be manipulated, but a properly designed tool-permission model limits what damage results.

## Key Points

- **Read-only, low value.** Reading public data, querying open APIs without auth. Risk of misuse is low.
- **Read-only, high value.** Reading user data, accessing private files, querying internal APIs. Risk is leak via output channel.
- **Write, low impact.** Creating drafts, scheduling reminders for self, generating content. Reversible.
- **Write, medium impact.** Sending messages to known contacts, creating database entries, posting to public channels.
- **Write, high impact.** Sending money, deleting data, executing code, calling external APIs that have material effects.
- **Privileged.** Modifying system configuration, accessing other users' data, escalating privileges.
- **Always allowed**: read-only low-value tools. The agent uses them freely.
- **Allowed within scope**: read-only high-value tools. The agent can use them on the current user's data only; cross-user access requires elevation.
- **Confirm on use**: write low-impact and medium-impact tools. The agent proposes; the user confirms; the action executes.
- **Confirm with detail**: write high-impact tools. Confirmation includes the specific action ("send $200 to alice@example.com"); user approves with full context.
- **Out of band**: privileged tools. Not given to the agent at all; require a separate auth flow (the user explicitly elevates).
1. **Agent proposes the action.** "I'd like to send the following email: [draft]."
skilldb get prompt-injection-defense-skills/Agent Tool Permissions and Confirmation FlowsFull skill: 164 lines
Paste into your CLAUDE.md or agent config

The agent's tools are where prompt injection becomes consequential. The model can be jailbroken, the response can be manipulated, but a properly designed tool-permission model limits what damage results.

This skill covers the design of tool permissions: what tools the agent has, when it can use them, who has to approve, and how to bound the blast radius of any single compromise.

Risk Classification

Classify each tool by risk before granting it to the agent:

  • Read-only, low value. Reading public data, querying open APIs without auth. Risk of misuse is low.
  • Read-only, high value. Reading user data, accessing private files, querying internal APIs. Risk is leak via output channel.
  • Write, low impact. Creating drafts, scheduling reminders for self, generating content. Reversible.
  • Write, medium impact. Sending messages to known contacts, creating database entries, posting to public channels.
  • Write, high impact. Sending money, deleting data, executing code, calling external APIs that have material effects.
  • Privileged. Modifying system configuration, accessing other users' data, escalating privileges.

The permission model treats each class differently.

The Permission Model

Standard model:

  • Always allowed: read-only low-value tools. The agent uses them freely.
  • Allowed within scope: read-only high-value tools. The agent can use them on the current user's data only; cross-user access requires elevation.
  • Confirm on use: write low-impact and medium-impact tools. The agent proposes; the user confirms; the action executes.
  • Confirm with detail: write high-impact tools. Confirmation includes the specific action ("send $200 to alice@example.com"); user approves with full context.
  • Out of band: privileged tools. Not given to the agent at all; require a separate auth flow (the user explicitly elevates).

Each agent has a tool set determined by its task. A drafting agent doesn't need write tools at all. A scheduling agent has limited write tools. A coding agent has more, but with execution sandboxes.

Confirmation Flow Design

The confirmation flow is the user's last line of defense. Design it carefully.

Standard pattern:

  1. Agent proposes the action. "I'd like to send the following email: [draft]."
  2. User reviews the proposal. They see what will happen, in detail.
  3. User confirms or modifies. They click confirm, edit and confirm, or reject.
  4. Action executes. Confirmation produces the action; rejection cancels.

The proposal must contain enough information for the user to evaluate. For an email: recipient, subject, body. For a payment: amount, recipient, source account. For a delete: which files, how many.

Bad proposal: "I'm about to send an email." User can't tell what.

Good proposal: "I'm about to send this email to alice@example.com:" [full subject and body shown] "Confirm?"

Confirmation Fatigue

If every action requires confirmation, users develop click-through habits. The confirmation becomes meaningless.

Mitigations:

  • Batch confirmations. "I'm going to (1) send this email, (2) update this calendar event, (3) move these 3 files. Confirm all?" One confirmation, multiple actions.
  • Tier by risk. Low-impact actions go through without confirmation; high-impact actions confirm individually.
  • Smart defaults. "Trusted" recipients/contexts get less confirmation; new ones get more.
  • Time windows. "Confirm to enable email sending for the next 5 minutes" — the user's intent is captured once, propagates to subsequent actions in the same task.

The goal is a confirmation pattern that captures user intent without exhausting attention.

Scoping

The agent operates within a scope. Scope limits what tools it can call and what data they can access.

Examples:

  • Per-conversation scope. The agent's tools only operate on data within this conversation.
  • Per-task scope. Within this task, the agent can do X, Y, Z; for new tasks, scope resets.
  • Per-user scope. The agent can only act on the current user's data; cross-user access is forbidden.
  • Per-session scope. Some tools are enabled only during the session; persistent capabilities require re-auth.

Scoping limits the blast radius of injection. A jailbroken agent in a per-task scope can only damage the task; the user's other data is safe.

Default Deny

Tools are denied by default; the agent acquires them through explicit grants. New tools require explicit permission, not exempted by default to the agent.

This is opposite to many development practices ("turn on all features for ease"). For agents, the bias goes the other way: every capability is a vector; minimize what's enabled.

When the agent needs a tool it doesn't have, the user is asked: "the agent wants to enable [tool]. Allow?" The user decides; the grant is recorded.

Audit Logging

Every tool call is logged with:

  • Timestamp.
  • Tool name.
  • Input parameters.
  • Output / result.
  • User confirmation (was it confirmed; when; with what evidence of intent).
  • Agent state (model version, system prompt hash, conversation ID).

The audit log is the post-hoc investigation surface. When something goes wrong (an unexpected email was sent, an unexpected file was deleted), the audit log tells you what the agent did, with what input, with what user signal of consent.

Audit logs should be tamper-resistant. Append-only storage; cryptographic integrity; not writable by the agent itself. The agent must not be able to modify its own audit trail.

Time-Limited Capabilities

Some tools are dangerous if held permanently. Grant them time-limited.

Examples:

  • "Enable code execution for the next 30 minutes."
  • "Allow sending emails to new recipients for this conversation only."
  • "Allow database writes for the duration of this task."

After the time window, the capability is revoked. Re-acquiring it requires explicit user action.

This bounds the persistence of any single permission. An attacker who jailbreaks the agent has only the window of the granted capability, not unbounded access.

Capability Approval Hierarchy

For organizational use, capabilities may require approval beyond the user:

  • User-approvable: most tool grants. The user decides for their own data.
  • Manager-approvable: tools with cross-team data access; tools with significant external impact.
  • Admin-approvable: tools with privileged access; tools that affect multiple users.

The agent never grants itself a capability; it asks the user, who may need to ask their manager, who may need to ask admin. The hierarchy ensures appropriate review.

Tool Sandbox

For high-risk tools, run them in a sandbox:

  • Code execution in a container with no network, no filesystem, time limits.
  • HTTP requests through a proxy that enforces allowlists and rate limits.
  • Database queries through a connection that has read-only access by default; writes require elevation.

Sandboxing converts catastrophic capabilities (run any code) into bounded ones (run code in a sealed environment). The agent's compromised actions are contained.

When Permissions Are Wrong

Sometimes the permission system gets in the way of legitimate work. The user wants to do something the agent can't (yet), and the friction is annoying.

The right response: track the friction; consider whether the capability should be expanded; expand or don't based on the risk-reward.

The wrong response: lower the bar generally. Each capability expansion is its own decision; don't trade away the model for one user's convenience.

Anti-Patterns

Allow-by-default tool access. Every tool the agent has is a vector. Default deny; explicit grant.

Single confirmation for unrelated actions. "Confirm everything" is meaningless. Tier by risk; batch only related actions.

Vague confirmation prompts. "Confirm?" without showing what. The user can't evaluate.

Click-through fatigue. Every action confirmed; users click through reflexively. Batch and tier.

No audit logging. Investigation requires reconstructing from chat. Log every tool call.

Permanent capability grants. "Enable code execution forever." A jailbreak six months later has the capability. Time-limit.

Agent grants itself capabilities. The agent can ask but not grant. Granting is the user's authority.

Install this skill directly: skilldb add prompt-injection-defense-skills

Get CLI access →