Tools and Actions: How AI Agents Act on the World

This is the third deep-dive in the system of intelligence series. The first two covered the memory substrate and compositional retrieval. Memory holds what the agent knows. Retrieval gets the right slice into working memory. This post is about the third pillar, how the agent actually does anything with what it knows.

This post covers:

The goal → plan → task → action hierarchy
What a tool actually is: schema, description, permissions, idempotency
Two patterns for tool selection, description-based atomic tools vs. workflow-as-a-tool and when to use each
Tools vs. skills vs. MCP, three terms for overlapping things
Where tools and workflows live, tool registry, workflow engine, and workflow-as-a-tool
Maria’s case end to end — from plan to action

The running example is still Maria, the Platinum DeltaBank customer disputing a $39 late fee. We left her with the agent’s working memory fully populated like customer facts, applicable policy, payment history, comparable precedent. The agent now has to act on that information.

1. Goal, plan, task, action

Most production agents conflate these terms. Doing so is the most common source of unpredictable behavior. Keep them separate.

Goal (what the user wants):The desired end state. “Resolve Maria’s $39 late fee dispute.” The goal is input to planning, not chosen by the agent.
Plan (the ordered set of tasks the agent commits to in order to achieve the goal): A plan is a tree or DAG, tasks can run sequentially, in parallel, or conditionally. Produced by the agent’s planning step; written somewhere durable (a :Plan node in the graph for long-running cases, working memory for short ones).
Task (a single bounded unit of work in the plan): Verify identity. Pull payment history. Check waiver eligibility. Apply the waiver. Log the case. Send confirmation. Each task has a precondition (what must be true before it runs), an outcome (what must be true after), and a typical duration.
Action (the concrete tool call that executes a task): “Verify identity” (the task) becomes call verify_identity(customer_id="CUST_4421") (the action). One task usually maps to one action, but complex tasks decompose into several.

The hierarchy is goal → plan → task → action, top to bottom. Each layer is a refinement of the one above. Keeping them separate is what makes the system auditable — six months from now, you can ask “why did the agent take this action?” and trace it back through the task it executed, the plan that contained the task, and the goal that produced the plan.

Two phases govern how an agent moves through this hierarchy.

Plan phase: The agent reads the goal (from the user) and its working memory (from retrieval), and produces a plan. “To resolve a late-fee dispute, I will: verify identity, check waiver eligibility, apply waiver if eligible, log the case, notify customer.” The plan is a structured object, not free text — each task in the plan names a known workflow step and references the tools that will execute it.

Execute phase: The agent walks the plan, picking the next task whose preconditions are satisfied, executing it via tool calls, checking the outcome, moving to the next task. Each action goes through the harness before reaching any real system.

The two phases interleave when reality intervenes. If a task fails, the agent re-plans from the current state. If new information arrives mid-execution (the customer asks an unrelated question), the agent re-plans with the updated goal set. This pattern — plan, execute, observe, replan when needed — is what the field calls plan-and-execute, and it’s the dominant production architecture for agents that need to be reviewable.

2. What a tool actually is

A tool is not just a function. It is a first-class declarative object with seven components:

tool:
  name: waive_fee
  description: >
    Reverse a fee that was charged to a customer's account, recording
    the waiver reason. Use this only when the customer is eligible per
    policy and the amount is within the agent's allowed range.
  input_schema:
    fee_event_id: string (required)
    amount: number (required, max: 50)
    reason: enum["goodwill", "system_error", "policy_override"]
    requested_by: string (required, employee_id)
  output_schema:
    status: enum["success", "rejected", "requires_approval"]
    transaction_id: string
    reversed_at: timestamp
  side_effects: write
  idempotency_key: required
  permissions: [fee_waiver_apply]
  timeout_ms: 2000
  governed_by: :Policy{name: "Late Fee Waiver"}

Each component does specific work:

Name and description: The description is what the planner reads when deciding whether to use this tool. It is also what the harness compares against the runtime arguments to detect misuse. Bad descriptions are the single most common cause of agent confusion — “call this tool when you want to reverse a charge” is too loose; the example above is specific enough to make the right call.
Input schema: Typed parameters with required/optional flags, value ranges, and enums. The schema is enforced before the tool runs. An attempt to call waive_fee with amount: 500 against a schema declaring max: 50 is rejected at the schema layer, not by the underlying system.
Output schema: What the agent should expect to receive back. Critical for two reasons: the agent can validate the response before reasoning over it, and the harness can detect when a tool returned something outside spec.
Side effects: read, write, or write_external. get_payment_history is a read — calling it twice changes nothing. waive_fee is a write — calling it twice without an idempotency key would reverse the fee twice. send_confirmation_email is a write-external — once the email is sent, no key in the world can unsend it. This single property determines whether the tool is reversible, whether it requires idempotency, and what level of harness scrutiny it needs.
Idempotency key: For write tools, required. The agent passes a unique key with each call; the tool’s underlying implementation guarantees that re-calling with the same key produces the same effect (or no additional effect). This is what makes safe retry possible.
Permissions: Which roles or scopes are allowed to invoke this tool. Maria’s senior support rep Janet has fee_waiver_apply; an intern doing read-only research does not. The harness checks this before every call.
Timeout: How long the agent should wait before considering the call failed. Production tools fail in a controlled way far more often than they succeed instantly.
Governed by: A link to the policy or workflow that authorizes this tool’s existence. The audit trail asks “under what authority does the system allow this action?” and the answer is a graph edge.

3. How does the agent decide which tool to call?

Two patterns are used in production. Neither is wrong; they fit different situations.

Pattern A: Description-based selection

The agent has a catalog of tools, each with a name and description. At runtime it reads the descriptions, reasons over what the task needs, and picks a tool whose description matches. The tool is not bound to a goal type — it floats in the catalog, available whenever its description fits. OpenAI function calling, Anthropic Skills, and most MCP-exposed tools work this way.

Strength: flexible and new tools get picked up automatically. Weakness: unpredictable, different picks for similar cases.

Pattern B: Workflow-as-a-tool

A workflow is exposed to the agent as a single tool. Calling the tool invokes a predefined sequence like verify identity, check eligibility, apply waiver, log case, notify customer — each step orchestrated by a workflow engine outside the agent. The agent sees one tool call; the workflow engine handles the rest. The workflow tool exists in the tool registry alongside atomic tools like waive_fee — it just happens to be a tool whose implementation is a multi-step process.

Strength: predictable like for same case, same sequence, same audit trail. Weakness: rigid, only works for cases a workflow was authored for.

When to use which

The right answer in production is both, deliberately, with three factors deciding:

Stakes: Money-moving, state-changing, externally-effecting tools → encapsulate in a workflow tool. Read-only and internal-compute tools → catalog as atomic tools.
Cross-cutting use: Tools that apply broadly (translation, formatting, math) → atomic catalog tools. Specific high-stakes sequences (late_fee_dispute_workflow, card_replacement_workflow) → workflow tools.
Variability of use: Same-way-every-time → workflow tool. Case-by-case discretion → atomic tools the agent composes itself.

In practice DeltaBank runs both. For Maria’s case the agent invokes late_fee_dispute_workflow (workflow tool, encapsulates the five-step sequence) plus a few atomic catalog tools like format_currency (used to format the confirmation message). Workflow tools are the rails for high-stakes execution; the atomic catalog is the discretion for everything else.

4. Where tools and workflows live

The context graph from the previous posts is the retrieval substrate — what the agent reads from. It is not where executable processes live. Tools and workflows live elsewhere.

Tools live in a tool registry: OpenAI/Anthropic function definitions, MCP servers, or an internal service catalog. The tool’s name, schema, permissions, and harness hooks live there, not in the context graph.
Workflows live in a workflow engine: Temporal, Camunda, Step Functions, or a custom orchestrator. The engine handles state, retries, branching, and partial completion. A knowledge graph is not the right shape for that.
A workflow is itself a tool: Each workflow is registered in the tool registry alongside atomic tools. From the agent’s perspective there is no difference between calling waive_fee (atomic) and calling late_fee_dispute_workflow (workflow) — both are tool calls with names, schemas, permissions, and a harness wrapper. The workflow tool’s implementation happens to be a multi-step process; the agent does not need to know that.

DeltaBank’s registry for Maria’s case contains both kinds:

atomic:    waive_fee, create_ticket, send_confirmation_email,
           verify_identity, check_waiver_eligibility, format_currency

workflow:  late_fee_dispute_workflow, card_replacement_workflow,
           fraud_investigation_workflow

Audit trail still works. When any tool is invoked — atomic or workflow — the resulting :DecisionTrace in episodic memory records the tool name, arguments, and outcome. The tool definition lives in the registry; the invocation record lives in the context graph. Separation of concerns clean.

5. Maria’s case — two scenarios

Maria’s case can play out in two very different ways depending on what tools the system already has registered. The dichotomy matches the Pattern A / Pattern B split from Section 3, but now we walk through both ends of it on the same customer.

The key architectural point: when a workflow tool’s description matches the user’s request, the model picks it and retrieval is skipped entirely. The workflow tool encapsulates the policy, the sequence, and the atomic calls. There is nothing for the agent to retrieve and reason over — the workflow already knows what to do. Retrieval only fires when no tool in the catalog is a good match.

Scenario A — Workflow tool exists (the production happy path)

Maria’s case is well-trodden. DeltaBank has handled thousands of these. A workflow tool late_fee_dispute_workflow is registered in the catalog with a clear description: “Reverse a disputed credit card late fee for an eligible customer, applying tier rules, annual limits, and ledger reconciliation. Use when a customer is contesting a late fee charge.”

Phase 1 (Tool selection): The agent reads the available tool descriptions and reasons over what Maria asked for. The late_fee_dispute_workflow description is a strong match — same domain, same operation, applicable customer profile. The agent picks it. No retrieval is invoked. The workflow tool already encapsulates everything the agent needs — applicable policy, eligibility logic, the five-step sequence.

Phase 2 (Invoke the workflow tool): The plan reduces to a single tool call:

plan:
  goal: "Resolve Maria's $39 late fee dispute"
  tasks:
    - tool: late_fee_dispute_workflow
      args:
        customer_id: "CUST_4421"
        fee_event_id: "EVT_88291"
        idempotency_key: "LFD-CUST_4421-EVT_88291"

The agent does not orchestrate the sequence itself. The workflow engine does.

Phase 3 (Execute): The workflow tool call goes through the harness. Inside the workflow engine, the five internal steps run.

(The idempotency check itself fires at the workflow tool level in Phase 2 above — not duplicated per internal step.)

Post-execution (2 checks):

Output schema validation: The response matches the declared output schema. Pass.
Post-hook ledger_verification: A separate read against the ledger confirms the $39 reversal actually landed. Pass.

The workflow engine proceeds to the remaining steps (log case, send email). When done, the agent’s :DecisionTrace records the workflow invocation and outcome.

End-to-end wall-clock: ~900ms from utterance to ledger reversal. No retrieval. One workflow tool call. Five internal atomic tool calls.

This is what every business-sensitive case should look like in production. Predictable, fast, fully audited.

Scenario B — No matching workflow tool (the discovery path)

Now imagine Maria’s case is something less common. Same customer, but this time she says “I want to dispute a $39 late fee on my Platinum card AND an annual fee charge on my business card — and the timing makes me think they’re related.” The agent scans the available tool descriptions. late_fee_dispute_workflow matches part of the request but not the cross-product dimension. annual_fee_waiver_workflow matches another part but does not handle the combined logic. No single tool description is a confident match for the whole request. The system has never explicitly handled a cross-product dispute as a single workflow.

Phase 1 (Tool selection, no confident match): The agent scans tool descriptions but no single workflow tool covers the full request. The agent falls back to the retrieval-and-compose path.

Phase 2 (Retrieval):. The agent invokes retrieve_context. The compositional pipeline (see the previous post for how) returns:

Facts: both accounts, both fee events, payment history across both products, tier.
Policy chunks: late fee waiver policy v3, annual fee waiver policy v2, cross-product handling guidance.
Procedural guidance: the retrieved policy passages say “for multi-product fee disputes, verify identity once, then evaluate each fee independently against its applicable policy, then apply waivers selectively.”
Precedent: two past traces where agents handled similar cases by combining the single-product workflows.

The procedural guidance is the critical retrieved artifact. It tells the agent what sequence to compose.

Phase 3 (Plan from atomic tools): Using the retrieved guidance, the agent composes a plan from the atomic tool catalog:

plan:
  tasks:
    - T1: verify_identity
    - T2: get_fee_event(EVT_88291)             # late fee
    - T3: get_fee_event(EVT_88450)             # annual fee
    - T4: check_waiver_eligibility(EVT_88291)
    - T5: check_waiver_eligibility(EVT_88450)
    - T6: waive_fee(EVT_88291) if T4.eligible
    - T7: waive_fee(EVT_88450) if T5.eligible
    - T8: create_ticket
    - T9: send_confirmation_email

Eight atomic tool calls instead of one workflow call. The agent reasoned its way to this sequence from the retrieved policy text and precedent.

Phase 4 (Execute): The agent walks the plan task by task. Each atomic call goes through the harness , but now applied across each tool call in the plan rather than encapsulated inside one workflow tool.

End-to-end wall-clock: ~1.9 seconds. Slower than Scenario A, broader harness surface, more places for things to go wrong — but the case got resolved using only retrieval and atomic composition.

Closing

Three takeaways for the Tools and Actions pillar:

Tools are declarative objects, not function calls: Name, description, schemas, side-effects, permissions, idempotency, governance — all of it lives in the tool definition. “What can the agent do?” is a graph query, not a code search. (And harness scrutiny scales with stakes: light on read tools, heavy on write tools, heaviest on write-external tools.)
Two paths, one direction of travel: Workflow tools = direct invocation, no retrieval, fast, audited (Scenario A). Atomic tools composed via retrieval = flexible, novel-case capable, slower (Scenario B). Production systems run both, but the goal is to graduate recurring patterns from B to A over time. Today’s discovery path is tomorrow’s production rail.
Goal, plan, task, action are four different things: The agent’s behavior is reviewable precisely because each layer is a refinement of the one above. Confusing them is the most common architectural mistake in production agents.

The next pillar is Harness Engineering, the deterministic layer that wraps every tool call, every memory write, every output. Tools turn decisions into changes in the world. The harness is what makes sure those changes are the right ones.

This post covers:

The goal → plan → task → action hierarchy
What a tool actually is: schema, description, permissions, idempotency
Two patterns for tool selection, description-based atomic tools vs. workflow-as-a-tool and when to use each
Tools vs. skills vs. MCP, three terms for overlapping things
Where tools and workflows live, tool registry, workflow engine, and workflow-as-a-tool
Maria’s case end to end — from plan to action

1. Goal, plan, task, action

Most production agents conflate these terms. Doing so is the most common source of unpredictable behavior. Keep them separate.

Goal (what the user wants):The desired end state. “Resolve Maria’s $39 late fee dispute.” The goal is input to planning, not chosen by the agent.
Plan (the ordered set of tasks the agent commits to in order to achieve the goal): A plan is a tree or DAG, tasks can run sequentially, in parallel, or conditionally. Produced by the agent’s planning step; written somewhere durable (a :Plan node in the graph for long-running cases, working memory for short ones).
Task (a single bounded unit of work in the plan): Verify identity. Pull payment history. Check waiver eligibility. Apply the waiver. Log the case. Send confirmation. Each task has a precondition (what must be true before it runs), an outcome (what must be true after), and a typical duration.
Action (the concrete tool call that executes a task): “Verify identity” (the task) becomes call verify_identity(customer_id="CUST_4421") (the action). One task usually maps to one action, but complex tasks decompose into several.

Two phases govern how an agent moves through this hierarchy.

2. What a tool actually is

A tool is not just a function. It is a first-class declarative object with seven components:

tool:
  name: waive_fee
  description: >
    Reverse a fee that was charged to a customer's account, recording
    the waiver reason. Use this only when the customer is eligible per
    policy and the amount is within the agent's allowed range.
  input_schema:
    fee_event_id: string (required)
    amount: number (required, max: 50)
    reason: enum["goodwill", "system_error", "policy_override"]
    requested_by: string (required, employee_id)
  output_schema:
    status: enum["success", "rejected", "requires_approval"]
    transaction_id: string
    reversed_at: timestamp
  side_effects: write
  idempotency_key: required
  permissions: [fee_waiver_apply]
  timeout_ms: 2000
  governed_by: :Policy{name: "Late Fee Waiver"}

Each component does specific work:

Name and description: The description is what the planner reads when deciding whether to use this tool. It is also what the harness compares against the runtime arguments to detect misuse. Bad descriptions are the single most common cause of agent confusion — “call this tool when you want to reverse a charge” is too loose; the example above is specific enough to make the right call.
Input schema: Typed parameters with required/optional flags, value ranges, and enums. The schema is enforced before the tool runs. An attempt to call waive_fee with amount: 500 against a schema declaring max: 50 is rejected at the schema layer, not by the underlying system.
Output schema: What the agent should expect to receive back. Critical for two reasons: the agent can validate the response before reasoning over it, and the harness can detect when a tool returned something outside spec.
Side effects: read, write, or write_external. get_payment_history is a read — calling it twice changes nothing. waive_fee is a write — calling it twice without an idempotency key would reverse the fee twice. send_confirmation_email is a write-external — once the email is sent, no key in the world can unsend it. This single property determines whether the tool is reversible, whether it requires idempotency, and what level of harness scrutiny it needs.
Idempotency key: For write tools, required. The agent passes a unique key with each call; the tool’s underlying implementation guarantees that re-calling with the same key produces the same effect (or no additional effect). This is what makes safe retry possible.
Permissions: Which roles or scopes are allowed to invoke this tool. Maria’s senior support rep Janet has fee_waiver_apply; an intern doing read-only research does not. The harness checks this before every call.
Timeout: How long the agent should wait before considering the call failed. Production tools fail in a controlled way far more often than they succeed instantly.
Governed by: A link to the policy or workflow that authorizes this tool’s existence. The audit trail asks “under what authority does the system allow this action?” and the answer is a graph edge.

3. How does the agent decide which tool to call?

Two patterns are used in production. Neither is wrong; they fit different situations.

Pattern A: Description-based selection

Strength: flexible and new tools get picked up automatically. Weakness: unpredictable, different picks for similar cases.

Pattern B: Workflow-as-a-tool

Strength: predictable like for same case, same sequence, same audit trail. Weakness: rigid, only works for cases a workflow was authored for.

When to use which

The right answer in production is both, deliberately, with three factors deciding:

Stakes: Money-moving, state-changing, externally-effecting tools → encapsulate in a workflow tool. Read-only and internal-compute tools → catalog as atomic tools.
Cross-cutting use: Tools that apply broadly (translation, formatting, math) → atomic catalog tools. Specific high-stakes sequences (late_fee_dispute_workflow, card_replacement_workflow) → workflow tools.
Variability of use: Same-way-every-time → workflow tool. Case-by-case discretion → atomic tools the agent composes itself.

4. Where tools and workflows live

The context graph from the previous posts is the retrieval substrate — what the agent reads from. It is not where executable processes live. Tools and workflows live elsewhere.

Tools live in a tool registry: OpenAI/Anthropic function definitions, MCP servers, or an internal service catalog. The tool’s name, schema, permissions, and harness hooks live there, not in the context graph.
Workflows live in a workflow engine: Temporal, Camunda, Step Functions, or a custom orchestrator. The engine handles state, retries, branching, and partial completion. A knowledge graph is not the right shape for that.
A workflow is itself a tool: Each workflow is registered in the tool registry alongside atomic tools. From the agent’s perspective there is no difference between calling waive_fee (atomic) and calling late_fee_dispute_workflow (workflow) — both are tool calls with names, schemas, permissions, and a harness wrapper. The workflow tool’s implementation happens to be a multi-step process; the agent does not need to know that.

DeltaBank’s registry for Maria’s case contains both kinds:

atomic:    waive_fee, create_ticket, send_confirmation_email,
           verify_identity, check_waiver_eligibility, format_currency

workflow:  late_fee_dispute_workflow, card_replacement_workflow,
           fraud_investigation_workflow

5. Maria’s case — two scenarios

Scenario A — Workflow tool exists (the production happy path)

Phase 2 (Invoke the workflow tool): The plan reduces to a single tool call:

plan:
  goal: "Resolve Maria's $39 late fee dispute"
  tasks:
    - tool: late_fee_dispute_workflow
      args:
        customer_id: "CUST_4421"
        fee_event_id: "EVT_88291"
        idempotency_key: "LFD-CUST_4421-EVT_88291"

The agent does not orchestrate the sequence itself. The workflow engine does.

Phase 3 (Execute): The workflow tool call goes through the harness. Inside the workflow engine, the five internal steps run.

(The idempotency check itself fires at the workflow tool level in Phase 2 above — not duplicated per internal step.)

Post-execution (2 checks):

Output schema validation: The response matches the declared output schema. Pass.
Post-hook ledger_verification: A separate read against the ledger confirms the $39 reversal actually landed. Pass.

The workflow engine proceeds to the remaining steps (log case, send email). When done, the agent’s :DecisionTrace records the workflow invocation and outcome.

End-to-end wall-clock: ~900ms from utterance to ledger reversal. No retrieval. One workflow tool call. Five internal atomic tool calls.

This is what every business-sensitive case should look like in production. Predictable, fast, fully audited.

Scenario B — No matching workflow tool (the discovery path)

Phase 1 (Tool selection, no confident match): The agent scans tool descriptions but no single workflow tool covers the full request. The agent falls back to the retrieval-and-compose path.

Phase 2 (Retrieval):. The agent invokes retrieve_context. The compositional pipeline (see the previous post for how) returns:

Facts: both accounts, both fee events, payment history across both products, tier.
Policy chunks: late fee waiver policy v3, annual fee waiver policy v2, cross-product handling guidance.
Procedural guidance: the retrieved policy passages say “for multi-product fee disputes, verify identity once, then evaluate each fee independently against its applicable policy, then apply waivers selectively.”
Precedent: two past traces where agents handled similar cases by combining the single-product workflows.

The procedural guidance is the critical retrieved artifact. It tells the agent what sequence to compose.

Phase 3 (Plan from atomic tools): Using the retrieved guidance, the agent composes a plan from the atomic tool catalog:

plan:
  tasks:
    - T1: verify_identity
    - T2: get_fee_event(EVT_88291)             # late fee
    - T3: get_fee_event(EVT_88450)             # annual fee
    - T4: check_waiver_eligibility(EVT_88291)
    - T5: check_waiver_eligibility(EVT_88450)
    - T6: waive_fee(EVT_88291) if T4.eligible
    - T7: waive_fee(EVT_88450) if T5.eligible
    - T8: create_ticket
    - T9: send_confirmation_email

Eight atomic tool calls instead of one workflow call. The agent reasoned its way to this sequence from the retrieved policy text and precedent.

End-to-end wall-clock: ~1.9 seconds. Slower than Scenario A, broader harness surface, more places for things to go wrong — but the case got resolved using only retrieval and atomic composition.

Closing

Three takeaways for the Tools and Actions pillar:

Tools are declarative objects, not function calls: Name, description, schemas, side-effects, permissions, idempotency, governance — all of it lives in the tool definition. “What can the agent do?” is a graph query, not a code search. (And harness scrutiny scales with stakes: light on read tools, heavy on write tools, heaviest on write-external tools.)
Two paths, one direction of travel: Workflow tools = direct invocation, no retrieval, fast, audited (Scenario A). Atomic tools composed via retrieval = flexible, novel-case capable, slower (Scenario B). Production systems run both, but the goal is to graduate recurring patterns from B to A over time. Today’s discovery path is tomorrow’s production rail.
Goal, plan, task, action are four different things: The agent’s behavior is reviewable precisely because each layer is a refinement of the one above. Confusing them is the most common architectural mistake in production agents.

Gen AI Insights

Tools and Actions: How AI Agents Act on the World

1. Goal, plan, task, action

2. What a tool actually is

3. How does the agent decide which tool to call?

Pattern A: Description-based selection

Pattern B: Workflow-as-a-tool

When to use which

4. Where tools and workflows live

5. Maria’s case — two scenarios

Scenario A — Workflow tool exists (the production happy path)

Scenario B — No matching workflow tool (the discovery path)

Closing

Comments

Leave a Comment

Written by Prasanth Sai

Enjoyed this article?

Tools and Actions: How AI Agents Act on the World

1. Goal, plan, task, action

2. What a tool actually is

3. How does the agent decide which tool to call?

Pattern A: Description-based selection

Pattern B: Workflow-as-a-tool

When to use which

4. Where tools and workflows live

5. Maria’s case — two scenarios

Scenario A — Workflow tool exists (the production happy path)

Scenario B — No matching workflow tool (the discovery path)

Closing

Comments

Leave a Comment

Written by Prasanth Sai

Enjoyed this article?