Thoughts on Generative AI, product leadership, and enterprise AI transformation

This is the fourth deep-dive in the system of intelligence series. The first three covered memory, retrieval, and tools and actions. This post is about the layer that wraps all of them.
A model writes confident answers and calls tools with plausible arguments. A model never says “I don’t know.” None of that becomes a reliable production system on its own. The discipline that turns a model into a working agent is harness engineering which is everything wrapped around the model that makes its behavior predictable, auditable, and safe.
A few months back, I sat in on a triage call for a customer-support agent we’d built on a small model, a cheap, fast, good enough for the routine queries that make up most of the volume. A customer had asked for a refund on a duplicate charge. The agent confirmed the policy applied, called the refund API, got a 200 OK back, and told the customer: “Your refund of $147 has been processed. You should see it in your account within 3-5 business days.”
Except it hadn’t been processed. The 200 response meant the API had received the request. The actual body contained status: "pending_review", the amount exceeded a threshold that required a human approver. The agent saw 200 OK, assumed success, and confidently promised the customer something that wasn’t true.
The customer waited five business days. Nothing happened. They called back angry. Now we had a refund problem and a trust problem.
The temptation was to fix the prompt. Always check the response body. Use exact wording from the API. Don’t say “processed” unless status is “completed.” The prompt got longer. The agent got more cautious. And then it still got things wrong, just in different ways.
The actual fix had nothing to do with the prompt. We layered four harness pieces around the same model:
status had to be one of ["completed", "pending_review", "denied"], and the harness forced the agent to acknowledge which value it received."completed" allowed “your refund has been processed.” "pending_review" mapped to specific language about review being underway. The agent could not improvise customer-facing phrasing on this path.After the harness landed, the same small model handled refunds reliably. No prompt changes. The model’s “intelligence” hadn’t improved. What improved was everything around it, bounded responses, schema-enforced status mapping, post-action verification, and an escalation path when uncertainty showed up.
That experience crystallized the discipline for me: the model is rarely the bottleneck, the systems around it are. Cheap small model + good harness reliably beats expensive large model + bad harness.
The formula Mitchell Hashimoto coined captures it:
Agent = Model + Harness
The model is the brain. The harness is everything else, the body that gives the brain hands, eyes, reflexes, and a sense of when to stop.
A useful operational definition: if you’re not the model, you’re the harness. Anything in the running agent system that isn’t the model itself which includes, code, config, prompts, hooks, tool descriptions, infrastructure, storage is part of the harness.
That includes the persistent storage substrates the agent reads from and writes to. The context graph from the memory deep-dive is a harness primitive. So are the document store and the vector index sitting next to it and every retrieval flows through them, every memory write lands in them, every audit traces back to them. In a coding-agent harness, this role is played by the filesystem and git; in an enterprise CX harness it’s the context graph, the document corpus, rules on tools and the structured knowledge base.
With that framing, every modern harness has four runtime components. Different teams use different vocabulary, but the architecture is consistent across the literature.
CLAUDE.md or AGENTS.md, Skill.md. Guides direct the model toward the right behavior in the first place.
The refund example from the opening used all four: the schema validator and ledger verifier are sensors, the status-to-language map and incident hook are guides and guardrails, the case logging is observability. No prompt changes. All four pieces sit outside the model, where they can be tested, versioned, and audited as software.
Production harnesses are mostly composed of:
.md files that the agent reads.
AGENTS.md / CLAUDE.md at the repo root describe the project’s conventions, dangerous commands, file-modification rules, and the team’s authoring norms. The agent loads these into context at session start..md files like DEPLOYMENT.md or BANKING_POLICY.md get loaded conditionally when the agent’s task touches that area.PreToolUse hooks fire before any tool executes. They run schema validation, permission checks, idempotency checks, range checks, rate limits.PostToolUse hooks fire after a tool returns. They run output validation, ledger checks, side-effect verification.hooks.py or hooks.ts file and registered with the runtime.Skill definitions, SKILL.md files (the Anthropic Skills pattern) that bundle tool definitions with usage instructions. These are guides + tool definitions in one artifact, loaded on demand.
MCP servers, the programs that expose tools and resources to the agent through a typed protocol, enforcing access controls and logging at the protocol layer.
Evaluation suites, test runners that replay decision traces against the current harness rules to catch regressions before they ship.

Teams that succeed with an enterprise CX agent harness don’t build everything at once. A practical sequence:
AGENTS.md, domain policy .md files, tool descriptions. Once you have a concrete use case.Each layer builds on the previous one. Teams that try to start at layer 4 or 5 without the lower layers find that their workflow tools fire against an unreliable substrate and their MCP integrations expose tools the harness can’t gate properly.
The amux harness guide puts the rule for what belongs in these files cleanly: “Every line in your AGENTS.md should trace to a real agent failure. If you can’t point to the specific mistake that prompted the line, delete it.” This applies to every piece of the harness. Each rule, hook, and constraint exists because the agent did the wrong thing at least once.

So far we’ve talked about harness at runtime, what fires during a live conversation, while the agent is acting. But there is an entire harness discipline that runs before any agent ever sees the substrate: the gates around how content enters storage and how it becomes part of the context graph in the first place.
Two layers, in the order they fire:
Before any document, policy, transcript, or organizational fact becomes part of the agent’s substrate, something has to decide whether it belongs there at all. This is the layer most teams skip entirely. They dump every PDF, every wiki page, every meeting transcript into a vector index and call it knowledge. The result is a substrate full of duplicates, outdated drafts, marketing language, and content nobody ever queries.
Pre-ingestion produces records that have passed the entry gate but have not yet been transformed into graph nodes. Content can also enter directly into trusted knowledge when a known authority is the source (a compliance officer publishing a policy, a PM authoring a workflow definition). Both entry paths run through the pre-ingestion gates; trusted entry skips only the discovery gates, not the validation gates.
A pre-ingestion harness asks four things of every candidate source:
:SUPERSEDED_BY edge keeps the lineage queryable.null everywhere.Sources break into three categories. Each contributes to different parts of the eventual graph and runs through slightly different gates:
The context graph cannot be built entirely from structured sources. Some of the most important content — what does this organization actually do, what are its products, what are its services, who are its customers, what is its tone, what are its non-negotiables exists nowhere as a structured record. It lives in people’s heads, in scattered docs, in onboarding decks. If the agent can’t see it, the agent can’t use it.
A serious construction harness creates a dedicated authored surface for organizational context — a structured folder inside trusted storage where humans write about the organization itself. Not a wiki dump. A structured authoring surface with a fixed template and a review path. Something like:
/trusted_knowledge
/organization
overview.md # mission, business model, scale
products.md # product catalog with structured fields
services.md # service offerings with SLAs
customer_segments.md # who we serve, tier definitions
tone_and_voice.md # communication norms, what NOT to say
/policies
escalation.md
authorization_limits.md
regulatory_constraints.md
/people
org_chart.md # human-authored, syncs with HRIS
decision_authorities.md
Every file in this surface is authored to a schema and should carry the same mandatory metadata as any other trusted source — author, last-reviewed, owning team. Every change goes through review. The construction pipeline parses these files into the object graph (products, services, segments) and the organizational graph (people, roles, decision authority), enriching what comes in from the structured systems.
The harness rules around this surface are unusual because the surface is itself part of the harness:
last_reviewed more than 90 days ago triggers an alert.Once a candidate is in the trusted knowledge , the construction harness governs the transformation into context graph nodes and edges. Five gates fire on every construction pass:
source, confidence, extracted_at, and (for inferred relationships) the extractor version. No provenance, no entry.This is the layer that determines whether your graph is a useful retrieval substrate or a quietly poisoned one. The same five gates fire whether the input is a CRM sync, an HRIS feed, or an authored organizational-context file — different sources, one validation discipline.
The metagraph’s typed-edge traversal patterns play the same harness role for retrieval — they replace fuzzy text search with symbol-precise reads (“walk HAS_TIER then HAS_EVENT to :WaiverEvent nodes”) and dramatically tighten the retrieval surface against irrelevant matches. The discipline differs by domain; the architectural function is the same.
/ingestion
/pre_gates # Layer 1
canonical_selector.py
query_volume_signal.py # what people ask about
metadata_required_check.py
structure_quality_check.py
/raw_store # staging area between layers
/trusted_knowledge # authored content lives here too
/organization # the authored surface
/pipelines # Layer 2
crm_sync.py
document_extractor.py
org_context_parser.py
hris_feed.py
/gates # Layer 2 — same five gates for all inputs
schema_validator.py
entity_resolver.py
provenance_check.py
ontology_drift_detector.py
/validation
golden_set/
batch_validator.py
/review_queue # human approval, used by both layers
The same principle that governs runtime harness governs construction: every gate is deterministic, every rule traces to a real past failure (a bad merge, a stale doc that misled the agent, an organizational fact nobody had written down). The construction harness is upstream of every retrieval and every decision; getting it wrong propagates everywhere.
One easy diagnostic: if your team can’t answer “where did this graph node come from and what’s its provenance?” in under thirty seconds, your construction harness has holes. The authored-organizational-context surface is the cheapest insurance against the “the agent doesn’t know what we know” failure mode.
The substrate doesn’t stop changing once construction has run. Live conversations produce signals worth folding back in. Decision traces surface patterns that suggest new workflow tools. Slack and Teams threads contain conventions the agent doesn’t know yet. Conflicts between trusted facts surface over time. All of that, the continuous-improvement discipline that turns a static substrate into a learning one is the subject of the fifth pillar, The Learning Loop. The construction harness is what makes today’s substrate trustworthy. The learning loop is what makes it better next week.

Theoretical harness rules are easier to follow when you watch them fire in sequence against a real case. Let’s walk Maria’s $39 late fee dispute from utterance to closed ticket and call out exactly which harness rules trigger at each beat. By the end we will have catalogued every harness check the system actually ran during her case.
Maria says: “I want to dispute this $39 late fee on my Platinum card.”
Before the agent ever sees that utterance, two harness layers fire:
The agent reads its available tool descriptions and reasons over Maria’s request. late_fee_dispute_workflow is a strong match.
The agent invokes late_fee_dispute_workflow with a single tool call.
LFD-CUST_4421-EVT_88291 has not been seen before. If it had been, the harness would short-circuit the call and return the prior result, preventing double execution. Pass.fee_waiver_initiate permission. Pass.As the workflow runs, each :TraceStep write passes through the five memory-write gates from the memory post: schema validation, provenance tagging, conflict detection, deduplication, and PII classification. Maria’s $39 transaction passes cleanly; had her SSN appeared in the trace, the PII gate would have routed it to a secured store with only a reference token in the graph.
When the case closes, a single :DecisionTrace is written linking all the trace steps to the outcome. The same five gates fire on that write.
The workflow completed cleanly. The agent drafts: “Maria, I’ve waived the $39 late fee on your Platinum card. The reversal should appear on your account within 24 hours.” Three sensors fire before this goes to her:
waive_fee returned status: "completed", so the phrasing “I’ve waived” is permitted. Had the status been pending_review, this phrasing would have been blocked.The response ships. The case closes.
The model made one judgment call: which workflow tool to pick. The harness made twenty.
This is what Agent = Model + Harness looks like in operation.
One harness discipline worth calling out separately: managing the agent’s working context as a case grows. Models degrade as their context window fills — slower to reason, more likely to lose earlier instructions, more likely to confabulate. Production teams call this context rot, and a serious harness has to defend against it.
Three patterns matter in practice:
SKILL.md front-matter on session start; full instructions and bundled tools load on demand when a task matches. Same idea applies to policy documents, traversal patterns, and rule sets — load what’s relevant for the current case, leave the rest available but unloaded.For most enterprise CX cases that resolve in seconds, context rot is a minor concern. For longer-running cases — multi-day disputes, batch reconciliation, multi-product workflows — it becomes a primary harness concern.
When the agent does the wrong thing, the question is which harness component failed?
AGENTS.md or a domain .md file (guide gap)This diagnostic framing turns harness engineering into normal software engineering. Every production incident becomes a debuggable failure with a clear remediation path.
Three takeaways for the Harness Engineering pillar:
AGENTS.md, each pre-tool-use hook, each post-write check exists because the agent did the wrong thing at least once. The harness grows by accretion, never by speculation.The fifth and final pillar is The Learning Loop — how the system uses the decision traces, the QA corrections, and the harness incident log to improve over time. The harness catches today’s mistakes. The learning loop turns yesterday’s mistakes into tomorrow’s training data, guide entries, and workflow tools.
No comments yet. Be the first to comment!

Gen AI Product Leader · Leads AI Applications and Search at eGain
I partner with PMs and engineers to drive production adoption of AI across Fortune 500 enterprises in the US and Europe. IIT Bombay alumnus; previously co-founded Selekt.in and built ChatGen.ai. The thesis I evangelize: knowledge is the harness for AI applications.