Thoughts on Generative AI, product leadership, and enterprise AI transformation

This is the first deep-dive in the system of intelligence series. The overview post laid out five pillars: memory, context engineering, tools, harness, and the learning loop. This post is about pillar one, the substrate where everything the agent knows about its world is stored.
Memory is the foundation. The other four pillars do nothing useful if memory is incomplete, stale, contradictory, or wrong. Context engineering can only surface what memory contains. Tools can only act on what memory describes. The harness can only validate against what memory says is policy. The learning loop is only as useful as what gets written back into memory.
We will cover five things in turn:
The DeltaBank example from the overview runs throughout. Maria is the Platinum cardholder calling to dispute a $39 late fee. We’ll look at the slice of memory that her call touches.
The single most common confusion in agent design is treating the model’s context window as “memory.” It isn’t. The context window is working memory, the few thousand tokens the model attends to during a single turn.
Memory in the sense that matters for a system of intelligence is the substrate, the persistent store of everything the agent might need to know about its world. The CRM record of every customer. The text of every policy document. The history of every conversation. The decision the agent made on call #4,212 last Thursday and why.
A useful taxonomy comes from the CoALA framework, which adapted cognitive psychology for LLM agents. It distinguishes three roles long-term memory has to play:
waive_fee tool requires the fee_waiver_apply permission.In most enterprise stacks today, the three roles are scattered. Semantic memory lives in the CRM. Episodic memory lives in call logs or a conversation table. Procedural memory lives in workflow tools or knowledge base platforms. This separation is the single largest source of system-of-intelligence dysfunction.
The premise of a context graph is to collapse that federation into the substrate itself.
A distinction worth naming before we get into graph mechanics. The substrate splits into two layers:
Trusted knowledge is constructed from raw knowledge through the harness gates we cover in Section 4. Both layers persist; neither is throwaway. Raw is the provenance anchor that lets you ask “where did this trusted fact actually come from?” six months later.
Content enters the system at one of two layers:
The architectural rule: trusted knowledge has two construction paths but one validation discipline. Manual curation does not get to skip the write-gates; it gets to skip the discovery-gates.
This vocabulary — raw vs trusted — runs through the rest of the post and the rest of the series.

A context graph is a single connected graph that holds all three memory roles together, in one database, with one query language, where shared entities are the same node across roles.
Why a graph and not just a bigger vector store? Because semantic and structural questions are different.
Vector search answers semantic questions: what does the policy say about waivers above $50? It does this by similarity, finding passages whose embeddings are nearest to the query’s embedding.
Graph traversal answers structural questions: every Platinum waiver policy authored by Carlos and updated in the last year, alongside the prior decision traces that cited each one. No embedding model in the world ranks that correctly from a similarity query. It is a structural query. It needs nodes, edges, and traversal.
Real systems use both. The graph is the catalog and the relationship map; the vector store holds the prose. We will get to the retrieval side in the next post. This post is about what the graph holds.
Neo4j’s recent work on context graphs popularized an entity taxonomy worth borrowing wholesale. They call it POLE+O, a five top-level entity categories that cover most enterprise domains:
| Category | DeltaBank examples | |
| P | Person | Maria, Carlos (policy owner), Janet (senior rep) |
| O | Organization | DeltaBank, Card Services dept, Visa |
| L | Location | San Francisco, Chicago HQ, Manila call center |
| E | Event | Payment missed, late fee charged, support call |
| +O | Object (everything else) | Platinum card, account, waiver policy, policy PDF |
POLE+O is a starting point, not a finished taxonomy. Every domain has entities that resist neat categorization. A ‘contract’ is an object, an event, and a relationship all at once and some domains will need additional or different top-level categories entirely. Healthcare might add :Condition and :Procedure as first-class categories. Manufacturing might add :Part and :Process. Logistics might split :Location into :Origin and :Destination. The point of starting from POLE+O is not that these five categories cover everything; it is that the categories that do fit transfer cleanly across domains. A :Person in your banking model is a :Person in your fraud-detection model. You compose ontologies on top of POLE+O rather than reinventing the basics for each use case.
For Maria’s late-fee dispute, the slice of the graph that the system needs spans all three memory roles and all five POLE+O categories.

The arrows are the point. Maria the customer (semantic) is the same node as Maria the participant in CONV_7799 (episodic) is the same node as Maria the subject of TRACE_8821 (reasoning). The waiver policy doc cited inside the decision trace is the same node that defines the policy that applies to her tier. Carlos who authored the doc is the same node as the employee in Card Services who reviews escalations.
This cross-stitching is what “show me every Platinum waiver decision in the last 90 days, the policy citation the agent used in each one, and which doc author wrote that policy” becomes a single graph query — instead of three queries against three systems joined by string-matching on names.
The graph doesn’t spring into existence. It’s constructed and continuously updated from five sources.
:Customer nodes. FIS Card Management holds card and transaction data, those become :Product, :Account, and :Event nodes (with :LateFeeCharged and :PaymentMissed as event subtypes). ServiceNow tickets become :Ticket nodes. Each source system maps into POLE+O with explicit transformations that get reviewed like schema migrations, because that’s what they are.:Document node, its metadata, and pointers.:Conversation, :DecisionTrace, and :TraceStep nodes. Every tool call generates an :Action node linked to the :Tool it invoked and the resulting :Outcome. This is the agent writing to its own memory in real time and this is what makes the graph a living artifact rather than a one-time snapshot.:Rule nodes, named workflows. The hand-crafted backbone the automatic sources extend.The context graph cannot be built entirely from structured sources. Some of the most important content, what does this organization actually do, who are its customers, what is its tone, what are its non-negotiables exists nowhere as a structured record. It lives in people’s heads, in scattered docs, in onboarding decks. If the agent can’t see it, the agent can’t use it.
A serious context-graph construction process creates a dedicated authored surface for organizational context, a structured folder inside trusted storage where humans write about the organization itself. Not a wiki dump. A templated authoring surface with a fixed schema and a review path:
/trusted_knowledge/organization/
overview.md # mission, business model, scale
products.md # product catalog
services.md # service offerings with SLAs
customer_segments.md # tier definitions
tone_and_voice.md # communication norms, what NOT to say
policies/ # escalation, authorization, regulatory
people/ # org chart, decision authorities
Every file in this surface is authored to a schema, carries mandatory metadata (author, last-reviewed date, owning team), and goes through review. The construction pipeline parses these files into the object graph (products, services, segments) and the organizational graph (people, roles, decision authority), enriching what comes in from the structured systems.
The reason building a context graph is harder than building a knowledge graph is the same reason it’s more valuable: the entities have to match across sources.
Salesforce’s “Maria Lopez” with email mlopez@gmail.com has to be the same node as FIS’s customer ID 4421 has to be the same node as the speaker named “Maria” in the call transcript from May 12. If they end up as three different nodes, the cross-stitching collapses and you’re back to federation.
Entity resolution: matching, deduplication, conflict handling — is the most under-discussed and most consequential piece of context graph construction. It’s also where the harness does its hardest work, which is the next section.
Memory has a failure mode that doesn’t exist for stateless systems: bad writes are permanent. A stateless model can produce a wrong answer in one turn and a right answer in the next. A bad fact written to memory poisons every retrieval that touches it, until somebody finds and removes it. Six months in production with no harness on memory, and the graph isn’t a context graph anymore, it’s a slow-motion data quality incident.
The harness on memory is the deterministic layer that gates every write. (Reads are governed by context engineering, covered in the next post.)
:Customer must have an id, a tier, and a created_at. A :LateFeeCharged must have an amount and a charged_to edge to an :Account. Schema enforcement is deterministic and fast, it catches the most common failure: an extraction pipeline that started emitting nodes in a slightly different shape after a model upgrade.source, confidence, and timestamp properties. If a fact was extracted from a PDF, the node points back at the PDF. If it was learned from a conversation, it points back at the conversation. If a human curated it, it points back at the curator. Provenance is what makes audit possible six months later and what powers the next check.:Property node on the :Customer, it gets routed to a secured store, with the graph holding a reference token instead. Sensitive data that does belong in the graph gets scoped: only certain queries, from certain roles, can traverse to it. The harness enforces this at write time and again at read time.The pattern is intercept-validate-or-reject, the same shape as the harness around tool calls in the overview. The difference is that the consequences of memory writes are slower-burn. A bad tool call hurts one customer in one turn. A bad memory write hurts every customer whose retrieval touches it for the next six months. Memory deserves more harness scrutiny than tool execution, not less. Most teams have it the other way around.
Two broad paths exist, and most production systems use both:
For high-stakes domains where a bad fact in trusted memory has real consequences like in customer-facing agents in support, banking, healthcare, legal, compliance, the default is every piece of content lands in raw and waits for a human to approve promotion. No auto-promotion, regardless of confidence score or source pedigree.
The pattern in these systems: Extraction pipelines push everything into raw with full provenance, the review queue surfaces candidates organized by source and confidence band, and a domain expert (often the policy or knowledge owner) approves, edits, or rejects each batch. Promotion happens in deliberate cycles, not in a stream.
For lower-stakes domains like internal knowledge bases, engineering documentation, exploratory analytics, internal-only agents, full human review on every promotion is overkill. Teams instead author deterministic auto-promotion rules that codify their trust model.
The principle is unchanged: humans define how trust gets earned. What differs is that the team writes those rules once and the harness enforces them automatically thereafter. Concrete examples of auto-rules teams write:
/engineering/standards/ space with a non-null owner field and a last_reviewed date inside 90 days auto-promote on every CDC sync.”record_type = Product and lifecycle_stage = GA auto-promote; anything in Beta stays in raw.”confidential = true which always require human review.”A useful default: start with everything human-reviewed, then graduate specific patterns to auto-rules once the team has watched them work for a quarter. Teams that flip this, starting with auto-rules and trying to add review later, discover that bad content has already spread through the substrate. Promotion rules are easy to relax and nearly impossible to tighten without expensive cleanup.
The architectural rule: the harness on promotion is itself a use-case-specific design decision. A banking CX agent and an internal engineering-docs agent should not have the same promotion policy. The PM and SME for each agent decide what trust looks like in that domain, write it down as deterministic rules (or as a “human review only” default), and the harness enforces what they wrote.

A context graph is a living artifact. The world changes. Policies get updated. Customers change tier. Transactions happen continuously. The agent’s own usage produces new conversations and decision traces every minute. Keeping memory fresh is its own ongoing discipline.
:Conversation, :DecisionTrace, :TraceStep, and :Outcome nodes. These are not optional, they’re how the agent learns from its own work. Maria’s call generates a trace showing why her fee was waived; six months from now, another agent handling a comparable case can retrieve that trace as precedent.waiver-policy-v3.pdf gets replaced by waiver-policy-v4.pdf, the graph creates a new :Document node, links it to the :Policy it defines, and marks the v3 node as superseded. Old decision traces that cited v3 still point at v3 — the version is preserved for audit but new retrieval surfaces v4. Nothing is deleted; old facts are marked superseded but retained.:DecisionTrace node with :Outcome {reviewed_by: human, correct: false, correction: ...}. Tomorrow’s retrieval picks these up as cautionary precedent. The loop is closed: memory feeds context, context feeds the model, the model proposes actions, the harness gates them, the actions hit systems of record, outcomes get written back into memory.Every update to trusted knowledge cascades through the rest of the system, but the cascade is not uniform. Different artifacts have different review bars:
The architectural rule: the more an artifact governs the agent’s behavior, the higher the review bar for changes. A fact correction propagates automatically. A traversal-pattern change requires sign-off.
Beyond writes and propagation, four maintenance disciplines keep the graph from accumulating noise.
Decay: Some facts are inherently time-bound. Maria’s account balance from last Tuesday is not relevant by Friday. The graph either expires these facts (delete the node) or moves them to a time-series store with a pointer from the graph (preserve audit, free the working substrate). The choice depends on the data type and the audit requirement.
Supersession: Versioned facts — policies, product definitions, org assignments — never get deleted. They get marked superseded. Carlos used to own the waiver policy; he handed it to Priya in March. The graph holds (Carlos)-[:OWNED_UNTIL:2026-03-01]->(Policy) and (Priya)-[:OWNS_SINCE:2026-03-01]->(Policy). Both edges exist. A query for “current owner” filters by status; an audit query for “who owned this in February” returns Carlos.
Consolidation: Patterns observed across many decision traces get promoted to procedural memory. Across 200 Platinum waiver decisions, the agent has been consistently approving cases where payment-on-time ratio is above 90%. That’s not just a pile of episodic data anymore — it’s a heuristic. A consolidation pass extracts the pattern and writes it back as a :Heuristic node connected to the underlying traces as evidence. The agent’s future retrieval picks up the heuristic directly, saving the cost of re-deriving it from raw episodes every time. Consolidation is how procedural memory grows organically, not just through human authoring.
Garbage collection: A continuous-scan discipline that surfaces content gone stale or low-value. The scanner flags:
Flagged content does not auto-delete. It surfaces to a queue where a human decides: refresh, archive, or remove. Without this discipline, the substrate accumulates rot — and rot in the substrate compounds, because the agent replicates patterns it sees, even bad ones.
Memory is the substrate. It is what the agent knows. Everything else in a system of intelligence either reads from it (context engineering), acts on what it describes (tools), validates against what it says is true (harness), or writes back into it (learning loop).
The four design commitments that separate a memory substrate from a database are:
Get those four right and you have a context graph the rest of the system can stand on. Get any of them wrong and you are watching data quality erode in slow motion.
In the next post, we cover Retrieval with the Context Graph — how the agent actually traverses and queries the substrate we have built here, and how the graph pre-filters what vector search has to look at. For the broader context-engineering discipline including window compaction, see the ultimate guide.
No comments yet. Be the first to comment!

Gen AI Product Leader · Leads AI Applications and Search at eGain
I partner with PMs and engineers to drive production adoption of AI across Fortune 500 enterprises in the US and Europe. IIT Bombay alumnus; previously co-founded Selekt.in and built ChatGen.ai. The thesis I evangelize: knowledge is the harness for AI applications.