我为什么给 agent 加了 memory，最后又把它删了

一句话结论

我给自己的 OpenCode agent 做过一整套长期记忆：先是文件式 memory，后来切到 mem0 + Qdrant，再后来把抽取逻辑整个收回自己控制，最后还是把它默认关掉了。

原因并不复杂：自动 memory 最大的问题不是“能不能记住”，而是“记住的东西值不值得信”。当一个系统会稳定地记住一些低价值、半正确、难审计的东西时，它带来的不是 continuity，而是新的维护负担。最后我决定回到更朴素的方案：重要信息显式写进 notes/，让 notes 做 truth，而不是让一个不透明的自动记忆层做 truth。

时间线

日期	提交	变化
2026-06-06	`63cb854`	加入第一版文件式 memory 插件和 `PROTOCOL.md`
2026-06-16	`63697a4`	长期记忆切到 mem0 + Qdrant
2026-06-22	`6ca24a0`	停用 mem0 自带抽取，改为自己控制 `infer:false` 写入
2026-06-24	`b903f25`	加入基于角色的 gate，以及 `core`/`text` 去重
2026-06-26	`da36bf4`, `a5ef9e9`	默认关闭抽取和召回

一开始我想做的，其实很简单

最开始那版 memory 很朴素。我想要的东西也很朴素：agent 不要每次都从零认识我，不要每次都重新学一遍怎么跟我协作。

所以第一版直接走文件方案。每条 memory 一个 markdown 文件，notes/memory/MEMORY.md 做索引，索引在每次会话启动时注入上下文。写入协议放在 .opencode/memory/PROTOCOL.md，自动抽取逻辑放在 .opencode/plugin/memory.ts。

这个方案的优点非常明显：可见、可改、可 diff、可 grep。对单用户系统来说，它几乎是最容易信任的一种形态。

后来我为什么切到 mem0

问题在召回成本。文件式方案本质上是 eager recall：索引会一直跟着上下文走。memory 越多，上下文越大，成本就是 O(n)。

到了 6 月中旬，我开始想把召回改成 pull-based：不要把整份记忆塞进每轮对话，而是在需要时按 query 取 top-k。于是我在 63697a4 这个提交里把长期记忆切到 mem0 + Qdrant，把索引常驻注入换成 search_memories 工具。

这一步从架构上看是对的。它解决了文件索引会无限膨胀的问题，也让 memory 更像一个检索层，而不是一块不断增长的 prompt。

真正难的不是召回，而是抽取

切到向量检索之后，我很快发现：召回不是最难的部分，抽取才是。

mem0 默认的抽取器是高召回取向的。对于“不要漏掉任何可能有用的东西”这类目标，它很合理；但对于“只记录少量高价值、长期成立、关于用户本人的事实”这种目标，它会记太多。再加上我当时用的是更轻量的模型，它能看到的上下文有限，于是就开始出现各种典型脏写：

assistant 把 “User wants X” 重新表述一遍，系统就把它当成用户事实存起来
研究过程中的发现、一次性的交付总结、临时 debug 状态，被错误地写进长期记忆
“以后想做什么”“之后再跟进什么” 这种 open loop，被当成 stable truth
同一个事实因为表述略有不同，被存成多条近重复 memory

我后来有个很明确的感受：我并不是在设计一个 memory system，我是在设计一个垃圾过滤器。

于是我把控制权收了回来

接下来这轮改动，是整条线里技术上最认真的一段。我先尝试过在 mem0 的 prompt 外面继续加 gate、加 regex 清理，但那只是补丁，不是控制权。

所以到 6ca24a0 这次提交，我直接停掉了 mem0 自带的 infer:true 抽取器，改成自己写 judge：先检索邻近 memory 做去重上下文，再让 .opencode/lib/mem0-judge.ts 按 EXTRACTION_GATE.md 输出明确的 ADD/UPDATE 决策，最后用 infer:false 落库。换句话说，mem0 从“帮我判断该记什么”降级成“只负责存和搜”。

两天后，在 b903f25 里我又把 gate 收紧了一轮，核心规则是下面这几条：

WHO SAID IT：关于用户的 claim 必须有用户自己的发言作证据
TRUTH, NOT OPEN LOOPS：“想做 / 待跟进 / 之后再看” 是任务，不是真相
RESEARCH -> wiki：外部研究结论进 wiki，不进 user memory
core / text：用原子事实做去重 key，用自包含措辞做实际存储

这些改动确实让质量变好了，但它没有改变一件更底层的事：这套系统仍然在用 LLM 自动决定 truth，而 truth layer 又是一个向量库。它变得更可控了，但没有变得真正让我放心。

为什么最后还是把它关了

最终让我放弃的原因有三个。

它仍然会记很多不值得记的东西。 哪怕规则已经很重，只要抽取是自动的、模型上下文又不完整，低价值内容就总会漏进来。更强的模型不一定解决这个问题，很多时候只会把垃圾写得更像正确答案。
它很难 audit。 后来我最在意的不是“能不能搜出来”，而是“这条事实为什么在这里”。在文件式方案里，文件本身就是 truth；在 mem0/Qdrant 方案里，snapshot 只是派生审计面，不是 source of truth。这个差别会直接影响信任感。
它开始反过来消耗我的注意力。 我本来是想让 memory 帮我减少重复劳动，结果越来越多时间花在给 memory 补规则、清垃圾、解释边界、修复误记。到这个阶段，它已经没有在服务产品目标，而是在制造一个新的维护面。

所以 6 月 26 日我把自动抽取和召回都改成 opt-in：MEMORY_EXTRACT_ENABLED 默认关闭，MEMORY_RECALL_ENABLED 默认关闭，agent-facing 的 protocol 和提示也从常驻上下文里撤掉。

我最后保留了什么

我并不是放弃了“记忆”这件事，我只是放弃了“隐式、自动、不可见的记忆层”。

现在我更信任的是几类显式文件：

notes/user.md：关于我的稳定画像和长期约束
notes/todos.md：开环事项和当前推进中的线程
notes/knowledge/：通过 llm-wiki 沉淀的外部知识和研究结论

这套东西没有那么“自动”，但它有一个非常重要的优点：我知道它为什么存在，也知道要去哪里改它。

如果以后再做，我会怎么做

让可审计的形式做 truth。 如果最后还是要靠 snapshot 来解释系统在干什么，那说明 truth layer 放错地方了。
让 LLM 提建议，不让它拥有真相。 模型可以负责提取候选项、打标签、做排序，但最终的 durable record 应该落在更容易审计的介质上。
先从显式 memory 出发，再决定哪些值得自动化。 自动化应该建立在一个已经可信的手工流程之上，而不是反过来要求规则去拯救一个不可信的自动流程。

结尾

这段尝试对我来说并不算失败。它让我更清楚一件事：agent memory 真正难的不是“存”和“搜”，而是“什么才配被长期记住”。如果这个问题没有被解决，memory 只会把噪音持久化。

所以至少在现在，我更愿意让 agent 依赖显式 notes，而不是依赖一个自动、隐式、向量库驱动的记忆层。

TL;DR

I built a full long-term memory system for my OpenCode agent: first a file-based version, then a mem0 + Qdrant version, then a stricter version where I owned extraction myself. In the end, I turned the whole thing off by default.

The reason was simple: the hardest part of automatic memory is not recall, but truth quality. Once a system reliably stores low-value, half-correct, hard-to-audit facts, it stops being continuity and starts being maintenance overhead. I eventually went back to a simpler rule: if something matters, write it explicitly into notes/. Notes should be the truth layer, not an opaque auto-memory pipeline.

Timeline

Date	Commit	Shift
2026-06-06	`63cb854`	Added the first file-based memory plugin and `PROTOCOL.md`
2026-06-16	`63697a4`	Switched long-term memory to mem0 + Qdrant
2026-06-22	`6ca24a0`	Stopped using mem0 extraction and owned writes with `infer:false`
2026-06-24	`b903f25`	Added role-grounded gating and `core`/`text` dedup
2026-06-26	`da36bf4`, `a5ef9e9`	Disabled extraction and recall by default

I Started With a Much Simpler Goal

The first version was intentionally simple. I wanted the agent to stop relearning who I was and how I liked to work every time a new session started.

So the first implementation was file-based. Each memory lived in its own markdown file, notes/memory/MEMORY.md was the index, and that index was injected into every new session. The write protocol lived in .opencode/memory/PROTOCOL.md, and automatic extraction lived in .opencode/plugin/memory.ts.

The strengths were obvious: everything was visible, editable, diffable, and grep-able. For a single-user system, that is a very trustworthy shape.

Why I Switched to mem0

The problem was recall cost. The file-based version was eager recall: the index always traveled with the prompt. As memory grew, context grew with it. The cost was fundamentally O(n).

By mid-June, I wanted pull-based recall instead: do not inject the whole memory store into every conversation, only fetch top-k items when the agent actually needs them. That led to commit 63697a4, where I switched long-term memory to mem0 + Qdrant and replaced eager injection with a search_memories tool.

Architecturally, this move was correct. It removed the unbounded prompt growth problem and turned memory into a retrieval layer instead of an ever-growing block of context.

The Real Problem Was Extraction

After the switch to vector retrieval, I realized recall was not the hard part. Extraction was.

mem0's default extractor is tuned for high recall. That makes sense if your goal is “do not miss potentially useful facts.” It is much less useful when your target is “store only a small number of durable, high-value facts about the user.” Combined with a lighter model and limited context, the system started producing very recognizable kinds of junk:

The assistant restated “User wants X,” and the system stored that restatement as a user fact
Research findings, one-off delivery summaries, and temporary debugging state leaked into long-term memory
Open loops such as “I want to do this later” were stored as if they were stable truth
The same fact accumulated as multiple near-duplicates because the wording changed slightly

At some point the feeling became clear: I was no longer designing a memory system. I was designing a garbage filter.

So I Took Control Back

The next round was the most serious engineering work in the whole experiment. I first tried adding more gate text and regex cleanup around mem0, but that was patching behavior, not owning it.

So in commit 6ca24a0, I stopped using mem0's built-in infer:true extractor entirely and wrote my own judge. The flow became: retrieve nearby memories for dedup context, run .opencode/lib/mem0-judge.ts with EXTRACTION_GATE.md, get explicit ADD/UPDATE decisions, then write them with infer:false. In other words, mem0 stopped deciding what mattered and became a storage-and-retrieval layer only.

Two days later, in b903f25, I tightened the gate again. The core rules were:

WHO SAID IT: a claim about the user must be grounded in the user's own words
TRUTH, NOT OPEN LOOPS: “want to do / follow up / revisit later” is a task, not a truth
RESEARCH -> wiki: external research findings belong in the wiki, not user memory
core / text: use an atomic fact as the dedup key, and a self-contained phrasing as the stored form

These changes did improve quality, but they did not change the more fundamental fact: the system was still using an LLM to decide truth, and the truth layer was still a vector store. It became more controlled, but not truly trustworthy.

Why I Removed It Anyway

Three reasons made me shut it down.

It still stored too many things that were not worth storing. Even with heavy rules, fully automatic extraction plus incomplete model context will keep leaking low-value facts. A stronger model does not necessarily solve this; often it just makes junk sound more convincing.
It was hard to audit. What I cared about most was no longer “can the agent recall this,” but “why is this fact here.” In the file-based version, the file itself was the truth. In the mem0/Qdrant version, the snapshot was only a derived audit surface, not the source of truth. That difference matters a lot for trust.
It started consuming my attention instead of saving it. The whole point of memory was to reduce repetitive work. Instead, I kept spending time tightening rules, cleaning junk, explaining boundaries, and fixing false memories. At that point, the system was no longer serving the product. It had become its own maintenance surface.

So on June 26, I made both extraction and recall opt-in: MEMORY_EXTRACT_ENABLED defaults to off, MEMORY_RECALL_ENABLED defaults to off, and the agent-facing protocol was removed from always-loaded context.

What I Kept

I did not give up on memory itself. I gave up on an implicit, automatic, mostly invisible memory layer.

What I trust now is a set of explicit files:

notes/user.md for durable facts and long-term constraints about me
notes/todos.md for open loops and active workstreams
notes/knowledge/ for external knowledge and research captured through the llm-wiki flow

It is less magical, but it has one major advantage: I know why it exists, and I know exactly where to edit it.

What I’d Do Differently Next Time

Make the auditable form the source of truth. If you need a derived snapshot to explain what the system is doing, your truth layer is probably in the wrong place.
Let the LLM propose, not own reality. The model can extract candidates, label them, or rank them. The durable record should still live in something easier to inspect.
Start from explicit memory, then decide what is worth automating. Automation should sit on top of an already trustworthy manual workflow, not the other way around.

Closing

I do not consider this experiment a failure. It clarified the real problem: the hard part of agent memory is not storing or retrieving, but deciding what deserves to persist. If that part is weak, memory just turns noise into durable state.

So for now, I would rather have my agent rely on explicit notes than on an automatic, implicit, vector-database-backed memory layer.