LLM

How Agents Remember You: Human Memory Science and a Code Audit of Six Open-Source Systems

2026-07-30·5564 words· 27 min

Deep Dives AI Agent LLM Memory Memory Systems Cognitive Science Open-Source Architecture

Almost every agent project now claims to provide “long-term memory.” For one project, that means embedding chat history. For another, it means maintaining a user profile. A third lets the model edit Markdown files. A fourth builds a bitemporal knowledge graph. All four use the word memory, but they are not the same system and should not be placed on one undifferentiated leaderboard. To decide whether a system genuinely remembers, I would rather ask three questions:

Agent 如何记住你：人脑记忆史与六大开源系统代码审计

2026-07-30·1755 字· 9 分钟

深度调研 AI Agent LLM Memory 记忆系统认知科学开源架构

几乎每个 Agent 项目都说自己有「长期记忆」。有的意思是把聊天记录做 embedding，有的意思是维护一份用户画像，有的意思是让模型自己修改 Markdown，还有的已经做到了双时序知识图谱。它们都叫 memory，却不是同一种东西，也不该放在一张跑分榜上直接比较。要判断一个系统是不是真的「会记」，我更愿意问三个问题：一次经历之后，系统里的什么状态发生了变化？这个状态存在哪里，谁能修改，什么时候失效？下一次行动前，它如何被准确、合规地带回来？这篇文章从这三个问题出发。前半段把人类记忆科学与 Agent 记忆技术放在同一条历史轴上；后半段直接读代码，对照 Mem0、Letta、Graphiti、LangMem、Cognee 与 MemoryOS 的宣传卖点、实际数据流、系统边界和对应的记忆范式。先给结论：今天主流的 Agent 并没有获得一种像人脑那样的统一「记忆器官」。工程上真正有效的是一条闭环：经历 → 写入门控 → 表征 → 存储 → 检索 → 上下文组装 → 行动反馈 → 巩固 / 修订 / 遗忘。不同开源项目，只是选择接管这条闭环的不同部分。下面这张图不是某个产品的组件架构，而是全文共用的判断坐标系。它要回答的不是「数据放在哪」，而是「一次过去的经历如何真正影响下一次行动」。阅读时先沿中间的七步主环看信息如何从经历变成行动；再看左侧三种载体，区分当前任务、跨会话记忆与真实世界状态；右侧说明每一步完成的变换，底部则展示长期运行后必须发生的巩固、修订与遗忘。这样能避免把数据库、Context、缓存和真实状态都笼统地叫作“记忆”。

LLM 推理引擎怎么选——2026 年从本地单机到 PD 分离的全景选型地图

2026-07-19·更新于: 2026-07-30·1306 字· 7 分钟

深度调研 LLM 推理引擎 VLLM SGLang 模型部署推理优化选型指南调研报告

阿里云 CAP 有一篇讲推理引擎选型的文章，把候选收敛到四个：Ollama、vLLM、SGLang、Hugging Face Pipeline。这个划分在 2024 年是够用的。但到 2026 年，它至少漏掉了半张地图——NVIDIA 的 TensorRT-LLM 完成了「PyTorch 化」转身、SGLang 因为首个开源复现 DeepSeek 大规模部署而封神、Hugging Face 自己给 TGI 挂上了「维护模式」横幅并劝你改用 vLLM，而整个 2025 年推理引擎领域真正的主线，其实是一个字：拆。这篇文章把这张地图更新到 2026 年年中。它不替你拍板选哪个产品——它给你一套分层框架、一张决策矩阵和一棵决策树，让你自己把候选收敛到 1–2 个。第一张图先解决最常见的比较错误：Ollama、KTransformers 和 vLLM 并不在解决同一层问题。先按本地运行、异构卸载与高性能服务划层，再在层内比较吞吐、格式和硬件支持，才有意义。三层推理地图：L1 追求跑得起与安装简单，L1.5 用内存换显存，L2 追求并发、吞吐和多 GPU 服务。它是分类框架，不是综合排名。为什么 2026 年「选推理引擎」才是个真问题 # 三年前不需要纠结这个。那时候能把一个 7B 模型在 GPU 上跑起来、返回还算流畅的 token 流，就已经过关。

How to Choose an LLM Inference Engine — A 2026 Map from Local Single-GPU to PD Disaggregation

2026-07-19·更新于: 2026-07-30·3716 words· 18 min

Deep Dives LLM Inference Engine VLLM SGLang Model Serving Inference Optimization Selection Guide Research

Aliyun’s CAP has a piece on picking an inference engine that narrows the field to four: Ollama, vLLM, SGLang, and Hugging Face Pipeline. In 2024, that framing was fine. By 2026, it’s missing half the map. NVIDIA’s TensorRT-LLM has completed its “PyTorch-ification,” SGLang became famous as the first open-source project to reproduce DeepSeek’s large-scale deployment, Hugging Face slapped a “maintenance mode” banner on TGI and told you to switch to vLLM — and the real throughline of the entire 2025 inference landscape can be summed up in one word: disaggregate.

↑