[{"content":"","date":"2026-05-04","externalUrl":null,"permalink":"/tags/ai-agent/","section":"Tags","summary":"","title":"AI Agent","type":"tags"},{"content":"","date":"2026-05-04","externalUrl":null,"permalink":"/tags/alibaba-cloud/","section":"Tags","summary":"","title":"Alibaba Cloud","type":"tags"},{"content":"","date":"2026-05-04","externalUrl":null,"permalink":"/tags/animation/","section":"Tags","summary":"","title":"Animation","type":"tags"},{"content":" Why Hugo # When picking a framework for a personal blog, my top criterion was low maintenance cost — I didn\u0026rsquo;t want to abandon writing three months later because of npm dependency hell.\nHugo is a single binary, requires no Node.js, builds thousands of posts in 1-2 seconds, and the PaperMod theme comes with dark mode, full-text search, RSS, Open Graph, and reading time estimates out of the box. Day-to-day writing only requires touching Markdown files.\nArchitecture # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ┌─────────────────────────────┐ │ DNS Geo-Based Routing │ │ (Alibaba Cloud DNS GeoDB) │ └──────┬──────────────┬────────┘ │ │ CN visitors ▼ Intl. visitors ▼ ┌──────────────────┐ ┌─────────────────┐ │ Alibaba CDN │ │ Cloudflare Pages │ │ ↓ │ │ (Free, Global) │ │ Alibaba OSS │ └─────────────────┘ │ (Static Hosting)│ └──────────────────┘ ↑ GitHub Actions auto-build \u0026amp; dual-stack push Annual cost: approximately ¥206 (~$29 USD):\nDomain zhuoqidev.com: ¥85/yr (bought 3 years) Function Compute resource pack (ICP filing): ¥101/yr Alibaba CDN 100GB traffic pack: ¥14/yr OSS storage: ~¥6/yr Cloudflare Pages: ¥0 ICP Filing Without a Server # Websites served to mainland China visitors need an ICP filing, which requires a \u0026ldquo;filing carrier\u0026rdquo; (a server IP). Instead of buying a full server, Alibaba Cloud\u0026rsquo;s Function Compute resource pack (¥101/yr) works as a filing carrier and provides a filing service code.\nTimeline: Alibaba Cloud initial review ~1 day + MIIT review 5-20 business days. Plenty of time to finish the site while waiting.\nGeo-DNS Routing # Alibaba Cloud DNS free tier supports \u0026ldquo;domestic / international\u0026rdquo; split routing:\nDomestic → Alibaba CDN CNAME International (default) → Cloudflare Pages CNAME Domestic visitors get the ICP-compliant Alibaba CDN; international visitors get Cloudflare\u0026rsquo;s free global CDN — one domain, two acceleration paths.\nDeployment # Push to GitHub → Actions runs hugo build → uploads in parallel to OSS and Cloudflare Pages. The whole process takes 2-3 minutes. Publishing a post is nearly instant.\nMore posts on AI Agent development coming soon. If you need an AI Agent integration developer, reach out: hello@zhuoqidev.com\n","date":"2026-05-04","externalUrl":null,"permalink":"/en/posts/hello-world/","section":"Ens","summary":"Why Hugo # When picking a framework for a personal blog, my top criterion was low maintenance cost — I didn’t want to abandon writing three months later because of npm dependency hell.\nHugo is a single binary, requires no Node.js, builds thousands of posts in 1-2 seconds, and the PaperMod theme comes with dark mode, full-text search, RSS, Open Graph, and reading time estimates out of the box. Day-to-day writing only requires touching Markdown files.\n","title":"Building a Personal Site with Hugo and Dual-Stack CDN","type":"en"},{"content":"","date":"2026-05-04","externalUrl":null,"permalink":"/tags/cdn/","section":"Tags","summary":"","title":"CDN","type":"tags"},{"content":"","date":"2026-05-04","externalUrl":null,"permalink":"/tags/cloudflare/","section":"Tags","summary":"","title":"Cloudflare","type":"tags"},{"content":"","date":"2026-05-04","externalUrl":null,"permalink":"/tags/css/","section":"Tags","summary":"","title":"CSS","type":"tags"},{"content":"","date":"2026-05-04","externalUrl":null,"permalink":"/categories/dev-log/","section":"Categories","summary":"","title":"Dev Log","type":"categories"},{"content":"Hugo shortcodes make it easy to embed live code demos. Here are three ways:\n1. Inline CSS Demo (No External Service) # A spinning loader animation, right in the article:\nPure CSS Spinner\nA gradient text animation:\nCSS Gradient Text\nZhuoQi Dev 2. Embed CodePen # If you already have CodePen creations, embed them with a single shortcode:\n1 {{\u0026lt; codepen id=\u0026#34;yourPenID\u0026#34; height=\u0026#34;400\u0026#34; tab=\u0026#34;result\u0026#34; \u0026gt;}} 3. Embed CodeSandbox # For React / Vue components, use CodeSandbox:\n1 {{\u0026lt; codesandbox id=\u0026#34;yourSandboxID\u0026#34; height=\u0026#34;450\u0026#34; view=\u0026#34;preview\u0026#34; \u0026gt;}} These three shortcodes cover most code demo scenarios — no extra tools needed.\n","date":"2026-05-04","externalUrl":null,"permalink":"/en/posts/css-animation-demo/","section":"Ens","summary":"Hugo shortcodes make it easy to embed live code demos. Here are three ways:\n1. Inline CSS Demo (No External Service) # A spinning loader animation, right in the article:\nPure CSS Spinner\nA gradient text animation:\n","title":"Embedding CSS Animation Demos in Hugo Articles","type":"en"},{"content":"","date":"2026-05-04","externalUrl":null,"permalink":"/en/","section":"Ens","summary":"","title":"Ens","type":"en"},{"content":"","date":"2026-05-04","externalUrl":null,"permalink":"/tags/hugo/","section":"Tags","summary":"","title":"Hugo","type":"tags"},{"content":"","date":"2026-05-04","externalUrl":null,"permalink":"/tags/icp-filing/","section":"Tags","summary":"","title":"ICP Filing","type":"tags"},{"content":"","date":"2026-05-04","externalUrl":null,"permalink":"/tags/icp%E5%A4%87%E6%A1%88/","section":"Tags","summary":"","title":"ICP备案","type":"tags"},{"content":"","date":"2026-05-04","externalUrl":null,"permalink":"/tags/llm/","section":"Tags","summary":"","title":"LLM","type":"tags"},{"content":"","date":"2026-05-04","externalUrl":null,"permalink":"/tags/memory/","section":"Tags","summary":"","title":"Memory","type":"tags"},{"content":"","date":"2026-05-04","externalUrl":null,"permalink":"/en/posts/","section":"Ens","summary":"","title":"Posts","type":"en"},{"content":"","date":"2026-05-04","externalUrl":null,"permalink":"/en/projects/","section":"Ens","summary":"","title":"Projects","type":"en"},{"content":"","date":"2026-05-04","externalUrl":null,"permalink":"/tags/research/","section":"Tags","summary":"","title":"Research","type":"tags"},{"content":"","date":"2026-05-04","externalUrl":null,"permalink":"/tags/shortcode/","section":"Tags","summary":"","title":"Shortcode","type":"tags"},{"content":" TL;DR # \u0026ldquo;LLMs have no memory\u0026rdquo; isn\u0026rsquo;t an oversight — it\u0026rsquo;s the equilibrium of four compounding constraints: O(n²) attention + KV cache VRAM + catastrophic forgetting + GDPR compliance. Every \u0026ldquo;Memory\u0026rdquo; feature from ChatGPT / Claude / Cursor works the same way: inject structured text back into the system prompt. Weights never change. Prompt Caching is performance optimization, not memory. The mainstream for the next 1–3 years is \u0026ldquo;stateless LLM core + stateful Agent memory layer\u0026rdquo;.\nComplexity 100M ctx Cost Cache Price Common TTL O(n²) 638×H100 0.1× 5min–24h 1. Why LLMs Are Stateless # Four independent constraints — individually manageable, together they leave \u0026ldquo;stateless\u0026rdquo; as the only viable engineering solution. This conclusion is cross-validated across 67 primary sources.\nArchitecture: O(n²) Attention # Self-attention scales at O(n²). A single 4096-token sequence needs 2 GB VRAM for KV cache; 32 concurrent sessions hit 64 GB — more than the model weights themselves. Llama 3.1 at 100M context requires 638 H100 GPUs ($5,400/hour) for KV cache alone.\n→ Liu et al. \u0026ldquo;Lost in the Middle\u0026rdquo; (TACL 2024): long contexts aren\u0026rsquo;t just slower — middle-section recall follows a U-shaped curve, worse than closed-book.\nTraining: Catastrophic Forgetting # LLM knowledge is entangled across billions of weights. No isolated \u0026ldquo;French module\u0026rdquo; or \u0026ldquo;user preference register\u0026rdquo; exists. Every fine-tune reshapes the entire parameter landscape. Even LoRA suffers from catastrophic forgetting in continual learning scenarios (arXiv 2404.16789).\n→ Industry standard: offline retraining at weekly/daily cadence. No one does per-request weight updates.\nCompliance: Right to Be Forgotten # GDPR Article 17 and PDPA require data controllers to delete personal data \u0026ldquo;without undue delay.\u0026rdquo; Once baked into billions of weights, the right to be forgotten becomes nearly impossible to execute — you can\u0026rsquo;t \u0026ldquo;subtract\u0026rdquo; a user from the model. Both Anthropic and OpenAI explicitly state Memory data lives externally, not in weights. This is a legal constraint, not a technical preference.\n→ RAG / Memory Layer beats fine-tuning because of compliance, not technical superiority.\nSecurity: Persistent Memory = Persistent Attack Surface # ChatGPT Memory has been breached via prompt injection through Google Docs, images, and web pages — attackers invoke to=bio to write malicious persistent instructions affecting all future conversations (Embrace The Red, 2024). This is precisely why Cursor 1.0→1.2 added mandatory user approval, and why Anthropic tested sycophancy/harmful conversation before releasing Memory.\nKarpathy\u0026rsquo;s canonical analogy: Weights = ROM (static, burned in at training); context window = RAM (directly addressable during inference); KV cache = working memory (formed at test-time); external vector / KG store = disk (persistent, requires retrieval). \u0026ldquo;Knowledge in the weights is a hazy recollection of training-time internet documents; content in the context window is directly accessible\u0026rdquo; — Andrej Karpathy, Dwarkesh Patel Interview (2025-10). 2. Product Landscape: Cache vs Memory vs True Memory # 14 products, zero weight modifications. This section also disentangles three commonly conflated concepts:\nCache (KV/Prompt Caching): Caches K,V projection tensors; prefix byte-level match → skip prefill. 5min–24h lifetime. Compute optimization, not \u0026ldquo;remembering.\u0026rdquo; Memory (Product Layer): Text in external databases/vector stores/markdown, injected into system prompt on each call. User-controlled. True Model Memory (In-Weights): Changing weights themselves. Hit by catastrophic forgetting + GDPR + interpretability. Comparison Table # Product Strategy Type Weight Δ? ChatGPT Memory 4-layer: metadata + bio + ~40 summaries + window Memory No OpenAI Prompt Caching ≥1024 tokens auto KV cache, 5min–24h TTL Cache No Anthropic Prompt Caching Explicit cache_control ≤4 breakpoints, byte-level match Cache No Gemini Context Caching Implicit 90% discount + Explicit 60min TTL Cache No Claude.ai Projects Instructions + files + history, full prompt injection Memory No Claude Memory (2025-10) Project-isolated, 24h synthesis, editable Memory No Claude Code CLAUDE.md + model-written MEMORY.md (200 lines) Memory No Cursor Rules / AGENTS.md Static markdown, 4 trigger modes, Team \u0026gt; Project \u0026gt; User Memory No Cursor Memories (1.0+) AI generates candidates → user approves → writes Memory No Cursor Codebase Index Merkle tree + encryption + Turbopuffer vector DB RAG No Windsurf Cascade global + workspace rules + auto Memories + RAG Memory No Devin Knowledge Human-written + AI suggestions + DeepWiki + VM Snapshots Memory+RAG No Replit Checkpoints VM snapshot = files + DB + chat + Agent memory Snapshot No Italic = Cache/RAG/Snapshot; Bold = Memory. No product modifies weights.\nKey reverse-engineering evidence: Manthan Gupta confirmed through three experiments: ask ChatGPT about a specific topic discussed a year ago, and it has absolutely no idea. ChatGPT Memory does not use RAG. It stores only: session metadata + dozens of bio entries + user message summaries of the last ~40 chats (not ChatGPT\u0026rsquo;s own replies) + the current sliding window. Cursor\u0026rsquo;s official docs put it even more bluntly: \u0026ldquo;Large language models don\u0026rsquo;t retain memory between completions. Rules provide persistent, reusable context at the prompt level.\u0026rdquo; 3. The Four-Layer Future Stack # Bottom-up: base layer forever stateless. The three above are different abstractions for \u0026ldquo;giving it memory.\u0026rdquo; L4 is the short-term mainstream; L2 is the highest-value research leap.\nL4 · Agent Memory Layer # Most Mature Treats the LLM as a stateless CPU; memory lives in external databases + Agent runtime. Representatives: Letta (MemGPT) · Mem0 · Zep + Graphiti · LangGraph Store · AutoGen Memory.\n✅ Auditable · Deletable · Model-agnostic ⚠️ Retrieval quality ceiling · Write contamination accumulates Mem0 scores 26% above OpenAI Memory on LoCoMo; 91% lower p95 latency; 90% fewer tokens L3 · Ultra-Long Context # Commercialized Stuffs memory into ultra-long context windows. Representatives: Gemini 2M (\u0026gt;99% needle recall) · Magic LTM-2-Mini 100M tokens.\n✅ Best in-session carrier ⚠️ Lost-in-the-middle unsolved · 100M ctx single user = 638×H100 L3 and L4 are complementary, not competitive: ultra-long context handles within-session associations; Agent memory layer handles cross-session / cross-year persistence. Combining both is the current engineering optimum.\nL2 · In-Architecture Memory # Highest Research Value Embeds \u0026ldquo;persistent memory\u0026rdquo; as a differentiable module in the network — potentially the real paradigm shift. Representatives: Google Titans · Infini-attention · Mamba-2 · RWKV-7 Goose.\n✅ Constant VRAM · Linear time ⚠️ Not yet validated at scale (needs ≥70B params / ≥10T tokens) L1 · Bare LLM (frozen weights) # Forever Stateless GPT / Claude / Gemini / Llama core. Each inference is a fresh process. Continual learning won\u0026rsquo;t become a per-user memory path short-term. LoRA is for domain/role specialization, not per-user.\n4. Memory Economics: Why Cache TTL Is a Hidden Pricing Dial # This is the most underappreciated thread in the entire landscape.\nIn 2026-03, Anthropic silently dropped cache TTL from 1h to 5min, causing Claude Code users to pay 17–26% more. No announcement. No SLA commitment. This exposed a brutal truth: cache TTL directly impacts per-user cost but appears on zero SLAs.\nMetric Value Cost increase after Anthropic TTL change 17–26% Cache cost transparency 0% (fully hidden) 100M ctx hardware cost (single user) ~$5.4k/hr SLA commitments on cache TTL 0 Extrapolate this logic and the future \u0026ldquo;memory economics\u0026rdquo; increasingly resemble cloud storage — tiered (5min/1h/24h/permanent), pricable (micro-adjusting TTL is reverse-pricing by traffic), and lock-in (migration cost once agent workflows depend on specific cache strategies).\n5. Three-Year Paradigm Roadmap # Based on Anthropic, Letta, Karpathy, LeCun sources. 2026 has high confidence; 2027–2028 are inferential with explicit uncertainty.\nYear Mainstream Potential Dark Horse 2026 Bare LLM + Agent Memory (Mem0/Zep/Letta) + long-context caching Titans-style architectures begin small-scale commercial use; Sleep-time Compute becomes agent standard 2027 Reflection / Sleep-time / TTT enter mainstream Agent framework primitives A 7B SSM/Hybrid surpasses Transformer on long-context benchmarks 2028 Top models may integrate in-arch memory (high-risk prediction); otherwise Memory Layer remains standard LeCun H-JEPA + LLM hybrid prototype (early signal for 5–10 year bet) 2028 caveat: In-architecture memory requires ≥70B params and ≥10T token training for validation — currently arXiv-only. The more likely 2028 scenario is coexistence, not replacement. 6. Nine Practical Takeaways # Never conflate Cache and Memory: Cache skips prefill; Memory decides what goes into the prompt. Orthogonal.\nWriting memory = writing system prompt: Any convention expressible in markdown (Cursor Rules / CLAUDE.md / AGENTS.md) beats \u0026ldquo;letting the AI remember\u0026rdquo; — diffable, version-controlled, deterministic.\nPrefix order: static → dynamic: Tool definitions, system prompt, project rules first; user input last. Top-level advice from OpenAI, Anthropic, and Google docs.\nCompaction must be cache-safe: Don\u0026rsquo;t open a new system prompt for summarization — forces full uncached recomputation. Claude Code calls this \u0026ldquo;cache-safe forking.\u0026rdquo;\nTTL is a product decision: The Anthropic 1h→5min incident proves it. Expose TTL as user-configurable, or users will find your hidden pricing in their bills.\nAI writes, human approves = steadiest auto-Memory: Cursor 1.2\u0026rsquo;s user approval + Devin\u0026rsquo;s suggestion-only flow are the post-prompt-injection consensus.\nVisible, editable, exportable = trust: Anthropic\u0026rsquo;s natural language synthesis vs ChatGPT\u0026rsquo;s opaque synthesis — two sides of the same coin.\nPrivacy mode conflicts with Cache: OpenAI Extended cache loses ZDR; Cursor privacy mode stores no plaintext. Offer \u0026ldquo;performance vs. privacy\u0026rdquo; as two modes.\nThe real moat is \u0026ldquo;context engineering,\u0026rdquo; not \u0026ldquo;memory models\u0026rdquo;: Deterministic, version-controlled, human-readable state. Curation cost is one-time; benefit compounds.\n7. Key References # All primary sources from 2024–2026. 30+ curated entries covering vendor docs, arXiv papers, and researcher essays.\nA. Vendor Sources # OpenAI: Prompt Caching guide · Caching 201 cookbook · Manthan Gupta · Reverse Engineered ChatGPT Memory · Embrace The Red · Hacking Memories\nAnthropic: Prompt Caching docs · Lessons from Claude Code · Claude Code Memory · How Claude\u0026rsquo;s memory works\nGoogle: Gemini Context Caching · Vertex AI caching overview\nCursor / Windsurf / Devin / Replit: Cursor Rules · Codebase Indexing · Cursor 1.0 + 1.2 changelogs · Windsurf Memories · Devin Knowledge · Replit Checkpoints\nB. Key Papers # Architecture: Lost in the Middle · Gemini 1.5 · Magic LTM-2-Mini · Titans · Infini-attention · Mamba-2 · RWKV-7 · KV-Direct\nMemory Layer: MemGPT · Mem0 · Zep + Graphiti · A-Mem · Generative Agents · Sleep-time Compute\nContinual Learning: CL Survey · TTT (ICML 2025) · Memory Taxonomy\nC. Researchers (Karpathy / LeCun / Raschka) # Karpathy · Dwarkesh Patel Interview (2025-10) · Intro to LLMs LeCun · Path Towards AMI · NVIDIA GTC 2025 Raschka · Coding the KV Cache D. Frameworks # LangGraph Persistence · AutoGen Memory Letta Research · Don\u0026rsquo;t Break the Cache · ctx.ist Research method: Three parallel sub-agents (technical principles + product API design + future paradigms), cross-validated across four sources (Exa, Tavily, Context7, WebSearch). 67 primary URLs, 2024-Q1 to 2026-Q2.\n","date":"2026-05-04","externalUrl":null,"permalink":"/en/projects/llm-memory-research/","section":"Ens","summary":"TL;DR # “LLMs have no memory” isn’t an oversight — it’s the equilibrium of four compounding constraints: O(n²) attention + KV cache VRAM + catastrophic forgetting + GDPR compliance. Every “Memory” feature from ChatGPT / Claude / Cursor works the same way: inject structured text back into the system prompt. Weights never change. Prompt Caching is performance optimization, not memory. The mainstream for the next 1–3 years is “stateless LLM core + stateful Agent memory layer”.\n","title":"Why LLMs Have No Memory — A Cross-Validated Research Report with 67 Primary Sources","type":"en"},{"content":"","date":"2026-05-04","externalUrl":null,"permalink":"/projects/","section":"作品","summary":"","title":"作品","type":"projects"},{"content":"","date":"2026-05-04","externalUrl":null,"permalink":"/tags/%E5%8A%A8%E7%94%BB/","section":"Tags","summary":"","title":"动画","type":"tags"},{"content":"Hugo 用 shortcode 可以很优雅地内嵌代码演示。这里展示三种方式：\n1. 内联 CSS demo（无需外部服务） # 直接在文章里跑一个旋转加载动画：\n纯 CSS 旋转加载器\n一个渐变色文字动画：\nCSS 渐变文字\nZhuoQi Dev 2. 嵌入 CodePen # 如果已有 CodePen 作品，用一行 shortcode 嵌入：\n1 {{\u0026lt; codepen id=\u0026#34;你的PenID\u0026#34; height=\u0026#34;400\u0026#34; tab=\u0026#34;result\u0026#34; \u0026gt;}} 3. 嵌入 CodeSandbox # React / Vue 组件用 CodeSandbox：\n1 {{\u0026lt; codesandbox id=\u0026#34;你的沙盒ID\u0026#34; height=\u0026#34;450\u0026#34; view=\u0026#34;preview\u0026#34; \u0026gt;}} 这三个 shortcode 覆盖了大多数代码展示场景，写博客基本不需要其他工具了。\n","date":"2026-05-04","externalUrl":null,"permalink":"/posts/css-animation-demo/","section":"文章","summary":"Hugo 用 shortcode 可以很优雅地内嵌代码演示。这里展示三种方式：\n1. 内联 CSS demo（无需外部服务） # 直接在文章里跑一个旋转加载动画：\n纯 CSS 旋转加载器\n一个渐变色文字动画：\n","title":"在 Hugo 文章里内嵌 CSS 动画 Demo","type":"posts"},{"content":" 一句话结论 # 所谓「大模型没有记忆」不是疏忽，而是 Transformer O(n²) 注意力 + KV cache 显存 + 权重纠缠（灾难性遗忘）+ GDPR 合规 四重约束的均衡解。ChatGPT / Claude / Cursor 的 \u0026ldquo;Memory\u0026rdquo; 本质都是把结构化文本塞回 system prompt，模型权重永远不动。Prompt Caching 只是性能优化，不是记忆。未来 1–3 年的主流是 「无状态 LLM 内核 + 有状态 Agent 记忆层」 混合架构。\n计算复杂度 100M ctx 成本 Cache 价格 主流 TTL O(n²) 638×H100 0.1× 5min–24h 1. 为什么 LLM 被设计成无状态 # 四个独立约束叠加，每一个单独都不致命，叠在一起就只剩\u0026quot;无状态\u0026quot;这一种工程解——这个结论来自对 67 条一手资料的交叉验证。\n架构约束 · O(n²) 注意力 # 自注意力关于序列长度 n 的计算复杂度是 O(n²)，KV cache 显存随 n 线性增长但系数巨大——4096 token 单序列就要约 2 GB 显存，32 并发就 64 GB，比模型权重本身还大。Llama 3.1 在 100M token 上下文中，仅 KV cache 就需要 638 块 H100（约 ¥40,000/小时）。\n→ Liu et al. \u0026ldquo;Lost in the Middle\u0026rdquo; (TACL 2024) 实证：长上下文不仅算得慢，模型对中段信息的利用呈 U 形曲线，比闭卷还差。\n训练约束 · 灾难性遗忘 # LLM 知识在数十亿权重里高度纠缠，没有\u0026quot;法语模块\u0026quot;或\u0026quot;用户偏好寄存器\u0026quot;可以独立写入。每次 fine-tune 都重塑整个参数景观，旧能力会被覆盖。即便是 LoRA，在 continual learning 场景下仍然受 catastrophic forgetting 困扰（arXiv 2404.16789）。\n→ 业界普遍做法是周/天级别的离线 retrain，没人做 per-request 的在线权重更新。\n合规约束 · 被遗忘权 # GDPR 第 17 条和 PDPA 要求数据控制者\u0026quot;不得无故拖延\u0026quot;地删除个人数据。一旦个人数据烘焙进数十亿权重，\u0026ldquo;被遗忘权\u0026quot;在工程上几乎无法精确执行——你无法从模型中\u0026quot;减去\u0026quot;某个用户的影响。Anthropic 和 OpenAI 都明确表示 Memory 数据存储在外部、不在权重内，这不是技术选择，是法务硬约束。\n→ RAG / Memory Layer 击败 fine-tuning 的根本原因是合规，不是技术优劣。\n安全约束 · 持久记忆 = 持久攻击面 # ChatGPT Memory 已被多次 prompt injection 攻破：通过 Google Doc / 图片 / 网页让模型调用 to=bio 写入恶意持久指令，从此影响所有未来对话（Embrace The Red 博客, 2024）。这正是 Cursor 1.0→1.2 给 Memories 强制加 user approval 的原因，也是 Anthropic 专门测试 sycophancy / harmful conversation 后才发布 Memory 的原因。\nKarpathy 的权威类比：权重 = ROM（训练时烧入，静态）；context window = RAM（推理时活跃，可直接寻址）；KV cache = working memory（test-time 形成的工作记忆）；外部 vector / KG store = disk（持久但要 retrieve）。原话：\u0026ldquo;权重里的知识是对训练时互联网文档的 hazy recollection；而 context window 里的内容是 directly accessible 的\u0026rdquo; — Andrej Karpathy, Dwarkesh Patel 专访 (2025-10)。 2. 主流产品的\u0026quot;记忆\u0026quot;策略对比（含 Cache vs Memory 辨析） # 14 个主流产品，没有任何一个真的修改了模型权重。在这节我们同时辨析三个常被混为一谈的概念：\nCache（KV / Prompt Caching）：缓存 attention 层的 K、V 投影张量，前缀逐 byte 匹配命中后跳过 prefill。生命周期 5min–24h。本质是算力优化，不是\u0026quot;记住\u0026quot;任何东西。 Memory（产品层）：文本存储在外部数据库 / 向量库 / markdown 文件里，每次调用拼到 system prompt 头部。用户可控。 真模型记忆（权重内）：改变模型权重本身。受灾难性遗忘、GDPR 被遗忘权、可解释性三重打击，业界普遍回避。 14 产品对比 # 产品 策略 本质 权重变? ChatGPT Memory 4 层: 元数据 + bio + ~40 条摘要 + 滑窗 Memory No OpenAI Prompt Caching ≥1024 token 自动 KV 缓存, 5min–24h TTL Cache No Anthropic Prompt Caching 显式 cache_control ≤4 断点, 逐 byte 匹配 Cache No Gemini Context Caching Implicit 90% 折扣 + Explicit 60min TTL Cache No Claude.ai Projects 项目说明 + 文件 + 历史, 全量塞 prompt Memory No Claude Memory (2025-10) 项目隔离, 24h 合成, 可视可编辑可导出 Memory No Claude Code CLAUDE.md + 模型自写 MEMORY.md (200 行) Memory No Cursor Rules / AGENTS.md 静态 markdown, 4 触发模式, Team \u0026gt; Project \u0026gt; User Memory No Cursor Memories (1.0+) AI 生成候选 → 用户审批 → 写入 Memory No Cursor Codebase Index Merkle 树 + 加密 + Turbopuffer 向量库 RAG No Windsurf Cascade global + workspace rules + 自动 Memories + RAG Memory No Devin Knowledge 人写 + AI 建议 + DeepWiki + VM Snapshots Memory+RAG No Replit Checkpoints VM 快照 = 文件 + DB + 对话 + Agent memory Snapshot No 斜体行 = Cache/RAG/Snapshot 类；粗体行 = Memory 类。没有一个产品改权重。\n关键反向工程证据：Manthan Gupta 三次实验证实：问 ChatGPT 一年前讨论过的具体话题，它根本不知道。ChatGPT Memory 没有用 RAG，存的只有：会话元数据 + 几十条 bio 条目 + 最近 ~40 个聊天的用户消息摘要（不存 ChatGPT 自己的回复）+ 当前滑窗。Cursor 官方文档第一句更直白：\u0026ldquo;Large language models don\u0026rsquo;t retain memory between completions. Rules provide persistent, reusable context at the prompt level.\u0026rdquo; 3. 未来范式：四层混合栈 # 自下而上：底层永远无状态，上面三层是\u0026quot;给它装记忆\u0026quot;的不同抽象。L4（Agent 记忆层）是短期主流，L2（架构内记忆）是最值得押注的研究跃迁。\nL4 · Agent 记忆层 # 商业最成熟 把 LLM 视为无状态 CPU，\u0026ldquo;记忆\u0026quot;放在外部数据库 + Agent runtime，每次推理把检索结果拼回 prompt。代表：Letta (MemGPT) · Mem0 · Zep + Graphiti · LangGraph Store · AutoGen Memory。\n✅ 可审计 · 可删除 · 模型无关 ⚠️ retrieval 质量决定上限 · 写入污染累积 Mem0 在 LoCoMo benchmark 上比 OpenAI Memory 高 26%、p95 延迟降 91%、token 降 90% L3 · 超长上下文 # 已商业化 把记忆塞进超长 context window。代表：Gemini 2M (needle 召回 \u0026gt;99%) · Magic LTM-2-Mini 100M tokens。\n✅ 会话内最佳载体 ⚠️ Lost-in-the-middle 仍未解 · 100M ctx 单用户 638×H100 L3 和 L4 是互补不是替代：超长上下文处理会话内的即时关联，Agent 记忆层处理跨会话/跨年的持久记忆。将两者组合是当前工程上的最优解。\nL2 · 架构内记忆 # 研究价值最高 把\u0026quot;持久记忆\u0026quot;做成可微模块嵌入网络——这可能是真正改写格局的方向。代表：Google Titans (短期 attention + 长期 neural memory) · Infini-attention · Mamba-2 · RWKV-7 Goose。\n✅ 常数显存 · 线性时间 ⚠️ 尚未规模化验证（需 ≥70B / ≥10T token 训练才能证明可行性） L1 · 裸 LLM（frozen weights） # 永远无状态 GPT / Claude / Gemini / Llama 内核。每次推理是新进程，权重不变。Continual learning 短期内不会成为 per-user 记忆主路。LoRA 用于领域/角色特化，不是 per-user。\n4. 记忆经济学：为什么 Cache TTL 是隐藏定价开关 # 这条暗线在全篇中最被低估。\nAnthropic 在 2026-03 把默认 cache TTL 从 1h 静默降到 5min，导致 Claude Code 用户实测多花 17–26%。没有任何公告，没有 SLA 承诺。这条改变暴露了一个残酷的事实：cache TTL 是直接影响用户单价、但不在任何 SLA 上的隐藏开关。\n指标 数值 Anthropic TTL 调整后成本上浮 17–26% Cache 费用占比透明度 0%（完全隐藏） 100M ctx 硬件成本（单用户） ~¥40k/小时 SLA 中 cache TTL 承诺 0 条 如果推演下去：未来的\u0026quot;记忆经济学\u0026quot;会越来越像云存储——分层（5min/1h/24h/永久）、可定价（微调 TTL 就是反向定价）、可锁定（agent 工作流依赖特定 cache 策略后迁移成本极高）。\n5. 3 年范式演进地图 # 基于 Anthropic、Letta、Karpathy、LeCun 等来源的判断。2026 年主流配置有较高确信，2027–2028 为推断，含不确定性。\n年份 工业主流配置 可能的黑马事件 2026 裸 LLM + Agent 记忆层 (Mem0/Zep/Letta) + 长上下文 caching Titans 系架构开始小规模商用；Sleep-time Compute 成 agent 标配 2027 Reflection / Sleep-time / TTT 进入主流 Agent 框架原语 某 SSM/Hybrid 7B 在 long-context benchmark 全面超 Transformer 2028 顶级模型可能集成 in-arch memory module（高风险预测）；否则 Memory Layer 仍是标配 LeCun H-JEPA + LLM 混合原型出现（5–10 年的早期信号） 2028 预测需谨慎：Titans 等架构内记忆方案需要 ≥70B 参数、≥10T token 训练才能规模化验证，目前仅在 arXiv。2028 年更可能的场景是 Agent 记忆层和架构内记忆共存，而非后者取代前者。 6. 给工程师的 9 条实用结论 # 不要把 Cache 和 Memory 混为一谈：Cache 是算力优化（跳过 prefill），Memory 是产品层决定把什么塞进 prompt。两者完全正交。\n写记忆就是写 system prompt：凡是能用 markdown 写下来的项目约定（Cursor Rules / CLAUDE.md / AGENTS.md），永远比\u0026quot;让 AI 自己记\u0026quot;更可控、可 diff、可版本管理。\n拼前缀顺序: static → dynamic：工具定义、System prompt、项目规则放最前；当前用户输入放最后。OpenAI / Anthropic / Google 三家文档的一致顶级建议。\nCompaction 必须 cache-safe：不要为 summarization 单开新 system prompt——会让全长对话按 uncached 全价重算。Claude Code 称之为 \u0026ldquo;cache-safe forking\u0026rdquo;。\nTTL 是产品决策不只是工程参数：Anthropic 1h→5min TTL 事件的教训。把 TTL 作为可配置项暴露给用户，否则用户会在账单里发现你的隐藏定价。\nAI 写、人审批 = 当前最稳的\u0026quot;自动 Memory\u0026quot;形态：Cursor 1.2 加 user approval、Devin 默认走 suggestion 流，是被反复 prompt injection 教训之后的设计共识。\n可视、可编辑、可导出 = trust：Anthropic 的 \u0026ldquo;natural language synthesis\u0026rdquo; 差异化和 ChatGPT 不透明合成，正反两面证明了这点。\n隐私模式与 Cache 有矛盾：OpenAI Extended cache 失去 ZDR 资格、Cursor 隐私模式不存原文——把\u0026quot;性能 vs 隐私\u0026quot;作为两档让用户选。\n真正的护城河是\u0026quot;上下文工程\u0026quot;不是\u0026quot;记忆模型\u0026rdquo;：把记忆写成 deterministic、version-controlled、人类可读的状态，curation cost 是一次性的，benefit 是 compounding 的。\n7. 关键引用源 # 全部为 2024–2026 年一手资料。共 30+ 条精选，涵盖原厂文档、arXiv 论文、研究者原文。\nA. 厂商一手资料 # OpenAI\nOpenAI Prompt Caching guide — KV cache 工作原理 + TTL + retention policy OpenAI Prompt Caching 201 cookbook — Extended cache 与 ZDR 的关系 Manthan Gupta · I Reverse Engineered ChatGPT\u0026rsquo;s Memory — 4 层结构反向工程 Embrace The Red · ChatGPT Hacking Memories — bio 工具与 prompt injection 攻击面 Anthropic\nAnthropic Prompt Caching docs — cache_control / 5min vs 1h / 4 breakpoints Lessons from building Claude Code — cache-safe forking 实践 Claude Code Memory docs — CLAUDE.md vs auto memory How does Claude\u0026rsquo;s memory work — RAG 工具调用 + 24h synthesis + 项目隔离 Google\nGemini API Context Caching — implicit vs explicit、TTL、storage 计费 Vertex AI Context caching overview — 90% 折扣 + 跨租户隔离 Cursor / Windsurf / Devin / Replit\nCursor Rules + Codebase Indexing + 1.0 changelog + 1.2 changelog Windsurf Cascade Memories — 5 层 context 拼接 Devin Knowledge — 人写 + AI + DeepWiki + VM Snapshots Replit Checkpoints — VM + DB + AI 对话快照 B. 关键论文 # 架构 / 长上下文\nLost in the Middle (TACL 2024) — U 形曲线实证 Gemini 1.5 Technical Report — 1M-10M token 标杆 Magic LTM-2-Mini — 100M token, 比 attention 省 1000× FLOPs Titans: Learning to Memorize at Test Time — Google neural memory module Infini-attention — Compressive memory, 1B 模型 5K → 1M passkey Mamba-2 / SSD (ICML 2024) + RWKV-7 Goose + KV-Direct Memory Layer / Agent 记忆\nMemGPT · Mem0 · Zep + Graphiti · A-Mem · Generative Agents · Sleep-time Compute Continual Learning\nContinual Learning of LLMs Survey · TTT (ICML 2025) · Memory Survey C. 范式判断 (Karpathy / LeCun / Raschka) # Andrej Karpathy · Dwarkesh Patel 专访 (2025-10) Karpathy · Intro to LLMs Yann LeCun · A Path Towards AMI LeCun at NVIDIA GTC 2025 Sebastian Raschka · Coding the KV Cache D. 工业框架 # LangGraph Persistence \u0026amp; Memory AutoGen Memory \u0026amp; RAG Letta Research Don\u0026rsquo;t Break the Cache (arXiv 2601.06007) ctx.ist · Context Determinism Thesis 调研方法：三路并行子代理（技术原理 + 产品 API 设计 + 未来范式），交叉验证四个信息源（Exa、Tavily、Context7、WebSearch）。共 67 条一手 URL，时效 2024-Q1 至 2026-Q2。\n","date":"2026-05-04","externalUrl":null,"permalink":"/projects/llm-memory-research/","section":"作品","summary":"一句话结论 # 所谓「大模型没有记忆」不是疏忽，而是 Transformer O(n²) 注意力 + KV cache 显存 + 权重纠缠（灾难性遗忘）+ GDPR 合规 四重约束的均衡解。ChatGPT / Claude / Cursor 的 “Memory” 本质都是把结构化文本塞回 system prompt，模型权重永远不动。Prompt Caching 只是性能优化，不是记忆。未来 1–3 年的主流是 「无状态 LLM 内核 + 有状态 Agent 记忆层」 混合架构。\n计算复杂度 100M ctx 成本 Cache 价格 主流 TTL O(n²) 638×H100 0.1× 5min–24h 1. 为什么 LLM 被设计成无状态 # 四个独立约束叠加，每一个单独都不致命，叠在一起就只剩\"无状态\"这一种工程解——这个结论来自对 67 条一手资料的交叉验证。\n","title":"大模型为什么没有记忆——67 条一手资料的交叉验证调研","type":"projects"},{"content":"","date":"2026-05-04","externalUrl":null,"permalink":"/categories/%E6%8A%98%E8%85%BE%E8%AE%B0%E5%BD%95/","section":"Categories","summary":"","title":"折腾记录","type":"categories"},{"content":"","date":"2026-05-04","externalUrl":null,"permalink":"/posts/","section":"文章","summary":"","title":"文章","type":"posts"},{"content":" 为什么选 Hugo # 做个人博客选框架，我的第一标准是维护成本低——不想三个月后因为 npm 依赖地狱放弃写作。\nHugo 是单二进制文件，无需 Node.js，构建几千篇文章只需 1-2 秒，PaperMod 主题开箱就有暗色模式、全文搜索、RSS、Open Graph、阅读时间估算。日常写作只需碰 Markdown。\n整体架构 # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ┌─────────────────────────────┐ │ DNS 分线路解析 │ │ (阿里云云解析 GeoDB) │ └──────┬──────────────┬────────┘ │ │ 国内访客 ▼ 国际访客 ▼ ┌──────────────────┐ ┌─────────────────┐ │ 阿里云 CDN │ │ Cloudflare Pages │ │ ↓ │ │ (免费，全球CDN) │ │ 阿里云 OSS │ └─────────────────┘ │ (静态托管) │ └──────────────────┘ ↑ GitHub Actions 自动构建 \u0026amp; 双栈推送 这套方案全年花费约 ¥206：\n域名 zhuoqidev.com：¥85/年（已买 3 年） 函数计算资源包（ICP 备案载体）：¥101/年 阿里云 CDN 100GB 流量包：¥14/年 OSS 存储：~¥6/年 Cloudflare Pages：¥0 ICP 备案那些事 # 国内域名备案需要\u0026quot;备案载体\u0026quot;（服务器 IP）。不买服务器的话，阿里云函数计算资源包（¥101/年）可以作为备案载体，拿到备案服务码。\n备案周期：阿里云初审 1 天 + 管局审核 5-20 个工作日。等备案期间正好把网站搭完。\nDNS 分线路解析 # 阿里云云解析免费版支持\u0026quot;境内/境外\u0026quot;两条线路：\n境内 → 阿里云 CDN CNAME 境外（默认）→ Cloudflare Pages CNAME 这样国内用户走备案过的阿里云节点，海外用户走 Cloudflare 免费全球 CDN，一套域名两套加速。\n部署流程 # 推送到 GitHub → Actions 自动 hugo build → 并行上传到 OSS 和 Cloudflare Pages。\n整个流程大约 2-3 分钟，文章发布基本无感。\n后续我会写更多关于 AI Agent 开发的文章。如果你有 Agent 集成开发的需求，欢迎联系我：hello@zhuoqidev.com\n","date":"2026-05-04","externalUrl":null,"permalink":"/posts/hello-world/","section":"文章","summary":"为什么选 Hugo # 做个人博客选框架，我的第一标准是维护成本低——不想三个月后因为 npm 依赖地狱放弃写作。\nHugo 是单二进制文件，无需 Node.js，构建几千篇文章只需 1-2 秒，PaperMod 主题开箱就有暗色模式、全文搜索、RSS、Open Graph、阅读时间估算。日常写作只需碰 Markdown。\n整体架构 # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ┌─────────────────────────────┐ │ DNS 分线路解析 │ │ (阿里云云解析 GeoDB) │ └──────┬──────────────┬────────┘ │ │ 国内访客 ▼ 国际访客 ▼ ┌──────────────────┐ ┌─────────────────┐ │ 阿里云 CDN │ │ Cloudflare Pages │ │ ↓ │ │ (免费，全球CDN) │ │ 阿里云 OSS │ └─────────────────┘ │ (静态托管) │ └──────────────────┘ ↑ GitHub Actions 自动构建 \u0026 双栈推送 这套方案全年花费约 ¥206：\n","title":"用 Hugo 和双栈 CDN 搭建个人网站","type":"posts"},{"content":"","date":"2026-05-04","externalUrl":null,"permalink":"/tags/%E8%B0%83%E7%A0%94%E6%8A%A5%E5%91%8A/","section":"Tags","summary":"","title":"调研报告","type":"tags"},{"content":"","date":"2026-05-04","externalUrl":null,"permalink":"/tags/%E9%98%BF%E9%87%8C%E4%BA%91/","section":"Tags","summary":"","title":"阿里云","type":"tags"},{"content":"I\u0026rsquo;m Liu ZhuoQi, an AI Agent developer.\nI integrate AI into real products — agent systems, data visualization, automated workflows. Good tools shouldn\u0026rsquo;t need to explain themselves.\nCursor is my daily driver. This site is where I document and share technical explorations, covering AI Agent development, engineering practices, and creative coding.\nIf you need an AI Agent integration developer, reach out.\nTech Stack # AI \u0026amp; Agents LangChain · OpenAI API · Claude API · Cursor Canvas · Prompt Engineering · RAG\nFrontend React · TypeScript · Next.js · Astro · Tailwind CSS\nBackend \u0026amp; Infra Node.js · Python · PostgreSQL · Docker · Cloudflare Workers\nContact # GitHub: github.com/zhuoqidev X / Twitter: x.com/zhuoqidev Email: hello@zhuoqidev.com ","externalUrl":null,"permalink":"/en/about/","section":"Ens","summary":"AI Agent developer Liu ZhuoQi’s personal introduction","title":"About","type":"en"},{"content":"","externalUrl":null,"permalink":"/en/categories/","section":"Categories","summary":"","title":"Categories","type":"categories"},{"content":"","externalUrl":null,"permalink":"/en/search/","section":"Ens","summary":"Search posts","title":"Search","type":"en"},{"content":"","externalUrl":null,"permalink":"/en/series/","section":"Series","summary":"","title":"Series","type":"series"},{"content":"","externalUrl":null,"permalink":"/en/tags/","section":"Tags","summary":"","title":"Tags","type":"tags"},{"content":"","externalUrl":null,"permalink":"/en/","section":"ZhuoQi Dev","summary":"","title":"ZhuoQi Dev","type":"page"},{"content":"我是刘卓琪，AI Agent 开发者。\n我做的事情是把 AI 的能力落到实际产品里——智能 Agent 系统、数据可视化、自动化工作流。我相信好的工具不需要解释自己。\nCursor 是我日常的主力工具。这个网站是我记录和分享技术探索的地方，涵盖 AI Agent 开发、工程实践和创意编程。\n如果你有 AI Agent 集成开发的需求，可以直接联系我。\n技术栈 # AI \u0026amp; Agents LangChain · OpenAI API · Claude API · Cursor Canvas · Prompt Engineering · RAG\n前端开发 React · TypeScript · Next.js · Astro · Tailwind CSS\n后端与基础设施 Node.js · Python · PostgreSQL · Docker · Cloudflare Workers\n联系方式 # GitHub: github.com/zhuoqidev X / Twitter: x.com/zhuoqidev Email: hello@zhuoqidev.com ","externalUrl":null,"permalink":"/about/","section":"卓琪的开发笔记","summary":"AI Agent 开发者刘卓琪的个人介绍","title":"关于我","type":"page"},{"content":"","externalUrl":null,"permalink":"/search/","section":"卓琪的开发笔记","summary":"搜索文章","title":"搜索","type":"page"}]