The Context Wall

Your AI context file works perfectly — until it doesn't. Here's when to upgrade from a single markdown file to QMD, and exactly how to do it.

There is a pattern that every AI-native engineer discovers on their own. You create a file — ctx.md, db.md, knowledge.md, whatever you call it — and you start filling it with everything your AI agent needs to know about your project. Architecture decisions. Business rules. Domain knowledge. API contracts. Edge cases. The things that make your project yours.

It works brilliantly. Your AI agent goes from generic to contextual. From guessing to knowing. From "technically correct" to "actually solves the problem."

And then one day, it stops working. Not with a bang — with a slow fade. Your context file grew too large, and nobody told you.

This article is about recognizing that moment, understanding why it happens, and the tool that fixes it without abandoning the approach that got you here.

The Single File That Runs Everything

The pattern is simple. In The Architect's Protocol, I described how AI-native development workflows rely on a growing knowledge base — a living document that feeds your AI agent the context it needs to produce quality output. In practice, this usually looks like a single markdown file in your project root:

project/
  CLAUDE.md          # Rules, conventions, workflow instructions
  docs/
    ctx.md           # Everything the AI needs to know about the project

CLAUDE.md holds the rules — how to format code, which conventions to follow, what tools to use. It gets loaded automatically at every session start.

ctx.md holds the knowledge — what the system does, how it's built, why certain decisions were made, what the business rules are. Your AI agent reads it when it needs context.

This is the KISS approach, and it's genuinely brilliant for several reasons:

You control what goes in. You decide what knowledge matters. You're the filter. No automated noise.
It's transparent. You can read the file. You can see exactly what your AI knows and doesn't know.
It's version-controlled. Git tracks every change. You can see when knowledge was added, who added it, why.
Zero dependencies. No databases, no servers, no API keys. Just a text file.
It works everywhere. Claude Code, Cursor, Windsurf — every AI coding tool reads markdown.

As I argued in The Knowledge Equation, the quality of your input dictates the quality of your output. A well-curated context file transforms your AI agent from a generic code generator into a domain-aware engineering partner.

So what's the problem?

The Wall

The problem is growth.

When your context file is 200 lines, everything fits. Your AI reads the whole thing, has full context, produces great output. When it's 500 lines, it's still manageable. Maybe a bit long, but it works.

Then the project evolves. You document more architecture decisions. More edge cases. More business rules. More API contracts. More incident learnings. The file grows to 1,000 lines. 2,000. 4,000. And somewhere in that growth, you hit the wall.

The wall isn't dramatic. It's subtle. Here's how to know you've hit it:

1. You start editing your context file down. You remove older information to make room for new information. Weeks later, your AI doesn't know something it used to know — because you trimmed it.

2. Your AI asks you things it should already know. Something you documented months ago. Something that's "obvious" to anyone who's worked on the project. But the file is so long that the AI either skipped it or it was pushed out of the context window.

3. You hesitate to add new knowledge. You learn something important — from a meeting, a Slack thread, a production incident — and think: "I should add this... but the file is already so long." The moment you start self-censoring your knowledge base, the system is broken.

4. You spend more time maintaining the file than writing code. Finding the right section, reorganizing content, deduplicating — the curation tax exceeds the benefit.

5. Your AI makes wrong assumptions. It confidently states something about your architecture or business rules that used to be true but was in a section it didn't read, or that you removed to keep the file manageable.

The Desk Analogy

Think of it like a desk. Your AI agent has a desk where it puts everything it needs to work with — your instructions, your knowledge file, the code it's reading, the conversation with you, and tool results. Everything has to fit on the desk at the same time.

The desk has a fixed size. That's the context window — the total amount of information an LLM can hold in working memory during a single session. Claude's context window is large (up to 1 million tokens on Opus 4.6), but it's still finite. And your knowledge file isn't the only thing competing for that space.

Small context file = a notebook on a big desk. Plenty of room for code, conversation, and tool results.

4,000-line context file = a massive binder covering half the desk. Code files are hanging off the edge. Conversation history is on the floor. Your AI literally can't see parts of the information — not because it's being lazy, but because it physically doesn't fit.

QMD = instead of putting the whole binder on the desk, you keep it in a filing cabinet next to the desk. When your AI needs something specific, it pulls out just that one page. The desk stays clear for actual work.

The Math

Here's what's actually happening. Context windows are getting larger — Claude Opus 4.6 supports up to 1 million tokens — but your context file still competes with everything else for that space: the code files being edited, the conversation history, the tool results, the system instructions. And larger context windows don't eliminate the problem — they just move the wall further out. Projects grow too.

A rough conversion: 1 line of markdown ≈ 25 tokens. With Claude Opus 4.6’s 1 million token context window, the math looks more forgiving than you’d expect:

500 lines ≈ 12K tokens — trivial, about 1% of context
2,000 lines ≈ 50K tokens — 5% of context, comfortable
4,000 lines ≈ 100K tokens — 10% of context, noticeable
6,000+ lines ≈ 150K+ tokens — 15%+ of context, significant when combined with code and conversation
10,000+ lines ≈ 250K+ tokens — 25% of context, now you’re definitely hitting the wall

These percentages look manageable in isolation. But remember — the context window isn’t all yours. The system prompt, CLAUDE.md, conversation history, code files Claude is reading, and tool results all compete for the same space. A 100K-token context file plus a 200K conversation plus 300K of code files adds up fast. And there’s another problem: LLM attention degrades over very long contexts. Information buried in the middle of a 100K-token file gets less attention than information at the beginning and end. So even if it technically fits, your AI might still miss things.

That's the wall. Not a hard crash — a slow degradation in quality that you might not even notice until it's severe.

Enter QMD

QMD is a local, on-device hybrid search engine for markdown knowledge bases. Created by Tobias Lutke, co-founder and CEO of Shopify. Open-source, MIT-licensed, and actively developed — v2.1.0 was released in April 2026 with over 20,000 GitHub stars.

Lutke's own description: "A local search engine that lives and executes entirely on your computer. Both for you and agents." He uses it daily. Shopify integrated a version into their code monorepo for internal doc search.

The core idea is simple: instead of loading your entire knowledge base into the context window, search it and retrieve only the relevant parts. Your AI gets the 50 lines it needs instead of eating 4,000 lines it mostly doesn't.

Everything runs locally. No API keys. No cloud. No subscription. No data leaving your machine.

How QMD Works: The Technical Deep Dive

QMD combines three search techniques into a single pipeline, each progressively more intelligent:

Layer 1: BM25 Full-Text Search

The foundation. BM25 is a classical keyword-matching algorithm implemented via SQLite FTS5. When you search for "payment reconciliation," it finds documents containing those exact terms, ranked by relevance.

Speed: Sub-10ms
Strength: Perfect for exact terms — function names, API endpoints, specific concepts
Weakness: Misses semantic matches. Searching "payment reconciliation" won't find a document about "settlement processing" even if it's the same concept

Layer 2: Vector Semantic Search

QMD embeds every chunk of your knowledge base into vector space using a local GemMA 300M model (~300MB). Queries are embedded too, and results are ranked by cosine similarity.

Speed: 50-200ms
Strength: Finds conceptual matches. "Payment reconciliation" will find documents about "settlement processing" because they're semantically close
Weakness: Can surface tangentially related results. Less precise than keyword matching for exact terms

Layer 3: LLM Reranking

The results from both BM25 and vector search are merged via Reciprocal Rank Fusion, then a local Qwen3 0.6B model (~600MB) re-ranks them with actual language understanding. It reads the query and each candidate result, and judges which results truly answer the question.

Speed: 200-500ms for the full hybrid pipeline
Strength: Context-aware ordering. Understands nuance that keyword and vector matching can't
Weakness: Adds latency. Small model means it's good but not perfect

Query Expansion

Before any of this runs, QMD uses a custom fine-tuned 1.7B model to expand your query into multiple sub-queries — lexical variants, semantic alternatives, and HyDE (Hypothetical Document Embeddings) that imagine what the ideal answer document would look like. This dramatically improves recall.

Smart Chunking

QMD doesn't naively split your files by character count. It chunks at semantic boundaries — paragraph breaks, section headings, code block boundaries — at approximately 900 tokens with 15% overlap. For code files, it uses tree-sitter AST parsing to chunk at function and class boundaries (TypeScript, Python, Go, Rust).

Three Search Modes

You can choose which layers to use:

qmd search — BM25 only. Fast, exact. Best for known terms.
qmd vsearch — Vector semantic only. Best for conceptual questions.
qmd query — Full hybrid pipeline. Best for complex questions. This is what your AI agent uses.

Local Models (~2GB Total)

Downloaded once on first run, then cached:

GemMA 300M (~300MB) — embedding model for vector search
Qwen3 0.6B (~600MB) — reranking model
Custom 1.7B (~1GB) — query expansion model (GRPO fine-tuned)

All inference runs on your machine via node-llama-cpp. GPU acceleration supported (Metal on Mac, CUDA on Linux) but CPU works fine for the model sizes involved.

The MCP Bridge: How QMD Talks to Claude

This is where it becomes practical. QMD exposes a Model Context Protocol (MCP) server — the standard interface that Claude Code uses to communicate with external tools. The same way Claude Code can use tools like Bash, Read, or Grep, it can use QMD's search tools.

Configuration

Add QMD to your Claude Code settings once:

// ~/.claude/settings.json (global) or .mcp.json (per-project)
{
  "mcpServers": {
    "qmd": {
      "command": "qmd",
      "args": ["mcp"]
    }
  }
}

That's it. Claude Code now has access to QMD's tools in every session.

Available Tools

QMD exposes four MCP tools to Claude:

query — Full hybrid search. "Find me everything about payment reconciliation." Returns the most relevant chunks.
get — Retrieve a specific document by path. When Claude already knows which file it needs.
multi_get — Retrieve multiple documents at once. Efficient batch retrieval.
status — Check index health, collection stats, document counts.

HTTP Daemon Mode

QMD has a cold start of ~17 seconds — it needs to load the GGUF models into memory. For frequent use, run it as a persistent HTTP server:

qmd mcp --http

This keeps models loaded in memory on port 8181. Queries respond in milliseconds instead of waiting for model loading. Update your MCP config to point to the HTTP endpoint.

What Happens During a Session

Here's the flow when Claude needs context:

Claude decides it needs domain knowledge to complete a task
Claude calls qmd query "scheduling retry logic edge cases"
QMD searches across all indexed knowledge files
Returns the top relevant chunks (typically 5-10 paragraphs from across multiple files)
Claude incorporates this context and proceeds with the task

The key insight: Claude decides when to search. You tell it in CLAUDE.md to use QMD for domain knowledge, and it makes the judgment call — just like it currently decides when to read a file or run a command.

The Migration: From One File to Many

Here's what the transition looks like in practice.

Before: Single File

project/
  CLAUDE.md              # Rules and conventions
  docs/
    ctx.md               # 4,000+ lines — everything crammed in

After: Searchable Knowledge Base

project/
  CLAUDE.md              # Rules and conventions (unchanged)
  docs/
    overview.md           # Short project summary (~200 lines, always loaded)
    knowledge/            # QMD indexes this directory
      architecture.md     # System design, service map
      api-contracts.md    # Endpoints, payloads, auth
      business-rules.md   # Core domain logic
      edge-cases.md       # The things that break
      integrations/
        stripe.md         # Payment provider details
        kafka.md          # Event streaming setup
      decisions/
        why-kafka.md      # Architecture decision records
        why-not-graphql.md
      incidents/
        2026-02-outage.md # Post-mortems and learnings

The .qmdrc Configuration

A .qmdrc file in your project root tells QMD what to index:

collections:
  - name: project-knowledge
    paths:
      - docs/knowledge/**/*.md
      - docs/overview.md

Every .md file under docs/knowledge/ is now indexed and searchable. When you add a new file, it's automatically included — no extra step.

Updating CLAUDE.md

The only change to your workflow instructions:

## Knowledge Base
- `docs/overview.md` — Project overview (read at session start)
- Deep knowledge is indexed via QMD in `docs/knowledge/`
- Use `qmd query` to search for domain context before making
  architectural or business logic decisions
- When learning new information, store it in `docs/knowledge/`
  as focused files (one topic per file, kebab-case naming)
- If a relevant file exists, update it; if new topic, create a new file

Claude reads this instruction and knows: read overview.md for the big picture, search QMD for deep context.

The Hybrid Approach

Notice the overview.md file. This is important. You don't go fully search-based — you keep a short overview file (~200 lines) that Claude always reads. This contains the essential context that every session needs: what the project is, the high-level architecture, the current state of work.

The deep details — the edge cases, the incident reports, the API contracts, the decision records — live in the QMD-indexed files. Claude searches for them only when it needs them.

Always loaded: CLAUDE.md (rules) + overview.md (big picture)

Searched on demand: Everything in docs/knowledge/ (deep domain knowledge)

The Day-to-Day Workflow

Here's what changes (and what doesn't) in your daily work.

Regular Coding Session

You open Claude Code on your project. You say: "Fix the bug in the scheduling retry logic."

Claude loads CLAUDE.md and overview.md automatically
Claude calls qmd query "scheduling retry logic"
QMD returns relevant paragraphs from scheduling-logic.md and edge-cases.md
Claude reads the actual code files
Claude fixes the bug with full domain context

You did nothing different. You asked Claude to fix a bug. Claude decided it needed context, searched QMD, got it.

You Learn Something New

You're in a meeting. A stakeholder says: "Settlement timelines are changing from T+2 to T+1 next quarter."

You tell Claude: "Save this — settlement timelines changing from T+2 to T+1 starting Q3, per stakeholder meeting today"
Claude opens docs/knowledge/settlement-rules.md, adds the information, saves
QMD re-indexes automatically
Three weeks later, Claude is working on the payment module, queries QMD, finds the T+1 change, accounts for it

Same command from you. "Update docs." The difference is Claude writes to a focused file instead of finding the right spot in a 4,000-line file.

Encountering Unfamiliar Code

Claude is working on a feature and hits a pattern it doesn't understand — some unusual retry mechanism in the payment module.

Before (single file): Claude reads ctx.md. The explanation is on line 3,800. Maybe Claude reads that far, maybe it doesn't.
After (QMD): Claude calls qmd query "payment module retry pattern". Gets the exact explanation from decisions/payment-architecture.md. Every time.

Adding a New Knowledge Area

You integrate a new third-party service. You tell Claude: "Create a knowledge file for the Stripe integration — here's the key details..."

Claude creates docs/knowledge/integrations/stripe.md. It's in the indexed path. Next query finds it. No editing a giant file. No worrying about where to put it.

Installation and Setup

Step 1: Install QMD

pip install mempalace
# or
npm install -g qmd

Step 2: Initialize in Your Project

cd your-project
qmd init

This creates the .qmdrc configuration file.

Step 3: Configure Collections

Edit .qmdrc to point to your knowledge files:

collections:
  - name: project-knowledge
    paths:
      - docs/knowledge/**/*.md

Step 4: Index

qmd index

First run downloads the models (~2GB) and indexes your files. Subsequent runs are incremental — only changed files are re-indexed.

Step 5: Configure MCP for Claude Code

// .mcp.json in your project root
{
  "mcpServers": {
    "qmd": {
      "command": "qmd",
      "args": ["mcp"]
    }
  }
}

Step 6: Update CLAUDE.md

Add the knowledge base instructions so Claude knows to use QMD:

## Knowledge Base
Deep project knowledge is stored in `docs/knowledge/` and indexed via QMD.
Use `qmd query` to search for context before making domain-specific decisions.
When new knowledge is learned, save it to the appropriate file in `docs/knowledge/`.

Optional: HTTP Daemon for Speed

qmd mcp --http  # Runs persistent server on port 8181

Eliminates the ~17-second cold start on each query. Keeps models loaded in memory. Recommended for active development sessions.

Optional: Bulletproof It (Health Doctor + Failure Notification)

Here's the failure mode nobody warns you about. QMD relies on better-sqlite3, a native Node addon. The compiled binary is locked to a specific Node.js ABI version. When you upgrade Node — via brew upgrade, nvm, fnm, whatever — the binary becomes incompatible until you rebuild it. You'll see this error:

Error: The module 'better-sqlite3/build/Release/better_sqlite3.node'
was compiled against a different Node.js version using
NODE_MODULE_VERSION 141. This version of Node.js requires
NODE_MODULE_VERSION 147. Please try re-compiling or re-installing
the module (for instance, using `npm rebuild` or `npm install`).

The fix is one command: cd $(npm root -g)/@tobilu/qmd && npm rebuild better-sqlite3. That's not the interesting part.

The interesting part is how you find out. If your post-commit hook backgrounds the reindex and redirects output to a log file — which is the recommended pattern so commits stay fast — you never see the error. Every commit looks successful. The MCP daemon keeps answering queries because it loaded the old binary into memory before the Node upgrade. Everything works. Until one reboot — and then the daemon dies on startup and your knowledge base goes dark.

I had this exact failure for several weeks before I noticed. Three layers of prevention fix it permanently.

Layer 1 — a health doctor script.

Save this as tools/qmd-doctor.sh. Run it any time something feels off. It auto-detects ABI mismatch and rebuilds, kickstarts a dead daemon, and forces a fresh-binary restart if the daemon is running on an in-memory binary older than what's on disk:

#!/bin/bash
# qmd-doctor — diagnose and auto-fix common QMD failures.
set -uo pipefail

FLAG_FILE="$HOME/.qmd-broken"
log() { echo "[qmd-doctor] $*"; }

# Resolve the @tobilu/qmd package root from the qmd binary symlink
QMD_LINK=$(command -v qmd) || { echo "qmd not in PATH"; exit 2; }
QMD_REAL=$(python3 -c "import os,sys;print(os.path.realpath(sys.argv[1]))" "$QMD_LINK")
QMD_PKG_ROOT="$(cd "$(dirname "$QMD_REAL")/.." && pwd)"

# 1. Probe CLI — detects NODE_MODULE_VERSION mismatch
CLI_OUT=$(qmd status 2>&1 | head -30)
if echo "$CLI_OUT" | grep -q "NODE_MODULE_VERSION"; then
  log "ABI mismatch — rebuilding better-sqlite3 against current Node..."
  (cd "$QMD_PKG_ROOT" && npm rebuild better-sqlite3)
fi

# 2. Probe MCP daemon — initialize handshake (only thing that returns 200)
PORT=8181
INIT='{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-06-18","capabilities":{},"clientInfo":{"name":"doctor","version":"1"}}}'
RESP=$(curl -fsS --max-time 3 -X POST "http://localhost:${PORT}/mcp" \
       -H "Content-Type: application/json" \
       -H "Accept: application/json, text/event-stream" \
       -d "$INIT" 2>/dev/null || true)

if ! echo "$RESP" | grep -q '"serverInfo"'; then
  log "Daemon not responding — kickstarting..."
  launchctl kickstart -k "gui/$(id -u)/io.qmd.daemon"
fi

# 3. Detect stale in-memory daemon (binary on disk newer than daemon start time)
DAEMON_PID=$(launchctl list io.qmd.daemon 2>/dev/null | awk '/"PID"/{gsub(/[;]/,"");print $3;exit}')
if [[ -n "$DAEMON_PID" && "$DAEMON_PID" != "0" ]]; then
  BIN="$QMD_PKG_ROOT/node_modules/better-sqlite3/build/Release/better_sqlite3.node"
  BIN_MTIME=$(stat -f %m "$BIN" 2>/dev/null)
  DAEMON_START=$(ps -p "$DAEMON_PID" -o lstart= | xargs -I{} date -j -f "%a %b %d %T %Y" "{}" "+%s" 2>/dev/null)
  if [[ "$BIN_MTIME" -gt "${DAEMON_START:-0}" ]]; then
    log "Daemon on stale in-memory binary — kickstarting to load fresh..."
    launchctl kickstart -k "gui/$(id -u)/io.qmd.daemon"
  fi
fi

# Final verify + flag management
if qmd status >/dev/null 2>&1; then
  rm -f "$FLAG_FILE"
  log "ALL GREEN."
else
  echo "QMD BROKEN at $(date -Iseconds) — run qmd-doctor.sh" > "$FLAG_FILE"
  command -v osascript >/dev/null && osascript -e \
    'display notification "QMD broken — run tools/qmd-doctor.sh" with title "QMD Health Alert"'
  exit 1
fi

Layer 2 — a post-commit hook that surfaces failures loudly.

The naïve hook backgrounds everything to a log file and disowns the process. Fast commits, silent failures. Replace it with this version, which writes a flag file (~/.qmd-broken) and fires a macOS notification on any non-zero exit. On the next successful reindex, the flag is cleared automatically:

# In .git/hooks/post-commit
{
  if qmd update 2>&1 && qmd embed 2>&1; then
    rm -f "$HOME/.qmd-broken"
  else
    {
      echo "QMD update failed at $(date -Iseconds)"
      echo "Recovery: $(git rev-parse --show-toplevel)/tools/qmd-doctor.sh"
      echo "See /tmp/qmd-update.log for the stack trace"
    } > "$HOME/.qmd-broken"
    command -v osascript >/dev/null && osascript -e \
      'display notification "QMD reindex failed — run tools/qmd-doctor.sh" with title "QMD Broken" sound name "Basso"'
  fi
} >> /tmp/qmd-update.log 2>&1 &
disown || true

Layer 3 — tell your AI agent to look for the flag on session start.

Add this to your CLAUDE.md (or equivalent agent rules file):

## QMD health — auto-detect + auto-fix

On session start, check for ~/.qmd-broken:

  [ -f ~/.qmd-broken ] && cat ~/.qmd-broken

If the file exists, the post-commit hook detected a reindex failure (silently,
in the background). Surface it and offer to run the doctor:

  ./tools/qmd-doctor.sh

The doctor auto-fixes: better-sqlite3 ABI mismatch after Node upgrade
(npm rebuild), daemon stopped or unresponsive (launchctl kickstart), and
daemon running on a stale in-memory binary that would die on next reboot.

Together these three layers make silent failure structurally impossible. The hook detects the break the moment it happens. The flag file persists the signal across sessions. The notification is audible and visible the same minute you do git commit. The agent rules in CLAUDE.md make sure even a fresh agent session picks up the signal and knows the recovery command. The doctor itself is idempotent — safe to run anytime, exits zero when there's nothing to fix.

If you forget to install all three layers and the daemon stops one morning with no warning, here's the manual recovery:

# Rebuild the native module against the current Node ABI
cd $(npm root -g)/@tobilu/qmd && npm rebuild better-sqlite3

# Restart the daemon so it loads the fresh binary
launchctl kickstart -k gui/$(id -u)/io.qmd.daemon

# Verify
qmd status