The Context Wall

Your AI context file works perfectly — until it doesn't. Here's when to upgrade from a single markdown file to QMD, and exactly how to do it.

There is a pattern that every AI-native engineer discovers on their own. You create a file — ctx.md, db.md, knowledge.md, whatever you call it — and you start filling it with everything your AI agent needs to know about your project. Architecture decisions. Business rules. Domain knowledge. API contracts. Edge cases. The things that make your project yours.

It works brilliantly. Your AI agent goes from generic to contextual. From guessing to knowing. From "technically correct" to "actually solves the problem."

And then one day, it stops working. Not with a bang — with a slow fade. Your context file grew too large, and nobody told you.

This article is about recognizing that moment, understanding why it happens, and the tool that fixes it without abandoning the approach that got you here.

The Single File That Runs Everything

The pattern is simple. In The Architect's Protocol, I described how AI-native development workflows rely on a growing knowledge base — a living document that feeds your AI agent the context it needs to produce quality output. In practice, this usually looks like a single markdown file in your project root:

project/
  CLAUDE.md          # Rules, conventions, workflow instructions
  docs/
    ctx.md           # Everything the AI needs to know about the project

CLAUDE.md holds the rules — how to format code, which conventions to follow, what tools to use. It gets loaded automatically at every session start.

ctx.md holds the knowledge — what the system does, how it's built, why certain decisions were made, what the business rules are. Your AI agent reads it when it needs context.

This is the KISS approach, and it's genuinely brilliant for several reasons:

You control what goes in. You decide what knowledge matters. You're the filter. No automated noise.
It's transparent. You can read the file. You can see exactly what your AI knows and doesn't know.
It's version-controlled. Git tracks every change. You can see when knowledge was added, who added it, why.
Zero dependencies. No databases, no servers, no API keys. Just a text file.
It works everywhere. Claude Code, Cursor, Windsurf — every AI coding tool reads markdown.

As I argued in The Knowledge Equation, the quality of your input dictates the quality of your output. A well-curated context file transforms your AI agent from a generic code generator into a domain-aware engineering partner.

So what's the problem?

The Wall

The problem is growth.

When your context file is 200 lines, everything fits. Your AI reads the whole thing, has full context, produces great output. When it's 500 lines, it's still manageable. Maybe a bit long, but it works.

Then the project evolves. You document more architecture decisions. More edge cases. More business rules. More API contracts. More incident learnings. The file grows to 1,000 lines. 2,000. 4,000. And somewhere in that growth, you hit the wall.

The wall isn't dramatic. It's subtle. Here's how to know you've hit it:

1. You start editing your context file down. You remove older information to make room for new information. Weeks later, your AI doesn't know something it used to know — because you trimmed it.

2. Your AI asks you things it should already know. Something you documented months ago. Something that's "obvious" to anyone who's worked on the project. But the file is so long that the AI either skipped it or it was pushed out of the context window.

3. You hesitate to add new knowledge. You learn something important — from a meeting, a Slack thread, a production incident — and think: "I should add this... but the file is already so long." The moment you start self-censoring your knowledge base, the system is broken.

4. You spend more time maintaining the file than writing code. Finding the right section, reorganizing content, deduplicating — the curation tax exceeds the benefit.

5. Your AI makes wrong assumptions. It confidently states something about your architecture or business rules that used to be true but was in a section it didn't read, or that you removed to keep the file manageable.

The Desk Analogy

Think of it like a desk. Your AI agent has a desk where it puts everything it needs to work with — your instructions, your knowledge file, the code it's reading, the conversation with you, and tool results. Everything has to fit on the desk at the same time.

The desk has a fixed size. That's the context window — the total amount of information an LLM can hold in working memory during a single session. Claude's context window is large (up to 1 million tokens on Opus 4.6), but it's still finite. And your knowledge file isn't the only thing competing for that space.

Small context file = a notebook on a big desk. Plenty of room for code, conversation, and tool results.

4,000-line context file = a massive binder covering half the desk. Code files are hanging off the edge. Conversation history is on the floor. Your AI literally can't see parts of the information — not because it's being lazy, but because it physically doesn't fit.

QMD = instead of putting the whole binder on the desk, you keep it in a filing cabinet next to the desk. When your AI needs something specific, it pulls out just that one page. The desk stays clear for actual work.

The Math

Here's what's actually happening. Context windows are getting larger — Claude Opus 4.6 supports up to 1 million tokens — but your context file still competes with everything else for that space: the code files being edited, the conversation history, the tool results, the system instructions. And larger context windows don't eliminate the problem — they just move the wall further out. Projects grow too.

A rough conversion: 1 line of markdown ≈ 25 tokens. With Claude Opus 4.6’s 1 million token context window, the math looks more forgiving than you’d expect:

500 lines ≈ 12K tokens — trivial, about 1% of context
2,000 lines ≈ 50K tokens — 5% of context, comfortable
4,000 lines ≈ 100K tokens — 10% of context, noticeable
6,000+ lines ≈ 150K+ tokens — 15%+ of context, significant when combined with code and conversation
10,000+ lines ≈ 250K+ tokens — 25% of context, now you’re definitely hitting the wall

These percentages look manageable in isolation. But remember — the context window isn’t all yours. The system prompt, CLAUDE.md, conversation history, code files Claude is reading, and tool results all compete for the same space. A 100K-token context file plus a 200K conversation plus 300K of code files adds up fast. And there’s another problem: LLM attention degrades over very long contexts. Information buried in the middle of a 100K-token file gets less attention than information at the beginning and end. So even if it technically fits, your AI might still miss things.

That's the wall. Not a hard crash — a slow degradation in quality that you might not even notice until it's severe.

Enter QMD

QMD is a local, on-device hybrid search engine for markdown knowledge bases. Created by Tobias Lutke, co-founder and CEO of Shopify. Open-source, MIT-licensed, and actively developed — v2.1.0 was released in April 2026 with over 20,000 GitHub stars.

Lutke's own description: "A local search engine that lives and executes entirely on your computer. Both for you and agents." He uses it daily. Shopify integrated a version into their code monorepo for internal doc search.

The core idea is simple: instead of loading your entire knowledge base into the context window, search it and retrieve only the relevant parts. Your AI gets the 50 lines it needs instead of eating 4,000 lines it mostly doesn't.

Everything runs locally. No API keys. No cloud. No subscription. No data leaving your machine.

How QMD Works: The Technical Deep Dive

QMD combines three search techniques into a single pipeline, each progressively more intelligent:

Layer 1: BM25 Full-Text Search

The foundation. BM25 is a classical keyword-matching algorithm implemented via SQLite FTS5. When you search for "payment reconciliation," it finds documents containing those exact terms, ranked by relevance.

Speed: Sub-10ms
Strength: Perfect for exact terms — function names, API endpoints, specific concepts
Weakness: Misses semantic matches. Searching "payment reconciliation" won't find a document about "settlement processing" even if it's the same concept

Layer 2: Vector Semantic Search

QMD embeds every chunk of your knowledge base into vector space using a local GemMA 300M model (~300MB). Queries are embedded too, and results are ranked by cosine similarity.

Speed: 50-200ms
Strength: Finds conceptual matches. "Payment reconciliation" will find documents about "settlement processing" because they're semantically close
Weakness: Can surface tangentially related results. Less precise than keyword matching for exact terms

Layer 3: LLM Reranking

The results from both BM25 and vector search are merged via Reciprocal Rank Fusion, then a local Qwen3 0.6B model (~600MB) re-ranks them with actual language understanding. It reads the query and each candidate result, and judges which results truly answer the question.

Speed: 200-500ms for the full hybrid pipeline
Strength: Context-aware ordering. Understands nuance that keyword and vector matching can't
Weakness: Adds latency. Small model means it's good but not perfect

Query Expansion

Before any of this runs, QMD uses a custom fine-tuned 1.7B model to expand your query into multiple sub-queries — lexical variants, semantic alternatives, and HyDE (Hypothetical Document Embeddings) that imagine what the ideal answer document would look like. This dramatically improves recall.

Smart Chunking

QMD doesn't naively split your files by character count. It chunks at semantic boundaries — paragraph breaks, section headings, code block boundaries — at approximately 900 tokens with 15% overlap. For code files, it uses tree-sitter AST parsing to chunk at function and class boundaries (TypeScript, Python, Go, Rust).

Three Search Modes

You can choose which layers to use:

qmd search — BM25 only. Fast, exact. Best for known terms.
qmd vsearch — Vector semantic only. Best for conceptual questions.
qmd query — Full hybrid pipeline. Best for complex questions. This is what your AI agent uses.

Local Models (~2GB Total)

Downloaded once on first run, then cached:

GemMA 300M (~300MB) — embedding model for vector search
Qwen3 0.6B (~600MB) — reranking model
Custom 1.7B (~1GB) — query expansion model (GRPO fine-tuned)

All inference runs on your machine via node-llama-cpp. GPU acceleration supported (Metal on Mac, CUDA on Linux) but CPU works fine for the model sizes involved.

The MCP Bridge: How QMD Talks to Claude

This is where it becomes practical. QMD exposes a Model Context Protocol (MCP) server — the standard interface that Claude Code uses to communicate with external tools. The same way Claude Code can use tools like Bash, Read, or Grep, it can use QMD's search tools.

Configuration

Add QMD to your Claude Code settings once:

// ~/.claude/settings.json (global) or .mcp.json (per-project)
{
  "mcpServers": {
    "qmd": {
      "command": "qmd",
      "args": ["mcp"]
    }
  }
}

That's it. Claude Code now has access to QMD's tools in every session.

Available Tools

QMD exposes four MCP tools to Claude:

query — Full hybrid search. "Find me everything about payment reconciliation." Returns the most relevant chunks.
get — Retrieve a specific document by path. When Claude already knows which file it needs.
multi_get — Retrieve multiple documents at once. Efficient batch retrieval.
status — Check index health, collection stats, document counts.

HTTP Daemon Mode

QMD has a cold start of ~17 seconds — it needs to load the GGUF models into memory. For frequent use, run it as a persistent HTTP server:

qmd mcp --http

This keeps models loaded in memory on port 8181. Queries respond in milliseconds instead of waiting for model loading. Update your MCP config to point to the HTTP endpoint.

What Happens During a Session

Here's the flow when Claude needs context:

Claude decides it needs domain knowledge to complete a task
Claude calls qmd query "scheduling retry logic edge cases"
QMD searches across all indexed knowledge files
Returns the top relevant chunks (typically 5-10 paragraphs from across multiple files)
Claude incorporates this context and proceeds with the task

The key insight: Claude decides when to search. You tell it in CLAUDE.md to use QMD for domain knowledge, and it makes the judgment call — just like it currently decides when to read a file or run a command.

The Migration: From One File to Many

Here's what the transition looks like in practice.

Before: Single File

project/
  CLAUDE.md              # Rules and conventions
  docs/
    ctx.md               # 4,000+ lines — everything crammed in

After: Searchable Knowledge Base

project/
  CLAUDE.md              # Rules and conventions (unchanged)
  docs/
    overview.md           # Short project summary (~200 lines, always loaded)
    knowledge/            # QMD indexes this directory
      architecture.md     # System design, service map
      api-contracts.md    # Endpoints, payloads, auth
      business-rules.md   # Core domain logic
      edge-cases.md       # The things that break
      integrations/
        stripe.md         # Payment provider details
        kafka.md          # Event streaming setup
      decisions/
        why-kafka.md      # Architecture decision records
        why-not-graphql.md
      incidents/
        2026-02-outage.md # Post-mortems and learnings

The .qmdrc Configuration

A .qmdrc file in your project root tells QMD what to index:

collections:
  - name: project-knowledge
    paths:
      - docs/knowledge/**/*.md
      - docs/overview.md

Every .md file under docs/knowledge/ is now indexed and searchable. When you add a new file, it's automatically included — no extra step.

Updating CLAUDE.md

The only change to your workflow instructions:

## Knowledge Base
- `docs/overview.md` — Project overview (read at session start)
- Deep knowledge is indexed via QMD in `docs/knowledge/`
- Use `qmd query` to search for domain context before making
  architectural or business logic decisions
- When learning new information, store it in `docs/knowledge/`
  as focused files (one topic per file, kebab-case naming)
- If a relevant file exists, update it; if new topic, create a new file

Claude reads this instruction and knows: read overview.md for the big picture, search QMD for deep context.

The Hybrid Approach

Notice the overview.md file. This is important. You don't go fully search-based — you keep a short overview file (~200 lines) that Claude always reads. This contains the essential context that every session needs: what the project is, the high-level architecture, the current state of work.

The deep details — the edge cases, the incident reports, the API contracts, the decision records — live in the QMD-indexed files. Claude searches for them only when it needs them.

Always loaded: CLAUDE.md (rules) + overview.md (big picture)

Searched on demand: Everything in docs/knowledge/ (deep domain knowledge)

The Day-to-Day Workflow

Here's what changes (and what doesn't) in your daily work.

Regular Coding Session

You open Claude Code on your project. You say: "Fix the bug in the scheduling retry logic."

Claude loads CLAUDE.md and overview.md automatically
Claude calls qmd query "scheduling retry logic"
QMD returns relevant paragraphs from scheduling-logic.md and edge-cases.md
Claude reads the actual code files
Claude fixes the bug with full domain context

You did nothing different. You asked Claude to fix a bug. Claude decided it needed context, searched QMD, got it.

You Learn Something New

You're in a meeting. A stakeholder says: "Settlement timelines are changing from T+2 to T+1 next quarter."

You tell Claude: "Save this — settlement timelines changing from T+2 to T+1 starting Q3, per stakeholder meeting today"
Claude opens docs/knowledge/settlement-rules.md, adds the information, saves
QMD re-indexes automatically
Three weeks later, Claude is working on the payment module, queries QMD, finds the T+1 change, accounts for it

Same command from you. "Update docs." The difference is Claude writes to a focused file instead of finding the right spot in a 4,000-line file.

Encountering Unfamiliar Code

Claude is working on a feature and hits a pattern it doesn't understand — some unusual retry mechanism in the payment module.

Before (single file): Claude reads ctx.md. The explanation is on line 3,800. Maybe Claude reads that far, maybe it doesn't.
After (QMD): Claude calls qmd query "payment module retry pattern". Gets the exact explanation from decisions/payment-architecture.md. Every time.

Adding a New Knowledge Area

You integrate a new third-party service. You tell Claude: "Create a knowledge file for the Stripe integration — here's the key details..."

Claude creates docs/knowledge/integrations/stripe.md. It's in the indexed path. Next query finds it. No editing a giant file. No worrying about where to put it.

Installation and Setup

Step 1: Install QMD

pip install mempalace
# or
npm install -g qmd

Step 2: Initialize in Your Project

cd your-project
qmd init

This creates the .qmdrc configuration file.

Step 3: Configure Collections

Edit .qmdrc to point to your knowledge files:

collections:
  - name: project-knowledge
    paths:
      - docs/knowledge/**/*.md

Step 4: Index

qmd index

First run downloads the models (~2GB) and indexes your files. Subsequent runs are incremental — only changed files are re-indexed.

Step 5: Configure MCP for Claude Code

// .mcp.json in your project root
{
  "mcpServers": {
    "qmd": {
      "command": "qmd",
      "args": ["mcp"]
    }
  }
}

Step 6: Update CLAUDE.md

Add the knowledge base instructions so Claude knows to use QMD:

## Knowledge Base
Deep project knowledge is stored in `docs/knowledge/` and indexed via QMD.
Use `qmd query` to search for context before making domain-specific decisions.
When new knowledge is learned, save it to the appropriate file in `docs/knowledge/`.

Optional: HTTP Daemon for Speed

qmd mcp --http  # Runs persistent server on port 8181

Eliminates the ~17-second cold start on each query. Keeps models loaded in memory. Recommended for active development sessions.

The Quick Test: Are You at the Wall?

You don’t need to guess. One command tells you where you stand:

wc -l docs/ctx.md

Take the line count. Multiply by 25. That’s your approximate token count.

Now here’s the key insight: your context file is never alone. In a typical working session, your AI agent’s context window is already occupied by:

System prompt + CLAUDE.md: ~10–15K tokens
Code files being read/edited: ~50–200K tokens (depends on task complexity)
Conversation history: ~20–100K tokens (grows throughout the session)
Tool results: ~10–50K tokens (search results, file contents, command output)

On a busy session, 300–400K tokens are already spoken for before your knowledge file even enters the picture. That leaves roughly 600K of a 1M window for your context file. Sounds like plenty — but attention degradation means your AI’s effective focus drops well before the window is full.

The Simple Formula

Here it is:

(Lines in your context file × 25) + 300,000 = approximate total tokens per session

If that number is under 500K: you’re fine. Everything fits with room to breathe.

If it’s between 500K and 700K: you’re in the zone where attention starts degrading on the context file. Watch for the five warning signs.

If it’s over 700K: you’re crowding the window. QMD will improve output quality.

In practice, this translates to simple line count thresholds:

Under 3,000 lines (≈75K tokens): Stay with a single file. No question.
3,000–6,000 lines (75–150K tokens): The gray zone. Combined with a heavy coding session, you might hit the wall. Watch for the warning signs.
Over 6,000 lines (150K+ tokens): QMD territory. Your AI is almost certainly losing context in the middle of the file.

The Practical Test

If you want to confirm rather than estimate, do this: start a fresh Claude Code session on your project. Ask a question about something you documented in the middle of your context file — not the beginning, not the end, the middle. Something specific and domain-heavy.

If Claude nails it: your file is fine.

If Claude guesses, hallucinates, or asks you to explain: the information is there but Claude isn’t seeing it. That’s the wall.

When NOT to Use QMD

QMD is not always the answer. KISS Your AI Workflow still applies. Here's the honest breakdown:

Under 1,000 lines: Don’t bother. Your file fits comfortably in context. QMD adds complexity for zero benefit.
1,000–3,000 lines: Comfortable zone. Keep your single file. Monitor it.
3,000–6,000 lines: Getting close to the wall. You’re eating 8–15% of context, and combined with code and conversation, things start to get crowded. Consider QMD if you’re noticing any of the five warning signs.
6,000–10,000 lines: You’re at the wall. QMD will make a meaningful difference. Your AI is likely missing context buried in the middle of the file.
10,000+ lines: You needed QMD yesterday. Your context file is consuming a quarter of the window before any work begins.

The right mental model: QMD is an upgrade path, not a starting point. Start with a single context file. When it outgrows itself — when you start noticing the warning signs — split it and add QMD. Not before.

The One Risk: Retrieval Isn't Perfect

There's a tradeoff worth being honest about.

With a single file, your AI sees everything. Nothing is missed — because everything is loaded. The cost is context window space, but the recall is 100%.

With QMD, your AI sees only what the search returns. Retrieval is good (three layers of intelligence) but not perfect. It's possible to search for "payment flow" and miss a relevant detail that's documented under "settlement processing" using terminology the search didn't connect.

This is why the hybrid approach matters. The overview.md file — always loaded, always visible — ensures Claude has the essential context regardless of search quality. The QMD-indexed files add depth. If retrieval misses something, Claude still has the overview. If you notice a gap, you can always tell Claude to read a specific file directly.

The tradeoff: 100% recall of a subset (single file, limited by what fits) vs. high recall of everything (QMD, limited by search quality). For large projects, the second option wins — because "everything" is far more valuable than "a subset."

What If Context Windows Become Unlimited?

Context windows are growing fast. GPT-4 launched in 2023 with 8K tokens. Within two years we had 128K, then 200K, then 1 million. The trajectory is clear.

So here’s the natural question: if context windows keep growing — 10 million tokens, 100 million, effectively unlimited — does the wall disappear? Does a single file win again?

Yes, in theory. If you could load your entire knowledge base, all your code, and full conversation history into a single context with perfect attention across every token — a single file would be strictly better than search. Why? Because:

100% recall. The AI sees everything. Nothing is missed by imperfect retrieval. No risk of searching for "payment flow" and missing relevant context filed under "settlement processing."
Zero configuration. No MCP servers, no indexing, no .qmdrc files. Just a file.
Full cross-referencing. The AI can connect information across sections naturally, without needing to make multiple search queries to find related context.
Simpler workflow. One file to update. One file to version control. One file for the AI to read. KISS at its purest.

Search-based retrieval is a workaround for a limitation, not a superior architecture. Given infinite perfect context, you’d always prefer having everything visible over hoping your search query returns the right results.

But we’re not there yet, and two hard problems remain:

1. Attention degradation. Current transformer architectures don’t attend equally to all tokens. Information in the middle of a very long context gets less attention than information at the beginning and end — the "lost in the middle" problem. Even with a 1M-token window, a fact on page 200 of a 400-page document may be effectively invisible. Making attention scale linearly with context length is an active research problem with no solved solution yet.

2. Cost and latency. Processing millions of tokens is expensive and slow. Every token in the context window costs compute on every generation step. Loading 500K tokens of context to answer a question that needs 2K tokens of knowledge is wasteful — like shipping the entire library to your desk when you need one book.

3. The knowledge keeps growing too. A project that has 6,000 lines of context today will have 20,000 in a year. Context windows grow, but so does the knowledge you want to put in them. It’s an arms race, and knowledge accumulation may outpace context expansion — especially for long-running complex projects.

When Could Unlimited Context Arrive?

Optimistic estimate: 3–5 years for 10M+ token context windows with reasonably flat attention. This would push the wall far enough that most projects — even large ones — could use a single file again.

Realistic estimate: 5–10 years for truly unlimited, cost-effective context with solved attention degradation. The "lost in the middle" problem is fundamental to current architectures and may require a paradigm shift beyond transformers.

The honest answer: nobody knows. The field moves in unpredictable jumps. But even optimistically, we’re years away from context windows making retrieval tools obsolete. And even then, retrieval may remain more efficient — why load 500K tokens when you can search and load 2K?

For now — and realistically for the next 3–5 years at minimum — QMD fills the gap. Even as context windows grow, attention degradation and cost efficiency will keep retrieval-based approaches relevant. And if unlimited perfect context eventually arrives and makes QMD unnecessary? Great. You migrate back to a single file in an afternoon. The knowledge is still markdown, still portable, still yours. Nothing is lost. QMD is infrastructure you can adopt today and discard tomorrow without any lock-in.

The Bigger Picture

QMD solves a specific, mechanical problem: your knowledge base outgrew your context window. But the deeper lesson connects to something I keep coming back to in this series.

In The Knowledge Equation, I argued that domain knowledge is the real differentiator in AI-native engineering — not orchestration skill, which becomes table stakes. In The Architect's Protocol, I described how the knowledge base grows as part of the development workflow. In Things AI Is Surprisingly Bad At, I explored the limits of AI without proper context.

QMD is the infrastructure that makes all of this scale. It removes the ceiling on your knowledge base. You can write everything down — every architecture decision, every edge case, every stakeholder requirement, every incident learning — without worrying about context limits. Your AI gets the right knowledge at the right moment.

The result: you stop self-censoring your knowledge base. You stop making tradeoffs about what to include. You write freely, search intelligently, and your AI agent has access to the full depth of your domain expertise.

That's the upgrade. Not a new philosophy — just better infrastructure for the one you already have.

The Knowledge Equation — Why domain knowledge is the real AI differentiator
The Architect's Protocol — The complete AI-native development workflow
KISS Your AI Workflow — Keep it simple until you can't
Things AI Is Surprisingly Bad At — Why AI needs your domain context to be useful
The Knowledge Base That Builds Itself — Let your AI maintain your knowledge base