The Context Wall
Your AI context file works perfectly — until it doesn't. Here's when to upgrade from a single markdown file to QMD, and exactly how to do it.
There is a pattern that every AI-native engineer discovers on their own. You create a file — ctx.md, db.md, knowledge.md, whatever you call it — and you start filling it with everything your AI agent needs to know about your project. Architecture decisions. Business rules. Domain knowledge. API contracts. Edge cases. The things that make your project yours.
It works brilliantly. Your AI agent goes from generic to contextual. From guessing to knowing. From "technically correct" to "actually solves the problem."
And then one day, it stops working. Not with a bang — with a slow fade. Your context file grew too large, and nobody told you.
This article is about recognizing that moment, understanding why it happens, and the tool that fixes it without abandoning the approach that got you here.
The Single File That Runs Everything
The pattern is simple. In The Architect's Protocol, I described how AI-native development workflows rely on a growing knowledge base — a living document that feeds your AI agent the context it needs to produce quality output. In practice, this usually looks like a single markdown file in your project root:
project/
CLAUDE.md # Rules, conventions, workflow instructions
docs/
ctx.md # Everything the AI needs to know about the projectCLAUDE.md holds the rules — how to format code, which conventions to follow, what tools to use. It gets loaded automatically at every session start.
ctx.md holds the knowledge — what the system does, how it's built, why certain decisions were made, what the business rules are. Your AI agent reads it when it needs context.
This is the KISS approach, and it's genuinely brilliant for several reasons:
- You control what goes in. You decide what knowledge matters. You're the filter. No automated noise.
- It's transparent. You can read the file. You can see exactly what your AI knows and doesn't know.
- It's version-controlled. Git tracks every change. You can see when knowledge was added, who added it, why.
- Zero dependencies. No databases, no servers, no API keys. Just a text file.
- It works everywhere. Claude Code, Cursor, Windsurf — every AI coding tool reads markdown.
As I argued in The Knowledge Equation, the quality of your input dictates the quality of your output. A well-curated context file transforms your AI agent from a generic code generator into a domain-aware engineering partner.
So what's the problem?
The Wall
The problem is growth.
When your context file is 200 lines, everything fits. Your AI reads the whole thing, has full context, produces great output. When it's 500 lines, it's still manageable. Maybe a bit long, but it works.
Then the project evolves. You document more architecture decisions. More edge cases. More business rules. More API contracts. More incident learnings. The file grows to 1,000 lines. 2,000. 4,000. And somewhere in that growth, you hit the wall.
The wall isn't dramatic. It's subtle. Here's how to know you've hit it:
1. You start editing your context file down. You remove older information to make room for new information. Weeks later, your AI doesn't know something it used to know — because you trimmed it.
2. Your AI asks you things it should already know. Something you documented months ago. Something that's "obvious" to anyone who's worked on the project. But the file is so long that the AI either skipped it or it was pushed out of the context window.
3. You hesitate to add new knowledge. You learn something important — from a meeting, a Slack thread, a production incident — and think: "I should add this... but the file is already so long." The moment you start self-censoring your knowledge base, the system is broken.
4. You spend more time maintaining the file than writing code. Finding the right section, reorganizing content, deduplicating — the curation tax exceeds the benefit.
5. Your AI makes wrong assumptions. It confidently states something about your architecture or business rules that used to be true but was in a section it didn't read, or that you removed to keep the file manageable.
The Desk Analogy
Think of it like a desk. Your AI agent has a desk where it puts everything it needs to work with — your instructions, your knowledge file, the code it's reading, the conversation with you, and tool results. Everything has to fit on the desk at the same time.
The desk has a fixed size. That's the context window — the total amount of information an LLM can hold in working memory during a single session. Claude's context window is large (up to 1 million tokens on Opus 4.6), but it's still finite. And your knowledge file isn't the only thing competing for that space.
Small context file = a notebook on a big desk. Plenty of room for code, conversation, and tool results.
4,000-line context file = a massive binder covering half the desk. Code files are hanging off the edge. Conversation history is on the floor. Your AI literally can't see parts of the information — not because it's being lazy, but because it physically doesn't fit.
QMD = instead of putting the whole binder on the desk, you keep it in a filing cabinet next to the desk. When your AI needs something specific, it pulls out just that one page. The desk stays clear for actual work.
The Math
Here's what's actually happening. Context windows are getting larger — Claude Opus 4.6 supports up to 1 million tokens — but your context file still competes with everything else for that space: the code files being edited, the conversation history, the tool results, the system instructions. And larger context windows don't eliminate the problem — they just move the wall further out. Projects grow too.
A rough conversion: 1 line of markdown ≈ 25 tokens. With Claude Opus 4.6’s 1 million token context window, the math looks more forgiving than you’d expect:
- 500 lines ≈ 12K tokens — trivial, about 1% of context
- 2,000 lines ≈ 50K tokens — 5% of context, comfortable
- 4,000 lines ≈ 100K tokens — 10% of context, noticeable
- 6,000+ lines ≈ 150K+ tokens — 15%+ of context, significant when combined with code and conversation
- 10,000+ lines ≈ 250K+ tokens — 25% of context, now you’re definitely hitting the wall
These percentages look manageable in isolation. But remember — the context window isn’t all yours. The system prompt, CLAUDE.md, conversation history, code files Claude is reading, and tool results all compete for the same space. A 100K-token context file plus a 200K conversation plus 300K of code files adds up fast. And there’s another problem: LLM attention degrades over very long contexts. Information buried in the middle of a 100K-token file gets less attention than information at the beginning and end. So even if it technically fits, your AI might still miss things.
That's the wall. Not a hard crash — a slow degradation in quality that you might not even notice until it's severe.
Enter QMD
QMD is a local, on-device hybrid search engine for markdown knowledge bases. Created by Tobias Lutke, co-founder and CEO of Shopify. Open-source, MIT-licensed, and actively developed — v2.1.0 was released in April 2026 with over 20,000 GitHub stars.
Lutke's own description: "A local search engine that lives and executes entirely on your computer. Both for you and agents." He uses it daily. Shopify integrated a version into their code monorepo for internal doc search.
The core idea is simple: instead of loading your entire knowledge base into the context window, search it and retrieve only the relevant parts. Your AI gets the 50 lines it needs instead of eating 4,000 lines it mostly doesn't.
Everything runs locally. No API keys. No cloud. No subscription. No data leaving your machine.
How QMD Works: The Technical Deep Dive
QMD combines three search techniques into a single pipeline, each progressively more intelligent:
Layer 1: BM25 Full-Text Search
The foundation. BM25 is a classical keyword-matching algorithm implemented via SQLite FTS5. When you search for "payment reconciliation," it finds documents containing those exact terms, ranked by relevance.
- Speed: Sub-10ms
- Strength: Perfect for exact terms — function names, API endpoints, specific concepts
- Weakness: Misses semantic matches. Searching "payment reconciliation" won't find a document about "settlement processing" even if it's the same concept
Layer 2: Vector Semantic Search
QMD embeds every chunk of your knowledge base into vector space using a local GemMA 300M model (~300MB). Queries are embedded too, and results are ranked by cosine similarity.
- Speed: 50-200ms
- Strength: Finds conceptual matches. "Payment reconciliation" will find documents about "settlement processing" because they're semantically close
- Weakness: Can surface tangentially related results. Less precise than keyword matching for exact terms
Layer 3: LLM Reranking
The results from both BM25 and vector search are merged via Reciprocal Rank Fusion, then a local Qwen3 0.6B model (~600MB) re-ranks them with actual language understanding. It reads the query and each candidate result, and judges which results truly answer the question.
- Speed: 200-500ms for the full hybrid pipeline
- Strength: Context-aware ordering. Understands nuance that keyword and vector matching can't
- Weakness: Adds latency. Small model means it's good but not perfect
Query Expansion
Before any of this runs, QMD uses a custom fine-tuned 1.7B model to expand your query into multiple sub-queries — lexical variants, semantic alternatives, and HyDE (Hypothetical Document Embeddings) that imagine what the ideal answer document would look like. This dramatically improves recall.
Smart Chunking
QMD doesn't naively split your files by character count. It chunks at semantic boundaries — paragraph breaks, section headings, code block boundaries — at approximately 900 tokens with 15% overlap. For code files, it uses tree-sitter AST parsing to chunk at function and class boundaries (TypeScript, Python, Go, Rust).
Three Search Modes
You can choose which layers to use:
qmd search— BM25 only. Fast, exact. Best for known terms.qmd vsearch— Vector semantic only. Best for conceptual questions.qmd query— Full hybrid pipeline. Best for complex questions. This is what your AI agent uses.
Local Models (~2GB Total)
Downloaded once on first run, then cached:
- GemMA 300M (~300MB) — embedding model for vector search
- Qwen3 0.6B (~600MB) — reranking model
- Custom 1.7B (~1GB) — query expansion model (GRPO fine-tuned)
All inference runs on your machine via node-llama-cpp. GPU acceleration supported (Metal on Mac, CUDA on Linux) but CPU works fine for the model sizes involved.
The MCP Bridge: How QMD Talks to Claude
This is where it becomes practical. QMD exposes a Model Context Protocol (MCP) server — the standard interface that Claude Code uses to communicate with external tools. The same way Claude Code can use tools like Bash, Read, or Grep, it can use QMD's search tools.
Configuration
Add QMD to your Claude Code settings once:
// ~/.claude/settings.json (global) or .mcp.json (per-project)
{
"mcpServers": {
"qmd": {
"command": "qmd",
"args": ["mcp"]
}
}
}That's it. Claude Code now has access to QMD's tools in every session.
Available Tools
QMD exposes four MCP tools to Claude:
query— Full hybrid search. "Find me everything about payment reconciliation." Returns the most relevant chunks.get— Retrieve a specific document by path. When Claude already knows which file it needs.multi_get— Retrieve multiple documents at once. Efficient batch retrieval.status— Check index health, collection stats, document counts.
HTTP Daemon Mode
QMD has a cold start of ~17 seconds — it needs to load the GGUF models into memory. For frequent use, run it as a persistent HTTP server:
qmd mcp --httpThis keeps models loaded in memory on port 8181. Queries respond in milliseconds instead of waiting for model loading. Update your MCP config to point to the HTTP endpoint.
What Happens During a Session
Here's the flow when Claude needs context:
- Claude decides it needs domain knowledge to complete a task
- Claude calls
qmd query "scheduling retry logic edge cases" - QMD searches across all indexed knowledge files
- Returns the top relevant chunks (typically 5-10 paragraphs from across multiple files)
- Claude incorporates this context and proceeds with the task
The key insight: Claude decides when to search. You tell it in CLAUDE.md to use QMD for domain knowledge, and it makes the judgment call — just like it currently decides when to read a file or run a command.
The Migration: From One File to Many
Here's what the transition looks like in practice.
Before: Single File
project/
CLAUDE.md # Rules and conventions
docs/
ctx.md # 4,000+ lines — everything crammed inAfter: Searchable Knowledge Base
project/
CLAUDE.md # Rules and conventions (unchanged)
docs/
overview.md # Short project summary (~200 lines, always loaded)
knowledge/ # QMD indexes this directory
architecture.md # System design, service map
api-contracts.md # Endpoints, payloads, auth
business-rules.md # Core domain logic
edge-cases.md # The things that break
integrations/
stripe.md # Payment provider details
kafka.md # Event streaming setup
decisions/
why-kafka.md # Architecture decision records
why-not-graphql.md
incidents/
2026-02-outage.md # Post-mortems and learningsThe .qmdrc Configuration
A .qmdrc file in your project root tells QMD what to index:
collections:
- name: project-knowledge
paths:
- docs/knowledge/**/*.md
- docs/overview.mdEvery .md file under docs/knowledge/ is now indexed and searchable. When you add a new file, it's automatically included — no extra step.
Updating CLAUDE.md
The only change to your workflow instructions:
## Knowledge Base
- `docs/overview.md` — Project overview (read at session start)
- Deep knowledge is indexed via QMD in `docs/knowledge/`
- Use `qmd query` to search for domain context before making
architectural or business logic decisions
- When learning new information, store it in `docs/knowledge/`
as focused files (one topic per file, kebab-case naming)
- If a relevant file exists, update it; if new topic, create a new fileClaude reads this instruction and knows: read overview.md for the big picture, search QMD for deep context.
The Hybrid Approach
Notice the overview.md file. This is important. You don't go fully search-based — you keep a short overview file (~200 lines) that Claude always reads. This contains the essential context that every session needs: what the project is, the high-level architecture, the current state of work.
The deep details — the edge cases, the incident reports, the API contracts, the decision records — live in the QMD-indexed files. Claude searches for them only when it needs them.
Always loaded: CLAUDE.md (rules) + overview.md (big picture)
Searched on demand: Everything in docs/knowledge/ (deep domain knowledge)
The Day-to-Day Workflow
Here's what changes (and what doesn't) in your daily work.
Regular Coding Session
You open Claude Code on your project. You say: "Fix the bug in the scheduling retry logic."
- Claude loads CLAUDE.md and overview.md automatically
- Claude calls
qmd query "scheduling retry logic" - QMD returns relevant paragraphs from
scheduling-logic.mdandedge-cases.md - Claude reads the actual code files
- Claude fixes the bug with full domain context
You did nothing different. You asked Claude to fix a bug. Claude decided it needed context, searched QMD, got it.
You Learn Something New
You're in a meeting. A stakeholder says: "Settlement timelines are changing from T+2 to T+1 next quarter."
- You tell Claude: "Save this — settlement timelines changing from T+2 to T+1 starting Q3, per stakeholder meeting today"
- Claude opens
docs/knowledge/settlement-rules.md, adds the information, saves - QMD re-indexes automatically
- Three weeks later, Claude is working on the payment module, queries QMD, finds the T+1 change, accounts for it
Same command from you. "Update docs." The difference is Claude writes to a focused file instead of finding the right spot in a 4,000-line file.
Encountering Unfamiliar Code
Claude is working on a feature and hits a pattern it doesn't understand — some unusual retry mechanism in the payment module.
- Before (single file): Claude reads ctx.md. The explanation is on line 3,800. Maybe Claude reads that far, maybe it doesn't.
- After (QMD): Claude calls
qmd query "payment module retry pattern". Gets the exact explanation fromdecisions/payment-architecture.md. Every time.
Adding a New Knowledge Area
You integrate a new third-party service. You tell Claude: "Create a knowledge file for the Stripe integration — here's the key details..."
Claude creates docs/knowledge/integrations/stripe.md. It's in the indexed path. Next query finds it. No editing a giant file. No worrying about where to put it.
Installation and Setup
Step 1: Install QMD
pip install mempalace
# or
npm install -g qmdStep 2: Initialize in Your Project
cd your-project
qmd initThis creates the .qmdrc configuration file.
Step 3: Configure Collections
Edit .qmdrc to point to your knowledge files:
collections:
- name: project-knowledge
paths:
- docs/knowledge/**/*.mdStep 4: Index
qmd indexFirst run downloads the models (~2GB) and indexes your files. Subsequent runs are incremental — only changed files are re-indexed.
Step 5: Configure MCP for Claude Code
// .mcp.json in your project root
{
"mcpServers": {
"qmd": {
"command": "qmd",
"args": ["mcp"]
}
}
}Step 6: Update CLAUDE.md
Add the knowledge base instructions so Claude knows to use QMD:
## Knowledge Base
Deep project knowledge is stored in `docs/knowledge/` and indexed via QMD.
Use `qmd query` to search for context before making domain-specific decisions.
When new knowledge is learned, save it to the appropriate file in `docs/knowledge/`.Optional: HTTP Daemon for Speed
qmd mcp --http # Runs persistent server on port 8181Eliminates the ~17-second cold start on each query. Keeps models loaded in memory. Recommended for active development sessions.
The Quick Test: Are You at the Wall?
You don’t need to guess. One command tells you where you stand:
wc -l docs/ctx.mdTake the line count. Multiply by 25. That’s your approximate token count.
Now here’s the key insight: your context file is never alone. In a typical working session, your AI agent’s context window is already occupied by:
- System prompt + CLAUDE.md: ~10–15K tokens
- Code files being read/edited: ~50–200K tokens (depends on task complexity)
- Conversation history: ~20–100K tokens (grows throughout the session)
- Tool results: ~10–50K tokens (search results, file contents, command output)
On a busy session, 300–400K tokens are already spoken for before your knowledge file even enters the picture. That leaves roughly 600K of a 1M window for your context file. Sounds like plenty — but attention degradation means your AI’s effective focus drops well before the window is full.
The Simple Formula
Here it is:
(Lines in your context file × 25) + 300,000 = approximate total tokens per session
If that number is under 500K: you’re fine. Everything fits with room to breathe.
If it’s between 500K and 700K: you’re in the zone where attention starts degrading on the context file. Watch for the five warning signs.
If it’s over 700K: you’re crowding the window. QMD will improve output quality.
In practice, this translates to simple line count thresholds:
- Under 3,000 lines (≈75K tokens): Stay with a single file. No question.
- 3,000–6,000 lines (75–150K tokens): The gray zone. Combined with a heavy coding session, you might hit the wall. Watch for the warning signs.
- Over 6,000 lines (150K+ tokens): QMD territory. Your AI is almost certainly losing context in the middle of the file.
The Practical Test
If you want to confirm rather than estimate, do this: start a fresh Claude Code session on your project. Ask a question about something you documented in the middle of your context file — not the beginning, not the end, the middle. Something specific and domain-heavy.
If Claude nails it: your file is fine.
If Claude guesses, hallucinates, or asks you to explain: the information is there but Claude isn’t seeing it. That’s the wall.
When NOT to Use QMD
QMD is not always the answer. KISS Your AI Workflow still applies. Here's the honest breakdown:
- Under 1,000 lines: Don’t bother. Your file fits comfortably in context. QMD adds complexity for zero benefit.
- 1,000–3,000 lines: Comfortable zone. Keep your single file. Monitor it.
- 3,000–6,000 lines: Getting close to the wall. You’re eating 8–15% of context, and combined with code and conversation, things start to get crowded. Consider QMD if you’re noticing any of the five warning signs.
- 6,000–10,000 lines: You’re at the wall. QMD will make a meaningful difference. Your AI is likely missing context buried in the middle of the file.
- 10,000+ lines: You needed QMD yesterday. Your context file is consuming a quarter of the window before any work begins.
The right mental model: QMD is an upgrade path, not a starting point. Start with a single context file. When it outgrows itself — when you start noticing the warning signs — split it and add QMD. Not before.
The One Risk: Retrieval Isn't Perfect
There's a tradeoff worth being honest about.
With a single file, your AI sees everything. Nothing is missed — because everything is loaded. The cost is context window space, but the recall is 100%.
With QMD, your AI sees only what the search returns. Retrieval is good (three layers of intelligence) but not perfect. It's possible to search for "payment flow" and miss a relevant detail that's documented under "settlement processing" using terminology the search didn't connect.
This is why the hybrid approach matters. The overview.md file — always loaded, always visible — ensures Claude has the essential context regardless of search quality. The QMD-indexed files add depth. If retrieval misses something, Claude still has the overview. If you notice a gap, you can always tell Claude to read a specific file directly.
The tradeoff: 100% recall of a subset (single file, limited by what fits) vs. high recall of everything (QMD, limited by search quality). For large projects, the second option wins — because "everything" is far more valuable than "a subset."
What If Context Windows Become Unlimited?
Context windows are growing fast. GPT-4 launched in 2023 with 8K tokens. Within two years we had 128K, then 200K, then 1 million. The trajectory is clear.
So here’s the natural question: if context windows keep growing — 10 million tokens, 100 million, effectively unlimited — does the wall disappear? Does a single file win again?
Yes, in theory. If you could load your entire knowledge base, all your code, and full conversation history into a single context with perfect attention across every token — a single file would be strictly better than search. Why? Because:
- 100% recall. The AI sees everything. Nothing is missed by imperfect retrieval. No risk of searching for "payment flow" and missing relevant context filed under "settlement processing."
- Zero configuration. No MCP servers, no indexing, no .qmdrc files. Just a file.
- Full cross-referencing. The AI can connect information across sections naturally, without needing to make multiple search queries to find related context.
- Simpler workflow. One file to update. One file to version control. One file for the AI to read. KISS at its purest.
Search-based retrieval is a workaround for a limitation, not a superior architecture. Given infinite perfect context, you’d always prefer having everything visible over hoping your search query returns the right results.
But we’re not there yet, and two hard problems remain:
1. Attention degradation. Current transformer architectures don’t attend equally to all tokens. Information in the middle of a very long context gets less attention than information at the beginning and end — the "lost in the middle" problem. Even with a 1M-token window, a fact on page 200 of a 400-page document may be effectively invisible. Making attention scale linearly with context length is an active research problem with no solved solution yet.
2. Cost and latency. Processing millions of tokens is expensive and slow. Every token in the context window costs compute on every generation step. Loading 500K tokens of context to answer a question that needs 2K tokens of knowledge is wasteful — like shipping the entire library to your desk when you need one book.
3. The knowledge keeps growing too. A project that has 6,000 lines of context today will have 20,000 in a year. Context windows grow, but so does the knowledge you want to put in them. It’s an arms race, and knowledge accumulation may outpace context expansion — especially for long-running complex projects.
When Could Unlimited Context Arrive?
Optimistic estimate: 3–5 years for 10M+ token context windows with reasonably flat attention. This would push the wall far enough that most projects — even large ones — could use a single file again.
Realistic estimate: 5–10 years for truly unlimited, cost-effective context with solved attention degradation. The "lost in the middle" problem is fundamental to current architectures and may require a paradigm shift beyond transformers.
The honest answer: nobody knows. The field moves in unpredictable jumps. But even optimistically, we’re years away from context windows making retrieval tools obsolete. And even then, retrieval may remain more efficient — why load 500K tokens when you can search and load 2K?
For now — and realistically for the next 3–5 years at minimum — QMD fills the gap. Even as context windows grow, attention degradation and cost efficiency will keep retrieval-based approaches relevant. And if unlimited perfect context eventually arrives and makes QMD unnecessary? Great. You migrate back to a single file in an afternoon. The knowledge is still markdown, still portable, still yours. Nothing is lost. QMD is infrastructure you can adopt today and discard tomorrow without any lock-in.
The Bigger Picture
QMD solves a specific, mechanical problem: your knowledge base outgrew your context window. But the deeper lesson connects to something I keep coming back to in this series.
In The Knowledge Equation, I argued that domain knowledge is the real differentiator in AI-native engineering — not orchestration skill, which becomes table stakes. In The Architect's Protocol, I described how the knowledge base grows as part of the development workflow. In Things AI Is Surprisingly Bad At, I explored the limits of AI without proper context.
QMD is the infrastructure that makes all of this scale. It removes the ceiling on your knowledge base. You can write everything down — every architecture decision, every edge case, every stakeholder requirement, every incident learning — without worrying about context limits. Your AI gets the right knowledge at the right moment.
The result: you stop self-censoring your knowledge base. You stop making tradeoffs about what to include. You write freely, search intelligently, and your AI agent has access to the full depth of your domain expertise.
That's the upgrade. Not a new philosophy — just better infrastructure for the one you already have.
Related Reading
- The Knowledge Equation — Why domain knowledge is the real AI differentiator
- The Architect's Protocol — The complete AI-native development workflow
- KISS Your AI Workflow — Keep it simple until you can't
- Things AI Is Surprisingly Bad At — Why AI needs your domain context to be useful
- The Knowledge Base That Builds Itself — Let your AI maintain your knowledge base