RAG, From Crayons to PhD

Five floors of understanding: a kid version, an engineer version, a tools-and-cases version, an advanced version, a PhD version. By the end you'll know what RAG is, when to use it, when to run, every tool worth caring about, and the hot takes most posts won't tell you.

Knowledge made queryable.

Almost every "AI assistant" you've used in production in the last eighteen months is doing the same thing under the hood: looking up the right page in your data before answering. That lookup is RAG — Retrieval-Augmented Generation. The model is frozen. The knowledge is fresh. The bridge between them is a search engine you didn't realize you were running.

It's also the part most teams get wrong. Industry surveys in 2026 put the production failure rate of RAG implementations at 40 to 60 percent, and when those systems do fail, the failure is in retrieval roughly seventy-three percent of the time — not in the model. Which means the bottleneck of modern AI isn't the LLM. It's the half of the stack nobody puts on a pitch deck.

This article walks the same idea up five floors. Floor one is for a five-year-old. Floor five is the cutting edge. By the time you reach the top you'll know what RAG is, when to use it, when to run, every tool worth caring about, the alternatives worth considering, and the hot takes most posts won't tell you.

Floor 1 — Crayons (ELI5)

Imagine asking a really smart friend a question. Instead of trying to remember the answer from memory, they walk to your bookshelf, find the right book, open it to the right page, and read it before answering. That's RAG. We taught computers the same trick: look it up first, then talk.

TL;DR (engineer level)

📋

RAG in six bullets:

• What. Run a search engine over your corpus → grab the most relevant chunks → stuff them into the LLM's prompt → it answers grounded in those chunks.

• When. Your knowledge changes faster than you can fine-tune. Your corpus is bigger than the context window. You need citations. You need per-tenant data isolation.

• How. Chunk → embed → index → retrieve → re-rank → augment prompt → generate → evaluate. Each stage is a failure point.

• Pitfalls. Bad chunking destroys recall. Cosine similarity is not relevance. Re-rankers matter more than your vector DB choice. Eighty percent of "RAG problems" are retrieval problems wearing a hallucination costume.

• 2026 baseline. Vanilla cosine-over-naive-chunks is dead. Hybrid retrieval (dense + keyword) plus a cross-encoder re-ranker plus query rewriting is the floor, not the ceiling.

• Don't use it for. Math, real-time state, anything that fits in 200K tokens you control, or behavior tasks where fine-tuning would be cleaner.

Floor 2 — What problem RAG actually solves

Large language models have three brutal constraints. RAG addresses all three at once.

First, knowledge cutoffs. Every model — GPT-5, Claude Opus 4.7, Gemini 3 — has a training date past which it knows nothing. It doesn't know what your company shipped yesterday, what's in last week's support tickets, what's in the PDF on your desktop.

Second, context windows are finite and expensive. Even with Gemini's two-million-token window or Claude's one-million, dumping your whole corpus into every prompt is wasteful: you pay per token, latency scales with input, and the well-documented "lost in the middle" attention degradation means the model often ignores the part you most need.

Third, hallucination without grounding. Ask a model something specific it doesn't actually know and it will confabulate plausibly-shaped nonsense. Forcing it to cite retrieved passages mostly collapses that failure mode.

RAG's answer: keep the model frozen, keep your knowledge external in a searchable form, and at query time retrieve only the relevant slices. The model becomes a reasoning engine over fresh context, not a stale knowledge store.

The trade-off is the part most tutorials skip. You're now running a search engine and a generation engine in series. Search engines are the harder half of that equation. Most teams underestimate this and ship cosine similarity over a default chunker, then wonder why their bot hallucinates inside a system that was supposed to ground it.

Floor 3 — How it actually works

The pipeline has five real stages. Each one has non-trivial choices. Each one is a failure point most teams don't monitor.

1. Ingestion and chunking. You parse documents — PDFs, HTML, code, transcripts — into text and split into chunks. Naive: every 512 tokens, regardless of meaning. Better: respect section and paragraph boundaries. Even better: contextual retrieval, which prepends an LLM-generated 50-100 token summary to each chunk before embedding (Anthropic published this technique in late 2024 and reported a 49 percent reduction in retrieval failures — it's still criminally underused). Chunking is where most RAG systems quietly die. Too small loses context. Too large dilutes the signal vector.

2. Embedding. Each chunk goes through an embedding model — Voyage-3, Cohere Embed v4, OpenAI text-embedding-3-large, Nomic, Jina. Output: a dense vector, typically 768 to 3072 dimensions. The embedding model determines your semantic ceiling. A weak embedder can't be saved by a fancy re-ranker downstream.

3. Indexing. Vectors go into a vector database which does approximate nearest neighbor search (HNSW, IVF, ScaNN) for sub-second retrieval over millions of chunks. You also store a parallel keyword index — pure semantic retrieval misses exact matches like product SKUs, code symbols, and proper names. Real production systems run hybrid from day one.

4. Retrieval. At query time: optionally rewrite the query (HyDE, multi-query, decomposition), embed it, fetch the top results from both the vector index and the keyword index, fuse the two rankings with Reciprocal Rank Fusion, then re-rank the survivors with a cross-encoder. Re-ranking the top 100 down to top 10 is the single highest-leverage move in the entire pipeline.

5. Generation. Stuff the top chunks into a structured prompt — system instruction, context window, question, citation format — call the LLM, parse the citations, return the answer.

Two principles most tutorials skip. Retrieval recall is not end-to-end accuracy: you can have ninety-five percent recall and still answer wrong because the right chunk was buried at position seventeen and the model attended to position three. And the cheapest correctness win is almost always re-ranking, not switching vector databases. Teams migrate Pinecone to Qdrant for a five-percent latency improvement while ignoring that adding a re-ranker would lift accuracy twenty points.

The tool landscape (opinionated, 2026)

Five layers, picks that are actually shipping in production today, brief honest take on each. If you want a comprehensive list go elsewhere — this is what I'd reach for.

Vector databases

Tool	When to pick it	Honest take
Qdrant	Default for new projects	Rust, fast, payload filtering excellent, self-hostable. The default I reach for.
Pinecone	You don't want to run infra	Managed, expensive, the serverless tier is now genuinely good.
pgvector + pgvectorscale	You already run Postgres	Underrated. ScaNN-quality ANN now baked in. Don't add infra you don't need.
Weaviate	You want hybrid search built-in	Hybrid retrieval native, embedding modules inline. Slightly heavier ops.
LanceDB	Local / edge / analytics workloads	Embedded, columnar. Perfect for laptop-grade RAG and offline experiments.
Vespa	Serious-people scale	Hybrid retrieval, ranking, ML in one engine. What Spotify and Yahoo run.
Turbopuffer	Cost-disruptive cold/warm data	Object-storage-backed, new entrant, attractive economics for archival.

Embedding models

Model	Strength	Honest take
Voyage-3 / voyage-3-large	English + code retrieval	Currently SOTA on most retrieval benchmarks. My default.
Cohere Embed v4	Multilingual + multimodal	Multilingual king, handles images, strong on retrieval-specific tasks.
OpenAI text-embedding-3-large	Boring-good, ubiquitous	Solid, everywhere. Matryoshka dimensions are useful for dynamic sizing.
BGE-M3	Open-weights, multilingual	Dense + sparse + ColBERT-style multi-vector in one model.
Nomic Embed v2 / Jina v3	Open-weights internal use	Good enough for most internal projects. Self-hostable.

Frameworks and re-rankers

Layer	Pick	Why
Orchestration	LlamaIndex	Best-in-class retrieval primitives, ingestion pipelines, eval hooks. My production pick.
Orchestration	Haystack 2.x	German-engineered, opinionated, components-based. Underused outside Europe.
Orchestration	LangChain	Massive ecosystem, but the abstractions leak. Prototype, then replace.
Orchestration	DSPy	Different paradigm: program your retrieval pipeline, framework optimizes prompts. Researcher's choice.
Re-ranker	Cohere Rerank 3.5	Multilingual, fast, the lowest-friction quality lift in RAG.
Re-ranker	Voyage rerank-2	Often beats Cohere on technical and code corpora.
Re-ranker	BGE-reranker-v2-m3	Open-weights, runs on a single GPU, near-SOTA.
Eval	RAGAS	Faithfulness, answer relevance, context precision/recall. Industry default; flawed but better than vibes.
Eval	TruLens / Phoenix	Production tracing + feedback functions. Phoenix has the best debugging UI.
Eval	DeepEval	Pytest-style, CI-friendly. Good for guarding against regressions.

Real production use cases

RAG isn't theoretical. The following systems run on it, in production, today. The pattern is almost always: hybrid retrieval, aggressive re-ranking, citations enforced, per-tenant isolation when there are multiple tenants.

Company / product	What RAG is doing
Bloomberg Terminal AI	Hybrid retrieval over forty-plus years of filings, news, and proprietary research. Analysts query by ticker and CUSIP — pure semantic search would fail; keyword + dense + re-rank is the architecture.
GitHub Copilot Workspace / Cursor (@codebase)	RAG over your repo, often with tree-sitter for structural chunking. At completion time, relevant files and symbols are retrieved. Cursor's @codebase is RAG with structural awareness.
Notion AI Q&A	RAG over your workspace with per-tenant isolation. You can't fine-tune one model per workspace, but you can absolutely retrieve per-workspace. This is the killer enterprise architecture.
Harvey (legal AI for BigLaw)	RAG over case law, contracts, and firm-specific documents. Citations are non-negotiable — a hallucinated case name in a brief is a malpractice event. Heavy investment in re-ranking and citation verification.
Perplexity	Web-scale RAG. Search the open web in real time, retrieve passages, generate cited answers. They popularized "answer engine" as a category. The moat is retrieval quality, not the model.
Glean	Enterprise search RAG over Slack, Drive, Notion, Salesforce, Jira — per-tenant, with permission-aware filtering. Permissions are the hardest part: you can't leak a chunk a user wasn't allowed to read.
Klarna's customer service AI	RAG over policies, order data, and support history. Publicly claimed it replaced 700 agents' worth of work.
Anthropic Claude Projects + Files API	Functions as managed RAG: upload docs, query naturally. Hides the pipeline. Great DX, less control. Sometimes that's the right trade.
Shopify Sidekick	RAG over store data plus Shopify documentation for merchant Q&A.

Should I use RAG? A decision tree

Ask these in order. Don't skip ahead. Most projects that fail do so because they answered question 6 first.

Is your corpus under 500K tokens AND queried fewer than 100 times a day? → Just use long context. Don't build infrastructure.
Is the answer a deterministic lookup in a structured database? → SQL or an MCP server, not RAG.
Is it a style, format, or persona task — not a knowledge task? → Fine-tune.
Do you need real-time web data? → Agentic search, or RAG over a recently crawled snapshot.
Is the question relational or multi-hop? ("Show me everything connected to X via Y") → Knowledge graph or GraphRAG.
Is it compliance-critical exact-text retrieval? → Deterministic keyword + structured retrieval. Don't trust embeddings near regulatory text.
Otherwise: RAG. Start with hybrid retrieval (keyword + dense) plus a re-ranker plus enforced citations. Don't ship without those three.

Anti-pattern check before building: if you can't articulate your evaluation metric in one sentence, you're not ready to build the system. Walk back to a notebook and figure out what "good" looks like.

Alternatives — when each one wins

RAG isn't the only tool. The dirty secret of 2026 is that most production AI assistants are RAG plus tool-use plus a couple of MCP servers, not pure RAG. The categories blur. Here's how to think about which weapon is right for which problem.

Approach	Wins over RAG when	RAG wins when
Long context (Gemini 2M, Claude 1M, GPT-5.4 1.05M)	Corpus fits in context, queries are rare, you want zero infra. Latency tolerance is high.	Corpus > context window. Queries are frequent. Cost-per-query matters. You need citations.
Fine-tuning	Style, format, structure tasks. Narrow domain vocabulary. Knowledge is stable. Latency-sensitive.	Knowledge changes. You need citations. Many tenants. You need to update without retraining.
Agentic search (browser/tool use)	Open-web questions. Real-time data (prices, news). Long-tail queries.	Closed corpus. Predictable latency. Cost discipline matters.
MCP servers	Structured data access. Tool-shaped tasks. Real-time state queries.	Unstructured documents. Semantic queries. Heterogeneous corpora.
Knowledge graphs	Multi-hop reasoning. Entity-centric queries. Explainability requirements.	Open-ended natural language. Fuzzy matching. Narrative content.
Plain SQL or keyword	Structured data. Exact match. Regulatory and audit needs.	Natural-language queries over unstructured docs.
Cache + templates	Top 100 FAQs. Deterministic answers. Ultra-low latency.	Long-tail. Novel queries. Open-ended content.

The two things people get wrong here: they treat long context as a RAG-killer (it's not — it's a prototyping tool) and they treat MCP as a RAG-replacer (it's not — it's the other half of the stack). The right system usually picks two or three of these and routes between them.

QMD: RAG without the infrastructure tax

If RAG is the production architecture for serving end users via an LLM, QMD is what the same idea looks like when the user is you and the corpus is your own knowledge.

QMD — the markdown search engine I run over my entire personal knowledge base — is hybrid retrieval (BM25 + dense vector + HyDE) embedded as a local daemon. No vector database to provision. No embedding pipeline to maintain. No multi-tenant permissions to debug. Point it at a folder of markdown files, it indexes them, and from that moment on Claude Code, your scripts, MCP clients — anything with shell access — can run semantic plus keyword search in a single command. The retrieval primitives that take six engineer-weeks to assemble in production RAG are a five-minute install at personal scale.

When QMD wins	When you graduate to full RAG
Personal or team markdown knowledge base	Multi-tenant customer-facing product
Git-tracked content, mostly text	Heterogeneous corpora (PDFs, images, structured data)
Local-first / on-laptop / private	Cloud-scale, distributed serving
Under 100K documents	Millions of documents
Direct human + agent queries	LLM-served end-user product with SLAs
Setup: 5 minutes	Setup: 5+ engineer-weeks
You own the deployment end-to-end	Multiple teams + ops dependencies

The principle most teams miss: most "we need RAG" projects are actually QMD-shaped. Someone wants to query their own docs. The right answer is rarely "spin up Pinecone, write an ingestion pipeline, hire an MLE." It is usually "embed the search, query from your tools." Graduate to full RAG only when you're shipping a customer-facing product, or when your corpus genuinely outgrows what a local hybrid index can serve in 50 milliseconds.

This same retrieval layer — running locally for a person, or running across a company for an agent fleet — is exactly what Sinapt is built on. Same primitives. Different scale.

Floor 4 — Advanced (the production-grade techniques)

If you're past the demo phase, this is the level you actually have to learn. None of these are exotic; all of them are deployed in production today. Together they're the difference between a system that wins demos and a system that survives in the wild.

Hybrid retrieval (keyword + dense + RRF)

Dense embeddings nail semantic similarity. BM25 keyword search nails exact tokens — names, IDs, code, dates. Run both, fuse the rankings with Reciprocal Rank Fusion. Single biggest correctness lift after re-ranking. Vespa, Weaviate, Qdrant, and pgvector all support this natively now. If you're running pure dense retrieval in 2026 you're leaving twenty points of accuracy on the table.

Re-ranking with cross-encoders

Bi-encoders (your embedding model) compute query and document embeddings independently — fast but lossy. Cross-encoders take query and document as a pair and compute a true relevance score. Ten to fifty times slower per pair, but you only re-rank the top 100, which is tractable. Lifts NDCG@10 by 15 to 30 points routinely. Cohere Rerank 3.5, Voyage rerank-2, and BGE-reranker-v2-m3 are the production-grade options.

Late interaction (ColBERT, ColPali)

Instead of one vector per document, store one vector per token. At query time, compute MaxSim between query tokens and document tokens. Massively better retrieval, much higher storage cost. ColPali extends this to documents-as-images — embed the page screenshot, retrieve at the page level. Game-changer for PDFs, slides, and financial filings where layout matters. The whole pipeline of "parse PDF → extract text → chunk → embed text" is being replaced by "screenshot each page → embed image with VLM → late-interaction retrieve." Most RAG stacks haven't caught up. Production-ready in 2026 via Vespa, LanceDB, and Qdrant multi-vector.

Query rewriting and HyDE

User queries are usually terrible retrieval queries. "How do I fix it." Rewrite with an LLM into a search-shaped query, or generate a hypothetical answer (HyDE) and embed THAT for retrieval — the hypothetical answer sits closer in embedding space to the real answer than the question does. Cheap recall boost. Almost free with prompt caching.

Parent-document and small-to-big retrieval

Embed small chunks for precise retrieval, but return the parent document for rich context. Decouples retrieval granularity from generation context. LlamaIndex calls this auto-merging. Sentence-window retrieval is the same idea at finer granularity: embed individual sentences, expand to surrounding window after retrieval.

Contextual retrieval (Anthropic, late 2024)

Prepend an LLM-generated 50-100 token context summary to each chunk before embedding. Cuts retrieval failures by 49 percent in Anthropic's benchmark. With prompt caching it costs almost nothing per chunk. Should be a default by now. It isn't. If you do nothing else from this article, do this one.

Floor 5 — PhD level (the cutting edge)

Vanilla RAG is a 2023 architecture. The frontier in 2026 is a small zoo of variants, each solving a different production failure mode.

GraphRAG (Microsoft, 2024)

Build a knowledge graph from your corpus by extracting entities and relationships with an LLM. Cluster the graph into communities. Generate community summaries. At query time, answer "global" questions like "what are the main themes in this corpus" by aggregating community summaries — something vanilla RAG genuinely cannot do, because there's no single chunk that contains the answer. Expensive to build (LLM-heavy ingestion: 3-5x more LLM calls) and accuracy of entity recognition is 60-85 percent depending on domain. The 2026 follow-ups (LazyGraphRAG, FastGraphRAG) reduced indexing cost by roughly 700x. Worth the budget for analytical workloads over closed corpora.

Self-RAG (Asai et al., 2023)

The model is fine-tuned to emit reflection tokens that decide whether to retrieve, whether the retrieved passages are relevant, and whether its own generation is supported by them. Retrieval becomes adaptive instead of always-on. Used in Cohere's Coral and similar adaptive systems.

Corrective RAG (CRAG)

A lightweight retrieval evaluator scores each retrieval as Correct, Ambiguous, or Incorrect. Correct uses the chunks. Incorrect triggers a web-search fallback. Ambiguous does both and lets the generator pick. Production pattern at Perplexity-style systems. Most "agentic RAG" deployments are really CRAG with extra steps.

Agentic RAG

Multi-step retrieval with planning and self-critique. The agent decomposes the query, retrieves, evaluates, reformulates, retrieves again, and critiques the answer before returning. Slower but handles multi-hop questions vanilla RAG can't. Production deployments report 25 to 40 percent reduction in irrelevant retrievals — but also reveal new failure modes including retrieval loops, incorrect decisions to skip retrieval, and over-retrieval when confidence calibration breaks down. LangGraph and LlamaIndex Workflows are the production frameworks. Honest pricing reality: the latency and cost balloon, so most production "agentic RAG" is actually CRAG-style retry, not true multi-hop.

Embed text and images and tables and video frames into a shared space (Cohere Embed v4, ColPali, Voyage Multimodal). Critical for financial filings (charts), medical imaging, technical manuals, and slide decks. Increasingly the dominant approach for PDFs in 2026 — page-image embedding with late interaction beats text extraction in almost every scenario where layout carries information.

Evaluation

Production-grade RAG without evaluation is vibes-driven AI. The frameworks that actually catch regressions:

Framework	What it measures	Production target
RAGAS	Faithfulness, answer relevance, context precision/recall	Faithfulness > 0.9, Answer Relevance > 0.85, Context Precision > 0.8
TruLens	Feedback functions at runtime	Continuous monitoring, drift detection
ARES	Synthetic eval data + LLM-as-judge calibrated against humans	Calibrated against ~50 human-rated examples
Needle-in-haystack	Long-context recall test	Useful for RAG vs long-context comparison
FRAMES (Google)	Multi-document, multi-hop reasoning	Brutal on vanilla RAG, useful gate for graph/agentic systems
MTEB / BEIR	Embedding model retrieval benchmarks	Pick embedder, don't confuse with end-to-end RAG eval

Production failure modes (the ones you'll actually hit)

Hallucination despite retrieval. Right context retrieved, model still invents an answer. Cause: weak instruction adherence, citation format not enforced, top-K too high causing dilution.

Lost in the middle. Even with the right chunk in context, attention degrades for middle positions. Mitigation: re-rank so the best chunk is first or last; keep top-K small (3-5) for the final prompt.

Retrieval blindness. No retrieval failure mode is logged because no eval exists. Half the production RAG systems out there have zero retrieval-quality monitoring. They're flying on user complaints.

Distribution shift. Embeddings trained on Wikipedia don't generalize to your legal/medical/code corpus. Domain-tune your embedder or use a domain-trained one (Voyage-law, Voyage-code).

Chunk boundary loss. The answer spans the boundary between chunks N and N+1 and neither contains it whole. Mitigation: 10-20 percent overlap, parent-document retrieval, or contextual retrieval.

Permission leaks. A chunk from a doc the user shouldn't see ends up in their answer. Filter at retrieval time, never at generation time. This is the actual hardest part of enterprise RAG.

Stale index. Docs change, the index doesn't update, the model cites a deleted policy. Versioning + TTL + re-embed pipelines.

When NOT to use RAG

Math, logic, code execution. RAG retrieves text. It doesn't compute. Use tool-use.
Real-time state. "What's my current balance?" is an MCP or SQL query, not RAG.
Tiny stable corpora. Just stuff it in the system prompt. Prompt caching makes this nearly free.
Global reasoning over the corpus. "Summarize all customer complaints from Q3" can't be answered by any single chunk. Use GraphRAG or map-reduce summarization.
When you can't evaluate. If you can't measure quality, you can't improve it. You'll ship something that demos well and silently fails in production.
Style, tone, structured-output formatting. These are training-time problems, not retrieval-time. Fine-tune.
Single-user, single-document workflows. Just upload the doc to Claude Projects.
Compliance-critical exact-text retrieval. Use deterministic keyword + structured retrieval. Don't trust embeddings near regulatory text.

Hot takes nobody else will tell you

These are the things that take a few production deployments before they become obvious. Skip the learning curve.

1. Your vector database is the least important choice in your stack

Re-ranking, hybrid retrieval, and chunking strategy each matter five to ten times more than Pinecone vs Qdrant vs pgvector. Most "we need to migrate our vector database" conversations are misdiagnosed retrieval problems wearing infrastructure clothes.

2. Cosine similarity has no semantics

A 0.87 score doesn't mean "relevant." It means "similar in this embedding space, which was trained on a particular distribution." Two completely irrelevant chunks can score 0.85. A perfect match can score 0.62. Stop treating similarity scores as confidence. Calibrate, or re-rank.

3. Long context did not kill RAG

Despite every think-piece written in 2024 about context-window-as-RAG-killer. Gemini 2M and Claude 1M shifted the threshold, sure, but cost-per-query, latency, lost-in-the-middle, and per-tenant isolation kept RAG dominant for production. The right framing: long context is a prototyping tool — does this even work? — and RAG is the productionization tool. Use long context to validate the use case in a weekend, then build RAG when you ship.

4. Most "agentic RAG" is just RAG with retries

True multi-step agentic retrieval — plan → retrieve → critique → reformulate → retrieve — is rare in production because latency and cost balloon. What ships is usually CRAG-style: retrieve once, score it, fall back to web search if confidence is low. That's plenty for ninety-five percent of use cases. Don't pay for an agent loop you don't need.

5. ColPali is eating PDF RAG

The whole "parse PDF → extract text → chunk → embed text" pipeline is being replaced by "screenshot each page → embed the image with a vision-language model → late-interaction retrieve." Tables, charts, layout, multi-column docs — all of it just works. If you have a PDF-heavy corpus and you're still running text-extraction RAG in 2026, you're solving a problem that no longer exists.

6. The hardest part of enterprise RAG is permissions, not retrieval

Glean's actual moat isn't search quality. It's per-document access controls that match the source system, applied at retrieval time, never bypassed. Most internal RAG projects ignore this until a salary spreadsheet leaks to an intern via the chatbot. If you're building enterprise RAG, design the auth model on day one, not after the demo.

7. Evaluation is harder than the system

Building a RAG demo is one weekend. Building a RAG eval that catches regressions before users do is six months. Teams that skip eval ship vibes-driven AI and never know why retention is bad. Invest in RAGAS plus a golden dataset plus CI eval before you scale, not after.

The bigger picture

RAG isn't dying. It's stratifying. Vanilla 2023-era RAG is archaeology — naive chunking, pure cosine similarity, no re-ranking, no eval. The 2026 production stack is layered: hybrid retrieval, cross-encoder re-ranking, contextual chunks, late interaction for documents-as-images, agentic retries for hard queries, knowledge graphs for global questions, MCP servers for structured data. Each layer solves a specific failure mode the layer below couldn't.

The teams that win the next two years aren't the ones with the biggest model. They're the ones whose retrieval is good enough that the model never has to guess. Start there. The rest follows.

This is the bet behind Sinapt — the agent-first knowledge layer I'm building. RAG is the per-application retrieval pipeline. Sinapt is the company-wide queryable layer underneath it: one source of truth that every agent in your stack can query, instead of stitching ten separate RAG pipelines that each rebuild context from scratch. The deep dive is in Sinapt and the Queryable Company. The product itself is at sinapt.ai.

💬

Working with a team that wants to adopt AI-native workflows at scale? I help engineering teams build this capability — workflow design, knowledge architecture, team training, and embedded engineering. → AI-Native Engineering Consulting

📖

Related reading

Sinapt and the Queryable Company — the layer above RAG: making your whole company queryable by agents instead of stitching ten RAG pipelines together.

Claude Killed the API Key — why MCP is reshaping the half of the AI stack that RAG never owned.

AI — Haters vs Believers — context for why the people building serious RAG today are the ones who quietly stopped having the "is AI real" debate.

Sinapt: The Architecture Before the Code — The build update — what's locked before the PoC: Floor 4 retrieval, permission-aware retrieval pattern, four surfaces sharing one authorization spine, and why this is a product not a wrapper.