Above the Model

The model is not the bottleneck anymore. The plumbing is well-known. What separates teams shipping transformative AI from teams shipping generic slop is a stack of less-discussed components: context, dreaming, sycophancy mitigation, verification, memory. The operator's manual.

The components above the model that decide AI-native output quality.

The model is no longer the altar; it is the engine sitting inside a larger machine that either sharpens it into a weapon or buries it under garbage context.

ELI5

An AI agent is like a brilliant engineer who wakes up with no memory except what you put on the desk. The model is the brain, but the desk matters: the documents, rules, tools, memory, checklist, reviewer, and test suite decide whether the brain ships excellent work or confident trash. Quality is not one thing. It is the stack around the model.

TL;DR for engineers

📋

The quality stack in six bullets:

• Model gap compressed. GPT-5.5, Claude 4.x, Gemini, and specialized coding models are all strong enough that workflow quality now dominates raw model choice.

• Context is the centerpiece. Long windows help, but the operator still decides what gets loaded, what gets evicted, what gets cached, and when a session is poisoned enough to kill.

• Sycophancy is not a personality quirk. It is an output-quality defect that makes agents validate wrong assumptions, accept bad architecture, and call broken tests "reasonable."

• Components compound. Instructions, memory, tool curation, evals, verification, skills, and workflow design — improve one and you get a bump. Improve five and the agent starts feeling like a staff engineer with a bench.

• Claude Dreaming is real, not mystical. A Managed Agents memory-consolidation job. Not weight self-training. Harvey reported 6× task completion rate after adopting it.

• The best operators don't prompt better. They run a disciplined context and verification system.

Chapter 1 — Context, the black box on the desk

In 2026, context is not "the stuff in the chat." That definition is too small. Context is the full runtime state the model can attend to while making the next decision: system prompt, developer instructions, project rules, loaded files, tool schemas, retrieved documents, prior turns, compacted summaries, memory files, scratchpads, images, diffs, test output, and sometimes hidden reasoning budget. The model does not "know" your repo. It knows the slice of the repo that made it into the window.

The windows are now absurd by 2023 standards. OpenAI's current model table lists GPT-5.5 at a 1,050,000 token context window and 128,000 max output tokens. Anthropic's context docs list Claude Opus 4.7, Opus 4.6, Sonnet 4.6, and Mythos Preview at 1M tokens, while older Sonnet 4.5/Sonnet 4-era models sit at 200K. Google's long-context docs made Gemini famous for 1M standard context and a 2M high-water mark, though not every current Gemini SKU exposes 2M by default. The practical conclusion is simple: "too much to fit" is less often the bottleneck. "Too much junk to reason through" is the bottleneck.

Long context is not infinite attention. The "lost in the middle" problem did not vanish because the number has six or seven digits. Models still overweight recent turns, system-level constraints, and salient retrieved chunks. They still miss a tiny invariant buried between 300K tokens of logs and an irrelevant architecture debate. A 1M window turns truncation failure into selection failure. It lets you put the whole cathedral in the room; it does not guarantee the model looks at the right gargoyle.

Agentic tools hide this machinery differently.

Surface	Context inputs	Compaction / Memory	Operator move
Claude Code	System prompt, recursive `CLAUDE.md`, loaded files, MCP tool names, skills, terminal output, session history.	Auto-compact near context pressure; default auto-compact when context exceeds 95%. `/compact`, `/clear`, and custom compact instructions matter.	Keep `CLAUDE.md` under 200 lines. Move procedures into skills. Clear between unrelated tasks.
Codex	Developer instructions, global and project `AGENTS.md`, tool definitions, file reads, diffs, shell output, subagent returns.	Compaction is a first-class API/docs concept; carries summaries across long sessions and supports subagents with isolated threads.	Write precise `AGENTS.md`. Delegate exploration to subagents only when their output can be summarized back cleanly.
Cursor	Open editor state, codebase index, chat history, attached files, `.cursor/rules`, memories, manual `@` context.	Rules inserted at the start of model context. Memories are project-scoped rules generated from conversations with approval.	Use scoped rules, not one giant always-on manifesto. Attach exact files for hard tasks.

Context has a blood price. Every repeated token costs money, latency, and attention. Prompt caching softens the bill but changes the game. OpenAI's prompt caching starts at 1024 tokens, depends on exact prefix matches, can reduce input costs by up to 90 percent and latency by up to 80 percent, and for GPT-5.5 defaults to extended 24-hour retention. Anthropic caching works through explicit breakpoints, with 5-minute default TTL and optional 1-hour TTL. Google Gemini has implicit caching plus explicit cached content, with 1-hour default TTL when not set. Translation: put stable instructions, tool schemas, reference docs, and reusable examples first. Put volatile user data, diffs, logs, and one-off task details last. Cache prefixes are architecture.

The operator's job tomorrow morning is not "use long context." It is:

Start every task with a context budget: what must be loaded, what can be searched on demand, what should stay out.
Use layered context: system policy → project rules → task brief → selected files → current question.
Evict deliberately. Old test output, obsolete plans, discarded branches, and dead-end investigations are poison after they stop being useful.
Start fresh when the task boundary changes. If you moved from debugging auth to rewriting pricing copy, kill the old session.
Compact before the model is drowning, not after. A good compact summary preserves decisions, changed files, failing tests, open risks, and rejected paths.
Keep an external state artifact: plan.md, issue comment, PR description, runbook, or memory note. The session is volatile. The artifact is the anchor.

Context engineering is now a discipline because context is where product quality enters the model. RAG and MCP feed the room. Context engineering decides what is on the desk.

Chapter 2 — Claude Dreaming, without the incense

First, the platform context. Claude Managed Agents is Anthropic's production runtime for agents — a hosted layer above the raw API where you define an agent (system prompt, tools, instructions), spin up sessions inside isolated environments, attach persistent memory stores, run outcomes (structured task batches), and observe everything. It is the production-grade equivalent of running Claude in a loop yourself, with the loop, the memory, the sandbox, and the orchestration handled by Anthropic's infrastructure. As of May 2026 it is in beta — every API call carries the managed-agents-2026-04-01 beta header. Dreams live inside this product. Outside it, you implement the moral equivalent yourself with the Files API, Memory tool, and your own session loop.

Second, the feature itself.

Anthropic's "Dreams" are real. They are also less mystical than the name wants them to be.

The official Claude Managed Agents docs describe dreaming as a Research Preview feature that lets Claude reflect on past sessions to curate an agent's memory and surface new insights. A dream takes an existing memory store and optionally up to 100 past sessions. It runs asynchronously, usually for minutes to tens of minutes. It produces a new output memory store. It does not mutate the input. During preview, Anthropic supports claude-opus-4-7 and claude-sonnet-4-6 as dream models. It requires the Managed Agents beta header and the dreaming-2026-04-21 beta header. It is billed at standard API token rates.

That means dreaming is not:

the model updating its weights;
a magical unconscious process inside every Claude chat;
a guarantee that future sessions become correct;
a replacement for instructions, tests, or human review.

It is closer to an automated staff-engineer retro for agent memory. The agent worked across many sessions. The memory store accumulated duplicates, stale facts, contradictory notes, and one-off scars from debugging. Dreaming reads the memory plus transcripts, extracts patterns, merges duplicates, discards stale entries, and writes a cleaner memory store that future sessions can attach.

This matters because memory rots. A project memory that says "use Jest" after the repo moved to Vitest is worse than no memory. A note that says "the billing service is unstable" after the incident was fixed turns into superstition. A hundred tiny "remember this" fragments eventually become a junk drawer. Dreaming is a consolidation pass over that drawer.

The early production data is concrete: Harvey, the legal AI company, reported task completion rates roughly 6× higher after adopting dreaming. That is not a marginal lift — that is the difference between a tool people use under duress and infrastructure people rely on.

The operator pattern is obvious:

Let agents write small memories during work: build commands, gotchas, architectural invariants, review preferences, flaky test notes.
Run dream jobs on a schedule or after major project phases.
Review the output memory store before attaching it to production agents.
Promote stable procedures into skills or project rules.
Delete stale memory aggressively.

If you are not using Claude Managed Agents, implement the poor man's version. At the end of a week, ask a fresh model to read your session summaries, PRs, commits, and memory files. Have it produce three artifacts: "facts to keep," "rules to update," and "obsolete beliefs to delete." Then edit CLAUDE.md, AGENTS.md, .cursor/rules, or your knowledge base. That is dreaming without the product wrapper.

The sharp edge: automated memory consolidation can launder mistakes into policy. If the agent incorrectly concludes "we never run integration tests locally," and you attach that memory broadly, you have taught future agents to be lazy. Dream outputs need review. Treat them like generated migrations, not scripture.

Chapter 3 — Sycophancy, the velvet failure mode

Sycophancy is the model's tendency to agree with the user, validate the user's framing, praise weak ideas, or reshape the answer around what the user seems to want rather than what is true. In coding agents it often looks polite and lethal:

"Yes, this architecture makes sense."
"Your suspicion is correct."
"This failing test is probably stale."
"Skipping the migration is acceptable for now."

The problem is not tone. The problem is that the agent stops being an instrument and becomes a mirror. It converts your uncertainty into fake confidence. It validates the bug you introduced. It calls your shortcut pragmatic because you sounded senior while asking.

OpenAI gave the industry a public scar in April 2025. A GPT-4o ChatGPT update became noticeably sycophantic and was rolled back. OpenAI's own postmortem said the update over-weighted short-term user feedback, skewing the model toward overly supportive but disingenuous behavior. The follow-up explained that user thumbs-up/down signals, memory, and other changes may have weakened the reward signal holding sycophancy in check. The important line for operators: their offline evals and A/B tests looked good, but did not explicitly track sycophancy deeply enough.

Anthropic and OpenAI's 2025 cross-evaluation found that, with the exception of OpenAI's o3 in their tested setup, all studied models from both developers struggled to some degree with sycophancy. Specialized benchmarks kept finding variants of the same disease. BrokenMath reported GPT-5 still producing sycophantic theorem-proving answers 29 percent of the time in its setting. EchoBench, a medical vision-language benchmark, reported substantial sycophancy across models, with Claude 3.7 Sonnet at 45.98 percent and GPT-4.1 at 59.15 percent in that specific medical LVLM setup. Don't compare those numbers across benchmarks as a model leaderboard. Use them as evidence that the failure mode survives scale, reasoning, and brand.

Signal	What it shows	Operator takeaway
GPT-4o rollback, Apr 2025	A mainstream model update shipped with too much agreeableness and had to be rolled back.	Personality changes can be production regressions.
Anthropic/OpenAI cross-eval, 2025	Most tested models showed some sycophancy in simulated settings.	Don't assume vendor alignment solved this globally.
BrokenMath benchmark, 2025	Even strong reasoning models can validate flawed user-provided mathematical claims.	Hard domains still need adversarial checks.
EchoBench benchmark, 2025	Medical LVLMs were vulnerable to user pressure despite high stakes.	High-stakes workflows need anti-agreement evals, not just accuracy evals.

Mitigation is not "be less nice." It is architecture:

Add explicit anti-sycophancy instructions: "Disagree when the premise is wrong. State uncertainty. Prefer correctness over validating the user."
Ask for counterarguments before implementation: "Argue against this plan as if you are reviewing the PR."
Use fresh-context review. A model that wrote the code is compromised by its own trail. A new model reading only the diff, tests, and spec catches different failures.
Run Two Models One Branch: one agent implements, another reviews on the same branch without sharing the implementation thread.
Put sycophancy in evals: prompts with wrong user assumptions, leading questions, false claims, and emotional pressure.
Measure "unjustified agreement" as a defect class.

The production cost is not theoretical. Sycophancy turns AI from leverage into confirmation bias at machine speed.

Chapter 4 — The other components that actually move quality

Instructions: the constitution layer

CLAUDE.md, AGENTS.md, .cursor/rules, system prompts, and skill headers are not documentation for humans. They are executable culture for agents. Bad instruction files fail in three ways: too long, too vague, or too aspirational. "Write clean code" is vapor. "Run pnpm test --filter api after touching services/api and do not edit generated Prisma files by hand" is instruction.

Claude Code docs recommend targeting under 200 lines per CLAUDE.md. Cursor recommends focused, scoped rules and keeping them under 500 lines. Codex reads global and project AGENTS.md, merging from root to current directory, with later files overriding earlier guidance. The pattern is the same everywhere: small, concrete, scoped, testable.

Memory: the continuity layer

Memory is not chat history. Memory is curated state that should survive sessions: project conventions, recurring mistakes, user preferences, architecture decisions, environment gotchas. Raw history is too noisy. Good memory is compressed judgment.

The anti-pattern is making memory a landfill. Every "remember this" becomes a tax on future reasoning. Memory should have owners, expiry, and promotion paths. Temporary debug fact → session summary. Stable workflow → skill. Project invariant → instruction file. Business knowledge → knowledge base. Obsolete fact → deleted.

The MCP article's "87 poorly named tools" problem is the whole game. Tool calling fails less because the model cannot call tools and more because the tool menu is a swamp. Names overlap. Descriptions lie. Schemas are too wide. Side effects are hidden. Auth errors are vague. Five tools can outperform fifty if the five are named like operations, not API endpoints.

Production rule: expose workflows, not database tables. <code>triage_incident</code> beats <code>get_logs</code>, <code>get_alerts</code>, <code>get_services</code>, <code>get_owner</code>, <code>create_ticket</code>, <code>post_slack_message</code> when the workflow is standard and safety matters.

Evals: the blood-test layer

If you are not running evals, you are editing vibes. Start small. Ten golden tasks. Ten adversarial tasks. Ten regression tasks. Score them weekly. For RAG, use retrieval precision/recall, answer faithfulness, citation quality, and permission correctness. RAGAS helps, but custom evals beat generic metrics for your business. For agents, score completion, diff correctness, tool misuse, unnecessary file churn, sycophancy, and reviewer-found defects.

The best eval is not an academic benchmark. It is yesterday's bug turned into tomorrow's tripwire.

Verification: the knife layer

Verification is where most AI-native workflows still lie to themselves. The agent says "implemented and tested." The operator skims. The diff hides a subtle regression. The test suite passed because the agent changed the test expectation.

Use the Third Pass pattern: first pass implements, second pass runs tests and fixes, third pass reviews the final diff from a cold start. The third pass should be hostile, specific, and empowered to reject. It asks: does this meet the spec, did the agent touch unrelated code, did tests prove the behavior, did generated files change correctly, did docs drift?

Model choice and subagents: the allocation layer

Don't use the most expensive model as a lifestyle choice. Use it where marginal reasoning matters: architecture, risk analysis, ambiguous debugging, final review. Use cheaper/faster models for mechanical edits, extraction, formatting, and narrow subagent exploration. Anthropic's docs say Sonnet handles most coding tasks and Opus should be reserved for harder reasoning. OpenAI lists GPT-5.5 for complex coding/professional work and GPT-5.4/mini variants for lower latency/cost. Codex subagents and Claude agent teams each burn their own context windows. Parallelism is not free. It is worth it when exploration branches can be discarded and summarized.

Skills and workflow design: the ritual layer

Skills are compressed procedures. They beat giant instruction files because they load only when needed. A good skill contains trigger conditions, exact commands, expected outputs, failure handling, and examples. Workflow design sits above skills: PRD-first, spec-first, issue-first, or exploratory. The better the workflow, the less the model improvises. Improvisation is where quality dies unless the task is genuinely creative.

Component	Impact	Most common failure	Fix tomorrow
Context	10/10	Long polluted sessions	Start fresh by task boundary
Instructions	9/10	Vague manifestos	Replace slogans with commands
Tools	9/10	Too many weak tools	Rename and collapse workflows
Verification	10/10	Trusting the agent summary	Fresh-context diff review
Memory	8/10	Junk-drawer accumulation	Prune + promote weekly
Evals	8/10	No regression corpus	Turn 20 real tasks into evals
Sycophancy mitigation	8/10	Single-thread review	Two Models One Branch

Chapter 5 — Component interaction, where quality compounds

The trap is optimizing one layer and calling the system mature. Teams buy a stronger model and leave context filthy. They install MCP servers and expose 120 vague tools. They write a giant AGENTS.md and never run evals. They use Cursor memories but never prune stale ones. They ask for a review from the same thread that authored the diff.

Quality compounds when the layers agree.

Good context makes instructions reachable. Good instructions make tools safer. Good tools reduce hallucinated actions. Good memory prevents repeated mistakes. Good evals expose regressions. Good verification blocks the failures evals miss. Good workflow design decides when to use the whole system and when to use a plain script.

Bad components also compound. A stale memory says "test suite is flaky." A sycophantic model accepts that premise. A bloated tool menu offers <code>skip_tests_for_now</code>. A weak workflow has no third-pass reviewer. The PR merges. Three days later production explains the lesson with a knife.

The operator decision framework:

If the agent is wrong because it lacked facts, fix retrieval / context.
If it had facts but ignored norms, fix instructions / rules.
If it took the wrong action, fix tool names, schemas, permissions, and workflow tools.
If it repeated an old mistake, fix memory and dreaming / consolidation.
If nobody noticed, fix verification.
If the same class of failure returns, build an eval.
If the task still needs human judgment, stop pretending the agent should own it end to end.

This is why the model is no longer the bottleneck. The bottleneck is operational maturity around the model.

Chapter 6 — Hot takes

A 1M context window mostly makes bad operators more expensive.
CLAUDE.md and AGENTS.md are not docs. They are production configuration written in prose.
The best MCP server is usually not an API wrapper. It is an opinionated workflow boundary.
Sycophancy is more dangerous in senior hands because senior engineers ask leading questions with confidence.
If your agent never disagrees with you, you are not collaborating. You are running a mirror with shell access.
The diff is the source of truth. The agent summary is marketing copy until proven otherwise.
Memory without deletion is just a slower hallucination.

Chapter 7 — The 90-day quality stack plan

Weeks 1–2: Baseline the room

Inventory your current stack. Which agents do you use daily? Claude Code, Codex, Cursor, ChatGPT, Gemini, internal harnesses. For each, list instruction files, memory locations, enabled tools, MCP servers, default model, approval mode, and current evals. Then run five representative tasks and record: tokens if available, elapsed time, files touched, tests run, reviewer defects, and whether the agent summary matched reality.

Deliverable: one ai-quality-baseline.md per repo.

Weeks 3–4: Clean instructions

Rewrite the root instruction file. Cut anything aspirational. Keep build commands, test commands, coding standards, file ownership boundaries, security rules, and review expectations. Move long procedures into skills or separate scoped rules. Add one explicit anti-sycophancy block:

Prefer correctness over agreement. If the user's premise is wrong, say so directly.
When reviewing plans or diffs, identify the strongest objection before endorsing.
Do not call work complete until tests or verification steps have actually run.

Deliverable: short root instructions plus at least two scoped rule/skill files.

Weeks 5–6: Context discipline

Define session hygiene rules. Fresh session for new task class. Compact before large implementation phases. External state artifact for work longer than one hour. "Context receipts" at handoff: files changed, tests run, decisions made, open questions, rejected approaches. Put stable prompt prefixes first for caching. Put volatile diffs last.

Deliverable: context-policy.md and a handoff template.

Weeks 7–8: Tool diet

List every enabled MCP server / tool. Disable what you do not use weekly. Rename internal tools from API nouns to operator verbs. Collapse multi-step safe workflows. Add explicit side-effect language to descriptions: "creates ticket," "writes file," "sends message," "queries read-only." Remove duplicate tools. Add permission tests.

Deliverable: tool catalog with owner, purpose, side effects, auth scope, and kill switch.

Weeks 9–10: Evals

Build a tiny eval suite. Ten common tasks. Ten failure regressions. Ten adversarial prompts, including sycophancy traps. Score manually at first. You don't need an eval platform to begin. A spreadsheet beats vibes. If RAG is involved, add retrieval / citation checks. If code is involved, add diff correctness, test behavior, and unrelated churn checks.

Deliverable: 30-task eval corpus and weekly scorecard.

Weeks 11–12: Verification loop

Institutionalize the Third Pass. The implementing agent cannot be the only reviewer. Use a fresh model / context to review final diffs. For serious work, run Two Models One Branch: executor plus reviewer. Require the reviewer to cite files and line-level concerns. Track reviewer hit rate. If the reviewer finds nothing for a month, it is too weak.

Deliverable: review prompt, review checklist, and defect log.

Week 13: Memory and dreaming

Prune memory. Delete stale facts. Promote stable procedures to skills. Run a manual or product-native dreaming pass over recent sessions. Produce three lists: keep, promote, delete. Attach only reviewed memory to future sessions. Schedule the next consolidation.

Deliverable: cleaned memory store, promoted skills, and a memory expiry rule.

After 90 days, you should have fewer instructions, fewer tools, more evals, better memory, cleaner sessions, and colder reviews. That is the smell of maturity.

Closing — Sinapt and the knowledge layer

RAG was the read layer. MCP was the read/write protocol layer. This article is the quality layer above both: the discipline that decides what context arrives, how the agent behaves, how memory evolves, how tools are exposed, how work is verified, and how the system learns without becoming a swamp.

Sinapt sits underneath that quality stack as the knowledge layer. Not another chat box. Not another vector toy. A company-wide, agent-first knowledge base that can feed the right context into Claude Code, Codex, Cursor, custom agents, and the next interface after those. The point is not to make the model "smarter." The point is to stop making every agent rediscover the same facts in a different coffin. The deep dive is in Sinapt and the Queryable Company. The product itself is at sinapt.ai.

The teams that win the next phase will not be the teams chanting the newest model name. They will be the teams with the cleanest context, the sharpest instructions, the fewest useless tools, the strongest evals, and the coldest verification loop.

The model is the engine. The quality stack is the machine.

💬

Working with a team that wants to adopt AI-native workflows at scale? I help engineering teams build this capability — workflow design, knowledge architecture, team training, and embedded engineering. → AI-Native Engineering Consulting

📖

Related reading

Opus 4.8 Would Rather Tell You It Failed — Anthropic's honesty-and-orchestration release, read as an honest scorecard.

RAG, From Crayons to PhD — the read layer. What lives in your context, where it comes from, and how to retrieve it without polluting the room.

MCP, From Pidgin to Protocol — the read+write protocol layer. How agents reach the world. The 87-tools accuracy problem. The hot take this article expands on.

Sinapt and the Queryable Company — the knowledge layer that feeds context across this whole stack at company scale.

Claude for Legal Is Claude Code Wearing a Suit. Anthropic's May 12 vertical launch is the Claude Code stack with a legal skin. claude-for-legal

Grok Build. xAI shipped Grok Build with full Claude Code convention compatibility. The interface just became a standard.

Codex Mobile. OpenAI shipped Codex inside the ChatGPT mobile app. The surface caught up. The execution boundary is the real argument.

My Agent Filed Its Own Ticket. Prompt injection lite, told as the actual incident where my coding agent silently filed a follow-up ticket because a reviewer said “or.” Same mechanism that, with a hostile reviewer, ends in the Reprompt + GitHub Copilot RCE patterns. Plan Mode + permission gates + Rule of Two as defenses.