Grok Build

xAI's terminal coding agent. Plan Mode by default, eight parallel subagents, Grok 4.3 Heavy under the hood, AGENTS.md / MCP / Skills compatible. The bets it makes, where to test them, and how to evaluate without religion.

xAI shipped Grok Build into the only AI coding market that matters: terminal-native agents that can plan, read a repo, edit files, run tools, and hand back a diff.

That is the right arena. The browser chat box is not where serious software work settles. The IDE sidebar is useful, but it is still framed as assistance. The terminal agent is different. It sits where the repo, tests, shell, hooks, package manager, deployment scripts, and local conventions already live.

I have 1,282 logged hours in Claude Code. That is useful calibration, not the point of the article. Claude Code is the reference instrument because I know what daily production use feels like at high volume. Grok Build deserves to be evaluated on its own terms: model choice, context strategy, planning UX, parallel-agent architecture, pricing, and how cleanly it plugs into the agent-native stack that is now forming around AGENTS.md, MCP, Skills, plugins, hooks, and repo-local instructions.

📋

In six bullets

Grok Build is xAI's terminal coding agent. Beta opened May 14, 2026 for SuperGrok Heavy users, with Grok 4.3 beta underneath and a Heavy-style multi-agent architecture behind it.

It does not invent a new agent dialect. It reads the conventions serious agentic repos already use: AGENTS.md, MCP servers, Skills, plugins, hooks, and project instructions.

Plan Mode is the strongest product decision. Grok Build foregrounds a gated plan before code moves: approve it, comment on steps, or rewrite it. That is not beginner UX. That is how you keep a powerful agent from spraying half-shaped changes across a repo.

The launch bet is model plus context plus parallelism. Grok 4.3 Heavy, long-context claims, and up to eight parallel subagents are the things to test. The CLI shape is now mostly table stakes.

Pricing is filter-shaped. Early access is tied to SuperGrok Heavy: $299/month standard, with a $99/month intro for six months. High-intent operator product, not a casual assistant.

The practical move is not religious switching. If you already run an agent-native repo, Grok Build is a serious second model to test against the same codebase and the same instructions.

What Grok Build actually is

Grok Build is a CLI coding agent for professional software engineering. You install it, run it inside a repo, and use it to plan and execute software tasks from the terminal.

That sentence sounds simple because the category has finally hardened. A serious coding agent now needs a few things on day one:

It must understand repo-level instructions like AGENTS.md.
It must call tools and external systems through MCP.
It must preserve reusable procedures through Skills or equivalent mechanisms.
It must respect hooks, permissions, and local execution boundaries.
It must produce reviewable diffs instead of vague advice.
It must have a planning mode that can be inspected before execution.

Grok Build enters with that shape. The interesting part is not that it has a terminal UI. The interesting part is that xAI is betting Grok 4.3 Heavy, large context, and parallel subagents can make that terminal UI worth paying for.

Install is the expected one-liner:

# install Grok Build
curl -fsSL https://x.ai/cli/install.sh | bash

# run it inside a real project
cd ~/Projects/your-actual-work
grok

The launch package is explicitly aimed at heavy users. Early beta access sits behind SuperGrok Heavy: $299/month at standard price, with a $99/month introductory price for the first six months.

That price tells you how xAI sees the product. This is not being packaged as "AI autocomplete, but stronger." It is being packaged as a premium engineering operator surface for people who already believe an agent can do real work.

The product bet

Grok Build has four bets worth taking seriously.

1. The model bet

The first bet is that Grok 4.3 beta is good enough at software reasoning to justify putting it directly in the engineering loop.

That is the only bet that really matters over time. Once a CLI can read files, edit files, run commands, use MCP, respect repo instructions, and hand back diffs, the quality ceiling comes from the model and the operating discipline around it.

For code, the model has to do more than write plausible functions. It has to hold architectural intent, notice stale assumptions, understand tests, avoid local maxima, and stop when it does not know enough. It has to be good at boring repo navigation, not just benchmark puzzles.

Grok Build's launch makes that test available in a real workflow. Not a prompt arena. Not a leaderboard. A repo, a task, a plan, a diff, and a human review.

2. The context bet

The second bet is context.

Launch coverage has talked about a 2 million token context window for Grok Build. xAI's current public API page has listed grok-4.3 at 1 million tokens. Until xAI's product docs and API docs line up cleanly, treat 2 million as a beta/product-path claim and 1 million as the more conservative API-facing number.

Either way, the window is large enough to matter. The trap is pretending raw context automatically solves software engineering.

It does not.

Long context helps when the task genuinely spans a lot of source, history, logs, specs, and design notes. But the hard part is still selecting the right material, ranking it, maintaining task state, and verifying the result. A huge context window filled with the wrong files is just a more expensive way to be confused.

The practical question for Grok Build is not "how large is the number?" The practical question is: does the product retrieve and use the right slices of a repo better because that window exists?

3. The parallel-agent bet

The third bet is parallelism.

Grok Build ships with up to eight parallel subagents across a plan, search, build cycle. Grok 4.3 Heavy itself is described as a multi-agent reasoning path. The product is leaning into agent orchestration at two levels: inside the model path and inside the CLI workflow.

This can be powerful. It can also be expensive theater.

Parallel subagents help when the work decomposes cleanly: one agent reads tests, one traces API boundaries, one searches docs, one edits a narrow module, one validates behavior. That is real leverage.

They hurt when the task needs one coherent mental model and the product sprays the problem into fragments too early. Eight agents producing eight partial interpretations is not intelligence. It is concurrency without taste.

So the thing to watch in Grok Build is decomposition quality. Does it split work the way a senior engineer would? Does it keep ownership boundaries clean? Does it merge findings into one plan instead of a pile of summaries? Does it avoid rewriting the same file from multiple angles?

That is where the architecture either earns its keep or becomes a demo feature.

4. The Plan Mode bet

The fourth bet is Plan Mode. This is the feature I like most in the launch.

Grok Build foregrounds a gated plan before execution: approve the plan, comment on individual steps, or rewrite it before code moves. That is exactly the right default for a powerful coding agent.

Plan Mode is not bureaucracy. It is a control surface.

Without it, agents drift into implementation too early. They overfit the first plausible reading of the task. They start editing before the repo has taught them its shape. Then the human reviews a diff that answers the wrong question beautifully.

A visible planning gate changes the rhythm. First, inspect. Then scope. Then execute. Then verify.

That is how experienced engineers work. Grok Build making that rhythm explicit is not a small UX detail. It is a product philosophy.

The comparison that actually matters

Most comparisons between coding agents get stupid fast. They turn into interface fan fiction.

The useful comparison is narrower: where does Grok Build differ in ways that affect operator decisions?

Axis	Grok Build	Practical read
Model	Grok 4.3 beta, with Heavy-style multi-agent reasoning	This is the core test. Use it on real repo tasks, not vibes.
Context	Launch claims around 2M tokens; public API page shows 1M for `grok-4.3`	Large either way. Value depends on retrieval quality, not just window size.
Parallelism	Up to eight subagents across plan, search, build	Great if decomposition is clean. Dangerous if agents overlap ownership.
Planning UX	Plan approval, step comments, rewrite-before-execute as default	Strong default. This is how production agent work should feel.
Repo conventions	Reads `AGENTS.md`, MCP, Skills, plugins, hooks, project instructions	Lowers trial cost. You can test Grok Build without redesigning the repo.
Pricing	SuperGrok Heavy: $299/mo standard, $99/mo intro for 6 months	High-friction entry. For operators who will actually put the agent to work.
Claude Code reference	Claude remains the obvious calibration target for many AI-native engineers	Compare model behavior, context handling, cost. Do not turn evaluation into brand loyalty.

That is the comparison. Not "who copied whom." Not "which logo owns the terminal." The relevant questions are operational: which model solves your tasks better, which tool keeps better discipline, which pricing model survives your usage, and which one fails more safely.

Standards are now infrastructure

The most important background fact is that Grok Build did not have to invent a new agent interface.

The ecosystem has converged around a practical stack:

AGENTS.md for repo-local agent instructions.
MCP for connecting agents to tools and systems.
Skills for reusable procedures.
Hooks for policy, formatting, tests, and local workflow control.
Plugins for packaging capabilities.
Project memory and instructions for long-lived operating context.

This is good.

A world where every coding agent requires its own GROK.md, CLAUDE.md, MCP.md, plugin shape, hook syntax, and tool registry is a tax on users. It fragments the thing that should be portable: the repo's operating instructions.

Grok Build benefits from arriving after the conventions hardened. It can read the same project shape a modern agentic repo already exposes. That lowers switching cost, but more importantly it lowers testing cost.

Testing another coding agent should not require rebuilding your workflow. It should be closer to swapping the model behind the same project discipline.

That is what Grok Build makes possible if the compatibility holds in practice.

What I would test first

I would not start by asking Grok Build to build a toy app.

Toy apps are misleading. Every serious model can make a clean React dashboard against a fake API. That tells you almost nothing about production fit.

I would test Grok Build on four classes of work.

1. Large-context repo orientation

Give it a real repo with real history and ask for an architectural explanation before any edits.

Good signs:

It reads AGENTS.md before freelancing.
It identifies the main subsystems accurately.
It notices tests, build tooling, and package boundaries.
It distinguishes source of truth from generated files.
It asks for missing constraints instead of inventing them.

Bad signs: summarizes filenames instead of architecture; ignores project instructions; treats generated files as hand-authored source; starts editing before planning.

This is where the context claim should show up first. Not in how much it can ingest, but in whether it can orient without drowning.

2. Narrow bug fix with tests

Give it a bug with a failing test or a reproducible command.

The test is simple: does Grok Build resist the urge to patch around the symptom?

A competent agent should trace the failure, find the ownership boundary, make the smallest coherent change, run the relevant test, and explain the diff. It should not rewrite adjacent modules because it saw an opportunity to "improve" them.

This is also where Plan Mode matters. The plan should name the likely failure path, the files it expects to inspect, and the verification command before code changes.

3. Multi-file feature with clear ownership

This is the parallel-subagent test.

Ask for a feature that crosses two or three layers: API, domain logic, UI, tests. Not huge. Just enough to require coordination.

Watch whether the subagents divide the work cleanly. One should not casually edit the same file another one is reasoning about unless the product has a sane merge discipline.

The best version of this architecture feels like a senior engineer delegating narrow slices and then integrating the result. The worst version feels like multiple interns pushing into the same branch.

4. Second-opinion review

This may be the most valuable first use.

Let your primary agent write a change. Then ask Grok Build to review the branch from a clean Plan Mode pass. Or reverse it: let Grok Build write, then have another model review.

This is the Two Models, One Branch pattern. The point is not that one model is morally better. The point is that different models miss different things. A second model with the same repo instructions is cheap insurance when the task matters.

Decision framework

Use Grok Build as primary if you are already paying for SuperGrok Heavy, you like Grok's reasoning profile, and your work benefits from long context plus agent parallelism. You have access, you have real tasks, and you can judge it against your own repos.

Use Grok Build as a hedge if you already operate agent-native repos with AGENTS.md, MCP, Skills, and hooks. This is where the product is easiest to evaluate. Install it, point it at a known project, compare output quality on the same tasks.

Wait if you are price-sensitive, happy with your current coding agent, or not yet using repo-local instructions seriously. A $299/month entry point is hard to justify if your workflow is still "ask the model in chat, paste code manually."

Do not switch because a launch thread looked good. Do not refuse to test because you already have a favorite tool. Both are low-grade engineering decisions.

Run the product against your repo. Compare plans. Compare diffs. Compare test behavior. Compare failure modes. That is the only evaluation that matters.

Eight takes on Grok Build

1. Grok Build's best product decision is Plan Mode. The product is telling users that planning is part of execution, not a detour before execution. That is correct.

2. The $299/month tier is a filter. It will keep casual users away, but it also means early feedback should come from people with serious workloads. That can sharpen the product faster than a huge free-tier crowd asking it to clone landing pages.

3. The context number is less important than context use. A 1M or 2M window is only valuable if the agent can retrieve, rank, and reason over the right material. Otherwise it is just a bigger room to lose your keys in.

4. Eight subagents is either the feature or the liability. If Grok Build decomposes well, parallelism becomes leverage. If it decomposes badly, it becomes non-deterministic churn with better branding.

5. Compatibility with AGENTS.md, MCP, Skills, plugins, and hooks is not a side feature. It is the reason Grok Build can be evaluated quickly by serious operators. Standards turn adoption from a migration into a trial.

6. The model bet is honest. xAI is not pretending the terminal chrome is the magic. The implied claim is: Grok 4.3 Heavy plus long context plus orchestration produces better engineering outcomes. That is testable.

7. Plan comments are underrated. Being able to push back on one step instead of rewriting the whole instruction is exactly the kind of human-agent control surface that compounds in daily work.

8. The market does not need another AI coding assistant. It needs agents that can take scoped responsibility inside real engineering systems. Grok Build is interesting because it is aimed at that category from day one.

Closing

Grok Build is a serious entrant because it starts in the right place: the terminal, the repo, the plan, the diff, the local instruction stack.

The product should not be judged by whether it resembles another coding CLI. In 2026, a capable agent CLI is supposed to understand AGENTS.md, MCP, Skills, hooks, plugins, .env boundaries, project instructions, and local verification. That is the baseline now. The interesting question is what Grok Build does above that baseline.

The answer is clear enough to test: Grok 4.3 Heavy, long context, parallel subagents, and a planning-first workflow.

For operators, the move is pragmatic. If your repos are already agent-native, Grok Build is easy to trial. If your work depends on large codebases and second-model review, it is especially worth testing. If you are not already using repo instructions and verification discipline, fix that first. A stronger agent will not save a sloppy workflow.

📚

Related Reading

Codex Mobile. Same-week companion piece on OpenAI's mobile coding-agent control surface. Different bets, different architecture choices.

Two Models, One Branch. The simplest multi-model orchestration that works. Grok Build slots into this pattern as the second reviewer when the open standards make hedging cheap.

MCP, From Pidgin to Protocol. Why MCP standardization changed how agents speak to systems — and why every new CLI in 2026 ships with it.

Above the Model. The components above the model that decide AI-native output quality. Plan Mode is one slice of this stack.

My Agent Filed Its Own Ticket. Prompt injection lite, told as the actual incident where my coding agent silently filed a follow-up ticket because a reviewer said “or.” Same mechanism that, with a hostile reviewer, ends in the Reprompt + GitHub Copilot RCE patterns. Plan Mode + permission gates + Rule of Two as defenses.

💬

Working with a team that wants to adopt AI-native workflows at scale? I help engineering teams build this capability: workflow design, knowledge architecture, team training, and embedded engineering. → AI-Native Engineering Consulting

Grok Build

What Grok Build actually is