Things AI Is Surprisingly Bad At

It can reverse-engineer binaries and find 27-year-old bugs. It cannot reliably tell you what day of the week it is. A field guide to AI's blind spots — and why each one exists.

Things AI Is Surprisingly Bad At — AI

It can reverse-engineer stripped binaries. It can find vulnerabilities that survived 27 years of expert review. It can write, test, and ship a full-stack application in a single conversation. It can reason about complex distributed systems, generate working code in a dozen languages, and explain quantum mechanics in terms a five-year-old would understand.

It cannot reliably tell you what day of the week it is.

The gap between what AI is extraordinarily good at and what it is hilariously bad at is one of the most fascinating — and practically important — things about working with these models daily. Here's a field guide to the blind spots, and why each one exists.

What Day Is It?

Can: Parse timestamps across timezones, calculate epoch deltas, generate cron expressions, build complex scheduling logic.

Cannot: Tell you it's Wednesday.

LLMs don't have a clock. They get a static date string injected into the system prompt at the start of a conversation — "Today's date is April 7, 2026" — and that's it. No running clock, no calendar library, no way to verify. When the model needs to figure out what day of the week it is, or how many days until Friday, it's guessing based on patterns rather than computing. It's token prediction, not calendar math.

The result: it will confidently tell you Thursday comes after Friday. It will skip entire days of the week. It will calculate "3 days from now" and land on a date that exists in no known calendar system.

The fix: Tell it to run date in the terminal before any time-sensitive operation. Forces the system clock instead of its imagination.

How Many R's in Strawberry?

Can: Analyze the frequency distribution of characters across a 10,000-line codebase. Tokenize, parse, and generate syntactically correct code in languages it was barely trained on.

Cannot: Count the letter 'r' in the word "strawberry."

This is the internet's favorite AI gotcha, and it reveals something fundamental: LLMs don't see characters — they see tokens. The word "strawberry" gets tokenized into chunks that don't align with individual letters. Asking the model to count characters is like asking someone to count the bricks in a wall while looking at it through frosted glass. The abstraction layer is wrong.

This extends to anything that requires precise character-level operations: counting vowels, checking palindromes, verifying string lengths. The model will attempt it, get it wrong, and present the wrong answer with absolute confidence.

The Confidence Problem

Can: Produce nuanced, well-reasoned analysis of complex problems with appropriate caveats and trade-offs.

Cannot: Say "I don't know" when it doesn't know.

This is the big one. AI models are trained to be helpful, which means they are constitutionally incapable of comfortable silence. Ask a question that has no good answer and the model will generate one anyway — fluently, confidently, and completely fabricated. It won't pause. It won't say "I'm not sure." It will deliver nonsense with the same tone it uses for things it actually knows.

The practical consequence: the model's confidence is not correlated with its accuracy. A correct answer and a hallucinated answer sound identical. This is why deterministic evals matter so much — you can't judge correctness by how sure the model sounds.

You Said Fix the Bug, Not Rewrite the File

Can: Follow complex, multi-step technical specifications with dozens of constraints.

Cannot: Resist the urge to "improve" everything it touches.

Ask the model to fix a one-line bug and there's a real chance it will also refactor the surrounding function, add type annotations you didn't ask for, rename variables to "better" names, and insert helpful comments throughout the file. It's like asking someone to change a lightbulb and coming back to find they've repainted the room.

This is training incentive at work. The model is rewarded for being thorough and helpful, so it maximizes helpfulness even when the most helpful thing would be to change exactly one line and stop. Constraining scope is a skill you develop when working with AI — your prompts get more specific, your instructions more explicit, and you learn to say "change only this line and nothing else" with military precision.

Which One Is Left?

Can: Generate valid CSS grid layouts, reason about database schemas with complex relationships, design system architectures with dozens of interacting components.

Cannot: Consistently tell left from right in an image.

Spatial reasoning in the physical sense — rotating objects mentally, understanding relative positions in images, navigating directional instructions — is genuinely weak. The model processes images as token sequences, not as spatial representations. It's like reading a description of a room versus actually standing in it. The information is there but the intuition isn't.

If you've ever had an AI agent click the wrong button in a UI because it confused left and right, you know this pain intimately.

Quick, What's 4,847 × 7,291?

Can: Derive mathematical proofs, explain complex statistical concepts, implement numerical algorithms correctly.

Cannot: Reliably multiply two large numbers.

The model can solve differential equations but will fumble basic multiplication with large numbers. This makes perfect sense when you remember that it's predicting tokens, not running a calculator. Small numbers it's seen often in training data? Fine. Large, uncommon numbers? It's interpolating, and the interpolation is lossy.

The absurdity: it can write code that computes the answer perfectly. It just can't compute it directly. The tool is smarter than the mind using the tool.

What Did I Just Say?

Can: Maintain complex project context across hundreds of interactions, remember architectural decisions, track dozens of file changes.

Cannot: Avoid contradicting itself after enough turns in a conversation.

Long context windows are impressive but not infinite. As conversations grow, earlier context gets compressed or drops out. The model might recommend an approach in message 5 and recommend the opposite in message 50 — not because it changed its mind, but because it forgot what it said. It's not lying. It's just stateless in ways that feel deeply weird when you've been treating it as a collaborator with memory.

Why This Matters

These aren't just fun party tricks to stump an AI. They're practical failure modes that affect real work. If you use AI agents daily, you will encounter every single one of these. The engineer who knows the blind spots routes around them. The engineer who doesn't gets bitten.

The pattern across all of these limitations is the same: LLMs are pattern matchers, not reasoning engines. When the pattern is well-represented in training data, the output is spectacular. When the task requires genuine computation — counting, calendar math, arithmetic, spatial rotation — the model fakes it. And it fakes it convincingly enough that you won't notice unless you're checking.

This is exactly why the eval-driven approach works: you let the model do what it's extraordinary at (reasoning, code generation, synthesis) and you use deterministic code to catch the things it's bad at (precision, accuracy, consistency). Play to the strengths. Verify the weaknesses. That's the workflow that scales.

The Punchline

The most powerful AI models in the world can find zero-day vulnerabilities in hardened operating systems, write production-grade software across every major language, and reason about problems that stump most humans.

They cannot count to three.

And somehow, that's fine. Because once you know where the blind spots are, you stop expecting a calculator and start working with what's actually there: the most powerful pattern-matching engine ever built, with a few charmingly human-sized holes in its abilities.

Just don't ask it what day it is.


Eval-Driven Development

The Verification Gap

KISS Your AI Workflow

The AI-Native Litmus Test