Skip to content

2026-05-25

Phronesis and AI Coding Agents: The Skill the Model Cannot Give You

Agents made code-writing essentially free. The harder skill, judgment about when and how much to use them, is still entirely yours. A frame that unifies Zechner, Osmani, Beck, Willison, METR, and Yegge into one argument.

Abstract

AI coding agents have collapsed the cost of writing code to roughly zero. The cost of knowing when and how much code to write has not moved. Most of the current “AI productivity” debate is people talking past each other because they are mixing those two skills together. This post offers one frame, borrowed from Aristotle, that separates them cleanly. It then walks through what Mario Zechner, Addy Osmani, Kent Beck, Simon Willison, METR, and Steve Yegge are each actually saying once you read them through that frame. The argument lands on a position: Zechner is closer to right than Yegge is, but Yegge’s foil sharpens the point rather than blunting it.

The Two Skills the Debate Keeps Confusing

A car that can suddenly do 300 km/h does not oblige you to drive it there. At 300 km/h you cannot make fine maneuvers. You cannot react to a child in the road. Knowing when to be fast and when to be slow is itself a skill, and it is not in the car.

AI coding agents are a car that lost its speed limit. Producing the artifact, the function, the working implementation, is now near-free. Knowing whether to produce it at all, in this codebase, with these on-call rotations, for this customer profile, is a different and unchanged skill. The interesting argument is not about the speedometer. It is about the second skill.

Aristotle had a clean name for it. In Book VI of the Nicomachean Ethics he distinguishes three intellectual virtues: episteme (unchanging knowledge, like mathematics), techne (the craft of making things, what a potter or a React developer has), and phronesis (practical wisdom; deliberation about what to do in this situation, with these constraints, for this end). Phronesis is the one that requires lived experience of particular cases. Aristotle states this plainly: a young person can be a brilliant geometer or a skilled craftsman, but cannot have phronesis, because phronesis is built from accumulated judgment about cases that actually went somewhere.

This is the frame. Agents have collapsed the marginal cost of techne. Phronesis has not moved a millimeter. None of it lives in the model weights. None of it can be fully encoded into a CLAUDE.md file or a “skill.” This post is for the engineer who is currently using Claude Code or Cursor every day, has shipped real features with them, and has been bitten at least once by something the agent confidently got wrong. The argument is that “use AI more” and “slow down with AI” are not contradictory. They are answers to two different questions, and the frame above is how you tell which question you are looking at.

Why the Experts Look Like They Disagree

Read the loudest voices in this space side by side and they sound contradictory. They are not. They are all talking about the same single split, just from different sides of it.

  • Addy Osmani’s 70/30 split. Osmani’s claim is that AI agents will do the first 70 percent of a task brilliantly and the last 30 percent (security, edge cases, maintainability, fit) badly. Read through the frame: the first 70 percent is techne, which is now cheap. The last 30 percent is phronesis, which is still expensive. Same split, different vocabulary.
  • Kent Beck’s TDD-with-agents. Beck calls TDD a “superpower” when paired with agents and warns that agents will happily delete a test to make it pass. Read through the frame: the test is phronesis externalized. The agent grinds techne against a phronetic spec. The “agents delete tests” failure mode is what happens when techne is given authority over its own success criteria.
  • Simon Willison’s “ships from his phone.” Willison merges work from his phone because his repos have fast tests, strict lints, type checks, preview environments, and protected branches. Read through the frame: he is not running faster. His scaffolding holds phronesis for him, so he can hand techne to the agent. The speed is bought, not free.
  • METR’s 19 percent slowdown. METR’s 2025 randomized trial measured experienced developers on familiar repos using AI agents. Participants predicted they would be ~24 percent faster. They were ~19 percent slower — and even after the trial, still felt ~20 percent faster. METR’s February 2026 follow-up acknowledges the sampling caveats (experienced devs, familiar code, the worst case for AI assistance). Read through the frame: the perception gap is what happens when you mistake the feeling of fluent techne for the result of phronesis. Phronesis you already have is what agents threaten to displace, not augment.
  • Mario Zechner’s “slow down.” Zechner’s argument is that agents compound mistakes faster than humans can catch them, and the discipline is to scope tasks tightly and read the diff every day. Read through the frame: phronesis defending itself against techne’s abundance.
  • Steve Yegge’s “code is a liquid.” Yegge’s position, in shorthand, is that the orchestration layer will absorb the judgment, the bottleneck is organizational ability to absorb agent output, and you should stop looking at the code. Read through the frame: Yegge is betting that phronesis can be moved into the orchestrator. Everyone else above is betting it cannot. This is the real disagreement.

So the six positions reduce to one axis with two ends. On one end: phronesis stays human. On the other end: phronesis migrates into the tooling layer. Zechner, Osmani, Beck, Willison, and METR cluster on the first end (with different operationalizations). Yegge sits on the second end. The post takes a side.

Where This Post Stands

Closer to Zechner than to Yegge, with respect for what Yegge is actually arguing. Here is why.

Phronesis is consequence-shaped. The judgment about whether to ship this code, in this state, into this system, is judgment about what happens next, in places the orchestrator cannot see. A model can see the diff. A model cannot see the on-call rotation, the customer who has just been migrated, the regulator who is asking new questions this quarter, the half-finished refactor in the next sprint, the junior teammate who is going to maintain this in six months without context. The consequences of code change live downstream of the orchestrator’s visibility, and they live downstream in places that change faster than the model’s training data.

Yegge’s argument has a real version and a weak version. The real version is that the spec-to-execution loop is now extraordinarily cheap, and treating “looking at code” as the central act of engineering will become as anachronistic as treating “feeding the punch-card reader” as central. He is right about that loop. The weak version is that this generalizes to “you do not need to look at the code.” It does not, because the spec-to-execution loop is not the only loop. There is also the consequence-to-spec loop, and that loop runs through human judgment about a future the orchestrator has no information about.

So the post’s position is: Yegge is right about the speed of one loop and wrong about which loop is load-bearing. Zechner is right about the loop that is load-bearing and slightly understating how cheap the other loop has become. The synthesis is Osmani’s 70/30, Beck’s TDD rhythm, and Willison’s scaffolding, all of which institutionalize phronesis in different places so techne can run free where it is safe.

The Perception Gap, Briefly

A short note from working with agents on this codebase. There was a refactor that felt fast. The agent produced a working diff in maybe twenty minutes, the tests passed, the preview deploy looked correct, and the satisfying click of “merge” arrived shortly after. Two days later, a small piece of phronesis that was not in the spec (a build-time behavior on a specific Astro content collection edge case) surfaced as a broken sitemap entry that no test was watching. The fix took longer than the original “fast” change. The original change felt fast because the techne part was fast. The full operation, including the part where someone notices the gap, was not fast. The METR finding generalizes: the feeling of producing fluent code is not the same signal as the result of producing the right code. Track outcomes, not flow.

The incident is not unusual. It is just the texture of what the METR data describes when it happens to one person. The lesson is small: the speedometer is in the wrong place. The dashboard reads “fast” because techne is fast. The road, which is the production system, is on a different time scale.

A Decision Framework Rooted in Phronesis, Not Speed

The questions worth asking before any agent task are phronetic, not technical. Speed is a consequence of the answers, not a starting point.

No

Yes

Only me

My team

Customers or other maintainers

Yes

No

New task in front of you

Can you describe 'done' in 3 sentences?

Too big to delegate. Decompose first.

Who pays if it breaks?

Full agent autonomy OK

Reversible in under 1 hour?

Phronesis-first: hand-write or heavily supervise

Agent-drafted, you review diff before merge

End-of-day diff review still mandatory

The root of the tree is two phronetic questions. Can you scope the task? Who pays if it breaks? Speed only enters after both are answered. That ordering is the structural argument: you do not pick a speed and then check if the task is safe at that speed. You read the situation, and the situation determines the speed.

The branches map roughly to a consequence ladder:

ContextSuggested speedWhy
Throwaway prototype, no usersFull agent autonomyCrash radius is you
Internal tool, small blast radiusAgent-drafted, human-reviewedFailure is annoyance, not loss
Customer-facing, reversibleAgent-assisted, paired reviewFailure is rollback
Customer-facing, irreversible (payments, data migration, auth)Hand-written or heavily supervisedFailure is the kind that makes the news
Public API, library, infrastructurePhronesis-first; agent for boilerplate onlyOther people pay the cost of mistakes

The asymmetry to notice: the cost of being too cautious in row one is small (you spent more attention than necessary). The cost of being too fast in row five is borne by people who never consented to your speed. That asymmetry is why the safe default is to slow down at the bottom of the table, not to speed up at the top.

Phronesis Encoded Into Scaffolding

Willison’s “ships from his phone” is the most useful operational claim in this space, because it shows that the choice is not actually “human or agent.” The choice is “where does the phronesis live.” If it lives only in your head, you must be present for every merge. If it lives in the scaffolding (tests, types, lints, CI gates, build-time checks, protected branches, preview deploys), the agent can move quickly inside the rails the scaffolding draws.

A concrete example from this site’s build pipeline. Mermaid diagrams in blog posts are rendered to SVG at build time, cached by content hash, and shipped as static markup. The build fails loudly if a diagram cannot parse. This is not a sophisticated piece of engineering; it is one rehype plugin and a cache directory. What it does is move a piece of phronesis (the judgment “this diagram should actually render in production, on the production runtime, without surprises”) out of the reviewer’s head and into the build. An agent can now propose a Mermaid diagram, the build either accepts it or rejects it, and the human time saved is real, not theatrical. The scaffolding earned the right to ship faster on Mermaid.

The same principle applies one tier up. Astro Content Collections enforce frontmatter schemas at build time, so an agent that hallucinates a category field or a malformed date cannot land it. The phronesis (the rule “posts must have valid frontmatter or the build fails”) is in the schema, not the reviewer. The pattern generalizes:

  • Schema validation at build, not at code review. Move “the agent forgot a field” from a human catch to a machine catch.
  • Test count regression as a CI gate. Beck’s warning. If the test count drops between commits and the diff did not delete a feature, fail the build.
  • Type checks on the strictest setting. TypeScript’s strict is phronesis encoded. Use it.
  • Protected branches with required reviews. The agent cannot self-merge. The reviewer is the smallest possible piece of in-the-loop phronesis, and it should not be optional.
  • Preview environments per PR. Lets the reviewer apply phronesis to a running artifact, not a diff.
  • A pre-commit hook on token-budget creep. The CLAUDE.md and skill set that an agent loads also takes phronesis: choose a tight set, not an “awesome list.” (More on this in Why Copying Others’ Claude Code Skills Doesn’t Work.)

Each of these is one piece of phronesis lifted out of the reviewer’s head and frozen into a check. The compound effect is that the human can review what the machine cannot check, instead of re-checking what the machine should have caught. That is what Willison’s setup actually buys him.

Practices That Operationalize the Frame

Five concrete habits, each tied to one of the experts above. Pick the ones that fit your team; the point is to make phronesis explicit somewhere in your loop, not to adopt all five.

Scope first, then unleash (Zechner). Before running the agent, write down the bounded task, the acceptance test, and the rollback plan. If you cannot fit the task in three sentences, the task is too big to delegate.

# Agent task scope
- What: add idempotency key handling to POST /orders
- Done when: 100 duplicate requests in the test produce 1 order, returns same response
- Rollback: revert single commit; no schema migration in this scope

This is twenty seconds of phronesis spent before twenty minutes of techne. The ratio is the entire trick.

Externalize judgment into a failing test first (Beck). Write the failing test by hand. Then let the agent implement against it. Then add a CI gate that fails if the test count drops between commits. The agent’s powers are the agent’s powers; the test is the rail.

Invest in scaffolding before autonomy (Willison). Before you raise the agent’s permissions, raise the build’s strictness. Add the lint, the type check, the schema validation, the preview deploy. The agent’s safe speed is bounded by the scaffolding’s ability to catch its mistakes. If you cannot articulate what the build catches, you cannot articulate why an agent at that speed is safe.

End-of-day diff review (Zechner). Read the day’s diff yourself, not the agent’s summary. The summary is the techne report. The diff is the artifact the future maintainer will read. Thirty minutes a day. It catches the small confidently-wrong choices the tests did not cover, and it surfaces the perception gap before it accumulates.

Match speed to consequence. The table earlier in this post. Read the situation; choose the speed. The mistake is letting yesterday’s speed carry into today’s task without checking whether the consequence profile changed.

Common Pitfalls

A few failure modes that recur, ranked by how often they show up in conversations with other teams using these tools.

  • Mistaking the feeling of speed for actual speed. The METR finding is the canonical case, but it generalizes. If your only signal is “this felt fast,” you are measuring techne and calling it productivity. Track at least one outcome metric per workflow.
  • Encoding phronesis into CLAUDE.md and assuming it will be followed. The file is a hint to the model, not a constraint. The constraint is the test, the type check, the CI gate. The CLAUDE.md is helpful; it is not where the judgment actually lives.
  • Letting agents delete tests to make them pass. Beck flagged this directly. Add a pre-commit hook or a CI gate that fails on test count regression. The agent will, under pressure, take the path of least resistance, and removing the failing test is the path of least resistance.
  • Outsourcing the diff review to another agent. Agents reviewing agents compounds the same blind spot in both directions. At least one human eye per merged change, even if cursory, even if it only catches one mistake a month. The point of the human eye is not throughput; it is to be the place where consequence-shaped phronesis lives.
  • Conflating “I can do this with an agent” with “I should do this with an agent.” The first is techne. The second is phronesis. The conflation is the entire problem this post is about.
  • Building a 12-MCP-server, 50-skill setup because someone else does. This is cargo culting, and it is itself an absence of phronesis: choosing what an agent loads is a judgment about what your work actually needs. See Why Copying Others’ Claude Code Skills Doesn’t Work for the token-budget math.
  • Tracking only techne metrics. Lines of code, PR throughput, “AI acceptance rate” are all techne metrics. Add at least one phronesis-flavored signal: time-to-first-incident on agent-authored PRs, rework rate within thirty days, on-call pages per shipped feature. The numbers do not have to be precise; they have to be in the conversation.

Trade-offs Worth Naming

Three honest costs of the position above.

Slowing down is unevenly distributed. A solo founder shipping an MVP and an engineer at a regulated bank should not be operating at the same agent speed. The decision framework above is contextual by design. Anyone telling you there is one right speed for agents across all contexts is selling you something.

Phronesis costs time. End-of-day diff review is thirty minutes a day you were not spending before. The defense is that this is cheaper than the incident, the rework, and the maintainer’s resentment, but it is a real cost, not a free habit. If your team adopts the practice, name the cost out loud, do not pretend it is zero.

The advice is not cleanly measurable in the short term. You cannot A/B test “did slowing down prevent the incident that would have happened anyway.” The argument has to be made on first principles plus accumulated patterns, not on a single benchmark. That is uncomfortable in a culture that wants every recommendation backed by a chart, but it is the honest situation. The METR study is the closest thing the field has to a load-bearing data point, and even METR has updated its caveats since publication.

Where the Default Holds, and Where It Doesn’t

The default this post argues for: treat agents as having made techne free, treat phronesis as the scarce resource you allocate deliberately, encode as much of your phronesis as possible into the build and the review loop, and slow down at the bottom of the consequence ladder where other people pay for your mistakes. This holds for the working engineer on a team shipping software that real users depend on. It is the right default for most readers of this post.

Where the default does not hold: throwaway code, personal scripts, prototypes whose only user is you. In that context Yegge’s framing is closer to right, the orchestrator can absorb most of the judgment, and the “spray code through hoses” mode is fine because the consequences of mistakes are bounded to you. The mistake is letting the agent speed that works in that mode carry over into the contexts where it does not. Read the situation, then choose the speed. The frame is older than software engineering; agents have just made the question newly urgent.

If there is one next action: pick the single piece of phronesis that lives only in your head today and move it into the build. One CI gate, one schema, one pre-commit hook. The compound interest on that move is what Willison’s setup is actually made of.

References

Related posts

The AI Assistance Spectrum: Choosing the Right Level for Professional Software Engineering

A framework for understanding six levels of AI assistance in software development - from code review to vibe coding - with practical guidance on when to dial AI help up or down based on your context, risk tolerance, and project requirements.

ai-toolscode-qualitydeveloper-productivity+5
Why Copying Others' Claude Code Skills Doesn't Work

Cargo-culting Claude Code configurations leads to context window bloat, degraded tool selection, and mismatched workflows. A data-backed guide to intentional AI tool configuration with token budget math and progressive enhancement.

developer-experienceai-toolsproductivity+2
AI/LLM Glossary: 82 Terms Every Developer Should Know

A practical, implementation-focused glossary for developers navigating the AI/LLM landscape. From tokens to agents, RAG to fine-tuning, with code examples and honest assessments.

llmgenaiai-agents+9
From Chatbots to Autonomous Agents: Architecture Patterns

Explore the architectural evolution from rule-based chatbots to autonomous AI agents. Learn ReAct, Plan-and-Execute, and multi-agent patterns with TypeScript implementations and practical migration strategies.

ai-agentschatbotsarchitecture+4
Building Production-Ready AI Agents with AWS Bedrock AgentCore

Learn how AWS Bedrock AgentCore solves the infrastructure challenges of deploying agentic AI at scale - from prototype to production with runtime, memory, gateway, and multi-agent coordination.

aws-bedrockai-agentsagentic-ai+4