Judgment is the new unit of performance

Output used to be expensive enough to tell you something about the person who produced it. You could read the volume, the polish, and the pace of what they shipped and get a rough signal. A good analyst produced more memos. A good product manager produced more specs. A good manager landed more projects. The signal was always noisy, but it was usable, because the raw production of the artifact cost something, and the cost was roughly similar across people of similar experience. AI has made that cost drop unevenly. For the people using it well, a first draft, a summary, a query, a plan, a piece of code, a campaign brief all take a fraction of the time they used to. For the people who are not using it, these things still take something close to the old time. The artifact still arrives. The raw production of it no longer means what it used to mean.

Judgment always mattered. What changed is that output used to carry more evidence of it than it does now. Judgment is the part of the job the model cannot be relied on to do on your behalf, even when it is doing everything else. It is framing the request correctly, which requires already knowing what matters. It is reading the output and spotting what is wrong, what is missing, and what is plausible but false. It is deciding which parts of a model’s answer to keep and which to throw away. It is knowing when the machine is being confident about something it does not know. It is choosing, sometimes, to override the machine entirely and go the long way round. None of these steps produces a visible artifact on its own. All of them decide whether the artifact is any good.

This change in what counts shows up differently in different roles, but it is the same change underneath. An analyst can now produce five times as many charts. The quality of the analyst is therefore no longer visible in the chart alone. It is in which question was worth answering in the first place, which cut of the data was honest, which of the five findings matters enough to bring upstream, and whether they will tell the room the uncomfortable reading rather than the one that will land well. A product manager can generate three alternative specs in an afternoon. The question is whether the one they picked is the right one, whether the trade-offs they declined were the ones worth declining, and whether they can tell, before the team commits, which of the three assumptions in the spec is the one most likely to be wrong. A manager can produce a coherent-looking performance review in a few minutes. Whether the review is true, and whether it helps the person it is about, depends on a kind of attention the model is not doing for them, and on the dozens of small observations across the quarter that no model can reconstruct from the artifacts alone.

Companies that still evaluate people primarily on visible output, or on visible effort, are about to start misreading who is strong. The person who produces a lot with AI will look busier than ever. The person who produces less but whose output is consistently right will look slower. The evaluation systems most organizations run were built when output was a reasonable proxy for judgment, because judgment was what made output expensive to produce. Now that output is cheap and judgment is not, the proxy has broken. Treating it as if it still works is how good operators get overlooked and confident producers get promoted into roles where their weak judgment suddenly costs the company a lot more.

Two shapes of employee get written about a lot in this period, and both readings are incomplete on their own. The first is the heavy AI user who has quadrupled their output and is visibly everywhere. The second is the refuser, who has continued working the way they always have and is suspicious of what the enthusiasts are shipping. Neither pattern tells you much about the person’s judgment. The heavy user might have sharp judgment and be using AI to do more of what they already do well, or they might be producing fast, confident, subtly wrong work at a rate nobody can audit. The refuser might be protecting a kind of rigour the machine has not earned their trust on, or they might be avoiding something they find threatening and falling behind in ways that will hurt them later. Both patterns are visible. Neither is a read on quality. Only the work itself, read closely, tells you.

The most dangerous profile inside a company right now is the person who ships a lot with weak judgment. They appear productive. Their output looks clean. They are exactly the kind of person an output-heavy system will over-reward. The errors show up slowly, in decisions that were worse than they should have been, in analyses that pointed at the wrong driver, in plans that did not survive contact with the real constraints, in code that worked until it met the edge case the author did not think to consider. A slower person whose judgment is sound does less and is worth more, and the gap between how those two people are currently rewarded is about to become a serious problem for the companies that are getting it wrong.

This is why the debates about hiring, performance review, the junior role, and managing a team where AI use is optional keep circling the same underlying question. Hiring is a judgment bet wrapped in a volume test that no longer filters well. Performance review is a judgment signal wrapped in output metrics that no longer explain as much. The junior role is an apprenticeship in judgment that used to be served by the volume work a machine is now doing instead. Managing a team with uneven AI use is a problem of reading people doing different kinds of work at different speeds and with different error profiles. They are usually discussed as separate problems. They are not. The common thread is that the thing being bought, sold, evaluated, and developed is judgment, and the surface metrics most companies use to read people were built in an era when output was a good enough proxy for it.

If a manager wants an honest read on the people on their team now, output counts for less than it used to. What counts is how well the person frames the problem before they start, what they notice in a draft that the rest of the team walked past, how often their first instinct on an ambiguous call turns out to be the right one, whether they catch the model when it is wrong, and what they choose to push back on. Those are harder to see than ticket counts and harder to write up at the end of a quarter. They are also the actual performance now. A manager who will not look at them is running the team blind.

Effort still matters. Output still matters. Both of them now sit beneath something else. Judgment used to be the quiet variable behind performance. It has become the performance.

Subscribe

Related

The ‘AI replaces juniors’ story points at the wrong layer

A weekly structure for a team with uneven AI use

I don’t know how to measure a team where AI use is uneven