AI-Native Methodology

Microsoft Built an AI Coder as Good as Claude. The Benchmark Might Be Broken.

Bill Cava/June 3, 2026

At its Build conference this month, Microsoft put a number on a slide. Its new coding model, called MAI-Code-1-Flash, scores about as high as Claude, one of the best coding models money can buy, on the standard test the whole industry uses to rank them. And it does it while being much smaller and cheaper to run. The number is the pitch. A cheaper model that scores as well must be about as good.

Here is the problem. The company that did the most to make that test the industry standard stopped trusting it four months ago.

This is not a knock on Microsoft. Shipping a small, fast, capable coding model is a real achievement, and there is no reason to doubt the numbers are accurate. The problem is the test. A high score on it, from Microsoft or anyone, has quietly stopped telling buyers the thing they think it tells them: whether the model writes software that survives in the real world.

How good is Microsoft's new AI coder?

By the headline number, very good. At Build this month Microsoft said its new coding model matches Claude on the most-cited industry test and holds its own on a harder version of it, while costing a fraction as much to run. The figure itself is not the issue. The use it is being put to is.

A benchmark result used to be a bragging point in a launch post. Now it is becoming a buying decision for core infrastructure. Microsoft is preparing to swap MAI-Code in underneath millions of developers inside GitHub Copilot, partly on the strength of numbers like these. When a score moves from marketing to procurement, the question of what it actually measures stops being academic.

Even the developers closest to these launches hedge. Simon Willison, who writes carefully about new models for a living, published a take on Microsoft's announcement and walked part of it back a day later, when a different claim did not survive a close read of the paper. That instinct, applied to one detail, is worth applying to the headline number too.

Why did OpenAI stop using its own favorite test?

It said the test had stopped measuring skill. In February 2026, OpenAI quit reporting its score on the most-cited coding test, called SWE-bench. Its explanation was blunt: a rising score no longer meant the model was getting better at real software. It increasingly just meant the model had seen the test before.

Improvements... increasingly reflect how much the model was exposed to the benchmark.

OpenAI, Why we no longer evaluate SWE-bench Verified, February 2026

The audit behind that decision is the part that should change how you read any coding score. When OpenAI checked the test's hardest problems, it found more than half of them (59.4 percent) were broken: the automated grader demanded exact function names the problem never mentioned, or checked for behavior from unrelated code. And every top model, Claude and Gemini included, could reproduce the official answer when handed nothing but the problem's ID number. They had, in effect, memorized the answer key. You can read OpenAI's full explanation for the details.

The takeaway is structural, not about any one vendor. This is not a flaw in Microsoft's model or a knock on its honesty. It is a property of the test that every model's score inherits, Claude and Gemini included. The test contaminated itself by becoming famous. Once a test shows up all over the training data, a high score measures memory as much as skill, and there is no clean way to tell the two apart from the number alone.

How do AI coders do on real-world work?

Far worse than the leaderboard says. One company dropping one test could be brushed off as a single opinion. So look at the other end, where someone measured against actual work instead of tidy, pre-solved problems. The gap is not subtle.

Meta built a test out of real coding sessions, the messy back-and-forth of someone actually trying to ship, rather than the clean, solvable tickets the standard test is made of. Across five top models, the real-world solve rate ran from 42.9 to 58.2 percent. The same models score around 80 percent on the standard test. That is a 25-to-35-point gap between the leaderboard and the actual workday.

The reason is what the standard test is made of. Independent reviews have noted that most of its problems are small, single-file bug fixes pulled from a handful of mature open-source projects, the kind of thing an experienced developer closes in under an hour. The work that actually breaks in production is barely represented: changes that span many files, judgment calls about architecture, vague requirements where being confidently wrong costs a week. A test built from the easy half of the job will always flatter a model on the hard half.

This is the same pattern we keep running into. Teams with the most mature AI guardrails roll back more often, not less, because the gap between a clean demo and production-grade work does not close just because the first draft got faster. The test is that demo, turned into a leaderboard.

Are AI coding benchmarks useless, then?

No, and assuming so is its own mistake. A benchmark is a floor. It is a quick way to rule out a model that plainly cannot code. By that standard, SWE-bench is fine, and Microsoft clearing the bar at a fraction of the cost is real, useful news about efficiency.

The failure is treating the floor as the finish line. "Passed the test" is not "ready for production," any more than a polished demo is a shipped product. This is the leaderboard version of the line between looks-done and actually-done that vibe coding keeps crashing into: the screens render, the happy path works, the score is green, and none of it tells you whether the thing holds when a real user does something the test never imagined.

It is also why the Big Four didn't standardize on a model, they standardized on how the work gets governed. When scores converge across vendors and then turn out to be partly illusory, the thing that sets one choice apart is no longer the number on the slide. It is everything around it: how the work gets checked, who is accountable when it ships, whether the output survives review.

What should you measure instead?

The question is never "what did it score." It is "does it hold up in the work." That shift, from a public leaderboard to your own codebase, is the whole move, and it is not a vibe. It is what experienced engineering judgment is for, and it is the part no number can hand you.

Concretely: run the model on your code. Your multi-file changes, your half-written tickets, your review standards, the architectural calls where a confident wrong answer costs you a week. A cheaper model that clears the bar on your work is worth more than one that tops a leaderboard built from someone else's solved problems. A benchmark can tell you who to rule out. Only your production can tell you who to trust.

Because here is what is coming, and it is worth saying plainly. The leaderboard will keep climbing. The models will keep getting cheaper and posting bigger numbers, and every launch will put one on a slide. None of it tells you whether the thing you ship on Monday survives the week. The teams that win this era are the ones who learn to stop reading the score and start reading the work.

Frequently asked

What did Microsoft's MAI-Code score on SWE-bench?

›At its Build 2026 conference, Microsoft said its new coding model, MAI-Code-1-Flash, scores about as high as Claude on SWE-bench, the standard coding test, while costing far less to run.

⌄At its Build 2026 conference, Microsoft said its new coding model, MAI-Code-1-Flash, scores about as high as Claude on SWE-bench, the standard coding test, while costing far less to run. The number itself is not in dispute. What it actually measures is.

Why did OpenAI stop reporting SWE-bench Verified?

›In February 2026 OpenAI stopped using SWE-bench Verified, saying a rising score 'increasingly reflect[s] how much the model was exposed to the benchmark' rather than real skill.

⌄In February 2026 OpenAI stopped using SWE-bench Verified, saying a rising score 'increasingly reflect[s] how much the model was exposed to the benchmark' rather than real skill. Its review found most of the hardest problems had broken tests, and top models could reproduce the official answer from the problem ID alone.

How well do AI models perform on real production coding tasks?

›Meta built a test from real coding sessions instead of tidy, pre-solved problems.

⌄Meta built a test from real coding sessions instead of tidy, pre-solved problems. The best models solved 42.9 to 58.2 percent of those tasks, far below the roughly 80 percent they score on the standard benchmark. That is a 25-to-35-point gap between the leaderboard and real work.

Are AI coding benchmarks useless?

›No. A benchmark is a useful floor: a quick way to rule out a model that plainly cannot code.

⌄No. A benchmark is a useful floor: a quick way to rule out a model that plainly cannot code. The mistake is treating the floor as the finish line, reading 'passed the test' as 'ready for production.' Use the score to disqualify, not to anoint.

How should I evaluate an AI coding model for my team?

›Run it on your own work: your repositories, your multi-file changes, your vague tickets, your review standards.

⌄Run it on your own work: your repositories, your multi-file changes, your vague tickets, your review standards. The question is never what it scored on a public test. It is whether the output holds up in your production, which only your codebase can tell you.

Considered takes, in your inbox.

We write when we learn something worth sharing. No schedule, no marketing digests. Built for engineers and product owners shipping with agents.