Your AI Coding Agent Can Run Malicious Code. Vendors Say That's by Design.
You clone a repo to take a look, open it in your AI coding agent, and click the button you have clicked a hundred times: "Yes, I trust this folder." On a booby-trapped repository, that single click can now run a stranger's code on your machine. The prompt you treat as a safety check just authorized an attack.
Two disclosures in the last two weeks showed this working across the major tools. The exploits are clever, but they are not the point. The point is what the vendors said when researchers reported them. The answer, more or less, was: working as designed.
This is not a vendor hit piece, and Anthropic in particular comes out of it looking more thoughtful than most, precisely because it explains its reasoning in public. The pattern is industry-wide, and as you will see, the line the vendors drew is defensible. But it has a consequence most teams have not absorbed yet: the security model for your AI coding tools is your job, not the vendor's. They have said so, repeatedly, on the record.
Can an AI coding agent really run malicious code?
Yes, and researchers demonstrated it across six of the most popular agents. A repository can be rigged so that opening it, and approving the routine "trust this folder" prompt, hands the agent instructions that run an attacker's code on your machine. The catch is that you have to open a hostile project. Nothing more exotic than that is required.
The first disclosure, from the security firm Adversa AI on May 26, is called SymJack. A booby-trapped repo uses a disguised file shortcut to quietly overwrite the agent's own configuration, and the attacker's code runs the next time the agent starts. Adversa confirmed it across six agents: Claude Code, Cursor, GitHub Copilot, Gemini CLI, Grok Build, and Codex CLI. The second, called TrustFall and disclosed June 1, is simpler. The same tools will run helper programs a project defines for itself (these are MCP servers, small plug-ins declared in the project so the agent can use extra tools) the moment you accept the folder-trust prompt. That is what security people call remote code execution: a stranger's code running on your machine with your permissions, no further clicks needed.
Worth being precise about the risk, because overstating it helps no one. These attacks need a malicious repository or plug-in config that you choose to open. This is not a drive-by that hits you while you browse. But "don't open untrusted repos" is not advice that survives contact with how developers actually work. Cloning a stranger's code to look at it is the job.
Why won't the vendors just fix it?
Because they do not consider it a bug. When researchers disclosed TrustFall, Anthropic's security team reviewed it and declined: clicking "I trust this folder" is consent to everything that project defines, so code running afterward is the boundary working as intended, not a breach of it. That is a coherent position. It is also the same position the whole industry keeps taking.
Pull the thread and the same answer shows up again and again:
- LayerX reported a zero-click flaw in Claude's desktop extensions, affecting more than 10,000 users. Declined as outside the threat model.
- OX Security reported a weakness in how agent plug-ins connect, affecting an estimated 200,000 of them. Called expected behavior.
- Mitiga reported a way to steal the access tokens those plug-ins use. Ruled out of scope.
- TrustFall reported the one-click code execution above. Declined as working as designed.
Four reports, four research teams, one answer. This is the same disclosure-and-decline rhythm we traced in the Lovable security crisis, now generalized into a stance: the vendors have drawn the threat-model line so that the trust dialog is consent, not a control. Stitch the four together and you stop seeing a string of unpatched bugs and start seeing a deliberate boundary, defended on purpose.
Is the "trust this folder" prompt actually protecting you?
Not in the way you assume. A prompt appearing before an action is not the same as informed consent, which requires an accurate picture of what the action will do. The dialog says "trust this folder." It does not say "and run any program this folder defines, with my permissions." You are agreeing to far more than it shows you.
Showing a prompt is not the same as obtaining informed consent.
This is why the approval surface keeps turning out to be the attack surface. We saw the same shape when a shared AI assistant could act before anyone approved it: the dialog looks like a control, so people trust it like one, and that trust is exactly what the exploit spends. A checkpoint that waves everything through is not a checkpoint. It is a signature line, and you are signing it blind.
Isn't this just careless security?
No, and that is the uncomfortable part. The vendors' position is defensible. You cannot give an agent the autonomy to actually work in your project and also wall it off from that project. At some point the developer has to be allowed to say "run my code," and drawing that consent line at folder-trust is a reasonable engineering call, not negligence.
The sharpest tension is inside Anthropic's own writing. The company published a clear principle about how it contains its models: the weakest layer in any system is the one you built yourself. Then it declined to harden the very layer (the trust dialog) that researchers keep breaking. That is not hypocrisy. It is a signal worth reading: even the most safety-forward vendor treats this particular boundary as the user's responsibility, not its own.
And notice the timing. The same month these exploits landed, Microsoft used its Build conference to ship an enormous security apparatus for AI code: a fleet of security agents, an "Agent 365" management layer, isolated execution containers. The defender industry is booming while the basic trust boundary on the coding agents themselves stays open by policy. This is the same disagreement we found when two major vendors couldn't agree on what the AI security problem even was. Buying more security tooling is not the same as owning your trust boundary, and it will not cover for a boundary you left to the vendor.
There is a clean way to see why this is structural, not incidental. Simon Willison describes what he calls the "lethal trifecta": an agent that can read your private data, process untrusted content, and communicate with the outside world is exploitable almost by definition. A coding agent sitting in your repository with shell access has all three at once. The surprise is not that exploits exist. It is that anyone expected a prompt to hold them back.
How do you run an AI coding agent safely?
Treat it as untrusted infrastructure, not a trusted teammate. Run it inside a sandbox (a container, a virtual machine, or a restricted runner) so a hostile repo cannot reach your real machine and credentials. Scope its plug-ins and permissions to the least it needs. And stop treating the approval prompt as the thing keeping you safe, because the vendor already told you it isn't.
Concretely, the hardening that actually matches the threat:
- Sandbox the agent. A container or VM so the blast radius of a bad repo is the sandbox, not your laptop or your network.
- Keep plug-ins out of the repo's reach. Don't let a cloned project define the tools your agent runs. Configure those yourself, deliberately.
- Isolate it from CI. The automated pipelines that build and ship your code should not be running agents that a pull request can hijack.
- Least privilege on credentials. The tokens and keys the agent can touch should be the minimum for the task, not your full keychain.
None of this is exotic. It is the same posture you would take toward any powerful tool that runs other people's code, which is exactly what an AI coding agent is. The reason it feels like overkill is that the smooth onboarding and the friendly approval prompt are designed to make the agent feel like a colleague. It is not a colleague. It is infrastructure with your permissions.
The agents are genuinely useful, and that is exactly why this matters. You are going to run them, and you should. The only real question is whether you run them like a teammate you trust completely, or like powerful infrastructure you have to contain. The vendors have already answered that question for you, in writing, four times. Build accordingly.
Frequently asked
Can an AI coding agent really run malicious code?›Yes. Researchers showed that a booby-trapped repository can trick six major agents (including Claude Code and Cursor) into running an attacker's code on your machine, triggered by the routine 'trust this folder' prompt.
Is Claude Code safe to use?›It is as safe as the workspace you run it in. The known exploits need a malicious repository or plug-in config you choose to open, and the vendors treat the resulting code execution as expected behavior.
Why won't the vendors patch these exploits?›They do not consider them bugs. Anthropic, LayerX, OX Security, and Mitiga disclosures were each declined as outside the threat model or working as designed.
What is MCP and why is it a security risk?›MCP (Model Context Protocol) is the standard that lets an AI agent plug into outside tools and data.
How do you run an AI coding agent safely?›Treat it as untrusted infrastructure. Run it inside a sandbox (a container, VM, or restricted runner) so a hostile repo cannot reach your real machine, scope its plug-ins and credentials to the least it needs, isolate it from your CI, and stop treating the approval prompt as a security control.
Considered takes, in your inbox.
We write when we learn something worth sharing. No schedule, no marketing digests. Built for engineers and product owners shipping with agents.