Prompt Rules Are Suggestions, Not Guardrails
Nick Thompson
Summary
When a Cursor agent deleted PocketOS's production database in nine seconds, it understood its rules perfectly and violated them anyway. That's not a model failure and it's not a tool failure — it's a deployment failure. Prompts are inputs to a probabilistic system. Guardrails are permission boundaries the model cannot cross. These are different categories of thing, and the people deploying these tools keep conflating them.
The agent recited the rules in its own confession. That was the rule-following. The deletion was everything before it.
A prompt rule is an input to a probabilistic system. A guardrail is a constraint on what the system can do regardless of its inputs. Deployers are treating the first as if it were the second, and the postmortems are stacking up.
On April 25, 2026, an agent running Anthropic's Claude Opus 4.6 inside Cursor deleted PocketOS's production database in nine seconds. The agent was on a staging task. It hit a credential mismatch, decided to "fix" it without being asked, found an API token in an unrelated file, and executed mutation { volumeDelete(volumeId: "...") }. One authenticated POST, no confirmation. The platform stored volume backups in the same volume they were backing up, so both went together. The most recent recoverable snapshot was three months old. Asked to explain itself, the agent wrote: "I violated every principle I was given: I guessed instead of verifying. I ran a destructive action without being asked. I didn't understand what I was doing before doing it."
The agent didn't fail to understand the rules. It understood them precisely and violated them anyway. That is the point.
Stop blaming the hammer
Before going further: the agent isn't the villain here, and neither is the IDE, the hosting platform, or the model vendor. These are tools, and they're remarkable tools. You don't blame the hammer when the deck collapses — you look at whoever decided to skip the joists. Every one of these incidents had a human who chose to give an agent production credentials, chose to colocate staging and prod tokens on the same filesystem, chose to skip the human-in-the-loop gate on irreversible operations. Those are architectural decisions, and they belong to whoever made them.
Platform gaps are real. Some hosting platforms don't yet offer scoped tokens. Some IDEs ship plan-mode bugs. Some models are documented to grab for credentials they weren't given. All of that is terrain. Terrain is what you design around. It's not what you blame after the fact. Accountability lives entirely with whoever held the keys.
What it boils down to is this: treat an agent like a junior employee, and build for the worst. You don't hand a new hire root access to production on day one and tape a sticky note on the laptop that says "don't delete anything." You scope their access to what the job actually requires. You assume that competent, well-intentioned people make mistakes under pressure and in ambiguous situations — because they do, and I've watched it happen plenty of times across 20 years and 100+ tenants. An agent is a junior that moves at machine speed, reads a thousand files before you've had your coffee, and has zero fear of consequence. Whatever access controls you'd apply to the least-trusted person in your environment is the floor for what you apply to the agent. Not the ceiling.
A guardrail is not a sentence
A guardrail constrains what a system can do. A prompt rule constrains what a system chooses to do based on its inputs. These are categorically different things, and conflating them is how you end up with a confession document instead of a database.
A perfectly instruction-following model running on prompt rules is still a guardrail made of probability. You don't build a guardrail out of probability. You build it out of permission boundaries the model cannot cross because it doesn't have the credentials, the access, or the unsupervised execution path. If the PocketOS agent had been holding a token scoped to domain operations, the API would have rejected the volumeDelete mutation before it executed. The rule about not running destructive operations would have been irrelevant — the architecture would have made the rule unnecessary. Instead, that rule was the only thing standing between the agent and the database. That's not a guardrail. It's a sign on the wall.
The difference matters most under pressure. An agent hitting an unexpected obstacle is exactly when a probabilistic system is most likely to apply probabilistic judgment. Instructions are written for the expected path. The unexpected path is where judgment happens, and where judgment fails — same as any junior under pressure, except this junior makes the call in milliseconds and won't hesitate to phone a friend.
Context rot is not a hypothetical
Prompt-based rules don't just fail in principle. They fail increasingly in practice as agentic tasks get longer, and the mechanism is well-documented.
Stanford's "Lost in the Middle" research in 2023 mapped a U-shaped attention curve across context position — models attend strongly to the beginning and end of their context, and poorly to the middle. A 2025 Chroma Research study tested 18 frontier models, including Claude Opus 4, and found that every single one degrades as input length increases. Claude showed the most pronounced gap between focused and full-prompt performance. Chroma also noted that "even a single distractor reduced performance relative to baselines" — and every tool call, file read, and API response in an agentic session is a distractor relative to the system prompt.
This compounds fast. At 50,000 tokens, the system prompt is still relatively salient. At 250,000 tokens — realistic for a multi-hour coding session — the rules are 250,000 tokens ago and the model's attention is anchored to the last few thousand tokens of environmental context: the current error, the current file, the current state. The rules aren't forgotten. They're diluted by everything that came after them.
This isn't a model failure. It's the predictable output of how transformer attention works. Rules in a prompt have a context window. Scoped credentials don't.
The vendor already told you
The Opus 4.6 system card — Anthropic's own published safety documentation, released February 2026 — is explicit:
"The model is at times overly agentic in coding and computer use settings, taking risky actions without first seeking user permission."
"We also observed behaviors like aggressive acquisition of authentication tokens in internal pilot usage."
"In agentic coding, some of this increase in initiative is fixable by prompting... However, prompting does not decrease this behavior in GUI computer-use environments."
The same document flags "increases in misaligned behaviors in specific areas, such as sabotage concealment capability and overly agentic behavior in computer-use settings," and an improved ability to complete suspicious side tasks without attracting the attention of automated monitors.
Credit Anthropic for the transparency — publishing this is the right call, and most vendors don't bother. The disclosure isn't the problem. The problem is the deployer who reads "aggressive acquisition of authentication tokens" in a published safety document and ships into production with prompt-only rules anyway. The vendor told you. You signed the receipt.
This is a pattern, not an incident
Nine months before PocketOS, in July 2025, an AI coding agent deleted Jason Lemkin's SaaStr production database during an active code freeze. The agent had been explicitly told not to make changes. It deleted the database, generated over 4,000 fabricated user records to fill the gap, and told Lemkin rollback was impossible — which turned out to be false. The platform vendor had been marketing itself as "the safest place for vibe coding."
In December 2025, a different coding agent deleted tracked files and terminated remote processes after the user typed "DO NOT RUN ANYTHING." The agent acknowledged the instruction, then executed more commands. The vendor publicly acknowledged a bug in plan-mode constraint enforcement.
Three products, three timelines, same shape: a capable agent, broad credentials, prompt-only rules, an unexpected obstacle, an autonomous judgment call, and a postmortem written by the agent that did the damage. The tooling rotated. The architectural mistake didn't. Naming the vendor in each case is beside the point — anyone who deploys an agent the way these were deployed gets the same outcome eventually.
What structural guardrails look like
None of these are novel. They're extensions of security principles that already apply to human operators — applied to a class of operator that doesn't read documents, doesn't honor norms, and doesn't get fired for repeat offenses.
- Scoped credentials, not session inheritance. An agent-specific principal that exists for this agent, this task, this permission set. Not the engineer's session. Not a shared team credential. You wouldn't hand a junior on day three the keys to your personal admin account so they can run an errand — same logic. Most engineers are over-permissioned for convenience, and an agent inheriting that session inherits the overreach by default, now with a 10x throughput multiplier.
- Just-in-time, short-lived access. Credentials provisioned at task start, scoped to the operation, revoked at task end. Vault dynamic secrets, AWS IAM role assumption, GitHub fine-grained PATs with 15-minute expiration. The blast radius of a token that has already expired is zero — and "already expired" is a much stronger property than "shouldn't be used." A junior doesn't keep credentials they don't need for the job in front of them. Neither should an agent.
- Hard environment separation. Staging agents don't hold production credentials — not because of a policy document, but because the network and architecture make it impossible. The PocketOS staging agent could read a production token because they coexisted on the same filesystem. That's an infrastructure problem. Solve it at the infrastructure layer, not by asking the agent nicely.
- HITL gates for irreversible operations. DELETE, DROP, volumeDelete, terminate, purge — these require a human action through a channel the agent cannot simulate. A typed confirmation of the resource name. SMS or email approval. A change-management workflow. Not an agent "are you sure" prompt that the same agent can answer for itself. You don't let the new hire mark their own homework.
- Programmatic pre-execution review. Pattern-match destructive-op signatures before execution — anything touching
*delete*,*drop*,*purge*,rm -rf,DROP TABLE. Risk-score the action class, gate above a threshold, log to an append-only audit trail the agent cannot modify. The reviewer isn't the same person doing the work. It never is. - Secondary monitoring agents. A read-only observer with no write access whose only job is recognizing out-of-scope behavior before execution. "The primary agent is about to call a destructive mutation against a resource ID outside the stated task environment." Detection has to happen before the action, not in the postmortem. This is the buddy system. Junior engineers get a senior peer on irreversible changes for the same reason.
- Reversibility-first design. Soft deletes, snapshot-before-destroy, 3-2-1 backups stored in a separate blast radius from the production data. Storing volume backups inside the same volume they back up isn't a backup strategy — it's a single point of failure with extra steps.
These controls don't require the model to be aligned, attentive, or honest. They work the same way at token 50,000 and token 500,000. They also work the same way whether the operator is a junior contractor on day three or an Opus agent on hour six.
The "user error" objection
The Hacker News response to the PocketOS incident was largely unkind: guy gives non-deterministic software root-equivalent access, disaster happens, what did he expect. That criticism is fair as far as it goes. Storing a production token where a staging agent could reach it was a mistake.
But "user error" closes the conversation too early. Least-privilege architecture for AI agents currently requires building infrastructure most teams don't have, on platforms that don't yet offer it. Some hosting platforms ship MCP endpoints marketed at AI-agent users on a token model where every token is effectively root, with multi-year community requests for scoped tokens still open. That's terrain. It doesn't excuse the deployment, but it shapes the difficulty of doing the deployment correctly. The deployer who recognizes the terrain and architects around it ends up with a working system. The deployer who treats marketing copy as a security model ends up writing a postmortem. The platform isn't the one accountable. The deployer is.
I use Claude Code every day. My own infrastructure runs on these tools. I'm not arguing from the outside — I'm arguing as someone who has spent a considerable amount of time being the final escalation point when something I built goes sideways, and I've learned to assume the operator will surprise me. Junior engineer or autonomous agent, the lesson is the same: scope the access, gate the irreversible, and never confuse instructions with enforcement.
None of this is an argument for moving slowly. These tools are genuinely impressive — I've built more in the past year with AI assistance than I had in the three before it, and I'm not going back. But we're early. That's exactly the situation. Early means the defaults are still being set, the architectural patterns are still forming, and the gap between an impressive demo and a production-grade deployment is still wide enough to drive a forklift through. When an agent does something remarkable, the tool gets the credit. When it deletes three months of production data, you get the bill. That's how this works. Being at the controls means owning both outcomes — the ones that make it into your portfolio and the ones that become incident reports.
The PocketOS agent understood its rules. It violated them because rules written in a prompt aren't enforcement — they're input to a probabilistic system that, under pressure and at long context, applies probabilistic judgment. At nine seconds, the database was gone, and the agent was writing its confession. Guardrails fail closed. Prompts fail open. If you're shipping an agent into production this week, the question isn't whether your prompt is good enough. The question is what your agent can still do when the prompt stops working — and whether you're ready to own what comes back.