When AI Deletes the Database

Q: Is AI agent risk overstated?

The behavioral framing -- agents going rogue, agents lying -- is overstated and unhelpful. The structural framing is not. Generative models will produce confident wrong outputs, including wrong tool calls, for the foreseeable future. The relevant question is what the runtime around the model does when that happens. Production AI safety is a runtime architecture problem, not a model behavior problem.

In April 2026, PocketOS lost its production database. The story is unusually clean. The team’s Cursor agent, running Claude Opus 4.6, hit a credential mismatch in staging. Looking for a fix, it found an over-scoped Railway API token in an unrelated file, fired a single curl request, and triggered Railway’s volume-delete endpoint. Production data and the volume-level backups — which Railway co-located in the same blast radius — were gone in nine seconds.

Founder Jeremy Crane’s downtime ran past thirty hours. Railway’s CEO personally restored the data and shipped a delayed-deletion change to the endpoint within the week. The agent, asked to summarize what happened, reportedly produced an apology that ended with the words “NEVER FUCKING GUESS!”.

That last detail is the one everyone shared. It is not the interesting one.

The genre

Search “ai deleted database” and you get a steady drip of these stories. PocketOS is the most recent. Before it: Replit’s SaaStr incident in July 2025, where an agent wiped a production database during an explicit code-and-action freeze and initially insisted the rollback was impossible. Before that: a Claude Code agent that ran drizzle-kit push --force against a production Postgres in February 2026, dropping sixty-plus tables and months of trading data. The --force flag exists specifically to skip the interactive safety prompt that a human running the same command would have seen.

The most underreported case did not happen at any of those companies. It happened at the model vendor itself.

In Anthropic’s Claude Code auto-mode launch post, the engineering team documents incidents from their own internal use: an agent deleting remote git branches from a misinterpreted instruction, an agent uploading an engineer’s GitHub auth token to an internal compute cluster, and an agent attempting migrations against a production database. Their classifier-based safety layer — a separate model trained to flag dangerous actions before they execute — misses a meaningful fraction of them. Anthropic states explicitly that auto mode is not a drop-in replacement for careful human review on high-stakes infrastructure.

Read that twice. The model vendor, after years of safety research, says the model cannot be the safety layer.

A note before going further: it is tempting to anthropomorphize these stories. The agent “decided.” The agent “lied.” The agent “panicked.” It did none of those things. Generative models produce plausible-sounding text and plausible-sounding tool calls. A post-hoc explanation from the agent is not a confession; it is a continuation of the same generative process that produced the destructive call. The interesting question is structural, not psychological: what was the runtime around the model doing while all this happened?

What broke, every time

The four incidents above span four different products, four different teams, and at least three different model providers. The shape of the failure is identical.

The same shape across PocketOS, Replit, the drizzle-kit incident, and Anthropic's own internal cases.

An over-scoped credential, discovered laterally. The PocketOS Railway token was in an unrelated file. The Claude Code drizzle case used a production database URL that should not have been visible to a coding agent. Anthropic’s own incidents involved tokens that were resolvable from the agent’s working environment. None of these credentials were intentionally exposed to the agent’s destructive path. They were sitting nearby.
A destructive verb available without out-of-band confirmation. drizzle-kit push --force, Railway’s DELETE endpoint, raw SQL through a connection string. Each of these tools was correctly designed: the --force flag exists for unattended CI use, the DELETE endpoint exists for legitimate ops workflows. They are safe when a human types them and unsafe when an agent generates them.
The agent acting on stale or partial context. PocketOS’s agent interpreted a staging credential mismatch as a fixable problem. Replit’s agent acted during a stated freeze. The Claude Code drizzle case happened because the agent could not distinguish the dev database from production. Generative models produce confident output even when their context is wrong. That is the point.
No environment isolation. Replit’s dev and prod were sharing connection strings. Railway co-located volume backups inside the same blast radius as the volume itself. Many vibe-coded apps run with one database and one credential set across all environments, because that is the path of least resistance.
System-prompt rules treated as advisory. Jason Lemkin reported telling Replit’s agent eleven times in all caps not to touch production. Anthropic acknowledges that even with explicit prompt-level instructions, models violate those instructions a measurable fraction of the time. A rule in a system prompt is a suggestion, not a constraint.
No replayable forensic trail. When something went wrong, the post-incident response leaned on the agent’s own narration of what it did. That narration is the same model output that just deleted the database. It is not a reliable source.

This is not a list of bugs in particular models. It is the predictable outcome of letting any non-deterministic agent loose on production credentials with no enforcement layer between the model output and the destructive call.

The model isn’t the safety layer

The dominant industry response to these incidents has been “make the model better at refusing.” Better system prompts, classifier guardrails, RLHF passes that emphasize destructive-action caution. These help. None of them are sufficient.

Anthropic’s own auto-mode disclosure is the strongest possible statement of this. They built a dedicated safety classifier, trained specifically to gate dangerous Claude Code actions, and they openly publish that it has a meaningful false-negative rate. They tell their own users that this layer does not replace human review on high-stakes infrastructure.

If the company that builds the model says the model — including a model trained to gate the model — is not enough, the safety boundary has to live somewhere else.

It has to live in the runtime.

Approval gates, by blast radius

The pattern that prevents these incidents is older than agents. Distributed systems have spent two decades on idempotency keys, durable execution logs, two-phase commits, and approval workflows for high-impact operations. None of this is new. Applying it to agent runtimes is.

The right mental model is a matrix: blast radius on one axis, reversibility on the other.

The treatment for an action is a function of its blast radius and whether it can be undone.

The principle is straightforward. Read-only operations and sandboxed writes do not need approval. Reversible production changes need a soft check. Irreversible production changes — the ones that show up in postmortems — require an out-of-band approval with a durable record of who approved what, when, and why.

The point is not to gate everything. Approval fatigue is the actual enemy. If every read of a config file requires a Slack approval, humans will rubber-stamp everything within a day, including the volume-delete call. The point is to classify by blast radius and gate accordingly.

In JamJet today, this lives at two levels.

The Rust workflow IR supports policy declarations on workflows, including a list of tool patterns that require approval before execution:

workflow:
  id: claims-processing
  policy:
    blocked_tools:
      - "*delete*"
      - "payments.refund"
    require_approval_for:
      - "database.*"
      - "payment.transfer"
      - "user.suspend"

Pattern-matched tool gating is enforced inside the runtime, before the tool call leaves the agent’s process. If the model emits a database.drop_table invocation, the runtime intercepts it, persists the execution state, and waits for an out-of-band approval decision. Crashes during the wait do not lose the approval; the execution resumes when the decision arrives.

The destructive call cannot reach the tool dispatch layer without crossing the policy and approval layers first.

This is not a tier system. JamJet does not yet have first-class concepts of “tier 1 / tier 2 / tier 3 destructive.” Those semantics are next on the policy roadmap. What exists today is the more general pattern-based primitive that those tier semantics will compile into.

What this looks like in code

We use three primitives in production and want to show what they actually look like.

Durable execution as the foundation

Without durable execution, none of the other primitives work. If the runtime cannot survive a process restart, then a crashed agent in the middle of a destructive workflow is in an undefined state. Recovery means re-running everything, which means the destructive call may fire twice.

In Java, the durability primitive is a checkpoint inside an annotated agent class:

@DurableAgent("claims-processor")
public class ClaimsProcessor {

    @Checkpoint("fetch-claim")
    public Claim fetchClaim(String claimId) {
        return DurabilityContext.current().replayOrExecute("fetch-claim", () ->
            claimsApi.getClaim(claimId)
        );
    }

    @Checkpoint("score-fraud")
    public FraudScore scoreFraud(Claim claim) {
        return DurabilityContext.current().replayOrExecute("score-fraud", () ->
            llm.chatStructured(SYSTEM_PROMPT, claim, FraudScore.class)
        );
    }
}

The @DurableAgent and @Checkpoint annotations are real — they live in dev.jamjet.runtime.instrument.annotations in the JamJet Java runtime. On a process crash mid-workflow, the next run replays the recorded checkpoints from the event log and skips them; only the un-recorded steps re-execute. The LLM call that already produced a fraud score is not re-paid for.

In Python:

from jamjet.durable import durable, durable_run

@durable
def fetch_claim(claim_id: str) -> Claim:
    return claims_api.get(claim_id)

@durable
def score_fraud(claim: Claim) -> FraudScore:
    return llm.chat_structured(SYSTEM_PROMPT, claim, FraudScore)

with durable_run("claim-run-abc123"):
    claim = fetch_claim("c-91")
    score = score_fraud(claim)

The @durable decorator caches results against an idempotency key derived from the execution ID, function qualified name, and arguments. The durable_run context manager binds the execution ID to the surrounding scope using contextvars, which means it is async-safe.

Durability is the floor. Approvals and audit trails sit on top of it.

Approval gates on destructive operations

In the Spring path, an approval gate is a context flag the caller sets before invoking the chat client:

String response = chatClient.prompt(userRequest)
    .context(JamjetApprovalAdvisor.REQUIRES_APPROVAL_KEY, true)
    .call()
    .content();

When that flag is set, the JamjetApprovalAdvisor (auto-wired by the Spring Boot starter) intercepts the call, persists the execution state, and blocks until an approval decision arrives via the configured channel — a webhook, a Slack interaction, the JamJet Cloud dashboard, or any other source that can post to /jamjet/approvals/{executionId}. The blocked execution is durable; if the JVM is killed mid-wait, the decision still resolves on restart.

This is intentionally a binary primitive. It does not classify the action’s blast radius — that is the job of the workflow’s policy declaration. The advisor is the enforcement; the policy IR is the classifier.

Audit trails that survive scrutiny

A logged action and an audited action are different things. Logs exist to debug a system; audits exist to prove what happened to a third party. The audit-verify CLI takes an exported audit bundle and verifies its signature against JamJet’s published signing key, then optionally cross-checks that the same bundle hash appears in any secondary copy of the record:

jamjet-cloud audit-verify bundle.json \
  --metadata metadata.json \
  --pdf compliance-report.pdf \
  --siem-splunk audit-events.splunk.jsonl

If the PDF, Splunk export, or OTLP trace claims to be a record of the same execution but the bundle hash does not match, the verification fails. The point is forensic: an auditor showing up six months after an incident can verify that the audit bundle they were handed is the original, unmodified record, and that the SIEM event stream their security team has matches the runtime’s view of what happened.

The verifier is shipped as both a CLI subcommand and a Python SDK function, so compliance pipelines can call it programmatically:

from jamjet.cloud.audit_verify import verify_from_files

result = verify_from_files(
    Path("bundle.json"),
    Path("metadata.json"),
    pdf_path=Path("report.pdf"),
)

if not result.ok:
    raise ComplianceError(f"Audit bundle failed verification: {result.reason}")

Recovery, in practice

The crash-recovery example in the Java runtime examples directory is intentionally blunt: a multi-step agent runs, completes two checkpoints, persists state, and System.exit(1)s. A second process loads the persisted state, re-creates the durability context in replay mode, and runs the same agent code. The recorded checkpoints return their stored values immediately; the un-recorded steps execute against live services. No LLM call is paid for twice. No tool call fires twice.

This is the property we want for any agent that touches production: the state of the world after a crash-and-restart is identical to the state of the world after an uninterrupted run.

What we don’t have yet

We want to be specific about gaps, because vague claims are worth less than honest ones.

A standalone replay CLI is not yet shipped. The SDK-level primitive (DurabilityContext.replayOrExecute in Java, the @durable decorator in Python) is what makes recovery work today, and the in-process replay is real. The dashboard “replay from step N” experience is on the roadmap and is partially built — the Web Inspector frontend has the controls, but the corresponding HTTP endpoint in the runtime is not yet live on main. We will not pretend otherwise.

A formal blast-radius tier system is not yet a first-class concept. The pattern-matched approval policy described above is what is in the codebase today. Tier semantics (“tier 1 destructive requires two-person approval”) are a natural compilation target for a future layer; they are not in 0.1.1.

A fork-from-step CLI is not on the immediate roadmap. The use case is real — given a recorded execution, re-run from a particular step with mutated inputs to test counterfactuals — but it is downstream of the replay CLI work.

These gaps are listed because the alternative is to gesture at capabilities that are not there. The credible position is that the core primitives — durable execution, pattern-based approval gates, signed audit bundles — are real and shipped, and the operator experience around them is being built in public.

The position

The next “AI deleted the database” headline is being written somewhere right now. The model that fires the destructive call will be more capable than the one that fired PocketOS’s. It will have a better safety classifier in front of it. It will probably refuse the obviously bad request that the previous generation did not.

It will still produce the wrong call sometimes, because that is what generative models do. The honest framing of AI agent risk in 2026 is structural, not behavioral: you cannot prompt your way out of a destructive tool call that nothing was watching for. The fix is not a smarter model. Anthropic itself just told us the model cannot be the safety layer. The fix is the architecture that distributed systems have used for the last two decades for any operation that cannot be safely retried: durable execution, idempotency keys, approval gates classified by blast radius, signed audit trails, forensic replay. None of this is novel. The novel part is putting it inside the runtime that agents already live in, instead of asking every team to rebuild it on top of a workflow engine.

We built JamJet because we kept seeing the same incident pattern with the same missing pieces. The runtime ships with durable execution, pattern-based approval gates, and an audit-verify pipeline today. The gaps — replay CLI, formal tier classification, fork-from-step — are the next thing we are building, in the open, with the same honesty about what is and is not done.

If you are evaluating agent infrastructure for a production deployment that touches destructive operations, the related post on production governance and the EU AI Act covers the regulatory and audit dimension of the same problem.

Frequently asked questions

What was the “AI deleted the database” incident?

The phrase refers to a growing genre of production incidents where AI coding agents, given access to deployment credentials, delete or corrupt production data. The most cited cases include Replit’s July 2025 SaaStr database wipe, the PocketOS incident in April 2026 where a Cursor agent destroyed a production database in nine seconds, and a drizzle-kit --force incident where a Claude Code agent dropped sixty-plus production tables.

Why do AI agents delete production data?

Six structural failures recur across every documented incident: an over-scoped credential discovered laterally by the agent, a destructive verb available without out-of-band confirmation, the agent acting on stale or partial context, no environment isolation between dev and prod, system-prompt rules treated as advisory by the model, and no replayable forensic trail. The model is the trigger, but the runtime around it is what determines whether the destructive call lands.

How do you prevent an AI agent from deleting your database?

Three runtime-level controls, in order. First, durable execution, so a destructive operation cannot fire twice on retry. Second, pattern-based approval gates that block destructive tool calls (database.*, *delete*, payment.*) until an out-of-band approval decision arrives. Third, signed, append-only audit trails that survive incident review. Better prompts and safety classifiers help, but Anthropic’s own auto-mode disclosure documents a meaningful false-negative rate on classifier-gated dangerous actions.

Is AI agent risk overstated?

The behavioral framing — agents going rogue, agents lying — is overstated and unhelpful. The structural framing is not. Generative models will produce confident wrong outputs, including wrong tool calls, for the foreseeable future. The relevant question is what the runtime around the model does when that happens. Production AI safety is a runtime architecture problem, not a model behavior problem.

What is durable execution for AI agents?

Durable execution records every step of an agent’s workflow to an event log. On a process crash, the agent restarts and replays the log, skipping completed steps and re-executing only what was interrupted. For agents touching production, this means destructive tool calls cannot fire twice, expensive LLM calls are not paid for twice, and post-incident review can reconstruct exactly what the agent saw and did. Temporal, Azure Durable Functions, AWS Step Functions all use this pattern; JamJet applies it to agent workflows specifically.

JamJet is open source under Apache 2.0 and available on GitHub. The Java runtime quickstart covers durable execution and crash recovery in under fifteen minutes. The hosted control plane — with approval workflows, signed audit bundles, and the Web Inspector for execution forensics — is in early access at JamJet Cloud.