I tried to build a toxic skill scanner. Twice. Both failed.

I have been thinking about how AI agent skills get vetted before they end up wired into a workflow. The honest answer is: they don’t, mostly. ClawHub will accept a SKILL.md from a one-week-old GitHub account, with no signing and no review, and that file becomes part of an agent’s instruction set. Snyk’s ToxicSkills audit in February found prompt injection in 36% of the 3,984 skills they pulled from ClawHub and skills.sh, plus 1,467 malicious payloads scattered through the corpus. The Koi Security ClawHavoc reporting around the same time tracked 341 confirmed malicious skills that grew past 824 inside two weeks. A static scanner that catches a meaningful chunk of that sounded like something worth building.

So I tried to build one. Twice. The first run with Claude Code, the second with Codex. Neither shipped. This is the post-mortem.

Attempt one: Claude Code

I started in mid-April with an architecture doc and an axiom I still believe: context must never become control. Untrusted input is data for analysis, never authority over the pipeline. Twelve mechanisms in the design were there to enforce that one rule. Phantom taint types in mypy, content-addressed immutable IR, parser sub-processes with rlimits on POSIX and Windows Job Objects, a prompt firewall around every LLM call, signed SARIF on the way out. Scan time was 100% deterministic. The AI lane was reserved for offline rulepack authoring, not per-artifact inference.

Most of the pipeline got built. Seven harvester adapters, eight detectors, an IR with a decode-chain that could walk three layers of base64 / hex / URL / HTML / unicode-escape / rot13 evasion. A live model catalog so I would never recommend a model from Claude’s training data again. 440 pytest tests passing, mypy strict clean, ruff clean. The seeded evasion bench ran 10 out of 10.

That paragraph reads cleanly. The reality of getting there did not. Claude Code burned through roughly 24 hours of wall-clock and every token I had, and a real chunk of that was the agent looping on its own tail. It would propose an approach, half-implement it, get tangled on a type error or import path, decide the fix was a “small refactor for clarity,” and end up rewriting the same module three times in slightly different shapes. Every session generated a fresh batch of pytest files, a new markdown design doc, two or three more memory entries, and another VIBE_HISTORY paragraph telling future-Claude what past-Claude had just learned. The tests were thorough. The docs were thorough. The scanner crawled forward an inch and slid back two. By the time I was deep in the M2 evasion-coverage phase I was paying real money to watch an agent re-derive decisions it had already made the day before, and I had to keep stepping in just to stop the spiral. If I had been billing my own time against this experiment I would have shut it down a week earlier.

Then I scanned a real corpus and the wheels came off.

787 findings. The strongest detectors I built (invisible Unicode steganography, encoding evasion, install-flow literals, wildcard allowed-tools, a trufflehog wrapper) produced three findings total. The volume came from the weak rules. PRIV-02 with 263 hits and a majority false positive rate. PRIV-03 with 173 hits, 81% of which were mechanical false positives sitting inside fenced code blocks. ROT13 with 224 hits on a flimsy word-delta gate. Of the 787 findings, maybe 40 to 50 carried real signal. A 5-6% true-positive rate on the material the scanner actually flagged.

That was the part that hurt. The strong detectors covered attack classes that rarely appear in the wild. The bulk findings were noise. And against the threat catalog the scanner was supposed to cover, the misses were structural, not tunable. Indirect prompt injection that lands at runtime against fetched content. Tool poisoning hidden in MCP description metadata. Memory poisoning. EchoLeak-class zero-click image exfil. Confused-deputy chains across tools. Multi-turn jailbreaks like Crescendo and Many-shot, where no single static artifact contains the payload. A static scanner has nothing useful to say about any of these. The actual ClawHavoc malware payloads, per Koi’s reporting, live in fetched second-stage scripts. The scanner sees the curl | sh line if one is present. It cannot see what gets fetched.

I marked the experiment FAILED in VIBE_HISTORY.md on April 20 and stopped.

Attempt two: Codex

A week later I tried again with Codex against the same problem, hoping a different agent and a more grounded engineering posture would land somewhere different. It did, partly. Codex was less prone to the production-readiness scope creep that Claude kept inventing for me out of habit. The work shifted toward operational realism: a single-copy SQLite artifact store keyed by SHA-256, run-centric output packages instead of mirrored file trees, ClamAV and Microsoft Defender wired in as third-party verification lanes alongside the deterministic detectors, and a small supervised classifier trained from saved reports plus reviewed labels.

The detectors got more honest. After running against the openai/skills repo and a real OpenClaw harvest, the structural rules around memory-write steering and code-execution steering needed actor and modality modeling, not just keyword presence. “Tell the user to install this CLI if it is missing” is a different class of behavior than “fetch and execute this installer right now.” Writing those distinctions in cut a lot of false positives.

Then I tried to scale. The OpenClaw harvest manifest came back with 73,221 records and 66,638 skill artifacts. Per-artifact setup cost dominated the run. Markdown normalization rebuilt a full line index for every span match. Feature extraction ran context regexes over the same blocks for each derived feature. Cloning and accessing the corpus at real size kept stalling. The transformer lane wanted to drag transformers, torch, and triton in eagerly through __init__.py even on runs that did not need them. Endpoint AV kept fighting the harvested samples. None of these were unsolvable individually. Together, with one person, they were too much. I parked it.

What I think actually happened

Three things went wrong, in this order.

The first failure was Claude Code itself. Opus 4.6 shipped on February 5 with a regression that other developers had been documenting all spring: circular exploration, re-reading files it had already read, losing instructions from CLAUDE.md after context compaction, spawning subagents for trivial tasks. Opus 4.7 landed April 16, and Anthropic shipped a system-prompt change at the same time that capped between-tool-call text at 25 words. They reverted it on April 20 because it had visibly hurt coding quality. That four-day window was exactly when I was deep in M2 evasion coverage. The looping. The re-derivation. The re-refactoring. I was sitting on the receiving end of a known degradation and did not know it at the time. Reading the postmortem afterward did not make me feel better. The Register, The New Stack, MindStudio, and a long thread of anthropics/claude-code GitHub issues all converged on the same complaint within days. A senior dev who knew what was going on could probably have papered around the worst of it. I could not, because I assumed the agent was working as intended and the project was just hard.

The second was Codex. Codex had a milder version of the same problem. Long-horizon agent research from this year says doubling task duration roughly quadruples failure rate, and every tested agent shows degradation past about 35 minutes of continuous work. ToxicSkillHunter is a multi-day, multi-thousand-file project. Codex held the architecture in its head better than Claude did, and when it lost track it did so in less destructive ways. It was still losing track. It also had opinions about how to engineer the same problem, which meant some of the second-attempt time went into rebuilding things the first attempt had already gotten right.

The third was static analysis at corpus scale. Even if the agents had been perfect, this one was waiting. Static scanning catches the part of the threat that lives in visible bytes. The 2026 attack surface has moved. Indirect injection lands in retrieved content at runtime. Memory poisoning happens in the agent’s context, not on disk. Multi-stage payloads load second-stage code from URLs the scanner never fetches. The Snyk ToxicSkills work and the arXiv SoK on prompt injection in agentic coding assistants point the same way: attacks that matter resolve at runtime, multi-stage, often architectural rather than signature-shaped. Then the corpus side. The OpenClaw harvest came back with 73,221 records and 66,638 skill artifacts. Per-artifact setup costs, span re-scans, eager transformer imports through __init__.py, endpoint AV fighting harvested samples. Every layer that worked on small fixtures had to be rebuilt to survive real volume. Solo, that combination beat me before detector quality work even mattered.

If I come back to this, it will not be a static scanner. It will sit beside the agent, watching tool calls and retrieved content, and enforce policy at the runtime boundary where the actual attacks happen. Most of what I built does not survive that pivot. The IR, the SARIF emitter, and the rulepack loader probably do.

For now: parked. The detectors that work, work. They cover maybe 5% of the real threat landscape. Publishing the tool as more than that would be dishonest, and dishonesty in a security tool is its own kind of malware.

iamnor