From Robot Arenas to Cyber Defense: Building Trustworthy AI Through Games

Most people learn about “AI in security” the same way they learn about “cloud security”: by being voluntold in a meeting and handed a vendor deck full of hockey sticks.

I learn it the opposite way.

I throw the AI into a sealed room with a handful of rules, a clock, and something that can hurt it. Then I watch what it does when it’s under pressure and it can’t talk its way out.

That “sealed room” can be an arena of pixelated robots, a 2D car skidding around a procedurally generated track, a tank lobbing artillery into destructible terrain, or two ships circling a star while gravity tries to kill them both. It doesn’t matter that the worlds are games. What matters is they’re closed systems. Every action has a measurable consequence. Every claim can be verified. And every shortcut gets exposed the moment the agent meets an opponent that doesn’t play nice.

This post is about why I keep building these little battlegrounds, and what they’ve taught me about deploying AI where it actually matters: the SOC, the network, and the parts of your business that bleed money when something goes wrong.

Why games are the best lie detector

If you’ve spent any time around LLM demos, you’ve seen the trick: ask a model a question that sounds like it has one right answer, then admire how confidently it delivers three wrong ones and a motivational quote.

That’s not a moral failing. It’s a measurement problem.

In text-only worlds, we tend to grade models on how good they sound. In closed environments, you grade them on what they do. You don’t get to “almost” dodge a torpedo. You don’t get a participation trophy for “intending” to stop lateral movement. You either kept the network clean, or you didn’t.

This is the same reason capture-the-flag is useful, and “cyber ranges” are useful, and why even the best incident responders love a good tabletop: constraints create signal.

The research community has been circling the same point from different angles for years. One example I keep coming back to is Eureka, which uses an LLM to generate reward functions (actual code) and iteratively improves them based on training feedback. Across 29 RL environments, those LLM-generated rewards outperformed human-designed ones on most tasks and produced large average improvements (83% of tasks; 52% average normalized improvement) (paper).

That’s interesting for robots… but it’s foundational for security. Because if you can’t define what “good” looks like, you can’t train an agent to behave. And if you define it poorly, you’ll train an agent that optimizes the wrong thing at machine speed.

So I built four games, not as toys, but as instruments.

The four arenas

Here’s the short version. I’ll expand on each below, because the devil is always in the details.

Arena	What the AI must learn	Why it maps to security
AI Robots (CROBOTS-style battle arena)	Strategy, deception, constraint satisfaction, failure recovery	Alert triage under uncertainty; adversarial interaction; “good enough” decisions fast
Learn2Drive (2D self-driving RL sim)	Perception from noisy sensors, continuous control, generalization across tracks	Detection from imperfect telemetry; resilience to environment drift; avoiding brittle rules
Scorched Earth (MOAG artillery)	Planning under randomness (wind), resource management, non-stationary terrain	IR under partial observability; cost-aware response; dealing with shifting attacker behavior
SpaceWar! (gravity + torpedoes)	Timing, momentum, self-play, policy diversity	Adversary emulation; adaptive defense; learning without overfitting to one “red team”

And yes, I know what you’re thinking: “Ron, you built a SpaceWar clone to explain incident response.”

Correct. And Marcus Ranum challenged me to do it. And it works.

Arena #1: AI Robots (CROBOTS, but with modern AI)

I grew up on games where you didn’t play the tank—you wrote the tank. CROBOTS was that kind of game: you wrote a little program, compiled it, and then watched it get humiliated in an arena by someone who had been thinking about angles and timing longer than you’d been alive.

That’s the point. In CROBOTS, the arena is honest.

In AI Robots, I rebuilt that feel, then did the thing you’re not “supposed” to do: I let modern language models write the robot programs. Claude, GPT, Gemini, Grok—take your pick. They all get the same rules. They all get the same constraints. Then we see who can produce a strategy that survives contact with another machine.

What surprised me wasn’t that the models could write code. They can. What surprised me was how often the winning strategy wasn’t “smart”, it was consistent. The model that wins more often isn’t necessarily the one with the best idea. It’s the one that can:

keep state without lying to itself,
follow its own rules under stress,
and recover when its plan fails.

That’s the SOC in a nutshell.

Analysts don’t lose because they don’t know MITRE ATT&CK. They lose because they’re drowning in partial truths, working under time pressure, and making decisions based on telemetry that is noisy on a good day and actively adversarial on a bad one.

In this arena, you can watch models fall into the same traps:

Overconfidence: “I’m safe behind cover” (while drifting directly into a minefield).
Inconsistent reasoning: it “remembers” its strategy until the first anomaly shows up.
Constraint failure: it keeps issuing an action that the rules forbid, because it forgot the rules existed.

In security terms: false positives, stale playbooks, broken automation, and response actions that look correct in a post-mortem but were impossible at runtime.

Code: rondilley/airobots (BSD-3-Clause)

Arena #2: Learn2Drive (sensors, drift, and the curse of “almost”)

Driving is boring until it’s not. Most of the time it’s lane-keeping, following rules, and making small continuous adjustments. Then one moment of uncertainty shows up—a weird reflection, bad paint, a parked truck at a bad angle—and you find out what your system really learned.

Learn2Drive is a 2D RL environment with a top-down car, procedurally generated tracks, and lidar-like sensors. The agent has to learn control, but more importantly it has to learn to generalize. The track changes. The curves change. The mistakes are punished quickly.

In the repo I compare PPO, DQN, and a GRPO implementation. The algorithm details matter, but the lesson is broader:

If your perception is wrong, your control is irrelevant.

That maps directly to detection engineering. You can argue about model architecture all day, but if your signals are garbage—missing fields, inconsistent parsing, broken clocks, recycled hostnames—you’re just doing math on noise.

This is why feature engineering and data quality keep beating “better models.” It’s also why graph-based approaches keep showing up in intrusion detection: graphs are a way to make topology and relationships explicit instead of hoping a model infers them from flattened logs (see, e.g., this GNN IDS survey: https://doi.org/10.1016/j.cose.2024.103821).

In Learn2Drive, you can watch the agent “cheat” your reward function. If the reward only cares about speed, it learns to fling itself into the wall at high velocity. If you reward “progress,” it learns to oscillate across a segment and farm reward. If you don’t penalize being stuck, it learns to park.

That’s the exact same story as an alerting rule that optimizes for “number of alerts closed” instead of “incidents prevented.”

Code: rondilley/learn2drive (GPL-3.0)

Arena #3: Scorched Earth (MOAG) and the physics of regret

Scorched Earth is one of those perfect games: simple surface, brutal depth. You aim, you fire, and the world reminds you that wind exists and terrain doesn’t care about your feelings.

In MOAG, I recreated that in Python/Pygame, then wired in RL training so the AI can learn to be a competent (and occasionally infuriating) opponent. Wind changes. Terrain gets carved into craters. The “right” shot on turn one might be suicidal on turn five.

Security has the same physics:

the environment changes mid-incident,
attacker tooling adapts,
and every action has cost (time, noise, business impact).

If you’ve ever watched an IR team argue over whether to isolate a host, you’ve seen this dynamic. Isolation is “safer,” but it’s also disruptive. Let it run, and you might get better telemetry… until the attacker detonates your domain.

MOAG forces you to think like that. You don’t just want a hit. You want a hit that sets up the next three turns, while not bankrupting yourself on expensive weapons you can’t afford.

Code: rondilley/Scorched_Earth_MOAG (GPL-3.0)

Arena #4: SpaceWar! (self-play, diversity, and the adversary problem)

SpaceWar! is ancient. It’s also perfect. Two ships, a star, gravity, fuel, torpedoes, and one rule: if you drift into the star you deserved it.

Why build it for RL?

Because it makes the “adversary problem” unavoidable.

In single-agent RL, you can train an agent to look brilliant in its own little bubble. In competitive environments, you learn the truth: your agent didn’t learn “the game,” it learned one opponent. Then the opponent changes and your agent collapses like a cheap lawn chair.

That’s cyber defense. Most “autonomous defense” demos train against a scripted attacker, then declare victory. In reality, attackers mutate. Tools change. TTPs evolve. If your defender overfits, you lose.

The research on cyber defense MARL environments (like CybORG/CAGE) is starting to tackle this, including hierarchical approaches designed to make policies transferable when the adversary shifts (paper).

In SpaceWar_AI I included self-play and “league training” ideas explicitly because policy diversity is not a nice-to-have, it’s how you avoid learning a fragile glass strategy that only wins in yesterday’s fight. (And yes, I also wired in optional LLM guidance for exploration and iterative refinement, because sometimes a model is useful as a generator of hypotheses, even if you don’t let it drive the ship in real time.)

Code: rondilley/SpaceWar_AI (GPL-3.0)

What this has to do with the real world (and why I keep doing it)

If you’re building AI for security, you’re not building a chatbot. You’re building an operator.

That means you need three things that most “AI in the SOC” conversations conveniently skip:

1) A reward function that doesn’t create monsters

If you reward the wrong behavior, the model will optimize it flawlessly.

This is why I keep coming back to reward design and evaluation research like Eureka. Not because I want a robot to spin a pen, but because I want to avoid building a defender that learns to silence alerts instead of stopping incidents.

2) An adversary that doesn’t stand still

Training against one attacker profile gives you a brittle defender.

This is why competitive arenas matter, and why policy diversity matters, and why multi-agent environments like CAGE exist.

3) A trust model for what the AI is allowed to do

I’m pro-AI. I’m also pro-not-dying.

LLMs are useful in security workflows—summarization, enrichment, mapping, drafting, even code assistance, but hallucinations and overconfidence are not edge cases; they’re normal failure modes (see, e.g., this hallucination survey: https://arxiv.org/abs/2311.05232). SOC-focused surveys treat this as an operational risk you have to engineer around, not a philosophical debate (SOC survey: https://arxiv.org/abs/2509.10858).

Which is why, in my worlds, the LLM can suggest, critique, or generate candidates, but the loop closes with measurable outcomes. The arena is the judge. Physics is the judge. The scoreboard is the judge.

The practical pattern I use (steal this)

If you want a recipe you can reuse in actual security automation, it’s this:

Constrain the world. Define the inputs, allowed actions, and termination conditions.
Instrument everything. If you can’t measure it, you can’t improve it.
Make failure cheap. Let the agent fail in simulation a million times so it fails less in production.
Use LLMs as idea factories, not oracles. Generate candidates, then test them.
Keep experts in the loop where consequence exists. Automation should earn authority.

This is also why I’m increasingly interested in domain-tuned, deployable security models, open-weight systems you can run on-prem where the telemetry lives, with controls you can verify. Cisco’s Foundation-sec-8B is one example of that direction (Cisco announcement, model card).

Code and data

If you want to play with the arenas, break them, or adapt them to your own research, here you go:

AI Robots: https://github.com/rondilley/airobots
Learn2Drive: https://github.com/rondilley/learn2drive
MOAG (Scorched Earth): https://github.com/rondilley/Scorched_Earth_MOAG
SpaceWar_AI: https://github.com/rondilley/SpaceWar_AI
LLM Compare (the multi-model evaluation harness I keep wiring into everything): https://github.com/rondilley/llm_compare

Licensing is per-repo (BSD-3-Clause for airobots; GPL-3.0 for the others as of this writing).

Five things worth reading if you want to go deeper

If you only read five things on this topic, make them these:

Eureka: Human-Level Reward Design via Coding Large Language Models – reward functions as code, iterated with feedback. https://arxiv.org/abs/2310.12931
Hierarchical Multi-agent Reinforcement Learning for Cyber Network Defense – transferable MARL defense policies in CybORG/CAGE. https://arxiv.org/abs/2410.17351
NIST IR 8269: A Taxonomy and Terminology of Adversarial Machine Learning – common language before you start arguing about solutions. https://csrc.nist.gov/pubs/ir/8269/ipd
Large Language Models for Security Operations Centers: A Comprehensive Survey – where LLMs help, and where they make things worse. https://arxiv.org/abs/2509.10858
Foundation-sec-8B (model + references) – an example of security-domain pretraining you can deploy and test. https://huggingface.co/fdtn-ai/Foundation-Sec-8B

Closing thought

The future SOC isn’t going to be “analysts replaced by AI.” It’s going to be analysts surrounded by automation that is either:

a force multiplier, or
a high-speed way to be wrong.

The difference is whether you can test the system in a world that doesn’t care about your marketing.

That’s why I keep building arenas.

iamnor