, ,

I Built a Tool to Make AIs Judge Each Other. Here’s What Happened.

It started the way a lot of my bad ideas start: too many terminals open, too many API keys configured, and one simple question that should not have been this hard. I was sitting in front of two Claude sessions, side by side. Same provider. Same model. Same version. In one window, Claude was being…

It started the way a lot of my bad ideas start: too many terminals open, too many API keys configured, and one simple question that should not have been this hard.

I was sitting in front of two Claude sessions, side by side. Same provider. Same model. Same version. In one window, Claude was being principled and thoughtful and very sure that it could not possibly help me with a reverse engineering problem. In the other window, Claude Code was happily generating Python scripts to tackle the exact same task. No hesitation. No debate. No ethical speeches. Just code.

If you have read my other pieces, you already know where this goes. I did what I always do when the tools behave in a way that bothers me. I stopped arguing with them and built something that would force the issue into the open.

I decided to make the models judge each other.

Not in a vague “rate this answer from zero to ten” sort of way, but in the way I wish security tools worked by default: show me the raw output, show me the arguments for and against, show me who disagrees with whom, and then give me a ranked list with receipts. If the AIs are going to be inconsistent, I want that inconsistency documented and interrogated, not smoothed over into a single pretty score.

That is the origin story of LLM Compare, a command line tool that takes one prompt, fans it out to every AI you have wired up, and then orchestrates a small, structured fight.

Benchmarks, Vibes, And the Questions We Actually Ask

We all pretend benchmarks are the answer. They are not. They are the slideware layer we cling to because it is more comforting to look at a bar chart than to admit that a large part of “is this any good” still comes down to taste, context, and risk tolerance.

The industry loves to talk about MMLU scores, coding benchmarks, preference models, and leaderboard positions. Those things are not useless, but they all dodge the question I care about when on the hook for real work: given this exact problem, in this moment, with my constraints and my appetite for risk, which model is least likely to quietly ruin my day.

Benchmarks are training targets. The models learn to please the rubric, not you. That is Goodhart’s Law, with a research budget that looks like the GDP of a small country. By the time the scores are stable and polished enough to show up in marketing material, they mostly tell you how well the model understands the exam, not how it will behave when you throw something weird and sharp and slightly cursed at it.

When I am working on a security problem, I do not care that a model is theoretically “stronger” on an academic benchmark. I care whether it will hallucinate a mitigation, fabricate a reference, or hand me an elegant but completely impractical architecture diagram that would catch fire in the first week of production. That is why LLM Compare exists. It does not chase an abstract notion of “best.” It asks a much more narrow, much more practical question: for this prompt, right now, which answer would you bet your reputation on, and why.

What I Actually Built

On the surface, LLM Compare looks simple. You run a Python CLI, you paste or type a prompt, and the tool quietly discovers which providers you have configured. OpenAI, Anthropic, Google, xAI, and, if you are feeling adventurous, local models through llama.cpp. It sends the same prompt to each one, waits for all the answers to come back, and then begins the more interesting part.

The first phase is pointwise scoring. Every model gets to look at every answer, including its own, and assign scores for things like accuracy, completeness, clarity, relevance, and reasoning. I do not accept “seven out of ten” with no explanation. Each score is attached to a written justification. If a model calls a response vague or incomplete, it must explain where and why.

The second phase is where it starts to feel like a tournament. Instead of staring at individual scores, the tool pairs the answers up and asks the models to pick a winner for each matchup. Here are Answer A and Answer B. For this specific question, which one is better? You must choose. The order is flipped and repeated so that we do not reward whoever shows up in the first slot. It is surprising how often a model will rate two answers as “both good” when judged in isolation but will clearly prefer one once it is forced to compare them directly.

Then comes the adversarial debate. This is the part that will feel familiar to anyone who has ever sat through a standards meeting, a contentious code review, or a security architecture board that ran long. One model is assigned to defend a particular answer. Another model defends a different one. A third model acts as challenger, attacking both. A fourth plays judge. They get a couple of rounds to argue. Advocates point out strengths, challengers poke holes in assumptions, the judge calls out where the arguments do or do not land. You end up with lines like “Answer B hand waves past input validation” or “Answer A never connects its recommendations to the stated threat model.”

After they have taken their shots at each other, I switch the tone. All the models see all the debate transcripts and all the earlier scores, and they are asked to look for areas of agreement and disagreement. Where do they converge on what “good” looks like for this prompt? Where do they consistently flag the same holes in a particular answer? Where do they split along stylistic lines, such as one model favoring detailed exposition while another favors brevity and concrete examples.

Finally, all the above gets distilled into a ranking using a Bradley–Terry style model. The pointwise scores, the pairwise wins and losses, the debate outcomes, and the consensus signals all feed into a single picture of relative strength, with confidence intervals attached. The tool produces a JSON trace and a PDF report that looks, frankly, a lot like the incident timelines I have spent my career writing. You get the original prompt, the raw answers, every evaluation, every debate, and the final ranking.

It is not magic. It will not reveal eternal truths about which AI is better in some metaphysical sense. What it does give you is a very detailed map of how these systems behave when they are forced to back up their opinions with specific criticisms of each other.

A Short Tour Through the Build Diary

If you are the kind of person who opens the VIBE_HISTORY.md file in my repos, you already know how this got built. I started the way I usually start these things with a high level prompt that sounded innocuous on paper: let us build a multi model comparison tool that discovers available providers from API keys, sends one prompt to all of them, and then uses collaborative and adversarial strategies to evaluate the answers.

The early sessions were architecture and research. I dug into the LLM as a judge papers, the G Eval rubric ideas, tournament style comparison schemes, and the more recent work on structured debate. On paper it all looked neat and clean and well-behaved. In my head, it looked like I was inviting four very opinionated junior engineers into the same room and handing them a whiteboard.

Once the design felt solid enough, I let the AI help write the scaffolding. That part still amazes me. In a couple of intense sessions, I had a working project skeleton with providers for each major API, an evaluation package broken out into pointwise, pairwise, adversarial, and consensus modules, a session manager that orchestrated the whole pipeline, and a reporting module that could turn everything into a PDF. Everything imported. Everything ran. It looked like a tool that had been through a few iterations, not something that had just been bootstrapped from scratch.

Of course, that is the trap. The code compiled, the basic happy path worked, and the temptation was to declare victory and write this blog post then and there. Instead, I did what I always tell other people to do. I kept pushing until the sharp edges showed up.

Local models came next. If the tool is going to compare opinions, I want the option of including a model that lives on my own hardware, even if it is not in the same league as the hosted giants. That meant wiring up llama.cpp, handling model configuration and path resolution, and dealing with the fact that some of the smaller models are surprisingly confident about things they are surprisingly wrong about.

Dynamic model discovery followed close behind. Hard coding model names felt fine for about five minutes, right up until an API returned a 404 for a model that existed the week before. At that point I accepted reality. Each provider now exposes a list of available models, the tool ranks them according to a preference list, and it picks something that exists and can handle chat style requests. When the provider shuffles its lineup, the tool adjusts.

There was a long session where I wrestled with report generation. On my first pass I let the models write long, detailed markdown critiques, handed the whole thing to a layout engine, and watched it explode when a single table cell tried to be taller than the page. The fix was to treat markdown as a first-class citizen, flow paragraphs across pages, and paginate debates without throwing away detail. Somewhere in there I also learned, yet again, that cute Unicode progress spinners and Windows terminals do not always get along.

By the time I declared “first version” in the history file, I had a working tool and a pile of lessons that felt very familiar from other projects like Log Carver and logpi. AI will happily help you get to “it runs” much faster. It will not do the aftermath for you.

The Security Review I Should Have Started With

Once things settled, I took off the excited builder hat and put on the grumpy security architect hat. If I had been handed this tool by someone else, what would I worry about.

The first issue jumped out immediately. The session manager took a string session identifier, treated it as part of a file path, and did not really ask any questions. Anyone who has spent time reading incident reports already knows where that goes. A malicious or simply unfortunate session-id with a few extra directory separators and dot dots in it could convince the loader to wander outside its intended directory. The fix was straightforward: stop pretending session ids are arbitrary strings and treat them as UUIDs, validating them before they ever touch the filesystem.

The second concern was local model configuration. Llama.cpp gives you a lot of freedom about where you put model files. That freedom is lovely for a single user tinkering on their own machine, and slightly less lovely if you imagine this tool running on a shared box. The configuration now resolves model paths canonically and checks that they live under the expected directory before trying to load them.

File permissions were next. On Unix systems the default umask can be a little too trusting for something that is happily writing prompts, responses, and evaluation logs to disk. Session files now get created with restrictive permissions. If anyone else wants to read them, they can ask you.

I also added basic schema validation around the config file, constrained prompt size so that you cannot accidentally paste a ten-gigabyte log file into a single session and call it a day, and dialed back error messages that revealed too much internal structure. At the end of that pass, there was a list of seven findings in the history file. Six were fixed. The last one, around how and where you store API keys, is one I am leaving in your hands. On my personal systems I accept the risk, and in a corporate environment I would expect you to route it through whatever secrets management the rest of your stack already uses.

Cleaning Up the AI Slop

The security issues were the obvious part. The more insidious problem was stylistic. As I dug into the code, I could see where the AI had been helpful, and where it had quietly ballooned the complexity.

Each provider module had its own private copy of a model ranking helper, twenty or thirty lines of nearly identical logic with only the model names changed. Data classes all grew handcrafted to_dict methods that tried to reimplement what the standard library already provides, with a few date handling tweaks sprinkled in. There was a logger wrapper class whose whole job was to forward method calls to the underlying logger while adding a tiny bit of extra metadata. The session manager had become a tangle of tiny helper methods, each called exactly once, chained together in a way that made the control flow less clear, not more.

None of this was malicious. This is what you get when you ask an AI to “make it clean” too many times in a row. You end up with something that looks well-structured in the moment but leaves you holding a system that is much harder to reason about as a whole.

So, I did what I would do with any human junior engineer’s code. I refactored.

The model ranking logic moved into a shared base implementation, with provider specific data describing preferred models and fallback rules. The custom serialization methods collapsed into a single helper that knows how to deal with nested data classes and timestamps. The logger wrapper disappeared in favor of a simple factory function that returns a normal logger already pointed at the right session log file. Several single use helper methods in the session manager were inlined so you can read the entire end to end pipeline in one pass.

When the dust settled, close to a thousand lines of code were gone. Provider modules shrank dramatically. The session manager was almost cut in half. The behavior did not change. The tool became easier to hold in your head.

That, to me, is the real cost of leaning on AI for scaffolding. You get speed, and then you pay for that speed with a second phase of work where you make the thing behave like a tool you want to maintain. If you skip that phase, you end up shipping AI slop in production and wondering why everything feels heavier than it should.

Does It Actually Help?

The question I keep coming back to is simple. Does orchestrating a fight like this produce better insight than just asking one model and trusting your gut.

Here is a concrete example. I asked all the models to explain quantum entanglement to a bright high school student. You could probably guess the rough shape of the answers without even seeing them. One model veered into math notation and jargon as if it were talking to a physics major. One stayed so high level that you could feel the student’s eyes glazing over. One wrote a small essay that was technically correct and structurally sound but felt like a textbook chapter. Another leaned into analogies and conversational tone, got the science right, and respected the attention span of someone who has not yet decided whether physics is going to be their thing.

If you handed those four answers to me in isolation, I would likely have gravitated toward the one that matched my teaching instincts. What the tool gave me was the ability to see that the other models, the ones that I did not naturally favor stylistically, were flagging some of the same weaknesses I saw. It also showed me where they disagreed with each other for reasons that were purely stylistic. That distinction is important. I want to know when a model is pointing at a genuine flaw, and when it is just expressing a preference.

Over many runs, I have noticed patterns. Certain models are consistently good at calling out missing threat models in security prompts. Others are generous with detail but happily invent references that do not exist. Some are much better judges than they are writers. In other words, the persona that shows up when you ask a model to produce an answer is not always the persona that shows up when you ask it to critique one.

Having all of that laid out in a single report, with the arguments attached, does not replace my judgment. It does, however, give me something much more solid to work with than “I asked GPT because it felt right.”

Why This Matters If You Work in Security

If you never ask AI for help with anything important, you can treat all of this as a mildly entertaining science project. If you work in security, or any field where being wrong has a half-life measured in months on the front page of someone else’s incident report, you no longer have that luxury.

We are already leaning on these tools to draft policies, review designs, comb through logs, and propose mitigations. Some of that is harmless acceleration. Some of it is quietly dangerous if you do not have a handle on how the tool behaves under pressure.

What LLM Compare gives me is not certainty. It gives me options.

I can run the same prompt through multiple models and see where they converge and diverge. I can use one model as a primary and another as a hostile reviewer. I can capture a detailed trace of how a recommendation emerged, which means that six months later, when someone asks, “why did we do it that way,” I have more than a shrug and a fuzzy memory of a chat window.

More importantly, it reminds me that these systems are not oracles. They are tools with quirks and blind spots and personalities. The right way to work with them looks a lot more like collaborating with a team of very fast, somewhat inconsistent junior engineers. You listen, you compare, you ask them to argue, and then you make a call. You do not hand over the keys to production and walk away.

The code for LLM Compare is on GitHub in the usual place. If you are curious, wire it up to whatever models you already use, aim it at a prompt that matters to you, and see how your favorites behave when they have to defend themselves in front of their peers. You may come away a little more skeptical of the leaderboards. You may also find that some of the models you had quietly written off are more useful as judges than as writers.

Either way, you will be making decisions based on something richer than a single score and a feeling. For the work I do, that feels like progress.

Leave a comment