From Log Carver to logpi: making massive logs feel small
Two weeks ago, I told a story about building a bare-metal parser with an unreasonable deadline and an AI copilot that alternated between brilliant and chaotic. That sprint proved a point: defenders can wring 50M+ lines per minute out of commodity hardware when they design for speed and verify like their job depends on it. It also left a bigger question hanging in the air: what happens after the sprint? How do you turn a knife fight into a logistics plan your SOC can run every night without drama? That’s what logpi is for, and it’s why this sequel exists.
Get the code & artifacts
Source: logpi — https://github.com/rondilley/logpi.
Enhancements in this article (single hash-writer, distributed reads, growable arrays, atomic+SIGALRM progress) are landing here, with tags and example runs.
Build diary: VIBE_HISTORY.md captures the Claude-assisted dev timeline (hash bucket jump, serial stabilization to ~60M LPM, then parallel foundation → 125M+ LPM; plus the progress-meter fix). See the 132 GB example output and minute-by-minute prints there.
Part I was the knife fight. Part II is logistics. I didn’t rebuild a SIEM; I built a field kit. logpi chews raw text, extracts the identities we actually pivot on (IPv4/6, MAC), and leaves a tiny index you can interrogate in seconds—on the same box that holds the logs. The point is to keep your mental model intact during incident response: ask a sharp question, get a precise and timely answer, ask the next one before the context evaporates.
The first lesson is architectural humility. Claude’s early plan was to parallelize everything, let every thread read and write, because what could go wrong? Plenty. The scheduler doesn’t reward wishful thinking. Under real logs, that “everyone does everything” pattern was the fine sand in the gearing that crippled throughput with lock contention and queue thrash. The turning point came from asking a better question: do we really need to serialize reads? No. We only need to serialize structural writes. Once we split the world accordingly, workers do local hash lookups in parallel, a single hash thread handles the inserts and updates, throughput stopped wobbling and started marching. Queue traffic plummeted because only meaningful work crossed the bridge: new addresses and nothing else, not updates, not lookups. The architecture got simpler, measurements got cleaner, and the code stopped fighting itself.
| Change | Why it helped | Δ LPM | New LPM |
| Hash bucket + growth off hot path | Fewer collisions; predictable inserts | +~53M | ~60M |
| Producer/consumer I/O | Cache-friendly parallel reads | +20M | ~80M |
| Serial/parallel identity pass | Confidence → more aggressive tuning | +15M | ~95M |
| Distributed reads, single hash write | Eliminates write contention | +15M | ~110M |
| Atomic+SIGALRM progress | Visibility without tax | +15M | 125M+ |
That change wasn’t a hunch; it was the spine of a very deliberate performance arc. The v0.7 baseline plodded along at ~7M lines per minute. Optimizing the hash (bucket count, distribution, growth checks off the hot path) cracked 60M. Producer–consumer scaffolding got us to 80M. Output identity between serial and parallel unlocked 95M. Queue and cache tuning pushed past 110M. The last nudge, fixing progress reporting so it told the truth without dragging the brakes pushed the aggregate rate into the 125M+ LPM band and made it feel honest, not theatrical. That’s a 20× swing measured in days, not quarters.
Progress shouldn’t change performance. A timing thread that calls time() like it’s free candy is a tax, not a meter. The fix was boring and right: an atomic counter in the hot path, SIGALRM once a minute to snapshot and reset. Same visibility; essentially zero overhead. The minute-by-minute line counts now roll by like a ship’s log, and they correlate with reality instead of altering it. There’s a deeper moral here: the brightest ideas of the AI are more likely than not to be a bad idea. There is a reason why the best symphonies have a talented and experienced conductor. They can play without it, but the performance suffers.
The second lesson was allocator economics. After the parallel plan took hold, the new villain wasn’t the hash or the queues, it was death by a million allocations. Literally. We discovered 772+ million malloc calls were being made to store per-occurrence line/offset pairs. Cute in a tutorial; catastrophic with a massive 132Gb logfile. The fix was boring in the best way: throw out the linked-list carnival and move to growable, contiguous arrays with conservative expansion and thread-aware append semantics. Workers append locally; the hash thread owns the rare capacity increases. It’s not an exotic data structure. It’s a grown-up one. And it changes the memory math from “pray to the allocator” to “predict the next 10 minutes with a whiteboard.”
That change also forced us to confront a pathological real-world case most people don’t plan for: one hot IP address appearing on almost every line. With batched adds through a “hashing” thread, the race was on to get a new IP added while all threads held massive lists of requests to add the same IP. This called for a change in thinking on batching IP additions and re-queueing updates. One IP on most lines turns ‘double forever’ into ‘allocate 8.5 B entries.’ At 16 bytes/occurrence (offset + line id), that’s ~136 GB for a single key. Switching to 25% growth after 1M entries caps runaway allocations yet stays amortized-O(1). The answer was a conservative growth strategy that switches from doubling to ~25% increments once a location array gets big enough (around a million entries). The result is graceful under both normal and pathological conditions, no memory cliffs, no overnight OOMs that masquerade as “random.” This is what it looks like to turn a benchmark into a tool.
All that work sets up the part most teams feel immediately: the wall-clock difference. On a Framework-class laptop, indexing a massive syslog file shifted from a projected four to five hours of radio silence to ~5 minutes 56 seconds of serial, minute-by-minute progress in the ~55–61M LPM band. The run finishes with user time around 5m23s and sys time around 30s, which is what you expect from a design that minimizes allocator storms and kernel back-and-forth. The box stays responsive. The analyst stays in the loop. Then you add some parallelism and when you look at CPU usage, you see multi-core utilization where there used to be a single hot core, the work is spread, not smeared and boom, more speed!
I care a lot about the “feel” of a tool. Engineers don’t talk about that enough. logpi now feels steady. The per-minute counts are boring in the best way: small variance, no mystery cliffs, no first-minute theatrics followed by a slow collapse. That steadiness is what your brain latches onto during an investigation. You can form a hypothesis, run it, watch the counts tick, and commit to the next question, without the dread that the machine will grind or the numbers will lie to you. Just add pseudo indexing to your log repository and have ultra-high-speed searching on-tap whenever you or your SOC analysts need it.
The corollary to “feel” is “trust,” and for me that means identity-preserving outputs. Parallel mode must match serial mode byte-for-byte or the speed is counterfeit. That was a hard gate for shipping, and we kept it throughout the performance arc. It’s not glamorous work; it is grown-up work. We leaned on repeatable timing, memory checks, and concurrency tools, Valgrind for leaks, ThreadSanitizer for data races, not because they’re fashionable, but because they find the boring failures you don’t notice until 2:00 a.m. The short version: measure first, ship later, and never compromise correctness to buy headlines.
People have asked whether any of this diminishes the thesis of Part I. It doesn’t. It completes it. The first chapter was about showing how far you can push with an AI partner when you keep your hands on the wheel. This chapter is about what it takes to make that speed useful under pressure. I used Claude here as a compiler for intent: confined to small, explicit boxes with strong invariants, it’s a productive ally. Left to its own devices, it “fixed” races by inserting mutexes into hot paths and “stabilized” performance by deleting optimizations. The rule I follow now is simple enough to tape to the monitor: architecture first, AI second, micro-opts last.
The Claude Code Recursion Problem
Working with Claude Code on logpi’s performance enhancements felt like déjà vu with a twist. Where the first sprint taught me that Claude’s “helpful” suggestions often masked deeper architectural problems, this optimization phase proved that the same pathological patterns resurface at every level of complexity, it’s not a bug in the model, it’s a feature of how it approaches unfamiliar performance domains.
Take the allocator economics problem. When I described the 772+ million malloc calls to Claude Code and asked for optimization strategies, its first instinct was to suggest a custom memory pool. Sounds reasonable, right? Wrong. The suggestion was to create multiple pools, one per thread, with complex synchronization to handle cross-thread deallocations. That’s not optimization; that’s architecture by committee, where the committee is a single AI that doesn’t understand the difference between “technically correct” and “operationally sane.”
The pattern from Part I repeated with surgical precision: Claude Code would take a focused performance question and explode it into a distributed systems problem. Ask about reducing allocation overhead, get a lecture about lock-free data structures. Ask about memory layout, get a proposal for NUMA-aware custom allocators. Every answer was technically sophisticated and practically useless, because the AI consistently optimizes for complexity over constraint.
More frustrating was watching it reproduce the same “parallel everything” anti-pattern that nearly killed the first version. When I showed Claude Code the single-threaded hash solution that worked, it immediately suggested “improvements”: reader-writer locks, concurrent hash maps, thread pools for the lookup operations. All the code it generated compiled cleanly and ran correctly on small test cases. All of it collapsed into scheduler hell under real workloads, just like before.
Here’s what I learned: Claude Code has no institutional memory between conversations, even within the same project. Every session starts from zero context about what has already failed. You can’t say “remember when that parallel hash idea caused lock contention issues?” because it genuinely doesn’t remember. It will cheerfully suggest the same optimization that you spent two days proving was counterproductive, with the same confident explanations about why this time will be different.
The atomic counter progress reporting fix is a perfect example. In Part I, Claude’s timing thread consumed more cycles reporting progress than the actual work being timed. This time, when I asked Claude Code how to fix the new progress bottleneck, its first suggestion was… another timing thread. With better caching. And maybe some lock-free atomics to reduce overhead. It took three explicit rejections before it suggested the SIGALRM approach that worked, and even then it wanted to add a backup thread “just in case the signal doesn’t fire reliably.”
The tools compound the problems. Claude Code’s integration with the development environment means it can see your build outputs, test results, and profiler data in real-time. That sounds like an advantage until you realize it creates a feedback loop where the AI doubles down on complexity because the intermediate measurements look promising. When the NUMA-aware allocator showed 15% improvements on microbenchmarks, Claude Code interpreted that as validation and suggested extending the approach to the entire system. It couldn’t see that the 15% microbenchmark gain was costing 40% on the end-to-end workflow.
I developed a new rule specific to Claude Code: treat every suggestion as suspect until proven under realistic load. Not synthetic benchmarks. Not unit tests. Not “this should be faster in theory.” Real logs, real memory pressure, real scheduler contention. The AI is excellent at producing code that satisfies local constraints and terrible at understanding global system behavior.
The conservative array growth strategy that fixed the pathological load balancer case is instructive here. When I explained the problem to Claude Code, one IP address appearing on millions of lines, causing exponential memory growth, it suggested three different solutions in sequence: 1) A bloom filter to detect hot keys, 2) A two-tier storage system with overflow handling, 3) A machine learning model to predict growth patterns. All technically creative. All missing the point. The actual solution was boring: cap the growth rate and accept that some keys get bigger arrays than they need. Claude Code called this approach “suboptimal” because it “wastes memory.” It couldn’t understand that wasting a few megabytes predictably is infinitely better than requesting 136 GB unpredictably.
The most revealing moment came when I asked Claude Code to review the final performance numbers: 125M+ LPM, stable memory usage, clean shutdown under SIGINT. Its response was to suggest “areas for improvement”, GPU acceleration, distributed processing, memory-mapped I/O optimizations. Even faced with a 20× performance gain and production stability, the AI’s instinct was to complexify rather than celebrate constraint.
This isn’t AI-bashing; it’s pattern recognition. Claude Code is a powerful compiler for intent, exactly as described in Part I. But its intent-compilation process is biased toward technical sophistication over operational simplicity. It will reliably suggest the most intellectually interesting solution rather than the most pragmatically effective one. The human’s job is not to accept or reject those suggestions wholesale, but to extract the useful kernel and ignore the architectural flourishes.
The lesson that bridges both parts: AI partnership works best when the human maintains architectural authority. Let Claude Code generate the hash function optimizations, the SIMD intrinsics, the low-level bit manipulation. Don’t let it design your threading model, your memory management strategy, or your error handling philosophy. Those decisions require understanding trade-offs that span the entire system lifecycle, and no amount of training data can substitute for experiencing a 2 AM production outage caused by a “theoretically optimal” design choice.
Let me ground this in how you use the thing. You point logpi at the log you already have on disk and let it run. You see line counts advance each minute, so you don’t have to guess whether you’ll be waiting two minutes or twenty. When it’s done, the index is small enough to live with the log without fuss. Then during the incident, you run a query for the handful of addresses you think matter, an attacker VPN, a suspicious CDN edge, a rogue NAT pool, and you get back precise line/offset hits across the time window that matters. The result is not “a vibe of activity;” it’s a crisp pivot. That’s what shaves hours off incident response.
There’s a quiet advantage to the design that isn’t obvious until you need it: the right serialization, in the right place, is not a sin. It’s a performance feature. Only the writes serialize. Reads do not. That means read-heavy workloads, the vast majority of indexing, scale with cores, while the writes happen on a single thread that’s easy to reason about and easy to measure. When something does get slow, you know where to look: queue depth and write ownership, not a swarm of thread stacks that all insist they might be guilty.
The project’s own history is a useful teacher. Day zero: ~7M LPM, single thread, no progress, a design that was “fine” for small files and plainly inadequate for 132 GB monsters. Day one: hash work, bucket count up, collisions down, growth checks off the hot path, bought a clean 10×, landing us at 60M. By day five, producer–consumer plumbing had a real foundation. By day eight, serial and parallel produced identical outputs. Day eleven, queue and cache work made the flow smoother. Day twelve, progress reporting stopped taxing the run and started telling the truth. Day fifteen, the test harness caught up to the ambition and we stamped it production ready (at least as “production ready” as tools that I build get). None of this was magic. It was a sequence of small, falsifiable bets. That’s how you make performance that lasts.
If you want a single benchmark image to keep in your head, use this one: the 132GB file that used to feel like a shift change is now a coffee refill. The console prints five or six progress lines, ~55–61M LPM each for serial mode, ~120-125M LPM for parallel mode, and then you’re done. The difference is not that you “saved time.” The indexing is the pre-work, searching for IP addresses using the index is the reward. Consider, when searching for 10 IP addresses over 30, 60, or 90 days of log data could take days to complete. Using pseudo indexed logs, it takes minutes. That time compression means that you didn’t lose the plot. You stayed in the same mental thread from hypothesis to answer. You didn’t have to switch tools, machines, or rooms. That’s the real speedup people miss when they stare at throughput numbers divorced from human workflow.
There’s still headroom, and I plan to use it, but without romance. The extractor stage now has enough slack to justify a second SIMD pass. I’m interested in distributed sharding for environments where you can partition the workload cleanly and merge catalogs at query time. I’m not allergic to GPUs; I’m allergic to unfalsifiable claims. If a GPU wins after you include transfer, batching, and back-pressure overhead, great, we’ll ship that. If a tuned CPU path outpaces it, also great, we’ll ship that instead. The goal is not trend compliance. The goal is time-to-answer.
One more story because it’s quietly important. While we were replacing the linked lists with growable arrays, we hit a terrifying error: a request to grow to billions of entries, as if the pointers themselves had been cursed. It wasn’t corruption. It was reality. A load balancer IP really did appear on most lines. The allocator obeyed exactly what we asked and tried to give that one key the entire house. That’s the difference between a demo and a durable tool: the demo blames the data; the durable tool changes its growth strategy and handles the worst day without wrecking the machine. We changed the strategy. The worst day became a shrug.
If you’re measuring this at home, do yourself a favor and keep the constraints sacred. Verify accuracy and completeness of analysis and byte-for-byte compatibility between serial and parallel modes. Time multiple runs and look for stability instead of one-off peaks. Turn on Valgrind and ThreadSanitizer before you brag. And when the numbers look too good to be true right after you added a “safety” patch, assume they are. We only shipped speed that survived those rules. That’s why the tool still feels calm at 2:13 a.m. when coffee tastes like firewall soot and the only thing you want is the truth, quickly.
$ time spi 91.238.181.91,23.248.217.56 ~/data/*.log
Searching for 23.248.217.56 91.238.181.91
MATCH [91.238.181.91] with 6 lines
Opening [~/data/daemon.log] for read
Jun 9 23:59:50 192.168.10.21 spamd[27698]: 91.238.181.91: connected (1/0)
Jun 9 23:59:50 192.168.10.21 spamd[27698]: 91.238.181.91: connected (2/0)
Jun 9 23:59:52 192.168.10.21 spamd[27698]: 91.238.181.91: disconnected after 2 seconds.
Jun 9 23:59:53 192.168.10.21 spamd[27698]: 91.238.181.91: connected (2/0)
Jun 9 23:59:54 192.168.10.21 spamd[27698]: 91.238.181.91: disconnected after 4 seconds.
Jun 9 23:59:57 192.168.10.21 spamd[27698]: 91.238.181.91: disconnected after 4 seconds.
MATCH [23.248.217.56] with 6 lines
Opening [~/data/local7.log] for read
Jun 9 02:37:01 honeypi00 sensor: PacketTime:2025-06-09 09:37:01.205778 Len:62 IPv4/TCP 23.248.217.56:4214 -> 10.10.10.40:80 ID:55780 TOS:0x0 TTL:111 IpLen:20 DgLen:48 ****S* Seq:0xa9d343cf Ack:0x0 Win:0x2000 TcpLen:28 Resp:SA Packetdata:3KYyaJoq1Haglac3CABFAAAw2eRAAG8GLIEX+Nk4CgoKKBB2AFCp00PPAAAAAHACIABfVAAAAgQFtAEBBAIO
Jun 9 02:37:01 honeypi00 sensor: PacketTime:2025-06-09 09:37:01.424181 Len:60 IPv4/TCP 23.248.217.56:4214 -> 10.10.10.40:80 ID:22927 TOS:0x0 TTL:43 IpLen:20 DgLen:40 ***R** Seq:0xa9d343d0 Ack:0x0 Win:0x0 TcpLen:20 Resp: Packetdata:3KYyaJoq1Haglac3CABFAAAoWY9AACsG8N4X+Nk4CgoKKBB2AFCp00PQAAAAAFAEAACsFAAAAAAAAAAADg==
Jun 9 02:37:07 192.168.10.21 date=2025-06-09 time=02:37:07 devname=”FortiWiFi-80F-2R” devid=”FWF80FTK21002176″ eventtime=1749461827438416270 tz=”-0700″ logid=”0000000013″ type=”traffic” subtype=”forward” level=”notice” vd=”root” srcip=23.248.217.56 srcport=4214 srcintf=”wan2″ srcintfrole=”wan” dstip=99.88.84.61 dstport=80 dstintf=”BLACKHOLE” dstintfrole=”dmz” srccountry=”United States” dstcountry=”United States” sessionid=4327215 proto=6 action=”client-rst” policyid=46 policytype=”policy” poluuid=”1d2c8168-c62e-51ea-8f55-d563c17af0df” policyname=”WAN->HOLE-HONEYPI_ALL” service=”HTTP” trandisp=”dnat” tranip=10.10.10.40 tranport=80 appcat=”unscanned” duration=6 sentbyte=88 rcvdbyte=40 sentpkt=2 rcvdpkt=1 dsthwvendor=”Raspberry Pi” masterdstmac=”dc:a6:32:68:9a:2a” dstmac=”dc:a6:32:68:9a:2a” dstserver=0
Jun 9 05:54:02 honeypi00 sensor: PacketTime:2025-06-09 12:54:02.574892 Len:62 IPv4/TCP 23.248.217.56:25224 -> 10.10.10.40:80 ID:45973 TOS:0x0 TTL:113 IpLen:20 DgLen:48 ****S* Seq:0xe907ac57 Ack:0x0 Win:0x2000 TcpLen:28 Resp:SA Packetdata:3KYyaJoq1Haglac3CABFAAAws5VAAHEGUNAX+Nk4CgoKKGKIAFDpB6xXAAAAAHACIABlhQAAAgQFtAEBBAIO
Jun 9 05:54:02 honeypi00 sensor: PacketTime:2025-06-09 12:54:02.749473 Len:60 IPv4/TCP 23.248.217.56:25224 -> 10.10.10.40:80 ID:14822 TOS:0x0 TTL:43 IpLen:20 DgLen:40 ***R** Seq:0xe907ac58 Ack:0x0 Win:0x0 TcpLen:20 Resp: Packetdata:3KYyaJoq1Haglac3CABFAAAoOeZAACsGEIgX+Nk4CgoKKGKIAFDpB6xYAAAAAFAEAACyRQAAAAAAAAAADg==
Jun 9 05:54:09 192.168.10.21 date=2025-06-09 time=05:54:09 devname=”FortiWiFi-80F-2R” devid=”FWF80FTK21002176″ eventtime=1749473648788417590 tz=”-0700″ logid=”0000000013″ type=”traffic” subtype=”forward” level=”notice” vd=”root” srcip=23.248.217.56 srcport=25224 srcintf=”wan2″ srcintfrole=”wan” dstip=99.88.84.61 dstport=80 dstintf=”BLACKHOLE” dstintfrole=”dmz” srccountry=”United States” dstcountry=”United States” sessionid=4476397 proto=6 action=”client-rst” policyid=46 policytype=”policy” poluuid=”1d2c8168-c62e-51ea-8f55-d563c17af0df” policyname=”WAN->HOLE-HONEYPI_ALL” service=”HTTP” trandisp=”dnat” tranip=10.10.10.40 tranport=80 appcat=”unscanned” duration=6 sentbyte=88 rcvdbyte=40 sentpkt=2 rcvdpkt=1 dsthwvendor=”Raspberry Pi” masterdstmac=”dc:a6:32:68:9a:2a” dstmac=”dc:a6:32:68:9a:2a” dstserver=0
real 0m0.032s
user 0m0.014s
sys 0m0.018s
Part I planted the flag. It said velocity with verification is the new baseline. Part II cashes that check with a system that makes the biggest files feel small without compromising trust. That’s the whole point. And if you’re still wondering whether the human side matters in a story stuffed with hash buckets and atomics, let me answer plainly: it’s the only thing that matters. We’re not optimizing for the machine; we’re optimizing for the analyst who has to hold the investigation in their head. logpi just gets everything else out of the way.
If you want a quick taste of the flow I’m describing, it looks like this on a good day:
That’s steadiness you can feel. That’s answers arriving while your questions are still warm. And that’s the difference between parsing history and defending the future.
Leave a comment