Untitled Excavation

120 Billion Parameters of Wishful Thinking

My AI stack couldn’t find its own bug patterns. Eight hundred lines of curated knowledge, four projects indexed, a ChromaDB instance humming along — and when I asked “how did we fix the auth bug in Valkyrie?” it handed me an Express route ordering pattern from a different project.

I’d been running a multi-model system for weeks. Claude for interactive work, Ollama for local inference, ChromaDB for semantic search, a hook pipeline that auto-extracts lessons from every session. The whole rig. Not bad. Not great, either. Retrieval felt like asking a librarian who’d read every book but filed them all under “miscellaneous.”

Then someone handed me a research report.

Twelve model additions. Prioritized, costed, benchmarked. Gemini 3.1 Pro for corpus digestion. gpt-oss-120b for local reasoning. Codestral Embed for code-aware retrieval. Safety classifiers. Vision-language models. The works. Twenty pages of “here’s what you’re missing” with Mermaid diagrams and pricing tables.

I almost started installing things.

The Audit

Instead, I did something uncharacteristic. I measured first.

Three parallel agents. One tore through every hook script — eighteen Python files, cataloging dependencies, timing latency, hunting race conditions. One mapped the entire RAG pipeline — embedding dimensions, chunk sizes, distance thresholds, query flow. The third ran nvidia-smi and Get-CimInstance Win32_PhysicalMemory.

Here’s what came back:

CPU: AMD Ryzen 9 7845HX (12c/24t)
RAM: 16 GB
GPU: RTX 4070 Laptop (8 GB VRAM)

Sixteen gigs of RAM. Eight gigs of VRAM.

The report’s Priority #2 — gpt-oss-120b, “ideal local precompact distiller” — needs roughly 80 GB of VRAM. Even the 20B variant would eat 13 GB at Q4 quantization, leaving nothing for the OS, ChromaDB, or the embedding model that has to run alongside it.

Half the report’s priorities died on contact with a PowerShell one-liner.

The Actual Problem

But the audit found something the report didn’t even mention. My retrieval pipeline had no reranking. None.

ChromaDB was returning the top 8 results by cosine distance with a hard cutoff at 1.5. That’s it. No second pass. No relevance reordering. Position 1 and position 8 — treated identically by the synthesis model. Like handing a chef eight ingredients sorted by color and asking him to make dinner.

The hooks audit found worse. Non-atomic file writes in the pattern metadata system — if the process crashes mid-write, the file corrupts. Race conditions in the session tracker — multiple hooks writing to the same JSON file simultaneously. A pattern cache with a one-hour TTL that doesn’t invalidate when the source file changes.

The system scored 7/10. Architecturally sound. Operationally brittle. A motorcycle with a good engine and bald tires.

The Fix That Actually Matters

bge-reranker-v2-m3. Three hundred megabytes. Runs on CPU. Free.

You retrieve 20 candidates from ChromaDB, pass them through the reranker, take the top 5. Community RAG practitioners — the ones who’ve actually shipped this stuff — report 15-30% recall improvement from adding reranking alone. More than upgrading embeddings. More than switching vector databases. More than throwing a 120-billion-parameter model at the problem.

The phased roadmap wrote itself:

  1. Phase 0: Fix the bugs. Atomic writes. File locking. Cache invalidation. Cost: $0. Time: 4 hours.
  2. Phase 1: Add local reranking. Cost: $0. Time: 2 hours. Impact: +25% retrieval recall.
  3. Phase 2: Gemini API for long-context corpus synthesis. Cost: $15-40/month. The one thing no local model can do on this hardware.
  4. Phase 3: DeepSeek API for cheap extraction validation. Cost: $1-10/month. Quality gate for the auto-extraction pipeline.

Total cost to go from 7/10 to ~9/10: less than fifty bucks a month and a weekend of work. No new hardware. No 120-billion-parameter fantasies.

The Principle

There’s a passage in Zen and the Art of Motorcycle Maintenance where Pirsig talks about how most mechanical problems aren’t solved by buying better parts. They’re solved by understanding the machine you already have. The stuck screw doesn’t need a bigger wrench. It needs penetrating oil and patience.

I had a twelve-model shopping list and a system that couldn’t atomically write to a file. The reranker wasn’t even in the report’s top 5 priorities — it showed up at #6, framed as a local alternative to Cohere’s API. Meanwhile, the report’s #2 pick was physically impossible to run on my hardware.

Research reports optimize for the general case. Your system is not the general case. Your system is 16 gigs of RAM and a pattern metadata module that rewrites the entire file to increment a hit counter.

Measure your machine before you modify it. The 300-megabyte fix will almost always beat the 120-billion-parameter dream.

What’s the reranker in your stack — the boring thing you haven’t tried because the shiny thing is more fun to think about?