Untitled Excavation
None Is Not a Number
The test harness returned TypeError: '<=' not supported between instances of 'NoneType' and 'float'. I’d spent two hours building a 25-query benchmark suite, wired up a cross-encoder reranker, ran the whole pipeline — and Python politely informed me that I was comparing nothing to something.
I was sitting in my home office in Cebu, 11 PM, second cup of coffee gone cold, chasing a 69% improvement in retrieval quality. The kind of night where you tell yourself “one more fix” until the sun comes up and you’ve rewritten your own test.
Which is exactly what happened.
The Baseline Nobody Wants to See
Before touching anything, I measured. Twenty positive queries and five negatives, scored against a ChromaDB index of cross-project knowledge — bug patterns, testing strategies, lessons learned across four codebases. The scoring was simple: did the system return the right pattern, and did it avoid hallucinating patterns that don’t exist?
Positive Avg: 0.605
Negative Avg: 0.000
Combined: 0.484
Point four eight four. Less than half. My curated knowledge base — eight hundred lines of documented fixes, four projects indexed, a hook pipeline that auto-extracts lessons — was performing worse than a coin flip.
The negative score of zero looked good until I realized it meant the system wasn’t returning anything for garbage queries. Which is correct behavior. But still.
Phase 0: The Bald Tires
Six bugs. All boring. All dangerous.
The pattern metadata system was writing files non-atomically. If the process died mid-write — and on Windows with OneDrive syncing, processes die mid-write — the file corrupts. Gone. The session tracker had the same problem, plus a race condition where multiple hooks wrote to the same JSON file simultaneously. The pattern cache ran on a one-hour TTL that didn’t check whether the source file had actually changed. Stale data served with confidence. Like a bartender pouring from yesterday’s open bottle.
os.replace() on a temp file. msvcrt file locking. mtime-based invalidation. The kind of fixes that would bore a conference audience to tears and save your data at 3 AM.
Phase 1: Three Hundred Megabytes of Relevance
bge-reranker-v2-m3. A cross-encoder that takes a query-document pair and scores how well they actually match, not just how close their embeddings are in vector space. Over-retrieve 20 candidates from ChromaDB, rerank, take the top 5.
The integration was clean. Lazy-loaded CrossEncoder, a rerank_results() function, a flag on each result saying whether it came through the reranker or the raw distance metric. Fifteen minutes of coding.
Then I ran the test.
TypeError: '<=' not supported between instances of 'NoneType' and 'float'
When the reranker scores results, there is no cosine distance. The distance field is None. My relevance checker was comparing None <= 1.5 and Python, unlike JavaScript, refuses to pretend that nothing is a number.
Fair enough.
The Test That Tested Itself
Fixed the type error. Added a doc_is_relevant() helper that checks the reranking flag before deciding how to evaluate relevance. Ran it again.
Positive Avg: 0.655
Negative Avg: 0.700
Combined: 0.664
Better. But the negative score dropped from perfect to 0.7. Five queries designed to return nothing were now triggering false positives. And something felt wrong about the positive scores — queries I knew were hitting the right documents were scoring 0.4.
I dug into the scoring logic. Five of my twenty positive queries targeted LESSONS_LEARNED content — entries without a specific pattern ID. The scorer had a branch for “expected pattern found” and a branch for “nothing relevant returned.” No branch for “relevant content found but it doesn’t have a pattern ID.” Those five queries could never score above 0.4. The ceiling was baked into the test.
How do you benchmark a system when the benchmark is broken?
You fix the benchmark. Obviously. But there’s a moment of vertigo when you realize the instrument you’re using to measure improvement is itself the thing that needs improving. It’s turtles all the way down. Or, if you prefer your metaphors from Maynard James Keenan: “I must persuade you another way” — sometimes the tool has to change before the work can begin.
Added keyword coverage scoring for pattern-less queries. Raised the negative threshold from 0.001 to 0.01 — the reranker was returning scores of 0.0012 for garbage queries, technically above the old threshold. Technically a false positive. Technically my fault for setting a threshold that fine-grained.
Positive Avg: 0.775
Negative Avg: 1.000
Combined: 0.820
The Scar
Zero point four eight four to zero point eight two zero. A 69.4% improvement. Twelve of twenty queries now return perfect results. The cross-project queries — “find me a pattern from Valkyrie that applies to DipRadar” — jumped from mediocre to 0.64. Needle-in-haystack lookups hit 0.78.
Total cost: zero dollars. A 300-megabyte model that runs on CPU. Six bug fixes that should have been written months ago. And a test harness that had to learn to measure itself before it could measure anything else.
The report I started with recommended twelve model additions. I’ve implemented one. The system went from “worse than a coin flip” to “finds the right answer four times out of five.” The remaining three weak queries need better index chunking, not bigger models.
I used to think measurement was the easy part. Point the tool at the thing, read the number. But the tool has assumptions baked in. Thresholds encode opinions. Scoring functions have blind spots. You can build a benchmark that tells you exactly what you want to hear, and it’ll feel like progress right up until production proves it wrong.
Measure the machine. Then measure the ruler.