Seventy Percent Garbage

A partner asked the chat, “Who is Zack Robinson?” The model responded with a three-sentence description of a Python package installation process. It cited eight sources. All eight sources were metadata files from inside a site-packages directory that should never have been indexed.

The model was not wrong. The model was faithfully summarizing the most relevant documents in its search results. The problem was that the most relevant documents were garbage. The Diviner had read every scroll in the archive. Nobody had checked whether the archive was worth reading.

I had built a RAG pipeline over five project directories — about 600 real documents across four active software projects. The indexer walked each directory, chunked the files, embedded them, and stored them in Postgres. It worked beautifully in development.

In production, the index contained 118,067 chunks. I had never audited that number. If I had divided it by the roughly 600 source documents, I would have gotten 196 chunks per document, which should have been a red flag. The actual document count was much higher because the crawler had walked into directories it should have skipped.

The discord-playlist-bot project had a Python virtual environment at venv/, not .venv/. My exclude list blocked .venv/ but not venv/. One missing character in a config file, and the entire googleapiclient discovery cache — 54,602 JSON files describing every Google API endpoint ever published — was faithfully indexed, embedded, and searchable.

On top of that, 83,000 of the indexed chunks were from site-packages directories, discovery cache folders, and RAG test fixture files. Seventy percent of my production database was sediment. The actual project documentation was an archaeological layer buried underneath tens of thousands of JSON files describing alloydb.v1beta parameters.

Like an episode of Hoarders, except the hoarder is a for-loop with no boundary checking.

The wrong theory was that this was a model problem. The responses were bad, so maybe the system prompt needed to be stricter. Maybe the embeddings needed to be better. Maybe the temperature was too high. I spent the previous session hardening the system prompt with formatting rules, security constraints, and behavior guidelines. The model followed all of them perfectly. It formatted the garbage beautifully.

The actual problem was upstream of the model, upstream of the embeddings, upstream of the search algorithm. The problem was in the data. The crawler had ingested everything it could reach, and nobody had ever checked what “everything” meant.

This is the insidious thing about RAG pipelines. The search returns results. The model generates fluent responses. The UI renders them cleanly. At no point does anything throw an error. The system functions exactly as designed, and the output is useless, and the only way to know is to read the answer and notice that it is describing Python package metadata instead of the person who built the platform.

Three parallel fixes, deployed simultaneously:

Database cleanup. Connect to Postgres. DELETE FROM documents where source path matches venv/, site-packages/, discovery_cache/. Result: 83,496 rows deleted. Index reduced from 118,067 to 34,571 chunks.
Search scoring. Add document type boosting to the search merge function. READMEs get 1.5x score multiplier. Session summaries get 0.3x. Tests and scripts get 0.7x.
API-layer filtering. For non-admin users: drop chunks with relevance score below 0.1. Block chunks from session summaries, legal documents, and personal content by source path. If all chunks are filtered out, fall back to baseline context in the system prompt.

After deploy, same question: “Who is Zack Robinson?” Response: systems-oriented technical leader, 8+ years experience, builds emergency dispatch platforms and market intelligence apps. Cited one source. The right source.

The lesson is older than language models. It is older than search engines. It is older than computers. Garbage in, garbage out. The phrase was coined in 1957 and it has not become less true in sixty-nine years.

When your RAG pipeline produces bad answers, the instinct is to fix the model layer: better prompts, better embeddings, more guardrails. Sometimes that is the right instinct. But I had already spent an entire session hardening the system prompt. I had added formatting rules, security constraints, citation guidelines, and a 300-word response cap. The model followed every single instruction. It formatted garbage immaculately.

The problem was never the model’s behavior. The problem was the data I was asking it to behave with. Before you touch the model, run one query:

SELECT COUNT(*) FROM documents;

If the number surprises you, the problem is not the model. The problem is that you never looked at what you fed it.

I had 118,067 chunks. I needed 34,571. The other 83,496 were not just useless — they were actively harmful, because they displaced the real documents in every search result. Removing them did more for response quality than every prompt engineering change combined.

The model was never wrong. It summarized exactly what you gave it. Next time, look at what you are giving it.

Seventy Percent Garbage

Enjoyed this post?