← writing

Long contexts are a trap (sometimes)

I have a million-token context window and I'm still hand-picking which files to share. Long context is a tool, not a strategy. Here is when it earns its keep, when it quietly costs you, and why curated beats comprehensive almost every time.

I have a million-token context window and I'm still hand-picking which files to share. That sentence should feel like a contradiction, and for a while it did. The whole pitch of long context was that we would stop curating and just hand the model everything. Read the repo. Read the docs. Read the meeting transcripts. Read the customer's entire support history. Let the model sort it out.

I tried that. I tried it a lot. And then I quietly went back to hand-picking the four files that actually matter.

This post is the honest version of why. When long context earns its keep, when it quietly bills you for nothing, and the workflow comparison that finally made me stop dumping and start picking.

When long context is genuinely the right answer

There is a real list, and I want to start there so it does not sound like I am dismissing the capability. Long context is a serious tool. It just is not a strategy.

The first place it earns its keep is read-the-whole-monorepo refactors. When the change you are making touches twenty files across five packages and you genuinely cannot predict which ones until you have read all of them, dumping the lot is the right move. Same for cross-cutting concerns. Renaming a concept everywhere. Tightening a type that propagates through dozens of call sites. The model needs to see the shape of the whole thing, not a slice.

The second is multi-file debugging where the bug is in the interaction, not in one place. A race condition between two services. A state machine that drifts because three different files each assume slightly different invariants. You can chase this with grep, but you are essentially playing detective on the model's behalf, and the model is better at the detective work if you let it. Hand it everything that could be relevant and let it correlate.

The third is summarizing very long documents. A 200-page PDF, a six-hour transcript, a year of meeting notes. There is no clever retrieval strategy here. You need the whole document in the window, in order, with the model walking it end to end. Long context was made for this and it is genuinely great at it.

That is the honest list. Three shapes. If your task looks like one of these, dump everything and stop worrying about it.

When long context quietly hurts you

And here is the list I wish someone had hammered into me earlier, because it is the list most of my workdays fall into.

Latency. A million tokens is not free to process even when the provider tells you it is fast. The first token takes longer. The whole interaction feels heavier. You stop having a conversation and start filing requests, because each one costs ten seconds before the model even starts thinking. Iteration speed quietly dies and you do not notice until you compare it to a session where you only loaded what mattered.

Cost. Even with prompt caching, the first run is expensive, and if your inputs shift even slightly between calls, you blow the cache. A focused four-file prompt costs cents. A 200-file dump costs dollars. Multiply that by every iteration of every task in a week and the difference is real.

Recall accuracy. This is the one that surprised me most. The needle-in-a-haystack benchmarks have gotten very good, but real tasks are not benchmarks. When the model has 800k tokens of context and you ask it to use a specific helper function from one of those files, it sometimes just doesn't. It generates a plausible function that does the same thing, ignoring the one you wanted it to call. The information was technically in the window. The attention was not.

And then there is the irrelevant-pattern problem. The model is trained to find patterns. Give it 200 files and it will find patterns, including ones that have nothing to do with the task. It will mimic a naming convention from a file you didn't care about. It will inherit a code style from a vendored dependency you forgot was in there. It will reference a deprecated helper because it appeared three times in the dump. The signal is in there. So is a lot of noise, and the model cannot always tell which is which.

Two workflows, same task

Here is the comparison that finally made me change how I work. Same bug, two approaches, back to back on the same afternoon.

Workflow A. Dump 200 files into context. Ask for the bug fix. The model spent a long time thinking. It produced a fix that touched four files. Two of those changes were correct. One was unrelated cleanup it had decided to throw in because it noticed an inconsistency in a totally different module. One was wrong, because it had latched onto a deprecated pattern from a vendored library inside the dump and tried to apply it. Total cost was real money. Total time was about three minutes per iteration. Iterations needed, three.

Workflow B. Grep for the error message. Read the trace. Identify the four files that actually touch the bug. Paste those four into the prompt. Ask for the same fix. The model produced the correct change in one pass. It was focused, it referenced the exact functions in the files I gave it, and it did not invent or import anything. Cost was a small fraction. Time per iteration was about twenty seconds. Iterations needed, one.

The B workflow was not just faster and cheaper. It was better. The output was more correct on the first try, because the model's attention was not diluted across 196 files of irrelevant context.

Retrieval is not a legacy pattern

There is a story going around that retrieval-augmented generation is a stopgap, a workaround for short context windows, and the moment context got big enough we would all delete our vector databases and move on. I do not believe that anymore.

Retrieval is not a legacy pattern. It is a precision tool. The job of retrieval is to decide what the model should pay attention to, and that job did not become unimportant when the window got bigger. It became more important, because now the cost of getting it wrong is larger. An irrelevant chunk in a 4k window was a small distraction. An irrelevant chunk in a 1M window is still a small distraction, and you have a lot more of them.

Treat context as the model's attention budget. That is the frame that made everything click for me. The window is not a bag you fill. It is a budget you spend. Every token you put in there is a token of attention that is not going to the tokens that actually matter.

Prompt caching is the saving grace

One honest caveat. When the dump-everything pattern does fit, prompt caching makes it survivable. You pay the full price the first time, and every subsequent request that starts with the same prefix is a tiny fraction of that cost. For workflows that legitimately need a lot of static context, a codebase you query repeatedly during a long session, a long document you are extracting from in stages, caching turns the economics from painful to reasonable.

It does not solve the attention dilution problem. The model still has to read the whole window. But it does mean the cost argument against long context is weaker than it used to be, and that is worth being fair about.

Bigger is not smarter

Here is the line I want to leave you with. Bigger is not smarter. Curated beats comprehensive almost always.

The capability is real. The discipline is what gets results. Use the long window when the task genuinely needs it. The other 80 percent of the time, grep, read, and pick the four files that actually matter. Your iteration speed, your bill, and your output quality will all thank you.

Want more like this?

Occasional, opinionated, no listicles.
all writing →