Not all tokens age the same.

An instruction you typed an hour ago still matters. A file the model read twenty turns ago probably doesn't. Long sessions get expensive because most tools treat both of them as if they still do.

Relevance decay

Every component ages on its own curve.

How context ages square-root scale — early turns get more room

User text — anchors, doesn't fully decay

Tool use blocks — the action breadcrumb

Read results — fall off fast

Bash results — fall off faster

Grep / Glob — go stale immediately

Thinking — depends on the model

Most of what's in a long session is already dead weight. Read results and command output collapse within the first five to ten turns. Grep and Glob are stale almost as soon as they arrive. Tool-use calls (the names and arguments of what the agent did) hang around longer because they're the skeleton of the work, but even they fade past turn twenty or so. The green user-text curve is the outlier: original instructions still anchor the session at turn 100, because they're the only record of what the job actually is. Thinking is dashed because Opus 4.5 changed how it's handled — older models strip thinking between turns, newer ones keep it, so the line only applies if you're on 4.5 or up. The shapes are informed by the JetBrains masking study, Chroma's context rot work, and Anthropic's own docs. The exact heights are our best-guess estimates, not benchmarked numbers.

The evidence

Masking holds up against summarizing. For less money.

If the shapes above hold, the fix is straightforward. Keep the record of what the agent did and drop the outputs it got back. JetBrains tested exactly that in 2025 on 500 SWE-bench Verified tasks. They compared two ways of shrinking a transcript: run the history through an LLM and use the summary, or replace old tool outputs with a short placeholder while keeping the tool name and arguments intact.

Masking tied or beat summarization on 4 of the 5 models tested. On Qwen3-Coder 480B it landed at 54.8% solve rate against summarization's 53.8%, and did it at 52% of the cost. Summarization also made the agents run longer. Roughly 15% more turns, because the summaries blurred how badly earlier attempts had failed and the agents kept retrying bad paths.

"The simple approach of observation masking wasn't just cheaper; it often matched or even slightly beat LLM summarization in solving benchmark tasks." Lindenbauer & Fraser · JetBrains Research · 2025

Masking means stripping the tool_result bodies while leaving every tool_use call (name and arguments) alone. That's what safe mode does to older turns. slim goes further and drops the old tool_use metadata too; the paper didn't test that, so treat it as unvalidated rather than broken.

The hierarchy

What claudecompress keeps, what it drops.

Component	Turn 0-1	Turn 6-15	Turn 50+	Re-derivable
User text messages	95	55	20	None
Thinking (Opus 4.5+)	80	30	5	High
Tool use blocks	60	30	5	Medium
Read results	90	15	2	High
Bash results	85	20	2	High
Grep / Glob	75	5	0	High

Green = keep, amber = consider trimming, coral = safe to compress.

safe (last 5) keeps the last five user turns completely intact. Tool outputs, thinking, everything. Your recent working memory stays put.
Older turns turn into a dialog trail. User text, assistant text, the name and arguments of every tool call. Enough to know what was done, not enough to re-read the results.
The heavy bodies go. File contents, command output, search results. If the model needs them again it runs the tool again.

Context rot

More context isn't always better.

Chroma ran a 2025 study across 18 frontier models. They padded a focused prompt with unrelated conversation history and watched accuracy fall by 30-60% on LongMemEval. Claude models showed the biggest drop, which sounds damning until you look closer. They weren't hallucinating more. They were refusing to commit. Once the context got noisy enough, they'd stop answering rather than guess.

"Dumb zone kicks in around ~40% context. Shoot to keep it under 40%, and if you get up to 60%, think about wrapping it up." Boris Cherny · Claude Code · Anthropic

Trimming is usually framed as a cost question, but that's not the whole story. A shorter session often answers better than a bloated one, because there's less noise competing for the model's attention. A full context window isn't free just because you paid for the tokens. (And on the cost side, independent analysis suggests cache reads don't count against the rate-limit block on subscription plans — only cold reads do, which sharpens the case for keeping /resume small.)

Mode selection

Which mode to pick, and when.

Numbers below are from one real session: 761k tokens, 153 user turns, Opus 4.6.

Mode	% saved	Quality risk	Use when
safe (default)	32.8%	low	Continuing the same task — research-aligned
smart	45.3%	low-med	Middle ground — per-component truncation, skeleton survives
slim	71.5%	med	Big session, cost-sensitive, pivoting topics
archive	83.5%	high	Historical only — not for continuing work

"Quality risk" is a judgment call, not a measurement. safe follows the JetBrains masking pattern, which is the one that's actually been tested at scale. smart applies per-component rules by turn depth: Read results drop past 15 turns, Bash truncates, Agent results stay longer (they're already summarized), and tool_use metadata always hangs around as a skeleton. slim drops older tool_use too. No one has shown that hurts performance, but no one has shown it doesn't either.

A case for slim anyway. Chroma's Context Rot work shows that tokens which look topically relevant but aren't can actively mislead the model. If you've pivoted the conversation to a new file or a new task, stale tool_use calls pointing at the old paths are exactly that kind of distractor. JetBrains ran their tests on SWE-bench, where older tool calls tend to stay topic-relevant, so their data doesn't cover the pivot case. In a session where you've clearly changed direction, slim's more aggressive cut might actually help. No public benchmark has tested it directly, so we can't say for sure. safe is the default because it matches a pattern someone actually ran the numbers on. That doesn't make slim wrong, just less studied.

smart rule table

Per-component, per-depth.

The smart rules follow the curves directly. Parts that hold their value at depth (user text, tool_use metadata) survive all the way through. Parts that fall off a cliff (Read, Bash, Grep) get truncated in the middle band and dropped in the tail. Agent results live longer because they already arrived as summaries.

Component	0-5 turns	6-15 turns	16+ turns
User text	keep	keep	truncate 600
Assistant text	truncate 800	truncate 300	drop
Thinking (Opus 4.5+)	truncate 500	drop	drop
`tool_use` (name + args)	keep	keep	keep
Read results	truncate 1500	truncate 300	drop
Bash results	truncate 800	truncate 200	drop
Grep / Glob	truncate 400	drop	drop
Edit confirmations	truncate 150	truncate 80	truncate 80
Agent results	keep	truncate 600	truncate 200
MCP browser	truncate 200	drop	drop

The numbers are character limits for truncate N. tool_use is always kept, so you can always see which files the agent touched. That's the main thing separating smart from slim. Even inside the recent five turns, long prose blocks like thinking or large assistant replies still get truncated; the model rarely depends on reading back its own essay. Images are dropped regardless of depth.

On thinking blocks

A default that quietly flipped.

Through Claude 4.4, the API quietly threw away prior thinking blocks between turns. Dropping them cost nothing because they weren't going to the model anyway. Opus 4.5 flipped the default. Now thinking blocks carry forward, and there's an open Claude Code issue about the docs still being out of date.

claudecompress handles the flip this way: even when you ask it to drop thinking, the recent window stays intact. Within the last N user turns, nothing gets touched. Older thinking still goes, because once a few turns have passed, the model rarely needs to reread its earlier reasoning.

Caveats

What this theory doesn't prove.

SWE-bench measures whether an agent finished a task. What claudecompress cares about is whether the agent picks up coherently after you /resume. Those aren't the same metric, and the optima might not line up. JetBrains' result is a strong proxy, not a direct test of this use case.
There's no controlled A/B test on claudecompress itself. The token savings are measured and real; the quality story leans on adjacent research rather than an end-to-end experiment.
The original Lost in the Middle paper from 2023 tested GPT-3.5 and Claude 1.3. Modern Opus scores 78.3% on MRCR v2 at 1M tokens, which is a different world. The effect hasn't vanished, though, especially for Sonnet at very long context.
The relevance numbers on this page mix published findings with reasoned guesses. Treat them as a starting point for intuition, not a measurement. The whole approach also assumes you have tool access when you resume, so the agent can re-read anything it needs.

References

Liu et al. 2023. Lost in the Middle. arXiv:2307.03172
Xiao et al. 2023. Attention Sinks. arXiv:2309.17453
Lindenbauer & Fraser 2025. The Complexity Trap: Observation Masking. arXiv:2508.21433
Chroma 2025. Context Rot. chroma
Factory AI 2025. Evaluating compression. factory.ai
Anthropic. Effective context engineering for AI agents. anthropic.com
Anthropic. Extended thinking. platform.claude.com
Anthropic. Context editing. platform.claude.com
Boris Cherny. How Boris uses Claude Code. howborisusesclaudecode.com
Blake Crosley. Context window management. blakecrosley.com
MRCR v2 Leaderboard. llm-stats.com
SWE-bench Verified. swebench.com