Not all tokens age the same.
An instruction you typed an hour ago still matters. A file the model read twenty turns ago probably doesn't. Long sessions get expensive because most tools treat both of them as if they still do.
Every component ages on its own curve.
Most of what's in a long session is already dead weight. Read results and command output collapse within the first five to ten turns. Grep and Glob are stale almost as soon as they arrive. Tool-use calls (the names and arguments of what the agent did) hang around longer because they're the skeleton of the work, but even they fade past turn twenty or so. The green user-text curve is the outlier: original instructions still anchor the session at turn 100, because they're the only record of what the job actually is. Thinking is dashed because Opus 4.5 changed how it's handled — older models strip thinking between turns, newer ones keep it, so the line only applies if you're on 4.5 or up. The shapes are informed by the JetBrains masking study, Chroma's context rot work, and Anthropic's own docs. The exact heights are our best-guess estimates, not benchmarked numbers.
Masking holds up against summarizing. For less money.
If the shapes above hold, the fix is straightforward. Keep the record of what the agent did and drop the outputs it got back. JetBrains tested exactly that in 2025 on 500 SWE-bench Verified tasks. They compared two ways of shrinking a transcript: run the history through an LLM and use the summary, or replace old tool outputs with a short placeholder while keeping the tool name and arguments intact.
Masking tied or beat summarization on 4 of the 5 models tested. On Qwen3-Coder 480B it landed at 54.8% solve rate against summarization's 53.8%, and did it at 52% of the cost. Summarization also made the agents run longer. Roughly 15% more turns, because the summaries blurred how badly earlier attempts had failed and the agents kept retrying bad paths.
"The simple approach of observation masking wasn't just cheaper; it often matched or even slightly beat LLM summarization in solving benchmark tasks." Lindenbauer & Fraser · JetBrains Research · 2025
Masking means stripping the tool_result bodies while leaving every tool_use call (name and arguments) alone. That's what safe mode does to older turns. slim goes further and drops the old tool_use metadata too; the paper didn't test that, so treat it as unvalidated rather than broken.
What claudecompress keeps, what it drops.
| Component | Turn 0-1 | Turn 6-15 | Turn 50+ | Re-derivable |
|---|---|---|---|---|
| User text messages | 95 | 55 | 20 | None |
| Thinking (Opus 4.5+) | 80 | 30 | 5 | High |
| Tool use blocks | 60 | 30 | 5 | Medium |
| Read results | 90 | 15 | 2 | High |
| Bash results | 85 | 20 | 2 | High |
| Grep / Glob | 75 | 5 | 0 | High |
Green = keep, amber = consider trimming, coral = safe to compress.
- safe (last 5) keeps the last five user turns completely intact. Tool outputs, thinking, everything. Your recent working memory stays put.
- Older turns turn into a dialog trail. User text, assistant text, the name and arguments of every tool call. Enough to know what was done, not enough to re-read the results.
- The heavy bodies go. File contents, command output, search results. If the model needs them again it runs the tool again.
More context isn't always better.
Chroma ran a 2025 study across 18 frontier models. They padded a focused prompt with unrelated conversation history and watched accuracy fall by 30-60% on LongMemEval. Claude models showed the biggest drop, which sounds damning until you look closer. They weren't hallucinating more. They were refusing to commit. Once the context got noisy enough, they'd stop answering rather than guess.
"Dumb zone kicks in around ~40% context. Shoot to keep it under 40%, and if you get up to 60%, think about wrapping it up." Boris Cherny · Claude Code · Anthropic
Trimming is usually framed as a cost question, but that's not the whole story. A shorter session often answers better than a bloated one, because there's less noise competing for the model's attention. A full context window isn't free just because you paid for the tokens. (And on the cost side, independent analysis suggests cache reads don't count against the rate-limit block on subscription plans — only cold reads do, which sharpens the case for keeping /resume small.)
Which mode to pick, and when.
Numbers below are from one real session: 761k tokens, 153 user turns, Opus 4.6.
| Mode | % saved | Quality risk | Use when |
|---|---|---|---|
| safe (default) | 32.8% | low | Continuing the same task — research-aligned |
| smart | 45.3% | low-med | Middle ground — per-component truncation, skeleton survives |
| slim | 71.5% | med | Big session, cost-sensitive, pivoting topics |
| archive | 83.5% | high | Historical only — not for continuing work |
"Quality risk" is a judgment call, not a measurement. safe follows the JetBrains masking pattern, which is the one that's actually been tested at scale. smart applies per-component rules by turn depth: Read results drop past 15 turns, Bash truncates, Agent results stay longer (they're already summarized), and tool_use metadata always hangs around as a skeleton. slim drops older tool_use too. No one has shown that hurts performance, but no one has shown it doesn't either.
A case for slim anyway. Chroma's Context Rot work shows that tokens which look topically relevant but aren't can actively mislead the model. If you've pivoted the conversation to a new file or a new task, stale tool_use calls pointing at the old paths are exactly that kind of distractor. JetBrains ran their tests on SWE-bench, where older tool calls tend to stay topic-relevant, so their data doesn't cover the pivot case. In a session where you've clearly changed direction, slim's more aggressive cut might actually help. No public benchmark has tested it directly, so we can't say for sure. safe is the default because it matches a pattern someone actually ran the numbers on. That doesn't make slim wrong, just less studied.
Per-component, per-depth.
The smart rules follow the curves directly. Parts that hold their value at depth (user text, tool_use metadata) survive all the way through. Parts that fall off a cliff (Read, Bash, Grep) get truncated in the middle band and dropped in the tail. Agent results live longer because they already arrived as summaries.
| Component | 0-5 turns | 6-15 turns | 16+ turns |
|---|---|---|---|
| User text | keep | keep | truncate 600 |
| Assistant text | truncate 800 | truncate 300 | drop |
| Thinking (Opus 4.5+) | truncate 500 | drop | drop |
tool_use (name + args) | keep | keep | keep |
| Read results | truncate 1500 | truncate 300 | drop |
| Bash results | truncate 800 | truncate 200 | drop |
| Grep / Glob | truncate 400 | drop | drop |
| Edit confirmations | truncate 150 | truncate 80 | truncate 80 |
| Agent results | keep | truncate 600 | truncate 200 |
| MCP browser | truncate 200 | drop | drop |
The numbers are character limits for truncate N. tool_use is always kept, so you can always see which files the agent touched. That's the main thing separating smart from slim. Even inside the recent five turns, long prose blocks like thinking or large assistant replies still get truncated; the model rarely depends on reading back its own essay. Images are dropped regardless of depth.
A default that quietly flipped.
Through Claude 4.4, the API quietly threw away prior thinking blocks between turns. Dropping them cost nothing because they weren't going to the model anyway. Opus 4.5 flipped the default. Now thinking blocks carry forward, and there's an open Claude Code issue about the docs still being out of date.
claudecompress handles the flip this way: even when you ask it to drop thinking, the recent window stays intact. Within the last N user turns, nothing gets touched. Older thinking still goes, because once a few turns have passed, the model rarely needs to reread its earlier reasoning.
What this theory doesn't prove.
- SWE-bench measures whether an agent finished a task. What claudecompress cares about is whether the agent picks up coherently after you
/resume. Those aren't the same metric, and the optima might not line up. JetBrains' result is a strong proxy, not a direct test of this use case. - There's no controlled A/B test on claudecompress itself. The token savings are measured and real; the quality story leans on adjacent research rather than an end-to-end experiment.
- The original Lost in the Middle paper from 2023 tested GPT-3.5 and Claude 1.3. Modern Opus scores 78.3% on MRCR v2 at 1M tokens, which is a different world. The effect hasn't vanished, though, especially for Sonnet at very long context.
- The relevance numbers on this page mix published findings with reasoned guesses. Treat them as a starting point for intuition, not a measurement. The whole approach also assumes you have tool access when you resume, so the agent can re-read anything it needs.
- Liu et al. 2023. Lost in the Middle. arXiv:2307.03172
- Xiao et al. 2023. Attention Sinks. arXiv:2309.17453
- Lindenbauer & Fraser 2025. The Complexity Trap: Observation Masking. arXiv:2508.21433
- Chroma 2025. Context Rot. chroma
- Factory AI 2025. Evaluating compression. factory.ai
- Anthropic. Effective context engineering for AI agents. anthropic.com
- Anthropic. Extended thinking. platform.claude.com
- Anthropic. Context editing. platform.claude.com
- Boris Cherny. How Boris uses Claude Code. howborisusesclaudecode.com
- Blake Crosley. Context window management. blakecrosley.com
- MRCR v2 Leaderboard. llm-stats.com
- SWE-bench Verified. swebench.com