claudecompress v0.16.8 · theory · github · npm
← back
Theory

Not all tokens age the same.

An instruction you typed an hour ago still matters. A file the model read twenty turns ago probably doesn't. Long sessions get expensive because most tools treat both of them as if they still do.

Relevance decay

Every component ages on its own curve.

How context ages square-root scale — early turns get more room
Relevance 0 5 10 20 50 100 Turns ago →
User text — anchors, doesn't fully decay
Tool use blocks — the action breadcrumb
Read results — fall off fast
Bash results — fall off faster
Grep / Glob — go stale immediately
Thinking — depends on the model

Most of what's in a long session is already dead weight. Read results and command output collapse within the first five to ten turns. Grep and Glob are stale almost as soon as they arrive. Tool-use calls (the names and arguments of what the agent did) hang around longer because they're the skeleton of the work, but even they fade past turn twenty or so. The green user-text curve is the outlier: original instructions still anchor the session at turn 100, because they're the only record of what the job actually is. Thinking is dashed because Opus 4.5 changed how it's handled — older models strip thinking between turns, newer ones keep it, so the line only applies if you're on 4.5 or up. The shapes are informed by the JetBrains masking study, Chroma's context rot work, and Anthropic's own docs. The exact heights are our best-guess estimates, not benchmarked numbers.

The evidence

Masking holds up against summarizing. For less money.

If the shapes above hold, the fix is straightforward. Keep the record of what the agent did and drop the outputs it got back. JetBrains tested exactly that in 2025 on 500 SWE-bench Verified tasks. They compared two ways of shrinking a transcript: run the history through an LLM and use the summary, or replace old tool outputs with a short placeholder while keeping the tool name and arguments intact.

Masking tied or beat summarization on 4 of the 5 models tested. On Qwen3-Coder 480B it landed at 54.8% solve rate against summarization's 53.8%, and did it at 52% of the cost. Summarization also made the agents run longer. Roughly 15% more turns, because the summaries blurred how badly earlier attempts had failed and the agents kept retrying bad paths.

"The simple approach of observation masking wasn't just cheaper; it often matched or even slightly beat LLM summarization in solving benchmark tasks." Lindenbauer & Fraser · JetBrains Research · 2025

Masking means stripping the tool_result bodies while leaving every tool_use call (name and arguments) alone. That's what safe mode does to older turns. slim goes further and drops the old tool_use metadata too; the paper didn't test that, so treat it as unvalidated rather than broken.

The hierarchy

What claudecompress keeps, what it drops.

Component Turn 0-1 Turn 6-15 Turn 50+ Re-derivable
User text messages955520None
Thinking (Opus 4.5+)80305High
Tool use blocks60305Medium
Read results90152High
Bash results85202High
Grep / Glob7550High

Green = keep, amber = consider trimming, coral = safe to compress.

  • safe (last 5) keeps the last five user turns completely intact. Tool outputs, thinking, everything. Your recent working memory stays put.
  • Older turns turn into a dialog trail. User text, assistant text, the name and arguments of every tool call. Enough to know what was done, not enough to re-read the results.
  • The heavy bodies go. File contents, command output, search results. If the model needs them again it runs the tool again.
Context rot

More context isn't always better.

Chroma ran a 2025 study across 18 frontier models. They padded a focused prompt with unrelated conversation history and watched accuracy fall by 30-60% on LongMemEval. Claude models showed the biggest drop, which sounds damning until you look closer. They weren't hallucinating more. They were refusing to commit. Once the context got noisy enough, they'd stop answering rather than guess.

"Dumb zone kicks in around ~40% context. Shoot to keep it under 40%, and if you get up to 60%, think about wrapping it up." Boris Cherny · Claude Code · Anthropic

Trimming is usually framed as a cost question, but that's not the whole story. A shorter session often answers better than a bloated one, because there's less noise competing for the model's attention. A full context window isn't free just because you paid for the tokens. (And on the cost side, independent analysis suggests cache reads don't count against the rate-limit block on subscription plans — only cold reads do, which sharpens the case for keeping /resume small.)

Mode selection

Which mode to pick, and when.

Numbers below are from one real session: 761k tokens, 153 user turns, Opus 4.6.

Mode % saved Quality risk Use when
safe (default)32.8%lowContinuing the same task — research-aligned
smart45.3%low-medMiddle ground — per-component truncation, skeleton survives
slim71.5%medBig session, cost-sensitive, pivoting topics
archive83.5%highHistorical only — not for continuing work

"Quality risk" is a judgment call, not a measurement. safe follows the JetBrains masking pattern, which is the one that's actually been tested at scale. smart applies per-component rules by turn depth: Read results drop past 15 turns, Bash truncates, Agent results stay longer (they're already summarized), and tool_use metadata always hangs around as a skeleton. slim drops older tool_use too. No one has shown that hurts performance, but no one has shown it doesn't either.

A case for slim anyway. Chroma's Context Rot work shows that tokens which look topically relevant but aren't can actively mislead the model. If you've pivoted the conversation to a new file or a new task, stale tool_use calls pointing at the old paths are exactly that kind of distractor. JetBrains ran their tests on SWE-bench, where older tool calls tend to stay topic-relevant, so their data doesn't cover the pivot case. In a session where you've clearly changed direction, slim's more aggressive cut might actually help. No public benchmark has tested it directly, so we can't say for sure. safe is the default because it matches a pattern someone actually ran the numbers on. That doesn't make slim wrong, just less studied.

smart rule table

Per-component, per-depth.

The smart rules follow the curves directly. Parts that hold their value at depth (user text, tool_use metadata) survive all the way through. Parts that fall off a cliff (Read, Bash, Grep) get truncated in the middle band and dropped in the tail. Agent results live longer because they already arrived as summaries.

Component0-5 turns6-15 turns16+ turns
User textkeepkeeptruncate 600
Assistant texttruncate 800truncate 300drop
Thinking (Opus 4.5+)truncate 500dropdrop
tool_use (name + args)keepkeepkeep
Read resultstruncate 1500truncate 300drop
Bash resultstruncate 800truncate 200drop
Grep / Globtruncate 400dropdrop
Edit confirmationstruncate 150truncate 80truncate 80
Agent resultskeeptruncate 600truncate 200
MCP browsertruncate 200dropdrop

The numbers are character limits for truncate N. tool_use is always kept, so you can always see which files the agent touched. That's the main thing separating smart from slim. Even inside the recent five turns, long prose blocks like thinking or large assistant replies still get truncated; the model rarely depends on reading back its own essay. Images are dropped regardless of depth.

On thinking blocks

A default that quietly flipped.

Through Claude 4.4, the API quietly threw away prior thinking blocks between turns. Dropping them cost nothing because they weren't going to the model anyway. Opus 4.5 flipped the default. Now thinking blocks carry forward, and there's an open Claude Code issue about the docs still being out of date.

claudecompress handles the flip this way: even when you ask it to drop thinking, the recent window stays intact. Within the last N user turns, nothing gets touched. Older thinking still goes, because once a few turns have passed, the model rarely needs to reread its earlier reasoning.

Caveats

What this theory doesn't prove.

  • SWE-bench measures whether an agent finished a task. What claudecompress cares about is whether the agent picks up coherently after you /resume. Those aren't the same metric, and the optima might not line up. JetBrains' result is a strong proxy, not a direct test of this use case.
  • There's no controlled A/B test on claudecompress itself. The token savings are measured and real; the quality story leans on adjacent research rather than an end-to-end experiment.
  • The original Lost in the Middle paper from 2023 tested GPT-3.5 and Claude 1.3. Modern Opus scores 78.3% on MRCR v2 at 1M tokens, which is a different world. The effect hasn't vanished, though, especially for Sonnet at very long context.
  • The relevance numbers on this page mix published findings with reasoned guesses. Treat them as a starting point for intuition, not a measurement. The whole approach also assumes you have tool access when you resume, so the agent can re-read anything it needs.
References
  1. Liu et al. 2023. Lost in the Middle. arXiv:2307.03172
  2. Xiao et al. 2023. Attention Sinks. arXiv:2309.17453
  3. Lindenbauer & Fraser 2025. The Complexity Trap: Observation Masking. arXiv:2508.21433
  4. Chroma 2025. Context Rot. chroma
  5. Factory AI 2025. Evaluating compression. factory.ai
  6. Anthropic. Effective context engineering for AI agents. anthropic.com
  7. Anthropic. Extended thinking. platform.claude.com
  8. Anthropic. Context editing. platform.claude.com
  9. Boris Cherny. How Boris uses Claude Code. howborisusesclaudecode.com
  10. Blake Crosley. Context window management. blakecrosley.com
  11. MRCR v2 Leaderboard. llm-stats.com
  12. SWE-bench Verified. swebench.com