Cut Tokens by 75% Without Losing Technical Accuracy

If you're running Claude Code heavily — dozens of tool calls, long sessions, big codebases — your token bill climbs fast. This post covers a technique I've been using: a response compression mode I call caveman, which strips filler language from Claude's replies while keeping every technical detail intact. Four compression levels, real-time savings tracking, and it switches on with a single slash command.

The Problem: Claude Talks Too Much

Claude's default responses are polished, conversational, and thorough — which is great for one-off questions but brutal for extended sessions. When you're running a long coding session, every "Certainly! Here's what I think is happening..." preamble burns tokens you don't need.

Here's what a typical unhelpful preamble looks like in practice:

"Great question! I'd be happy to help you debug this. Let me walk you 
through what I think might be going on with your Dockerfile. First, 
I want to make sure I understand the problem correctly..."

None of that moves the task forward. The actual fix is two lines of shell. You're paying for fluff.

The first thing I tried was just telling Claude in the system prompt to "be concise." That works — for about three turns. Then it creeps back. The model has strong stylistic priors, and a single instruction doesn't override them consistently across a long session.

token waste before compression

The Fix: `/caveman` Skill

The caveman skill enforces a compression contract at the skill level, not just as a prompt hint. When you activate it, the model switches to a stripped response style where every token has to earn its place.

Activate it with:

/caveman

That's the default full mode. You can also pass a level explicitly:

/caveman lite    # Light trim — removes preamble, keeps full sentences
/caveman full    # Default — short sentences, no filler, no pleasantries
/caveman ultra   # Minimum viable output — almost telegraphic
/caveman wenyan  # Classical/formal compression — dense, no contractions

Here's the same build failure question under ultra mode:

build fail: line 12 COPY before RUN apt. swap order.

Same information. 8 tokens instead of ~80. If you're running 200+ exchanges per session, that difference is significant.

compression level comparison

Checking Your Savings: `/caveman-stats`

After you've run a session with caveman active, you can pull a real-time summary:

/caveman-stats

Output looks like this:

Session stats
─────────────────────────────
Mode active   : full
Turns tracked : 47
Tokens saved  : 3,841
Est. USD saved: $0.038
Compression   : 74.2%

The per-session tracking is genuinely useful. I had one refactoring session where I saved ~$0.11 — not life-changing on its own, but across a week of daily sessions it adds up, and more importantly it validates that the compression is actually working consistently.

What I found interesting: the wenyan mode (modeled on classical Chinese written form — terse, no redundancy) ends up being close to ultra in token count but feels more readable than telegraphic shorthand. If ultra feels too choppy for you, try wenyan as a middle ground.

stats command flow

Variations and Gotchas

The 100% technical accuracy claim holds — with one caveat. In ultra mode, Claude will sometimes drop explanation context that's actually useful. If you hit a non-obvious bug, the ultra output might give you the fix without the why, and you'll want to know the why for next time. I tend to use full as my daily driver and drop to ultra only for mechanical tasks (file renaming, boilerplate generation, format conversion).

It doesn't affect tool call payloads. The compression targets Claude's prose replies, not the structured JSON in tool calls. Your bash commands, file edits, and search results come back at full fidelity regardless of mode.

Mac vs. Linux vs. Docker: No environment differences here — this is purely a prompt-layer behavior change. The /caveman skill works the same everywhere Claude Code runs.

Switching modes mid-session works cleanly. You can start full, realize you need more context for a tricky debugging stretch, switch to lite, then go back. Stats accumulate across the whole session regardless of mode switches.

The wenyan mode is genuinely different. It's not just shorter — the sentence construction changes. Here's a side-by-side:

Mode	Example output
lite	"The issue is in the Dockerfile. Move your RUN apt-get line before the COPY."
full	"Dockerfile error: COPY before RUN apt-get. Swap lines 11–12."
ultra	"line 11-12: swap COPY/RUN order"
wenyan	"Error resides in ordering. Place RUN before COPY. Correct and rebuild."

All four convey the same fix. Token counts: ~24 / ~13 / ~7 / ~14. Your preference will depend on whether you find telegraphic output readable.

Closing

The core insight here isn't "write shorter prompts" — it's that response verbosity is a mode you can control persistently, not just something to hint at once and hope sticks. The caveman skill enforces that contract at the skill level, survives across turns, and gives you measurable feedback.

If you're doing any kind of long-session Claude Code work — big refactors, extended debugging, automated pipelines — turn this on at the start of the session and check your stats at the end. The 75% compression figure isn't a best case; in my sessions it's been consistently in the 65-78% range across modes.

Next up: I'm looking at whether the same approach applies to tool call inputs — not just replies. If you can compress the context you feed into each call, the savings stack further.

🐦 Faster updates on X: @baegseungh7061
📚 More in this series: All posts
💌 Subscribe: Follow on X or grab the RSS

Seunghyeon's Agentic Lab

이 블로그 검색