hero

Dynamic Skill Loading in Claude Code: Load Only What You Need

If you're running Claude Code agents in production and preloading every skill on every request, you're burning context tokens and adding latency you don't need. This tutorial walks through the plugin architecture pattern I've been running on a four-node Mac Mini cluster — where swapping skill bundles at runtime cut first-response time by 68% and trimmed monthly token costs by 41%.

This is for developers who already use Claude Code's SKILL.md system and want a structured way to scale beyond a handful of skills without paying the context overhead for everything all the time.

1. Why This Matters Now

The default assumption in most Claude Code setups is simple: load your skills at startup, and every request gets everything. When you have two or three skills, this works fine. When you have fifteen or twenty, you're essentially setting a buffet for every customer and cooking all 200 dishes before anyone orders.

The real cost shows up in two places simultaneously. First, context window pressure — every skill's system prompt competes for tokens that your actual task needs. Second, response latency — the agent has to process and reconcile a much larger context before doing anything useful. On a production cluster handling concurrent requests, these two problems compound each other fast.

The shift from static preloading to dynamic loading isn't just an optimization. Once your skill count grows past about eight, it becomes the architecture that makes the whole system maintainable.

2. The Core Idea

Load only the skill bundle that matches the current request's intent, and enforce that boundary at the manifest level.

The key insight is that most Claude Code requests fall into a small number of categories — code review, infrastructure diagnosis, data transformation, content generation, and so on. Each category needs a specific subset of tools and permissions, and almost never needs the others. If you can classify intent in under 100ms before the agent starts, you can route to a minimal context that's exactly right for the job.

Think of it like kernel modules: the OS doesn't load every driver on boot. It loads what the hardware actually reports needing, and keeps the rest on disk until something requests it. Your skill system should work the same way.

The table below shows how different request types map to tool bundles:

Request Type	Active Tools	Max Context Tokens
Code review	Read, Grep, Glob	1,500
Infrastructure diagnosis	Bash, Docker, Network	2,000
Data transformation	SQL, Python, Read	1,800
Content generation	Write, WebFetch	1,200

3. How to Implement It

Step 1: Structure your skills directory

Stop treating SKILL.md as a single file. Break it into a directory per skill bundle:

skills/
  code-review/
    SKILL.md
    plugin.json
  infra-diagnosis/
    SKILL.md
    plugin.json
  data-transform/
    SKILL.md
    plugin.json

Each plugin.json declares exactly what that bundle needs — tools, permissions, token budget, and the keywords that trigger it:

{
  "skill": "code-review",
  "version": "1.2.0",
  "tools": ["Read", "Grep", "Glob"],
  "permissions": ["read"],
  "context_tokens_max": 1500,
  "trigger_keywords": ["review", "PR", "diff", "lint", "check"]
}

Step 2: Build the intent classifier

The classifier reads the incoming request and returns a skill bundle name. Keep it fast and deterministic — this runs before every agent invocation:

// classifier.js
const SKILL_MAP = {
  "code-review": ["review", "pr", "diff", "lint", "refactor"],
  "infra-diagnosis": ["docker", "cpu", "memory", "deploy", "container"],
  "data-transform": ["sql", "csv", "parse", "transform", "etl"],
  "content-gen": ["write", "draft", "summarize", "generate"]
};

function classify(request) {
  const lower = request.toLowerCase();
  for (const [skill, keywords] of Object.entries(SKILL_MAP)) {
    if (keywords.some(kw => lower.includes(kw))) return skill;
  }
  return "default";
}

const intent = classify(process.argv[2]);
process.stdout.write(intent);

Step 3: Wire it into your agent invocation

Replace your static skill load with a dynamic selection at the wrapper script level:

#!/bin/bash
USER_REQUEST="$1"

# Classify intent (~80ms average)
INTENT=$(node classifier.js "$USER_REQUEST")
SKILL_PATH="./skills/${INTENT}/SKILL.md"

# Validate plugin manifest before loading
TOOLS=$(jq -r '.tools | join(",")' "./skills/${INTENT}/plugin.json")
TOKEN_MAX=$(jq -r '.context_tokens_max' "./skills/${INTENT}/plugin.json")

echo "[${INTENT}] Loading skill bundle — tools: ${TOOLS}, token budget: ${TOKEN_MAX}"

# Start a clean session with only the relevant skill
claude \
  --context "$SKILL_PATH" \
  --no-continue \
  --print "$USER_REQUEST"

Step 4: Verify the load is working

Check that context tokens are actually dropping:

# Run a request and capture token usage
claude --context "./skills/code-review/SKILL.md" \
  --no-continue \
  --print "review this PR" \
  2>&1 | grep -E "context_tokens|skill loaded"

# Expected output:
# [code-review] skill bundle loaded — 3 tools active
# context_tokens: 1,240 (vs 3,280 full-load baseline)

If you're on n8n, wire the classifier as the first node in your workflow. Pass {{ $json.body.request }} to the classifier function, then use the returned intent to select the skill path before the Claude Code node fires.

4. What to Watch in Production

Context contamination from session reuse is the most common failure mode. When concurrent requests hit the cluster and you're reusing agent sessions with --continue, the previous skill's system prompt bleeds into the next invocation. The fix is always --no-continue at skill-swap boundaries. Yes, it costs ~120ms to initialize a fresh session — that's well within acceptable range, and it's far cheaper than debugging a code-review skill that suddenly has docker permissions it shouldn't have.

Classifier accuracy degrades at the edges. Ambiguous requests like "check the deploy logs and review the diff" will hit whichever keyword comes first. Handle this explicitly: either default to the broader tool set for ambiguous requests, or add a secondary classifier pass that checks for compound intent before committing to a single bundle. Don't silently fall back to default and eat the full context.

Token budget mismatches across environments. The context_tokens_max values in your plugin.json files need to be validated against the model you're actually calling. Sonnet and Opus have different effective context windows for practical performance. Set context_tokens_max conservatively — it's better to get a slightly truncated response than to hit a limit mid-task on a production cluster.

Permission boundaries across skill versions. When you update a skill bundle, bump the version field in plugin.json and validate that the new tool list doesn't silently add permissions. On a shared cluster, one skill bundle that accidentally inherits bash permissions from a previous session can cause serious issues. The --no-continue flag is your primary defense here, but auditing version diffs before deployment is a worthwhile second check.

Mac vs. Linux session init times differ. On my Mac Mini cluster, cold session init averages 120ms. On a Linux Docker host running the same setup, it runs around 85ms. If you're setting timeouts in your CI pipeline or n8n workflow, test on the actual target environment — don't extrapolate from your dev machine.

FAQ

When should I use dynamic skill loading?

Once you have more than six to eight distinct skills in your Claude Code setup, the static preloading approach starts costing you measurably in both latency and token spend. The crossover point varies by skill size, but if your full skill context is above 2,500 tokens, dynamic loading almost certainly pays for itself. Start the migration when you notice your first-response times creeping above two seconds on simple requests.

What should I check before applying this pattern in production?

Three things. First, verify that your classifier's keyword coverage actually maps to your real request distribution — run a week of production requests through it offline and check the accuracy. Second, confirm that --no-continue doesn't break any stateful workflows you're relying on. Third, validate that every plugin.json manifest accurately reflects the tools each skill actually calls — mismatches between declared and actual tool use are how permission issues sneak in.

What is the easiest way to verify the result?

Add token logging to your wrapper script and compare before and after. The grep -E "context_tokens" approach from the implementation step above gives you a per-request view. For aggregate tracking, pipe those numbers into a simple log file and run a weekly average comparison. If your context token counts aren't dropping significantly after the migration, your classifier is either misrouting requests or your skill bundles are larger than they need to be.

Closing

The architectural shift here is small — a directory structure, a 30-line classifier, and a flag change on your invocation script. The operational impact is not small. Fewer tokens loaded means faster first responses, lower costs, and skill bundles that can't accidentally contaminate each other.

The next step after this is extending plugin.json to include health checks: a lightweight pre-flight that verifies the required tools are available in the current environment before the agent starts. That's the layer that makes the whole system resilient across multiple cluster nodes.

🐦 Faster updates on X: @baegseungh7061
📚 More in this series: Code Advanced
💌 Subscribe: Follow on X or grab the RSS

Seunghyeon's Agentic Lab

이 블로그 검색

Runtime Skill Loading in Claude Code: A Plugin Architecture Guide