Automated News Summary Quality Scoring with LLM-as-Judge

If you're running a news summarization pipeline at any real volume, you already know the problem: how do you know if your summaries are actually good? Manual review doesn't scale. A/B testing against user clicks is a lagging indicator. What you need is a quality signal that runs inline, per batch, without a human in the loop.

LLM-as-Judge is the pattern that solves this. A second LLM evaluates the output of the first, returning structured scores across dimensions you define. Wired into your pipeline, it gives you per-article quality metrics, a feedback loop for regeneration, and a time-series you can alert on when quality drifts.

1. Why this matters now

The default assumption in most summarization pipelines is that if the model is good, the summaries are good. That assumption breaks the moment you change a prompt, swap a model version, or shift your input data distribution — and you find out weeks later when someone manually audits a batch.

The real pain is invisible regression. A summarizer that worked fine on tech articles starts hallucinating financial figures when you add an earnings-report feed. Composite metrics from user engagement won't catch this for days. A Judge running inline catches it in the same batch.

There's also the qualitative-to-quantitative problem. "This summary is biased" is a valid complaint but an unusable one. When you decompose quality into accuracy, conciseness, and neutrality — each scored on a 1–5 rubric — you get an actionable number attached to a rationale string. That's debuggable. That's something you can write an alert on.

The timing matters too. Anthropic's Claude model lineup now makes it practical to run a stronger model as Judge against a faster, cheaper model as Summarizer. The cost delta between claude-haiku-4-5 and claude-sonnet-4-5 is large enough that using Sonnet only for evaluation — and caching its system prompt — keeps total pipeline cost in a reasonable range.

2. The core idea

One model generates; a different, stronger model judges. The Judge returns structured JSON with per-dimension scores and a one-sentence rationale. You aggregate those scores over time, trigger regeneration when a summary falls below a threshold, and validate the Judge itself against a known golden set.

The architecture is three stages: generation, evaluation, and logging.

Stage	Model	Role
Summarizer	`claude-haiku-4-5`	Fast, low-cost generation
Judge	`claude-sonnet-4-5`	Structured rubric evaluation
Validator	Same as Judge (CI only)	Golden-set calibration check

The critical rule: never use the same model for both roles. A model evaluating its own output is systematically lenient. It has already "decided" the output was good — that's why it generated it. You need an independent evaluator.

The Judge prompt is where most implementations go wrong. Vague criteria like "is this summary good?" produce inconsistent scores. Rubric-based prompts with explicit dimension definitions and a forced rationale field produce scores you can actually trust. The rationale field is load-bearing: it prevents the model from assigning arbitrary numbers by forcing it to articulate why.

3. How to implement it

Start with the Judge prompt. Every quality dimension needs a precise definition, and the output schema must be rigid.

JUDGE_SYSTEM = """
You are a strict news summarization evaluator.
Score the summary on three dimensions (each 1–5):

1. accuracy    – factual fidelity to the source article
2. conciseness – no redundant phrases; dense information per token
3. neutrality  – absence of editorial bias or emotional language

Return ONLY valid JSON:
{"accuracy": <int>, "conciseness": <int>, "neutrality": <int>, "rationale": "<one sentence>"}
"""

JUDGE_USER = """
SOURCE:
{source}

SUMMARY:
{summary}
"""

Next, the evaluation function. The composite score uses a weighted average — accuracy carries 50% because factual fidelity is non-negotiable in news; conciseness and neutrality split the remaining 50%.

import anthropic
import json
from dataclasses import dataclass

client = anthropic.Anthropic()

@dataclass
class EvalResult:
    accuracy: int
    conciseness: int
    neutrality: int
    rationale: str
    composite: float

    def passed(self, threshold: float = 3.5) -> bool:
        return self.composite >= threshold

def judge_summary(source: str, summary: str) -> EvalResult:
    resp = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=256,
        system=JUDGE_SYSTEM,
        messages=[{"role": "user", "content": JUDGE_USER.format(
            source=source, summary=summary
        )}],
    )
    data = json.loads(resp.content[0].text)
    composite = (data["accuracy"] * 0.5 +
                 data["conciseness"] * 0.25 +
                 data["neutrality"] * 0.25)
    return EvalResult(**data, composite=round(composite, 2))

For a fact-checking-heavy pipeline, push the accuracy weight to 0.7 and split 0.15/0.15 across the other two. Adjust once, reweight the composite, and your alerts reflect the actual priority of your domain.

Now wire in the regeneration loop. The key difference from a naive retry is that you inject the Judge's rationale back into the Summarizer as feedback — so the second attempt actually knows what failed.

MAX_RETRIES = 2

def summarize_with_eval(article: str) -> dict:
    summary = call_summarizer(article)
    for attempt in range(MAX_RETRIES + 1):
        result = judge_summary(article, summary)
        if result.passed():
            break
        if attempt < MAX_RETRIES:
            summary = call_summarizer(
                article,
                feedback=result.rationale
            )
    return {
        "summary": summary,
        "scores": result,
        "retried": attempt > 0,
    }

Your call_summarizer function needs to accept the optional feedback parameter and inject it into the Summarizer's system or user prompt — something like "Previous attempt failed evaluation: {feedback}. Revise accordingly."

For persistence, the schema below gives you everything you need for drift detection and model comparison:

CREATE TABLE eval_log (
    id          BIGSERIAL PRIMARY KEY,
    article_id  TEXT NOT NULL,
    model_sum   TEXT,
    model_judge TEXT,
    accuracy    SMALLINT,
    conciseness SMALLINT,
    neutrality  SMALLINT,
    composite   NUMERIC(3,2),
    retried     BOOLEAN,
    created_at  TIMESTAMPTZ DEFAULT now()
);

CREATE INDEX ON eval_log (created_at, composite);

Verify the full loop end to end:

# Quick smoke test — run this before deploying to production
test_source = "The Federal Reserve held interest rates steady Wednesday..."
test_summary = "The Fed kept rates unchanged."

result = judge_summary(test_source, test_summary)
print(f"Composite: {result.composite}, Passed: {result.passed()}")
print(f"Rationale: {result.rationale}")
# Expected: composite around 2.5–3.0 (too concise, loses detail)

4. What to watch in production

Validate the Judge, not just the summaries. The Judge is a model too — its behavior changes when Anthropic updates a model version. Maintain a golden set of known-good and known-bad summary pairs with expected score bounds, and run this in CI.

GOLDEN_SET = [
    {"source": "...", "summary": "Perfect factual summary", "expected_min": 4.2},
    {"source": "...", "summary": "Contains factual error", "expected_max": 2.5},
]

def validate_judge():
    for case in GOLDEN_SET:
        r = judge_summary(case["source"], case["summary"])
assert r.composite >= case.get("expected_min", 0), f"Judge under-scoring: {r.rationale}"
        assert r.composite <= case.get("expected_max", 5), f"Judge over-scoring: {r.rationale}"

If the golden-set validation fails after a model update, that's a signal to recalibrate your rubric — not to trust the new scores blindly.

Control Judge cost with prompt caching. The JUDGE_SYSTEM prompt is identical across every call in a batch. Mark it as a cached block and you'll see 90%+ cache hit rates on large batches, bringing Judge cost down to roughly 30% of Summarizer cost.

client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=256,
    system=[{
        "type": "text",
        "text": JUDGE_SYSTEM,
        "cache_control": {"type": "ephemeral"}  # 5-minute TTL
    }],
    messages=[{"role": "user", "content": user_msg}],
)

Sample for very large batches. If throughput makes evaluating every article impractical, run the full Judge on 20% of articles per batch and interpolate composite estimates for the rest. This is acceptable for alerting on aggregate drift, though individual article scores will be approximate for the non-sampled portion.

Set a drift alert. A 7-day rolling average on composite that drops below 3.8 is worth a Slack notification. This catches prompt regressions, model version changes, and input distribution shifts before they become editorial problems. A simple cron querying your eval_log table is all you need.

Mac vs. Linux behavior: the pipeline itself is environment-agnostic, but if you're running the PostgreSQL schema locally on macOS, note that BIGSERIAL and TIMESTAMPTZ behave identically across platforms. The NUMERIC(3,2) type constrains composite to values 0.00–9.99, which is fine for a 1–5 scale with weighted averages.

Three pillars hold this pattern together: separate scoring dimensions with explicit rubrics, feedback-injected regeneration on failure, and periodic Judge validation against a golden set. Without the third pillar, you're trusting an unvalidated evaluator — which defeats the purpose of automated quality control.

Next up in the series: batching the eval loop into an n8n workflow that auto-routes low-scoring articles to a human review queue while high-confidence summaries publish immediately.

🐦 Faster updates on X: @baegseungh7061
📚 More in this series: All posts
💌 Subscribe: Follow on X or grab the RSS

Seunghyeon's Agentic Lab

이 블로그 검색

Automated Summary Quality Scoring with LLM-as-Judge

Automated News Summary Quality Scoring with LLM-as-Judge

1. Why this matters now

2. The core idea

3. How to implement it

4. What to watch in production

댓글

댓글 쓰기