Self-Validating Summaries: ROUGE + LLM-as-Judge in One Pipeline

If you're running an automated news summarization pipeline, you already know the uncomfortable truth: your summaries can silently degrade and you won't notice until a user complains. This post walks through a layered quality-gate system that combines ROUGE pre-filtering with LLM-as-Judge evaluation — and loops the failure reason back into the next generation attempt, closing the feedback cycle automatically.


1. Why This Matters Now

The standard approach to summarization quality is either manual spot-checks or no checks at all. Neither scales. Manual review breaks down around a few hundred articles per day; zero review means you're shipping hallucinated or off-topic summaries with no signal.

The real pain is this: LLM outputs are non-deterministic, model behavior drifts over time, and upstream data quality fluctuates. A pipeline that produced acceptable summaries last week may be producing garbage this week because the source articles changed format, or because the model's behavior shifted slightly after a provider update.

What's missing is a quality gate that runs automatically inside the pipeline itself — not as a post-hoc audit, but as a blocking step that either approves the output or triggers a retry with a correction hint. The cost constraint is real too: calling an LLM judge on every summary is expensive. You need a cheap first-pass filter to make the expensive judge call only when it counts.


2. The Core Idea

The architecture is a two-stage filter: ROUGE screens out obviously bad summaries cheaply, and LLM-as-Judge evaluates the ones that survive on semantic quality. If a summary fails, the failure reason gets injected into the next generation prompt.

Here's how the two methods compare:

Method Speed Cost What it measures Blind spot
ROUGE Fast, deterministic Zero n-gram overlap with source Semantic quality, fluency
LLM-as-Judge Slower Per-call Faithfulness, coverage, conciseness Prompt sensitivity, cost
Combined (layered) Fast overall Reduced Both surface and semantic quality Threshold miscalibration

ROUGE-2 is the key metric in the pre-filter. Bigram overlap captures phrase-level fidelity better than unigram overlap — a summary that reuses the same individual words in a completely different order will pass ROUGE-1 but fail ROUGE-2. That distinction matters for news summaries where key named entities and noun phrases need to appear in context.

The LLM judge adds three axes that ROUGE cannot measure: faithfulness (hallucination check), coverage (were the main points included?), and conciseness (is every sentence earning its place?). Crucially, the judge outputs a fail_reason string — and that string becomes the correction hint in the next retry.


3. How to Implement It

Install dependencies first:

pip install rouge-score anthropic prometheus-client numpy

Step 1 — ROUGE pre-filter

The pre-filter runs on every summary before any LLM call. A ROUGE-2 F-measure below 0.15 means the summary shares almost no two-word phrases with the source, which is a strong signal of hallucination or complete misdirection.

from rouge_score import rouge_scorer

def rouge_prefilter(source: str, summary: str, threshold: float = 0.15) -> dict:
    scorer = rouge_scorer.RougeScorer(
        ['rouge1', 'rouge2', 'rougeL'], use_stemmer=True
    )
    scores = scorer.score(source, summary)
    passed = scores['rouge2'].fmeasure >= threshold
    return {
        "passed": passed,
        "rouge1": scores['rouge1'].fmeasure,
        "rouge2": scores['rouge2'].fmeasure,
        "rougeL": scores['rougeL'].fmeasure,
    }

The use_stemmer=True flag reduces false negatives from morphological variants — "running" and "run" count as the same token.

Step 2 — LLM judge with structured output

The judge prompt enforces three principles: explicit rubric, fixed score range, and JSON-only output. Freeform judge outputs are nearly impossible to parse reliably at scale.

JUDGE_PROMPT = """You are an expert news summarization evaluator.

Given the SOURCE article and its SUMMARY, score on three axes (1–5 each):

1. **Faithfulness** — Does the summary contain only information from the source? (hallucination check)
2. **Coverage** — Does the summary capture the main points?
3. **Conciseness** — Is every sentence necessary?

Respond ONLY as JSON:
{"faithfulness": <int>, "coverage": <int>, "conciseness": <int>, "fail_reason": "<string or null>"}

SOURCE:
{source}

SUMMARY:
{summary}"""

The fail_reason field is what makes this more than just a quality checker — it becomes the input to the next generation attempt.

Step 3 — Self-evaluating retry loop

import anthropic
import json
from dataclasses import dataclass

@dataclass
class QualityResult:
    passed: bool
    scores: dict
    fail_reason: str | None
    attempts: int

client = anthropic.Anthropic()

def llm_judge(source: str, summary: str) -> dict:
    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=256,
        system="You are a strict JSON-only evaluator.",
        messages=[{
            "role": "user",
            "content": JUDGE_PROMPT.format(source=source, summary=summary)
        }]
    )
    return json.loads(response.content[0].text)

def evaluate_summary(source: str, summary: str) -> QualityResult:
    rouge = rouge_prefilter(source, summary)
    if not rouge["passed"]:
        return QualityResult(
            passed=False,
            scores=rouge,
            fail_reason="ROUGE-2 below threshold",
            attempts=0
        )
    judgment = llm_judge(source, summary)
    passed = (
        judgment["faithfulness"] >= 4 and
        judgment["coverage"] >= 3 and
        judgment["conciseness"] >= 3
    )
    return QualityResult(
        passed=passed,
        scores={**rouge,**judgment},
        fail_reason=judgment.get("fail_reason"),
        attempts=0
    )

def generate_summary(source: str, hint: str = "") -> str:
    correction = f"\nPrevious attempt failed because: {hint}. Fix this." if hint else ""
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=512,
        messages=[{
            "role": "user",
            "content": f"Summarize in 3 sentences.{correction}\n\n{source}"
        }]
    )
    return response.content[0].text

def summarize_with_retry(source: str, max_attempts: int = 3) -> QualityResult:
    fail_context = ""
    for attempt in range(1, max_attempts + 1):
        summary = generate_summary(source, hint=fail_context)
        result = evaluate_summary(source, summary)
        result.attempts = attempt
        if result.passed:
            return result
        fail_context = result.fail_reason or "quality insufficient"
    return result  # Final failure — trigger alert

On the first attempt, fail_context is empty and the summary is generated normally. If the judge rejects it with fail_reason: "summary introduces facts not in source", that exact string gets appended to the next generation prompt. The loop is now closed — the generator knows what went wrong.

Step 4 — Verify it's working

test_source = "Apple announced record quarterly revenue of $94.9B, driven by iPhone 15 Pro sales."
result = summarize_with_retry(test_source, max_attempts=3)
print(f"Passed: {result.passed}")
print(f"Attempts: {result.attempts}")
print(f"ROUGE-2: {result.scores.get('rouge2', 0):.3f}")
print(f"Faithfulness: {result.scores.get('faithfulness')}")

Expected output on a clean run:

Passed: True
Attempts: 1
ROUGE-2: 0.312
Faithfulness: 5

4. What to Watch in Production

Threshold calibration is not a one-time task. The ROUGE-2 threshold of 0.15 is a starting point for general news, not a universal truth. What worked for me: collect 200 examples, have a human score them pass/fail, then run np.percentile(human_scores, 5) to find your domain-specific floor.

import numpy as np

def calibrate_threshold(human_scores: list[float], percentile: int = 5) -> float:
    return float(np.percentile(human_scores, percentile))

Use the 5th percentile, not the median. You want to catch the bottom tail, not enforce average quality.

Monitor pass rates weekly. A sudden drop of more than 20 percentage points is a model drift signal — either the source data format changed, or the generation model's behavior shifted. Wire this up with Prometheus:

from prometheus_client import Counter, Histogram

summary_quality_pass = Counter('summary_quality_pass_total', 'Summaries passing QA')
summary_quality_fail = Counter('summary_quality_fail_total', 'Summaries failing QA', ['reason'])
rouge2_histogram = Histogram('summary_rouge2_score', 'ROUGE-2 score distribution')

def record_metrics(result: QualityResult):
    rouge2_histogram.observe(result.scores.get("rouge2", 0))
    if result.passed:
        summary_quality_pass.inc()
    else:
        summary_quality_fail.labels(reason=result.fail_reason or "unknown").inc()

In Grafana, group summary_quality_fail_total by reason label. Faithfulness failures trending up usually points to source document quality degradation — the raw articles are getting noisier. Coverage failures trending up usually means the source articles are getting longer and the context window is being exceeded.

Cost math matters. At 10,000 articles per day, calling the LLM judge on every single one adds up fast. If the ROUGE pre-filter rejects 30% upfront, you're down to 7,000 judge calls per day — roughly a 30% monthly cost reduction. Raising the ROUGE threshold cuts judge costs further, but increases false negatives: genuinely good summaries that happen to use different phrasing get blocked. That tradeoff depends entirely on your SLA. If you're in a domain where a missed hallucination is a liability (medical, legal, financial), keep the threshold low and pay for more judge calls. If throughput and cost are the priority, tighten ROUGE and accept more aggressive filtering.

Environment note: On Apple Silicon Macs, rouge_scorer with use_stemmer=True requires nltk punkt data to be downloaded. Run python -m nltk.downloader punkt once during setup. In Docker, add it to your Dockerfile's build step.


A two-stage quality gate — cheap ROUGE filter first, expensive LLM judge only when needed — cuts your per-article evaluation cost while catching the failures that actually matter. The feedback loop from fail_reason back into the generator is what makes this a self-correcting system rather than just a monitoring layer.

Next in this series: storing the quality scores alongside the summaries for long-term trend analysis and automatic threshold re-calibration.


🐦 Faster updates on X: @baegseungh7061
📚 More in this series: All posts
💌 Subscribe: Follow on X or grab the RSS

댓글