If you're running an automated news summarization pipeline, you already know the uncomfortable truth: your summaries can silently degrade and you won't notice until a user complains. This post walks through a layered quality-gate system that combines ROUGE pre-filtering with LLM-as-Judge evaluation — and loops the failure reason back into the next generation attempt, closing the feedback cycle automatically.
1. Why This Matters Now
The standard approach to summarization quality is either manual spot-checks or no checks at all. Neither scales. Manual review breaks down around a few hundred articles per day; zero review means you're shipping hallucinated or off-topic summaries with no signal.
The real pain is this: LLM outputs are non-deterministic, model behavior drifts over time, and upstream data quality fluctuates. A pipeline that produced acceptable summaries last week may be producing garbage this week because the source articles changed format, or because the model's behavior shifted slightly after a provider update.
What's missing is a quality gate that runs automatically inside the pipeline itself — not as a post-hoc audit, but as a blocking step that either approves the output or triggers a retry with a correction hint. The cost constraint is real too: calling an LLM judge on every summary is expensive. You need a cheap first-pass filter to make the expensive judge call only when it counts.
2. The Core Idea
The architecture is a two-stage filter: ROUGE screens out obviously bad summaries cheaply, and LLM-as-Judge evaluates the ones that survive on semantic quality. If a summary fails, the failure reason gets injected into the next generation prompt.
Here's how the two methods compare:
| Method | Speed | Cost | What it measures | Blind spot |
|---|---|---|---|---|
| ROUGE | Fast, deterministic | Zero | n-gram overlap with source | Semantic quality, fluency |
| LLM-as-Judge | Slower | Per-call | Faithfulness, coverage, conciseness | Prompt sensitivity, cost |
| Combined (layered) | Fast overall | Reduced | Both surface and semantic quality | Threshold miscalibration |
ROUGE-2 is the key metric in the pre-filter. Bigram overlap captures phrase-level fidelity better than unigram overlap — a summary that reuses the same individual words in a completely different order will pass ROUGE-1 but fail ROUGE-2. That distinction matters for news summaries where key named entities and noun phrases need to appear in context.
The LLM judge adds three axes that ROUGE cannot measure: faithfulness (hallucination check), coverage (were the main points included?), and conciseness (is every sentence earning its place?). Crucially, the judge outputs a fail_reason string — and that string becomes the correction hint in the next retry.
3. How to Implement It
Install dependencies first:
pip install rouge-score anthropic prometheus-client numpy
Step 1 — ROUGE pre-filter
The pre-filter runs on every summary before any LLM call. A ROUGE-2 F-measure below 0.15 means the summary shares almost no two-word phrases with the source, which is a strong signal of hallucination or complete misdirection.
from rouge_score import rouge_scorer
def rouge_prefilter(source: str, summary: str, threshold: float = 0.15) -> dict:
scorer = rouge_scorer.RougeScorer(
['rouge1', 'rouge2', 'rougeL'], use_stemmer=True
)
scores = scorer.score(source, summary)
passed = scores['rouge2'].fmeasure >= threshold
return {
"passed": passed,
"rouge1": scores['rouge1'].fmeasure,
"rouge2": scores['rouge2'].fmeasure,
"rougeL": scores['rougeL'].fmeasure,
}
The use_stemmer=True flag reduces false negatives from morphological variants — "running" and "run" count as the same token.
Step 2 — LLM judge with structured output
The judge prompt enforces three principles: explicit rubric, fixed score range, and JSON-only output. Freeform judge outputs are nearly impossible to parse reliably at scale.
JUDGE_PROMPT = """You are an expert news summarization evaluator.
Given the SOURCE article and its SUMMARY, score on three axes (1–5 each):
1. **Faithfulness** — Does the summary contain only information from the source? (hallucination check)
2. **Coverage** — Does the summary capture the main points?
3. **Conciseness** — Is every sentence necessary?
Respond ONLY as JSON:
{"faithfulness": <int>, "coverage": <int>, "conciseness": <int>, "fail_reason": "<string or null>"}
SOURCE:
{source}
SUMMARY:
{summary}"""
The fail_reason field is what makes this more than just a quality checker — it becomes the input to the next generation attempt.
Step 3 — Self-evaluating retry loop
import anthropic
import json
from dataclasses import dataclass
@dataclass
class QualityResult:
passed: bool
scores: dict
fail_reason: str | None
attempts: int
client = anthropic.Anthropic()
def llm_judge(source: str, summary: str) -> dict:
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=256,
system="You are a strict JSON-only evaluator.",
messages=[{
"role": "user",
"content": JUDGE_PROMPT.format(source=source, summary=summary)
}]
)
return json.loads(response.content[0].text)
def evaluate_summary(source: str, summary: str) -> QualityResult:
rouge = rouge_prefilter(source, summary)
if not rouge["passed"]:
return QualityResult(
passed=False,
scores=rouge,
fail_reason="ROUGE-2 below threshold",
attempts=0
)
judgment = llm_judge(source, summary)
passed = (
judgment["faithfulness"] >= 4 and
judgment["coverage"] >= 3 and
judgment["conciseness"] >= 3
)
return QualityResult(
passed=passed,
scores={**rouge,**judgment},
fail_reason=judgment.get("fail_reason"),
attempts=0
)
def generate_summary(source: str, hint: str = "") -> str:
correction = f"\nPrevious attempt failed because: {hint}. Fix this." if hint else ""
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=512,
messages=[{
"role": "user",
"content": f"Summarize in 3 sentences.{correction}\n\n{source}"
}]
)
return response.content[0].text
def summarize_with_retry(source: str, max_attempts: int = 3) -> QualityResult:
fail_context = ""
for attempt in range(1, max_attempts + 1):
summary = generate_summary(source, hint=fail_context)
result = evaluate_summary(source, summary)
result.attempts = attempt
if result.passed:
return result
fail_context = result.fail_reason or "quality insufficient"
return result # Final failure — trigger alert
On the first attempt, fail_context is empty and the summary is generated normally. If the judge rejects it with fail_reason: "summary introduces facts not in source", that exact string gets appended to the next generation prompt. The loop is now closed — the generator knows what went wrong.
Step 4 — Verify it's working
test_source = "Apple announced record quarterly revenue of $94.9B, driven by iPhone 15 Pro sales."
result = summarize_with_retry(test_source, max_attempts=3)
print(f"Passed: {result.passed}")
print(f"Attempts: {result.attempts}")
print(f"ROUGE-2: {result.scores.get('rouge2', 0):.3f}")
print(f"Faithfulness: {result.scores.get('faithfulness')}")
Expected output on a clean run:
Passed: True
Attempts: 1
ROUGE-2: 0.312
Faithfulness: 5
4. What to Watch in Production
Threshold calibration is not a one-time task. The ROUGE-2 threshold of 0.15 is a starting point for general news, not a universal truth. What worked for me: collect 200 examples, have a human score them pass/fail, then run np.percentile(human_scores, 5) to find your domain-specific floor.
import numpy as np
def calibrate_threshold(human_scores: list[float], percentile: int = 5) -> float:
return float(np.percentile(human_scores, percentile))
Use the 5th percentile, not the median. You want to catch the bottom tail, not enforce average quality.
Monitor pass rates weekly. A sudden drop of more than 20 percentage points is a model drift signal — either the source data format changed, or the generation model's behavior shifted. Wire this up with Prometheus:
from prometheus_client import Counter, Histogram
summary_quality_pass = Counter('summary_quality_pass_total', 'Summaries passing QA')
summary_quality_fail = Counter('summary_quality_fail_total', 'Summaries failing QA', ['reason'])
rouge2_histogram = Histogram('summary_rouge2_score', 'ROUGE-2 score distribution')
def record_metrics(result: QualityResult):
rouge2_histogram.observe(result.scores.get("rouge2", 0))
if result.passed:
summary_quality_pass.inc()
else:
summary_quality_fail.labels(reason=result.fail_reason or "unknown").inc()
In Grafana, group summary_quality_fail_total by reason label. Faithfulness failures trending up usually points to source document quality degradation — the raw articles are getting noisier. Coverage failures trending up usually means the source articles are getting longer and the context window is being exceeded.
Cost math matters. At 10,000 articles per day, calling the LLM judge on every single one adds up fast. If the ROUGE pre-filter rejects 30% upfront, you're down to 7,000 judge calls per day — roughly a 30% monthly cost reduction. Raising the ROUGE threshold cuts judge costs further, but increases false negatives: genuinely good summaries that happen to use different phrasing get blocked. That tradeoff depends entirely on your SLA. If you're in a domain where a missed hallucination is a liability (medical, legal, financial), keep the threshold low and pay for more judge calls. If throughput and cost are the priority, tighten ROUGE and accept more aggressive filtering.
Environment note: On Apple Silicon Macs, rouge_scorer with use_stemmer=True requires nltk punkt data to be downloaded. Run python -m nltk.downloader punkt once during setup. In Docker, add it to your Dockerfile's build step.
A two-stage quality gate — cheap ROUGE filter first, expensive LLM judge only when needed — cuts your per-article evaluation cost while catching the failures that actually matter. The feedback loop from fail_reason back into the generator is what makes this a self-correcting system rather than just a monitoring layer.
Next in this series: storing the quality scores alongside the summaries for long-term trend analysis and automatic threshold re-calibration.
🐦 Faster updates on X: @baegseungh7061
📚 More in this series: All posts
💌 Subscribe: Follow on X or grab the RSS
댓글
댓글 쓰기