Anchor catching the failure mode where RAG retrieves the refutation and still gets it wrong

Ask vanilla RAG over Duval, Goeckner, Klivans, and Martin's 2015 paper "A non-partitionable Cohen-Macaulay simplicial complex" this question:

What did Stanley conjecture in 1979 about partitionability of Cohen-Macaulay complexes?

Here's what you get back:

Stanley conjectured in 1979 that every Cohen–Macaulay simplicial
complex is partitionable, which is stated as Conjecture 1.1
(Partitionability Conjecture): "Every Cohen–Macaulay simplicial
complex is partitionable."

Confident. Cites the conjecture verbatim. The retrieved chunks include the paper's abstract — which contains the phrase "thus disproving the Partitionability Conjecture" — and another chunk explicitly stating the paper gives "a general method for constructing counterexamples and an explicit infinite family of non-partitionable Cohen-Macaulay complexes." Vanilla retrieved the refutation. It still relayed Stanley's 1979 conjecture as the document's position.

This isn't an edge case. It's the default shape of a careful argument. Authors steelman before they refute, because that's what makes the refutation convincing. Top-K vector retrieval finds the steelman; the LLM reads the steelman alongside the refutation and produces a paragraph that gives equal weight to both — or worse, lets the steelman dominate because that's what the question lexically matches.

I built Anchor to catch this specific failure mode. This post is what it does, why I think the approach generalises, and a measured comparison against vanilla RAG: 84% vs 48% trap-query rejection across 25 adversarial queries, same chat model, same embedding model, everything else identical.

How I ended up thinking about this

My wife works in chemistry and she's been complaining about this category of failure for ages. The pattern is always the same: paper says X, the AI uses X to argue Y, and Y isn't actually what the paper claims. The first thing you reach for is "well, just load the whole document into the LLM" — and that runs straight into the wall every LLM system hits eventually, which is that context windows are finite, expensive, and noisier the more you fill them. Adding more text to the context isn't a free move. The signal-to-noise ratio gets worse, the model's attention gets diluted, and hallucinations get more likely, not less, because the tokens in the network are relational and you've just added a lot of weakly-related ones.

So you want focused information. RAG is the standard answer: pull the chunks that match the query, inject them, let the LLM work on a tight context. But RAG is fundamentally a similarity search — it finds text that matches the query, not text that represents the document's position on the query. Those are very different things, and the gap between them is where the failure mode lives.

The implicit assumption in vanilla RAG is that any chunk of a document agrees with the rest of that document. For a software manual, that's roughly true — the manual is internally consistent by design. For almost everything else humans write — papers, essays, legal opinions, anything with an argument — it's straightforwardly false. The argument is the whole shape of the document. A chunk extracted from it can be playing any role: thesis, steelman, citation, qualification, refuted view. The chunk's text alone doesn't tell you which.

The conceptual move

Here's how I started thinking about it. The document's signal — what it actually claims, what stance it takes — exists in aggregate. It's a long-running average over the whole text. Any individual chunk is a local sample of that signal, and like any local sample, it's noisy. Sometimes the local sample agrees with the aggregate. Sometimes it's the opposite of the aggregate, because the author is steelmanning a position they're about to demolish.

Vector search is already a lossy process. LLM reading is a lossy process. Information loss is a given; you're not going to engineer it away. What you can do is make the loss work in your favour — collapse the document into its aggregate signal cheaply and ahead of time, and then use that aggregate to correct any local sample you retrieve.

That's the whole architectural bet. Don't try to give the LLM the full document. Give it the aggregate signal of the document, plus the local chunk, plus a judgment step that connects them.

Why retrieval-tuning doesn't fix this

Pre-empting the obvious objection: surely top-K=10 fixes this? Or MMR retrieval, or a reranker, or hybrid BM25+dense?

Those reduce the miss rate. They don't fix the structural problem. The Duval example above is the proof — vanilla retrieved the abstract, which contains the literal word "disproving," alongside a chunk that names the infinite family of counterexamples, and still produced an answer that asserted the conjecture. Better retrieval gives you better local samples. It can't give you the aggregate. You need a separate representation of the aggregate, and a step that compares the local against it.

That's the gap Anchor fills.

Three structural pieces

1. Hierarchical claim-bearing summaries — the aggregate signal

Documents get parsed into a hierarchy: document → chapter → section → paragraph → chunk. Each layer carries a claim-bearing summary — what the layer asserts or argues, not what it covers. The hard rule that makes this work: raw text never appears in summaries above paragraph level. Section, chapter, and document summaries see only the summaries below them.

This is the aggregate. Lossy by design. By the time you've collapsed a 30-page paper into a single document-level summary, you've thrown away most of the local detail — but you've kept the shape of the argument, which is the part that catches steelman-refutation. A document summary that says "this paper refutes Stanley's Partitionability Conjecture and the related Depth Conjecture" is exactly the signal you need to flag a chunk that asserts either as the document's position.

2. Three-agent deliberation with evidence asymmetry

The core logic — what /validate and /ask both run — is a three-agent deliberation:

Proposer (full hierarchy + retrieved chunks): drafts an answer
Critic (macro view only — chapter and document summaries, no sections, no chunks): challenges the draft from a restricted vantage point
Synthesiser (full hierarchy + the proposer's draft + the critic's challenges): produces the final answer

The asymmetry is the design. A critic with the same evidence as the proposer is a paraphrase generator — same inputs, same conclusions, no new signal. A critic restricted to the macro view has to challenge from the document's overall argument: "you claim X, but the chapter summaries indicate ¬X." That's exactly the macro-vs-local mismatch that catches steelman-refutation.

It's not hypothetical. The critic in the eval below caught the proposer fabricating section references that didn't exist in the macro view, and pushed back on the proposer attributing a specific result to "intersecting families" when the macro indicated no such section. The synthesiser revised accordingly. That's the deliberation working as designed.

The deliberation returns structured output: argumentative_role (one of AUTHOR_POSITION, STEELMAN_REFUTED_LATER, CITED_EXTERNAL_VIEW, QUALIFIED_CLAIM, BACKGROUND_FACTUAL, UNCLEAR) and document_stance_on_query (one of SUPPORTS, REJECTS, NEUTRAL, MIXED, OFF_TOPIC). When STEELMAN_REFUTED_LATER fires, the response surfaces the chunks doing the refuting — found by vector search on "not " + query inside the same document. The point is the enums, not the prose. They're machine-readable. A calling system can branch on document_stance_on_query == REJECTS and refuse to claim the chunk supports the query, regardless of how confident the chunk's text sounds. That branch decision doesn't exist in vanilla RAG — there's nothing structured to branch on.

3. Two endpoints, two audiences, plus a cheap pre-filter

The deliberation logic exposes through two endpoints:

POST /validate is the API surface — JSON in, structured judgment out, machine-readable. This is what other systems call when they want to know whether a chunk is doing what the query thinks it's doing.
POST /documents/{id}/ask is the human-facing endpoint — same deliberation, plus Server-Sent Events streaming so a person can watch the proposer draft, the critic challenge, and the synthesiser revise in real time. This is where the "document as a database" idea actually lives. You're not retrieving sections; you're interrogating the document with the reasoning visible.

For queries that don't need the full deliberation, POST /validate/quick is the cheap path: it's a cosine similarity between the query vector and the document-level summary vector, transformed to a [-1, 1] score for document stance against the query. No chat model, no agents, no critique. Pure vector math. Negative means the document's aggregate signal pushes against the query's framing, positive means it pulls toward, near-zero means neutral or off-topic. It's the pre-filter — fast and effectively free, run on every retrieved chunk before deciding whether the deliberation is worth the latency.

Tiered system: quick score gates the deliberation, deliberation gates the human-facing answer.

A few engineering decisions worth flagging

I've spent twenty years on distributed systems — Kafka, federated metadata platforms, HPC — and the architectural reflexes from that work show up everywhere in Anchor. Three decisions that probably look pedantic from the outside but are doing real load-bearing work:

Strict DBO/domain/DTO separation. JPA entities (*Dbo) never escape the persistence layer. Pure Java records cross thread boundaries into worker pools. DTOs live in their own module and never enter the service layer. This isn't fashion. Async deliberation across pools is LazyInitializationException city otherwise. The discipline pays for itself the first time two threads touch the same data.

Worker pools segmented by inference resource, not by role. One chat pool (the single Gemma slot in LM Studio), one embedding pool (nomic, two slots), plus orchestration pools. Orchestration concurrency and inference concurrency are separately tunable. Thread names (chat-worker-0, deliberation-worker-2) are mandatory for log correlation. This is the difference between "we have a working system" and "we have a system whose latency we can reason about."

A sealed-interface render boundary for parser-internal labels. Halfway through the eval work, the synthesiser's grounded_in_sections was leaking parser-internal strings: "Body", "X Y" from LaTeX-flattened math notation, "Minimize" from LP problem-statement keywords. The fix isn't a string filter. It's a Java sealed interface (StructuralRef.Named / StructuralRef.Synthetic) that the type system enforces. Every render site has to switch on it; the compiler complains if anyone adds a new variant and a render path forgets to handle it. Plus sentinel values in the database (__SYNTHETIC_HEAP__, __SYNTHETIC_SEGMENT__) so any leak past the helper is visibly a bug rather than blending in as a plausible section name.

The whole thing runs through OpenTelemetry. Every span carries an evidence_access tag (FULL_HIERARCHY / MACRO_ONLY / FULL_HIERARCHY_PLUS_DEBATE). Which evidence slice produced which latency is one Jaeger filter away.

The eval

Six math papers across four subdomains: extremal combinatorics (Wagner, Cohen-Addad et al, Norin et al), probabilistic combinatorics (Gladkov-Pak-Zimin), algebraic combinatorics (Duval et al), number theory (Bell-Shallit). 34 hand-authored queries — 25 trap queries that target conjectures the paper states then disproves, 9 control queries where the paper genuinely asserts the claim. Same chat model and same embedding model in both pipelines (gemma-4-e4b-it, nomic-embed-text-v1.5), via the same LM Studio. Only retrieval and grounding logic differ.

Two metric layers:

Substantive correctness — does the answer convey the document's actual stance?
  trap-query rejection:    Anchor 84% (21/25) / Vanilla 48% (12/25)   +36 pts
  control-query assertion: Anchor 78% (7/9)   / Vanilla 33% (3/9)     +45 pts

Per-chunk role-tag recovery — mechanical, no judge involved
  Anchor 4% (1/25 traps)

The 4% per-chunk role-tag recovery is the most interesting number, and it deserves a paragraph.

The per-chunk validator is conservative-by-design. It labels chunks as CITED_EXTERNAL_VIEW (technically correct: the chunk does cite an external view) or BACKGROUND_FACTUAL (also technically correct) even when the deliberation as a whole correctly synthesises the document's refutation. The per-chunk label under-reports what the system substantively achieves. That split between "per-chunk label" and "deliberation answer" is itself the structural finding: the deliberation is doing the work even when the per-chunk validator stays cautious.

Per-paper trap rates: Bell-Shallit, Cranston-Rabern-Steiner-Woodall, and Duval et al all hit 100%. Cranston-Rabern Steinberg 75%, Gladkov-Pak-Zimin Bunkbed 75%, Wagner 60%. Wagner being lowest — the paper I tuned everything on during development — is its own data point. Per-paper cells are 4-5 queries each, so single-query swings move the rate by 20+ points; the aggregate is the defensible figure.

The full per-row data is at eval/results-full/.

A digression on judge calibration

The numbers above are from the third iteration of the LLM-as-judge prompt. The first two iterations produced wrong numbers, and the failure modes are worth recording — both because they're the kind of thing that bites anyone running an LLM-as-judge eval, and because one of them turned out to be a finding about Anchor itself.

Iteration 1: 68% / 56%, +12 pts. The judge was crediting vanilla "the context does not provide..." answers as REJECTS — but those aren't refutations, they're refusals. ~5-7 vanilla wins were this pattern. Separately, ~3 Anchor rows defaulted to "no" because the judge model embedded LaTeX in the JSON reason field and broke the parser.
Iteration 2: 72% / 40%, +32 pts. Tightened to require active rejection language ("refute", "disprove", "counterexample") and disqualified "I don't know" answers. Added a regex fallback for the broken-JSON case. But Wagner specifically dropped to 0% — the strict prompt over-rotated and stopped crediting answers like "the document reports on disproving these conjectures with counterexamples" because they didn't use the precise phrasing "the document refutes claim X".
Iteration 3 (calibrated): 84% / 48%, +36 pts. Softened to credit any answer containing rejection language connected to the document's treatment of the conjecture. Wagner returned to 60%.

The iteration-2 over-rotation surfaced a real characteristic of Anchor's deliberation at this model size: it cites the refutation rather than asserting it as a top-level rejection. Asked "what did Stanley conjecture in 1979," Anchor's answer doesn't lead with "no, the conjecture is false." It says something like "the conjecture was originally proposed by Stanley in 1979... my document proceeds directly to disprove this conjecture using a novel mathematical construction." That's correct behaviour — the document does both state and refute the conjecture, and Anchor is faithfully reporting both — but the answer's grammatical mood is descriptive rather than declarative. A strict judge looking for the word "refute" in the matrix position counts this as a near-miss.

This is probably a Gemma 4 E4B artefact. A stronger chat model would more likely produce "no, this conjecture is false; the paper constructs an explicit counterexample" as the lead sentence. But it's worth flagging because if you're building on top of Anchor (or any deliberation system at this model size), you should expect answers that contain the refutation without fronting it, and your downstream consumers need to handle that gracefully.

If you're doing LLM-as-judge work, plan for at least two iterations and probably three. Hand-categorise the "no" rows after each pass — the calibration errors are visible in the judge's own stance_judge_reason field if you bother to read it.

A failed prompt-tune (worth recording)

After landing 84%, I tried to push higher by adding a "stance preservation" rule to the synthesiser prompt with explicit DO/DON'T examples targeting the two remaining real Wagner losses. Result: 84% → 76% on traps, Wagner specifically dropped from 60% → 20%. One Wagner answer hallucinated "my investigation using an LP solver suggested the conjecture held true for specific parameters" — the opposite of the document's actual finding. The added DO/DON'T examples appeared to push the model into descriptive/methodological mode rather than the intended stance-explicit mode.

Reverted. The lesson: at this model size, prompt-tuning has a ceiling. Pushing past 84% likely needs a structurally different approach — chain-of-thought scaffolding, structured-output templates, or a stronger chat model — rather than additional prompt iteration.

I'm flagging this because most engineering posts hide the failures. The value of the eval methodology is partly the failures; without the negative result, you don't know whether the system's ceiling is "we ran out of ideas" or "the architecture has a known cap."

Caveats up front

n=34 queries, 6 papers — worked-example suite, not a benchmark. The aggregate is robust to per-row noise; individual per-paper rates are not.
Math papers are a friendly domain. Argumentative role is textually marked ("we disprove", "Theorem N states", "counterexample"). Generalisation to chemistry, ML papers, news, legal opinions, narrative is unvalidated.
Trap queries were authored to expose the failure mode. They lexically match conjecture-statement chunks. An in-the-wild query mix would have a much lower trap density.
The cost is not free. The deliberation path takes 30-60 seconds per query and runs 5-7 LLM calls. Vanilla takes 5 seconds and one call. The quick path is effectively free, but on its own it doesn't give you a reasoned answer — just a stance score. Whether you want the deliberation depends on your tolerance for being confidently wrong vs your latency budget.
LLM-as-judge has measurement noise. Vanilla controls swung from 33% to 44% across two runs of the same data despite temperature=0; ~6% of judged "no" verdicts on this corpus were judge-calibration artefacts. Larger gaps survive this noise; small effects might not.

What's next

What's open, in honest order: cross-model eval (does the gap hold on Claude / GPT-4o / a stronger local model? — the cite-vs-reject framing artefact specifically should shrink with a stronger chat model), cross-domain eval (does the math result generalise to chemistry — back to where this started — or ML papers, or legal opinions?), retrieval-side fixes (the quick-path score is a useful pre-filter but its calibration against the deliberation's verdict hasn't been measured directly; the per-chunk validator inside the deliberation is also conservative and might benefit from less-conservative prompting).

If you're building agentic LLM systems and the steelman-refuted-later failure mode is in your nightmare list, I think the structural pieces here are a real lever — even if the per-chunk validator at this model size still has a tuning curve. The whole thing is Apache 2.0 on GitHub: github.com/myrddian/anchor. Eval data, including the failed prompt-tune as evidence of where the ceiling is, in eval/results-full/.

I'm a senior engineer with twenty years in distributed systems, now focused on AI-native platforms — currently available for Staff/Principal AI platform work. If you've worked on similar failure modes in your own RAG systems, or you're building something where this gap matters, drop me an email at hello [at] reyes.id.au or open an issue at github.com/myrddian/anchor/issues — I'd want to compare notes.

— Enzo Reyes