Large language models deployed in policy analysis and intelligence work face a critical failure mode: epistemic collapse under adversarial narrative pressure. When prompted with leading questions, false premises, or requests to extrapolate beyond available evidence, even frontier models tend toward sycophantic agreement, fabrication of plausible-sounding sources, or confident assertions unsupported by evidence. These failure modes are particularly dangerous in high-stakes domains where analytical outputs inform real decisions.
This paper presents a stacked cognitive training methodology and a custom six-dimension evaluation harness designed to measure epistemic robustness in adversarial conditions. We report three generations of results: Nehanda v1 (7B), which baked domain knowledge into model weights via continued pre-training and beat both GPT-4o-mini and GPT-4o on sycophancy resistance despite being orders of magnitude smaller; Nehanda v2 (32B), which shifted domain knowledge to RAG and revealed a universal failure on multi-turn consistency; and Nehanda v2.1, which targeted the multi-turn gap with synthetic DPO training data and now outperforms GPT-5 Mini on multi-turn epistemic consistency.
Can stacked cognitive sequencing — where each training stage builds on the epistemic foundations laid by previous stages — produce stronger epistemic grounding than larger general-purpose LLMs trained with broader but shallower alignment approaches?
This research was motivated by a concrete deployment: Sustainable Energy Africa (SEA) manages apply.sseg.org.za, a Small-Scale Embedded Generation (SSEG) application platform serving 50–60 of South Africa's 167 electrical utilities. The platform facilitates rooftop solar and other embedded generation permit applications, each governed by municipality-specific requirements — capacity limits, permitted inverter types, grid connection rules, and tariff structures. Two user classes need AI-assisted guidance: applicants navigating the permit process, and municipal officials verifying requirements against national and local policy.
In this environment, epistemic robustness is operational, not academic. When an applicant asks about permitted system specifications and the answer depends on their specific municipality's rules, or when an official needs to verify whether a national regulatory change supersedes a local bylaw, a fabricated citation or sycophantic agreement with a false premise has direct consequences: non-compliant installations, rejected applications, incorrect approvals, and wasted capital expenditure. The model must know what it doesn't know, cite specific municipal and national documents, and correct false assumptions rather than validate them.
Standard benchmarks like MMLU and HellaSwag don't test whether a model will invent a NERSA regulation, agree with an incorrect claim about Eskom purchasing policy, or bury a clear tariff answer in defensive hedging. The custom evaluation harness described below was built to test exactly these failure modes, and the stacked training methodology was the response to what the harness revealed about frontier model behaviour.
What epistemic failure looks like in practice:
The sycophantic model validates the false premise and builds advice on top of it. The epistemically honest model corrects the error before proceeding. In a regulatory context, the first response leads to a fundamentally misstructured deal.
The v1 pipeline trained Mistral-7B-v0.3 through seven sequential stages using QLoRA (rank 32, all projection modules, 4-bit quantization) on a single A10G GPU. Total compute cost: approximately $180.
- Foundation — Generic instruction-following on Dolly-15k plus OpenMath reasoning data. No persona prompt.
- Energy CPT — Continued pre-training on an energy and regulatory corpus. Up to 25,000 CPT samples ingested into weights.
- Energy Analyst SFT — Supervised fine-tuning on energy domain Q&A with a regulatory analyst persona prompt.
- Intel CPT — Continued pre-training on an intelligence corpus.
- Intel Analyst SFT — Intel domain Q&A with a signal-detection persona prompt.
- Constitutional SFT — Alignment training with a replay buffer auto-calibrated by an eval gate.
- Constitutional DPO — Direct preference optimisation on paired responses demonstrating epistemic honesty.
We built a custom evaluation harness with two operating modes — Energy (regulatory, SSEG, policy) and Intelligence (signal detection, OPSEC, source validation). Six dimensions, each targeting a distinct failure mode:
- Fabrication resistance — Present a real document excerpt, then ask about information not contained in it.
- Factual grounding — Direct domain knowledge questions with known expected answers.
- Adversarial robustness — Present false claims and require the model to identify them as incorrect.
- Sycophancy resistance — Embed false premises in user questions; the model must correct rather than validate.
- Over-hedging control — Present clear, unambiguous answers and check the model states them directly.
- Structural quality — Complex analytical prompts requiring multi-part structured outputs.
Two-tier scoring: Layer 1 uses deterministic keyword matching (fast, reproducible). Layer 2 uses an LLM judge with structured rubrics (1.0 / 0.5 / 0.0) for nuanced evaluation. v2/v2.1 used GPT-4o as judge; v2.2 used Claude Opus 4.6. The effective score uses the judge when available.
Energy Mode
| Dimension | Nehanda v1-7B | GPT-4o-mini | GPT-4o |
|---|---|---|---|
| Overall | 84.58% | 80% | 92% |
| Fabrication | 100% | 100% | 100% |
| Factual | 87.5% | 88% | 100% |
| Adversarial | 87.5% | 69% | 88% |
| Sycophancy | 58.33% | 17% | 67% |
| Over-hedging | 87.5% | 100% | 100% |
| Structure | 70.83% | 88% | 92% |
Intelligence Mode
| Dimension | Nehanda v1-7B | GPT-4o-mini | GPT-4o |
|---|---|---|---|
| Overall | 81.67% | 90% | 93% |
| Fabrication | 100% | 100% | 100% |
| Factual | 100% | 100% | 100% |
| Adversarial | 91.67% | 92% | 100% |
| Sycophancy | 75% | 50% | 58% |
| Over-hedging | 41.67% | 100% | 100% |
| Structure | 83.33% | 92% | 92% |
The standout finding: a $180, 7B-parameter model beat both GPT-4o-mini and GPT-4o on sycophancy resistance. When an analyst asks a leading question based on a false assumption, Nehanda was more likely to correct them than models with 100× the parameters. For a detailed comparison against ChatGPT, see the full analysis →
The v1 pipeline spent two of its seven stages on continued pre-training (CPT). This worked — but it raised a question: was baking domain knowledge into weights the best use of a limited training budget?
The v2 architectural insight: domain knowledge doesn't need to live in weights when RAG can provide it at inference time. By moving factual grounding to a retrieval layer, the entire training budget could be redirected toward what RAG cannot provide — epistemic discipline, adversarial resilience, and calibrated confidence.
- CPT stages removed — Domain knowledge served via RAG at query time, always current, always citable.
- Mistral 7B → Qwen 2.5-32B — With domain knowledge moved to RAG, the training budget could invest in a larger base model for deeper reasoning capacity. The two viable open-weight options with Apache 2.0 licensing and available base (pretrained, not instruction-tuned) checkpoints were Mistral Small (24B) and Qwen 2.5 (32B). We chose Qwen for two reasons. First, parameter count matters for epistemic reasoning — the 33% capacity advantage (32B vs 24B) directly benefits the multi-step analytical judgements where v1 was weakest: structural quality (v1: 70.8%) and over-hedging control (v1: 41.7% intel), dimensions that require the model to hold multiple evidence threads in working memory while constructing a coherent position. Second, both models fit on a single 44GB L40S GPU in 4-bit quantization, so the larger model imposed no additional infrastructure cost — same hardware, same training budget, 33% more reasoning capacity. Starting from a base model rather than an instruct checkpoint is important for the stacked approach: building epistemic alignment from scratch across 5 sequential stages means the base weights should not carry competing alignment patterns from prior instruction tuning.
- Lower cost despite 4.5× larger model — v2 trained for ~$45 on a single g6e.4xlarge (NVIDIA L40S). Removing CPT eliminated the two most expensive stages, and the L40S delivers substantially faster throughput than the A10G used for v1. Total including eval: ~$95 — roughly half the v1 cost.
v1 Architecture
- 7 stages
- 2 CPT stages (domain knowledge in weights)
- 2 domain SFT stages
- 1 foundation stage
- 1 constitutional SFT
- 1 constitutional DPO
- Domain knowledge: baked into weights
v2 Architecture
- 5 stages
- 1 epistemic foundation
- 1 epistemic hardening SFT
- 1 RAG synthesis SFT
- 1 constitutional SFT + replay
- 1 constitutional DPO
- Domain knowledge: served via RAG
The v2 evaluation introduced a 3-phase design with progressively harder cognitive requirements:
Phase 1 (Table Stakes) — 24 recall-level tests. Any competent model should score near-perfect.
Phase 2 (Single Hard) — 48 higher-order tests: comparison across conflicting sources, analysis of embedded falsehoods, evaluation of scope limits, cross-source synthesis.
Phase 3 (Multi-Turn) — 4 sequences of 4 turns each. The model sees its own previous answers and must resist sycophancy follow-ups that push it to abandon correct positions.
Recall-level
reading comprehension
Higher-order reasoning
conflicting sources,
embedded falsehoods,
scope evaluation
Sustained pressure
sycophancy follow-ups
position maintenance
The v2 model performed strongly on Phases 1 and 2 but the 3-phase evaluation revealed a critical gap: universal failure on multi-turn consistency when conflicting sources meet sycophancy pressure.
Phase 3 — Multi-Turn Sequences (v2)
| Sequence | v2 Energy | v2 Intel | GPT-5 Mini Energy | GPT-5 Mini Intel |
|---|---|---|---|---|
| Overall | 56.2% | 59.4% | 65.6% | 75.0% |
| Consistency | 43.8% | 50.0% | 37.5% | 50.0% |
| Comparison (conflicting sources) | 0.0% | 0.0% | 0.0% | 12.5% |
| Analysis (embedded falsehoods) | 100% | 100% | 100% | 100% |
| Evaluation (scope limits) | 87.5% | 100% | 75.0% | 87.5% |
| Synthesis (cross-source calculation) | 37.5% | 37.5% | 87.5% | 100% |
The comparison sequence — where two credible sources disagree and a sycophancy follow-up pressures the model to pick one — produced 0% for both Nehanda v2 and GPT-5 Mini in energy mode. Every model capitulated.
The 3-phase eval identified three specific training gaps:
- No multi-turn training data — Every stage formatted examples as single instruction/response pairs. The model had never seen training data where it must maintain a position across follow-up turns.
- Energy-intel asymmetry — Training data heavily weighted toward energy-domain sources.
- Prompt-eval conflict — Stage 3's instruction "Lead with the answer" conflicted with the requirement to preserve source disagreement.
The v2.1 intervention:
- 152 multi-turn DPO pairs generated from the eval's own sequence templates, with chosen responses maintaining position and rejected responses capitulating.
- 56 conflicting-source DPO pairs from retired baseline tests, teaching the model to preserve disagreement.
- 205 intel-domain SFT examples covering geopolitical analysis, cyber threats, nuclear proliferation, economic espionage, and influence operations.
- 183 sycophancy resistance SFT records adapted from Anthropic's public sycophancy evaluation dataset.
- Prompt rebalancing — "Lead with the answer" changed to "Lead with what the evidence supports" with explicit conflict preservation.
Total v2.1 incremental cost: ~$15 on the same g6e.4xlarge instance.
Phase 3 — Multi-Turn Sequences (v2 → v2.1 vs GPT-5 Mini)
| Sequence | v2 | v2.1 | GPT-5 Mini |
|---|---|---|---|
| Energy | |||
| Overall | 56.2% | 71.9% | 65.6% |
| Consistency | 43.8% | 68.8% | 37.5% |
| Comparison | 0.0% | 75.0% | 0.0% |
| Analysis | 100% | 87.5% | 100% |
| Evaluation | 87.5% | 87.5% | 75.0% |
| Synthesis | 37.5% | 37.5% | 87.5% |
| Intelligence | |||
| Overall | 59.4% | 71.9% | 75.0% |
| Consistency | 50.0% | 68.8% | 50.0% |
| Comparison | 0.0% | 62.5% | 12.5% |
| Analysis | 100% | 100% | 100% |
| Evaluation | 100% | 100% | 87.5% |
| Synthesis | 37.5% | 25.0% | 100% |
The comparison sequence — which every model failed at 0% in v2 — moved to 75% (energy) and 62.5% (intel) in v2.1. GPT-5 Mini still fails at 0% (energy) and 12.5% (intel). Multi-turn consistency improved from 43.8%/50.0% to 68.8%/68.8%, substantially exceeding GPT-5 Mini's 37.5%/50.0%.
Comparison Sequence: Conflicting Sources Under Sycophancy Pressure (Energy)
The hardest test in the eval: two credible sources disagree, then a follow-up pressures the model to pick one. Every model collapsed at 0% before v2.1. 152 targeted DPO pairs moved Nehanda to 75%.
Phase 2 — Single Hard Questions (v2.1)
| Dimension | v2.1 Energy | v2.1 Intel | GPT-5 Mini Energy | GPT-5 Mini Intel |
|---|---|---|---|---|
| Overall | 78.9% | 78.5% | 84.5% | 84.0% |
| Fabrication | 40.0% | 30.0% | 40.0% | 50.0% |
| Factual | 100% | 100% | 100% | 100% |
| Adversarial | 100% | 100% | 100% | 100% |
| Sycophancy | 90.0% | 100% | 100% | 100% |
| Over-hedging | 75.0% | 75.0% | 87.5% | 75.0% |
| Structure | 72.2% | 66.7% | 88.9% | 83.3% |
Weighted Overall Scores
| v2 | v2.1 | GPT-5 Mini | |
|---|---|---|---|
| Energy | 70.2% | 78.0% | 78.5% |
| Intel | 73.0% | 77.6% | 81.6% |
Weights: Phase 1 (10%) + Phase 2 (50%) + Phase 3 (40%). Phase 3 is weighted heavily because multi-turn consistency is the most operationally relevant dimension.
The v2.1 intervention proved that targeted DPO pairs could move specific dimensions. v2.2 scaled this approach systematically: 3 new DPO modes (sycophancy resistance, cross-source conflict, reinforcement) and 6 new SFT modes (multi-turn synthesis, cross-source analysis, concise format, credibility weighting, sycophancy resistance, and direct synthesis scaling). Total training data approximately doubled.
The result: perfect multi-turn consistency — 100% in both energy and intelligence domains, up from 68.8%. The model now holds its epistemic position under sustained pressure without exception across all 4 multi-turn sequences.
Phase 3 — Multi-Turn Sequences (v2 → v2.1 → v2.2)
| Metric | v2 | v2.1 | v2.2 | Claude Opus 4.6 | GPT-5 Mini |
|---|---|---|---|---|---|
| Energy | |||||
| Overall | 56.2% | 71.9% | 81.2% | 93.8% | 65.6% |
| Consistency | 43.8% | 68.8% | 100% | 100% | 37.5% |
| Intelligence | |||||
| Overall | 59.4% | 71.9% | 81.2% | 93.8% | 75.0% |
| Consistency | 50.0% | 68.8% | 100% | 100% | 50.0% |
Phase 2 — Single Hard Questions (v2.2)
| Dimension | v2.2 Energy | v2.2 Intel | Opus Energy | Opus Intel | GPT-5 Energy | GPT-5 Intel |
|---|---|---|---|---|---|---|
| Overall | 74.8% | 79.2% | 92.4% | 95.6% | 84.5% | 84.0% |
| Fabrication | 60.0% | 90.0% | 80.0% | 90.0% | 40.0% | 50.0% |
| Factual | 33.3% | 33.3% | 100% | 100% | 100% | 100% |
| Adversarial | 100% | 100% | 100% | 100% | 100% | 100% |
| Sycophancy | 100% | 100% | 100% | 100% | 100% | 100% |
| Over-hedging | 62.5% | 62.5% | 87.5% | 100% | 87.5% | 75.0% |
| Structure | 72.2% | 50.0% | 83.3% | 72.2% | 88.9% | 83.3% |
Weighted Overall Scores (v2 → v2.1 → v2.2)
| v2 | v2.1 | v2.2 | Claude Opus 4.6 | GPT-5 Mini | |
|---|---|---|---|---|---|
| Energy | 70.2% | 78.0% | 82.3% | 92.9% | 78.5% |
| Intel | 73.0% | 77.6% | 82.2% | 95.4% | 81.6% |
Claude Opus 4.6 evaluated on the same test prompts under epistemic isolation (no access to Nehanda’s training data or eval history). v2.2 and Opus scored by Claude Opus 4.6 judge; v2/v2.1 and GPT-5 Mini scored by GPT-4o. Weights: Phase 1 (20%) + Phase 2 (35%) + Phase 3 (45%).
Claude Opus 4.6 dominates on single-turn questions (92–96% Phase 2), as expected from a frontier model with orders of magnitude more parameters. GPT-5 Mini scores 84% Phase 2. Nehanda v2.2 scores 75–79% — competitive for a $135 fine-tuned 32B model. On the dimensions that matter most for deployment — adversarial resistance, sycophancy resistance, and multi-turn consistency — both Nehanda and Claude Opus achieve 100%, while GPT-5 Mini collapses to 37.5–50% consistency under sustained pressure.
Incremental cost for v2.2 training data generation and retraining: ~$25 on the same g6e.4xlarge instance. Total project cost including all iterations: ~$135.
- Factual keyword matching — v2.2 scores 33.3% on Phase 2 factual in both domains. This reflects the keyword scorer’s sensitivity to exact phrasing rather than substantive accuracy — responses that are factually correct but use different wording score zero. A judge-based factual scorer would likely show higher true performance.
- Embedded falsehood detection — When a source contains a planted false figure, v2.2 sometimes accepts it at face value (hard_fab_002: 0.0). Detecting falsehoods embedded within otherwise credible sources remains the hardest fabrication subtype.
- Structure and formatting — Intel structure scores 50% on Phase 2, reflecting formatting conventions (step-by-step, checklists) that the keyword scorer penalises when the model uses a different valid structure.
Single-turn benchmarks systematically overstate model capability. Both Nehanda and GPT-5 Mini score 85–100% on Phase 1. The differentiating signal only appears under sustained conversational pressure.
The v2.1 multi-turn DPO intervention demonstrated that targeted training data — 152 preference pairs built from the eval’s own failure cases — could move a specific epistemic dimension from 0% to 75%. v2.2 scaled this approach systematically and achieved perfect multi-turn consistency — 100% in both domains, matching Claude Opus 4.6 and far exceeding GPT-5 Mini’s 37.5%–50%.
The 3-way comparison reveals a clear hierarchy: Claude Opus 4.6 leads on single-turn reasoning (92–96% Phase 2), as expected from a frontier model. But on the dimensions that matter most for high-stakes deployment — adversarial resistance, sycophancy resistance, and multi-turn epistemic consistency — the $135 fine-tuned 32B model matches the frontier model at 100%. The gap between Nehanda and Claude is on general capability; the gap between both and GPT-5 Mini is on epistemic discipline under pressure.
The broader lesson: epistemic alignment is not a property that emerges from scale alone — but targeted fine-tuning can close the gap with frontier models on the dimensions that matter most, while scale provides the headroom for general capability. The optimal deployment strategy may be to use the fine-tuned specialist where epistemic integrity is critical, and the frontier model where general reasoning depth is needed.
This is not an accident of undertrained models. As OpenAI’s own research on why language models hallucinate makes clear, hallucination, sycophancy, and epistemic failures in general-purpose models are products of training priorities — deliberate trade-offs that optimise for fluency, helpfulness, and broad capability at the expense of calibrated uncertainty. A model trained to always produce an answer will produce answers even when the evidence doesn’t support one. The Nehanda methodology inverts this priority: every training stage explicitly rewards refusal to fabricate, resistance to social pressure, and preservation of source disagreement. The result is a model that scores lower on general capability but higher on the epistemic dimensions that determine whether a deployment is trustworthy.