Your AI Sounds Smart. Is It Actually Reliable?

What We Learned Building and Testing a 7B Model for Enterprise Intelligence Work

A lot of enterprise teams are living the same pattern right now.

You run a polished AI demo. Everyone is impressed. Then someone asks a domain question your team actually cares about, and the model gives a confident answer that is wrong, incomplete, or based on an assumption nobody verified. The language is smooth. The answer is fast. And the risk is real.

We started this project to fix that specific failure.

The Problem in Plain Language

Most general-purpose assistants are tuned for broad helpfulness. Enterprise teams in policy, regulation, finance, and intelligence need something narrower: evidence discipline under pressure.

Those are not the same objective.

In our work, the failure showed up in predictable ways. The model would fill missing facts with plausible text, accept false premises in user prompts, or bury a correct answer under unnecessary qualifiers. If this happens in customer support, you recover with a follow-up. If this happens in policy or business-intelligence work, bad answers get reused in decks, memos, and decisions—and real-world harm or liability exposure follows. Consider the lawyer sanctioned for submitting a brief generated by ChatGPT that contained hallucinated legal citations. That is where fluent falsehoods end up.

The broader research literature reports the same pattern. Strong benchmark performance does not eliminate hallucination, citation weakness, sycophancy, or calibration drift [1]–[6]. So we did not treat this as a prompt-engineering issue. We treated it as a model-behavior issue.

Our Hypothesis

A smaller open-weights model, trained and evaluated specifically for enterprise failure modes, can be more reliable on domain tasks than a larger general assistant. This was inspired in part by BloombergGPT, which was built on the same premise. However, since we do not have anywhere near the same compute budget, we went the fine-tuning route rather than building model weights from scratch.

Why This Hypothesis

We did not choose 7B because it sounds elegant in a paper. We chose it because enterprise constraints are practical and non-negotiable.

First, we needed control. Open weights let us tune behavior directly, run repeatable evaluations, and fix specific failure patterns. Closed black-box APIs are useful, but they limit what you can correct at the behavior layer.

Second, we needed iteration speed. Most business teams are not running multi-day frontier experiments on large A100 clusters every time they discover a new failure mode. A 7B stack gives a shorter loop: tune, test, review, ship.

Third, we needed governance. If your outputs influence high-stakes decisions, you need a model program that can be audited over time. That means explicit stages, explicit gates, and explicit scorecards.

The choice was not "small beats big." The choice was "controllable and improvable beats opaque and static" for this use case.

How We Tested It

We tested the hypothesis against the failure modes business users actually encounter.

The model pipeline used staged training and post-stage gates, deployed on an EC2 instance running an A10 (24 GB VRAM), for about $180 in total compute cost:

Foundation reasoning
Domain systems knowledge
Analyst persona
Constitutional SFT
DPO with replay controls and evaluation gates

We then evaluated on two operating modes relevant to our deployment domains:

energy — regulatory and policy tasks
intel — signal detection and intelligence tasks

And we scored six dimensions:

Fabrication handling
Factual grounding
Adversarial robustness
Sycophancy resistance
Over-hedging balance
Structured output adherence

To contextualize these results, we ran the same evaluation suite against two widely deployed frontier models: OpenAI's GPT-4o and GPT-4o-mini.

Results: Nehanda-v1-7B vs. Frontier Models

Energy Mode

Dimension	Nehanda-v1-7B	GPT-4o-mini	GPT-4o
Overall	84.58%	80%	92%
Fabrication	100%	100%	100%
Factual	87.5%	88%	100%
Adversarial	87.5%	69%	88%
Sycophancy	58.33%	17%	67%
Over-hedging	87.5%	100%	100%
Structure	70.83%	88%	92%

Intel Mode

Dimension	Nehanda-v1-7B	GPT-4o-mini	GPT-4o
Overall	81.67%	90%	93%
Fabrication	100%	100%	100%
Factual	100%	100%	100%
Adversarial	91.67%	92%	100%
Sycophancy	75%	50%	58%
Over-hedging	41.67%	100%	100%
Structure	83.33%	92%	92%

What the Comparison Shows

On overall score, GPT-4o leads in both modes. But the comparison is not a horse race—it is a diagnostic, and the diagnostics reveal something more interesting than the headline scores.

Nehanda outperforms GPT-4o-mini on energy overall (84.58% vs. 80%) and matches or beats GPT-4o on multiple dimensions. Fabrication is perfect across all three models. Adversarial robustness in energy mode (87.5%) matches GPT-4o and significantly exceeds GPT-4o-mini (69%). Factual grounding in energy mode (87.5%) is within a half-point of GPT-4o-mini. These are not marginal results from a model trained for $180 on a single GPU against systems backed by billions in infrastructure.

Sycophancy resistance is where the $180 model wins outright. This is the most consequential finding in the comparison. Nehanda scores 58.33% on energy sycophancy and 75% on intel sycophancy. GPT-4o-mini scores 17% and 50% respectively. GPT-4o scores 67% and 58%. Sycophancy—the tendency to accept false premises from the user rather than correcting them—is the failure mode with the highest real-world consequence in policy and intelligence work. It is also the one dimension where larger, more expensive models show no consistent advantage. This suggests sycophancy resistance is an alignment-tuning problem, not a scale problem, and targeted DPO work on a smaller model can exceed frontier performance on the dimension that matters most.

The remaining gaps trace to identifiable training-stage issues, not architectural limitations. Intel-mode over-hedging (41.67%) is the clearest weakness. The analyst persona and constitutional SFT stages are injecting excessive qualification into intel-mode responses—teaching the model to hedge where the task requires directness. Per-stage checkpoint evaluation will isolate exactly where this degradation occurs. Structure scores (70.83% energy, 83.33% intel) lag the frontier models and indicate the SFT data needs more structured output examples.

What the Comparison Costs

The scores do not exist in a vacuum. They exist in an economic context that determines whether they are usable.

Nehanda-v1-7B was trained for $180 in total compute on a single A10 GPU. It deploys on a laptop. No cloud dependency, no per-token pricing, no API metering. At inference time, the marginal cost of a query is zero. The model runs offline, on-device, on hardware the operator already owns. Data never leaves the machine.

GPT-4o is backed by billions of dollars in training infrastructure. It is accessed through a metered API at $2.50 per million input tokens and $10 per million output tokens. And it is worth noting that the cost asymmetry holds at every layer of the stack. Nehanda is built on Mistral-7B—a foundation model from a completely independent organization with its own training infrastructure and research program. This is not a fine-tune of an OpenAI model. There is no dependency on OpenAI's ecosystem at any point: not their weights, not their APIs, not their tokenizer, not their alignment decisions, not their deprecation schedule. Even tracing costs back to the base model, you are comparing Mistral's foundation training budget to OpenAI's—and those are not comparable numbers. At enterprise inference volumes—continuous regulatory monitoring, forecasting pipelines, compliance checks across a portfolio—those per-token costs compound. And when GPT-4o-mini scores 17% on energy-mode sycophancy resistance, OpenAI cannot hand you a stage-by-stage diagnostic to fix it. You cannot file a support ticket that results in a targeted correction for your domain.

The relevant comparison is not "which model scores highest today." It is the total cost of ownership for a model program you can actually govern:

Cost to build: $180 vs. billions
Cost to deploy: A laptop vs. metered API
Cost to diagnose a failure: Per-stage checkpoint eval vs. opaque
Cost to fix a failure: Another $180 training run vs. waiting for the provider's next release and hoping they prioritized your domain
Data residency: On-device vs. third-party cloud
Connectivity requirement: None vs. always-on
Supply chain independence: Entire stack (Mistral base → custom fine-tune) vs. single-vendor dependency on pricing, alignment, and deprecation decisions

For teams operating in environments with unreliable connectivity, constrained budgets, and high-stakes regulatory decisions—which describes much of the energy sector across Sub-Saharan Africa—the question is not whether a frontier API model scores higher on a benchmark. The question is whether it is available, affordable, and governable where the work actually happens.

Where the Model Is Deployed Today

Nehanda is not a benchmark experiment waiting for a use case. It is deployed in production as a retrieval-augmented generation assistant supporting the small scale renewable energy interconnection process across South African municipalities. Sustainable Energy Africa (SEA) operates apply.sseg.org.za, a nationwide portal that digitises solar and embedded generation applications for 50–60 utilities. Applicants and municipal officials previously had to interpret lengthy PDF policies manually—the kind of high-stakes, document-grounded task where sycophancy and fabrication cause real harm.

It delivers rapid, municipality-specific answers with citations, backed by 13,000+ regulatory documents processed through a continued pre-training and instruction-tuning pipeline. A three-municipality pilot recorded sub-three-second median latency and 91% positive feedback across 4,200 queries. The system runs on-device, offline-capable, with no data leaving the operator's infrastructure—exactly the deployment model that an API-dependent frontier model cannot serve in this context.

This is the practical test of the hypothesis. A $180 model, deployed on a laptop, answering regulatory questions with citations across South African municipalities at 91% user satisfaction. The path from benchmark to production is not theoretical—it is already underway.

What We Are Changing Next

The comparative evaluation gave us a precise tuning map, not a reason to stop.

Per-stage checkpoint evaluation. Run the full eval suite after each of the five training stages to isolate exactly where over-hedging and structure degrade. This is a 5×6 diagnostic matrix—five stages, six dimensions—that tells us which stage broke which capability.
Intel-mode directness. The 41.67% over-hedging score is the largest remaining gap. Rebalance the analyst persona training data away from cautious qualification patterns and toward direct assessment.
Structure improvement. Add more structured output examples to the SFT data to close the gap with frontier models on formatted responses.
Sycophancy hardening. Nehanda already leads both frontier models on this dimension, but 58.33% in energy mode and 75% in intel mode leave room to extend the advantage. Additional DPO preference pairs where the model corrects false premises before providing analysis will push these higher.

Estimated compute cost for the full diagnostic and next training iteration: $180–$900. Total program cost through v2: under $1,100.

Bottom Line

We started with a business problem, not a model-size debate: enterprise teams were getting fluent answers they could not safely trust.

The 7B open-weights path gave us something we could control, iterate, and govern. Evaluated honestly against frontier models costing orders of magnitude more to build and operate, it already outperforms GPT-4o-mini on energy overall, beats both frontier models on sycophancy resistance in both domains, and matches GPT-4o on adversarial robustness—with a clear path to close remaining gaps at negligible incremental cost.

That is the point of building a model program instead of buying a benchmark score.

References

[1] Lin, S., Hilton, J., & Evans, O. TruthfulQA: Measuring How Models Mimic Human Falsehoods. https://arxiv.org/abs/2109.07958

[2] Li, J., Cheng, X., Zhao, W. X., Nie, J.-Y., & Wen, J.-R. HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. https://aclanthology.org/2023.emnlp-main.397/

[3] Gao, T., Yen, H., Yu, J., & Chen, D. Enabling Large Language Models to Generate Text with Citations (ALCE). https://arxiv.org/abs/2305.14627

[4] Sharma, M., et al. Towards Understanding Sycophancy in Language Models. https://arxiv.org/abs/2310.13548

[5] Kadavath, S., et al. Language Models (Mostly) Know What They Know. https://arxiv.org/abs/2207.05221

[6] Wei, A., Haghtalab, N., & Steinhardt, J. Jailbroken: How Does LLM Safety Training Fail? https://arxiv.org/abs/2307.02483