Eval sweeps that fit in a batch window.

Evals are embarrassingly parallel — yet teams still run them one model at a time on realtime endpoints. Bundle five models and ten thousand prompts into a single batch, get reproducible outputs tied to model versions, and hand the report to your compliance team.

4 models · 10K prompts·24h window·pricing at launch
We'll tag your request with model evals and pair you with a design-partner slot.
The workload

You're comparing a new fine-tune to the last one, or to three candidate base models, across a benchmark your domain actually cares about. That's a product of {models} × {prompts} × {seeds} — a few tens of thousands of completions, all non-interactive, all needed before the next deploy decision.

Why it's different on sference
Async fit

The whole sweep in one submission.

Submit every (model, prompt, seed) tuple as a single batch with a 24h window. Shards run in parallel on spot capacity. No orchestration code for you to babysit.

BYOM parity

Your fine-tune, same pipeline.

Evaluate your latest fine-tune alongside Qwen, Mistral, and Gemma on identical infrastructure — not across three different APIs with three different rate limits.

Reproducibility

Outputs are pinned, not drifting.

Catalog models are versioned; BYOM models are pinned to the weights you uploaded. Rerun the same batch a month later and the outputs will match. Regulators and reviewers accept this.

Example

Example — sweep 4 models across a 10K-prompt benchmark.

sference — evals
$ sference eval ./bench.jsonl --models qwen3.6-plus,gemma-4-31b,ft_a,ft_b --window 24h
→ fanout 40,000 completions (4 models × 10,000 prompts)
→ batch bch_e42f queued · eta 17h 20m · sla 24h
→ models pinned: qwen3.6-plus@2026-04, gemma-4-31b@2026-04, ft_a@v7, ft_b@v7
▸ 16 shards · 4 EU providers
✓ completed 40,000/40,000 · 16h 48m
✓ report.jsonl + report.html · version manifest exported
input · prompt
{ "id": "bench_0042", "prompt": "Summarise the following clinical note…", "reference": "…" }
output · completion
{
  "id": "bench_0042",
  "model": "ft_a@v7",
  "completion": "…",
  "metrics": { "rouge_l": 0.48, "latency_ms": 612 },
  "_sference": {
    "region": "eu-de-fra",
    "batch": "bch_e42f"
  }
}
SLA & cost

Pricing is announced at launch. Eval sweeps are a natural 24h job — 48h is fine when the next deploy is Monday.

pricing at launch
1h

Smoke sweep on a small benchmark.

Baseline
6h

Same-day regression check.

Cheaper
24h

Canonical eval window.

Much cheaper
48h

Full matrix sweeps, weekend runs.

Cheapest
Recommended models
Qwen3.6 PlusCatalog

Flagship general catalog model; pinned version, reproducible across reruns.

Gemma 4 31BCatalog

Dense multimodal alternative; useful as a judge or a side-by-side baseline.

Your fine-tunesBYOM

Upload weights, pin a version, run side-by-side with catalog models.

Compliance

Export the report as an Annex IV artefact.

Every completion is tied to a pinned model version, region, and batch ID. Export the report as JSONL or HTML and attach it to your EU AI Act technical documentation. Same artefact works for customers' vendor-risk reviews.

05Early access

Stop paying realtime prices for work that can wait.

We're in early access. Drop your email — if your workload fits, we'll send you API credentials and you're good to go.