Eval sweeps that fit in a batch window.

Evals are embarrassingly parallel — yet teams still run them one model at a time on realtime endpoints. Bundle five models and ten thousand prompts into a single batch, get reproducible outputs tied to model versions, and hand the report to your compliance team.

4 models · 10K prompts·24h window·pricing at launch

We'll tag your request with model evals and pair you with a design-partner slot.

The workload

You're comparing a new fine-tune to the last one, or to three candidate base models, across a benchmark your domain actually cares about. That's a product of {models} × {prompts} × {seeds} — a few tens of thousands of completions, all non-interactive, all needed before the next deploy decision.

—Why it's different on sference

Async fit

The whole sweep in one submission.

Submit every (model, prompt, seed) tuple as a single batch with a 24h window. Shards run in parallel on spot capacity. No orchestration code for you to babysit.

BYOM parity

Your fine-tune, same pipeline.

Evaluate your latest fine-tune alongside Qwen, Mistral, and Gemma on identical infrastructure — not across three different APIs with three different rate limits.

Reproducibility

Outputs are pinned, not drifting.

Catalog models are versioned; BYOM models are pinned to the weights you uploaded. Rerun the same batch a month later and the outputs will match. Regulators and reviewers accept this.

—Example

Example — sweep 4 models across a 10K-prompt benchmark.

sference — evals

$ sference eval ./bench.jsonl --models qwen3.6-plus,gemma-4-31b,ft_a,ft_b --window 24h

→ fanout 40,000 completions (4 models × 10,000 prompts)

→ batch bch_e42f queued · eta 17h 20m · sla 24h

→ models pinned: qwen3.6-plus@2026-04, gemma-4-31b@2026-04, ft_a@v7, ft_b@v7

▸ 16 shards · 4 EU providers

✓ completed 40,000/40,000 · 16h 48m

✓ report.jsonl + report.html · version manifest exported

input · prompt

{ "id": "bench_0042", "prompt": "Summarise the following clinical note…", "reference": "…" }

output · completion

{
  "id": "bench_0042",
  "model": "ft_a@v7",
  "completion": "…",
  "metrics": { "rouge_l": 0.48, "latency_ms": 612 },
  "_sference": {
    "region": "eu-de-fra",
    "batch": "bch_e42f"
  }
}

—SLA & cost

Pricing is announced at launch. Eval sweeps are a natural 24h job — 48h is fine when the next deploy is Monday.

pricing at launch

Smoke sweep on a small benchmark.

Baseline

Same-day regression check.

Cheaper

24h

Canonical eval window.

Much cheaper

48h

Full matrix sweeps, weekend runs.

Cheapest

—Recommended models

Qwen3.6 PlusCatalog

Flagship general catalog model; pinned version, reproducible across reruns.

Gemma 4 31BCatalog

Dense multimodal alternative; useful as a judge or a side-by-side baseline.

Your fine-tunesBYOM

Upload weights, pin a version, run side-by-side with catalog models.

—Compliance

Export the report as an Annex IV artefact.

Every completion is tied to a pinned model version, region, and batch ID. Export the report as JSONL or HTML and attach it to your EU AI Act technical documentation. Same artefact works for customers' vendor-risk reviews.

—Other workloads

All use cases →

01Workload

Structured extraction

Turn PDFs, scans, and forms into clean JSON — with an audit trail per document.

100K invoices·24h window·pricing at launch

02Workload

Embeddings at scale

Build or rebuild RAG indexes on open-weight embeddings — at a fraction of hosted-API cost.

10M chunks·6h window·pricing at launch

03Workload

Synthetic data generation

Generate training sets, distillation corpora, and red-team prompts — with full per-sample provenance.

1M samples·24h window·pricing at launch

05Early access

Stop paying realtime prices for work that can wait.

We're in early access. Drop your email — if your workload fits, we'll send you API credentials and you're good to go.