Local & on-prem LLMs

LLM Studio

For clients with strict data-residency requirements, we run and fine-tune models locally via LLM Studio — no data leaves the perimeter.

Why local models matter

Not every business can send sensitive data to a cloud model. Regulated industries, IP-heavy teams, and companies with strict data-residency contracts often need the model to run inside their perimeter. LLM Studio is the tool that makes that practical — and the open-weights ecosystem is strong enough today that most of the workflows we ship can run entirely on-prem without losing meaningful quality.

Local models we actually deploy

These are the families we reach for most often. Each is available in multiple sizes and in quantised GGUF formats for CPU-friendly inference.

General-purpose reasoning

  • Llama 3.x (8B, 70B) — our default workhorse. 70B runs well on a single GPU server; 8B fits on a laptop.
  • Qwen 2.5 (7B, 14B, 32B, 72B) — strongest open family for long-context reasoning and multilingual work.
  • Mistral & Mixtral (7B, 8x7B, 8x22B) — MoE models that punch above their weight on latency-per-token.
  • Gemma 2 (9B, 27B) — tight Google-trained models that are reliable and easy to fine-tune.
  • DeepSeek-V2 / V3 — when the workload is math-heavy or needs strong instruction following on a budget.

Small and on-device

  • Phi-3 / Phi-4 mini — 3–4B parameters, competitive with much larger models on structured tasks.
  • Llama 3.2 1B / 3B — phone- and edge-device-friendly.
  • Qwen 2.5 1.5B — surprising quality for its size, good for retrieval-augmented tasks.

Coding

  • Qwen2.5-Coder (7B, 14B, 32B) — our default for anything that has to read or write code.
  • DeepSeek-Coder-V2 — strong alternative, especially for repo-scale context.
  • CodeLlama — older but battle-tested for Python and JS generation.

Embeddings (for retrieval pipelines)

  • bge-m3 / bge-large-en-v1.5 — multilingual, dense, proven.
  • nomic-embed-text — Apache-licensed, excellent quality-per-byte.
  • jina-embeddings-v3 — when long-chunk embeddings matter.

How we pick a model for a given workflow

Three questions, in this order:

  1. Data sensitivity — does the data have to stay on-prem? If yes, we're in local-model territory from day one.
  2. Task shape — simple classification and extraction rarely need more than 7–14B. Multi-step reasoning benefits from 32B+.
  3. Hardware budget — a single RTX 4090 comfortably runs a quantised 14B model. A 2× L40S or H100 node runs 70B at good latency. We size once and let it run.

The trade-offs we are honest about

A well-chosen open-weights model on decent hardware gets to 85–95% of frontier performance for most workflows. It will rarely match a flagship model on the hardest multi-step reasoning. We make that trade-off explicit during Phase 4 feasibility, so the business owner decides — not the engineer.

Quantisation (Q4_K_M, Q5_K_M, Q8) can cut memory and latency dramatically at the cost of 1–3 percentage points of quality on our evals. We test each combination before committing to it.

Typical QwertyBit on-prem stack

  • A quantised open-weights model sized to the workload (commonly Llama 3.x 70B or Qwen 2.5 32B).
  • A retrieval layer over the client's own documents, with local embeddings (bge-m3 or nomic).
  • An eval harness run on the client's own hardware so we can compare model upgrades safely.
  • Full tracing and cost-per-run accounting — even on-prem, you should know what each request costs to execute.

Further reading

If you have data-residency constraints and want to know what is actually achievable on-prem today, book a business audit — we will give you a clear-eyed answer.

¿Listo para ver dónde los agentes pueden reducir tus costes?

Cuéntanos sobre el proceso que quieres optimizar. Vlad revisa personalmente cada brief y responde en un día laborable.