Local & on-prem LLMs

LLM Studio

QwertyBit deploys on-premise LLMs via LLM Studio for clients with strict data-residency requirements — Llama, Qwen, Mistral, Gemma, DeepSeek, fully on your hardware.

QwertyBit deploys production on-prem LLMs via LLM Studio for regulated clients — so sensitive data never leaves your infrastructure while you still get frontier-class reasoning from Llama, Qwen, Mistral, Gemma and DeepSeek.

Why local models matter

Not every business can send sensitive data to a cloud model. Regulated industries, IP-heavy teams, and companies with strict data-residency contracts often need the model to run inside their perimeter. LLM Studio is the tool that makes that practical — and the open-weights ecosystem is strong enough today that most of the workflows we ship can run entirely on-prem without losing meaningful quality.

Local models we actually deploy

These are the families we reach for most often. Each is available in multiple sizes and in quantised GGUF formats for CPU-friendly inference.

General-purpose reasoning

Llama 3.x (8B, 70B) — our default workhorse. 70B runs well on a single GPU server; 8B fits on a laptop.
Qwen 2.5 (7B, 14B, 32B, 72B) — strongest open family for long-context reasoning and multilingual work.
Mistral & Mixtral (7B, 8x7B, 8x22B) — MoE models that punch above their weight on latency-per-token.
Gemma 2 (9B, 27B) — tight Google-trained models that are reliable and easy to fine-tune.
DeepSeek-V2 / V3 — when the workload is math-heavy or needs strong instruction following on a budget.

Small and on-device

Phi-3 / Phi-4 mini — 3–4B parameters, competitive with much larger models on structured tasks.
Llama 3.2 1B / 3B — phone- and edge-device-friendly.
Qwen 2.5 1.5B — surprising quality for its size, good for retrieval-augmented tasks.

Coding

Qwen2.5-Coder (7B, 14B, 32B) — our default for anything that has to read or write code.
DeepSeek-Coder-V2 — strong alternative, especially for repo-scale context.
CodeLlama — older but battle-tested for Python and JS generation.

Embeddings (for retrieval pipelines)

bge-m3 / bge-large-en-v1.5 — multilingual, dense, proven.
nomic-embed-text — Apache-licensed, excellent quality-per-byte.
jina-embeddings-v3 — when long-chunk embeddings matter.

How we pick a model for a given workflow

Three questions, in this order:

Data sensitivity — does the data have to stay on-prem? If yes, we're in local-model territory from day one.
Task shape — simple classification and extraction rarely need more than 7–14B. Multi-step reasoning benefits from 32B+.
Hardware budget — a single RTX 4090 comfortably runs a quantised 14B model. A 2× L40S or H100 node runs 70B at good latency. We size once and let it run.

The trade-offs we are honest about

A well-chosen open-weights model on decent hardware gets to 85–95% of frontier performance for most workflows. It will rarely match a flagship model on the hardest multi-step reasoning. We make that trade-off explicit during Phase 4 feasibility, so the business owner decides — not the engineer.

Quantisation (Q4_K_M, Q5_K_M, Q8) can cut memory and latency dramatically at the cost of 1–3 percentage points of quality on our evals. We test each combination before committing to it.

Typical QwertyBit on-prem stack

A quantised open-weights model sized to the workload (commonly Llama 3.x 70B or Qwen 2.5 32B).
A retrieval layer over the client's own documents, with local embeddings (bge-m3 or nomic).
An eval harness run on the client's own hardware so we can compare model upgrades safely.
Full tracing and cost-per-run accounting — even on-prem, you should know what each request costs to execute.

Work with us on LLM Studio

If data residency, compliance or sovereignty means cloud LLMs are not an option for your workflow, LLM Studio is the production path. Book a scoping call with a specific use case, or read our LLM integration & deployment service for the full engagement shape.

Volver al stack completo

Frontier LLMs

¿Listo para ver dónde los agentes pueden reducir tus costes?

Cuéntanos sobre el proceso que quieres optimizar. Vlad revisa personalmente cada brief y responde en un día laborable.

Agendar una auditoría de negocio Conoce nuestro enfoque

LLM Studio

Why local models matter

Local models we actually deploy

General-purpose reasoning

Small and on-device

Coding

Embeddings (for retrieval pipelines)

How we pick a model for a given workflow

The trade-offs we are honest about

Typical QwertyBit on-prem stack

Further reading

Work with us on LLM Studio

Anthropic

CrewAI

GitHub

¿Listo para ver dónde los agentes pueden reducir tus costes?

LLM Studio

Why local models matter

Local models we actually deploy

General-purpose reasoning

Small and on-device

Coding

Embeddings (for retrieval pipelines)

How we pick a model for a given workflow

The trade-offs we are honest about

Typical QwertyBit on-prem stack

Further reading

Work with us on LLM Studio

Lectura adicional

Anthropic

CrewAI

GitHub

¿Listo para ver dónde los agentes pueden reducir tus costes?