Local & on-prem LLMs
LLM Studio
QwertyBit deploys on-premise LLMs via LLM Studio for clients with strict data-residency requirements — Llama, Qwen, Mistral, Gemma, DeepSeek, fully on your hardware.
QwertyBit deploys production on-prem LLMs via LLM Studio for regulated clients — so sensitive data never leaves your infrastructure while you still get frontier-class reasoning from Llama, Qwen, Mistral, Gemma and DeepSeek.
Why local models matter
Not every business can send sensitive data to a cloud model. Regulated industries, IP-heavy teams, and companies with strict data-residency contracts often need the model to run inside their perimeter. LLM Studio is the tool that makes that practical — and the open-weights ecosystem is strong enough today that most of the workflows we ship can run entirely on-prem without losing meaningful quality.
Local models we actually deploy
These are the families we reach for most often. Each is available in multiple sizes and in quantised GGUF formats for CPU-friendly inference.
General-purpose reasoning
- Llama 3.x (8B, 70B) — our default workhorse. 70B runs well on a single GPU server; 8B fits on a laptop.
- Qwen 2.5 (7B, 14B, 32B, 72B) — strongest open family for long-context reasoning and multilingual work.
- Mistral & Mixtral (7B, 8x7B, 8x22B) — MoE models that punch above their weight on latency-per-token.
- Gemma 2 (9B, 27B) — tight Google-trained models that are reliable and easy to fine-tune.
- DeepSeek-V2 / V3 — when the workload is math-heavy or needs strong instruction following on a budget.
Small and on-device
- Phi-3 / Phi-4 mini — 3–4B parameters, competitive with much larger models on structured tasks.
- Llama 3.2 1B / 3B — phone- and edge-device-friendly.
- Qwen 2.5 1.5B — surprising quality for its size, good for retrieval-augmented tasks.
Coding
- Qwen2.5-Coder (7B, 14B, 32B) — our default for anything that has to read or write code.
- DeepSeek-Coder-V2 — strong alternative, especially for repo-scale context.
- CodeLlama — older but battle-tested for Python and JS generation.
Embeddings (for retrieval pipelines)
- bge-m3 / bge-large-en-v1.5 — multilingual, dense, proven.
- nomic-embed-text — Apache-licensed, excellent quality-per-byte.
- jina-embeddings-v3 — when long-chunk embeddings matter.
How we pick a model for a given workflow
Three questions, in this order:
- Data sensitivity — does the data have to stay on-prem? If yes, we're in local-model territory from day one.
- Task shape — simple classification and extraction rarely need more than 7–14B. Multi-step reasoning benefits from 32B+.
- Hardware budget — a single RTX 4090 comfortably runs a quantised 14B model. A 2× L40S or H100 node runs 70B at good latency. We size once and let it run.
The trade-offs we are honest about
A well-chosen open-weights model on decent hardware gets to 85–95% of frontier performance for most workflows. It will rarely match a flagship model on the hardest multi-step reasoning. We make that trade-off explicit during Phase 4 feasibility, so the business owner decides — not the engineer.
Quantisation (Q4_K_M, Q5_K_M, Q8) can cut memory and latency dramatically at the cost of 1–3 percentage points of quality on our evals. We test each combination before committing to it.
Typical QwertyBit on-prem stack
- A quantised open-weights model sized to the workload (commonly Llama 3.x 70B or Qwen 2.5 32B).
- A retrieval layer over the client's own documents, with local embeddings (bge-m3 or nomic).
- An eval harness run on the client's own hardware so we can compare model upgrades safely.
- Full tracing and cost-per-run accounting — even on-prem, you should know what each request costs to execute.
Further reading
- AI and LLMs: the future of smart business solutions — the on-prem vs. cloud decision, framed for business owners.
- The Hugging Face Open LLM Leaderboard — the quickest way to see which open-weights model is currently top of its size class.
- LM Studio (the desktop app the category is often named after).
Work with us on LLM Studio
If data residency, compliance or sovereignty means cloud LLMs are not an option for your workflow, LLM Studio is the production path. Book a scoping call with a specific use case, or read our LLM integration & deployment service for the full engagement shape.
Further reading
Frontier LLMs
Anthropic
QwertyBit builds production AI agents on Anthropic Claude for high-reasoning, long-context, and compliance-aware workflows where steerability matters.
Multi-agent orchestration
CrewAI
QwertyBit builds multi-agent systems with CrewAI for workflows that need specialist agents planning, executing, and reviewing in sequence — not a single oversized prompt.
Version control & CI
GitHub
QwertyBit ships every line of code to your GitHub from commit one — transparent delivery, PR reviews, CI, and no vendor lock-in. Custom software engineering that your team inherits cleanly.
Ready to see where agents can take cost out of your business?
Tell us about the process you want to optimise. Vlad personally reviews every brief and replies within one business day.