Running a model locally usually means picking a GGUF, guessing a quantization, hunting for the right llama.cpp build, and praying it fits in VRAM. LumaBrowser does all of that for you. It reads your hardware, recommends a model that actually fits, and downloads it — resumably — in one click. Then it just runs.
The reason most people give up on local AI. You read that you can run models on your own machine. Then you meet the wall: which model? Which quant — Q4, Q5, Q8? Do you need the CUDA build or the Vulkan one? How big a context window before it spills out of VRAM and crawls? You pick wrong, download nine gigabytes over a flaky connection, it dies at 80%, and you start over. Most people stop here and go back to paying per token.
Answer three plain-language questions — what you'll use it for, the slowest speed you'd tolerate, how much conversation memory you want. LumaBrowser turns those answers plus your hardware into a concrete, working setup.
Detects your GPU, usable VRAM, system RAM, and whether a CUDA driver is present — before anything downloads.
Chooses the most capable curated model, quantization, and context size your machine can actually run, and tells you why in plain words.
Fetches the matching llama.cpp build — CUDA 12 for NVIDIA, Vulkan for AMD/Intel, CPU as the universal fallback. Only if it's missing.
Resumably downloads the model, saves the configuration, starts the local server, and drops you into a chat. Done.
No terminal. No config files. No Hugging Face account. If you'd rather pick the exact file yourself, an Advanced field accepts any single-file GGUF URL or owner/repo/file.gguf shorthand.
The recommender doesn't just print a filename. It sizes the model honestly — real weight size plus the KV cache — so when a smaller model leaves VRAM headroom it stretches the context window up toward ~49K instead of quietly capping you at 4K, and when the weights nearly fill VRAM it tells you the smaller context it actually settled on. Then it explains the tradeoff in a sentence.
For development, Qwen 3.6 35B-A3B (Q4_K_M) is the most capable model that runs fully on your GPU. The weights nearly fill 24 GB of VRAM, so context is set to 12,288 tokens with a compressed (q8_0) KV cache — the honest figure that keeps it entirely on the GPU instead of spilling to system RAM.
Illustrative example. The actual pick depends on your machine and your three answers. When there's no usable GPU, the recommender chooses a smaller model and tells you it will run on CPU.
Every model in the catalog is a verified single-file GGUF with a real, known download size — chosen so the picker can reason about it before anything downloads. The lineup tracks the Gemma 4 and Qwen 3.6 generation.
| Model | Tier | Type | Context | Best for |
|---|---|---|---|---|
| Gemma 4 E2B Instruct | Small | Dense | up to 128K | The CPU / edge floor — tiny and fast when nothing larger fits |
| Gemma 4 E4B Instruct | Mid | Dense | up to 128K | A capable all-rounder — a great low-end laptop default |
| Gemma 4 26B A4B Instruct | Large | MoE (~4B active) | up to 128K | Fast for its size, broadly capable |
| Qwen 3.6 27B | Large | Dense | up to 256K | Deep reasoning, strong general & code work |
| Gemma 4 31B Instruct | XL | Dense | up to 128K | Complex reasoning and long documents — high-VRAM GPUs |
| Qwen 3.6 35B A3B | XL | MoE (~3B active) | up to 256K | The most capable pick — closest to frontier, offline |
Each model is available in Q4_K_M, Q6_K, and Q8_0. Mixture-of-experts (MoE) models activate only a few billion parameters per token, so they punch well above their inference cost while still needing the full weights resident for a GPU offload. Context windows shown are what the weights support; the recommender sizes the actual runtime context to your VRAM.
The local model lands in a built-in chat with the features you'd expect from a hosted assistant, all rendered locally.
Responses stream in as they're generated, with the same scroll-pause and auto-scroll behavior you're used to. No spinner-then-wall-of-text.
Ask the model to build a page, chart, diagram, or document and it renders in a docked panel beside the conversation — building in real time as the model writes it, not pasted back as raw HTML. Pop it out to a real browser tab when you want.
Didn't love an answer? Regenerate it. Previous takes are kept as variants you can flip between — nothing is thrown away.
Flip on Tools and the same chat can drive the browser — navigate, click, fill forms, extract data — with a lenient parser that keeps small local models from tripping over their own tool-call formatting.
The practical options for “run a model on my own machine” — and where LumaBrowser sits.
| Raw llama.cpp | Ollama | LM Studio | LumaBrowser | |
|---|---|---|---|---|
| Picks a model for your hardware | No | No | Shows fit hints | Yes — one click |
| Picks the runtime build | You choose & install | Bundled | Bundled | Auto: CUDA / Vulkan / CPU |
| Resumable model download | Manual | Yes | Yes | Yes |
| Sizes context honestly to VRAM | You do the math | Fixed default | Manual slider | Automatic, explained |
| Built-in streaming chat + artifacts | No | No (CLI/API) | Chat, no artifacts | Yes |
| Same model drives the browser | No | No | No | Yes — agentic tools |
This comparison reflects publicly available feature information gathered to the best of our knowledge from each project's public materials. These tools move fast and we're a small team — if anything here looks wrong, email [email protected] with the correction and a source, and we'll update the page.
No API key, ever. A GPU is optional. The wizard reads your hardware and recommends a model that fits what you have — with no usable GPU it picks a smaller model and runs it on the bundled CPU build of llama.cpp. NVIDIA uses the CUDA 12 runtime, AMD/Intel use Vulkan, and everything else falls back to CPU.
The curated catalog spans the Gemma 4 and Qwen 3.6 generation — from Gemma 4 E2B for CPU up to the Qwen 3.6 35B-A3B mixture-of-experts model — each in Q4_K_M, Q6_K, and Q8_0. You can also point the Advanced field at any single-file GGUF on Hugging Face.
Model files are several gigabytes, so downloads are resumable. Drop your connection or quit the app mid-download and the partial file is kept; the next attempt continues from where it stopped with an HTTP range request. Nothing restarts from zero.
No. The model runs entirely on your machine through a local server. Prompts and responses never leave the device, there's no inference provider in the loop, and there are no per-token costs.
Yes. Local AI is the default in setup, but you can choose Anthropic or any OpenAI-compatible endpoint instead — or switch between local and hosted models per conversation. LumaByte never marks up API costs.
Download LumaBrowser, answer three questions, click once. A private model that fits your machine is running a couple of minutes later — no API key, no terminal, nothing leaving your device.