Which local models does LumaBrowser offer?

The curated catalog covers the Gemma 4 and Qwen 3.6 generation: Gemma 4 E2B and E4B for laptops and CPU, the Gemma 4 26B-A4B and Qwen 3.6 35B-A3B mixture-of-experts models that stay fast for their size, the dense Qwen 3.6 27B, and the dense Gemma 4 31B for high-VRAM GPUs. Each is offered in Q4_K_M, Q6_K, and Q8_0 quantizations. You can also point setup at any single-file GGUF on Hugging Face.

What happens if the model download is interrupted?

Model files are several gigabytes, so downloads are resumable. If the connection drops or you quit the app mid-download, the partial file is kept and the next attempt continues from where it left off with an HTTP range request. Nothing re-downloads from zero.

Is my data private when using local AI?

Yes. The model runs entirely on your machine via a local llama.cpp server. Your prompts and the model's responses never leave the device; there is no inference provider in the loop and no API costs.

Local · Private · Offline · No API key

The local AI setup step that sets itself up.

Running a model locally usually means picking a GGUF, guessing a quantization, hunting for the right llama.cpp build, and praying it fits in VRAM. LumaBrowser does all of that for you. It reads your hardware, recommends a model that actually fits, and downloads it, resumably, in one click. Then it just runs.

Download LumaBrowser - free See what one click does

Bundled llama.cpp runtimes (CPU / CUDA 12 / Vulkan) · MLX on Apple Silicon · nothing leaves your machine

Try a small model live in your browser

The reason most people give up on local AI. You read that you can run models on your own machine. Then you meet the wall: which model? Which quant, Q4, Q5, Q8? Do you need the CUDA build or the Vulkan one? How big a context window before it spills out of VRAM and crawls? You pick wrong, download nine gigabytes over a flaky connection, it dies at 80%, and you start over. Most people stop here and go back to paying per token.

What one click actually does

Answer three plain-language questions, what you'll use it for, the slowest speed you'd tolerate, how much conversation memory you want. LumaBrowser turns those answers plus your hardware into a concrete, working setup.

Step 1

Reads your hardware

Detects your GPU, usable VRAM, system RAM, and whether a CUDA driver is present, before anything downloads.

Step 2

Picks a model that fits

Chooses the most capable curated model, quantization, and context size your machine can actually run, and tells you why in plain words.

Step 3

Installs the runtime

Fetches the matching llama.cpp build, CUDA 12 for NVIDIA, Vulkan for AMD/Intel, CPU as the universal fallback. Only if it's missing. On Apple Silicon, models can also run on the MLX runtime.

Step 4

Downloads & starts

Resumably downloads the model, saves the configuration, starts the local server, and drops you into a chat. Done.

No terminal. No config files. No Hugging Face account. If you'd rather pick the exact file yourself, an Advanced field accepts any single-file GGUF URL or owner/repo/file.gguf shorthand.

A recommendation you can read, not a spec sheet

The recommender doesn't just print a filename. It sizes the model honestly, real weight size plus the KV cache, so when a smaller model leaves VRAM headroom it stretches the context window up toward ~49K instead of quietly capping you at 4K, and when the weights nearly fill VRAM it tells you the smaller context it actually settled on. Then it explains the tradeoff in a sentence.

Recommended for you

Qwen 3.6 35B-A3B Q4_K_M

~20 GB download · 12,288 ctx · llama-cpp-cuda12

For development, Qwen 3.6 35B-A3B (Q4_K_M) is the most capable model that runs fully on your GPU. The weights nearly fill 24 GB of VRAM, so context is set to 12,288 tokens with a compressed (q8_0) KV cache, the honest figure that keeps it entirely on the GPU instead of spilling to system RAM.

NVIDIA GeForce RTX 4090 · 24 GB VRAM · 64 GB RAM · CUDA

Illustrative example. The actual pick depends on your machine and your three answers. When there's no usable GPU, the recommender chooses a smaller model and tells you it will run on CPU.

A curated lineup, not the whole of Hugging Face

Every model in the catalog is a verified single-file GGUF with a real, known download size, chosen so the picker can reason about it before anything downloads. The lineup tracks the Gemma 4 and Qwen 3.6 generation.

Model	Tier	Type	Context	Best for
Gemma 4 E2B Instruct	Small	Dense	up to 128K	The CPU / edge floor - tiny and fast when nothing larger fits
Gemma 4 E4B Instruct	Mid	Dense	up to 128K	A capable all-rounder - a great low-end laptop default
Gemma 4 26B A4B Instruct	Large	MoE (~4B active)	up to 128K	Fast for its size, broadly capable
Qwen 3.6 27B	Large	Dense	up to 256K	Deep reasoning, strong general & code work
Gemma 4 31B Instruct	XL	Dense	up to 128K	Complex reasoning and long documents - high-VRAM GPUs
Qwen 3.6 35B A3B	XL	MoE (~3B active)	up to 256K	The most capable pick - closest to frontier, offline

Each model is available in Q4_K_M, Q6_K, and Q8_0. Mixture-of-experts (MoE) models activate only a few billion parameters per token, so they punch well above their inference cost while still needing the full weights resident for a GPU offload. Context windows shown are what the weights support; the recommender sizes the actual runtime context to your VRAM.

Once it's running, it's a real chat - not a toy

The local model lands in a built-in chat with the features you'd expect from a hosted assistant, all rendered locally.

Live

Token streaming

Responses stream in as they're generated, with the same scroll-pause and auto-scroll behavior you're used to. No spinner-then-wall-of-text.

Build

Live artifact panel

Ask the model to build a page, chart, diagram, or document and it renders in a docked panel beside the conversation, building in real time as the model writes it, not pasted back as raw HTML. Pop it out to a real browser tab when you want.

Iterate

Regenerate with variants

Didn't love an answer? Regenerate it. Previous takes are kept as variants you can flip between; nothing is thrown away.

Agentic

Browser tools, optional

Flip on Tools and the same chat can drive the browser, navigate, click, fill forms, extract data, with a lenient parser that keeps small local models from tripping over their own tool-call formatting.

How this compares to the usual local-AI path

The practical options for “run a model on my own machine”, and where LumaBrowser sits.

	Raw llama.cpp	Ollama	LM Studio	LumaBrowser
Picks a model for your hardware	No	No	Shows fit hints	Yes - one click
Picks the runtime build	You choose & install	Bundled	Bundled	Auto: CUDA / Vulkan / CPU
Resumable model download	Manual	Yes	Yes	Yes
Sizes context honestly to VRAM	You do the math	Fixed default	Manual slider	Automatic, explained
Built-in streaming chat + artifacts	No	No (CLI/API)	Chat, no artifacts	Yes
Same model drives the browser	No	No	No	Yes - agentic tools

This comparison reflects publicly available feature information gathered to the best of our knowledge from each project's public materials. These tools move fast and we're a small team; if anything here looks wrong, email [email protected] with the correction and a source, and we'll update the page.

Questions people ask before downloading

Do I need an API key or a GPU?

No API key, ever. A GPU is optional. The wizard reads your hardware and recommends a model that fits what you have; with no usable GPU it picks a smaller model and runs it on the bundled CPU build of llama.cpp. NVIDIA uses the CUDA 12 runtime, AMD/Intel use Vulkan, and everything else falls back to CPU. Apple Silicon Macs can additionally run models on the MLX runtime.

Which models can I run?

The curated catalog spans the Gemma 4 and Qwen 3.6 generation, from Gemma 4 E2B for CPU up to the Qwen 3.6 35B-A3B mixture-of-experts model, each in Q4_K_M, Q6_K, and Q8_0. You can also point the Advanced field at any single-file GGUF on Hugging Face.

What if my download gets interrupted?

Model files are several gigabytes, so downloads are resumable. Drop your connection or quit the app mid-download and the partial file is kept; the next attempt continues from where it stopped with an HTTP range request. Nothing restarts from zero.

Is anything sent to the cloud?

No. The model runs entirely on your machine through a local server. Prompts and responses never leave the device, there's no inference provider in the loop, and there are no per-token costs.

Can I still bring my own API key instead?

Yes. Local AI is the default in setup, but you can choose Anthropic or any OpenAI-compatible endpoint instead, or switch between local and hosted models per conversation. LumaByte never marks up API costs.