Local · Private · Offline · No API key

The local AI setup step that sets itself up.

Running a model locally usually means picking a GGUF, guessing a quantization, hunting for the right llama.cpp build, and praying it fits in VRAM. LumaBrowser does all of that for you. It reads your hardware, recommends a model that actually fits, and downloads it — resumably — in one click. Then it just runs.

Download LumaBrowser — free See what one click does
Bundled llama.cpp runtimes (CPU / CUDA 12 / Vulkan) · nothing leaves your machine

The reason most people give up on local AI. You read that you can run models on your own machine. Then you meet the wall: which model? Which quant — Q4, Q5, Q8? Do you need the CUDA build or the Vulkan one? How big a context window before it spills out of VRAM and crawls? You pick wrong, download nine gigabytes over a flaky connection, it dies at 80%, and you start over. Most people stop here and go back to paying per token.

What one click actually does

Answer three plain-language questions — what you'll use it for, the slowest speed you'd tolerate, how much conversation memory you want. LumaBrowser turns those answers plus your hardware into a concrete, working setup.

Step 1

Reads your hardware

Detects your GPU, usable VRAM, system RAM, and whether a CUDA driver is present — before anything downloads.

Step 2

Picks a model that fits

Chooses the most capable curated model, quantization, and context size your machine can actually run, and tells you why in plain words.

Step 3

Installs the runtime

Fetches the matching llama.cpp build — CUDA 12 for NVIDIA, Vulkan for AMD/Intel, CPU as the universal fallback. Only if it's missing.

Step 4

Downloads & starts

Resumably downloads the model, saves the configuration, starts the local server, and drops you into a chat. Done.

No terminal. No config files. No Hugging Face account. If you'd rather pick the exact file yourself, an Advanced field accepts any single-file GGUF URL or owner/repo/file.gguf shorthand.

A recommendation you can read, not a spec sheet

The recommender doesn't just print a filename. It sizes the model honestly — real weight size plus the KV cache — so when a smaller model leaves VRAM headroom it stretches the context window up toward ~49K instead of quietly capping you at 4K, and when the weights nearly fill VRAM it tells you the smaller context it actually settled on. Then it explains the tradeoff in a sentence.

Recommended for you
Qwen 3.6 35B-A3B Q4_K_M
~20 GB download · 12,288 ctx · llama-cpp-cuda12

For development, Qwen 3.6 35B-A3B (Q4_K_M) is the most capable model that runs fully on your GPU. The weights nearly fill 24 GB of VRAM, so context is set to 12,288 tokens with a compressed (q8_0) KV cache — the honest figure that keeps it entirely on the GPU instead of spilling to system RAM.

NVIDIA GeForce RTX 4090 · 24 GB VRAM · 64 GB RAM · CUDA

Illustrative example. The actual pick depends on your machine and your three answers. When there's no usable GPU, the recommender chooses a smaller model and tells you it will run on CPU.

A curated lineup, not the whole of Hugging Face

Every model in the catalog is a verified single-file GGUF with a real, known download size — chosen so the picker can reason about it before anything downloads. The lineup tracks the Gemma 4 and Qwen 3.6 generation.

Model Tier Type Context Best for
Gemma 4 E2B Instruct Small Dense up to 128K The CPU / edge floor — tiny and fast when nothing larger fits
Gemma 4 E4B Instruct Mid Dense up to 128K A capable all-rounder — a great low-end laptop default
Gemma 4 26B A4B Instruct Large MoE (~4B active) up to 128K Fast for its size, broadly capable
Qwen 3.6 27B Large Dense up to 256K Deep reasoning, strong general & code work
Gemma 4 31B Instruct XL Dense up to 128K Complex reasoning and long documents — high-VRAM GPUs
Qwen 3.6 35B A3B XL MoE (~3B active) up to 256K The most capable pick — closest to frontier, offline

Each model is available in Q4_K_M, Q6_K, and Q8_0. Mixture-of-experts (MoE) models activate only a few billion parameters per token, so they punch well above their inference cost while still needing the full weights resident for a GPU offload. Context windows shown are what the weights support; the recommender sizes the actual runtime context to your VRAM.

Once it's running, it's a real chat — not a toy

The local model lands in a built-in chat with the features you'd expect from a hosted assistant, all rendered locally.

Live

Token streaming

Responses stream in as they're generated, with the same scroll-pause and auto-scroll behavior you're used to. No spinner-then-wall-of-text.

Build

Live artifact panel

Ask the model to build a page, chart, diagram, or document and it renders in a docked panel beside the conversation — building in real time as the model writes it, not pasted back as raw HTML. Pop it out to a real browser tab when you want.

Iterate

Regenerate with variants

Didn't love an answer? Regenerate it. Previous takes are kept as variants you can flip between — nothing is thrown away.

Agentic

Browser tools, optional

Flip on Tools and the same chat can drive the browser — navigate, click, fill forms, extract data — with a lenient parser that keeps small local models from tripping over their own tool-call formatting.

How this compares to the usual local-AI path

The practical options for “run a model on my own machine” — and where LumaBrowser sits.

Raw llama.cpp Ollama LM Studio LumaBrowser
Picks a model for your hardware No No Shows fit hints Yes — one click
Picks the runtime build You choose & install Bundled Bundled Auto: CUDA / Vulkan / CPU
Resumable model download Manual Yes Yes Yes
Sizes context honestly to VRAM You do the math Fixed default Manual slider Automatic, explained
Built-in streaming chat + artifacts No No (CLI/API) Chat, no artifacts Yes
Same model drives the browser No No No Yes — agentic tools

This comparison reflects publicly available feature information gathered to the best of our knowledge from each project's public materials. These tools move fast and we're a small team — if anything here looks wrong, email [email protected] with the correction and a source, and we'll update the page.

Questions people ask before downloading

Do I need an API key or a GPU?

No API key, ever. A GPU is optional. The wizard reads your hardware and recommends a model that fits what you have — with no usable GPU it picks a smaller model and runs it on the bundled CPU build of llama.cpp. NVIDIA uses the CUDA 12 runtime, AMD/Intel use Vulkan, and everything else falls back to CPU.

Which models can I run?

The curated catalog spans the Gemma 4 and Qwen 3.6 generation — from Gemma 4 E2B for CPU up to the Qwen 3.6 35B-A3B mixture-of-experts model — each in Q4_K_M, Q6_K, and Q8_0. You can also point the Advanced field at any single-file GGUF on Hugging Face.

What if my download gets interrupted?

Model files are several gigabytes, so downloads are resumable. Drop your connection or quit the app mid-download and the partial file is kept; the next attempt continues from where it stopped with an HTTP range request. Nothing restarts from zero.

Is anything sent to the cloud?

No. The model runs entirely on your machine through a local server. Prompts and responses never leave the device, there's no inference provider in the loop, and there are no per-token costs.

Can I still bring my own API key instead?

Yes. Local AI is the default in setup, but you can choose Anthropic or any OpenAI-compatible endpoint instead — or switch between local and hosted models per conversation. LumaByte never marks up API costs.

Local AI without the setup tax

Download LumaBrowser, answer three questions, click once. A private model that fits your machine is running a couple of minutes later — no API key, no terminal, nothing leaving your device.

Download LumaBrowser — free Read the local AI docs