Skip to main
Moraine Lake, Canada

The Cloud is Just Someone Else's Rate Limit

Goran Pavlović

Wednesday, April 15, 2026

8 min read

There's a particular frustration that has become increasingly common: you finally get into a flow on a non-trivial refactor with Claude Code, the agent is looping productively through edits and tests... and then you hit a rate limit. Or you glance at your usage dashboard and realize that the last hour of iteration cost more than a nice dinner. Agentic workflows require large context windows, they loop constantly, and every tool call is another round trip through a paid API. We have all in recent years become very familiar with the term "token economics".

For a long time, the answer to "can I just run this locally?" was a polite "not really". Local models were interesting, but they were not professionally viable for the kind of agentic work that tools like Codex, Claude Code, and Cursor enable. That has changed. The combination of new consumer-grade hardware, mature inference engines, and a new generation of open-weights coding models has pushed local inference across the threshold into something you can actually use for real work.

This post is a walkthrough of the setup I am currently running: an AMD Strix Halo machine with 128GB of unified memory, Omarchy on the Linux Zen kernel, llama.cpp serving Qwen3 Coder Next with speculative decoding, and OpenCode pointed at the local server instead of a cloud provider. The result is a full agentic coding workflow running locally, at usable speeds, with zero recurring cost.

The Hardware: Why Strix Halo Matters

The hardware barrier for running large models locally has traditionally been VRAM. Discrete GPUs have fast memory, but not much of it unless you are willing to spend several thousand dollars on data centre cards or stack multiple consumer GPUs together. A 70B parameter model at reasonable quantization simply does not fit on a 24GB card, and the models useful for agentic coding are only getting larger.

The AMD Ryzen AI Max+ 395, the chip at the heart of the Strix Halo platform, approaches the problem from a different angle. It is an APU with a unified memory architecture, meaning the CPU and integrated GPU share the same pool of system RAM. That pool can be significantly larger than what you get on a discrete card, and because it is unified, there is no PCIe transfer tax between system memory and the GPU.

On my machine, that pool is 128GB. Of that, I have allocated 100GB specifically for the iGPU to use for AI models. That is enough headroom to comfortably run models that would otherwise require a multi-GPU rig, and it leaves the remaining 28GB for the operating system and everything else I am doing at the same time.

The Environment: Omarchy and the Zen Kernel

The operating system and kernel choices here are not strictly necessary for local inference to work, but they meaningfully affect the experience.

I am running Omarchy, DHH's opinionated Arch setup. It is dev-focused, visually coherent out of the box, and it gets out of the way. If you have been curious about a Linux daily driver but did not want to spend a weekend configuring one, it is worth a look. On top of that, I am using the Linux Zen kernel, which is tuned for desktop responsiveness and tends to feel noticeably snappier under load.

Allocating GTT Memory via Limine

The most important piece of the environment, and probably the most useful piece of this post for anyone trying to replicate the setup, is how to actually convince the system to give the iGPU 100GB of memory to work with. On an APU, this is managed through GTT (Graphics Translation Table) memory, which you configure via a kernel parameter.

Before touching the bootloader, there is a BIOS-level step worth mentioning. I have set the dedicated VRAM allocation in my BIOS to 512MB, which is the minimum the board allows. This is counterintuitive at first glance, since more dedicated VRAM sounds like it should be better. In practice, dedicated VRAM on an APU is carved out of system memory and reserved exclusively for the GPU, which means it cannot be reclaimed or shared. GTT memory, on the other hand, is allocated dynamically and is what actually gives the iGPU access to the full 100GB pool. Minimizing the dedicated allocation leaves as much memory as possible available for GTT to use.

Omarchy uses Limine as its bootloader, so the GTT parameter goes into limine.conf. The relevant line looks like this:

cmdline: ... ttm.pages_limit=26214400 amdgpu.gttsize=100000

Without this, the amdgpu driver will default to a much smaller allocation, and you will hit out-of-memory errors the moment you try to load anything substantial. With it, the iGPU can address the full 100GB, and the rest of the stack can proceed as if you were running on a card that does not actually exist at this price point.

The Engine: llama.cpp and Speculative Decoding

With the hardware addressable, the next piece is the inference engine. I am using llama.cpp in server mode, which exposes an OpenAI-compatible HTTP API on localhost. This is the key piece that makes the rest of the setup possible: any tool that can talk to the OpenAI API can, with a base URL change, talk to your local server instead.

One thing worth mentioning: I am not building llama.cpp from source. I am using a pre-built binary (llamacpp-rocm) that is tuned for this hardware, which gets you up and running in minutes rather than wrestling with compile flags and ROCm quirks. For anyone wanting to replicate the setup, this is the path of least resistance.

./llama-server -m /path/to/model/Qwen3-Coder-Next-Q4_K_M.gguf \
    -md /path/to/model/Qwen3-1.7B-Q8_0.gguf \
    -c 128000 \
    -fa on \
    -ngl 999 \
    -ngld 999 \
    --no-mmap \
    -b 4096 --ubatch-size 2048 \
    --host 0.0.0.0 --port 8081
  • -m / -md set main and draft models for speculative decoding
  • -c for setting context size
  • -fa on turning flash attention on
  • -ngl 999 / -ngld 999 ensures that all layers of both models are offloaded to iGPU
  • --no-mmap quirk with ROCm in that memory mapping destroys performance
  • -b 4096 --ubatch-size 2048 batch sizes for processing
  • --host 0.0.0.0 --port 8081 for the server

Your experience when running a large model on any hardware, local or cloud, is ultimately bottlenecked by tokens-per-second. For agentic coding this matters more than it does for chat because the agent is generating a lot of output: tool calls, file edits, test runs, and the reasoning between each step. Slow generation compounds across a loop.

Speculative decoding helps to get even more performance out of the local inference. The idea is to run a smaller "draft" model alongside the main model. As long as they use the same "vocabulary", the draft model can generate candidate tokens cheaply and quickly that the main model then verifies in a single forward pass rather than generating each token sequentially. When the draft model guesses right, which it often does on structured output like code, you get a significant throughput improvement for the small price of a little more memory usage. When it guesses wrong, you fall back to normal generation. The net effect is that the main model's quality is preserved while the effective generation speed climbs toward something that feels interactive rather than something you watch happen.

The Brains: Qwen3 Coder Next and OpenCode

The model choice here is Qwen3 Coder Next (Qwen3-Coder-Next-Q4_K_M). There are plenty of capable open-weights coding models at this point, but Qwen3 Coder Next stands out for agentic work specifically. It is trained for tool use, bash execution, and file editing, which are exactly the capabilities that OpenCode leans on. A model that is generally strong at code but weak at following tool-call schemas will stall out in an agentic loop; Qwen3 Coder Next does not.

Connecting OpenCode to the local server is straightforward. Because llama.cpp exposes an OpenAI-compatible endpoint, it is really just a configuration change:

{
...
	"provider": {
		"llama.cpp": {
			"npm": "@ai-sdk/openai-compatible",
			"name": "llama-server (local)",
			"options": {
				"baseURL": "http://127.0.0.1:8081/v1"
			},
			"models": {
				"Qwen3-Coder-Next-Q4_K_M.gguf": {
					"name": "Qwen3-Coder-Next-Q4_K_M.gguf (local)",
					"limit": {
						"context": 128000,
						"output": 65536
					}
				},
			}
		}
	}

Once that is in place, nothing else about the OpenCode experience changes. The agent proposes edits, the model generates them, the edits are applied, tests run, the loop continues. Except the loop is running against a process on localhost that is not charging you for tokens and does not have a rate limit.

The Workflow: Using Frontier and Local Models Together

The most important thing to understand about this setup is that it is not a replacement for frontier models. It is a complement to them. Frontier models and local models are good at different things, and the workflow that has emerged for me leans into that difference rather than fighting it.

The loop starts with my roadmap. Work is broken down into epics, and I hand each epic off to a frontier model like Claude Opus with a clear ask: do the research, gather the relevant context from the codebase, and produce an implementation plan broken down into discrete tasks. This is where frontier models earn their cost. They are genuinely good at synthesis, at navigating ambiguity, at reasoning through architecture decisions and surfacing considerations that a smaller model will miss. The output is a thorough task list that represents a workable epic implementation plan.

From there, the tasks go to the local setup. Qwen3 Coder Next running locally is more than capable of executing well-defined tasks: making the file edits, running the tests, iterating on failures. This is work that does not require frontier-level reasoning but does require a lot of tokens, and that is exactly the kind of work you want to offload.

The result is that I get more out of both tiers. My cloud usage stays within sustainable limits because I am only spending frontier tokens on the work that actually benefits from frontier reasoning. And my local setup is not being asked to do things it would struggle with, like planning an epic from scratch. Each model is doing what it is best at, and the overall throughput of the workflow is higher than it would be if I were trying to do everything in one place.

The Zero-Guilt Refactor

The practical payoff of this setup is a psychological one as much as a technical one. When the usage meter is running, there is a constant low-level pressure to be efficient with your prompts, to scope tasks conservatively, and to not let the agent thrash. Local inference removes that pressure entirely. You are free to iterate and experiment, and go until you are satisfied with the work.

Taking Back Control

We are at an inflection point. The combination of consumer hardware with enough unified memory to matter, inference engines that have matured to the point of being drop-in replacements for cloud APIs, and open-weights models that are genuinely competitive for agentic work means that the dependency on cloud providers for day-to-day coding assistance is becoming a choice, one with a real alternative.

None of this is to say that cloud providers no longer have a place. As described above, frontier models still earn their keep at the planning and research layer, where their capabilities meaningfully outclass anything you can run locally. But for the execution layer of agentic work, local inference on hardware like the Strix Halo platform is not a compromise anymore.

If you are experimenting with local models for agentic coding, I would be curious to hear what setups you are running, what models you are finding most useful, or whether you have questions about the setup. You can get in touch via the contact form.

Was this helpful?

Buy Me a Coffee at ko-fi.com

Related Articles