Local LLMs on NPU Laptops: The 2026 Reality

Copilot+ PCs advertise 40-80 TOPS NPUs, but can they actually run a local LLM well? Here is what the numbers say in 2026.

Sam CarterJun 23, 2026 9 min read

Cover image for Local LLMs on NPU Laptops: The 2026 Reality — Photo: brewbooks / flickr (BY-SA 2.0)

Every Copilot+ PC sold in 2026 carries a neural processing unit rated at 40 TOPS or more, and the new Snapdragon X2 Elite pushes that to 80. The marketing implies these machines are local AI powerhouses. The reality is more nuanced, and if you buy one expecting to run a large language model at desktop-GPU speed, you will be disappointed. This piece separates what NPUs are genuinely good at from what they are not, so you can set expectations before you spend.

Quick answer

A Copilot+ PC NPU is excellent at low-power, always-on AI (live captions, camera effects, Microsoft's on-device Phi Silica model) but a poor general-purpose LLM engine. Text generation is bound by memory bandwidth, not TOPS, and most local runtimes like Ollama do not even use the NPU yet; they fall back to the CPU. An 8B model runs around 5 to 10 tokens per second on a Snapdragon X Elite versus roughly 100 on a used desktop GPU. If raw local-LLM speed is your goal in 2026, buy a discrete GPU, not an NPU laptop.

Key takeaways

Copilot+ PCs require a 40+ TOPS NPU, 16 GB of RAM, and a 256 GB SSD; qualifying chips include Snapdragon X Elite/Plus, Intel Core Ultra 200V, and AMD Ryzen AI 300.
The Snapdragon X2 Elite roughly doubles NPU throughput to about 80 TOPS, a fast generational jump.
LLM text generation is bound by memory bandwidth, not raw TOPS, and most mainstream runtimes do not even use the NPU yet.
An 8B model on a Snapdragon X Elite runs CPU-only at roughly 5-10 tokens per second; a used desktop GPU runs the same model around 100.
NPUs shine at small, sustained, low-power tasks like background transcription, image effects, and Microsoft's on-device Phi Silica model.

What TOPS actually measures

TOPS, trillions of operations per second, is a peak throughput figure for low-precision integer math. It tells you how fast the NPU can crunch the fixed-function workloads it was designed for. It does not tell you how fast a chatbot will type, because LLM token generation is dominated by a different bottleneck.

The 2026 TOPS jump is real. Current Snapdragon X devices clear Microsoft's 45 TOPS bar; the X2 Elite generation lands near 80-85 TOPS, one of the fastest year-over-year improvements in any consumer chip category. But a bigger TOPS number does not automatically mean faster local text generation.

To make the gap concrete, here is roughly what an 8B model does on different hardware, and why:

Hardware	8B model speed	Bottleneck	Best for
Snapdragon X Elite (CPU fallback)	~5-10 tokens/sec	Memory bandwidth, no NPU use	Small models, background AI
Snapdragon X2 Elite (~80 TOPS)	Faster, still bandwidth-limited	Memory bandwidth	Efficient on-device tasks
Used desktop GPU (e.g. RTX 3060)	~100 tokens/sec	Compute (plenty of bandwidth)	Fast local LLM inference
High-end desktop GPU	Well above 100 tokens/sec	Compute	Large models, heavy local AI

The pattern is clear: the NPU never enters the LLM picture on mainstream tools, and even when it does, memory bandwidth caps how fast text can stream out.

The memory bandwidth wall

Here is the part the spec sheets bury. When an LLM generates text, it reads its entire weight set from memory for every single token it produces. That makes decoding a memory-bandwidth problem, not a compute problem. You can have a monster NPU and still crawl if memory cannot feed it fast enough.

Two facts follow from this:

Mainstream runtimes skip the NPU. Popular local tools like Ollama run on the CPU on these machines. On a Snapdragon X Elite, an 8B model lands around 5-10 tokens per second. A used desktop GPU runs the same 8B model at roughly 100 tokens per second, an order of magnitude faster.
RAM size and bandwidth gate what you can load. Local LLM deployments typically want 45+ TOPS paired with at least 32 GB of RAM, and serious work benefits from more. The 16 GB Copilot+ baseline is fine for the OS and small models, tight for anything larger.

Note

If raw local-LLM speed is your goal, a discrete GPU still wins decisively in 2026. The NPU is not a GPU replacement, it is a different tool for a different job.

Close-up of a laptop system-on-chip highlighting the neural processing unit — Photo: Bob Mical / flickr (BY-NC 2.0)

What NPUs are genuinely great at

The NPU is a fixed-function accelerator built to run small, sustained AI tasks continuously at very low power. That is exactly the workload a laptop wants offloaded so the CPU and battery are spared. Real wins in 2026 include:

Background transcription and live captions that run all day without draining the battery.
Camera and video effects, blur, eye contact, auto-framing, handled on the NPU during calls.
On-device small language models. Microsoft preinstalls Phi Silica, a small model, on every Copilot+ PC, and it runs on the NPU for tasks like quick rewrites and summaries.
Image generation and editing features in apps tuned to use the NPU.

These are the same on-device, privacy-friendly jobs we covered in small language models for on-device agents. The NPU does them quietly and efficiently, which is the whole point.

Picking a Copilot+ PC for AI

If you have decided an NPU laptop fits your needs, weigh these:

Chip family. Snapdragon X2 Elite leads on raw TOPS in 2026; Intel Core Ultra and AMD Ryzen AI 300 are competitive and run x86 software natively, which matters for compatibility.
RAM. Buy 32 GB if you can. The 16 GB floor is genuinely minimal for AI experimentation.
Memory bandwidth. Rarely advertised, but it is the real predictor of local LLM speed. Higher is better.
Your actual use case. If you mostly want efficient background AI features and long battery life, a Copilot+ PC is excellent. If you want to run mid-size models fast and locally, budget for a discrete GPU instead.

The honest summary: NPUs in 2026 are superb at low-power, always-on AI and weak as general-purpose LLM engines. Match the tool to the task and you will be happy. Expect a desktop-GPU experience from a thin-and-light and you will not.

What to do right now

Before you buy, line up the hardware with what you actually want from local AI:

If you mainly want all-day battery, live captions, camera effects, and quick on-device rewrites, a Copilot+ PC is an excellent fit.
If you want to run mid-size LLMs fast and locally, plan for a discrete GPU instead. An NPU laptop will frustrate you.
Whichever way you lean, buy 32 GB of RAM, not the 16 GB floor. The whole model has to fit and be read every token.
Ask the seller for memory bandwidth, not just TOPS. It is the real predictor of local LLM speed.
If you go Copilot+ and still want to experiment with bigger models, set expectations at single-digit to low double-digit tokens per second on CPU fallback.

Frequently asked questions

Can a Copilot+ PC run a large language model locally?

It can run small models, and Microsoft's Phi Silica ships on every unit and uses the NPU. Larger models run, but mostly CPU-only on mainstream tools, at single-digit to low double-digit tokens per second. For fast local inference on bigger models, a discrete GPU is still far ahead.

Why is my NPU laptop slower at LLMs than a cheap GPU?

Because text generation is limited by memory bandwidth, and most local runtimes do not use the NPU at all, they fall back to the CPU. A GPU has both the bandwidth and the software support, so it can be roughly ten times faster on the same model.

How much RAM do I need for local AI on a laptop?

16 GB is the Copilot+ minimum and works for the OS plus small models. For comfortable experimentation with larger models, aim for 32 GB or more, since the entire model must fit and be read from memory each token.

Are NPUs useless then?

No. They are very good at what they were built for, low-power, sustained tasks like transcription, camera effects, and small on-device models, all without hammering the CPU or battery. They are a poor substitute for a GPU but an excellent efficiency accelerator.

#ai#hardware