Skip to content
WhySoGeek.
How To

Run a Local LLM With Ollama on Your PC (2026)

Run a private AI model on your own hardware with Ollama, free, offline, and no data leaving your machine. Here is the ten-minute setup and the right model.

Sam Carter 8 min read
Cover image for Run a Local LLM With Ollama on Your PC (2026)
Photo: HO JJ / flickr (BY-NC-SA 2.0)

Cloud AI is convenient until you think about what you are pasting into it. Ollama runs capable language models entirely on your own machine: no API bills, no data leaving your hardware, and it works offline. If you have a laptop with 8 GB of RAM, you can have a private assistant running in about ten minutes.

Quick answer

Download Ollama for Windows, macOS, or Linux from ollama.com and install it. Then open a terminal and run ollama run llama3.1:8b to download and start a model. Ollama handles quantization and GPU acceleration automatically, and it exposes an OpenAI-compatible API on localhost for your own apps. Start with a 3B to 8B model; anything larger needs serious RAM or VRAM.

Key takeaways

  • Ollama runs LLMs locally on Windows, macOS, and Linux, free and offline.
  • Nothing leaves your machine, so it is ideal for private or sensitive text.
  • Install it, then pull a model with one command; setup takes about ten minutes.
  • Start small: a quantized 7B or 8B model runs on a laptop with 8 GB of RAM.
  • It exposes an OpenAI-compatible API on localhost so you can wire it into your own tools.

Why run a model locally

The pitch is control. A local model costs nothing per token, runs with no internet connection, and keeps every byte of your prompts on your own disk. That matters for confidential documents, code you cannot upload, or just avoiding another subscription. The trade-off is that local models are smaller and slower than frontier cloud models, so match your expectations to your hardware.

FactorLocal (Ollama)Cloud API
Cost per tokenZeroMetered
PrivacyData stays on deviceSent to provider
Offline useYesNo
Model ceilingLimited by your RAM/VRAMFrontier models
SetupOne install, one commandAPI key

If you want to understand which open models are worth pulling, our roundup of the best open-weight LLMs of 2026 covers the current standouts and their strengths.

Match the model to your hardware

The fastest way to fail is downloading a model too big for your machine. Quantized models shrink the memory footprint dramatically, which is why an 8B model fits on modest laptops.

Your RAM/VRAMRecommended model sizeExample
8 GB3B to 8B quantizedllama3.1:8b, phi small
16 GB8B to 14Bmid-size chat models
24 GB+ VRAM30B and uplarger reasoning models

Start at the low end, confirm it runs smoothly, then move up. A model that swaps to disk because it does not fit will feel unusably slow.

A terminal window showing Ollama downloading and running a Llama model locally, ready to chat at a prompt
Photo: Qole Tech / flickr (BY 2.0)

Install and run your first model

    1. Go to ollama.com, download the installer for your OS, and run it.
    2. Open a terminal (or PowerShell on Windows) and type ollama --version to confirm it installed.
    3. Run ollama run llama3.1:8b to download and launch the model; the first pull takes a few minutes.
    4. When the prompt appears, type a question and press Enter to chat locally.
    5. Type /bye to exit, and use ollama list to see downloaded models.

Ollama automatically uses your GPU if it can, quantizes the model to fit, and caches it so future launches are instant. To try a different model, just run ollama run with its name.

Warning

Downloaded models are large, often several gigabytes each. Keep an eye on disk usage if you pull many of them. Use ollama rm modelname to delete ones you no longer need, and if space gets tight, see our guide to freeing up disk space in Windows 11.

Beyond the command line

The terminal is just the start. Ollama runs a local server that speaks the same API format as OpenAI, so you can point existing apps and scripts at localhost instead of a paid endpoint.

  • Chat UIs: front ends like Open WebUI give you a browser chat interface over your local models.
  • Code and automation: call the local endpoint from Python or any language to build private tools.
  • Structured output: Ollama can constrain responses to a JSON schema, which is handy for extraction tasks; our guide to Ollama structured outputs with JSON schema shows how.

If you are deciding between inference engines for a heavier workload, compare the options in vLLM vs Ollama vs llama.cpp.

What to do right now

  • Check your RAM or GPU VRAM and pick a model size that fits (start with 8B or smaller).
  • Download and install Ollama from ollama.com.
  • Run ollama run llama3.1:8b and confirm you can chat at the prompt.
  • Use ollama list and ollama rm to manage which models you keep.
  • Explore a chat UI or the localhost API if you want to build on top of it.

Frequently asked questions

Is Ollama really free?

Yes. Ollama itself is open-source and free, and the models you run locally cost nothing per token because they run on your own hardware. Your only costs are electricity and disk space.

Do I need a powerful GPU?

No. Quantized models let you run a capable 7B or 8B model on a laptop with 8 GB of RAM and no dedicated GPU, just slower. A GPU speeds things up considerably, and larger models genuinely need one with enough VRAM.

Does my data leave my computer?

No. That is the main reason to run models locally. Everything, including your prompts and the model's responses, stays on your machine, which makes Ollama suitable for confidential text.

Which model should I start with?

Start with a small, well-supported model like an 8B Llama variant. Confirm it runs smoothly, then experiment with larger models if your hardware allows. Beginning too big is the most common mistake.

Can I use Ollama in my own apps?

Yes. Ollama exposes an OpenAI-compatible API on localhost, so you can point existing scripts and tools at it, build custom automations, or constrain output to a JSON schema for data extraction.

#ai#ollama#local-llm

Sources & further reading

Keep reading