Advertisement
Header Banner (728x90 / responsive)
tutorial

Running Local AI Models on a Mac Mini M4 — The Ollama Performance Guide

We benchmark 12 models on the Mac Mini M4 with Ollama — from tiny 1B parameter models to 70B behemoths. Token speeds, memory usage, and which models are actually worth running locally.

AI Tool Dojo

Running Local AI Models on a Mac Mini M4 — The Ollama Performance Guide

Cloud AI is convenient until it isn’t. Rate limits at 2 AM when you’re shipping a deadline. API costs that balloon when you’re running 50 agents. Privacy concerns when your prompts contain proprietary code. And the nagging feeling that you’re paying someone else to run software on hardware you already own.

The Mac Mini M4 changed the local AI equation. With 24GB of unified memory, a neural engine that actually gets used now, and a $599 starting price, it’s the most practical local AI server most people can buy. We’ve been running Ollama on one for months. Here’s what actually works.

Why Ollama?

Ollama is the easiest way to run open-source language models locally. Install it, pull a model, and you’re running inference in under 5 minutes. No Python environments, no dependency hell, no CUDA drivers.

# Install
brew install ollama

# Start the server
ollama serve

# Pull and run a model
ollama run llama3.3

That’s it. Ollama handles model downloading, quantization, memory management, and API serving. It exposes an OpenAI-compatible API on localhost:11434, so any tool that works with OpenAI’s API works with Ollama out of the box.

The Hardware: Mac Mini M4

Our test machine:

  • Chip: Apple M4 (10-core CPU, 10-core GPU)
  • Memory: 24GB unified
  • Storage: 512GB SSD
  • OS: macOS Sequoia 15.3
  • Ollama version: 0.6.x

The key spec is the 24GB unified memory. In the Apple Silicon architecture, CPU and GPU share the same memory pool. This means models that would need a $1,000 GPU on a PC can run on $599 hardware — because the memory bandwidth is good enough (100 GB/s on M4) and there’s no PCIe bottleneck.

The Benchmarks

We tested 12 models across three categories: speed (tokens/second), quality (human evaluation on coding + reasoning tasks), and memory usage. All models are quantized (Q4_K_M or Q5 where available).

Small Models (1B - 3B parameters)

ModelSizeSpeed (tok/s)MemoryQualityBest For
qwen3:1.7b1.1 GB85-952.1 GB⭐⭐Classification, extraction
phi-4-mini:3.8b2.5 GB60-703.8 GB⭐⭐⭐Summarization, simple Q&A
llama3.2:3b2.0 GB70-803.2 GB⭐⭐⭐General assistant, chat

Verdict: These models fly on the M4. 60-95 tokens per second means responses feel instant. Quality is limited — they struggle with complex reasoning and multi-step tasks — but for classification, extraction, and simple generation, they’re more than enough. And you can run them alongside everything else without noticeable system impact.

Medium Models (7B - 14B parameters)

ModelSizeSpeed (tok/s)MemoryQualityBest For
qwen3:8b4.9 GB40-506.5 GB⭐⭐⭐⭐Coding, analysis, chat
llama3.3:8b4.7 GB42-526.3 GB⭐⭐⭐⭐General purpose, coding
mistral:7b4.1 GB45-555.8 GB⭐⭐⭐⭐Fast general purpose
gemma2:9b5.4 GB38-457.1 GB⭐⭐⭐⭐Instruction following
deepseek-coder-v2:16b8.9 GB25-3211.2 GB⭐⭐⭐⭐⭐Coding specifically

Verdict: The sweet spot. Qwen3 8B and Llama 3.3 8B deliver surprisingly good quality at 40-50 tok/s. DeepSeek Coder v2 at 16B is the best local coding model we’ve tested — it understands complex codebases and generates correct, idiomatic code. The tradeoff is speed (25-32 tok/s) and memory (11.2 GB, which is half your available RAM).

Large Models (30B - 70B parameters)

ModelSizeSpeed (tok/s)MemoryQualityBest For
qwen3:32b19.8 GB8-1221.5 GB⭐⭐⭐⭐⭐Complex reasoning
llama3.3:70b-q224.6 GB3-525+ GB⭐⭐⭐⭐⭐Max quality (slow)
command-r:35b20.1 GB7-1022 GB⭐⭐⭐⭐RAG, long context

Verdict: This is where the M4’s 24GB starts to sweat. Qwen3 32B fits but barely — you’ll see system slowdown as macOS juggles memory between the model and everything else. The 70B models require aggressive quantization (Q2) that noticeably hurts quality, and at 3-5 tok/s, you’re waiting 30+ seconds for a paragraph. Not practical for interactive use, but fine for batch processing overnight.

The Practical Setup

Here’s what we actually run in production:

Always loaded (background):

  • qwen3:8b — handles 80% of requests (agent tools, classification, quick generation)

On-demand (loaded when needed):

  • deepseek-coder-v2:16b — code generation and review tasks
  • qwen3:1.7b — fast classification and extraction

Never (not worth it on 24GB):

  • 70B models — too slow, quality difference vs 32B doesn’t justify the cost
  • Multiple large models simultaneously — memory thrashing kills performance

Memory Management Tips

Ollama keeps models in memory after loading. On 24GB, this matters:

# See what's loaded
ollama ps

# Unload a model manually
ollama stop qwen3:32b

# Set keepalive timeout (auto-unload after inactivity)
# In ~/.ollama/config.json:
# "keepalive": "5m"

Our recommendation: set keepalive to 5 minutes. This auto-unloads models you haven’t used recently, freeing memory for the next request. The cold-start penalty (10-30 seconds to reload a model) is worth it to avoid memory pressure.

API Integration

Ollama exposes an OpenAI-compatible API. Any tool that supports base_url configuration works:

import httpx

response = httpx.post("http://localhost:11434/v1/chat/completions", json={
    "model": "qwen3:8b",
    "messages": [{"role": "user", "content": "Explain Docker networking in 3 sentences"}],
    "temperature": 0.7
})
print(response.json()["choices"][0]["message"]["content"])

This works with LangChain, LlamaIndex, OpenAI’s Python SDK (just change the base URL), and any HTTP client.

Cost Comparison: Local vs Cloud

Let’s do the math for a typical month of moderate AI usage:

ScenarioCloud (OpenAI)Cloud (Anthropic)Local (Ollama)
30 tasks/day, 30 days~$50~$80$0*
100 tasks/day, 30 days~$150~$250$0*
500 tasks/day, 30 days~$500+~$800+$0*

*Local cost is $0 marginal — you already bought the Mac Mini. Amortized over 3 years, the hardware cost is ~$17/month. Electricity adds maybe $5-8/month.

The breakeven point is roughly 2-3 months of moderate usage. After that, local AI is essentially free.

The catch: Local models (even 32B) don’t match Claude Opus or GPT-5.3 on complex reasoning, creative writing, or nuanced analysis. You’re trading quality ceiling for cost and privacy. For many tasks — especially structured generation, classification, and code — the quality gap is negligible.

When Local Isn’t Enough

Be honest about what local models can’t do:

  • Complex multi-step reasoning — Cloud models (Opus, o3) are significantly better
  • Creative writing — Local models produce more generic, repetitive text
  • Very long context — 128K context in local models exists but quality degrades
  • Multimodal — Local vision models exist but lag behind GPT-4o and Claude
  • Cutting-edge knowledge — Local models have training cutoffs; cloud models browse the web

Our approach: local for volume, cloud for quality. Ollama handles 80% of requests (the structured, predictable ones). Claude and GPT handle the 20% that need maximum intelligence.

Getting Started in 5 Minutes

# 1. Install Ollama
brew install ollama

# 2. Start the server
ollama serve &

# 3. Pull the recommended starter model
ollama pull qwen3:8b

# 4. Test it
ollama run qwen3:8b "What's the capital of France?"

# 5. Pull a coding model when you need it
ollama pull deepseek-coder-v2:16b

Total time: under 5 minutes (plus model download time, which depends on your internet speed).

The Mac Mini M4 won’t replace a cluster of H100s. But for personal AI infrastructure — agents, coding assistants, text processing, local embeddings — it’s the most cost-effective setup available in 2026. And once it’s running, you never think about API keys, rate limits, or monthly bills again.

ollamamac mini m4local aillmbenchmarkapple silicontutorial