Apple M1 Pro vs RTX 4070 Ti for local AI — GPU matmul and LLM benchmarks (2026)

RTX 4070 Ti hits 75 TFLOPS at fp16 via tensor cores; M1 Pro MPS reaches 4.2 TFLOPS — an 18× compute gap. But on real LLM inference (minicpm-v4.6, qwen3.5, gemma4:12b via ollama), the RTX is only 3–4× faster. LLM inference at batch=1 is memory-bandwidth-bound, not compute-bound, and the M1's unified-memory architecture closes most of the gap. Apple's AMX extensions also give its CPU 2.7× better matmul throughput than an Intel i5 with the same thread count.

TL;DR — A 4096×4096 float16 matrix multiply exposes an 18× raw compute gap: RTX 4070 Ti hits 75 TFLOPS via tensor cores; M1 Pro MPS reaches 4.2 TFLOPS. But run an actual LLM and the gap collapses to 3–4× — because token generation at batch size 1 is memory-bandwidth-bound, not compute-bound. Both machines can run useful models locally; they trade off on VRAM ceiling vs speed.

Companion to Rust vs Go vs C on the same machines — same hardware, different question.

Setup

M1 Pro (Metal/MPS)RTX 4070 Ti (CUDA)
MachineApple MacBook ProDesktop PC
GPUM1 Pro 16-core GPUNVIDIA GeForce RTX 4070 Ti
VRAM / GPU mem16 GB unified (shared with CPU)12 GB GDDR6X
CPUM1 Pro, 8-core (6P+2E)Intel i5-13400F, 16 threads
RAM16 GB125 GB
OSmacOS 26.6Ubuntu 24.04, CUDA 12.1, driver 595
PyTorch2.7.0, MPS backend2.5.1+cu121
ollama0.30.80.30.8

Benchmark 1 — GPU matrix multiply (raw TFLOPS)

A single 4096×4096 dense matmul, 20 timed runs after 5 warmup runs. Synchronised before and after to exclude kernel launch overhead. This is a proxy for raw tensor compute throughput — the ceiling everything else runs against.

Devicedtypems / runTFLOPS
i5-13400F CPUfloat32237.8 ms0.58
M1 Pro CPUfloat3286.4 ms1.59
M1 Pro MPSfloat3235.8 ms3.84
M1 Pro MPSfloat1633.0 ms4.16
RTX 4070 Ti CUDAfloat324.9 ms28.3
RTX 4070 Ti CUDAfloat161.8 ms75.3
RTX 4070 Ti CUDAbfloat161.8 ms74.9

What the numbers mean

M1 Pro MPS float16 → float32 barely differs (4.16 vs 3.84 TFLOPS). The M1 GPU does not have dedicated fp16 tensor cores the way NVIDIA does. Both precisions go through the same hardware; the fp16 path saves some memory bandwidth but gains little compute.

RTX 4070 Ti float16 → 75 TFLOPS — a 18× leap over M1 and a 2.7× leap over its own float32. This is the tensor core effect. NVIDIA’s Ada Lovelace generation (RTX 40xx) has 4th-gen tensor cores that operate natively on fp16/bf16 matrix tiles and reach roughly 4× the throughput of the fp32 CUDA cores on the same chip. PyTorch routes torch.float16 matmul through them automatically.

M1 Pro CPU beats Intel i5-13400F CPU 2.7× despite both having 8 active threads here. The M1 uses AMX (Apple Matrix Extensions) — dedicated matrix-multiply units in each CPU core, separate from NEON SIMD. PyTorch’s macOS backend routes float32 matmul through them automatically. The Intel i5 uses AVX2 (256-bit SIMD repurposed for matmul), which is general-purpose and substantially less efficient for dense matrix work.

Benchmark 2 — LLM inference (tokens / second, ollama)

Fixed 300-token generation task, temperature=0, measuring eval_count / eval_duration from the ollama API response (excludes prompt-eval time, which is reported separately). The prompt asks for a detailed explanation of transformer internals to ensure the model generates dense, varied output.

ModelSizeM1 Pro tok/sRTX 4070 Ti tok/sRTX / M1
minicpm-v4.61.6 GB95.3278.92.9×
qwen3.56.6 GB20.373.53.6×
gemma4:12b7.6 GB49.1

Prompt eval (time to first token): M1 Pro 25ms / 115ms; RTX 4070 Ti 9ms / 22ms / 42ms for the three models respectively.

The bandwidth story

The raw compute gap is 18×. The LLM throughput gap is only 3–4×. Why?

LLM decoding at batch=1 is memory-bandwidth-bound, not compute-bound. Each generated token requires loading the entire set of model weights from VRAM (or unified memory) once per forward pass. The compute per byte is low. The GPU spends almost all its time moving data, not multiplying.

The relevant number is memory bandwidth:

  • RTX 4070 Ti: ~504 GB/s (GDDR6X, dedicated VRAM bus)
  • M1 Pro: ~200 GB/s (LPDDR5X, shared CPU/GPU)

Bandwidth ratio: 2.5× — which matches the LLM throughput ratio of 2.9–3.6× almost exactly. The M1 is slower because it has less memory bandwidth, not because its GPU is 18× weaker. For LLM inference at small batch sizes, buying more memory bandwidth is what you’re actually buying.

The unified-memory advantage

The M1 Pro’s GPU has access to all 16 GB of unified memory. An RTX 4070 Ti has 12 GB of dedicated VRAM with a hard cliff: models larger than ~11 GB fp16 don’t fit and fall back to slow system RAM via PCIe, which collapses throughput.

Model size (fp16)M1 Pro 16GBRTX 4070 Ti 12GB
1–3B (~2–6 GB)Runs fastRuns fast
7B (~14 GB)Runs fastFits tight (~13.5 GB); may not load
13B (~26 GB)Doesn’t fitDoesn’t fit
7B quantised q4 (~4 GB)FastFast
13B quantised q4 (~8 GB)FastFast
30B quantised q4 (~18 GB)Doesn’t fitDoesn’t fit

For the common use case — a quantised 7B or 13B model — both machines work comfortably. The M1 Pro’s unified memory lets it load an unquantised 7B model in fp16 that would overflow a 12GB RTX; the RTX runs the same model quantised to q4 faster overall.

What 95 vs 279 tok/s actually feels like

minicpm-v4.6 at 95 tok/s on M1 Pro produces tokens faster than anyone reads. qwen3.5 at 20 tok/s on M1 Pro is borderline — you see words appearing but it feels slightly slow. On the RTX at 74 tok/s, qwen3.5 is comfortable.

For interactive use: 20–30 tok/s is the floor of what feels acceptable. Both machines clear it on the small model; only the RTX clears it comfortably on the 6.6GB model.

For a batch transcription or summarisation pipeline running continuously, the RTX wins clearly — 279 vs 95 tok/s means 3× the throughput with the same model. The Danish voice-log pipeline using Claude Code subagents as the summary layer runs exactly this kind of sustained batch workload; an RTX 4070 Ti would run the same workload in a third of the clock time.

Power draw

No instruments were connected for this run, but rated TDP gives the order of magnitude:

IdleFull GPU load
M1 Pro MacBook Pro~5 W~25–30 W
RTX 4070 Ti (card only)~15 W~200 W

The M1 wins decisively on efficiency. At 95 tok/s for ~25W and 279 tok/s for ~200W, the RTX delivers 2.9× the performance at 7–8× the power — so the M1 is 2.5× more efficient per watt for LLM inference.

Which machine for local AI?

Use casePick
Interactive local assistant (7B–13B quantised)Either — RTX faster, M1 quieter and more efficient
Sustained batch inference (transcription, summaries)RTX 4070 Ti — 3× throughput
Loading large fp16 models without quantisationM1 Pro — unified memory, no VRAM cliff
Training or fine-tuning (LoRA)RTX 4070 Ti — tensor cores, CUDA ecosystem
Always-on home server (power bill matters)M1 Pro — ~7–8× lower GPU power draw
Numerical ML (CV, embeddings, matmul-heavy)RTX 4070 Ti — 7–18× faster

Benchmarks run on personal hardware, June 2026. LLM throughput depends heavily on model quantisation, context length, and system load — treat these numbers as order-of-magnitude comparisons, not absolute specs.