Apple M1 Pro vs RTX 4070 Ti for local AI — GPU matmul and LLM benchmarks (2026)
RTX 4070 Ti hits 75 TFLOPS at fp16 via tensor cores; M1 Pro MPS reaches 4.2 TFLOPS — an 18× compute gap. But on real LLM inference (minicpm-v4.6, qwen3.5, gemma4:12b via ollama), the RTX is only 3–4× faster. LLM inference at batch=1 is memory-bandwidth-bound, not compute-bound, and the M1's unified-memory architecture closes most of the gap. Apple's AMX extensions also give its CPU 2.7× better matmul throughput than an Intel i5 with the same thread count.