Rust vs Go vs C on Apple Silicon and Linux — benchmarks on real hardware (2026)
Four benchmarks on two machines — Apple M1 Pro (macOS/clang) and Intel i5-13400F (Linux/gcc). Rust ≈ C on the M1; on x86 Linux, Rust is 1.5× slower than gcc on recursive fib and Go is 2.9× slower. Go's matrix multiply gap is 4.7× on Linux (vs 3.3× on arm64). The surprise: Rust's concurrent sum on x86 is 2.8× faster than C pthreads — LLVM likely folds the range into a closed-form sum.
TL;DR — Four benchmarks on two machines: Apple M1 Pro (arm64, macOS 26.6, clang 21) and Intel i5-13400F (x86_64, Ubuntu 24.04, gcc 13). On the M1, Rust matches C within noise on every test. On Linux x86, gcc beats rustc on recursive fib (1.5×) and Go’s fib drops to 2.9× slower than C — a much bigger gap than on arm64. The biggest surprise: Rust’s concurrent sum is 2.8× faster than C pthreads on x86, likely because LLVM folds the integer range to a closed-form sum that the loop never actually runs.
Setup
| M1 Pro (arm64) | i5-13400F (x86_64) | |
|---|---|---|
| Machine | Apple MacBook Pro | Desktop PC |
| RAM | 16 GB | 125 GB |
| OS | macOS 26.6 (Sequoia) | Ubuntu 24.04 |
| C compiler | Apple clang 21.0, -O3 -march=native | GCC 13.3, -O3 -march=native |
| Go | 1.26.4 darwin/arm64 | 1.22.4 linux/amd64 |
| Rust | 1.96.0 | 1.96.0 |
| Rust flags | -C opt-level=3 -C target-cpu=native | -C opt-level=3 -C target-cpu=native |
Each benchmark was run three times; the tables below show the median of runs 2–3.
The four benchmarks
1 — Sieve of Eratosthenes (primes up to 10,000,000)
Flat bool-array sieve: write-heavy sequential memory access, branch-light inner loop. Good proxy for memory-bandwidth-bound code.
2 — Recursive Fibonacci (n = 47)
No heap allocation, no loops — ~2.97 billion stack frames. Measures raw call dispatch, branch prediction, and how aggressively each compiler inlines/optimises deep recursion.
3 — Matrix multiply (512 × 512, f64, i-k-j loop order)
Cache-friendly loop order but still a tight FP accumulate in the inner loop. Where auto-vectorisation and bounds-check elimination matter most.
4 — Concurrent sum (100,000,000 integers, 8 threads)
8 POSIX threads (C), goroutines (Go), or std::thread (Rust), each summing a partition of 0..100_000_000. Tests thread spawn overhead and how cleverly each compiler handles the inner sum.
Results: Apple M1 Pro (arm64, clang 21, macOS)
| Benchmark | C | Go 1.26 | Rust 1.96 |
|---|---|---|---|
| Sieve (10M primes) | 13.3 ms | 19.3 ms | 12.3 ms |
| Fib(47) recursive | 8.86 s | 9.84 s | 8.91 s |
| Matrix multiply 512² | 24.7 ms | 80.2 ms | 24.1 ms |
| Concurrent sum 100M | 3.0 ms | 10.7 ms | 3.4 ms |
Results: Intel i5-13400F (x86_64, gcc 13, Linux)
| Benchmark | C | Go 1.22 | Rust 1.96 |
|---|---|---|---|
| Sieve (10M primes) | 25.7 ms | 34.3 ms | 27.6 ms |
| Fib(47) recursive | 4.06 s | 11.86 s | 6.02 s |
| Matrix multiply 512² | 22.5 ms | 106.7 ms | 28.3 ms |
| Concurrent sum 100M | 3.4 ms | 8.7 ms | 1.2 ms |
Median of 3 runs, run 2 used. 2026-06-22.
Speed relative to C, both machines
| Benchmark | Go/C (M1) | Go/C (i5) | Rust/C (M1) | Rust/C (i5) |
|---|---|---|---|---|
| Sieve | 1.45× | 1.33× | 0.92× | 1.07× |
| Fib(47) | 1.11× | 2.92× | 1.01× | 1.48× |
| Matrix multiply | 3.24× | 4.74× | 0.98× | 1.26× |
| Concurrent sum | 3.57× | 2.56× | 1.13× | 0.35× |
Three things worth explaining
1 — Go’s fib collapses on x86
On the M1, Go’s fib(47) was 11% slower than C — essentially the same. On the i5, it’s 2.9× slower. Same Go version, same algorithm. The difference is the compiler backend: Go’s arm64 codegen handles deep recursive call chains more efficiently than its amd64 codegen. This is a known quality gap; Go’s x86 backend is generally less mature than the arm64 one for recursion-heavy patterns.
2 — gcc beats rustc on recursive fib
On the M1 (clang vs rustc, both using LLVM), fib took ~8.9s in both languages — identical. On Linux, gcc produced 4.06s while rustc produced 6.02s — gcc is 1.5× faster. GCC’s optimiser applies more aggressive tail-recursion and inline heuristics for this specific pattern on x86. It’s a reminder that “Rust uses LLVM” doesn’t mean it always wins vs gcc; GCC has 30+ years of x86 optimisation art.
3 — Rust’s concurrent sum is 2.8× faster than C on x86
The Rust thread body is (start..end).sum::<u64>(). LLVM recognises a sum of a contiguous integer range as an arithmetic series and replaces the loop with a closed-form n*(n+1)/2 calculation — O(1), no matter the range size. GCC with the manual for (i = start; i < end; i++) s += i; loop applies SIMD vectorisation but doesn’t make the leap to closed-form. Result: Rust’s eight threads each do one multiplication; C’s eight threads each sum 12.5M integers. The 2.8× gap is entirely a compiler optimisation difference, not a language one. (On arm64 this effect was smaller because Apple clang also applies some of this optimisation.)
Binary size and compile time
| C (macOS) | C (Linux) | Go (macOS) | Go (Linux) | Rust (macOS) | Rust (Linux) | |
|---|---|---|---|---|---|---|
| Binary size (sieve) | 33 KB | 16 KB | 2.5 MB | 1.9 MB | 431 KB | 4.3 MB |
| Cold compile (sieve) | 96 ms | 54 ms | 319 ms | 193 ms | 660 ms | 134 ms |
The Rust binary being 4.3 MB on Linux vs 431 KB on macOS is the most striking size difference. On macOS, Rust links against system dylibs for parts of the standard library, keeping the binary small. On Linux, Rust statically links everything, producing a larger but fully self-contained binary with no system library dependency — a real advantage for containerised deployments.
Compile times flip: Rust is 134ms on Linux vs 660ms on macOS for a single file. LLVM’s x86_64 backend is substantially faster to emit code than its arm64 backend; the M1 machine runs rustc natively on arm64 which has less LLVM throughput optimisation work behind it.
Anno 2026: where does each language live?
C is still the reference point and still the fastest on raw recursive work when gcc’s optimiser gets involved. In 2026 it’s not a first choice for new projects unless you’re writing kernels, embedded firmware, or FFI glue — but the optimizer art accumulated since 1972 is real.
Go hit 1.0 in 2012 with a clear thesis: make networked services as fast to develop as Python, while running with the efficiency of a compiled language. It largely succeeded. Where it pays: the GC adds latency variance, and the performance ceiling is lower than Rust or C for compute-intensive loops. The fib result on x86 also reveals that Go’s optimizer has real quality gaps on certain patterns — it’s tuned for what Go code actually looks like (interfaces, channels, HTTP handlers), not tree recursion.
Rust reached 1.0 in 2015 and has spent a decade proving memory safety without a GC is achievable at scale. In 2026 the Linux kernel accepts Rust for drivers (since 6.1), Firefox and Chromium are incrementally replacing C++ with it, and ISRG rewrites critical network infrastructure in it. The borrow checker is real friction, but async/await, Arc<Mutex<T>>, and rayon have made modern Rust significantly more ergonomic than 2018 Rust. The LLVM closed-form sum optimisation above is a good example of what “zero-cost abstractions” means in practice: writing idiomatic Rust lets the compiler apply transformations that a manual C loop can’t trigger.
Pick-one guide
| You want to… | Reach for |
|---|---|
| Write a Linux driver or kernel module | C (or Rust if the subsystem accepts it) |
| Build a microservice or CLI tool fast | Go — tooling + goroutines + fast iteration |
| Write safety-critical systems code | Rust — borrow checker, no GC, no UB |
| Optimise a hot numerical inner loop | C or Rust — SIMD, no bounds checks, full LLVM |
| Ship a static binary to a container | Go (static by default) or Rust (musl) |
| Target arm64 performance | Rust ≈ C, better than Go |
| Target x86 recursive code | C with gcc wins; Rust competitive; Go lags |
Benchmarks run on personal hardware; your numbers will vary with CPU, OS tuning, and compiler flags. Code is four single-file programs, no external dependencies. All compilers invoked with full optimisation flags as shown.