5x Faster than Rust Standard Channel (MPSC)
The techniques used to achieve this speedup involve specialized, unsafe implementations and memory arena strategies tailored specifically for high-performance asynchronous task execution. This is not a robust, full-featured MPSC implementation, but rather an optimized channel that executes FnOnce. This is commonly implemented using MPSC over boxed closures, but memory allocation and thread contention were becoming the bottleneck.
The implementation is not a drop-in replacement for a channel, it doesn't support auto-flushing and has many assumptions, but I believe this may be of use for some of you and may become a crate in the future.
Benchmarks
We performed several benchmarks to measure the performance differences between different ways of performing computation across threads, as well as our new communication layer in Burn. First, we isolated the channel implementation using random tasks. Then, we conducted benchmarks directly within Burn, measuring framework overhead by launching small tasks.
https://preview.redd.it/3d9fmws5bnog1.png?width=2048&format=png&auto=webp&s=949ecc004f58a0207c234684588860655416efba
The benchmarks reveal that a mutex remains the fastest way to perform computations with a single thread. This is expected, as it avoids data copying entirely and lacks contention when only one thread is active. When multiple threads are involved, however, it is a different story: the custom channel can be up to 10 times faster than the standard channel and roughly 2 times faster than the mutex. When measuring framework overhead with 8 threads, we can execute nearly twice as many tasks compared to using a reentrant mutex as the communication layer in Burn.
Why was a dedicated channel slower than a lock? The answer was memory allocation. Our API relies on sending closures over a channel. In standard Rust, this usually looks like
Implementation Details
First, we addressed zero-allocation task enqueuing by replacing standard boxing with a tiered Double-Buffer Arena. Small closures (β€ 48 bytes) are now inlined directly into a 64-byte Task struct, aligned to CPU cache lines to prevent false sharing, while larger closures (up to 4KB) use a pre-allocated memory arena to bypass the global allocator entirely. We only fallback to a standard Box for closures larger than 4KB, which represent a negligible fraction of our workloads.
Second, we implemented lock-free double buffering to eliminate the contention typical of standard ring buffers. Using a Double-Buffering Swap strategy, producers write to a client buffer using atomic Acquire/Release semantics. When the runner thread is ready, it performs a single atomic swap to move the entire batch of tasks into a private server buffer, allowing the runner to execute tasks sequentially with zero interference from producers.
Finally, we ensured recursive safety via Thread Local Storage (TLS). To handle the recursion that originally necessitated reentrant mutexes, the runner thread now uses TLS to detect if it is attempting to submit a task to itself. If it is, the task is executed immediately and eagerly rather than being enqueued, preventing deadlocks without the heavy overhead of reentrant locking.
Conclusion
Should you implement a custom channel instead of relying on the standard library? Probably not. But can you significantly outperform general implementations when you have knowledge of the objects being transferred? Absolutely.
Full blog post: https://burn.dev/blog/faster-channel/
https://redd.it/1rrx1bx
@r_rust
The techniques used to achieve this speedup involve specialized, unsafe implementations and memory arena strategies tailored specifically for high-performance asynchronous task execution. This is not a robust, full-featured MPSC implementation, but rather an optimized channel that executes FnOnce. This is commonly implemented using MPSC over boxed closures, but memory allocation and thread contention were becoming the bottleneck.
The implementation is not a drop-in replacement for a channel, it doesn't support auto-flushing and has many assumptions, but I believe this may be of use for some of you and may become a crate in the future.
Benchmarks
We performed several benchmarks to measure the performance differences between different ways of performing computation across threads, as well as our new communication layer in Burn. First, we isolated the channel implementation using random tasks. Then, we conducted benchmarks directly within Burn, measuring framework overhead by launching small tasks.
https://preview.redd.it/3d9fmws5bnog1.png?width=2048&format=png&auto=webp&s=949ecc004f58a0207c234684588860655416efba
The benchmarks reveal that a mutex remains the fastest way to perform computations with a single thread. This is expected, as it avoids data copying entirely and lacks contention when only one thread is active. When multiple threads are involved, however, it is a different story: the custom channel can be up to 10 times faster than the standard channel and roughly 2 times faster than the mutex. When measuring framework overhead with 8 threads, we can execute nearly twice as many tasks compared to using a reentrant mutex as the communication layer in Burn.
Why was a dedicated channel slower than a lock? The answer was memory allocation. Our API relies on sending closures over a channel. In standard Rust, this usually looks like
Box<dyn FnOnce()>. Because these closures often exceeded 1000 bytes, we were placing massive pressure on the allocator. With multiple threads attempting to allocate and deallocate these boxes simultaneously, the contention was worse than the original mutex lock. To solve this, we moved away from the safety of standard trait objects and embraced pointer manipulation and pre-allocated memory.Implementation Details
First, we addressed zero-allocation task enqueuing by replacing standard boxing with a tiered Double-Buffer Arena. Small closures (β€ 48 bytes) are now inlined directly into a 64-byte Task struct, aligned to CPU cache lines to prevent false sharing, while larger closures (up to 4KB) use a pre-allocated memory arena to bypass the global allocator entirely. We only fallback to a standard Box for closures larger than 4KB, which represent a negligible fraction of our workloads.
Second, we implemented lock-free double buffering to eliminate the contention typical of standard ring buffers. Using a Double-Buffering Swap strategy, producers write to a client buffer using atomic Acquire/Release semantics. When the runner thread is ready, it performs a single atomic swap to move the entire batch of tasks into a private server buffer, allowing the runner to execute tasks sequentially with zero interference from producers.
Finally, we ensured recursive safety via Thread Local Storage (TLS). To handle the recursion that originally necessitated reentrant mutexes, the runner thread now uses TLS to detect if it is attempting to submit a task to itself. If it is, the task is executed immediately and eagerly rather than being enqueued, preventing deadlocks without the heavy overhead of reentrant locking.
Conclusion
Should you implement a custom channel instead of relying on the standard library? Probably not. But can you significantly outperform general implementations when you have knowledge of the objects being transferred? Absolutely.
Full blog post: https://burn.dev/blog/faster-channel/
https://redd.it/1rrx1bx
@r_rust
Vite 8.0 is out. And it's full of π¦ Rust
https://vite.dev/blog/announcing-vite8
https://redd.it/1rs2d4j
@r_rust
https://vite.dev/blog/announcing-vite8
https://redd.it/1rs2d4j
@r_rust
vitejs
Vite 8.0 is out!
Vite 8 Release Announcement
What's everyone working on this week (11/2026)?
New week, new Rust! What are you folks up to? Answer here or over at rust-users!
https://redd.it/1rv3nik
@r_rust
New week, new Rust! What are you folks up to? Answer here or over at rust-users!
https://redd.it/1rv3nik
@r_rust
The Rust Programming Language Forum
What's everyone working on this week (11/2026)?
New week, new Rust! What are you folks up to?
I am too stupid to use AVX-512
Recently I have been working on writing a simple 4x4 matrix implementation in rust using SIMD intrinsics. I defined the following struct to help with this.
```
#[repr(C, align(16))]
#[derive(Copy, Clone)]
union f32x16 {
#[cfg(target_feature = "sse")]
pub sse: [__m128; 4],
#[cfg(target_feature = "avx")]
pub avx: [__m256; 2],
#[cfg(target_feature = "avx512f")]
pub avx512: __m512,
pub data: [f32; 16],
}
#[repr(transparent)]
#[derive(Copy, Clone)]
pub struct Mat4(f32x16);
```
My initial implementations of matrix multiplication were slower than a standard avx2 implementation. I brushed it off as a skill issue from my part. But when I tried benchmarking 2 different matrix transpose implementations I realized that something seriously wrong is going on.
The first AVX2 implementation
```
#[cfg(all(target_feature = "avx2", not(target_feature = "avx512f")))]
#[inline(always)]
pub fn transpose(&self) -> Self {
// Mat4 layout: avx[0] = [c0β c0β c0β c0β | c1β c1β c1β c1β]
// avx[1] = [c2β c2β c2β c2β | c3β c3β c3β c3β]
// Goal rows: res[0] = [c0β c1β c2β c3β | c0β c1β c2β c3β] (Rows 0, 1)
// res[1] = [c0β c1β c2β c3β | c0β c1β c2β c3β] (Rows 2, 3)
use std::arch::x86_64::{_mm256_permutevar8x32_ps, _mm256_blend_ps, _mm256_set_epi32};
unsafe {
// Indices for Row 0/1: [idx5, idx1, idx5, idx1, idx4, idx0, idx4, idx0]
let idx01 = _mm256_set_epi32(5, 1, 5, 1, 4, 0, 4, 0);
let t0 = _mm256_permutevar8x32_ps(self.0.avx[0], idx01);
let t1 = _mm256_permutevar8x32_ps(self.0.avx[1], idx01);
let res01 = _mm256_blend_ps(t0, t1, 0b11001100);
// Indices for Row 2/3: [idx7, idx3, idx7, idx3, idx6, idx2, idx6, idx2]
let idx23 = _mm256_set_epi32(7, 3, 7, 3, 6, 2, 6, 2);
let t2 = _mm256_permutevar8x32_ps(self.0.avx[0], idx23);
let t3 = _mm256_permutevar8x32_ps(self.0.avx[1], idx23);
let res23 = _mm256_blend_ps(t2, t3, 0b11001100);
Mat4(f32x16 { avx: [res01, res23] })
}
}
```
And the AVX-512 implementation
```
#[cfg(target_feature = "avx512f")]
#[inline(always)]
pub fn transpose(&self) -> Self {
use std::arch::x86_64::{_mm512_permutexvar_ps, _mm512_set_epi32};
unsafe {
let indices = _mm512_set_epi32(
15, 11, 7, 3, // Row 3 maps to Col 3
14, 10, 6, 2, // Row 2 maps to Col 2
13, 9, 5, 1, // Row 1 maps to Col 1
12, 8, 4, 0 // Row 0 maps to Col 0
);
let transposed = _mm512_permutexvar_ps(indices, self.0.avx512);
Mat4(f32x16 { avx512: transposed })
}
}
```
Now the AVX-512 implementation is obviously much simpler, its only 1 instruction compared to 6, and I assumed that it would outperform the AVX2 implementation, but I was wrong.
```
Matrix Transpose/Mat4 Transpose
time: [4.0170 ns 4.0173 ns 4.0176 ns]
change: [+600.21% +600.31% +600.45%] (p = 0.00 < 0.05)
Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
5 (5.00%) low mild
1 (1.00%) high mild
2 (2.00%) high severe
```
600% slower. So I decided to paste both implementations in godbolt and see the asm output
AVX2 Implementation
```
.LCPI0_0:
.zero 4
.zero 4
.long 0
.long 4
.zero 4
.zero 4
.long 1
.long 5
.LCPI0_1:
.long 0
.long 4
.zero 4
.zero 4
.long 1
.long 5
.zero 4
.zero 4
.LCPI0_2:
.zero 4
.zero 4
.long 2
.long 6
.zero 4
.zero 4
.long 3
.long 7
.LCPI0_3:
.long 2
.long 6
.zero 4
.zero 4
.long 3
.long 7
.zero 4
.zero 4
example::main::hb98fc185a6f3c541:
push rbp
mov rbp, rsp
and rsp, -32
sub rsp, 160
vxorps xmm0, xmm0,
Recently I have been working on writing a simple 4x4 matrix implementation in rust using SIMD intrinsics. I defined the following struct to help with this.
```
#[repr(C, align(16))]
#[derive(Copy, Clone)]
union f32x16 {
#[cfg(target_feature = "sse")]
pub sse: [__m128; 4],
#[cfg(target_feature = "avx")]
pub avx: [__m256; 2],
#[cfg(target_feature = "avx512f")]
pub avx512: __m512,
pub data: [f32; 16],
}
#[repr(transparent)]
#[derive(Copy, Clone)]
pub struct Mat4(f32x16);
```
My initial implementations of matrix multiplication were slower than a standard avx2 implementation. I brushed it off as a skill issue from my part. But when I tried benchmarking 2 different matrix transpose implementations I realized that something seriously wrong is going on.
The first AVX2 implementation
```
#[cfg(all(target_feature = "avx2", not(target_feature = "avx512f")))]
#[inline(always)]
pub fn transpose(&self) -> Self {
// Mat4 layout: avx[0] = [c0β c0β c0β c0β | c1β c1β c1β c1β]
// avx[1] = [c2β c2β c2β c2β | c3β c3β c3β c3β]
// Goal rows: res[0] = [c0β c1β c2β c3β | c0β c1β c2β c3β] (Rows 0, 1)
// res[1] = [c0β c1β c2β c3β | c0β c1β c2β c3β] (Rows 2, 3)
use std::arch::x86_64::{_mm256_permutevar8x32_ps, _mm256_blend_ps, _mm256_set_epi32};
unsafe {
// Indices for Row 0/1: [idx5, idx1, idx5, idx1, idx4, idx0, idx4, idx0]
let idx01 = _mm256_set_epi32(5, 1, 5, 1, 4, 0, 4, 0);
let t0 = _mm256_permutevar8x32_ps(self.0.avx[0], idx01);
let t1 = _mm256_permutevar8x32_ps(self.0.avx[1], idx01);
let res01 = _mm256_blend_ps(t0, t1, 0b11001100);
// Indices for Row 2/3: [idx7, idx3, idx7, idx3, idx6, idx2, idx6, idx2]
let idx23 = _mm256_set_epi32(7, 3, 7, 3, 6, 2, 6, 2);
let t2 = _mm256_permutevar8x32_ps(self.0.avx[0], idx23);
let t3 = _mm256_permutevar8x32_ps(self.0.avx[1], idx23);
let res23 = _mm256_blend_ps(t2, t3, 0b11001100);
Mat4(f32x16 { avx: [res01, res23] })
}
}
```
And the AVX-512 implementation
```
#[cfg(target_feature = "avx512f")]
#[inline(always)]
pub fn transpose(&self) -> Self {
use std::arch::x86_64::{_mm512_permutexvar_ps, _mm512_set_epi32};
unsafe {
let indices = _mm512_set_epi32(
15, 11, 7, 3, // Row 3 maps to Col 3
14, 10, 6, 2, // Row 2 maps to Col 2
13, 9, 5, 1, // Row 1 maps to Col 1
12, 8, 4, 0 // Row 0 maps to Col 0
);
let transposed = _mm512_permutexvar_ps(indices, self.0.avx512);
Mat4(f32x16 { avx512: transposed })
}
}
```
Now the AVX-512 implementation is obviously much simpler, its only 1 instruction compared to 6, and I assumed that it would outperform the AVX2 implementation, but I was wrong.
```
Matrix Transpose/Mat4 Transpose
time: [4.0170 ns 4.0173 ns 4.0176 ns]
change: [+600.21% +600.31% +600.45%] (p = 0.00 < 0.05)
Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
5 (5.00%) low mild
1 (1.00%) high mild
2 (2.00%) high severe
```
600% slower. So I decided to paste both implementations in godbolt and see the asm output
AVX2 Implementation
```
.LCPI0_0:
.zero 4
.zero 4
.long 0
.long 4
.zero 4
.zero 4
.long 1
.long 5
.LCPI0_1:
.long 0
.long 4
.zero 4
.zero 4
.long 1
.long 5
.zero 4
.zero 4
.LCPI0_2:
.zero 4
.zero 4
.long 2
.long 6
.zero 4
.zero 4
.long 3
.long 7
.LCPI0_3:
.long 2
.long 6
.zero 4
.zero 4
.long 3
.long 7
.zero 4
.zero 4
example::main::hb98fc185a6f3c541:
push rbp
mov rbp, rsp
and rsp, -32
sub rsp, 160
vxorps xmm0, xmm0,
xmm0
vmovaps ymmword ptr [rsp + 32], ymm0
vmovaps ymmword ptr [rsp], ymm0
mov rax, rsp
vmovaps ymm0, ymmword ptr [rsp]
vmovaps ymm1, ymmword ptr [rsp + 32]
vmovaps ymm2, ymmword ptr [rip + .LCPI0_0]
vpermps ymm2, ymm2, ymm1
vmovaps ymm3, ymmword ptr [rip + .LCPI0_1]
vpermps ymm3, ymm3, ymm0
vblendps ymm2, ymm3, ymm2, 204
vmovaps ymm3, ymmword ptr [rip + .LCPI0_2]
vpermps ymm1, ymm3, ymm1
vmovaps ymm3, ymmword ptr [rip + .LCPI0_3]
vpermps ymm0, ymm3, ymm0
vblendps ymm0, ymm0, ymm1, 204
vmovaps ymmword ptr [rsp + 64], ymm2
vmovaps ymmword ptr [rsp + 96], ymm0
lea rax, [rsp + 64]
mov rsp, rbp
pop rbp
vzeroupper
ret
```
AVX-512 Implementation
```
.LCPI0_0:
.long 0
.long 4
.long 8
.long 12
.long 1
.long 5
.long 9
.long 13
.long 2
.long 6
.long 10
.long 14
.long 3
.long 7
.long 11
.long 15
example::main::hb98fc185a6f3c541:
push rbp
mov rbp, rsp
and rsp, -64
sub rsp, 192
vxorps xmm0, xmm0, xmm0
vmovaps ymmword ptr [rsp + 32], ymm0
vmovaps ymmword ptr [rsp], ymm0
mov rax, rsp
vmovaps zmm0, zmmword ptr [rip + .LCPI0_0]
vpermps zmm0, zmm0, zmmword ptr [rsp]
vmovaps zmmword ptr [rsp + 64], zmm0
lea rax, [rsp + 64]
mov rsp, rbp
pop rbp
vzeroupper
ret
```
both look pretty normal to me. So then I tried letting LLVM autovectorize targeting x86-64-v3 for avx2 and x86-64-v4 for AVX-512.
x86-64-v3
```
.LCPI0_0:
.zero 4
.zero 4
.long 0
.long 4
.zero 4
.zero 4
.long 1
.long 5
.LCPI0_1:
.long 0
.long 4
.zero 4
.zero 4
.long 1
.long 5
.zero 4
.zero 4
.LCPI0_2:
.zero 4
.zero 4
.zero 4
.long 0
.zero 4
.zero 4
.zero 4
.long 1
.LCPI0_3:
.zero 4
.zero 4
.long 0
.zero 4
.zero 4
.zero 4
.long 1
.zero 4
example::main::hb98fc185a6f3c541:
push rbp
mov rbp, rsp
and rsp, -32
sub rsp, 160
vxorps xmm0, xmm0, xmm0
vmovaps ymmword ptr [rsp + 32], ymm0
vmovaps ymmword ptr [rsp], ymm0
mov rax, rsp
vmovaps ymm0, ymmword ptr [rip + .LCPI0_0]
vpermps ymm0, ymm0, ymmword ptr [rsp + 32]
vmovaps ymm1, ymmword ptr [rip + .LCPI0_1]
vpermps ymm2, ymm1, ymmword ptr [rsp]
vblendps ymm0, ymm2, ymm0, 204
vmovaps ymmword ptr [rsp + 64], ymm0
vmovups xmm0, xmmword ptr [rsp + 40]
vmovsd xmm2, qword ptr [rsp + 56]
vmovaps ymm3, ymmword ptr [rip + .LCPI0_2]
vpermps ymm2, ymm3, ymm2
vmovaps ymm3, ymmword ptr [rip + .LCPI0_3]
vpermps ymm0, ymm3, ymm0
vpermps ymm1, ymm1, ymmword ptr [rsp + 8]
vblendps ymm0, ymm1, ymm0, 204
vblendps ymm0, ymm0, ymm2, 136
vmovaps ymmword ptr [rsp + 96], ymm0
lea rax, [rsp + 64]
mov rsp, rbp
pop rbp
vzeroupper
ret
```
x86-64-v4
```
.LCPI0_0:
.long 0
.long 4
.long 8
.long 12
.long 1
.long 5
.long 9
.long 13
.LCPI0_1:
.long 8
.long 12
.long 0
.zero 4
.long 9
.long 13
.long 1
.zero 4
.LCPI0_2:
.long 0
.long 1
.long 2
.long 8
.long 4
.long 5
.long 6
.long 9
example::main::hb98fc185a6f3c541:
push
vmovaps ymmword ptr [rsp + 32], ymm0
vmovaps ymmword ptr [rsp], ymm0
mov rax, rsp
vmovaps ymm0, ymmword ptr [rsp]
vmovaps ymm1, ymmword ptr [rsp + 32]
vmovaps ymm2, ymmword ptr [rip + .LCPI0_0]
vpermps ymm2, ymm2, ymm1
vmovaps ymm3, ymmword ptr [rip + .LCPI0_1]
vpermps ymm3, ymm3, ymm0
vblendps ymm2, ymm3, ymm2, 204
vmovaps ymm3, ymmword ptr [rip + .LCPI0_2]
vpermps ymm1, ymm3, ymm1
vmovaps ymm3, ymmword ptr [rip + .LCPI0_3]
vpermps ymm0, ymm3, ymm0
vblendps ymm0, ymm0, ymm1, 204
vmovaps ymmword ptr [rsp + 64], ymm2
vmovaps ymmword ptr [rsp + 96], ymm0
lea rax, [rsp + 64]
mov rsp, rbp
pop rbp
vzeroupper
ret
```
AVX-512 Implementation
```
.LCPI0_0:
.long 0
.long 4
.long 8
.long 12
.long 1
.long 5
.long 9
.long 13
.long 2
.long 6
.long 10
.long 14
.long 3
.long 7
.long 11
.long 15
example::main::hb98fc185a6f3c541:
push rbp
mov rbp, rsp
and rsp, -64
sub rsp, 192
vxorps xmm0, xmm0, xmm0
vmovaps ymmword ptr [rsp + 32], ymm0
vmovaps ymmword ptr [rsp], ymm0
mov rax, rsp
vmovaps zmm0, zmmword ptr [rip + .LCPI0_0]
vpermps zmm0, zmm0, zmmword ptr [rsp]
vmovaps zmmword ptr [rsp + 64], zmm0
lea rax, [rsp + 64]
mov rsp, rbp
pop rbp
vzeroupper
ret
```
both look pretty normal to me. So then I tried letting LLVM autovectorize targeting x86-64-v3 for avx2 and x86-64-v4 for AVX-512.
x86-64-v3
```
.LCPI0_0:
.zero 4
.zero 4
.long 0
.long 4
.zero 4
.zero 4
.long 1
.long 5
.LCPI0_1:
.long 0
.long 4
.zero 4
.zero 4
.long 1
.long 5
.zero 4
.zero 4
.LCPI0_2:
.zero 4
.zero 4
.zero 4
.long 0
.zero 4
.zero 4
.zero 4
.long 1
.LCPI0_3:
.zero 4
.zero 4
.long 0
.zero 4
.zero 4
.zero 4
.long 1
.zero 4
example::main::hb98fc185a6f3c541:
push rbp
mov rbp, rsp
and rsp, -32
sub rsp, 160
vxorps xmm0, xmm0, xmm0
vmovaps ymmword ptr [rsp + 32], ymm0
vmovaps ymmword ptr [rsp], ymm0
mov rax, rsp
vmovaps ymm0, ymmword ptr [rip + .LCPI0_0]
vpermps ymm0, ymm0, ymmword ptr [rsp + 32]
vmovaps ymm1, ymmword ptr [rip + .LCPI0_1]
vpermps ymm2, ymm1, ymmword ptr [rsp]
vblendps ymm0, ymm2, ymm0, 204
vmovaps ymmword ptr [rsp + 64], ymm0
vmovups xmm0, xmmword ptr [rsp + 40]
vmovsd xmm2, qword ptr [rsp + 56]
vmovaps ymm3, ymmword ptr [rip + .LCPI0_2]
vpermps ymm2, ymm3, ymm2
vmovaps ymm3, ymmword ptr [rip + .LCPI0_3]
vpermps ymm0, ymm3, ymm0
vpermps ymm1, ymm1, ymmword ptr [rsp + 8]
vblendps ymm0, ymm1, ymm0, 204
vblendps ymm0, ymm0, ymm2, 136
vmovaps ymmword ptr [rsp + 96], ymm0
lea rax, [rsp + 64]
mov rsp, rbp
pop rbp
vzeroupper
ret
```
x86-64-v4
```
.LCPI0_0:
.long 0
.long 4
.long 8
.long 12
.long 1
.long 5
.long 9
.long 13
.LCPI0_1:
.long 8
.long 12
.long 0
.zero 4
.long 9
.long 13
.long 1
.zero 4
.LCPI0_2:
.long 0
.long 1
.long 2
.long 8
.long 4
.long 5
.long 6
.long 9
example::main::hb98fc185a6f3c541:
push
rbp
mov rbp, rsp
and rsp, -64
sub rsp, 192
vxorps xmm0, xmm0, xmm0
vmovaps ymmword ptr [rsp + 32], ymm0
vmovaps ymmword ptr [rsp], ymm0
mov rax, rsp
vmovaps ymm0, ymmword ptr [rip + .LCPI0_0]
vpermps zmm0, zmm0, zmmword ptr [rsp]
vmovaps ymmword ptr [rsp + 64], ymm0
vmovups xmm0, xmmword ptr [rsp + 40]
vmovsd xmm1, qword ptr [rsp + 56]
vmovaps ymm2, ymmword ptr [rip + .LCPI0_1]
vpermi2ps ymm2, ymm0, ymmword ptr [rsp + 8]
vmovaps ymm0, ymmword ptr [rip + .LCPI0_2]
vpermi2ps ymm0, ymm2, ymm1
vmovaps ymmword ptr [rsp + 96], ymm0
lea rax, [rsp + 64]
mov rsp, rbp
pop rbp
vzeroupper
ret
```
Looks about the same as my implementation, but more importantly, I observed the same performance regression.
For some reason LLVM literally generates SLOWER code if you target x86-64-v4 than if you target x86-64-v3. For context my cpu is a ryzen 7 9800x3d and I used the crate "criterion" for benchmarking.
Can someone more qualified try to explain everything to me? Why is `_mm512_permutexvar_ps` SO SLOW? Is it only a Zen 5 thing? I assume its an architecture thing considering that LLVM also generated basically the same code as me?
https://redd.it/1ryavdu
@r_rust
mov rbp, rsp
and rsp, -64
sub rsp, 192
vxorps xmm0, xmm0, xmm0
vmovaps ymmword ptr [rsp + 32], ymm0
vmovaps ymmword ptr [rsp], ymm0
mov rax, rsp
vmovaps ymm0, ymmword ptr [rip + .LCPI0_0]
vpermps zmm0, zmm0, zmmword ptr [rsp]
vmovaps ymmword ptr [rsp + 64], ymm0
vmovups xmm0, xmmword ptr [rsp + 40]
vmovsd xmm1, qword ptr [rsp + 56]
vmovaps ymm2, ymmword ptr [rip + .LCPI0_1]
vpermi2ps ymm2, ymm0, ymmword ptr [rsp + 8]
vmovaps ymm0, ymmword ptr [rip + .LCPI0_2]
vpermi2ps ymm0, ymm2, ymm1
vmovaps ymmword ptr [rsp + 96], ymm0
lea rax, [rsp + 64]
mov rsp, rbp
pop rbp
vzeroupper
ret
```
Looks about the same as my implementation, but more importantly, I observed the same performance regression.
For some reason LLVM literally generates SLOWER code if you target x86-64-v4 than if you target x86-64-v3. For context my cpu is a ryzen 7 9800x3d and I used the crate "criterion" for benchmarking.
Can someone more qualified try to explain everything to me? Why is `_mm512_permutexvar_ps` SO SLOW? Is it only a Zen 5 thing? I assume its an architecture thing considering that LLVM also generated basically the same code as me?
https://redd.it/1ryavdu
@r_rust
Reddit
From the rust community on Reddit
Explore this post and more from the rust community
Symbolic derivatives and the Rust rewrite of the RE# regex engine
https://iev.ee/blog/symbolic-derivatives-and-the-rust-rewrite-of-resharp/
https://redd.it/1rygobk
@r_rust
https://iev.ee/blog/symbolic-derivatives-and-the-rust-rewrite-of-resharp/
https://redd.it/1rygobk
@r_rust
ian erik varatalu
symbolic derivatives and the rust rewrite of RE# | ian erik varatalu
Building an LSP Server with Rust is surprisingly easy and fun
https://codeinput.com/blog/lsp-server
https://redd.it/1rxy886
@r_rust
https://codeinput.com/blog/lsp-server
https://redd.it/1rxy886
@r_rust
Code Input
Building an LSP Server with Rust is surprisingly easy and fun | Blog | Code Input
A hands-on guide to building toy LSP servers in Rust
Hey Rustaceans! Got a question? Ask here (11/2026)!
Mystified about strings? Borrow checker has you in a headlock? Seek help here! There are no stupid questions, only docs that haven't been written yet. Please note that if you include code examples to e.g. show a compiler error or surprising result, linking a playground with the code will improve your chances of getting help quickly.
If you have a StackOverflow account, consider asking it there instead! StackOverflow shows up much higher in search results, so ahaving your question there also helps future Rust users (be sure to give it the "Rust" tag for maximum visibility). Note that this site is very interested in question quality. I've been asked to read a RFC I authored once. If you want your code reviewed or review other's code, there's a codereview stackexchange, too. If you need to test your code, maybe the Rust playground is for you.
Here are some other venues where help may be found:
/r/learnrust is a subreddit to share your questions and epiphanies learning Rust programming.
The official Rust user forums: https://users.rust-lang.org/.
The official Rust Programming Language Discord: https://discord.gg/rust-lang
The unofficial Rust community Discord: https://bit.ly/rust-community
Also check out last week's thread with many good questions and answers. And if you believe your question to be either very complex or worthy of larger dissemination, feel free to create a text post.
Also if you want to be mentored by experienced Rustaceans, tell us the area of expertise that you seek. Finally, if you are looking for Rust jobs, the most recent thread is here.
https://redd.it/1rv3l49
@r_rust
Mystified about strings? Borrow checker has you in a headlock? Seek help here! There are no stupid questions, only docs that haven't been written yet. Please note that if you include code examples to e.g. show a compiler error or surprising result, linking a playground with the code will improve your chances of getting help quickly.
If you have a StackOverflow account, consider asking it there instead! StackOverflow shows up much higher in search results, so ahaving your question there also helps future Rust users (be sure to give it the "Rust" tag for maximum visibility). Note that this site is very interested in question quality. I've been asked to read a RFC I authored once. If you want your code reviewed or review other's code, there's a codereview stackexchange, too. If you need to test your code, maybe the Rust playground is for you.
Here are some other venues where help may be found:
/r/learnrust is a subreddit to share your questions and epiphanies learning Rust programming.
The official Rust user forums: https://users.rust-lang.org/.
The official Rust Programming Language Discord: https://discord.gg/rust-lang
The unofficial Rust community Discord: https://bit.ly/rust-community
Also check out last week's thread with many good questions and answers. And if you believe your question to be either very complex or worthy of larger dissemination, feel free to create a text post.
Also if you want to be mentored by experienced Rustaceans, tell us the area of expertise that you seek. Finally, if you are looking for Rust jobs, the most recent thread is here.
https://redd.it/1rv3l49
@r_rust
play.rust-lang.org
Rust Playground
A browser interface to the Rust compiler to experiment with the language
I built a minimal process monitor in Rust with a real-time web UI (stdout / stderr)
https://redd.it/1ryp59z
@r_rust
https://redd.it/1ryp59z
@r_rust
Static-conduit: no dependency, small, type-safe, extensible data pipeline library
https://crates.io/crates/static-conduit
Part of my rust learning journey in depth. May it find its use in the community
https://redd.it/1ryvbdn
@r_rust
https://crates.io/crates/static-conduit
Part of my rust learning journey in depth. May it find its use in the community
https://redd.it/1ryvbdn
@r_rust
crates.io
crates.io: Rust Package Registry
crates.io serves as a central registry for sharing crates, which are packages or libraries written in Rust that you can use to enhance your projects
Why is Rust so Liberal with Heap Allocations?
Ever since learning about data oriented design and the potential cost associated with system calls and trying to apply it more to my own code I've noticed that in "idiomatic Rust" it's quite common to model data as nested trees of enums with Box, Vec & String. While this is a super intuitive way to model data it's not always the most efficient.
For a language that prides itself for performance I rarely see libraries leverage alternative allocators such as arenas or other advanced data packing strategies when I peak under the hood of libraries.
Is this just a situation of focusing on readability & correctness first and avoiding premature optimization or is there something deeper going on here? Curious what your guys' perspective is.
https://redd.it/1ryxxcg
@r_rust
Ever since learning about data oriented design and the potential cost associated with system calls and trying to apply it more to my own code I've noticed that in "idiomatic Rust" it's quite common to model data as nested trees of enums with Box, Vec & String. While this is a super intuitive way to model data it's not always the most efficient.
For a language that prides itself for performance I rarely see libraries leverage alternative allocators such as arenas or other advanced data packing strategies when I peak under the hood of libraries.
Is this just a situation of focusing on readability & correctness first and avoiding premature optimization or is there something deeper going on here? Curious what your guys' perspective is.
https://redd.it/1ryxxcg
@r_rust
YouTube
Andrew Kelley: A Practical Guide to Applying Data Oriented Design (DoD)
Copyright: Belongs to Handmade Seattle (https ://vimeo.com/649009599). I'm not the owner of the video and hold no copyright. And the video is not monetized.
In this video Andrew Kelley (creator of Zig programming language) explains various strategies oneβ¦
In this video Andrew Kelley (creator of Zig programming language) explains various strategies oneβ¦
What we heard about Rust's challenges, and how we can address them | Rust Blog
https://blog.rust-lang.org/2026/03/20/rust-challenges.md
https://redd.it/1rz15t3
@r_rust
https://blog.rust-lang.org/2026/03/20/rust-challenges.md
https://redd.it/1rz15t3
@r_rust
blog.rust-lang.org
What we heard about Rust's challenges, and how we can address them | Rust Blog
Empowering everyone to build reliable and efficient software.
Rust + HTML templates + vanilla JS for SPA-like apps β anyone doing this in production?
Iβve been pretty obsessed with performance for a while, especially when it comes to web apps.
On the frontend, Iβve been using Qwik for the past 3 years instead of React and similar frameworks. Iβve even used it in production for a client project, and the performance has been great.
Lately, I started questioning the efficiency of server-side rendering with JavaScript runtimes. From some experiments I ran, rendering HTML using templates in Rust (e.g. with Askama + Axum) can be dramatically faster (Iβve seen ~30β40x improvements) compared to SSR with modern JS frameworks.
So recently I picked up Rust and started building with Axum, and now I want to push this idea further.
Iβm planning a side project (a Reddit-like social media app) with this approach:
- Backend in Rust (Axum)
- Server-rendered HTML using templates (Askama or similar)
- SPA-like UX on the frontend
- Minimal JavaScript β ideally vanilla JS with no libraries unless absolutely necessary
- Very small JS bundles for faster load times
My main questions are actually about the frontend side:
- Are any of you building apps like this (Rust backend + mostly vanilla JS frontend)?
- How do you structure the frontend as it grows without a framework?
- Do you end up building your own abstractions or lightweight framework?
- How do you handle things like state, navigation, and partial updates?
Also, from the Rust side:
- Any recommendations for this kind of architecture?
- Libraries/tools that fit well with a βHTML-over-the-wire + minimal JSβ approach?
Iβm trying to push performance as far as reasonably possible without making the project unmaintainable, so Iβm interested in real-world tradeoffs, not just theory.
https://redd.it/1rz3u23
@r_rust
Iβve been pretty obsessed with performance for a while, especially when it comes to web apps.
On the frontend, Iβve been using Qwik for the past 3 years instead of React and similar frameworks. Iβve even used it in production for a client project, and the performance has been great.
Lately, I started questioning the efficiency of server-side rendering with JavaScript runtimes. From some experiments I ran, rendering HTML using templates in Rust (e.g. with Askama + Axum) can be dramatically faster (Iβve seen ~30β40x improvements) compared to SSR with modern JS frameworks.
So recently I picked up Rust and started building with Axum, and now I want to push this idea further.
Iβm planning a side project (a Reddit-like social media app) with this approach:
- Backend in Rust (Axum)
- Server-rendered HTML using templates (Askama or similar)
- SPA-like UX on the frontend
- Minimal JavaScript β ideally vanilla JS with no libraries unless absolutely necessary
- Very small JS bundles for faster load times
My main questions are actually about the frontend side:
- Are any of you building apps like this (Rust backend + mostly vanilla JS frontend)?
- How do you structure the frontend as it grows without a framework?
- Do you end up building your own abstractions or lightweight framework?
- How do you handle things like state, navigation, and partial updates?
Also, from the Rust side:
- Any recommendations for this kind of architecture?
- Libraries/tools that fit well with a βHTML-over-the-wire + minimal JSβ approach?
Iβm trying to push performance as far as reasonably possible without making the project unmaintainable, so Iβm interested in real-world tradeoffs, not just theory.
https://redd.it/1rz3u23
@r_rust
Reddit
From the rust community on Reddit
Explore this post and more from the rust community
We replaced our Rust/WASM parser with TypeScript and it got 3x faster
https://www.openui.com/blog/rust-wasm-parser
https://redd.it/1rz64ug
@r_rust
https://www.openui.com/blog/rust-wasm-parser
https://redd.it/1rz64ug
@r_rust
Openui
Rewriting our Rust WASM Parser in TypeScript | OpenUI
We rewrote our Rust WASM Parser in TypeScript - and it got 3x Faster
einstellung - A configuration parsing and composing library
Introducing einstellung a proc-macro based, flexible, strongly-typed configuration parser for Rust.
I built einstellung because I wanted a more ergonomic way to handle configuration in Rust applications, especially when dealing with multiple layers (defaults, files, user overrides) without losing type safety or control over how things are merged.
The goal is to keep configuration definitions simple while still allowing customized advanced behavior when needed.
einstellung works by generating an associated `Partial` configuration
for your config in which every field is optional. These partial configs can then be arbitrarily loaded and merged until they are `.build()` at which point your fully initialized config struct is produced.
\- https://github.com/soruh/einstellung
\- https://crates.io/crates/einstellung
\- https://docs.rs/einstellung/latest/einstellung/
Iβd be interested to hear how this compares to other config approaches people are using, or if there are gaps I should address.
https://redd.it/1rz7lur
@r_rust
Introducing einstellung a proc-macro based, flexible, strongly-typed configuration parser for Rust.
I built einstellung because I wanted a more ergonomic way to handle configuration in Rust applications, especially when dealing with multiple layers (defaults, files, user overrides) without losing type safety or control over how things are merged.
The goal is to keep configuration definitions simple while still allowing customized advanced behavior when needed.
einstellung works by generating an associated `Partial` configuration
for your config in which every field is optional. These partial configs can then be arbitrarily loaded and merged until they are `.build()` at which point your fully initialized config struct is produced.
\- https://github.com/soruh/einstellung
\- https://crates.io/crates/einstellung
\- https://docs.rs/einstellung/latest/einstellung/
Iβd be interested to hear how this compares to other config approaches people are using, or if there are gaps I should address.
https://redd.it/1rz7lur
@r_rust
GitHub
GitHub - soruh/einstellung: Rust Application Configuration library
Rust Application Configuration library. Contribute to soruh/einstellung development by creating an account on GitHub.
Idiomatic Use of the
I've started working at a company where use of the
I've immediately learned a lot about
I learn none of that. I know that it must be of some type which can be inferred by its later use, and its value must be whatever that type's default is. Within some complicated logic, this can make code harder for me to read and understand sequentially.
I went looking for something I could cite to coworkers to say "this is well-established bad practice" and came up empty, which surprised me. I think of Rustaceans as having strong opinions on what code is proper. So for lack of a better source I've written my own maximalist diatribe version of this here, but I'm curious about whether if I'm truly so isolated in this belief, perhaps I could just be misguided.
Does the community in general think of
https://redd.it/1rz8pc6
@r_rust
Default Trait?I've started working at a company where use of the
Default trait is ubiquitous, as I understand to perhaps be standard in Rust. However, I somehow always wince at this when forced to read the code. If I see something as simple aslet x = false;
I've immediately learned a lot about
x: I know both its type, bool, and its value, false. In contrast, when I seelet x = Default::default();
I learn none of that. I know that it must be of some type which can be inferred by its later use, and its value must be whatever that type's default is. Within some complicated logic, this can make code harder for me to read and understand sequentially.
I went looking for something I could cite to coworkers to say "this is well-established bad practice" and came up empty, which surprised me. I think of Rustaceans as having strong opinions on what code is proper. So for lack of a better source I've written my own maximalist diatribe version of this here, but I'm curious about whether if I'm truly so isolated in this belief, perhaps I could just be misguided.
Does the community in general think of
Default as something to be encouraged or discouraged? In what scenarios is it seen as idiomatic, and how do you avoid this sort of confusion?https://redd.it/1rz8pc6
@r_rust
joel.place
The "Billion Dollar Mistake" Lives On In Rust β joel.place
Or, "why the Default trait is an anti-pattern"