r_rustπŸ¦€
59 subscribers
593 photos
65 videos
36.4K links
Posts top submissions from the rust subreddit every hour.

Powered by: @reddit2telegram
Chat: @r_channels
Download Telegram
5x Faster than Rust Standard Channel (MPSC)

The techniques used to achieve this speedup involve specialized, unsafe implementations and memory arena strategies tailored specifically for high-performance asynchronous task execution. This is not a robust, full-featured MPSC implementation, but rather an optimized channel that executes FnOnce. This is commonly implemented using MPSC over boxed closures, but memory allocation and thread contention were becoming the bottleneck.

The implementation is not a drop-in replacement for a channel, it doesn't support auto-flushing and has many assumptions, but I believe this may be of use for some of you and may become a crate in the future.

Benchmarks

We performed several benchmarks to measure the performance differences between different ways of performing computation across threads, as well as our new communication layer in Burn. First, we isolated the channel implementation using random tasks. Then, we conducted benchmarks directly within Burn, measuring framework overhead by launching small tasks.

https://preview.redd.it/3d9fmws5bnog1.png?width=2048&format=png&auto=webp&s=949ecc004f58a0207c234684588860655416efba

The benchmarks reveal that a mutex remains the fastest way to perform computations with a single thread. This is expected, as it avoids data copying entirely and lacks contention when only one thread is active. When multiple threads are involved, however, it is a different story: the custom channel can be up to 10 times faster than the standard channel and roughly 2 times faster than the mutex. When measuring framework overhead with 8 threads, we can execute nearly twice as many tasks compared to using a reentrant mutex as the communication layer in Burn.

Why was a dedicated channel slower than a lock? The answer was memory allocation. Our API relies on sending closures over a channel. In standard Rust, this usually looks like Box<dyn FnOnce()>. Because these closures often exceeded 1000 bytes, we were placing massive pressure on the allocator. With multiple threads attempting to allocate and deallocate these boxes simultaneously, the contention was worse than the original mutex lock. To solve this, we moved away from the safety of standard trait objects and embraced pointer manipulation and pre-allocated memory.

Implementation Details

First, we addressed zero-allocation task enqueuing by replacing standard boxing with a tiered Double-Buffer Arena. Small closures (≀ 48 bytes) are now inlined directly into a 64-byte Task struct, aligned to CPU cache lines to prevent false sharing, while larger closures (up to 4KB) use a pre-allocated memory arena to bypass the global allocator entirely. We only fallback to a standard Box for closures larger than 4KB, which represent a negligible fraction of our workloads.

Second, we implemented lock-free double buffering to eliminate the contention typical of standard ring buffers. Using a Double-Buffering Swap strategy, producers write to a client buffer using atomic Acquire/Release semantics. When the runner thread is ready, it performs a single atomic swap to move the entire batch of tasks into a private server buffer, allowing the runner to execute tasks sequentially with zero interference from producers.

Finally, we ensured recursive safety via Thread Local Storage (TLS). To handle the recursion that originally necessitated reentrant mutexes, the runner thread now uses TLS to detect if it is attempting to submit a task to itself. If it is, the task is executed immediately and eagerly rather than being enqueued, preventing deadlocks without the heavy overhead of reentrant locking.

Conclusion

Should you implement a custom channel instead of relying on the standard library? Probably not. But can you significantly outperform general implementations when you have knowledge of the objects being transferred? Absolutely.

Full blog post: https://burn.dev/blog/faster-channel/

https://redd.it/1rrx1bx
@r_rust
I wrote a pure-Rust video codec that compiles to WASM, no FFI
https://redd.it/1rryw0h
@r_rust
What's everyone working on this week (11/2026)?

New week, new Rust! What are you folks up to? Answer here or over at rust-users!

https://redd.it/1rv3nik
@r_rust
I am too stupid to use AVX-512

Recently I have been working on writing a simple 4x4 matrix implementation in rust using SIMD intrinsics. I defined the following struct to help with this.

```
#[repr(C, align(16))]
#[derive(Copy, Clone)]
union f32x16 {
#[cfg(target_feature = "sse")]
pub sse: [__m128; 4],

#[cfg(target_feature = "avx")]
pub avx: [__m256; 2],

#[cfg(target_feature = "avx512f")]
pub avx512: __m512,

pub data: [f32; 16],
}

#[repr(transparent)]
#[derive(Copy, Clone)]
pub struct Mat4(f32x16);
```

My initial implementations of matrix multiplication were slower than a standard avx2 implementation. I brushed it off as a skill issue from my part. But when I tried benchmarking 2 different matrix transpose implementations I realized that something seriously wrong is going on.

The first AVX2 implementation

```
#[cfg(all(target_feature = "avx2", not(target_feature = "avx512f")))]
#[inline(always)]
pub fn transpose(&self) -> Self {
// Mat4 layout: avx[0] = [c0β‚€ c0₁ c0β‚‚ c0₃ | c1β‚€ c1₁ c1β‚‚ c1₃]
// avx[1] = [c2β‚€ c2₁ c2β‚‚ c2₃ | c3β‚€ c3₁ c3β‚‚ c3₃]
// Goal rows: res[0] = [c0β‚€ c1β‚€ c2β‚€ c3β‚€ | c0₁ c1₁ c2₁ c3₁] (Rows 0, 1)
// res[1] = [c0β‚‚ c1β‚‚ c2β‚‚ c3β‚‚ | c0₃ c1₃ c2₃ c3₃] (Rows 2, 3)
use std::arch::x86_64::{_mm256_permutevar8x32_ps, _mm256_blend_ps, _mm256_set_epi32};

unsafe {
// Indices for Row 0/1: [idx5, idx1, idx5, idx1, idx4, idx0, idx4, idx0]
let idx01 = _mm256_set_epi32(5, 1, 5, 1, 4, 0, 4, 0);
let t0 = _mm256_permutevar8x32_ps(self.0.avx[0], idx01);
let t1 = _mm256_permutevar8x32_ps(self.0.avx[1], idx01);
let res01 = _mm256_blend_ps(t0, t1, 0b11001100);

// Indices for Row 2/3: [idx7, idx3, idx7, idx3, idx6, idx2, idx6, idx2]
let idx23 = _mm256_set_epi32(7, 3, 7, 3, 6, 2, 6, 2);
let t2 = _mm256_permutevar8x32_ps(self.0.avx[0], idx23);
let t3 = _mm256_permutevar8x32_ps(self.0.avx[1], idx23);
let res23 = _mm256_blend_ps(t2, t3, 0b11001100);

Mat4(f32x16 { avx: [res01, res23] })
}
}
```

And the AVX-512 implementation

```
#[cfg(target_feature = "avx512f")]
#[inline(always)]
pub fn transpose(&self) -> Self {
use std::arch::x86_64::{_mm512_permutexvar_ps, _mm512_set_epi32};

unsafe {
let indices = _mm512_set_epi32(
15, 11, 7, 3, // Row 3 maps to Col 3
14, 10, 6, 2, // Row 2 maps to Col 2
13, 9, 5, 1, // Row 1 maps to Col 1
12, 8, 4, 0 // Row 0 maps to Col 0
);

let transposed = _mm512_permutexvar_ps(indices, self.0.avx512);
Mat4(f32x16 { avx512: transposed })
}
}
```

Now the AVX-512 implementation is obviously much simpler, its only 1 instruction compared to 6, and I assumed that it would outperform the AVX2 implementation, but I was wrong.

```
Matrix Transpose/Mat4 Transpose
time: [4.0170 ns 4.0173 ns 4.0176 ns]
change: [+600.21% +600.31% +600.45%] (p = 0.00 < 0.05)
Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
5 (5.00%) low mild
1 (1.00%) high mild
2 (2.00%) high severe
```

600% slower. So I decided to paste both implementations in godbolt and see the asm output

AVX2 Implementation

```
.LCPI0_0:
.zero 4
.zero 4
.long 0
.long 4
.zero 4
.zero 4
.long 1
.long 5
.LCPI0_1:
.long 0
.long 4
.zero 4
.zero 4
.long 1
.long 5
.zero 4
.zero 4
.LCPI0_2:
.zero 4
.zero 4
.long 2
.long 6
.zero 4
.zero 4
.long 3
.long 7
.LCPI0_3:
.long 2
.long 6
.zero 4
.zero 4
.long 3
.long 7
.zero 4
.zero 4
example::main::hb98fc185a6f3c541:
push rbp
mov rbp, rsp
and rsp, -32
sub rsp, 160
vxorps xmm0, xmm0,
xmm0
vmovaps ymmword ptr [rsp + 32], ymm0
vmovaps ymmword ptr [rsp], ymm0
mov rax, rsp
vmovaps ymm0, ymmword ptr [rsp]
vmovaps ymm1, ymmword ptr [rsp + 32]
vmovaps ymm2, ymmword ptr [rip + .LCPI0_0]
vpermps ymm2, ymm2, ymm1
vmovaps ymm3, ymmword ptr [rip + .LCPI0_1]
vpermps ymm3, ymm3, ymm0
vblendps ymm2, ymm3, ymm2, 204
vmovaps ymm3, ymmword ptr [rip + .LCPI0_2]
vpermps ymm1, ymm3, ymm1
vmovaps ymm3, ymmword ptr [rip + .LCPI0_3]
vpermps ymm0, ymm3, ymm0
vblendps ymm0, ymm0, ymm1, 204
vmovaps ymmword ptr [rsp + 64], ymm2
vmovaps ymmword ptr [rsp + 96], ymm0
lea rax, [rsp + 64]
mov rsp, rbp
pop rbp
vzeroupper
ret
```

AVX-512 Implementation

```
.LCPI0_0:
.long 0
.long 4
.long 8
.long 12
.long 1
.long 5
.long 9
.long 13
.long 2
.long 6
.long 10
.long 14
.long 3
.long 7
.long 11
.long 15
example::main::hb98fc185a6f3c541:
push rbp
mov rbp, rsp
and rsp, -64
sub rsp, 192
vxorps xmm0, xmm0, xmm0
vmovaps ymmword ptr [rsp + 32], ymm0
vmovaps ymmword ptr [rsp], ymm0
mov rax, rsp
vmovaps zmm0, zmmword ptr [rip + .LCPI0_0]
vpermps zmm0, zmm0, zmmword ptr [rsp]
vmovaps zmmword ptr [rsp + 64], zmm0
lea rax, [rsp + 64]
mov rsp, rbp
pop rbp
vzeroupper
ret
```

both look pretty normal to me. So then I tried letting LLVM autovectorize targeting x86-64-v3 for avx2 and x86-64-v4 for AVX-512.

x86-64-v3
```
.LCPI0_0:
.zero 4
.zero 4
.long 0
.long 4
.zero 4
.zero 4
.long 1
.long 5
.LCPI0_1:
.long 0
.long 4
.zero 4
.zero 4
.long 1
.long 5
.zero 4
.zero 4
.LCPI0_2:
.zero 4
.zero 4
.zero 4
.long 0
.zero 4
.zero 4
.zero 4
.long 1
.LCPI0_3:
.zero 4
.zero 4
.long 0
.zero 4
.zero 4
.zero 4
.long 1
.zero 4
example::main::hb98fc185a6f3c541:
push rbp
mov rbp, rsp
and rsp, -32
sub rsp, 160
vxorps xmm0, xmm0, xmm0
vmovaps ymmword ptr [rsp + 32], ymm0
vmovaps ymmword ptr [rsp], ymm0
mov rax, rsp
vmovaps ymm0, ymmword ptr [rip + .LCPI0_0]
vpermps ymm0, ymm0, ymmword ptr [rsp + 32]
vmovaps ymm1, ymmword ptr [rip + .LCPI0_1]
vpermps ymm2, ymm1, ymmword ptr [rsp]
vblendps ymm0, ymm2, ymm0, 204
vmovaps ymmword ptr [rsp + 64], ymm0
vmovups xmm0, xmmword ptr [rsp + 40]
vmovsd xmm2, qword ptr [rsp + 56]
vmovaps ymm3, ymmword ptr [rip + .LCPI0_2]
vpermps ymm2, ymm3, ymm2
vmovaps ymm3, ymmword ptr [rip + .LCPI0_3]
vpermps ymm0, ymm3, ymm0
vpermps ymm1, ymm1, ymmword ptr [rsp + 8]
vblendps ymm0, ymm1, ymm0, 204
vblendps ymm0, ymm0, ymm2, 136
vmovaps ymmword ptr [rsp + 96], ymm0
lea rax, [rsp + 64]
mov rsp, rbp
pop rbp
vzeroupper
ret
```

x86-64-v4
```
.LCPI0_0:
.long 0
.long 4
.long 8
.long 12
.long 1
.long 5
.long 9
.long 13
.LCPI0_1:
.long 8
.long 12
.long 0
.zero 4
.long 9
.long 13
.long 1
.zero 4
.LCPI0_2:
.long 0
.long 1
.long 2
.long 8
.long 4
.long 5
.long 6
.long 9
example::main::hb98fc185a6f3c541:
push
rbp
mov rbp, rsp
and rsp, -64
sub rsp, 192
vxorps xmm0, xmm0, xmm0
vmovaps ymmword ptr [rsp + 32], ymm0
vmovaps ymmword ptr [rsp], ymm0
mov rax, rsp
vmovaps ymm0, ymmword ptr [rip + .LCPI0_0]
vpermps zmm0, zmm0, zmmword ptr [rsp]
vmovaps ymmword ptr [rsp + 64], ymm0
vmovups xmm0, xmmword ptr [rsp + 40]
vmovsd xmm1, qword ptr [rsp + 56]
vmovaps ymm2, ymmword ptr [rip + .LCPI0_1]
vpermi2ps ymm2, ymm0, ymmword ptr [rsp + 8]
vmovaps ymm0, ymmword ptr [rip + .LCPI0_2]
vpermi2ps ymm0, ymm2, ymm1
vmovaps ymmword ptr [rsp + 96], ymm0
lea rax, [rsp + 64]
mov rsp, rbp
pop rbp
vzeroupper
ret
```

Looks about the same as my implementation, but more importantly, I observed the same performance regression.

For some reason LLVM literally generates SLOWER code if you target x86-64-v4 than if you target x86-64-v3. For context my cpu is a ryzen 7 9800x3d and I used the crate "criterion" for benchmarking.

Can someone more qualified try to explain everything to me? Why is `_mm512_permutexvar_ps` SO SLOW? Is it only a Zen 5 thing? I assume its an architecture thing considering that LLVM also generated basically the same code as me?

https://redd.it/1ryavdu
@r_rust
Hey Rustaceans! Got a question? Ask here (11/2026)!

Mystified about strings? Borrow checker has you in a headlock? Seek help here! There are no stupid questions, only docs that haven't been written yet. Please note that if you include code examples to e.g. show a compiler error or surprising result, linking a playground with the code will improve your chances of getting help quickly.

If you have a StackOverflow account, consider asking it there instead! StackOverflow shows up much higher in search results, so ahaving your question there also helps future Rust users (be sure to give it the "Rust" tag for maximum visibility). Note that this site is very interested in question quality. I've been asked to read a RFC I authored once. If you want your code reviewed or review other's code, there's a codereview stackexchange, too. If you need to test your code, maybe the Rust playground is for you.

Here are some other venues where help may be found:

/r/learnrust is a subreddit to share your questions and epiphanies learning Rust programming.

The official Rust user forums: https://users.rust-lang.org/.

The official Rust Programming Language Discord: https://discord.gg/rust-lang

The unofficial Rust community Discord: https://bit.ly/rust-community

Also check out last week's thread with many good questions and answers. And if you believe your question to be either very complex or worthy of larger dissemination, feel free to create a text post.

Also if you want to be mentored by experienced Rustaceans, tell us the area of expertise that you seek. Finally, if you are looking for Rust jobs, the most recent thread is here.

https://redd.it/1rv3l49
@r_rust
I built a minimal process monitor in Rust with a real-time web UI (stdout / stderr)
https://redd.it/1ryp59z
@r_rust
Why is Rust so Liberal with Heap Allocations?

Ever since learning about data oriented design and the potential cost associated with system calls and trying to apply it more to my own code I've noticed that in "idiomatic Rust" it's quite common to model data as nested trees of enums with Box, Vec & String. While this is a super intuitive way to model data it's not always the most efficient.

For a language that prides itself for performance I rarely see libraries leverage alternative allocators such as arenas or other advanced data packing strategies when I peak under the hood of libraries.

Is this just a situation of focusing on readability & correctness first and avoiding premature optimization or is there something deeper going on here? Curious what your guys' perspective is.

https://redd.it/1ryxxcg
@r_rust
Rust + HTML templates + vanilla JS for SPA-like apps β€” anyone doing this in production?

I’ve been pretty obsessed with performance for a while, especially when it comes to web apps.
On the frontend, I’ve been using Qwik for the past 3 years instead of React and similar frameworks. I’ve even used it in production for a client project, and the performance has been great.
Lately, I started questioning the efficiency of server-side rendering with JavaScript runtimes. From some experiments I ran, rendering HTML using templates in Rust (e.g. with Askama + Axum) can be dramatically faster (I’ve seen ~30–40x improvements) compared to SSR with modern JS frameworks.
So recently I picked up Rust and started building with Axum, and now I want to push this idea further.
I’m planning a side project (a Reddit-like social media app) with this approach:
- Backend in Rust (Axum)
- Server-rendered HTML using templates (Askama or similar)
- SPA-like UX on the frontend
- Minimal JavaScript β€” ideally vanilla JS with no libraries unless absolutely necessary
- Very small JS bundles for faster load times
My main questions are actually about the frontend side:
- Are any of you building apps like this (Rust backend + mostly vanilla JS frontend)?
- How do you structure the frontend as it grows without a framework?
- Do you end up building your own abstractions or lightweight framework?
- How do you handle things like state, navigation, and partial updates?
Also, from the Rust side:
- Any recommendations for this kind of architecture?
- Libraries/tools that fit well with a β€œHTML-over-the-wire + minimal JS” approach?
I’m trying to push performance as far as reasonably possible without making the project unmaintainable, so I’m interested in real-world tradeoffs, not just theory.

https://redd.it/1rz3u23
@r_rust
einstellung - A configuration parsing and composing library

Introducing einstellung a proc-macro based, flexible, strongly-typed configuration parser for Rust.

I built einstellung because I wanted a more ergonomic way to handle configuration in Rust applications, especially when dealing with multiple layers (defaults, files, user overrides) without losing type safety or control over how things are merged.

The goal is to keep configuration definitions simple while still allowing customized advanced behavior when needed.

einstellung works by generating an associated `Partial` configuration
for your config in which every field is optional. These partial configs can then be arbitrarily loaded and merged until they are `.build()` at which point your fully initialized config struct is produced.

\- https://github.com/soruh/einstellung

\- https://crates.io/crates/einstellung

\- https://docs.rs/einstellung/latest/einstellung/

I’d be interested to hear how this compares to other config approaches people are using, or if there are gaps I should address.

https://redd.it/1rz7lur
@r_rust
Idiomatic Use of the Default Trait?

I've started working at a company where use of the Default trait is ubiquitous, as I understand to perhaps be standard in Rust. However, I somehow always wince at this when forced to read the code. If I see something as simple as
let x = false;

I've immediately learned a lot about x: I know both its type, bool, and its value, false. In contrast, when I see
let x = Default::default();

I learn none of that. I know that it must be of some type which can be inferred by its later use, and its value must be whatever that type's default is. Within some complicated logic, this can make code harder for me to read and understand sequentially.

I went looking for something I could cite to coworkers to say "this is well-established bad practice" and came up empty, which surprised me. I think of Rustaceans as having strong opinions on what code is proper. So for lack of a better source I've written my own maximalist diatribe version of this here, but I'm curious about whether if I'm truly so isolated in this belief, perhaps I could just be misguided.

Does the community in general think of Default as something to be encouraged or discouraged? In what scenarios is it seen as idiomatic, and how do you avoid this sort of confusion?

https://redd.it/1rz8pc6
@r_rust