r_rust🦀

5x Faster than Rust Standard Channel (MPSC)

The techniques used to achieve this speedup involve specialized, unsafe implementations and memory arena strategies tailored specifically for high-performance asynchronous task execution. This is not a robust, full-featured MPSC implementation, but rather an optimized channel that executes FnOnce. This is commonly implemented using MPSC over boxed closures, but memory allocation and thread contention were becoming the bottleneck.

The implementation is not a drop-in replacement for a channel, it doesn't support auto-flushing and has many assumptions, but I believe this may be of use for some of you and may become a crate in the future.

Benchmarks

We performed several benchmarks to measure the performance differences between different ways of performing computation across threads, as well as our new communication layer in Burn. First, we isolated the channel implementation using random tasks. Then, we conducted benchmarks directly within Burn, measuring framework overhead by launching small tasks.

https://preview.redd.it/3d9fmws5bnog1.png?width=2048&format=png&auto=webp&s=949ecc004f58a0207c234684588860655416efba

The benchmarks reveal that a mutex remains the fastest way to perform computations with a single thread. This is expected, as it avoids data copying entirely and lacks contention when only one thread is active. When multiple threads are involved, however, it is a different story: the custom channel can be up to 10 times faster than the standard channel and roughly 2 times faster than the mutex. When measuring framework overhead with 8 threads, we can execute nearly twice as many tasks compared to using a reentrant mutex as the communication layer in Burn.

Why was a dedicated channel slower than a lock? The answer was memory allocation. Our API relies on sending closures over a channel. In standard Rust, this usually looks like Box<dyn FnOnce()>. Because these closures often exceeded 1000 bytes, we were placing massive pressure on the allocator. With multiple threads attempting to allocate and deallocate these boxes simultaneously, the contention was worse than the original mutex lock. To solve this, we moved away from the safety of standard trait objects and embraced pointer manipulation and pre-allocated memory.

Implementation Details

First, we addressed zero-allocation task enqueuing by replacing standard boxing with a tiered Double-Buffer Arena. Small closures (≤ 48 bytes) are now inlined directly into a 64-byte Task struct, aligned to CPU cache lines to prevent false sharing, while larger closures (up to 4KB) use a pre-allocated memory arena to bypass the global allocator entirely. We only fallback to a standard Box for closures larger than 4KB, which represent a negligible fraction of our workloads.

Second, we implemented lock-free double buffering to eliminate the contention typical of standard ring buffers. Using a Double-Buffering Swap strategy, producers write to a client buffer using atomic Acquire/Release semantics. When the runner thread is ready, it performs a single atomic swap to move the entire batch of tasks into a private server buffer, allowing the runner to execute tasks sequentially with zero interference from producers.

Finally, we ensured recursive safety via Thread Local Storage (TLS). To handle the recursion that originally necessitated reentrant mutexes, the runner thread now uses TLS to detect if it is attempting to submit a task to itself. If it is, the task is executed immediately and eagerly rather than being enqueued, preventing deadlocks without the heavy overhead of reentrant locking.

Conclusion

Should you implement a custom channel instead of relying on the standard library? Probably not. But can you significantly outperform general implementations when you have knowledge of the objects being transferred? Absolutely.

Full blog post: https://burn.dev/blog/faster-channel/

https://redd.it/1rrx1bx
@r_rust

13 views18:21