r_rust🦀

Parametricity, or Comptime is Bonkers
https://noelwelsh.com/posts/comptime-is-bonkers/

https://redd.it/1rrjjkw
@r_rust

From the rust community on Reddit: Parametricity, or Comptime is Bonkers

Posted by soareschen - 17 votes and 13 comments

14 views09:16

I got tired of refreshing Claude usage so I built physical analog gauges with Rust firmware instead
https://redd.it/1rrf2y9
@r_rust

10 views10:16

r_rust🦀

RaTeX: A cross-platform LaTeX math rendering engine written in pure Rust

https://preview.redd.it/zfm3esmpqkog1.png?width=783&format=png&auto=webp&s=6388de4d1b62bdddbdefc8f39598bb2d24f2d515

I've been working on RaTeX, a LaTeX math rendering engine written in pure Rust。

# The problem

Rendering LaTeX math in cross-platform apps is surprisingly fragmented. The common solutions all have the same weakness:

Web / cross-platform frameworks (Flutter, React Native) → WebView + KaTeX/MathJax, which adds tens of MB of memory overhead and JS startup latency
Native-only renderers (SwiftMath on iOS, JLatexMath on Android) → each platform needs its own implementation, no shared code

With AI tools and education apps generating large amounts of math content, a single rendering core that works everywhere felt like a missing piece.

# What RaTeX does

RaTeX parses LaTeX and produces a renderer-agnostic display list — a flat sequence of drawing commands (glyphs, lines, rectangles, bezier paths) with absolute coordinates. Everything platform-specific lives at the edges.

The pipeline:

tokenization → parsing → layout → display list

The display list is then consumed by:

`ratex-wasm` — compiles to WASM, returns the display list as JSON for Canvas 2D rendering in the browser ✅ working
ratex-render — rasterizes to PNG via tiny-skia, no browser needed ✅ working
`ratex-ffi` — C ABI for iOS (Swift/ObjC), Android (JNI), Flutter (Dart FFI), React Native 🚧 in progress

# Code

Server-side PNG rendering today looks like this:

use ratex_parser::parser::parse;
use ratex_layout::{layout, to_display_list, LayoutOptions};
use ratex_render::{render_to_png, RenderOptions};

let ast = parse(r"\frac{-b \pm \sqrt{b^2 - 4ac}}{2a}")?;
let layout_box = layout(&ast, &LayoutOptions::default());
let display_list = to_display_list(&layout_box);
let png = render_to_png(&display_list, &RenderOptions::default())?;

std::fs::write("formula.png", png)?;

# Current status

\~99% of KaTeX formula syntax parses correctly
\~80% visual similarity vs KaTeX reference renders (tested against 916 formulas, pixel-level comparison)
The remaining gaps are mostly edge cases: accents, extensible arrows, some spacing rules

I built a live support table that renders all 916 formulas with both engines side-by-side, so it's easy to see exactly where RaTeX matches and where it falls short.

# Why this matters

As far as I know, there's no open-source cross-platform LaTeX renderer that runs natively without a browser. RaTeX is an attempt to fill that gap — one Rust core, every platform.

It's still early, but the foundation is there. If you've been building an app that needs math rendering and didn't want to ship a WebView, this might be useful to you.

Stars and issues welcome — especially if you hit a formula that doesn't render right.

GitHub: [https://github.com/erweixin/RaTeX\](https://github.com/erweixin/RaTeX)

Demo: [https://erweixin.github.io/RaTeX/demo/index.html\]

Support table: [https://erweixin.github.io/RaTeX/demo/support\_table.html\]

https://redd.it/1rrl9ka
@r_rust

6 views11:16

r_rust🦀

Introduce a way to construct Range from start + length

https://internals.rust-lang.org/t/introduce-a-way-to-construct-range-from-start-length/24073

https://redd.it/1rrkeyi
@r_rust

Rust Internals

Introduce a way to construct Range from start + length

Currently the only way to construct such range is via start..(start + length) Instead, we would be able to do it via Range::from_len(start, length) Or, via a new custom syntax start..+length This would help avoid off-by-one errors / wrongly defined ranges…

6 views12:16

r_rust🦀

Job-focused list of product companies using Rust in production — 2026 (ReadyToTouch)

Hi everyone! I've been manually maintaining a list of companies that hire and use Rust in production for over a year now, updating it weekly. Writing this up again for both new and returning readers.

**Why I built this**

I started the project against a backdrop of layoff news and posts about how hard job searching has become. I wanted to do something now — while I still have time — to make my future job search easier. So I started building a list of companies hiring Go engineers and connecting with people at companies I'd want to work at, where I'd be a strong candidate based on my expertise. I added Rust later, because I've been learning it and considering it for my own career going forward.

**The list:** [https://readytotouch.com/rust/companies](https://readytotouch.com/rust/companies) — sorted by most recent job openings. Product companies and startups only — no outsourcing, outstaffing, or recruiting agencies. Nearly 300 companies in the Rust list; for comparison, the Go list has 900+.

**The core idea**

The point isn't to chase open positions — it's to build your career deliberately over time.

If you have experience in certain industries and with certain cloud providers, the list has filters for exactly that: industry (MedTech, FinTech, PropTech, etc.) and cloud provider (AWS, GCP, Azure). You can immediately target companies where you'd be a strong candidate — even if they have no open roles right now. Then you can add their current employees on LinkedIn with a message like: *"Hi, I have experience with Rust and SomeTech, so I'm keeping Example Company on my radar for future opportunities."*

Each company profile on ReadyToTouch includes a link to current employees on LinkedIn. Browsing those profiles is useful beyond just making connections — you start noticing patterns in where people came from. If a certain company keeps appearing in employees' backgrounds, it might be a natural stepping stone to get there.

The same logic applies to former employees — there's a dedicated link for that in each profile too. Patterns in where people go next can help you understand which direction to move in. And former employees are worth connecting with early — they can give you honest insight into the company before you apply.

One more useful link in each profile: a search for employee posts on LinkedIn. This helps you find people who are active there and easier to reach.

If you're ever choosing between two offers, knowing where employees tend to go next can simplify the decision. And if the offers are from different industries, you can check ReadyToTouch to see which industry has more companies you'd actually want to work at — a small but useful data point for long-term career direction.

**What's in each company profile**

1. **Careers page** — direct applications are reportedly more effective for some candidates than applying through LinkedIn
2. **Glassdoor** — reviews and salaries; there's also a Glassdoor rating filter in both the company list and jobs list on ReadyToTouch
3. **Indeed / Blind** — more reviews
4. [**Levels.fyi**](https://Levels.fyi/) — another salary reference
5. **GitHub** — see what Rust projects the company is actually working on
6. **Layoffs** — quick Google searches for recent layoff news by company

Not every profile is 100% complete — some companies simply don't publish everything, and I can't always fill in the gaps manually. There's a "Google it" button on every profile for exactly that reason.

**Alternatives**

If ReadyToTouch doesn't fit your workflow, here are other resources worth knowing:

1. [https://filtra.io/](https://filtra.io/)
2. [https://rustyboard.com/](https://rustyboard.com/)
3. [https://jobs.letsgetrusty.com/](https://jobs.letsgetrusty.com/)
4. [https://rustjobs.dev/](https://rustjobs.dev/)
5. [https://rust.careers/](https://rust.careers/)
6. [https://wellfound.com/role/rust-developer](https://wellfound.com/role/rust-developer)
7. LinkedIn search: ["Rust" AND

Readytotouch

Rust companies | ReadyToTouch

Discover startups and product teams using Rust in production.

7 views13:16

r_rust🦀

"Engineer"](https://www.linkedin.com/jobs/search/?f_TPR=r2592000&geoId=92000000&keywords=%22Rust%22%20AND%20%22Engineer%22&location=Worldwide&sortBy=DD)
8. LinkedIn search: ["Rust" AND "Developer"](https://www.linkedin.com/jobs/search/?f_TPR=r2592000&geoId=92000000&keywords=%22Rust%22%20AND%20%22Developer%22&location=Worldwide&sortBy=DD)
9. [https://github.com/omarabid/rust-companies](https://github.com/omarabid/rust-companies)
10. [https://github.com/ImplFerris/rust-in-production](https://github.com/ImplFerris/rust-in-production)

**One more tool**

If building a personal list of target companies and tracking connections is a strategy that works for you — the way it does for me — there's a separate tool for that: [https://readytotouch.com/companies-and-connections](https://readytotouch.com/companies-and-connections)

**What's new**

* Mobile-friendly (fixed after earlier feedback — happy to show before/after in comments)
* 1,500+ GitHub stars, \~7,000 visitors/month
* Open source, built with a small team

**What's next**

Continuing weekly updates to companies and job openings across all languages.

The project runs at $0 revenue. If your company is actively hiring Rust engineers, there's a paid option to feature it at the top of the list for a month — reach out if interested.

**Links**

* Companies: [https://readytotouch.com/rust/companies](https://readytotouch.com/rust/companies)
* Jobs: [https://readytotouch.com/rust/jobs](https://readytotouch.com/rust/jobs)
* Repository: [https://github.com/readytotouch/readytotouch](https://github.com/readytotouch/readytotouch)

*My native language is Ukrainian. I think and write in it, then translate with Claude's help and review the result — so please keep that in mind.*

Happy to answer questions! And I'd love to hear in the comments if the list has helped anyone find a job — or even just changed how they think about job searching.

https://redd.it/1rrnxfh
@r_rust

11,000+ Rust And Engineer jobs (992 new)

Today’s top 11,000+ Rust And Engineer jobs. Leverage your professional network, and get hired. New Rust And Engineer jobs added daily.

4 views13:16

r_rust🦀

Need resources for building a Debugger

Hi everyone,

I am Abinash. I am interested in learning how a debugger works by building one of my own in Rust.

So, I am looking for some resources (Docs, Blog Posts, Videos, Repo) to understand and build a debugger with UI.

My Skills:

\- Rust - Intermediate (Actively Learning)
\- OS - Basic (Actively Learning)

Setup:

\- Windows 11 (AMD Ryzen 5 7530U with Radeon Graphics (2.00 GHz, x64-based processor))
\- Programming on WSL (Ubuntu)

Some resources I found:

\- https://www.timdbg.com/posts/writing-a-debugger-from-scratch-part-1/
\- https://www.dgtlgrove.com/t/demystifying-debuggers

Thank you.

https://redd.it/1rrmvgs
@r_rust

TimDbg

Writing a Debugger From Scratch - DbgRs Part 1 - Attaching to a Process

I’ve left the Microsoft Debugger Platform team twice, and each time I’ve started writing my own debugger. I must really like debuggers or something. This time, I have two reasons for writing a new debugger. The first is because I want to learn Rust better…

7 views14:16

r_rust🦀

Is there a language similar to Rust but with a garbage collector?

Hi everyone,

I’m learning Rust and I really like its performance and safety model. I know Rust doesn’t use a garbage collector and instead relies on ownership and borrowing.

I’m curious: are there programming languages that are similar to Rust but use a garbage collector instead?

I’d like to compare the approaches and understand the trade-offs.

Thanks!

https://redd.it/1rrsj4k
@r_rust

From the rust community on Reddit

Explore this post and more from the rust community

7 views15:16

r_rust🦀

Announcing rustup 1.29.0
https://blog.rust-lang.org/2026/03/12/Rustup-1.29.0/

https://redd.it/1rrtk6c
@r_rust

blog.rust-lang.org

Announcing rustup 1.29.0 | Rust Blog

Empowering everyone to build reliable and efficient software.

6 views16:16

r_rust🦀

I built a real-time code quality grader in Rust — treemap visualization + 14 health metrics via tree-sitter

I built sentrux — a real-time code structure visualizer and quality grader.

What it does:

\- Scans any codebase, renders a live interactive treemap (egui/wgpu)

\- 14 quality dimensions graded A-F (coupling, cycles, cohesion, dead code, etc.)

\- Dependency edges (import, call, inheritance) as animated polylines

\- File watcher — files glow when modified, incremental rescan

\- MCP server for AI agent integration

\- 23 languages via tree-sitter

Tech stack:

\- Pure Rust, single binary, no runtime dependencies

\- egui + wgpu for rendering

\- tree-sitter for parsing (23 languages)

\- tokei for line counting

\- notify for filesystem watching

\- Squarified treemap layout + spatial index for O(1) hit testing

GitHub: https://github.com/sentrux/sentrux

MIT licensed. Would love feedback on the architecture or the Rust patterns used. Happy to answer any questions.

https://redd.it/1rrs7jp
@r_rust

GitHub

GitHub - sentrux/sentrux: Real-time architectural sensor that helps AI agents close the feedback loop, enabling recursive self…

Real-time architectural sensor that helps AI agents close the feedback loop, enabling recursive self-improvement of code quality. Pure Rust. - sentrux/sentrux

10 views17:19

r_rust🦀

5x Faster than Rust Standard Channel (MPSC)

The techniques used to achieve this speedup involve specialized, unsafe implementations and memory arena strategies tailored specifically for high-performance asynchronous task execution. This is not a robust, full-featured MPSC implementation, but rather an optimized channel that executes FnOnce. This is commonly implemented using MPSC over boxed closures, but memory allocation and thread contention were becoming the bottleneck.

The implementation is not a drop-in replacement for a channel, it doesn't support auto-flushing and has many assumptions, but I believe this may be of use for some of you and may become a crate in the future.

Benchmarks

We performed several benchmarks to measure the performance differences between different ways of performing computation across threads, as well as our new communication layer in Burn. First, we isolated the channel implementation using random tasks. Then, we conducted benchmarks directly within Burn, measuring framework overhead by launching small tasks.

https://preview.redd.it/3d9fmws5bnog1.png?width=2048&format=png&auto=webp&s=949ecc004f58a0207c234684588860655416efba

The benchmarks reveal that a mutex remains the fastest way to perform computations with a single thread. This is expected, as it avoids data copying entirely and lacks contention when only one thread is active. When multiple threads are involved, however, it is a different story: the custom channel can be up to 10 times faster than the standard channel and roughly 2 times faster than the mutex. When measuring framework overhead with 8 threads, we can execute nearly twice as many tasks compared to using a reentrant mutex as the communication layer in Burn.

Why was a dedicated channel slower than a lock? The answer was memory allocation. Our API relies on sending closures over a channel. In standard Rust, this usually looks like Box<dyn FnOnce()>. Because these closures often exceeded 1000 bytes, we were placing massive pressure on the allocator. With multiple threads attempting to allocate and deallocate these boxes simultaneously, the contention was worse than the original mutex lock. To solve this, we moved away from the safety of standard trait objects and embraced pointer manipulation and pre-allocated memory.

Implementation Details

First, we addressed zero-allocation task enqueuing by replacing standard boxing with a tiered Double-Buffer Arena. Small closures (≤ 48 bytes) are now inlined directly into a 64-byte Task struct, aligned to CPU cache lines to prevent false sharing, while larger closures (up to 4KB) use a pre-allocated memory arena to bypass the global allocator entirely. We only fallback to a standard Box for closures larger than 4KB, which represent a negligible fraction of our workloads.

Second, we implemented lock-free double buffering to eliminate the contention typical of standard ring buffers. Using a Double-Buffering Swap strategy, producers write to a client buffer using atomic Acquire/Release semantics. When the runner thread is ready, it performs a single atomic swap to move the entire batch of tasks into a private server buffer, allowing the runner to execute tasks sequentially with zero interference from producers.

Finally, we ensured recursive safety via Thread Local Storage (TLS). To handle the recursion that originally necessitated reentrant mutexes, the runner thread now uses TLS to detect if it is attempting to submit a task to itself. If it is, the task is executed immediately and eagerly rather than being enqueued, preventing deadlocks without the heavy overhead of reentrant locking.

Conclusion

Should you implement a custom channel instead of relying on the standard library? Probably not. But can you significantly outperform general implementations when you have knowledge of the objects being transferred? Absolutely.

Full blog post: https://burn.dev/blog/faster-channel/

https://redd.it/1rrx1bx
@r_rust

13 views18:21

r_rust🦀

I wrote a pure-Rust video codec that compiles to WASM, no FFI
https://redd.it/1rryw0h
@r_rust

17 views19:35

r_rust🦀

Vite 8.0 is out. And it's full of 🦀 Rust
https://vite.dev/blog/announcing-vite8

https://redd.it/1rs2d4j
@r_rust

vitejs

Vite 8.0 is out!

Vite 8 Release Announcement

15 views21:29

r_rust🦀

This Week in Rust #643
https://this-week-in-rust.org/blog/2026/03/18/this-week-in-rust-643/

https://redd.it/1rxlv3m
@r_rust

this-week-in-rust.org

This Week in Rust 643 · This Week in Rust

9 views06:16

r_rust🦀

What's everyone working on this week (11/2026)?

New week, new Rust! What are you folks up to? Answer here or over at rust-users!

https://redd.it/1rv3nik
@r_rust

The Rust Programming Language Forum

What's everyone working on this week (11/2026)?

New week, new Rust! What are you folks up to?

9 views07:16

r_rust🦀

Greetings from Rustikon
https://redd.it/1ry32tf
@r_rust

9 views08:16

r_rust🦀

I am too stupid to use AVX-512

Recently I have been working on writing a simple 4x4 matrix implementation in rust using SIMD intrinsics. I defined the following struct to help with this.

```
#[repr(C, align(16))]
#[derive(Copy, Clone)]
union f32x16 {
#[cfg(target_feature = "sse")]
pub sse: [__m128; 4],

#[cfg(target_feature = "avx")]
pub avx: [__m256; 2],

#[cfg(target_feature = "avx512f")]
pub avx512: __m512,

pub data: [f32; 16],
}

#[repr(transparent)]
#[derive(Copy, Clone)]
pub struct Mat4(f32x16);
```

My initial implementations of matrix multiplication were slower than a standard avx2 implementation. I brushed it off as a skill issue from my part. But when I tried benchmarking 2 different matrix transpose implementations I realized that something seriously wrong is going on.

The first AVX2 implementation

```
#[cfg(all(target_feature = "avx2", not(target_feature = "avx512f")))]
#[inline(always)]
pub fn transpose(&self) -> Self {
// Mat4 layout: avx[0] = [c0₀ c0₁ c0₂ c0₃ | c1₀ c1₁ c1₂ c1₃]
// avx[1] = [c2₀ c2₁ c2₂ c2₃ | c3₀ c3₁ c3₂ c3₃]
// Goal rows: res[0] = [c0₀ c1₀ c2₀ c3₀ | c0₁ c1₁ c2₁ c3₁] (Rows 0, 1)
// res[1] = [c0₂ c1₂ c2₂ c3₂ | c0₃ c1₃ c2₃ c3₃] (Rows 2, 3)
use std::arch::x86_64::{_mm256_permutevar8x32_ps, _mm256_blend_ps, _mm256_set_epi32};

unsafe {
// Indices for Row 0/1: [idx5, idx1, idx5, idx1, idx4, idx0, idx4, idx0]
let idx01 = _mm256_set_epi32(5, 1, 5, 1, 4, 0, 4, 0);
let t0 = _mm256_permutevar8x32_ps(self.0.avx[0], idx01);
let t1 = _mm256_permutevar8x32_ps(self.0.avx[1], idx01);
let res01 = _mm256_blend_ps(t0, t1, 0b11001100);

// Indices for Row 2/3: [idx7, idx3, idx7, idx3, idx6, idx2, idx6, idx2]
let idx23 = _mm256_set_epi32(7, 3, 7, 3, 6, 2, 6, 2);
let t2 = _mm256_permutevar8x32_ps(self.0.avx[0], idx23);
let t3 = _mm256_permutevar8x32_ps(self.0.avx[1], idx23);
let res23 = _mm256_blend_ps(t2, t3, 0b11001100);

Mat4(f32x16 { avx: [res01, res23] })
}
}
```

And the AVX-512 implementation

```
#[cfg(target_feature = "avx512f")]
#[inline(always)]
pub fn transpose(&self) -> Self {
use std::arch::x86_64::{_mm512_permutexvar_ps, _mm512_set_epi32};

unsafe {
let indices = _mm512_set_epi32(
15, 11, 7, 3, // Row 3 maps to Col 3
14, 10, 6, 2, // Row 2 maps to Col 2
13, 9, 5, 1, // Row 1 maps to Col 1
12, 8, 4, 0 // Row 0 maps to Col 0
);

let transposed = _mm512_permutexvar_ps(indices, self.0.avx512);
Mat4(f32x16 { avx512: transposed })
}
}
```

Now the AVX-512 implementation is obviously much simpler, its only 1 instruction compared to 6, and I assumed that it would outperform the AVX2 implementation, but I was wrong.

```
Matrix Transpose/Mat4 Transpose
time: [4.0170 ns 4.0173 ns 4.0176 ns]
change: [+600.21% +600.31% +600.45%] (p = 0.00 < 0.05)
Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
5 (5.00%) low mild
1 (1.00%) high mild
2 (2.00%) high severe
```

600% slower. So I decided to paste both implementations in godbolt and see the asm output

AVX2 Implementation

```
.LCPI0_0:
.zero 4
.zero 4
.long 0
.long 4
.zero 4
.zero 4
.long 1
.long 5
.LCPI0_1:
.long 0
.long 4
.zero 4
.zero 4
.long 1
.long 5
.zero 4
.zero 4
.LCPI0_2:
.zero 4
.zero 4
.long 2
.long 6
.zero 4
.zero 4
.long 3
.long 7
.LCPI0_3:
.long 2
.long 6
.zero 4
.zero 4
.long 3
.long 7
.zero 4
.zero 4
example::main::hb98fc185a6f3c541:
push rbp
mov rbp, rsp
and rsp, -32
sub rsp, 160
vxorps xmm0, xmm0,

7 views09:16

r_rust🦀

xmm0
vmovaps ymmword ptr [rsp + 32], ymm0
vmovaps ymmword ptr [rsp], ymm0
mov rax, rsp
vmovaps ymm0, ymmword ptr [rsp]
vmovaps ymm1, ymmword ptr [rsp + 32]
vmovaps ymm2, ymmword ptr [rip + .LCPI0_0]
vpermps ymm2, ymm2, ymm1
vmovaps ymm3, ymmword ptr [rip + .LCPI0_1]
vpermps ymm3, ymm3, ymm0
vblendps ymm2, ymm3, ymm2, 204
vmovaps ymm3, ymmword ptr [rip + .LCPI0_2]
vpermps ymm1, ymm3, ymm1
vmovaps ymm3, ymmword ptr [rip + .LCPI0_3]
vpermps ymm0, ymm3, ymm0
vblendps ymm0, ymm0, ymm1, 204
vmovaps ymmword ptr [rsp + 64], ymm2
vmovaps ymmword ptr [rsp + 96], ymm0
lea rax, [rsp + 64]
mov rsp, rbp
pop rbp
vzeroupper
ret
```

AVX-512 Implementation

```
.LCPI0_0:
.long 0
.long 4
.long 8
.long 12
.long 1
.long 5
.long 9
.long 13
.long 2
.long 6
.long 10
.long 14
.long 3
.long 7
.long 11
.long 15
example::main::hb98fc185a6f3c541:
push rbp
mov rbp, rsp
and rsp, -64
sub rsp, 192
vxorps xmm0, xmm0, xmm0
vmovaps ymmword ptr [rsp + 32], ymm0
vmovaps ymmword ptr [rsp], ymm0
mov rax, rsp
vmovaps zmm0, zmmword ptr [rip + .LCPI0_0]
vpermps zmm0, zmm0, zmmword ptr [rsp]
vmovaps zmmword ptr [rsp + 64], zmm0
lea rax, [rsp + 64]
mov rsp, rbp
pop rbp
vzeroupper
ret
```

both look pretty normal to me. So then I tried letting LLVM autovectorize targeting x86-64-v3 for avx2 and x86-64-v4 for AVX-512.

x86-64-v3
```
.LCPI0_0:
.zero 4
.zero 4
.long 0
.long 4
.zero 4
.zero 4
.long 1
.long 5
.LCPI0_1:
.long 0
.long 4
.zero 4
.zero 4
.long 1
.long 5
.zero 4
.zero 4
.LCPI0_2:
.zero 4
.zero 4
.zero 4
.long 0
.zero 4
.zero 4
.zero 4
.long 1
.LCPI0_3:
.zero 4
.zero 4
.long 0
.zero 4
.zero 4
.zero 4
.long 1
.zero 4
example::main::hb98fc185a6f3c541:
push rbp
mov rbp, rsp
and rsp, -32
sub rsp, 160
vxorps xmm0, xmm0, xmm0
vmovaps ymmword ptr [rsp + 32], ymm0
vmovaps ymmword ptr [rsp], ymm0
mov rax, rsp
vmovaps ymm0, ymmword ptr [rip + .LCPI0_0]
vpermps ymm0, ymm0, ymmword ptr [rsp + 32]
vmovaps ymm1, ymmword ptr [rip + .LCPI0_1]
vpermps ymm2, ymm1, ymmword ptr [rsp]
vblendps ymm0, ymm2, ymm0, 204
vmovaps ymmword ptr [rsp + 64], ymm0
vmovups xmm0, xmmword ptr [rsp + 40]
vmovsd xmm2, qword ptr [rsp + 56]
vmovaps ymm3, ymmword ptr [rip + .LCPI0_2]
vpermps ymm2, ymm3, ymm2
vmovaps ymm3, ymmword ptr [rip + .LCPI0_3]
vpermps ymm0, ymm3, ymm0
vpermps ymm1, ymm1, ymmword ptr [rsp + 8]
vblendps ymm0, ymm1, ymm0, 204
vblendps ymm0, ymm0, ymm2, 136
vmovaps ymmword ptr [rsp + 96], ymm0
lea rax, [rsp + 64]
mov rsp, rbp
pop rbp
vzeroupper
ret
```

x86-64-v4
```
.LCPI0_0:
.long 0
.long 4
.long 8
.long 12
.long 1
.long 5
.long 9
.long 13
.LCPI0_1:
.long 8
.long 12
.long 0
.zero 4
.long 9
.long 13
.long 1
.zero 4
.LCPI0_2:
.long 0
.long 1
.long 2
.long 8
.long 4
.long 5
.long 6
.long 9
example::main::hb98fc185a6f3c541:
push

5 views09:16

r_rust🦀

rbp
mov rbp, rsp
and rsp, -64
sub rsp, 192
vxorps xmm0, xmm0, xmm0
vmovaps ymmword ptr [rsp + 32], ymm0
vmovaps ymmword ptr [rsp], ymm0
mov rax, rsp
vmovaps ymm0, ymmword ptr [rip + .LCPI0_0]
vpermps zmm0, zmm0, zmmword ptr [rsp]
vmovaps ymmword ptr [rsp + 64], ymm0
vmovups xmm0, xmmword ptr [rsp + 40]
vmovsd xmm1, qword ptr [rsp + 56]
vmovaps ymm2, ymmword ptr [rip + .LCPI0_1]
vpermi2ps ymm2, ymm0, ymmword ptr [rsp + 8]
vmovaps ymm0, ymmword ptr [rip + .LCPI0_2]
vpermi2ps ymm0, ymm2, ymm1
vmovaps ymmword ptr [rsp + 96], ymm0
lea rax, [rsp + 64]
mov rsp, rbp
pop rbp
vzeroupper
ret
```

Looks about the same as my implementation, but more importantly, I observed the same performance regression.

For some reason LLVM literally generates SLOWER code if you target x86-64-v4 than if you target x86-64-v3. For context my cpu is a ryzen 7 9800x3d and I used the crate "criterion" for benchmarking.

Can someone more qualified try to explain everything to me? Why is `_mm512_permutexvar_ps` SO SLOW? Is it only a Zen 5 thing? I assume its an architecture thing considering that LLVM also generated basically the same code as me?

https://redd.it/1ryavdu
@r_rust

From the rust community on Reddit

Explore this post and more from the rust community

5 views09:16