the last neural cell
1.14K subscribers
91 photos
8 videos
15 files
118 links
we write about BCI, AI and brain research.

authors:
@kovalev_alvi - visual neural interfaces - UMH, Spain | CEO of ALVI Labs
@Altime - comp neuro phd @ GTC Tübingen

Our chat: @neural_cell_chat
Download Telegram
Forwarded from LIFT feed
Свежее от Precision Neuroscience: они вновь испытали на людях свои тонкопленочные сверхплотные микроэлектродные ЭКоГ массивы, каждый размером с почтовую марку и содержит 1024 электрода. Через узкую щель (900 мкм) в черепе пациентам заводили до четырех пленок, покрывая примерно 8 см² коры более чем 4 тыс. электродами. — По сравнению с предыдущим результатом добавили нейромодуляцию. На сегодня Precision имплантировали своё устройство уже >50 пациентам и получили разрешение FDA на имплантацию.

#tech | #readout | #modulation | #brain
43
next-token-diffusion.png
8.9 MB
tasty next-token diffusion papers

Autoregressive Image Generation without Vector Quantization (MAR)
tl;dr: propose diffusion head to model each token's distribution instead of cross-entropy (no need to use VQ-VAE)
- bidirectional attention (MAE-style) + random order lets model see full context, unlike causal AR - generates 64 tokens at once
- diffusion head is tiny (2M params works same as 45M) - proves transformer backbone learned everything, head just samples
link: https://arxiv.org/abs/2406.11838

Multimodal Latent Language Modeling with Next-Token Diffusion (LatentLM)
tl;dr: extend MAR's approach to multimodal LLMs, unifying text, image, and audio generation in a single framework
- make VAE more stable: σ-VAE where encoder predicts mean, sigma samples independently, which fixes variance collapse
- use the same diffusion loss as in MAR paper
link: https://arxiv.org/abs/2412.08635

industry-level solution

VibeVoice: A Frontier Open-Source Text-to-Speech Model
tl;dr: apply LatentLM architecture to long-form conversational audio synthesis
- train σ-VAE for audio compression: 7.5Hz is insane (3200x compression)
- model can generate up to 90min with 4 speakers
- beats Gemini + ElevenLabs on human eval, 10x fewer steps than VALL-E 2
link: https://microsoft.github.io/VibeVoice/

NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale
tl;dr: scale next-token diffusion to 14B parameters with lightweight flow matching for state-of-the-art image generation
- start to use flow matching for diffusion loss
- 14B transformer + 157M flow head (same quality as 528M head) - ratio doesn't matter, confirms transformer does all modeling
- add channel-wise norm in tokenizer critical for stability at high CFG
link: https://stepfun.ai/research/en/nextstep1

my thoughts

The core win: continuous latents via diffusion head = no VQ-VAE bottleneck. Smoother reconstruction, fewer artifacts, VAE training just works. Diffusion head size doesn't matter (2M vs 45M, same quality). Means transformer already learned everything, head just samples.

Clean merge of AR and diffusion - not Frankenstein hybrid, just "model sequences autoregressively, sample via diffusion instead of argmax."

In addition this inherits the entire causal LLM toolkit (KV caching, flash attention, etc.) - transformer backbone stays autoregressive, only head changed.

I guess it should perfectly work as neural foundation models. Let's see.
3🔥1🤔1