Forwarded from LIFT feed
Свежее от Precision Neuroscience: они вновь испытали на людях свои тонкопленочные сверхплотные микроэлектродные ЭКоГ массивы, каждый размером с почтовую марку и содержит 1024 электрода. Через узкую щель (900 мкм) в черепе пациентам заводили до четырех пленок, покрывая примерно 8 см² коры более чем 4 тыс. электродами. — По сравнению с предыдущим результатом добавили нейромодуляцию. На сегодня Precision имплантировали своё устройство уже >50 пациентам и получили разрешение FDA на имплантацию.
#tech | #readout | #modulation | #brain
#tech | #readout | #modulation | #brain
Nature
Minimally invasive implantation of scalable high-density cortical microelectrode arrays for multimodal neural decoding and stimulation
Nature Biomedical Engineering - A 1,024-channel microelectrode array is delivered to the brain cortex via a minimally invasive incision in the skull and dura, and allows recording, stimulation and...
⚡4❤3
next-token-diffusion.png
8.9 MB
tasty next-token diffusion papers
Autoregressive Image Generation without Vector Quantization (MAR)
tl;dr: propose diffusion head to model each token's distribution instead of cross-entropy (no need to use VQ-VAE)
- bidirectional attention (MAE-style) + random order lets model see full context, unlike causal AR - generates 64 tokens at once
- diffusion head is tiny (2M params works same as 45M) - proves transformer backbone learned everything, head just samples
link: https://arxiv.org/abs/2406.11838
Multimodal Latent Language Modeling with Next-Token Diffusion (LatentLM)
tl;dr: extend MAR's approach to multimodal LLMs, unifying text, image, and audio generation in a single framework
- make VAE more stable: σ-VAE where encoder predicts mean, sigma samples independently, which fixes variance collapse
- use the same diffusion loss as in MAR paper
link: https://arxiv.org/abs/2412.08635
industry-level solution
VibeVoice: A Frontier Open-Source Text-to-Speech Model
tl;dr: apply LatentLM architecture to long-form conversational audio synthesis
- train σ-VAE for audio compression: 7.5Hz is insane (3200x compression)
- model can generate up to 90min with 4 speakers
- beats Gemini + ElevenLabs on human eval, 10x fewer steps than VALL-E 2
link: https://microsoft.github.io/VibeVoice/
NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale
tl;dr: scale next-token diffusion to 14B parameters with lightweight flow matching for state-of-the-art image generation
- start to use flow matching for diffusion loss
- 14B transformer + 157M flow head (same quality as 528M head) - ratio doesn't matter, confirms transformer does all modeling
- add channel-wise norm in tokenizer critical for stability at high CFG
link: https://stepfun.ai/research/en/nextstep1
my thoughts
The core win: continuous latents via diffusion head = no VQ-VAE bottleneck. Smoother reconstruction, fewer artifacts, VAE training just works. Diffusion head size doesn't matter (2M vs 45M, same quality). Means transformer already learned everything, head just samples.
Clean merge of AR and diffusion - not Frankenstein hybrid, just "model sequences autoregressively, sample via diffusion instead of argmax."
In addition this inherits the entire causal LLM toolkit (KV caching, flash attention, etc.) - transformer backbone stays autoregressive, only head changed.
I guess it should perfectly work as neural foundation models. Let's see.
Autoregressive Image Generation without Vector Quantization (MAR)
tl;dr: propose diffusion head to model each token's distribution instead of cross-entropy (no need to use VQ-VAE)
- bidirectional attention (MAE-style) + random order lets model see full context, unlike causal AR - generates 64 tokens at once
- diffusion head is tiny (2M params works same as 45M) - proves transformer backbone learned everything, head just samples
link: https://arxiv.org/abs/2406.11838
Multimodal Latent Language Modeling with Next-Token Diffusion (LatentLM)
tl;dr: extend MAR's approach to multimodal LLMs, unifying text, image, and audio generation in a single framework
- make VAE more stable: σ-VAE where encoder predicts mean, sigma samples independently, which fixes variance collapse
- use the same diffusion loss as in MAR paper
link: https://arxiv.org/abs/2412.08635
industry-level solution
VibeVoice: A Frontier Open-Source Text-to-Speech Model
tl;dr: apply LatentLM architecture to long-form conversational audio synthesis
- train σ-VAE for audio compression: 7.5Hz is insane (3200x compression)
- model can generate up to 90min with 4 speakers
- beats Gemini + ElevenLabs on human eval, 10x fewer steps than VALL-E 2
link: https://microsoft.github.io/VibeVoice/
NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale
tl;dr: scale next-token diffusion to 14B parameters with lightweight flow matching for state-of-the-art image generation
- start to use flow matching for diffusion loss
- 14B transformer + 157M flow head (same quality as 528M head) - ratio doesn't matter, confirms transformer does all modeling
- add channel-wise norm in tokenizer critical for stability at high CFG
link: https://stepfun.ai/research/en/nextstep1
my thoughts
The core win: continuous latents via diffusion head = no VQ-VAE bottleneck. Smoother reconstruction, fewer artifacts, VAE training just works. Diffusion head size doesn't matter (2M vs 45M, same quality). Means transformer already learned everything, head just samples.
Clean merge of AR and diffusion - not Frankenstein hybrid, just "model sequences autoregressively, sample via diffusion instead of argmax."
In addition this inherits the entire causal LLM toolkit (KV caching, flash attention, etc.) - transformer backbone stays autoregressive, only head changed.
I guess it should perfectly work as neural foundation models. Let's see.
❤3🔥1🤔1