Data Science | Machine Learning with Python for Researchers
31.5K subscribers
1.59K photos
102 videos
22 files
1.87K links
Admin: @HusseinSheikho

The Data Science and Python channel is for researchers and advanced programmers

Buy ads: https://telega.io/c/dataScienceT
Download Telegram
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

9 May 2025 · Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, Hongyang Li ·

A generalist robot should perform effectively across various environments. However, most existing approaches heavily rely on scaling action-annotated data to enhance their capabilities. Consequently, they are often limited to single physical specification and struggle to learn transferable knowledge across different embodiments and environments. To confront these limitations, we propose UniVLA, a new framework for learning cross-embodiment vision-language-action (VLA) policies. Our key innovation is to derive task-centric action representations from videos with a latent action model. This enables us to exploit extensive data across a wide spectrum of embodiments and perspectives. To mitigate the effect of task-irrelevant dynamics, we incorporate language instructions and establish a latent action model within the DINO feature space. Learned from internet-scale videos, the generalist policy can be deployed to various robots through efficient latent action decoding. We obtain state-of-the-art results across multiple manipulation and navigation benchmarks, as well as real-robot deployments. UniVLA achieves superior performance over OpenVLA with less than 1/20 of pretraining compute and 1/10 of downstream data. Continuous performance improvements are observed as heterogeneous data, even including human videos, are incorporated into the training pipeline. The results underscore UniVLA's potential to facilitate scalable and efficient robot policy learning.


Paper: https://arxiv.org/pdf/2505.06111v2.pdf

Code: https://github.com/opendrivelab/univla

Datasets: R2R - VLN-CE - Open-X-Embodiment

🔗 Our Telegram channels: https://t.iss.one/addlist/0f6vfFbEMdAwODBk

📱 Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
Please open Telegram to view this post
VIEW IN TELEGRAM
👍1
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning

24 Apr 2025 · Minju Seo, Jinheon Baek, Seongyun Lee, Sung Ju Hwang ·

Despite the rapid growth of machine learning research, corresponding code implementations are often unavailable, making it slow and labor-intensive for researchers to reproduce results and build upon prior work. In the meantime, recent Large Language Models (LLMs) excel at understanding scientific documents and generating high-quality code. Inspired by this, we introduce PaperCoder, a multi-agent LLM framework that transforms machine learning papers into functional code repositories. PaperCoder operates in three stages: planning, where it constructs a high-level roadmap, designs the system architecture with diagrams, identifies file dependencies, and generates configuration files; analysis, which focuses on interpreting implementation-specific details; and generation, where modular, dependency-aware code is produced. Moreover, each phase is instantiated through a set of specialized agents designed to collaborate effectively across the pipeline. We then evaluate PaperCoder on generating code implementations from machine learning papers based on both model-based and human evaluations, specifically from the original paper authors, with author-released repositories as ground truth if available. Our results demonstrate the effectiveness of PaperCoder in creating high-quality, faithful implementations. Furthermore, it consistently shows strengths in the recently released PaperBench benchmark, surpassing strong baselines by substantial margins. Code is available at: https://github.com/going-doer/Paper2Code.


Paper: https://arxiv.org/pdf/2504.17192v2.pdf

Code: https://github.com/going-doer/paper2code

📱 WhatsApp Channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
Please open Telegram to view this post
VIEW IN TELEGRAM
👍4
Forwarded from Thomas
🎁❗️TODAY FREE❗️🎁

Entry to our VIP channel is completely free today. Tomorrow it will cost $500! 🔥

JOIN 👇

https://t.iss.one/+VKT2Gy3kE6A4NTE5
https://t.iss.one/+VKT2Gy3kE6A4NTE5
https://t.iss.one/+VKT2Gy3kE6A4NTE5
👍1
Absolute Zero: Reinforced Self-play Reasoning with Zero Data

6 May 2025 · Andrew Zhao, Yiran Wu, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, Gao Huang ·

Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually curated collections of questions and answers for training. The scarcity of high-quality, human-produced examples raises concerns about the long-term scalability of relying on human supervision, a challenge already evident in the domain of language model pretraining. Furthermore, in a hypothetical future where AI surpasses human intelligence, tasks provided by humans may offer limited learning potential for a superintelligent system. To address these concerns, we propose a new RLVR paradigm called Absolute Zero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external data. Under this paradigm, we introduce the Absolute Zero Reasoner (#AZR), a system that self-evolves its training curriculum and reasoning ability by using a code executor to both validate proposed code reasoning tasks and verify answers, serving as an unified source of verifiable reward to guide open-ended yet grounded learning. Despite being trained entirely without external data, AZR achieves overall #SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.


Paper: https://arxiv.org/pdf/2505.03335v2.pdf

Code: https://github.com/LeapLabTHU/Absolute-Zero-Reasoner

Dataset: CityPersons

📱 WhatsApp Channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
Please open Telegram to view this post
VIEW IN TELEGRAM
1
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

14 May 2025 · Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, Le Xue, Caiming Xiong, ran Xu ·

Unifying image understanding and generation has gained growing attention in recent research on multimodal models. Although design choices for image understanding have been extensively studied, the optimal model architecture and training recipe for a unified framework with image generation remain underexplored. Motivated by the strong potential of autoregressive and diffusion models for high-quality generation and scalability, we conduct a comprehensive study of their use in unified multimodal settings, with emphasis on image representations, modeling objectives, and training strategies. Grounded in these investigations, we introduce a novel approach that employs a diffusion transformer to generate semantically rich CLIP image features, in contrast to conventional VAE-based representations. This design yields both higher training efficiency and improved generative quality. Furthermore, we demonstrate that a sequential pretraining strategy for unified models-first training on image understanding and subsequently on image generation-offers practical advantages by preserving image understanding capability while developing strong image generation ability. Finally, we carefully curate a high-quality instruction-tuning dataset BLIP3o-60k for image generation by prompting GPT-4o with a diverse set of captions covering various scenes, objects, human gestures, and more. Building on our innovative model design, training recipe, and datasets, we develop BLIP3-o, a suite of state-of-the-art unified multimodal models. BLIP3-o achieves superior performance across most of the popular benchmarks spanning both image understanding and generation tasks. To facilitate future research, we fully open-source our models, including code, model weights, training scripts, and pretraining and instruction tuning datasets.


Paper: https://arxiv.org/pdf/2505.09568v1.pdf

Code: https://github.com/jiuhaichen/blip3o
👍21🔥1
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

13 May 2025 · Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, Yu Cheng ·

While humans can flexibly leverage interactive visual cognition for complex problem-solving, enabling Large Vision-Language Models (LVLMs) to learn similarly adaptive behaviors with visual tools remains challenging. A significant hurdle is the current lack of standardized infrastructure, which hinders integrating diverse tools, generating rich interaction data, and training robust agents effectively. To address these gaps, we introduce OpenThinkIMG, the first open-source, comprehensive end-to-end framework for tool-augmented LVLMs. It features standardized vision tool interfaces, scalable trajectory generation for policy initialization, and a flexible training environment. Furthermore, considering supervised fine-tuning (SFT) on static demonstrations offers limited policy generalization for dynamic tool invocation, we propose a novel reinforcement learning (RL) framework V-ToolRL to train LVLMs to learn adaptive policies for invoking external vision tools. V-ToolRL enables LVLMs to autonomously discover optimal tool-usage strategies by directly optimizing for task success using feedback from tool interactions. We empirically validate V-ToolRL on challenging chart reasoning tasks. Our RL-trained agent, built upon a Qwen2-VL-2B, significantly outperforms its SFT-initialized counterpart (+28.83 points) and surpasses established supervised tool-learning baselines like Taco and CogCom by an average of +12.7 points. Notably, it also surpasses prominent closed-source models like GPT-4.1 by +8.68 accuracy points. We hope OpenThinkIMG can serve as a foundational framework for advancing dynamic, tool-augmented visual reasoning, helping the community develop AI agents that can genuinely "think with images"


Paper: https://arxiv.org/pdf/2505.08617v1.pdf

Dataset: https://github.com/zhaochen0110/openthinkimg

✉️ Our Telegram channels: https://t.iss.one/addlist/0f6vfFbEMdAwODBk

📱 Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
Please open Telegram to view this post
VIEW IN TELEGRAM
Deliberation on Priors: Trustworthy Reasoning of Large Language Models on Knowledge Graphs

🖥 Github: https://github.com/reml-group/deliberation-on-priors

📕 Paper: https://arxiv.org/abs/2505.15210v1

✉️ Our Telegram channels: https://t.iss.one/addlist/0f6vfFbEMdAwODBk

📱 Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
Please open Telegram to view this post
VIEW IN TELEGRAM
👍41
PIA S5 Proxy solves all problems for AI developers.

🔥 Why top AI teams choose PIA S5 Proxy:
🔹 SOCKS5 proxy: as low as $0.045/IP
Global high-quality IP | No traffic limit / static IP
High success rate >99.9% | Ultra-low latency | Stable anti-ban
Smart crawling API, support seamless integration

🔹 Unlimited traffic proxy: only $79/day
Unlimited traffic | Unlimited concurrency | Bandwidth over 100Gbps | Customization supported
Best for large-scale AI / LLM data collection
Save up to 90% on crawling costs

Exclusive for new users:
Enter the coupon code [AI Python] to enjoy a 10% discount!

🚀 Buy now: https://www.piaproxy.com/?co=piaproxy&ck=?ai
Please open Telegram to view this post
VIEW IN TELEGRAM
👍31
🎯 ابدأ رحلتك الاحترافية في البرمجة مع
#Python_Mastery_Course 🐍
هل ترغب بتعلم لغة البرمجة الأكثر طلبًا في العالم؟
هل تحلم بالوصول إلى مجالات مثل الذكاء الاصطناعي، تحليل البيانات أو تصميم الواجهات؟
📢 هذه الدورة خُصصت لتكون نقطة انطلاقك نحو المستقبل!
________________________________________
🚀 ماذا ستتعلم في هذه الدورة؟
🔹 الوحدة 1: أساسيات بايثون (المتغيرات – أنواع البيانات – العمليات – أساسيات الكود)
🔹 الوحدة 2: التحكم في سير البرنامج (الشروط – الحلقات – أوامر التحكم)
🔹 الوحدة 3: هياكل البيانات (قوائم – قواميس – مجموعات – Tuples)
🔹 الوحدة 4: الدوال (إنشاء – معاملات – النطاق – التكرار)
🔹 الوحدة 5: الوحدات (Modules)
🔹 الوحدة 6: التعامل مع الملفات وملفات CSV
🔹 الوحدة 7: معالجة الاستثناءات باحتراف
🔹 الوحدة 8: البرمجة الكائنية (OOP)
🔹 الوحدة 9: المفاهيم المتقدمة:
   المولدات (Generators)
   الكائنات القابلة للتكرار (Iterators)
   المزينات (Decorators)
💡 عند انتهائك ستكون قادرًا على:
✔️ بناء مشاريع حقيقية بلغة بايثون
✔️ الانتقال بثقة إلى مجالات متقدمة مثل الذكاء الاصطناعي وتحليل البيانات
✔️ أتمتة المهام والتعامل مع البيانات باحتراف

🎥 نظام الدورة:
• بث مباشر Live مع المدرب د. محمد عماد عرفه
• جميع المحاضرات ستُرفع على الموقع لتشاهدها في الوقت الذي يناسبك
🕒 مدة الدورة: 25 ساعة تدريبية
📅 تاريخ البداية:15- 6
💰 خصم للحجز المبكر
تواصل الآن مع ذكر كود الدورة"001"
https://t.iss.one/Agartha_Support
This media is not supported in your browser
VIEW IN TELEGRAM
💃 GENMO: Generalist Human Motion by NVIDIA 💃

NVIDIA introduces GENMO, a unified generalist model for human motion that seamlessly combines motion estimation and generation within a single framework. GENMO supports conditioning on videos, 2D keypoints, text, music, and 3D keyframes, enabling highly versatile motion understanding and synthesis.

Currently, no official code release is available.

Review:
https://t.ly/Q5T_Y

Paper:
https://lnkd.in/ds36BY49

Project Page:
https://lnkd.in/dAYHhuFU

#NVIDIA #GENMO #HumanMotion #DeepLearning #AI #ComputerVision #MotionGeneration #MachineLearning #MultimodalAI #3DReconstruction


✉️ Our Telegram channels: https://t.iss.one/addlist/0f6vfFbEMdAwODBk

📱 Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
Please open Telegram to view this post
VIEW IN TELEGRAM
👍43
Generative AI with LangChain 2025

Docker Deep Dive 2025

Available now in our paid channels

Join our paid channel for free books and courses (2$ monthly)
https://t.iss.one/+r_Tcx2c-oVU1OWNi
👍1
Forwarded from Thomas
🪙 +30.560$ with 300$ in a month of trading! We can teach you how to earn! FREE!

It was a challenge - a marathon 300$ to 30.000$ on trading, together with Lisa!

What is the essence of earning?: "Analyze and open a deal on the exchange, knowing where the currency rate will go. Lisa trades every day and posts signals on her channel for free."

🔹Start: $150
🔹 Goal: $20,000
🔹Period: 1.5 months.

Join and get started, there will be no second chance👇

https://t.iss.one/+HjHm7mxR5xllNTY5
SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding

🖥 Github: https://github.com/haoningwu3639/SpatialScore

📕 Paper: https://arxiv.org/abs/2505.17012v1

🔗 Tasks: https://paperswithcode.com/task/motion-estimation

https://t.iss.one/DataScienceT 🌟
Please open Telegram to view this post
VIEW IN TELEGRAM
1👍1
Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

20 May 2025 · Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, Jingqun Tang, Hao liu, Can Huang ·

Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Current approaches either assemble specialized expert models or directly generate page-level content autoregressively, facing integration overhead, efficiency bottlenecks, and layout structure degradation despite their decent performance. To address these limitations, we present \textit{Dolphin} (\textit{\textbf{Do}cument Image \textbf{P}arsing via \textbf{H}eterogeneous Anchor Prompt\textbf{in}g}), a novel multimodal document image parsing model following an analyze-then-parse paradigm. In the first stage, Dolphin generates a sequence of layout elements in reading order. These heterogeneous elements, serving as anchors and coupled with task-specific prompts, are fed back to Dolphin for parallel content parsing in the second stage. To train Dolphin, we construct a large-scale dataset of over 30 million samples, covering multi-granularity parsing tasks. Through comprehensive evaluations on both prevalent benchmarks and self-constructed ones, Dolphin achieves state-of-the-art performance across diverse page-level and element-level settings, while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism. The code and pre-trained models are publicly available at https://github.com/ByteDance/Dolphin


Paper: https://arxiv.org/pdf/2505.14059v1.pdf

Code: https://github.com/bytedance/dolphin

Dataset: PubTabNet

✉️ Our Telegram channels: https://t.iss.one/addlist/0f6vfFbEMdAwODBk

📱 Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
Please open Telegram to view this post
VIEW IN TELEGRAM
3
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

20 May 2025 · Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, Xing Yu ·

Large Vision-Language Models (VLMs) have shown strong capabilities in multimodal understanding and reasoning, yet they are primarily constrained by text-based reasoning processes. However, achieving seamless integration of visual and textual reasoning which mirrors human cognitive processes remains a significant challenge. In particular, effectively incorporating advanced visual input processing into reasoning mechanisms is still an open question. Thus, in this paper, we explore the interleaved multimodal reasoning paradigm and introduce DeepEyes, a model with "thinking with images" capabilities incentivized through end-to-end reinforcement learning without the need for cold-start SFT. Notably, this ability emerges natively within the model itself, leveraging its inherent grounding ability as a tool instead of depending on separate specialized models. Specifically, we propose a tool-use-oriented data selection mechanism and a reward strategy to encourage successful tool-assisted reasoning trajectories. DeepEyes achieves significant performance gains on fine-grained perception and reasoning benchmarks and also demonstrates improvement in grounding, hallucination, and mathematical reasoning tasks. Interestingly, we observe the distinct evolution of tool-calling behavior from initial exploration to efficient and accurate exploitation, and diverse thinking patterns that closely mirror human visual reasoning processes. Code is available at https://github.com/Visual-Agent/DeepEyes.


Paper: https://arxiv.org/pdf/2505.14362v1.pdf

Code: https://github.com/visual-agent/deepeyes

Datasets: MS COCO - RefCOCO - MathVista - MathVerse

✉️ Our Telegram channels: https://t.iss.one/addlist/0f6vfFbEMdAwODBk

📱 Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
Please open Telegram to view this post
VIEW IN TELEGRAM
1
MMaDA: Multimodal Large Diffusion Language Models

21 May 2025 · Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, Mengdi Wang ·

We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is distinguished by three key innovations: (i) MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. This architecture ensures seamless integration and processing across different data types. (ii) We implement a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. By aligning reasoning processes between textual and visual domains, this strategy facilitates cold-start training for the final reinforcement learning (RL) stage, thereby enhancing the model's ability to handle complex tasks from the outset. (iii) We propose UniGRPO, a unified policy-gradient-based RL algorithm specifically tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements. Experimental results demonstrate that MMaDA-8B exhibits strong generalization capabilities as a unified multimodal foundation model. It surpasses powerful models like LLaMA-3-7B and Qwen2-7B in textual reasoning, outperforms Show-o and SEED-X in multimodal understanding, and excels over SDXL and Janus in text-to-image generation. These achievements highlight MMaDA's effectiveness in bridging the gap between pretraining and post-training within unified diffusion architectures, providing a comprehensive framework for future research and development. We open-source our code and trained models at: https://github.com/Gen-Verse/MMaDA


Paper: https://arxiv.org/pdf/2505.15809v1.pdf

Code: https://github.com/gen-verse/mmada

Colab: https://huggingface.co/spaces/Gen-Verse/MMaDA

✉️ Our Telegram channels: https://t.iss.one/addlist/0f6vfFbEMdAwODBk

📱 Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
Please open Telegram to view this post
VIEW IN TELEGRAM
5
OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation

26 May 2025 · Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Chongyang Ma, Jiebo Luo, Li Yuan ·

Subject-to-Video (S2V) generation aims to create videos that faithfully incorporate reference content, providing enhanced flexibility in the production of videos. To establish the infrastructure for S2V generation, we propose OpenS2V-Nexus, consisting of (i) OpenS2V-Eval, a fine-grained benchmark, and (ii) OpenS2V-5M, a million-scale dataset. In contrast to existing S2V benchmarks inherited from VBench that focus on global and coarse-grained assessment of generated videos, OpenS2V-Eval focuses on the model's ability to generate subject-consistent videos with natural subject appearance and identity fidelity. For these purposes, OpenS2V-Eval introduces 180 prompts from seven major categories of S2V, which incorporate both real and synthetic test data. Furthermore, to accurately align human preferences with S2V benchmarks, we propose three automatic metrics, NexusScore, NaturalScore and GmeScore, to separately quantify subject consistency, naturalness, and text relevance in generated videos. Building on this, we conduct a comprehensive evaluation of 16 representative S2V models, highlighting their strengths and weaknesses across different content. Moreover, we create the first open-source large-scale S2V generation dataset OpenS2V-5M, which consists of five million high-quality 720P subject-text-video triples. Specifically, we ensure subject-information diversity in our dataset by (1) segmenting subjects and building pairing information via cross-video associations and (2) prompting GPT-Image-1 on raw frames to synthesize multi-view representations. Through OpenS2V-Nexus, we deliver a robust infrastructure to accelerate future S2V generation research.


Paper: https://arxiv.org/pdf/2505.20292v1.pdf

Codes:
https://github.com/PKU-YuanGroup/ConsisID
https://github.com/PKU-YuanGroup/OpenS2V-Nexus

Datasets: OpenS2V-5M - OpenS2V-Eval

✉️ Our Telegram channels: https://t.iss.one/addlist/0f6vfFbEMdAwODBk

📱 Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
Please open Telegram to view this post
VIEW IN TELEGRAM
4
🔥 Accelerate Your IT Career with FREE Certification Kits!
🚀 Get Hired Faster—Zero Cost!
Grab expert guides, labs, and courses for AWS, Azure, AI, Python, Cyber Security, and beyond—100% FREE, no hidden fees!
CLICK your field👇
DOWNLOAD & dominate your goals!
🔗 AWS + Azure Cloud Mastery: https://bit.ly/44S0dNS
🔗 AI & Machine Learning Starter Kit: https://bit.ly/3FrKw5H
🔗 Python, Excel, Cyber Security Courses: https://bit.ly/4mFrA4g
📘 FREE Career Hack: IT Success Roadmap E-book ➔ https://bit.ly/3Z6JS49

🚨 Limited Time! Act FAST!
📱 Join Our IT Study Group: https://bit.ly/43piMq8
💬 1-on-1 Exam Help: https://wa.link/sbpp0m
Your dream job won’t wait—GRAB YOUR RESOURCES NOW! 💻
4👏1
RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination

28 May 2025 · Chong Zeng, Yue Dong, Pieter Peers, Hongzhi Wu, Xin Tong ·

We present RenderFormer, a neural rendering pipeline that directly renders an image from a triangle-based representation of a scene with full global illumination effects and that does not require per-scene training or fine-tuning. Instead of taking a physics-centric approach to rendering, we formulate rendering as a sequence-to-sequence transformation where a sequence of tokens representing triangles with reflectance properties is converted to a sequence of output tokens representing small patches of pixels. RenderFormer follows a two stage pipeline: a view-independent stage that models triangle-to-triangle light transport, and a view-dependent stage that transforms a token representing a bundle of rays to the corresponding pixel values guided by the triangle-sequence from the view-independent stage. Both stages are based on the transformer architecture and are learned with minimal prior constraints. We demonstrate and evaluate RenderFormer on scenes with varying complexity in shape and light transport.


Paper: https://arxiv.org/pdf/2505.21925v1.pdf

Code: https://github.com/microsoft/renderformer

Dataset: Objaverse

https://t.iss.one/DataScienceT 🔗
Please open Telegram to view this post
VIEW IN TELEGRAM
1👏1