Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
24 Apr 2025 · Minju Seo, Jinheon Baek, Seongyun Lee, Sung Ju Hwang ·
Paper: https://arxiv.org/pdf/2504.17192v2.pdf
Code: https://github.com/going-doer/paper2code
📱 WhatsApp Channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
24 Apr 2025 · Minju Seo, Jinheon Baek, Seongyun Lee, Sung Ju Hwang ·
Despite the rapid growth of machine learning research, corresponding code implementations are often unavailable, making it slow and labor-intensive for researchers to reproduce results and build upon prior work. In the meantime, recent Large Language Models (LLMs) excel at understanding scientific documents and generating high-quality code. Inspired by this, we introduce PaperCoder, a multi-agent LLM framework that transforms machine learning papers into functional code repositories. PaperCoder operates in three stages: planning, where it constructs a high-level roadmap, designs the system architecture with diagrams, identifies file dependencies, and generates configuration files; analysis, which focuses on interpreting implementation-specific details; and generation, where modular, dependency-aware code is produced. Moreover, each phase is instantiated through a set of specialized agents designed to collaborate effectively across the pipeline. We then evaluate PaperCoder on generating code implementations from machine learning papers based on both model-based and human evaluations, specifically from the original paper authors, with author-released repositories as ground truth if available. Our results demonstrate the effectiveness of PaperCoder in creating high-quality, faithful implementations. Furthermore, it consistently shows strengths in the recently released PaperBench benchmark, surpassing strong baselines by substantial margins. Code is available at: https://github.com/going-doer/Paper2Code.
Paper: https://arxiv.org/pdf/2504.17192v2.pdf
Code: https://github.com/going-doer/paper2code
Please open Telegram to view this post
VIEW IN TELEGRAM
👍4
Forwarded from Thomas
🎁❗️TODAY FREE❗️🎁
Entry to our VIP channel is completely free today. Tomorrow it will cost $500! 🔥
JOIN 👇
https://t.iss.one/+VKT2Gy3kE6A4NTE5
https://t.iss.one/+VKT2Gy3kE6A4NTE5
https://t.iss.one/+VKT2Gy3kE6A4NTE5
Entry to our VIP channel is completely free today. Tomorrow it will cost $500! 🔥
JOIN 👇
https://t.iss.one/+VKT2Gy3kE6A4NTE5
https://t.iss.one/+VKT2Gy3kE6A4NTE5
https://t.iss.one/+VKT2Gy3kE6A4NTE5
👍1
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
6 May 2025 · Andrew Zhao, Yiran Wu, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, Gao Huang ·
Paper: https://arxiv.org/pdf/2505.03335v2.pdf
Code: https://github.com/LeapLabTHU/Absolute-Zero-Reasoner
Dataset: CityPersons
📱 WhatsApp Channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
6 May 2025 · Andrew Zhao, Yiran Wu, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, Gao Huang ·
Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually curated collections of questions and answers for training. The scarcity of high-quality, human-produced examples raises concerns about the long-term scalability of relying on human supervision, a challenge already evident in the domain of language model pretraining. Furthermore, in a hypothetical future where AI surpasses human intelligence, tasks provided by humans may offer limited learning potential for a superintelligent system. To address these concerns, we propose a new RLVR paradigm called Absolute Zero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external data. Under this paradigm, we introduce the Absolute Zero Reasoner (#AZR), a system that self-evolves its training curriculum and reasoning ability by using a code executor to both validate proposed code reasoning tasks and verify answers, serving as an unified source of verifiable reward to guide open-ended yet grounded learning. Despite being trained entirely without external data, AZR achieves overall #SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.
Paper: https://arxiv.org/pdf/2505.03335v2.pdf
Code: https://github.com/LeapLabTHU/Absolute-Zero-Reasoner
Dataset: CityPersons
Please open Telegram to view this post
VIEW IN TELEGRAM
❤1
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
14 May 2025 · Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, Le Xue, Caiming Xiong, ran Xu ·
Paper: https://arxiv.org/pdf/2505.09568v1.pdf
Code: https://github.com/jiuhaichen/blip3o
14 May 2025 · Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, Le Xue, Caiming Xiong, ran Xu ·
Unifying image understanding and generation has gained growing attention in recent research on multimodal models. Although design choices for image understanding have been extensively studied, the optimal model architecture and training recipe for a unified framework with image generation remain underexplored. Motivated by the strong potential of autoregressive and diffusion models for high-quality generation and scalability, we conduct a comprehensive study of their use in unified multimodal settings, with emphasis on image representations, modeling objectives, and training strategies. Grounded in these investigations, we introduce a novel approach that employs a diffusion transformer to generate semantically rich CLIP image features, in contrast to conventional VAE-based representations. This design yields both higher training efficiency and improved generative quality. Furthermore, we demonstrate that a sequential pretraining strategy for unified models-first training on image understanding and subsequently on image generation-offers practical advantages by preserving image understanding capability while developing strong image generation ability. Finally, we carefully curate a high-quality instruction-tuning dataset BLIP3o-60k for image generation by prompting GPT-4o with a diverse set of captions covering various scenes, objects, human gestures, and more. Building on our innovative model design, training recipe, and datasets, we develop BLIP3-o, a suite of state-of-the-art unified multimodal models. BLIP3-o achieves superior performance across most of the popular benchmarks spanning both image understanding and generation tasks. To facilitate future research, we fully open-source our models, including code, model weights, training scripts, and pretraining and instruction tuning datasets.
Paper: https://arxiv.org/pdf/2505.09568v1.pdf
Code: https://github.com/jiuhaichen/blip3o
👍2❤1🔥1
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning
13 May 2025 · Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, Yu Cheng ·
Paper: https://arxiv.org/pdf/2505.08617v1.pdf
Dataset: https://github.com/zhaochen0110/openthinkimg
✉️ Our Telegram channels: https://t.iss.one/addlist/0f6vfFbEMdAwODBk
📱 Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
13 May 2025 · Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, Yu Cheng ·
While humans can flexibly leverage interactive visual cognition for complex problem-solving, enabling Large Vision-Language Models (LVLMs) to learn similarly adaptive behaviors with visual tools remains challenging. A significant hurdle is the current lack of standardized infrastructure, which hinders integrating diverse tools, generating rich interaction data, and training robust agents effectively. To address these gaps, we introduce OpenThinkIMG, the first open-source, comprehensive end-to-end framework for tool-augmented LVLMs. It features standardized vision tool interfaces, scalable trajectory generation for policy initialization, and a flexible training environment. Furthermore, considering supervised fine-tuning (SFT) on static demonstrations offers limited policy generalization for dynamic tool invocation, we propose a novel reinforcement learning (RL) framework V-ToolRL to train LVLMs to learn adaptive policies for invoking external vision tools. V-ToolRL enables LVLMs to autonomously discover optimal tool-usage strategies by directly optimizing for task success using feedback from tool interactions. We empirically validate V-ToolRL on challenging chart reasoning tasks. Our RL-trained agent, built upon a Qwen2-VL-2B, significantly outperforms its SFT-initialized counterpart (+28.83 points) and surpasses established supervised tool-learning baselines like Taco and CogCom by an average of +12.7 points. Notably, it also surpasses prominent closed-source models like GPT-4.1 by +8.68 accuracy points. We hope OpenThinkIMG can serve as a foundational framework for advancing dynamic, tool-augmented visual reasoning, helping the community develop AI agents that can genuinely "think with images"
Paper: https://arxiv.org/pdf/2505.08617v1.pdf
Dataset: https://github.com/zhaochen0110/openthinkimg
Please open Telegram to view this post
VIEW IN TELEGRAM
Deliberation on Priors: Trustworthy Reasoning of Large Language Models on Knowledge Graphs
🖥 Github: https://github.com/reml-group/deliberation-on-priors
📕 Paper: https://arxiv.org/abs/2505.15210v1
✉️ Our Telegram channels: https://t.iss.one/addlist/0f6vfFbEMdAwODBk
📱 Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
Please open Telegram to view this post
VIEW IN TELEGRAM
👍4❤1
PIA S5 Proxy solves all problems for AI developers.
🔥 Why top AI teams choose PIA S5 Proxy:
🔹 SOCKS5 proxy: as low as $0.045/IP
✅ Global high-quality IP | No traffic limit / static IP
✅ High success rate >99.9% | Ultra-low latency | Stable anti-ban
✅ Smart crawling API, support seamless integration
🔹 Unlimited traffic proxy: only $79/day
✅ Unlimited traffic | Unlimited concurrency | Bandwidth over 100Gbps | Customization supported
✅ Best for large-scale AI / LLM data collection
✅ Save up to 90% on crawling costs
✨ Exclusive for new users:
Enter the coupon code [AI Python] to enjoy a 10% discount!
🚀 Buy now: https://www.piaproxy.com/?co=piaproxy&ck=?ai
Enter the coupon code [AI Python] to enjoy a 10% discount!
Please open Telegram to view this post
VIEW IN TELEGRAM
Piaproxy
PIA Proxy - Largest Socks5 Residential Proxy Anonymous & Secure
Pia S5 Proxy is the world's largest commercial residential proxy service. With over 350 million fresh residential IPs that can be located by country, city, postcode, and ISP, it supports both HTTP(S) proxy and Socks5 proxy, allowing you to easily access the…
👍3❤1
Forwarded from Python | Machine Learning | Coding | R
🎯 ابدأ رحلتك الاحترافية في البرمجة مع
#Python_Mastery_Course 🐍
هل ترغب بتعلم لغة البرمجة الأكثر طلبًا في العالم؟
هل تحلم بالوصول إلى مجالات مثل الذكاء الاصطناعي، تحليل البيانات أو تصميم الواجهات؟
📢 هذه الدورة خُصصت لتكون نقطة انطلاقك نحو المستقبل!
________________________________________
🚀 ماذا ستتعلم في هذه الدورة؟
🔹 الوحدة 1: أساسيات بايثون (المتغيرات – أنواع البيانات – العمليات – أساسيات الكود)
🔹 الوحدة 2: التحكم في سير البرنامج (الشروط – الحلقات – أوامر التحكم)
🔹 الوحدة 3: هياكل البيانات (قوائم – قواميس – مجموعات – Tuples)
🔹 الوحدة 4: الدوال (إنشاء – معاملات – النطاق – التكرار)
🔹 الوحدة 5: الوحدات (Modules)
🔹 الوحدة 6: التعامل مع الملفات وملفات CSV
🔹 الوحدة 7: معالجة الاستثناءات باحتراف
🔹 الوحدة 8: البرمجة الكائنية (OOP)
🔹 الوحدة 9: المفاهيم المتقدمة:
✅ المولدات (Generators)
✅ الكائنات القابلة للتكرار (Iterators)
✅ المزينات (Decorators)
💡 عند انتهائك ستكون قادرًا على:
✔️ بناء مشاريع حقيقية بلغة بايثون
✔️ الانتقال بثقة إلى مجالات متقدمة مثل الذكاء الاصطناعي وتحليل البيانات
✔️ أتمتة المهام والتعامل مع البيانات باحتراف
🎥 نظام الدورة:
• بث مباشر Live مع المدرب د. محمد عماد عرفه
• جميع المحاضرات ستُرفع على الموقع لتشاهدها في الوقت الذي يناسبك
🕒 مدة الدورة: 25 ساعة تدريبية
📅 تاريخ البداية:15- 6
💰 خصم للحجز المبكر
تواصل الآن مع ذكر كود الدورة"001"
https://t.iss.one/Agartha_Support
#Python_Mastery_Course 🐍
هل ترغب بتعلم لغة البرمجة الأكثر طلبًا في العالم؟
هل تحلم بالوصول إلى مجالات مثل الذكاء الاصطناعي، تحليل البيانات أو تصميم الواجهات؟
📢 هذه الدورة خُصصت لتكون نقطة انطلاقك نحو المستقبل!
________________________________________
🚀 ماذا ستتعلم في هذه الدورة؟
🔹 الوحدة 1: أساسيات بايثون (المتغيرات – أنواع البيانات – العمليات – أساسيات الكود)
🔹 الوحدة 2: التحكم في سير البرنامج (الشروط – الحلقات – أوامر التحكم)
🔹 الوحدة 3: هياكل البيانات (قوائم – قواميس – مجموعات – Tuples)
🔹 الوحدة 4: الدوال (إنشاء – معاملات – النطاق – التكرار)
🔹 الوحدة 5: الوحدات (Modules)
🔹 الوحدة 6: التعامل مع الملفات وملفات CSV
🔹 الوحدة 7: معالجة الاستثناءات باحتراف
🔹 الوحدة 8: البرمجة الكائنية (OOP)
🔹 الوحدة 9: المفاهيم المتقدمة:
✅ المولدات (Generators)
✅ الكائنات القابلة للتكرار (Iterators)
✅ المزينات (Decorators)
💡 عند انتهائك ستكون قادرًا على:
✔️ بناء مشاريع حقيقية بلغة بايثون
✔️ الانتقال بثقة إلى مجالات متقدمة مثل الذكاء الاصطناعي وتحليل البيانات
✔️ أتمتة المهام والتعامل مع البيانات باحتراف
🎥 نظام الدورة:
• بث مباشر Live مع المدرب د. محمد عماد عرفه
• جميع المحاضرات ستُرفع على الموقع لتشاهدها في الوقت الذي يناسبك
🕒 مدة الدورة: 25 ساعة تدريبية
📅 تاريخ البداية:15- 6
💰 خصم للحجز المبكر
تواصل الآن مع ذكر كود الدورة"001"
https://t.iss.one/Agartha_Support
Telegram
Agartha Support
This media is not supported in your browser
VIEW IN TELEGRAM
NVIDIA introduces GENMO, a unified generalist model for human motion that seamlessly combines motion estimation and generation within a single framework. GENMO supports conditioning on videos, 2D keypoints, text, music, and 3D keyframes, enabling highly versatile motion understanding and synthesis.
Currently, no official code release is available.
Review:
https://t.ly/Q5T_Y
Paper:
https://lnkd.in/ds36BY49
Project Page:
https://lnkd.in/dAYHhuFU
#NVIDIA #GENMO #HumanMotion #DeepLearning #AI #ComputerVision #MotionGeneration #MachineLearning #MultimodalAI #3DReconstruction
Please open Telegram to view this post
VIEW IN TELEGRAM
👍4❤3
Forwarded from Data Science Machine Learning Data Analysis
Generative AI with LangChain 2025
Docker Deep Dive 2025
Available now in our paid channels
Join our paid channel for free books and courses (2$ monthly)
https://t.iss.one/+r_Tcx2c-oVU1OWNi
Docker Deep Dive 2025
Available now in our paid channels
Join our paid channel for free books and courses (2$ monthly)
https://t.iss.one/+r_Tcx2c-oVU1OWNi
👍1
Forwarded from Thomas
🪙 +30.560$ with 300$ in a month of trading! We can teach you how to earn! FREE!
It was a challenge - a marathon 300$ to 30.000$ on trading, together with Lisa!
What is the essence of earning?: "Analyze and open a deal on the exchange, knowing where the currency rate will go. Lisa trades every day and posts signals on her channel for free."
🔹Start: $150
🔹 Goal: $20,000
🔹Period: 1.5 months.
Join and get started, there will be no second chance👇
https://t.iss.one/+HjHm7mxR5xllNTY5
It was a challenge - a marathon 300$ to 30.000$ on trading, together with Lisa!
What is the essence of earning?: "Analyze and open a deal on the exchange, knowing where the currency rate will go. Lisa trades every day and posts signals on her channel for free."
🔹Start: $150
🔹 Goal: $20,000
🔹Period: 1.5 months.
Join and get started, there will be no second chance👇
https://t.iss.one/+HjHm7mxR5xllNTY5
SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding
🖥 Github: https://github.com/haoningwu3639/SpatialScore
📕 Paper: https://arxiv.org/abs/2505.17012v1
🔗 Tasks: https://paperswithcode.com/task/motion-estimation
https://t.iss.one/DataScienceT🌟
🔗 Tasks: https://paperswithcode.com/task/motion-estimation
https://t.iss.one/DataScienceT
Please open Telegram to view this post
VIEW IN TELEGRAM
❤1👍1
Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting
20 May 2025 · Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, Jingqun Tang, Hao liu, Can Huang ·
Paper: https://arxiv.org/pdf/2505.14059v1.pdf
Code: https://github.com/bytedance/dolphin
Dataset: PubTabNet
✉️ Our Telegram channels: https://t.iss.one/addlist/0f6vfFbEMdAwODBk
📱 Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
20 May 2025 · Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, Jingqun Tang, Hao liu, Can Huang ·
Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Current approaches either assemble specialized expert models or directly generate page-level content autoregressively, facing integration overhead, efficiency bottlenecks, and layout structure degradation despite their decent performance. To address these limitations, we present \textit{Dolphin} (\textit{\textbf{Do}cument Image \textbf{P}arsing via \textbf{H}eterogeneous Anchor Prompt\textbf{in}g}), a novel multimodal document image parsing model following an analyze-then-parse paradigm. In the first stage, Dolphin generates a sequence of layout elements in reading order. These heterogeneous elements, serving as anchors and coupled with task-specific prompts, are fed back to Dolphin for parallel content parsing in the second stage. To train Dolphin, we construct a large-scale dataset of over 30 million samples, covering multi-granularity parsing tasks. Through comprehensive evaluations on both prevalent benchmarks and self-constructed ones, Dolphin achieves state-of-the-art performance across diverse page-level and element-level settings, while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism. The code and pre-trained models are publicly available at https://github.com/ByteDance/Dolphin
Paper: https://arxiv.org/pdf/2505.14059v1.pdf
Code: https://github.com/bytedance/dolphin
Dataset: PubTabNet
Please open Telegram to view this post
VIEW IN TELEGRAM
❤3
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
20 May 2025 · Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, Xing Yu ·
Paper: https://arxiv.org/pdf/2505.14362v1.pdf
Code: https://github.com/visual-agent/deepeyes
Datasets: MS COCO - RefCOCO - MathVista - MathVerse
✉️ Our Telegram channels: https://t.iss.one/addlist/0f6vfFbEMdAwODBk
📱 Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
20 May 2025 · Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, Xing Yu ·
Large Vision-Language Models (VLMs) have shown strong capabilities in multimodal understanding and reasoning, yet they are primarily constrained by text-based reasoning processes. However, achieving seamless integration of visual and textual reasoning which mirrors human cognitive processes remains a significant challenge. In particular, effectively incorporating advanced visual input processing into reasoning mechanisms is still an open question. Thus, in this paper, we explore the interleaved multimodal reasoning paradigm and introduce DeepEyes, a model with "thinking with images" capabilities incentivized through end-to-end reinforcement learning without the need for cold-start SFT. Notably, this ability emerges natively within the model itself, leveraging its inherent grounding ability as a tool instead of depending on separate specialized models. Specifically, we propose a tool-use-oriented data selection mechanism and a reward strategy to encourage successful tool-assisted reasoning trajectories. DeepEyes achieves significant performance gains on fine-grained perception and reasoning benchmarks and also demonstrates improvement in grounding, hallucination, and mathematical reasoning tasks. Interestingly, we observe the distinct evolution of tool-calling behavior from initial exploration to efficient and accurate exploitation, and diverse thinking patterns that closely mirror human visual reasoning processes. Code is available at https://github.com/Visual-Agent/DeepEyes.
Paper: https://arxiv.org/pdf/2505.14362v1.pdf
Code: https://github.com/visual-agent/deepeyes
Datasets: MS COCO - RefCOCO - MathVista - MathVerse
Please open Telegram to view this post
VIEW IN TELEGRAM
❤1
MMaDA: Multimodal Large Diffusion Language Models
21 May 2025 · Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, Mengdi Wang ·
Paper: https://arxiv.org/pdf/2505.15809v1.pdf
Code: https://github.com/gen-verse/mmada
Colab: https://huggingface.co/spaces/Gen-Verse/MMaDA
✉️ Our Telegram channels: https://t.iss.one/addlist/0f6vfFbEMdAwODBk
📱 Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
21 May 2025 · Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, Mengdi Wang ·
We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is distinguished by three key innovations: (i) MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. This architecture ensures seamless integration and processing across different data types. (ii) We implement a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. By aligning reasoning processes between textual and visual domains, this strategy facilitates cold-start training for the final reinforcement learning (RL) stage, thereby enhancing the model's ability to handle complex tasks from the outset. (iii) We propose UniGRPO, a unified policy-gradient-based RL algorithm specifically tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements. Experimental results demonstrate that MMaDA-8B exhibits strong generalization capabilities as a unified multimodal foundation model. It surpasses powerful models like LLaMA-3-7B and Qwen2-7B in textual reasoning, outperforms Show-o and SEED-X in multimodal understanding, and excels over SDXL and Janus in text-to-image generation. These achievements highlight MMaDA's effectiveness in bridging the gap between pretraining and post-training within unified diffusion architectures, providing a comprehensive framework for future research and development. We open-source our code and trained models at: https://github.com/Gen-Verse/MMaDA
Paper: https://arxiv.org/pdf/2505.15809v1.pdf
Code: https://github.com/gen-verse/mmada
Colab: https://huggingface.co/spaces/Gen-Verse/MMaDA
Please open Telegram to view this post
VIEW IN TELEGRAM
❤5
OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation
26 May 2025 · Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Chongyang Ma, Jiebo Luo, Li Yuan ·
Paper: https://arxiv.org/pdf/2505.20292v1.pdf
Codes:
https://github.com/PKU-YuanGroup/ConsisID
https://github.com/PKU-YuanGroup/OpenS2V-Nexus
Datasets: OpenS2V-5M - OpenS2V-Eval
✉️ Our Telegram channels: https://t.iss.one/addlist/0f6vfFbEMdAwODBk
📱 Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
26 May 2025 · Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Chongyang Ma, Jiebo Luo, Li Yuan ·
Subject-to-Video (S2V) generation aims to create videos that faithfully incorporate reference content, providing enhanced flexibility in the production of videos. To establish the infrastructure for S2V generation, we propose OpenS2V-Nexus, consisting of (i) OpenS2V-Eval, a fine-grained benchmark, and (ii) OpenS2V-5M, a million-scale dataset. In contrast to existing S2V benchmarks inherited from VBench that focus on global and coarse-grained assessment of generated videos, OpenS2V-Eval focuses on the model's ability to generate subject-consistent videos with natural subject appearance and identity fidelity. For these purposes, OpenS2V-Eval introduces 180 prompts from seven major categories of S2V, which incorporate both real and synthetic test data. Furthermore, to accurately align human preferences with S2V benchmarks, we propose three automatic metrics, NexusScore, NaturalScore and GmeScore, to separately quantify subject consistency, naturalness, and text relevance in generated videos. Building on this, we conduct a comprehensive evaluation of 16 representative S2V models, highlighting their strengths and weaknesses across different content. Moreover, we create the first open-source large-scale S2V generation dataset OpenS2V-5M, which consists of five million high-quality 720P subject-text-video triples. Specifically, we ensure subject-information diversity in our dataset by (1) segmenting subjects and building pairing information via cross-video associations and (2) prompting GPT-Image-1 on raw frames to synthesize multi-view representations. Through OpenS2V-Nexus, we deliver a robust infrastructure to accelerate future S2V generation research.
Paper: https://arxiv.org/pdf/2505.20292v1.pdf
Codes:
https://github.com/PKU-YuanGroup/ConsisID
https://github.com/PKU-YuanGroup/OpenS2V-Nexus
Datasets: OpenS2V-5M - OpenS2V-Eval
Please open Telegram to view this post
VIEW IN TELEGRAM
❤4
Forwarded from Python | Machine Learning | Coding | R
🔥 Accelerate Your IT Career with FREE Certification Kits!
🚀 Get Hired Faster—Zero Cost!
Grab expert guides, labs, and courses for AWS, Azure, AI, Python, Cyber Security, and beyond—100% FREE, no hidden fees!
✅ CLICK your field👇
✅ DOWNLOAD & dominate your goals!
🔗 AWS + Azure Cloud Mastery: https://bit.ly/44S0dNS
🔗 AI & Machine Learning Starter Kit: https://bit.ly/3FrKw5H
🔗 Python, Excel, Cyber Security Courses: https://bit.ly/4mFrA4g
📘 FREE Career Hack: IT Success Roadmap E-book ➔ https://bit.ly/3Z6JS49
🚨 Limited Time! Act FAST!
📱 Join Our IT Study Group: https://bit.ly/43piMq8
💬 1-on-1 Exam Help: https://wa.link/sbpp0m
Your dream job won’t wait—GRAB YOUR RESOURCES NOW! 💻✨
🚀 Get Hired Faster—Zero Cost!
Grab expert guides, labs, and courses for AWS, Azure, AI, Python, Cyber Security, and beyond—100% FREE, no hidden fees!
✅ CLICK your field👇
✅ DOWNLOAD & dominate your goals!
🔗 AWS + Azure Cloud Mastery: https://bit.ly/44S0dNS
🔗 AI & Machine Learning Starter Kit: https://bit.ly/3FrKw5H
🔗 Python, Excel, Cyber Security Courses: https://bit.ly/4mFrA4g
📘 FREE Career Hack: IT Success Roadmap E-book ➔ https://bit.ly/3Z6JS49
🚨 Limited Time! Act FAST!
📱 Join Our IT Study Group: https://bit.ly/43piMq8
💬 1-on-1 Exam Help: https://wa.link/sbpp0m
Your dream job won’t wait—GRAB YOUR RESOURCES NOW! 💻✨
❤4👏1
RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination
Paper: https://arxiv.org/pdf/2505.21925v1.pdf
Code: https://github.com/microsoft/renderformer
Dataset: Objaverse
https://t.iss.one/DataScienceT🔗
28 May 2025 · Chong Zeng, Yue Dong, Pieter Peers, Hongzhi Wu, Xin Tong ·
We present RenderFormer, a neural rendering pipeline that directly renders an image from a triangle-based representation of a scene with full global illumination effects and that does not require per-scene training or fine-tuning. Instead of taking a physics-centric approach to rendering, we formulate rendering as a sequence-to-sequence transformation where a sequence of tokens representing triangles with reflectance properties is converted to a sequence of output tokens representing small patches of pixels. RenderFormer follows a two stage pipeline: a view-independent stage that models triangle-to-triangle light transport, and a view-dependent stage that transforms a token representing a bundle of rays to the corresponding pixel values guided by the triangle-sequence from the view-independent stage. Both stages are based on the transformer architecture and are learned with minimal prior constraints. We demonstrate and evaluate RenderFormer on scenes with varying complexity in shape and light transport.
Paper: https://arxiv.org/pdf/2505.21925v1.pdf
Code: https://github.com/microsoft/renderformer
Dataset: Objaverse
https://t.iss.one/DataScienceT
Please open Telegram to view this post
VIEW IN TELEGRAM
❤1👏1