Forwarded from Python Courses
Please open Telegram to view this post
VIEW IN TELEGRAM
β€1π1
Forwarded from Python | Machine Learning | Coding | R
Dive deep into the world of Transformers with this comprehensive PyTorch implementation guide. Whether you're a seasoned ML engineer or just starting out, this resource breaks down the complexities of the Transformer model, inspired by the groundbreaking paper "Attention Is All You Need".
https://www.k-a.in/pyt-transformer.html
This guide offers:
By following along, you'll gain a solid understanding of how Transformers work and how to implement them from scratch.
#MachineLearning #DeepLearning #PyTorch #Transformer #AI #NLP #AttentionIsAllYouNeed #Coding #DataScience #NeuralNetworksο»Ώ
Please open Telegram to view this post
VIEW IN TELEGRAM
π1
IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
Paper: https://arxiv.org/pdf/2502.05512v1.pdf
Code: https://github.com/index-tts/index-tts
https://t.iss.one/DataScienceTβ
8 Feb 2025 Β· Wei Deng, Siyi Zhou, Jingchen Shu, Jinchao Wang, Lu Wang Β·
Recently, large language model (#LLM) based text-to-speech (#TTS) systems have gradually become the mainstream in the industry due to their high naturalness and powerful zero-shot voice cloning capabilities.Here, we introduce the IndexTTS system, which is mainly based on the XTTS and Tortoise model. We add some novel improvements. Specifically, in Chinese scenarios, we adopt a hybrid modeling method that combines characters and pinyin, making the pronunciations of polyphonic characters and long-tail characters controllable. We also performed a comparative analysis of the Vector Quantization (VQ) with Finite-Scalar Quantization (FSQ) for codebook utilization of acoustic speech tokens. To further enhance the effect and stability of voice cloning, we introduce a conformer-based speech conditional encoder and replace the speechcode decoder with BigVGAN2. Compared with #XTTS, it has achieved significant improvements in naturalness, content consistency, and zero-shot voice cloning. As for the popular TTS systems in the open-source, such as Fish-Speech, CosyVoice2, FireRedTTS and F5-TTS, IndexTTS has a relatively simple training process, more controllable usage, and faster inference speed. Moreover, its performance surpasses that of these systems. Our demos are available at https://index-tts.github.io.
Paper: https://arxiv.org/pdf/2502.05512v1.pdf
Code: https://github.com/index-tts/index-tts
https://t.iss.one/DataScienceT
Please open Telegram to view this post
VIEW IN TELEGRAM
π1
LettuceDetect: A Hallucination Detection Framework for RAG Applications
24 Feb 2025 Β· ΓdΓ‘m KovΓ‘cs, GΓ‘bor Recski Β·
Paper: https://arxiv.org/pdf/2502.17125v1.pdf
Code: https://github.com/KRLabsOrg/LettuceDetect
Colab: https://colab.research.google.com/drive/1Ubca5aMaBGdHtJ1rpqj3Ke9SLEr-PaDn?usp=sharing
https://t.iss.one/DataScienceTβ
24 Feb 2025 Β· ΓdΓ‘m KovΓ‘cs, GΓ‘bor Recski Β·
Retrieval Augmented Generation (#RAG) systems remain vulnerable to hallucinated answers despite incorporating external knowledge sources. We present LettuceDetect a framework that addresses two critical limitations in existing hallucination detection methods: (1) the context window constraints of traditional encoder-based methods, and (2) the computational inefficiency of #LLM based approaches. Building on ModernBERT's extended context capabilities (up to 8k tokens) and trained on the RAGTruth benchmark dataset, our approach outperforms all previous encoder-based models and most prompt-based models, while being approximately 30 times smaller than the best models. LettuceDetect is a token-classification model that processes context-question-answer triples, allowing for the identification of unsupported claims at the token level. Evaluations on the RAGTruth corpus demonstrate an F1 score of 79.22% for example-level detection, which is a 14.8% improvement over Luna, the previous state-of-the-art encoder-based architecture. Additionally, the system can process 30 to 60 examples per second on a single GPU, making it more practical for real-world RAG applications.
Paper: https://arxiv.org/pdf/2502.17125v1.pdf
Code: https://github.com/KRLabsOrg/LettuceDetect
Colab: https://colab.research.google.com/drive/1Ubca5aMaBGdHtJ1rpqj3Ke9SLEr-PaDn?usp=sharing
https://t.iss.one/DataScienceT
Please open Telegram to view this post
VIEW IN TELEGRAM
π4
PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters
7 Apr 2025 Β· Zonghang Li, Tao Li, Wenjiao Feng, Mohsen Guizani, Hongfang Yu Β·
Paper: https://arxiv.org/pdf/2504.08791v1.pdf
Code: https://github.com/lizonghang/prima.cpp
https://t.iss.one/DataScienceTβ
7 Apr 2025 Β· Zonghang Li, Tao Li, Wenjiao Feng, Mohsen Guizani, Hongfang Yu Β·
Emergency of DeepSeek R1 and QwQ 32B have broken through performance barriers for running frontier large language models (#LLMs) on home devices. While consumer hardware is getting stronger and model quantization is improving, existing end-side solutions still demand #GPU clusters, large RAM/VRAM, and high bandwidth, far beyond what a common home cluster can handle. This paper introduces prima.cpp, a distributed inference system that runs 70B-scale models on everyday home devices using a mix of CPU/GPU, low RAM/VRAM, Wi-Fi, and cross-platform support. It uses mmap to manage model weights and introduces piped-ring parallelism with prefetching to hide disk loading. By modeling heterogeneity in computation, communication, disk, memory (and its management behavior), and OS, it optimally assigns model layers to each device's #CPU and GPU, further reducing token latency. An elegant algorithm named Halda is proposed to solve this NP-hard assignment problem. We evaluate prima.cpp on a common four-node home cluster. It outperforms llama.cpp,# exo, and #dllama on 30B+ models while keeping memory pressure below 6%. This brings frontier 30B-70B models, such as #Llama 3, #DeepSeek R1, #Qwen 2.5, and #QwQ to home assistants, making advanced AI truly accessible to individuals. The code is open source and available at https://github.com/Lizonghang/prima.cpp.
Paper: https://arxiv.org/pdf/2504.08791v1.pdf
Code: https://github.com/lizonghang/prima.cpp
https://t.iss.one/DataScienceT
Please open Telegram to view this post
VIEW IN TELEGRAM
π5π2
BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence
22 Nov 2024 Β· Xuewu Lin, Tianwei Lin, Lichao Huang, Hongyu Xie, Zhizhong Su Β·
Paper: https://arxiv.org/pdf/2411.14869v2.pdf
Code: https://github.com/HorizonRobotics/BIP3D
HF: https://huggingface.co/spaces/AGC2024/visual-grounding-2024
Dataset: 10,000 People - Human Pose Recognition Data
https://t.iss.one/DataScienceTπ‘
22 Nov 2024 Β· Xuewu Lin, Tianwei Lin, Lichao Huang, Hongyu Xie, Zhizhong Su Β·
In embodied intelligence systems, a key component is 3D perception algorithm, which enables agents to understand their surrounding environments. Previous algorithms primarily rely on point cloud, which, despite offering precise geometric information, still constrain perception performance due to inherent sparsity, noise, and data scarcity. In this work, we introduce a novel image-centric 3D perception model, #BIP3D, which leverages expressive image features with explicit 3D position encoding to overcome the limitations of point-centric methods. Specifically, we leverage pre-trained 2D vision foundation models to enhance semantic understanding, and introduce a spatial enhancer module to improve spatial understanding. Together, these modules enable BIP3D to achieve multi-view, multi-modal feature fusion and end-to-end 3D perception. In our experiments, BIP3D outperforms current state-of-the-art results on the EmbodiedScan benchmark, achieving improvements of 5.69% in the 3D detection task and 15.25% in the 3D visual grounding task.
Paper: https://arxiv.org/pdf/2411.14869v2.pdf
Code: https://github.com/HorizonRobotics/BIP3D
HF: https://huggingface.co/spaces/AGC2024/visual-grounding-2024
Dataset: 10,000 People - Human Pose Recognition Data
https://t.iss.one/DataScienceT
Please open Telegram to view this post
VIEW IN TELEGRAM
β€2π2π₯2
Forwarded from Python | Machine Learning | Coding | R
LLM Engineerβs Handbook (2024)
π Unlock the Future of AI with the LLM Engineerβs Handbook π
Step into the world of Large Language Models (LLMs) with this comprehensive guide that takes you from foundational concepts to deploying advanced applications using LLMOps best practices. Whether you're an AI engineer, NLP professional, or LLM enthusiast, this book offers practical insights into designing, training, and deploying LLMs in real-world scenarios.
Why Choose the LLM Engineerβs Handbook?
Comprehensive Coverage: Learn about data engineering, supervised fine-tuning, and deployment strategies.
Hands-On Approach: Implement MLOps components through practical examples, including building an LLM-powered twin that's cost-effective, scalable, and modular.
Cutting-Edge Techniques: Explore inference optimization, preference alignment, and real-time data processing to apply LLMs effectively in your projects.
Real-World Applications: Move beyond isolated Jupyter notebooks and focus on building production-grade end-to-end LLM systems.
Limited-Time Offer
Originally priced at $55, the LLM Engineerβs Handbook is now available for just $25βa 55% discount! This special offer is available for a limited quantity, so act fast to secure your copy.
Who Should Read This Book?
This handbook is ideal for AI engineers, NLP professionals, and LLM engineers looking to deepen their understanding of LLMs. A basic knowledge of LLMs, Python, and AWS is recommended. Whether you're new to AI or seeking to enhance your skills, this book provides comprehensive guidance on implementing LLMs in real-world scenarios.
Don't miss this opportunity to advance your expertise in LLM engineering. Secure your discounted copy today and take the next step in your AI journey!
Buy book: https://www.patreon.com/DataScienceBooks/shop/llm-engineers-handbook-2024-1582908
π Unlock the Future of AI with the LLM Engineerβs Handbook π
Step into the world of Large Language Models (LLMs) with this comprehensive guide that takes you from foundational concepts to deploying advanced applications using LLMOps best practices. Whether you're an AI engineer, NLP professional, or LLM enthusiast, this book offers practical insights into designing, training, and deploying LLMs in real-world scenarios.
Why Choose the LLM Engineerβs Handbook?
Comprehensive Coverage: Learn about data engineering, supervised fine-tuning, and deployment strategies.
Hands-On Approach: Implement MLOps components through practical examples, including building an LLM-powered twin that's cost-effective, scalable, and modular.
Cutting-Edge Techniques: Explore inference optimization, preference alignment, and real-time data processing to apply LLMs effectively in your projects.
Real-World Applications: Move beyond isolated Jupyter notebooks and focus on building production-grade end-to-end LLM systems.
Limited-Time Offer
Originally priced at $55, the LLM Engineerβs Handbook is now available for just $25βa 55% discount! This special offer is available for a limited quantity, so act fast to secure your copy.
Who Should Read This Book?
This handbook is ideal for AI engineers, NLP professionals, and LLM engineers looking to deepen their understanding of LLMs. A basic knowledge of LLMs, Python, and AWS is recommended. Whether you're new to AI or seeking to enhance your skills, this book provides comprehensive guidance on implementing LLMs in real-world scenarios.
Don't miss this opportunity to advance your expertise in LLM engineering. Secure your discounted copy today and take the next step in your AI journey!
Buy book: https://www.patreon.com/DataScienceBooks/shop/llm-engineers-handbook-2024-1582908
π5β€1
ReSpec: Relevance and Specificity Grounded Online Filtering for Learning on Video-Text Data Streams
π₯ Github: https://github.com/cdjkim/respec
π Paper: https://arxiv.org/abs/2504.14875v1
π Dataset: https://paperswithcode.com/task/informativeness
π Dataset: https://paperswithcode.com/task/informativeness
Please open Telegram to view this post
VIEW IN TELEGRAM
π4
Forwarded from Data Science Premium (Books & Courses)
Join to our WhatsApp channel π±
Tell your friends
https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
Tell your friends
https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
Please open Telegram to view this post
VIEW IN TELEGRAM
WhatsApp.com
Python | Machine Learning | Data Science | WhatsApp Channel
Python | Machine Learning | Data Science WhatsApp Channel. Welcome to our official WhatsApp Channel β your daily dose of AI, Python, and cutting-edge technology!
Here, we share:
Python tutorials and ready-to-use code snippets
AI & machine learning tipsβ¦
Here, we share:
Python tutorials and ready-to-use code snippets
AI & machine learning tipsβ¦
π1
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
6 May 2025 Β· Andrew Zhao, Yiran Wu, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, Gao Huang Β·
Paper: https://arxiv.org/pdf/2505.03335v2.pdf
Code: https://arxiv.org/pdf/2505.03335v2.pdf
https://t.iss.one/DataScienceTβ
6 May 2025 Β· Andrew Zhao, Yiran Wu, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, Gao Huang Β·
Reinforcement learning with verifiable rewards (#RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually curated collections of questions and answers for training. The scarcity of high-quality, human-produced examples raises concerns about the long-term scalability of relying on human supervision, a challenge already evident in the domain of language model pretraining. Furthermore, in a hypothetical future where #AI surpasses human intelligence, tasks provided by humans may offer limited learning potential for a superintelligent system. To address these concerns, we propose a new RLVR paradigm called Absolute Zero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external data. Under this paradigm, we introduce the Absolute Zero Reasoner (#AZR), a system that self-evolves its training curriculum and reasoning ability by using a code executor to both validate proposed code reasoning tasks and verify answers, serving as an unified source of verifiable reward to guide open-ended yet grounded learning. Despite being trained entirely without external data, AZR achieves overall #SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.
Paper: https://arxiv.org/pdf/2505.03335v2.pdf
Code: https://arxiv.org/pdf/2505.03335v2.pdf
https://t.iss.one/DataScienceT
Please open Telegram to view this post
VIEW IN TELEGRAM
π3
FastVLM: Efficient Vision Encoding for Vision Language Models
17 Dec 2024 Β· Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari Β·
Paper: https://arxiv.org/pdf/2412.13303v1.pdf
code: https://github.com/apple/ml-fastvlm
Datasets: GQA -TextVQA - ScienceQA
https://t.iss.one/DataScienceT π¦Ύ
17 Dec 2024 Β· Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari Β·
Scaling the input image resolution is essential for enhancing the performance of Vision Language Models (#VLMs), particularly in text-rich image understanding tasks. However, popular visual encoders such as #ViTs become inefficient at high resolutions due to the large number of tokens and high encoding latency caused by stacked self-attention layers. At different operational resolutions, the vision encoder of a VLM can be optimized along two axes: reducing encoding latency and minimizing the number of visual tokens passed to the LLM, thereby lowering overall latency. Based on a comprehensive efficiency analysis of the interplay between image resolution, vision latency, token count, and #LLM size, we introduce #FastVLM, a model that achieves an optimized trade-off between latency, model size and accuracy. FastVLM incorporates #FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images. Unlike previous methods, FastVLM achieves the optimal balance between visual token count and image resolution solely by scaling the input image, eliminating the need for additional token pruning and simplifying the model design. In the LLaVA-1.5 setup, FastVLM achieves 3.2
improvement in time-to-first-token (TTFT) while maintaining similar performance on VLM benchmarks compared to prior works. Compared to LLaVa-OneVision at the highest resolution (11521152), FastVLM achieves comparable performance on key benchmarks like SeedBench and MMMU, using the same 0.5B LLM, but with 85 faster TTFT and a vision encoder that is 3.4 smaller.
Paper: https://arxiv.org/pdf/2412.13303v1.pdf
code: https://github.com/apple/ml-fastvlm
Datasets: GQA -TextVQA - ScienceQA
https://t.iss.one/DataScienceT π¦Ύ
π2
Forwarded from Python | Machine Learning | Coding | R
π Your balance is credited $4,000 , the owner of the channel wants to contact you!
Dear subscriber, we would like to thank you very much for supporting our channel, and as a token of our gratitude we would like to provide you with free access to Lisa's investor channel, with the help of which you can earn today
t.iss.one/Lisainvestor
Be sure to take advantage of our gift, admission is free, don't miss the opportunity, change your life for the better.
You can follow the link :
https://t.iss.one/+0DQSCADFTUA3N2Qx
Dear subscriber, we would like to thank you very much for supporting our channel, and as a token of our gratitude we would like to provide you with free access to Lisa's investor channel, with the help of which you can earn today
t.iss.one/Lisainvestor
Be sure to take advantage of our gift, admission is free, don't miss the opportunity, change your life for the better.
You can follow the link :
https://t.iss.one/+0DQSCADFTUA3N2Qx
β€2
Forwarded from Python | Machine Learning | Coding | R
This media is not supported in your browser
VIEW IN TELEGRAM
β
β
Join to our WhatsApp
https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
Please open Telegram to view this post
VIEW IN TELEGRAM
π1π₯1
FastVLM: Efficient Vision Encoding for Vision Language Models
17 Dec 2024 Β· Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari Β·
Paper: https://arxiv.org/pdf/2412.13303v1.pdf
Code: https://github.com/apple/ml-fastvlm
Datasets: GQA - TextVQA - ScienceQA
https://t.iss.one/DataScienceT
π± WhatsApp Channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
17 Dec 2024 Β· Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari Β·
Scaling the input image resolution is essential for enhancing the performance of Vision Language Models (VLMs), particularly in text-rich image understanding tasks. However, popular visual encoders such as #ViTs become inefficient at high resolutions due to the large number of tokens and high encoding latency caused by stacked self-attention layers. At different operational resolutions, the vision encoder of a VLM can be optimized along two axes: reducing encoding latency and minimizing the number of visual tokens passed to the LLM, thereby lowering overall latency. Based on a comprehensive efficiency analysis of the interplay between image resolution, vision latency, token count, and LLM size, we introduce FastVLM, a model that achieves an optimized trade-off between latency, model size and accuracy. FastVLM incorporates FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images. Unlike previous methods, FastVLM achieves the optimal balance between visual token count and image resolution solely by scaling the input image, eliminating the need for additional token pruning and simplifying the model design. In the LLaVA-1.5 setup, FastVLM achieves 3.2
improvement in time-to-first-token (TTFT) while maintaining similar performance on VLM benchmarks compared to prior works. Compared to LLaVa-OneVision at the highest resolution (11521152), #FastVLM achieves comparable performance on key benchmarks like SeedBench and MMMU, using the same 0.5B #LLM, but with 85 faster TTFT and a vision encoder that is 3.4 smaller.
Paper: https://arxiv.org/pdf/2412.13303v1.pdf
Code: https://github.com/apple/ml-fastvlm
Datasets: GQA - TextVQA - ScienceQA
https://t.iss.one/DataScienceT
Please open Telegram to view this post
VIEW IN TELEGRAM
π1
Generating Physically Stable and Buildable LEGO Designs from Text
8 May 2025 Β· Ava Pun, Kangle Deng, Ruixuan Liu, Deva Ramanan, Changliu Liu, Jun-Yan Zhu Β·
Paper: https://arxiv.org/pdf/2505.05469v1.pdf
Code: https://github.com/AvaLovelace1/LegoGPT
Quick start: https://huggingface.co/spaces/cmu-gil/LegoGPT-Demo
Dataset: StableText2Lego
π± WhatsApp Channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
8 May 2025 Β· Ava Pun, Kangle Deng, Ruixuan Liu, Deva Ramanan, Changliu Liu, Jun-Yan Zhu Β·
We introduce #LegoGPT, the first approach for generating physically stable LEGO brick models from text prompts. To achieve this, we construct a large-scale, physically stable dataset of LEGO designs, along with their associated captions, and train an autoregressive large language model to predict the next brick to add via next-token prediction. To improve the stability of the resulting designs, we employ an efficient validity check and physics-aware rollback during autoregressive inference, which prunes infeasible token predictions using physics laws and assembly constraints. Our experiments show that LegoGPT produces stable, diverse, and aesthetically pleasing LEGO designs that align closely with the input text prompts. We also develop a text-based LEGO texturing method to generate colored and textured designs. We show that our designs can be assembled manually by humans and automatically by robotic arms. We also release our new dataset, StableText2Lego, containing over 47,000 LEGO structures of over 28,000 unique 3D objects accompanied by detailed captions, along with our code and models at the project website: https://avalovelace1.github.io/LegoGPT/
Paper: https://arxiv.org/pdf/2505.05469v1.pdf
Code: https://github.com/AvaLovelace1/LegoGPT
Quick start: https://huggingface.co/spaces/cmu-gil/LegoGPT-Demo
Dataset: StableText2Lego
Please open Telegram to view this post
VIEW IN TELEGRAM
π3
This media is not supported in your browser
VIEW IN TELEGRAM
Saint-Γtienne University has introduced a new 3D human body pose estimation pipeline designed specifically for dance analysis.
Check out the project page featuring results and an interactive demo!
#DanceAnalysis #3DPoseEstimation #DeepLearning #HumanPose #AI #MachineLearning #ComputerVisionResearch
Please open Telegram to view this post
VIEW IN TELEGRAM
π1
Forwarded from Python | Machine Learning | Coding | R
This channels is for Programmers, Coders, Software Engineers.
0οΈβ£ Python
1οΈβ£ Data Science
2οΈβ£ Machine Learning
3οΈβ£ Data Visualization
4οΈβ£ Artificial Intelligence
5οΈβ£ Data Analysis
6οΈβ£ Statistics
7οΈβ£ Deep Learning
8οΈβ£ programming Languages
β
https://t.iss.one/addlist/8_rRW2scgfRhOTc0
β
https://t.iss.one/Codeprogrammer
Please open Telegram to view this post
VIEW IN TELEGRAM
Flow-GRPO: Training Flow Matching Models via Online RL
8 May 2025 Β· Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, Wanli Ouyang Β·
Paper: https://arxiv.org/pdf/2505.05470v2.pdf
code: https://github.com/yifan123/flow_grpo
HG: https://huggingface.co/spaces/jieliu/SD3.5-M-Flow-GRPO
Datasets: DrawBench - GenEval - T2I-CompBench
Notes: Ranked #1 on Text-to-Image Generation on GenEval
π Our Telegram channels: https://t.iss.one/addlist/0f6vfFbEMdAwODBk
π± Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
8 May 2025 Β· Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, Wanli Ouyang Β·
We propose Flow-GRPO, the first method integrating online reinforcement learning (RL) into flow matching models. Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary Differential Equation (ODE) into an equivalent Stochastic Differential Equation (SDE) that matches the original model's marginal distribution at all timesteps, enabling statistical sampling for RL exploration; and (2) a Denoising Reduction strategy that reduces training denoising steps while retaining the original inference timestep number, significantly improving sampling efficiency without performance degradation. Empirically, Flow-GRPO is effective across multiple text-to-image tasks. For complex compositions, RL-tuned SD3.5 generates nearly perfect object counts, spatial relations, and fine-grained attributes, boosting GenEval accuracy from 63% to 95%. In visual text rendering, its accuracy improves from 59% to 92%, significantly enhancing text generation. Flow-GRPO also achieves substantial gains in human preference alignment. Notably, very little reward hacking occurred, meaning rewards did not increase at the cost of appreciable image quality or diversity degradation.
Paper: https://arxiv.org/pdf/2505.05470v2.pdf
code: https://github.com/yifan123/flow_grpo
HG: https://huggingface.co/spaces/jieliu/SD3.5-M-Flow-GRPO
Datasets: DrawBench - GenEval - T2I-CompBench
Notes: Ranked #1 on Text-to-Image Generation on GenEval
Please open Telegram to view this post
VIEW IN TELEGRAM
π2
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
9 May 2025 Β· Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, Hongyang Li Β·
Paper: https://arxiv.org/pdf/2505.06111v2.pdf
Code: https://github.com/opendrivelab/univla
Datasets: R2R - VLN-CE - Open-X-Embodiment
π Our Telegram channels: https://t.iss.one/addlist/0f6vfFbEMdAwODBk
π± Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
9 May 2025 Β· Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, Hongyang Li Β·
A generalist robot should perform effectively across various environments. However, most existing approaches heavily rely on scaling action-annotated data to enhance their capabilities. Consequently, they are often limited to single physical specification and struggle to learn transferable knowledge across different embodiments and environments. To confront these limitations, we propose UniVLA, a new framework for learning cross-embodiment vision-language-action (VLA) policies. Our key innovation is to derive task-centric action representations from videos with a latent action model. This enables us to exploit extensive data across a wide spectrum of embodiments and perspectives. To mitigate the effect of task-irrelevant dynamics, we incorporate language instructions and establish a latent action model within the DINO feature space. Learned from internet-scale videos, the generalist policy can be deployed to various robots through efficient latent action decoding. We obtain state-of-the-art results across multiple manipulation and navigation benchmarks, as well as real-robot deployments. UniVLA achieves superior performance over OpenVLA with less than 1/20 of pretraining compute and 1/10 of downstream data. Continuous performance improvements are observed as heterogeneous data, even including human videos, are incorporated into the training pipeline. The results underscore UniVLA's potential to facilitate scalable and efficient robot policy learning.
Paper: https://arxiv.org/pdf/2505.06111v2.pdf
Code: https://github.com/opendrivelab/univla
Datasets: R2R - VLN-CE - Open-X-Embodiment
Please open Telegram to view this post
VIEW IN TELEGRAM
π1
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
24 Apr 2025 Β· Minju Seo, Jinheon Baek, Seongyun Lee, Sung Ju Hwang Β·
Paper: https://arxiv.org/pdf/2504.17192v2.pdf
Code: https://github.com/going-doer/paper2code
π± WhatsApp Channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
24 Apr 2025 Β· Minju Seo, Jinheon Baek, Seongyun Lee, Sung Ju Hwang Β·
Despite the rapid growth of machine learning research, corresponding code implementations are often unavailable, making it slow and labor-intensive for researchers to reproduce results and build upon prior work. In the meantime, recent Large Language Models (LLMs) excel at understanding scientific documents and generating high-quality code. Inspired by this, we introduce PaperCoder, a multi-agent LLM framework that transforms machine learning papers into functional code repositories. PaperCoder operates in three stages: planning, where it constructs a high-level roadmap, designs the system architecture with diagrams, identifies file dependencies, and generates configuration files; analysis, which focuses on interpreting implementation-specific details; and generation, where modular, dependency-aware code is produced. Moreover, each phase is instantiated through a set of specialized agents designed to collaborate effectively across the pipeline. We then evaluate PaperCoder on generating code implementations from machine learning papers based on both model-based and human evaluations, specifically from the original paper authors, with author-released repositories as ground truth if available. Our results demonstrate the effectiveness of PaperCoder in creating high-quality, faithful implementations. Furthermore, it consistently shows strengths in the recently released PaperBench benchmark, surpassing strong baselines by substantial margins. Code is available at: https://github.com/going-doer/Paper2Code.
Paper: https://arxiv.org/pdf/2504.17192v2.pdf
Code: https://github.com/going-doer/paper2code
Please open Telegram to view this post
VIEW IN TELEGRAM
π4