Data Science | Machine Learning with Python for Researchers
31.5K subscribers
1.58K photos
102 videos
22 files
1.86K links
Admin: @HusseinSheikho

The Data Science and Python channel is for researchers and advanced programmers

Buy ads: https://telega.io/c/dataScienceT
Download Telegram
Kiss3DGen: Repurposing Image Diffusion Models for 3D Asset Generation

๐Ÿ–ฅ Github: https://github.com/EnVision-Research/Kiss3DGen

๐Ÿ“• Paper: https://arxiv.org/abs/2503.01370v1

๐ŸŒŸ Dataset: https://paperswithcode.com/dataset/nerf
Please open Telegram to view this post
VIEW IN TELEGRAM
๐Ÿ‘4๐Ÿ™2
MonSter: Marry Monodepth to Stereo Unleashes Power

15 Jan 2025 ยท Junda Cheng, Longliang Liu, Gangwei Xu, Xianqi Wang, Zhaoxing Zhang, Yong Deng, Jinliang Zang, Yurui Chen, Zhipeng Cai, Xin Yang ยท

Stereo matching recovers depth from image correspondences. Existing methods struggle to handle ill-posed regions with limited matching cues, such as occlusions and textureless areas. To address this, we propose MonSter, a novel method that leverages the complementary strengths of monocular depth estimation and stereo matching. MonSter integrates monocular depth and stereo matching into a dual-branch architecture to iteratively improve each other. Confidence-based guidance adaptively selects reliable stereo cues for monodepth scale-shift recovery. The refined monodepth is in turn guides stereo effectively at ill-posed regions. Such iterative mutual enhancement enables MonSter to evolve monodepth priors from coarse object-level structures to pixel-level geometry, fully unlocking the potential of stereo matching. As shown in Fig.1, MonSter ranks 1st across five most commonly used leaderboards -- SceneFlow, KITTI 2012, KITTI 2015, Middlebury, and ETH3D. Achieving up to 49.5% improvements (Bad 1.0 on ETH3D) over the previous best method. Comprehensive analysis verifies the effectiveness of MonSter in ill-posed regions. In terms of zero-shot generalization, MonSter significantly and consistently outperforms state-of-the-art across the board. The code is publicly available at: https://github.com/Junda24/MonSter.


Paper: https://arxiv.org/pdf/2501.08643v1.pdf

Code: https://github.com/junda24/monster

Datasets: KITTI - TartanAir

https://t.iss.one/DataScienceT โœ‰๏ธ
Please open Telegram to view this post
VIEW IN TELEGRAM
๐Ÿ‘3
๐ŸŽโ—๏ธTODAY FREEโ—๏ธ๐ŸŽ

Entry to our VIP channel is completely free today. Tomorrow it will cost $500! ๐Ÿ”ฅ

JOIN ๐Ÿ‘‡

https://t.iss.one/+1TWrwFRud4U1YTVi
https://t.iss.one/+1TWrwFRud4U1YTVi
https://t.iss.one/+1TWrwFRud4U1YTVi
๐Ÿ‘1
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

๐Ÿ–ฅ Github: https://github.com/yunncheng/MMRL

๐Ÿ“• Paper: https://arxiv.org/abs/2503.08497v1

๐ŸŒŸ Dataset: https://paperswithcode.com/dataset/imagenet-s

https://t.iss.one/DataScienceT ๐Ÿ’ซ
Please open Telegram to view this post
VIEW IN TELEGRAM
โค1๐Ÿ‘1
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

3 Mar 2025 ยท Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang, Songxiang Liu, Linqin Li, Zheng Liang, Qixi Zheng, Rui Wang, Xiaoqin Feng, Weizhen Bian, Zhen Ye, Sitong Cheng, Ruibin Yuan, Zhixian Zhao, Xinfa Zhu, Jiahao Pan, Liumeng Xue, Pengcheng Zhu, Yunlin Chen, Zhifei Li, Xie Chen, Lei Xie, Yike Guo, Wei Xue ยท


Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting multiple codebooks, limiting efficiency and integration flexibility. To overcome these challenges, we introduce Spark-TTS, a novel system powered by BiCodec, a single-stream speech codec that decomposes speech into two complementary token types: low-bitrate semantic tokens for linguistic content and fixed-length global tokens for speaker attributes. This disentangled representation, combined with the Qwen2.5 LLM and a chain-of-thought (CoT) generation approach, enables both coarse-grained control (e.g., gender, speaking style) and fine-grained adjustments (e.g., precise pitch values, speaking rate). To facilitate research in controllable TTS, we introduce VoxBox, a meticulously curated 100,000-hour dataset with comprehensive attribute annotations. Extensive experiments demonstrate that Spark-TTS not only achieves state-of-the-art zero-shot voice cloning but also generates highly customizable voices that surpass the limitations of reference-based synthesis. Source code, pre-trained models, and audio samples are available at https://github.com/SparkAudio/Spark-TTS.


Paper: https://arxiv.org/pdf/2503.01710v1.pdf

Code: https://github.com/sparkaudio/spark-tts

#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek #RAG #Agents #GPT4

https://t.iss.one/DataScienceT
๐Ÿ‘6
VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control

7 Mar 2025 ยท Yuxuan Bian, Zhaoyang Zhang, Xuan Ju, Mingdeng Cao, Liangbin Xie, Ying Shan, Qiang Xu ยท

Video inpainting, which aims to restore corrupted video content, has experienced substantial progress. Despite these advances, existing methods, whether propagating unmasked region pixels through optical flow and receptive field priors, or extending image-inpainting models temporally, face challenges in generating fully masked objects or balancing the competing objectives of background context preservation and foreground generation in one model, respectively. To address these limitations, we propose a novel dual-stream paradigm VideoPainter that incorporates an efficient context encoder (comprising only 6% of the backbone parameters) to process masked videos and inject backbone-aware background contextual cues to any pre-trained video DiT, producing semantically consistent content in a plug-and-play manner. This architectural separation significantly reduces the model's learning complexity while enabling nuanced integration of crucial background context. We also introduce a novel target region ID resampling technique that enables any-length video inpainting, greatly enhancing our practical applicability. Additionally, we establish a scalable dataset pipeline leveraging current vision understanding models, contributing VPData and VPBench to facilitate segmentation-based inpainting training and assessment, the largest video inpainting dataset and benchmark to date with over 390K diverse clips. Using inpainting as a pipeline basis, we also explore downstream applications including video editing and video editing pair data generation, demonstrating competitive performance and significant practical potential. Extensive experiments demonstrate VideoPainter's superior performance in both any-length video inpainting and editing, across eight key metrics, including video quality, mask region preservation, and textual coherence.


Paper: https://arxiv.org/pdf/2503.05639v2.pdf

Code: https://github.com/TencentARC/VideoPainter

Datasets: VPData - VPBench

https://t.iss.one/DataScienceT ๐ŸŽ™
Please open Telegram to view this post
VIEW IN TELEGRAM
๐Ÿ‘2
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

10 Mar 2025 ยท Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, Xu Yang

Enhancing reasoning in Large Multimodal Models (#LMMs) faces unique challenges from the complex interplay between visual perception and logical reasoning, particularly in compact 3B-parameter architectures where architectural constraints limit reasoning capacity and modality alignment. While rule-based reinforcement learning (RL) excels in text-only domains, its multimodal extension confronts two critical barriers: (1) data limitations due to ambiguous answers and scarce complex reasoning examples, and (2) degraded foundational reasoning induced by multimodal pretraining. To address these challenges, we propose \textbf{\method}, a two-stage framework adapting rule-based RL for multimodal reasoning through \textbf{Foundational Reasoning Enhancement (FRE)} followed by \textbf{Multimodal Generalization Training (MGT)}. The FRE stage first strengthens reasoning abilities using text-only data with rule-based RL, then the MGT stage generalizes these reasoning capabilities to multimodal domains. Experiments on Qwen2.5-VL-Instruct-3B demonstrate that \method achieves 4.83\% and 4.5\% average improvements over baselines in multimodal and text-only benchmarks, respectively, with a 3.63\% gain in complex Football Game tasks. These results validate that text-based reasoning enhancement enables effective multimodal generalization, offering a data-efficient paradigm that bypasses costly high-quality multimodal training data.


Paper: https://arxiv.org/pdf/2503.07536v1.pdf

code: https://github.com/tidedra/lmm-r1

https://t.iss.one/DataScienceT ๐Ÿงก
Please open Telegram to view this post
VIEW IN TELEGRAM
๐Ÿ‘3๐Ÿ”ฅ1
โšก๏ธ TxAgent: An AI agent for therapeutic reasoning across a universe of tools

๐Ÿ–ฅ Github: https://github.com/mims-harvard/TxAgent

๐Ÿ“• Paper: https://arxiv.org/abs/2503.10970v1

๐ŸŒŸ Methods: https://paperswithcode.com/method/align
Please open Telegram to view this post
VIEW IN TELEGRAM
๐Ÿ‘2
Executable Code Actions Elicit Better LLM Agents

1 Feb 2024 ยท Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, Heng Ji

Large Language Model (LLM) agents, capable of performing a broad range of actions, such as invoking tools and controlling robots, show great potential in tackling real-world challenges. LLM agents are typically prompted to produce actions by generating #JSON or text in a pre-defined format, which is usually limited by constrained action space (e.g., the scope of pre-defined tools) and restricted flexibility (e.g., inability to compose multiple tools). This work proposes to use executable Python code to consolidate LLM agents' actions into a unified action space (CodeAct). Integrated with a Python interpreter, CodeAct can execute code actions and dynamically revise prior actions or emit new actions upon new observations through multi-turn interactions. Our extensive analysis of 17 LLMs on API-Bank and a newly curated benchmark shows that CodeAct outperforms widely used alternatives (up to 20% higher success rate). The encouraging performance of CodeAct motivates us to build an open-source #LLM agent that interacts with environments by executing interpretable code and collaborates with users using natural language. To this end, we collect an instruction-tuning dataset CodeActInstruct that consists of 7k multi-turn interactions using CodeAct. We show that it can be used with existing data to improve models in agent-oriented tasks without compromising their general capability. CodeActAgent, finetuned from Llama2 and Mistral, is integrated with #Python interpreter and uniquely tailored to perform sophisticated tasks (e.g., model training) using existing libraries and autonomously self-debug.


Paper: https://arxiv.org/pdf/2402.01030v4.pdf

Codes:
https://github.com/epfllm/megatron-llm
https://github.com/xingyaoww/code-act

Datasets: MMLU - GSM8K - HumanEval - MATH

https://t.iss.one/DataScienceT โš ๏ธ
Please open Telegram to view this post
VIEW IN TELEGRAM
โค3๐Ÿ‘3๐Ÿ”ฅ1๐Ÿ‘1
โšก๏ธ MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling

๐Ÿ–ฅ Github: https://github.com/hustvl/MaTVLM

๐Ÿ“• Paper: https://arxiv.org/abs/2503.13440v1

๐ŸŒŸ Methods: https://paperswithcode.com/method/speed
Please open Telegram to view this post
VIEW IN TELEGRAM
โค2๐Ÿ‘1๐Ÿ”ฅ1
PiEEG kit - bioscience Lab in home for your Brain and Body

๐Ÿ–ฅ Github: https://github.com/pieeg-club/PiEEG_Kit

๐Ÿ“• Paper: https://arxiv.org/abs/2503.13482

๐ŸŒŸ Methods: https://paperswithcode.com/task/eeg-1
Please open Telegram to view this post
VIEW IN TELEGRAM
๐Ÿ‘4
FastCuRL: Curriculum Reinforcement Learning with Progressive Context Extension for Efficient Training R1-like Reasoning Models

๐Ÿ–ฅ Github: https://github.com/nick7nlp/FastCuRL

๐Ÿ“• Paper: https://arxiv.org/abs/2503.17287v1

๐ŸŒŸ Tasks
: https://paperswithcode.com/task/language-modeling
Please open Telegram to view this post
VIEW IN TELEGRAM
Please open Telegram to view this post
VIEW IN TELEGRAM
๐Ÿ‘5โค1
Greetings.
As part of our research, we want to write a review article in the field of pathology. Friends who are interested in the 2nd and 3rd places on this topic can participate.

โœ… Approximate start time: April 10th.

Journal: scientific reports https://www.nature.com/srep/

Price:
2: $400
3: $300

I will help with complete explanations and how to write each section.

@Raminmousa
@Machine_learn
@Paper4money
๐Ÿ‘4โค1
InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity

20 Mar 2025 ยท Liming Jiang, Qing Yan, Yumin Jia, Zichuan Liu, Hao Kang, Xin Lu ยท

Achieving flexible and high-fidelity identity-preserved image generation remains formidable, particularly with advanced Diffusion Transformers (DiTs) like FLUX. We introduce InfiniteYou (InfU), one of the earliest robust frameworks leveraging DiTs for this task. InfU addresses significant issues of existing methods, such as insufficient identity similarity, poor text-image alignment, and low generation quality and aesthetics. Central to InfU is InfuseNet, a component that injects identity features into the DiT base model via residual connections, enhancing identity similarity while maintaining generation capabilities. A multi-stage training strategy, including pretraining and supervised fine-tuning (SFT) with synthetic single-person-multiple-sample (SPMS) data, further improves text-image alignment, ameliorates image quality, and alleviates face copy-pasting. Extensive experiments demonstrate that InfU achieves state-of-the-art performance, surpassing existing baselines. In addition, the plug-and-play design of InfU ensures compatibility with various existing methods, offering a valuable contribution to the broader community.


Paper: https://arxiv.org/pdf/2503.16418v1.pdf

Code: https://github.com/bytedance/infiniteyou

Dataset: 10,000 People - Human Pose Recognition Data

https://t.iss.one/DataScienceT โš ๏ธ
Please open Telegram to view this post
VIEW IN TELEGRAM
๐Ÿ‘3โค1
LHM: Large Animatable Human Reconstruction Model from a Single Image in Seconds

13 Mar 2025 ยท Lingteng Qiu, Xiaodong Gu, Peihao Li, Qi Zuo, Weichao Shen, Junfei Zhang, Kejie Qiu, Weihao Yuan, GuanYing Chen, Zilong Dong, Liefeng Bo ยท

Animatable 3D human reconstruction from a single image is a challenging problem due to the ambiguity in decoupling geometry, appearance, and deformation. Recent advances in 3D human reconstruction mainly focus on static human modeling, and the reliance of using synthetic 3D scans for training limits their generalization ability. Conversely, optimization-based video methods achieve higher fidelity but demand controlled capture conditions and computationally intensive refinement processes. Motivated by the emergence of large reconstruction models for efficient static reconstruction, we propose LHM (Large Animatable Human Reconstruction Model) to infer high-fidelity avatars represented as 3D Gaussian splatting in a feed-forward pass. Our model leverages a multimodal transformer architecture to effectively encode the human body positional features and image features with attention mechanism, enabling detailed preservation of clothing geometry and texture. To further boost the face identity preservation and fine detail recovery, we propose a head feature pyramid encoding scheme to aggregate multi-scale features of the head regions. Extensive experiments demonstrate that our LHM generates plausible animatable human in seconds without post-processing for face and hands, outperforming existing methods in both reconstruction accuracy and generalization ability.


Paper: https://arxiv.org/pdf/2503.10625v1.pdf

Code: https://github.com/aigc3d/LHM

https://t.iss.one/DataScienceT โš ๏ธ
Please open Telegram to view this post
VIEW IN TELEGRAM
๐Ÿ‘4
Long-Context Autoregressive Video Modeling with Next-Frame Prediction

25 Mar 2025 ยท YuChao Gu, Weijia Mao, Mike Zheng Shou ยท

Long-context autoregressive modeling has significantly advanced language generation, but video generation still struggles to fully utilize extended temporal contexts. To investigate long-context video modeling, we introduce Frame AutoRegressive (FAR), a strong baseline for video autoregressive modeling. Just as language models learn causal dependencies between tokens (i.e., Token AR), FAR models temporal causal dependencies between continuous frames, achieving better convergence than Token AR and video diffusion transformers. Building on FAR, we observe that long-context vision modeling faces challenges due to visual redundancy. Existing RoPE lacks effective temporal decay for remote context and fails to extrapolate well to long video sequences. Additionally, training on long videos is computationally expensive, as vision tokens grow much faster than language tokens. To tackle these issues, we propose balancing locality and long-range dependency. We introduce FlexRoPE, an test-time technique that adds flexible temporal decay to RoPE, enabling extrapolation to 16x longer vision contexts. Furthermore, we propose long short-term context modeling, where a high-resolution short-term context window ensures fine-grained temporal consistency, while an unlimited long-term context window encodes long-range information using fewer tokens. With this approach, we can train on long video sequences with a manageable token context length. We demonstrate that FAR achieves state-of-the-art performance in both short- and long-video generation, providing a simple yet effective baseline for video autoregressive modeling.


Paper: https://arxiv.org/pdf/2503.19325v1.pdf

Code: https://github.com/showlab/FAR

Dataset: UCF101

Ranked #2 on Video Generation on UCF-101

https://t.iss.one/DataScienceT โš ๏ธ
Please open Telegram to view this post
VIEW IN TELEGRAM
โค3๐Ÿ‘3
Harnessing the Reasoning Economy: A Survey of Efficient Reasoning for Large Language Models

๐Ÿ–ฅ Github: https://github.com/devoallen/awesome-reasoning-economy-papers

๐Ÿ“• Paper: https://arxiv.org/abs/2503.24377v1
Please open Telegram to view this post
VIEW IN TELEGRAM
๐Ÿ‘4โค1
This channels is for Programmers, Coders, Software Engineers.

0๏ธโƒฃ Python
1๏ธโƒฃ Data Science
2๏ธโƒฃ Machine Learning
3๏ธโƒฃ Data Visualization
4๏ธโƒฃ Artificial Intelligence
5๏ธโƒฃ Data Analysis
6๏ธโƒฃ Statistics
7๏ธโƒฃ Deep Learning
8๏ธโƒฃ programming Languages

โœ… https://t.iss.one/addlist/8_rRW2scgfRhOTc0

โœ… https://t.iss.one/Codeprogrammer
Please open Telegram to view this post
VIEW IN TELEGRAM
๐Ÿ‘3โค1
๐Ÿ™๐Ÿ’ธ 500$ FOR THE FIRST 500 WHO JOIN THE CHANNEL! ๐Ÿ™๐Ÿ’ธ

Join our channel today for free! Tomorrow it will cost 500$!

https://t.iss.one/+vhF2zNz5GBw3NTU1

You can join at this link! ๐Ÿ‘†๐Ÿ‘‡

https://t.iss.one/+vhF2zNz5GBw3NTU1
๐Ÿ‘3