Forwarded from Python | Machine Learning | Coding | R
π THE 7-DAY PROFIT CHALLENGE! π
Can you turn $100 into $5,000 in just 7 days?
Jay can. And sheβs challenging YOU to do the same. π
https://t.iss.one/+QOcycXvRiYs4YTk1
https://t.iss.one/+QOcycXvRiYs4YTk1
https://t.iss.one/+QOcycXvRiYs4YTk1
Can you turn $100 into $5,000 in just 7 days?
Jay can. And sheβs challenging YOU to do the same. π
https://t.iss.one/+QOcycXvRiYs4YTk1
https://t.iss.one/+QOcycXvRiYs4YTk1
https://t.iss.one/+QOcycXvRiYs4YTk1
β€1
Article Title:
MEIA: Multimodal Embodied Perception and Interaction in Unknown Environments
Article Date: 1 Feb 2024
Article Description:
With the surge in the development of large language models, embodied intelligence has attracted increasing attention. Nevertheless, prior works on embodied intelligence typically encode scene or historical memory in an unimodal manner, either visual or linguistic, which complicates the alignment of the model's action planning with embodied control. To overcome this limitation, we introduce the Multimodal Embodied Interactive Agent (MEIA), capable of translating high-level tasks expressed in natural language into a sequence of executable actions. Specifically, we propose a novel Multimodal Environment Memory (MEM) module, facilitating the integration of embodied control with large models through the visual-language memory of scenes. This capability enables MEIA to generate executable action plans based on diverse requirements and the robot's capabilities. Furthermore, we construct an embodied question answering dataset based on a dynamic virtual cafe environment with the help of the large language model. In this virtual environment, we conduct several experiments, utilizing multiple large models through zero-shot learning, and carefully design scenarios for various situations. The experimental results showcase the promising performance of our MEIA in various embodied interactive tasks.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2402.00290v3.pdf
GitHub:
β’ https://github.com/hcplab-sysu/causalvlr
Datasets:
β’ No datasets information available
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
MEIA: Multimodal Embodied Perception and Interaction in Unknown Environments
Article Date: 1 Feb 2024
Article Description:
With the surge in the development of large language models, embodied intelligence has attracted increasing attention. Nevertheless, prior works on embodied intelligence typically encode scene or historical memory in an unimodal manner, either visual or linguistic, which complicates the alignment of the model's action planning with embodied control. To overcome this limitation, we introduce the Multimodal Embodied Interactive Agent (MEIA), capable of translating high-level tasks expressed in natural language into a sequence of executable actions. Specifically, we propose a novel Multimodal Environment Memory (MEM) module, facilitating the integration of embodied control with large models through the visual-language memory of scenes. This capability enables MEIA to generate executable action plans based on diverse requirements and the robot's capabilities. Furthermore, we construct an embodied question answering dataset based on a dynamic virtual cafe environment with the help of the large language model. In this virtual environment, we conduct several experiments, utilizing multiple large models through zero-shot learning, and carefully design scenarios for various situations. The experimental results showcase the promising performance of our MEIA in various embodied interactive tasks.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2402.00290v3.pdf
GitHub:
β’ https://github.com/hcplab-sysu/causalvlr
Datasets:
β’ No datasets information available
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€2
πΉ Title:
SkillBlender: Towards Versatile Humanoid Whole-Body Loco-Manipulation via Skill Blending
πΉ Publication Date: Published on Jun 11
πΉ Abstract:
SkillBlender is a hierarchical reinforcement learning framework that uses pretrained primitive skills to efficiently solve diverse loco-manipulation tasks for humanoid robots. AI-generated summary Humanoid robots hold significant potential in accomplishing daily tasks across diverse environments thanks to their flexibility and human-like morphology. Recent works have made significant progress in humanoid whole-body control and loco-manipulation leveraging optimal control or reinforcement learning. However, these methods require tedious task-specific tuning for each task to achieve satisfactory behaviors, limiting their versatility and scalability to diverse tasks in daily scenarios. To that end, we introduce SkillBlender , a novel hierarchical reinforcement learning framework for versatile humanoid loco-manipulation. SkillBlender first pretrains goal-conditioned task-agnostic primitive skills , and then dynamically blends these skills to accomplish complex loco-manipulation tasks with minimal task-specific reward engineering . We also introduce SkillBench , a parallel, cross-embodiment , and diverse simulated benchmark containing three embodiments, four primitive skills, and eight challenging loco-manipulation tasks , accompanied by a set of scientific evaluation metrics balancing accuracy and feasibility. Extensive simulated experiments show that our method significantly outperforms all baselines, while naturally regularizing behaviors to avoid reward hacking , resulting in more accurate and feasible movements for diverse loco-manipulation tasks in our daily scenarios. Our code and benchmark will be open-sourced to the community to facilitate future research. Project page: https://usc-gvl.github.io/ SkillBlender -web/.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.09366
β’ PDF: https://arxiv.org/pdf/2506.09366
β’ Project Page: https://usc-gvl.github.io/SkillBlender-web/
β’ Github: https://usc-gvl.github.io/SkillBlender-web/
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
SkillBlender: Towards Versatile Humanoid Whole-Body Loco-Manipulation via Skill Blending
πΉ Publication Date: Published on Jun 11
πΉ Abstract:
SkillBlender is a hierarchical reinforcement learning framework that uses pretrained primitive skills to efficiently solve diverse loco-manipulation tasks for humanoid robots. AI-generated summary Humanoid robots hold significant potential in accomplishing daily tasks across diverse environments thanks to their flexibility and human-like morphology. Recent works have made significant progress in humanoid whole-body control and loco-manipulation leveraging optimal control or reinforcement learning. However, these methods require tedious task-specific tuning for each task to achieve satisfactory behaviors, limiting their versatility and scalability to diverse tasks in daily scenarios. To that end, we introduce SkillBlender , a novel hierarchical reinforcement learning framework for versatile humanoid loco-manipulation. SkillBlender first pretrains goal-conditioned task-agnostic primitive skills , and then dynamically blends these skills to accomplish complex loco-manipulation tasks with minimal task-specific reward engineering . We also introduce SkillBench , a parallel, cross-embodiment , and diverse simulated benchmark containing three embodiments, four primitive skills, and eight challenging loco-manipulation tasks , accompanied by a set of scientific evaluation metrics balancing accuracy and feasibility. Extensive simulated experiments show that our method significantly outperforms all baselines, while naturally regularizing behaviors to avoid reward hacking , resulting in more accurate and feasible movements for diverse loco-manipulation tasks in our daily scenarios. Our code and benchmark will be open-sourced to the community to facilitate future research. Project page: https://usc-gvl.github.io/ SkillBlender -web/.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.09366
β’ PDF: https://arxiv.org/pdf/2506.09366
β’ Project Page: https://usc-gvl.github.io/SkillBlender-web/
β’ Github: https://usc-gvl.github.io/SkillBlender-web/
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
arXiv.org
SkillBlender: Towards Versatile Humanoid Whole-Body...
Humanoid robots hold significant potential in accomplishing daily tasks across diverse environments thanks to their flexibility and human-like morphology. Recent works have made significant...
β€2
Article Title:
MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm
Article Date: 5 Jun 2025
Article Description:
We introduce MonkeyOCR, a vision-language model for document parsing that advances the state of the art by leveraging a Structure-Recognition-Relation (SRR) triplet paradigm. This design simplifies what would otherwise be a complex multi-tool pipeline (as in MinerU's modular approach) and avoids the inefficiencies of processing full pages with giant end-to-end models (e.g., large multimodal LLMs like Qwen-VL). In SRR, document parsing is abstracted into three fundamental questions - "Where is it?" (structure), "What is it?" (recognition), and "How is it organized?" (relation) - corresponding to layout analysis, content identification, and logical ordering. This focused decomposition balances accuracy and speed: it enables efficient, scalable processing without sacrificing precision. To train and evaluate this approach, we introduce the MonkeyDoc (the most comprehensive document parsing dataset to date), with 3.9 million instances spanning over ten document types in both Chinese and English. Experiments show that MonkeyOCR outperforms MinerU by an average of 5.1%, with particularly notable improvements on challenging content such as formulas (+15.0%) and tables (+8.6%). Remarkably, our 3B-parameter model surpasses much larger and top-performing models, including Qwen2.5-VL (72B) and Gemini 2.5 Pro, achieving state-of-the-art average performance on English document parsing tasks. In addition, MonkeyOCR processes multi-page documents significantly faster (0.84 pages per second compared to 0.65 for MinerU and 0.12 for Qwen2.5-VL-7B). The 3B model can be efficiently deployed for inference on a single NVIDIA 3090 GPU. Code and models will be released at https://github.com/Yuliang-Liu/MonkeyOCR.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.05218v1.pdf
GitHub:
β’ https://github.com/yuliang-liu/monkeyocr
Datasets:
β’ No datasets information available
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm
Article Date: 5 Jun 2025
Article Description:
We introduce MonkeyOCR, a vision-language model for document parsing that advances the state of the art by leveraging a Structure-Recognition-Relation (SRR) triplet paradigm. This design simplifies what would otherwise be a complex multi-tool pipeline (as in MinerU's modular approach) and avoids the inefficiencies of processing full pages with giant end-to-end models (e.g., large multimodal LLMs like Qwen-VL). In SRR, document parsing is abstracted into three fundamental questions - "Where is it?" (structure), "What is it?" (recognition), and "How is it organized?" (relation) - corresponding to layout analysis, content identification, and logical ordering. This focused decomposition balances accuracy and speed: it enables efficient, scalable processing without sacrificing precision. To train and evaluate this approach, we introduce the MonkeyDoc (the most comprehensive document parsing dataset to date), with 3.9 million instances spanning over ten document types in both Chinese and English. Experiments show that MonkeyOCR outperforms MinerU by an average of 5.1%, with particularly notable improvements on challenging content such as formulas (+15.0%) and tables (+8.6%). Remarkably, our 3B-parameter model surpasses much larger and top-performing models, including Qwen2.5-VL (72B) and Gemini 2.5 Pro, achieving state-of-the-art average performance on English document parsing tasks. In addition, MonkeyOCR processes multi-page documents significantly faster (0.84 pages per second compared to 0.65 for MinerU and 0.12 for Qwen2.5-VL-7B). The 3B model can be efficiently deployed for inference on a single NVIDIA 3090 GPU. Code and models will be released at https://github.com/Yuliang-Liu/MonkeyOCR.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.05218v1.pdf
GitHub:
β’ https://github.com/yuliang-liu/monkeyocr
Datasets:
β’ No datasets information available
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€1
Article Title:
TradingAgents: Multi-Agents LLM Financial Trading Framework
Article Date: 28 Dec 2024
Article Description:
Significant progress has been made in automated problem-solving using societies of agents powered by large language models (LLMs). In finance, efforts have largely focused on single-agent systems handling specific tasks or multi-agent frameworks independently gathering data. However, the multi-agent systems' potential to replicate real-world trading firms' collaborative dynamics remains underexplored. TradingAgents proposes a novel stock trading framework inspired by trading firms, featuring LLM-powered agents in specialized roles such as fundamental analysts, sentiment analysts, technical analysts, and traders with varied risk profiles. The framework includes Bull and Bear researcher agents assessing market conditions, a risk management team monitoring exposure, and traders synthesizing insights from debates and historical data to make informed decisions. By simulating a dynamic, collaborative trading environment, this framework aims to improve trading performance. Detailed architecture and extensive experiments reveal its superiority over baseline models, with notable improvements in cumulative returns, Sharpe ratio, and maximum drawdown, highlighting the potential of multi-agent LLM frameworks in financial trading. TradingAgents is available at https://github.com/TauricResearch/TradingAgents.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2412.20138v7.pdf
GitHub:
β’ https://github.com/tauricresearch/tradingagents
Datasets:
β’ How Do I Login McAfee Antivirus Account?: A Complete Guide
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
TradingAgents: Multi-Agents LLM Financial Trading Framework
Article Date: 28 Dec 2024
Article Description:
Significant progress has been made in automated problem-solving using societies of agents powered by large language models (LLMs). In finance, efforts have largely focused on single-agent systems handling specific tasks or multi-agent frameworks independently gathering data. However, the multi-agent systems' potential to replicate real-world trading firms' collaborative dynamics remains underexplored. TradingAgents proposes a novel stock trading framework inspired by trading firms, featuring LLM-powered agents in specialized roles such as fundamental analysts, sentiment analysts, technical analysts, and traders with varied risk profiles. The framework includes Bull and Bear researcher agents assessing market conditions, a risk management team monitoring exposure, and traders synthesizing insights from debates and historical data to make informed decisions. By simulating a dynamic, collaborative trading environment, this framework aims to improve trading performance. Detailed architecture and extensive experiments reveal its superiority over baseline models, with notable improvements in cumulative returns, Sharpe ratio, and maximum drawdown, highlighting the potential of multi-agent LLM frameworks in financial trading. TradingAgents is available at https://github.com/TauricResearch/TradingAgents.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2412.20138v7.pdf
GitHub:
β’ https://github.com/tauricresearch/tradingagents
Datasets:
β’ How Do I Login McAfee Antivirus Account?: A Complete Guide
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€3
πΉ Title:
Generalized Few-Shot Semantic Segmentation: All You Need is Fine-Tuning
πΉ Publication Date: Published on Dec 21, 2021
πΉ Abstract:
A fine-tuning solution for generalized few-shot semantic segmentation improves performance beyond meta-learning by addressing saturation and minimizing the performance gap between novel and base categories. AI-generated summary Generalized few-shot semantic segmentation was introduced to move beyond only evaluating few-shot segmentation models on novel classes to include testing their ability to remember base classes. While the current state-of-the-art approach is based on meta-learning , it performs poorly and saturates in learning after observing only a few shots. We propose the first fine-tuning solution, and demonstrate that it addresses the saturation problem while achieving state-of-the-art results on two datasets, PASCAL-5i and COCO-20i. We also show that it outperforms existing methods, whether fine-tuning multiple final layers or only the final layer. Finally, we present a triplet loss regularization that shows how to redistribute the balance of performance between novel and base categories so that there is a smaller gap between them.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2112.10982
β’ PDF: https://arxiv.org/pdf/2112.10982
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
Generalized Few-Shot Semantic Segmentation: All You Need is Fine-Tuning
πΉ Publication Date: Published on Dec 21, 2021
πΉ Abstract:
A fine-tuning solution for generalized few-shot semantic segmentation improves performance beyond meta-learning by addressing saturation and minimizing the performance gap between novel and base categories. AI-generated summary Generalized few-shot semantic segmentation was introduced to move beyond only evaluating few-shot segmentation models on novel classes to include testing their ability to remember base classes. While the current state-of-the-art approach is based on meta-learning , it performs poorly and saturates in learning after observing only a few shots. We propose the first fine-tuning solution, and demonstrate that it addresses the saturation problem while achieving state-of-the-art results on two datasets, PASCAL-5i and COCO-20i. We also show that it outperforms existing methods, whether fine-tuning multiple final layers or only the final layer. Finally, we present a triplet loss regularization that shows how to redistribute the balance of performance between novel and base categories so that there is a smaller gap between them.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2112.10982
β’ PDF: https://arxiv.org/pdf/2112.10982
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
arXiv.org
Generalized Few-Shot Semantic Segmentation: All You Need is Fine-Tuning
Generalized few-shot semantic segmentation was introduced to move beyond only evaluating few-shot segmentation models on novel classes to include testing their ability to remember base classes....
β€3
πΉ Title:
RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning
πΉ Publication Date: Published on Jul 10
πΉ Abstract:
RLEP, a reinforcement learning framework with experience replay, enhances large language model training by focusing on high-quality examples, leading to faster convergence and improved performance on math-related benchmarks. AI-generated summary Reinforcement learning (RL) for large language models is an energy-intensive endeavor: training can be unstable, and the policy may gradually drift away from its pretrained weights. We present RLEP\, -- \,Reinforcement Learning with Experience rePlay \, -- \,a two-phase framework that first collects verified trajectories and then replays them during subsequent training. At every update step, the policy is optimized on mini-batches that blend newly generated rollouts with these replayed successes. By replaying high-quality examples, RLEP steers the model away from fruitless exploration, focuses learning on promising reasoning paths, and delivers both faster convergence and stronger final performance . On the Qwen2.5-Math-7B base model, RLEP reaches baseline peak accuracy with substantially fewer updates and ultimately surpasses it, improving accuracy on AIME-2024 from 38.2% to 39.9%, on AIME-2025 from 19.8% to 22.3%, and on AMC-2023 from 77.0% to 82.2%. Our code, datasets, and checkpoints are publicly available at https://github.com/Kwai-Klear/RLEP to facilitate reproducibility and further research.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2507.07451
β’ PDF: https://arxiv.org/pdf/2507.07451
πΉ Datasets citing this paper:
β’ https://huggingface.co/datasets/Kwai-Klear/RLEP_dataset
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning
πΉ Publication Date: Published on Jul 10
πΉ Abstract:
RLEP, a reinforcement learning framework with experience replay, enhances large language model training by focusing on high-quality examples, leading to faster convergence and improved performance on math-related benchmarks. AI-generated summary Reinforcement learning (RL) for large language models is an energy-intensive endeavor: training can be unstable, and the policy may gradually drift away from its pretrained weights. We present RLEP\, -- \,Reinforcement Learning with Experience rePlay \, -- \,a two-phase framework that first collects verified trajectories and then replays them during subsequent training. At every update step, the policy is optimized on mini-batches that blend newly generated rollouts with these replayed successes. By replaying high-quality examples, RLEP steers the model away from fruitless exploration, focuses learning on promising reasoning paths, and delivers both faster convergence and stronger final performance . On the Qwen2.5-Math-7B base model, RLEP reaches baseline peak accuracy with substantially fewer updates and ultimately surpasses it, improving accuracy on AIME-2024 from 38.2% to 39.9%, on AIME-2025 from 19.8% to 22.3%, and on AMC-2023 from 77.0% to 82.2%. Our code, datasets, and checkpoints are publicly available at https://github.com/Kwai-Klear/RLEP to facilitate reproducibility and further research.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2507.07451
β’ PDF: https://arxiv.org/pdf/2507.07451
πΉ Datasets citing this paper:
β’ https://huggingface.co/datasets/Kwai-Klear/RLEP_dataset
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
GitHub
GitHub - Kwai-Klear/RLEP: RL with Experience Replay
RL with Experience Replay. Contribute to Kwai-Klear/RLEP development by creating an account on GitHub.
β€1
πΉ Title:
MMHU: A Massive-Scale Multimodal Benchmark for Human Behavior Understanding
πΉ Publication Date: Published on Jul 16
πΉ Abstract:
A large-scale benchmark, MMHU, is proposed for human behavior analysis in autonomous driving, featuring rich annotations and diverse data sources, and benchmarking multiple tasks including motion prediction and behavior question answering. AI-generated summary Humans are integral components of the transportation ecosystem, and understanding their behaviors is crucial to facilitating the development of safe driving systems. Although recent progress has explored various aspects of human behaviorx2014such as motion, trajectories, and intentionx2014a comprehensive benchmark for evaluating human behavior understanding in autonomous driving remains unavailable. In this work, we propose MMHU, a large-scale benchmark for human behavior analysis featuring rich annotations, such as human motion and trajectories, text description for human motions, human intention, and critical behavior labels relevant to driving safety. Our dataset encompasses 57k human motion clips and 1.73M frames gathered from diverse sources, including established driving datasets such as Waymo , in-the-wild videos from YouTube , and self-collected data. A human-in-the-loop annotation pipeline is developed to generate rich behavior captions. We provide a thorough dataset analysis and benchmark multiple tasksx2014ranging from motion prediction to motion generation and human behavior question answering x2014thereby offering a broad evaluation suite. Project page : https://MMHU-Benchmark.github.io.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2507.12463
β’ PDF: https://arxiv.org/pdf/2507.12463
β’ Project Page: https://mmhu-benchmark.github.io/
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
MMHU: A Massive-Scale Multimodal Benchmark for Human Behavior Understanding
πΉ Publication Date: Published on Jul 16
πΉ Abstract:
A large-scale benchmark, MMHU, is proposed for human behavior analysis in autonomous driving, featuring rich annotations and diverse data sources, and benchmarking multiple tasks including motion prediction and behavior question answering. AI-generated summary Humans are integral components of the transportation ecosystem, and understanding their behaviors is crucial to facilitating the development of safe driving systems. Although recent progress has explored various aspects of human behaviorx2014such as motion, trajectories, and intentionx2014a comprehensive benchmark for evaluating human behavior understanding in autonomous driving remains unavailable. In this work, we propose MMHU, a large-scale benchmark for human behavior analysis featuring rich annotations, such as human motion and trajectories, text description for human motions, human intention, and critical behavior labels relevant to driving safety. Our dataset encompasses 57k human motion clips and 1.73M frames gathered from diverse sources, including established driving datasets such as Waymo , in-the-wild videos from YouTube , and self-collected data. A human-in-the-loop annotation pipeline is developed to generate rich behavior captions. We provide a thorough dataset analysis and benchmark multiple tasksx2014ranging from motion prediction to motion generation and human behavior question answering x2014thereby offering a broad evaluation suite. Project page : https://MMHU-Benchmark.github.io.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2507.12463
β’ PDF: https://arxiv.org/pdf/2507.12463
β’ Project Page: https://mmhu-benchmark.github.io/
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€2
Article Title:
Embedding Atlas: Low-Friction, Interactive Embedding Visualization
Article Date: 9 May 2025
Article Description:
Embedding projections are popular for visualizing large datasets and models. However, people often encounter "friction" when using embedding visualization tools: (1) barriers to adoption, e.g., tedious data wrangling and loading, scalability limits, no integration of results into existing workflows, and (2) limitations in possible analyses, without integration with external tools to additionally show coordinated views of metadata. In this paper, we present Embedding Atlas, a scalable, interactive visualization tool designed to make interacting with large embeddings as easy as possible. Embedding Atlas uses modern web technologies and advanced algorithms -- including density-based clustering, and automated labeling -- to provide a fast and rich data analysis experience at scale. We evaluate Embedding Atlas with a competitive analysis against other popular embedding tools, showing that Embedding Atlas's feature set specifically helps reduce friction, and report a benchmark on its real-time rendering performance with millions of points. Embedding Atlas is available as open source to support future work in embedding-based analysis.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.06386v1.pdf
GitHub:
β’ https://github.com/apple/embedding-atlas
Datasets:
β’ No datasets information available
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
Embedding Atlas: Low-Friction, Interactive Embedding Visualization
Article Date: 9 May 2025
Article Description:
Embedding projections are popular for visualizing large datasets and models. However, people often encounter "friction" when using embedding visualization tools: (1) barriers to adoption, e.g., tedious data wrangling and loading, scalability limits, no integration of results into existing workflows, and (2) limitations in possible analyses, without integration with external tools to additionally show coordinated views of metadata. In this paper, we present Embedding Atlas, a scalable, interactive visualization tool designed to make interacting with large embeddings as easy as possible. Embedding Atlas uses modern web technologies and advanced algorithms -- including density-based clustering, and automated labeling -- to provide a fast and rich data analysis experience at scale. We evaluate Embedding Atlas with a competitive analysis against other popular embedding tools, showing that Embedding Atlas's feature set specifically helps reduce friction, and report a benchmark on its real-time rendering performance with millions of points. Embedding Atlas is available as open source to support future work in embedding-based analysis.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.06386v1.pdf
GitHub:
β’ https://github.com/apple/embedding-atlas
Datasets:
β’ No datasets information available
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€4
Article Title:
OmniGen2: Exploration to Advanced Multimodal Generation
Article Date: 23 Jun 2025
Article Description:
In this work, we introduce OmniGen2, a versatile and open-source generative model designed to provide a unified solution for diverse generation tasks, including text-to-image, image editing, and in-context generation. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. This design enables OmniGen2 to build upon existing multimodal understanding models without the need to re-adapt VAE inputs, thereby preserving the original text generation capabilities. To facilitate the training of OmniGen2, we developed comprehensive data construction pipelines, encompassing image editing and in-context generation data. Additionally, we introduce a reflection mechanism tailored for image generation tasks and curate a dedicated reflection dataset based on OmniGen2. Despite its relatively modest parameter size, OmniGen2 achieves competitive results on multiple task benchmarks, including text-to-image and image editing. To further evaluate in-context generation, also referred to as subject-driven tasks, we introduce a new benchmark named OmniContext. OmniGen2 achieves state-of-the-art performance among open-source models in terms of consistency. We will release our models, training code, datasets, and data construction pipeline to support future research in this field. Project Page: https://vectorspacelab.github.io/OmniGen2; GitHub Link: https://github.com/VectorSpaceLab/OmniGen2PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.18871v2.pdf
GitHub:
β’ https://github.com/vectorspacelab/omnigen2
Datasets:
β’ MM-Vet
β’ GenEval
β’ MagicBrush
β’ ImgEdit
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
OmniGen2: Exploration to Advanced Multimodal Generation
Article Date: 23 Jun 2025
Article Description:
In this work, we introduce OmniGen2, a versatile and open-source generative model designed to provide a unified solution for diverse generation tasks, including text-to-image, image editing, and in-context generation. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. This design enables OmniGen2 to build upon existing multimodal understanding models without the need to re-adapt VAE inputs, thereby preserving the original text generation capabilities. To facilitate the training of OmniGen2, we developed comprehensive data construction pipelines, encompassing image editing and in-context generation data. Additionally, we introduce a reflection mechanism tailored for image generation tasks and curate a dedicated reflection dataset based on OmniGen2. Despite its relatively modest parameter size, OmniGen2 achieves competitive results on multiple task benchmarks, including text-to-image and image editing. To further evaluate in-context generation, also referred to as subject-driven tasks, we introduce a new benchmark named OmniContext. OmniGen2 achieves state-of-the-art performance among open-source models in terms of consistency. We will release our models, training code, datasets, and data construction pipeline to support future research in this field. Project Page: https://vectorspacelab.github.io/OmniGen2; GitHub Link: https://github.com/VectorSpaceLab/OmniGen2PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.18871v2.pdf
GitHub:
β’ https://github.com/vectorspacelab/omnigen2
Datasets:
β’ MM-Vet
β’ GenEval
β’ MagicBrush
β’ ImgEdit
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€3
Article Title:
Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting
Article Date: 20 May 2025
Article Description:
Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Current approaches either assemble specialized expert models or directly generate page-level content autoregressively, facing integration overhead, efficiency bottlenecks, and layout structure degradation despite their decent performance. To address these limitations, we present \textit{Dolphin} (\textit{\textbf{Do}cument Image \textbf{P}arsing via \textbf{H}eterogeneous Anchor Prompt\textbf{in}g}), a novel multimodal document image parsing model following an analyze-then-parse paradigm. In the first stage, Dolphin generates a sequence of layout elements in reading order. These heterogeneous elements, serving as anchors and coupled with task-specific prompts, are fed back to Dolphin for parallel content parsing in the second stage. To train Dolphin, we construct a large-scale dataset of over 30 million samples, covering multi-granularity parsing tasks. Through comprehensive evaluations on both prevalent benchmarks and self-constructed ones, Dolphin achieves state-of-the-art performance across diverse page-level and element-level settings, while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism. The code and pre-trained models are publicly available at https://github.com/ByteDance/DolphinPDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.14059v1.pdf
GitHub:
β’ https://github.com/bytedance/dolphin
Datasets:
β’ PubTabNet
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting
Article Date: 20 May 2025
Article Description:
Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Current approaches either assemble specialized expert models or directly generate page-level content autoregressively, facing integration overhead, efficiency bottlenecks, and layout structure degradation despite their decent performance. To address these limitations, we present \textit{Dolphin} (\textit{\textbf{Do}cument Image \textbf{P}arsing via \textbf{H}eterogeneous Anchor Prompt\textbf{in}g}), a novel multimodal document image parsing model following an analyze-then-parse paradigm. In the first stage, Dolphin generates a sequence of layout elements in reading order. These heterogeneous elements, serving as anchors and coupled with task-specific prompts, are fed back to Dolphin for parallel content parsing in the second stage. To train Dolphin, we construct a large-scale dataset of over 30 million samples, covering multi-granularity parsing tasks. Through comprehensive evaluations on both prevalent benchmarks and self-constructed ones, Dolphin achieves state-of-the-art performance across diverse page-level and element-level settings, while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism. The code and pre-trained models are publicly available at https://github.com/ByteDance/DolphinPDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.14059v1.pdf
GitHub:
β’ https://github.com/bytedance/dolphin
Datasets:
β’ PubTabNet
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€2
Article Title:
OGGSplat: Open Gaussian Growing for Generalizable Reconstruction with Expanded Field-of-View
Article Date: 5 Jun 2025
Article Description:
Reconstructing semantic-aware 3D scenes from sparse views is a challenging yet essential research direction, driven by the demands of emerging applications such as virtual reality and embodied AI. Existing per-scene optimization methods require dense input views and incur high computational costs, while generalizable approaches often struggle to reconstruct regions outside the input view cone. In this paper, we propose OGGSplat, an open Gaussian growing method that expands the field-of-view in generalizable 3D reconstruction. Our key insight is that the semantic attributes of open Gaussians provide strong priors for image extrapolation, enabling both semantic consistency and visual plausibility. Specifically, once open Gaussians are initialized from sparse views, we introduce an RGB-semantic consistent inpainting module applied to selected rendered views. This module enforces bidirectional control between an image diffusion model and a semantic diffusion model. The inpainted regions are then lifted back into 3D space for efficient and progressive Gaussian parameter optimization. To evaluate our method, we establish a Gaussian Outpainting (GO) benchmark that assesses both semantic and generative quality of reconstructed open-vocabulary scenes. OGGSplat also demonstrates promising semantic-aware scene reconstruction capabilities when provided with two view images captured directly from a smartphone camera.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.05204v1.pdf
GitHub:
β’ https://github.com/Yanbo-23/OGGSplat
Datasets:
β’ S3DIS
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
OGGSplat: Open Gaussian Growing for Generalizable Reconstruction with Expanded Field-of-View
Article Date: 5 Jun 2025
Article Description:
Reconstructing semantic-aware 3D scenes from sparse views is a challenging yet essential research direction, driven by the demands of emerging applications such as virtual reality and embodied AI. Existing per-scene optimization methods require dense input views and incur high computational costs, while generalizable approaches often struggle to reconstruct regions outside the input view cone. In this paper, we propose OGGSplat, an open Gaussian growing method that expands the field-of-view in generalizable 3D reconstruction. Our key insight is that the semantic attributes of open Gaussians provide strong priors for image extrapolation, enabling both semantic consistency and visual plausibility. Specifically, once open Gaussians are initialized from sparse views, we introduce an RGB-semantic consistent inpainting module applied to selected rendered views. This module enforces bidirectional control between an image diffusion model and a semantic diffusion model. The inpainted regions are then lifted back into 3D space for efficient and progressive Gaussian parameter optimization. To evaluate our method, we establish a Gaussian Outpainting (GO) benchmark that assesses both semantic and generative quality of reconstructed open-vocabulary scenes. OGGSplat also demonstrates promising semantic-aware scene reconstruction capabilities when provided with two view images captured directly from a smartphone camera.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.05204v1.pdf
GitHub:
β’ https://github.com/Yanbo-23/OGGSplat
Datasets:
β’ S3DIS
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€3π1
Article Title:
Facial Appearance Capture at Home with Patch-Level Reflectance Prior
Article Date: 4 Jun 2025
Article Description:
Existing facial appearance capture methods can reconstruct plausible facial reflectance from smartphone-recorded videos. However, the reconstruction quality is still far behind the ones based on studio recordings. This paper fills the gap by developing a novel daily-used solution with a co-located smartphone and flashlight video capture setting in a dim room. To enhance the quality, our key observation is to solve facial reflectance maps within the data distribution of studio-scanned ones. Specifically, we first learn a diffusion prior over the Light Stage scans and then steer it to produce the reflectance map that best matches the captured images. We propose to train the diffusion prior at the patch level to improve generalization ability and training stability, as current Light Stage datasets are in ultra-high resolution but limited in data size. Tailored to this prior, we propose a patch-level posterior sampling technique to sample seamless full-resolution reflectance maps from this patch-level diffusion model. Experiments demonstrate our method closes the quality gap between low-cost and studio recordings by a large margin, opening the door for everyday users to clone themselves to the digital world. Our code will be released at https://github.com/yxuhan/DoRA.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.03478v1.pdf
GitHub:
β’ https://github.com/yxuhan/dora
Datasets:
β’ NeRF
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
Facial Appearance Capture at Home with Patch-Level Reflectance Prior
Article Date: 4 Jun 2025
Article Description:
Existing facial appearance capture methods can reconstruct plausible facial reflectance from smartphone-recorded videos. However, the reconstruction quality is still far behind the ones based on studio recordings. This paper fills the gap by developing a novel daily-used solution with a co-located smartphone and flashlight video capture setting in a dim room. To enhance the quality, our key observation is to solve facial reflectance maps within the data distribution of studio-scanned ones. Specifically, we first learn a diffusion prior over the Light Stage scans and then steer it to produce the reflectance map that best matches the captured images. We propose to train the diffusion prior at the patch level to improve generalization ability and training stability, as current Light Stage datasets are in ultra-high resolution but limited in data size. Tailored to this prior, we propose a patch-level posterior sampling technique to sample seamless full-resolution reflectance maps from this patch-level diffusion model. Experiments demonstrate our method closes the quality gap between low-cost and studio recordings by a large margin, opening the door for everyday users to clone themselves to the digital world. Our code will be released at https://github.com/yxuhan/DoRA.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.03478v1.pdf
GitHub:
β’ https://github.com/yxuhan/dora
Datasets:
β’ NeRF
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€4
πΉ Title:
The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs
πΉ Publication Date: Published on Jul 15
πΉ Abstract:
DIJA is a framework that exploits safety weaknesses in diffusion-based large language models by constructing adversarial prompts, demonstrating significant vulnerabilities in their alignment mechanisms. AI-generated summary Diffusion-based large language models ( dLLMs ) have recently emerged as a powerful alternative to autoregressive LLMs , offering faster inference and greater interactivity via parallel decoding and bidirectional modeling . However, despite strong performance in code generation and text infilling, we identify a fundamental safety concern: existing alignment mechanisms fail to safeguard dLLMs against context-aware , masked-input adversarial prompts , exposing novel vulnerabilities. To this end, we present DIJA , the first systematic study and jailbreak attack framework that exploits unique safety weaknesses of dLLMs . Specifically, our proposed DIJA constructs adversarial interleaved mask-text prompts that exploit the text generation mechanisms of dLLMs , i.e., bidirectional modeling and parallel decoding . Bidirectional modeling drives the model to produce contextually consistent outputs for masked spans, even when harmful, while parallel decoding limits model dynamic filtering and rejection sampling of unsafe content. This causes standard alignment mechanisms to fail, enabling harmful completions in alignment-tuned dLLMs , even when harmful behaviors or unsafe instructions are directly exposed in the prompt. Through comprehensive experiments, we demonstrate that DIJA significantly outperforms existing jailbreak methods, exposing a previously overlooked threat surface in dLLM architectures. Notably, our method achieves up to 100% keyword-based ASR on Dream-Instruct, surpassing the strongest prior baseline, ReNeLLM, by up to 78.5% in evaluator-based ASR on JailbreakBench and by 37.7 points in StrongREJECT score , while requiring no rewriting or hiding of harmful content in the jailbreak prompt. Our findings underscore the urgent need for rethinking safety alignment in this emerging class of language models. Code is available at https://github.com/ZichenWen1/ DIJA .
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2507.11097
β’ PDF: https://arxiv.org/pdf/2507.11097
β’ Github: https://github.com/ZichenWen1/DIJA
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs
πΉ Publication Date: Published on Jul 15
πΉ Abstract:
DIJA is a framework that exploits safety weaknesses in diffusion-based large language models by constructing adversarial prompts, demonstrating significant vulnerabilities in their alignment mechanisms. AI-generated summary Diffusion-based large language models ( dLLMs ) have recently emerged as a powerful alternative to autoregressive LLMs , offering faster inference and greater interactivity via parallel decoding and bidirectional modeling . However, despite strong performance in code generation and text infilling, we identify a fundamental safety concern: existing alignment mechanisms fail to safeguard dLLMs against context-aware , masked-input adversarial prompts , exposing novel vulnerabilities. To this end, we present DIJA , the first systematic study and jailbreak attack framework that exploits unique safety weaknesses of dLLMs . Specifically, our proposed DIJA constructs adversarial interleaved mask-text prompts that exploit the text generation mechanisms of dLLMs , i.e., bidirectional modeling and parallel decoding . Bidirectional modeling drives the model to produce contextually consistent outputs for masked spans, even when harmful, while parallel decoding limits model dynamic filtering and rejection sampling of unsafe content. This causes standard alignment mechanisms to fail, enabling harmful completions in alignment-tuned dLLMs , even when harmful behaviors or unsafe instructions are directly exposed in the prompt. Through comprehensive experiments, we demonstrate that DIJA significantly outperforms existing jailbreak methods, exposing a previously overlooked threat surface in dLLM architectures. Notably, our method achieves up to 100% keyword-based ASR on Dream-Instruct, surpassing the strongest prior baseline, ReNeLLM, by up to 78.5% in evaluator-based ASR on JailbreakBench and by 37.7 points in StrongREJECT score , while requiring no rewriting or hiding of harmful content in the jailbreak prompt. Our findings underscore the urgent need for rethinking safety alignment in this emerging class of language models. Code is available at https://github.com/ZichenWen1/ DIJA .
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2507.11097
β’ PDF: https://arxiv.org/pdf/2507.11097
β’ Github: https://github.com/ZichenWen1/DIJA
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
GitHub
ZichenWen1 - Overview
Ph.D candidate at SJTU. ZichenWen1 has 12 repositories available. Follow their code on GitHub.
β€1
πΉ Title:
4KAgent: Agentic Any Image to 4K Super-Resolution
πΉ Publication Date: Published on Jul 9
πΉ Abstract:
4KAgent, a unified agentic super-resolution system, enhances low-resolution images to 4K using profiling, perception, and restoration agents, achieving state-of-the-art performance across various imaging domains. AI-generated summary We present 4KAgent, a unified agentic super-resolution generalist system designed to universally upscale any image to 4K resolution (and even higher, if applied iteratively). Our system can transform images from extremely low resolutions with severe degradations, for example, highly distorted inputs at 256x256, into crystal-clear, photorealistic 4K outputs. 4KAgent comprises three core components: (1) Profiling , a module that customizes the 4KAgent pipeline based on bespoke use cases; (2) A Perception Agent , which leverages vision-language models alongside image quality assessment experts to analyze the input image and make a tailored restoration plan; and (3) A Restoration Agent, which executes the plan, following a recursive execution-reflection paradigm, guided by a quality-driven mixture-of-expert policy to select the optimal output for each step. Additionally, 4KAgent embeds a specialized face restoration pipeline, significantly enhancing facial details in portrait and selfie photos. We rigorously evaluate our 4KAgent across 11 distinct task categories encompassing a total of 26 diverse benchmarks, setting new state-of-the-art on a broad spectrum of imaging domains. Our evaluations cover natural images, portrait photos, AI-generated content, satellite imagery, fluorescence microscopy, and medical imaging like fundoscopy, ultrasound, and X-ray, demonstrating superior performance in terms of both perceptual (e.g., NIQE , MUSIQ ) and fidelity (e.g., PSNR ) metrics. By establishing a novel agentic paradigm for low-level vision tasks , we aim to catalyze broader interest and innovation within vision-centric autonomous agents across diverse research communities. We will release all the code, models, and results at: https://4kagent.github.io.
πΉ Links:
β’ arXiv Page: https://arxivexplained.com/papers/4kagent-agentic-any-image-to-4k-super-resolution
β’ PDF: https://arxiv.org/pdf/2507.07105
β’ Project Page: https://huggingface.co/collections/tonton5093/side-project-67eb59863bd520640f423b9a
β’ Github: https://4kagent.github.io/
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
4KAgent: Agentic Any Image to 4K Super-Resolution
πΉ Publication Date: Published on Jul 9
πΉ Abstract:
4KAgent, a unified agentic super-resolution system, enhances low-resolution images to 4K using profiling, perception, and restoration agents, achieving state-of-the-art performance across various imaging domains. AI-generated summary We present 4KAgent, a unified agentic super-resolution generalist system designed to universally upscale any image to 4K resolution (and even higher, if applied iteratively). Our system can transform images from extremely low resolutions with severe degradations, for example, highly distorted inputs at 256x256, into crystal-clear, photorealistic 4K outputs. 4KAgent comprises three core components: (1) Profiling , a module that customizes the 4KAgent pipeline based on bespoke use cases; (2) A Perception Agent , which leverages vision-language models alongside image quality assessment experts to analyze the input image and make a tailored restoration plan; and (3) A Restoration Agent, which executes the plan, following a recursive execution-reflection paradigm, guided by a quality-driven mixture-of-expert policy to select the optimal output for each step. Additionally, 4KAgent embeds a specialized face restoration pipeline, significantly enhancing facial details in portrait and selfie photos. We rigorously evaluate our 4KAgent across 11 distinct task categories encompassing a total of 26 diverse benchmarks, setting new state-of-the-art on a broad spectrum of imaging domains. Our evaluations cover natural images, portrait photos, AI-generated content, satellite imagery, fluorescence microscopy, and medical imaging like fundoscopy, ultrasound, and X-ray, demonstrating superior performance in terms of both perceptual (e.g., NIQE , MUSIQ ) and fidelity (e.g., PSNR ) metrics. By establishing a novel agentic paradigm for low-level vision tasks , we aim to catalyze broader interest and innovation within vision-centric autonomous agents across diverse research communities. We will release all the code, models, and results at: https://4kagent.github.io.
πΉ Links:
β’ arXiv Page: https://arxivexplained.com/papers/4kagent-agentic-any-image-to-4k-super-resolution
β’ PDF: https://arxiv.org/pdf/2507.07105
β’ Project Page: https://huggingface.co/collections/tonton5093/side-project-67eb59863bd520640f423b9a
β’ Github: https://4kagent.github.io/
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€6
πΉ Title:
LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs
πΉ Publication Date: Published on Jun 17
πΉ Abstract:
This study investigates long-context performance of diffusion LLMs compared to auto-regressive LLMs, identifies their unique characteristics, and proposes LongLLaDA, a training-free method for extending context windows. AI-generated summary Large Language Diffusion Models, or diffusion LLMs , have emerged as a significant focus in NLP research, with substantial effort directed toward understanding their scalability and downstream task performance. However, their long-context capabilities remain unexplored, lacking systematic analysis or methods for context extension. In this work, we present the first systematic investigation comparing the long-context performance of diffusion LLMs and traditional auto-regressive LLMs . We first identify a unique characteristic of diffusion LLMs , unlike auto-regressive LLMs , they maintain remarkably \textit{ stable perplexity } during direct context extrapolation. Furthermore, where auto-regressive models fail outright during the Needle-In-A-Haystack task with context exceeding their pretrained length, we discover diffusion LLMs exhibit a distinct \textit{ local perception } phenomenon, enabling successful retrieval from recent context segments. We explain both phenomena through the lens of Rotary Position Embedding (RoPE) scaling theory. Building on these observations, we propose LongLLaDA , a training-free method that integrates LLaDA with the NTK-based RoPE extrapolation. Our results validate that established extrapolation scaling laws remain effective for extending the context windows of diffusion LLMs . Furthermore, we identify long-context tasks where diffusion LLMs outperform auto-regressive LLMs and others where they fall short. Consequently, this study establishes the first context extrapolation method for diffusion LLMs while providing essential theoretical insights and empirical benchmarks critical for advancing future research on long-context diffusion LLMs .
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.14429
β’ PDF: https://arxiv.org/pdf/2506.14429
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs
πΉ Publication Date: Published on Jun 17
πΉ Abstract:
This study investigates long-context performance of diffusion LLMs compared to auto-regressive LLMs, identifies their unique characteristics, and proposes LongLLaDA, a training-free method for extending context windows. AI-generated summary Large Language Diffusion Models, or diffusion LLMs , have emerged as a significant focus in NLP research, with substantial effort directed toward understanding their scalability and downstream task performance. However, their long-context capabilities remain unexplored, lacking systematic analysis or methods for context extension. In this work, we present the first systematic investigation comparing the long-context performance of diffusion LLMs and traditional auto-regressive LLMs . We first identify a unique characteristic of diffusion LLMs , unlike auto-regressive LLMs , they maintain remarkably \textit{ stable perplexity } during direct context extrapolation. Furthermore, where auto-regressive models fail outright during the Needle-In-A-Haystack task with context exceeding their pretrained length, we discover diffusion LLMs exhibit a distinct \textit{ local perception } phenomenon, enabling successful retrieval from recent context segments. We explain both phenomena through the lens of Rotary Position Embedding (RoPE) scaling theory. Building on these observations, we propose LongLLaDA , a training-free method that integrates LLaDA with the NTK-based RoPE extrapolation. Our results validate that established extrapolation scaling laws remain effective for extending the context windows of diffusion LLMs . Furthermore, we identify long-context tasks where diffusion LLMs outperform auto-regressive LLMs and others where they fall short. Consequently, this study establishes the first context extrapolation method for diffusion LLMs while providing essential theoretical insights and empirical benchmarks critical for advancing future research on long-context diffusion LLMs .
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.14429
β’ PDF: https://arxiv.org/pdf/2506.14429
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
arXiv.org
LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs
Large Language Diffusion Models, or diffusion LLMs, have emerged as a significant focus in NLP research, with substantial effort directed toward understanding their scalability and downstream task...
β€3
πΉ Title:
T-LoRA: Single Image Diffusion Model Customization Without Overfitting
πΉ Publication Date: Published on Jul 8
πΉ Abstract:
T-LoRA, a timestep-dependent low-rank adaptation framework, enhances diffusion model personalization with a dynamic fine-tuning strategy and orthogonal initialization, achieving better concept fidelity and text alignment in data-limited settings. AI-generated summary While diffusion model fine-tuning offers a powerful approach for customizing pre-trained models to generate specific objects, it frequently suffers from overfitting when training samples are limited, compromising both generalization capability and output diversity . This paper tackles the challenging yet most impactful task of adapting a diffusion model using just a single concept image, as single-image customization holds the greatest practical potential. We introduce T-LoRA , a Timestep-Dependent Low-Rank Adaptation framework specifically designed for diffusion model personalization. In our work we show that higher diffusion timesteps are more prone to overfitting than lower ones, necessitating a timestep-sensitive fine-tuning strategy. T-LoRA incorporates two key innovations: (1) a dynamic fine-tuning strategy that adjusts rank-constrained updates based on diffusion timesteps , and (2) a weight parametrization technique that ensures independence between adapter components through orthogonal initialization . Extensive experiments show that T-LoRA and its individual components outperform standard LoRA and other diffusion model personalization techniques. They achieve a superior balance between concept fidelity and text alignment , highlighting the potential of T-LoRA in data-limited and resource-constrained scenarios. Code is available at https://github.com/ControlGenAI/ T-LoRA .
πΉ Links:
β’ arXiv Page: https://arxivexplained.com/papers/t-lora-single-image-diffusion-model-customization-without-overfitting
β’ PDF: https://arxiv.org/pdf/2507.05964
β’ Github: https://github.com/ControlGenAI/T-LoRA
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
T-LoRA: Single Image Diffusion Model Customization Without Overfitting
πΉ Publication Date: Published on Jul 8
πΉ Abstract:
T-LoRA, a timestep-dependent low-rank adaptation framework, enhances diffusion model personalization with a dynamic fine-tuning strategy and orthogonal initialization, achieving better concept fidelity and text alignment in data-limited settings. AI-generated summary While diffusion model fine-tuning offers a powerful approach for customizing pre-trained models to generate specific objects, it frequently suffers from overfitting when training samples are limited, compromising both generalization capability and output diversity . This paper tackles the challenging yet most impactful task of adapting a diffusion model using just a single concept image, as single-image customization holds the greatest practical potential. We introduce T-LoRA , a Timestep-Dependent Low-Rank Adaptation framework specifically designed for diffusion model personalization. In our work we show that higher diffusion timesteps are more prone to overfitting than lower ones, necessitating a timestep-sensitive fine-tuning strategy. T-LoRA incorporates two key innovations: (1) a dynamic fine-tuning strategy that adjusts rank-constrained updates based on diffusion timesteps , and (2) a weight parametrization technique that ensures independence between adapter components through orthogonal initialization . Extensive experiments show that T-LoRA and its individual components outperform standard LoRA and other diffusion model personalization techniques. They achieve a superior balance between concept fidelity and text alignment , highlighting the potential of T-LoRA in data-limited and resource-constrained scenarios. Code is available at https://github.com/ControlGenAI/ T-LoRA .
πΉ Links:
β’ arXiv Page: https://arxivexplained.com/papers/t-lora-single-image-diffusion-model-customization-without-overfitting
β’ PDF: https://arxiv.org/pdf/2507.05964
β’ Github: https://github.com/ControlGenAI/T-LoRA
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
GitHub
ControlGenAI
ControlGenAI has 12 repositories available. Follow their code on GitHub.
β€1
πΉ Title:
How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks
πΉ Publication Date: Published on Jul 2
πΉ Abstract:
Multimodal foundation models, despite being primarily trained on image-text tasks, demonstrate respectable performance across various vision tasks when adapted through prompt chaining, though they fall short compared to specialized models. AI-generated summary Multimodal foundation models, such as GPT-4o , have recently made remarkable progress, but it is not clear where exactly these models stand in terms of understanding vision. In this paper, we benchmark the performance of popular multimodal foundation models ( GPT-4o , o4-mini , Gemini 1.5 Pro and Gemini 2.0 Flash, Claude 3.5 Sonnet , Qwen2-VL , Llama 3.2 ) on standard computer vision tasks ( semantic segmentation , object detection , image classification , depth and surface normal prediction ) using established datasets (e.g., COCO , ImageNet and its variants, etc). The main challenges to performing this are: 1) most models are trained to output text and cannot natively express versatile domains, such as segments or 3D geometry, and 2) many leading models are proprietary and accessible only at an API level, i.e., there is no weight access to adapt them. We address these challenges by translating standard vision tasks into equivalent text-promptable and API-compatible tasks via prompt chaining to create a standardized benchmarking framework. We observe that 1) the models are not close to the state-of-the-art specialist models at any task. However, 2) they are respectable generalists; this is remarkable as they are presumably trained on primarily image-text-based tasks. 3) They perform semantic tasks notably better than geometric ones. 4) While the prompt-chaining techniques affect performance, better models exhibit less sensitivity to prompt variations. 5) GPT-4o performs the best among non- reasoning models , securing the top position in 4 out of 6 tasks, 6) reasoning models , e.g. o3, show improvements in geometric tasks, and 7) a preliminary analysis of models with native image generation, like the latest GPT-4o , shows they exhibit quirks like hallucinations and spatial misalignments.
πΉ Links:
β’ arXiv Page: https://arxivexplained.com/papers/how-well-does-gpt-4o-understand-vision-evaluating-multimodal-foundation-models-on-standard-computer-vision-tasks
β’ PDF: https://arxiv.org/pdf/2507.01955
β’ Project Page: https://fm-vision-evals.epfl.ch/
β’ Github: https://github.com/EPFL-VILAB/fm-vision-evals
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks
πΉ Publication Date: Published on Jul 2
πΉ Abstract:
Multimodal foundation models, despite being primarily trained on image-text tasks, demonstrate respectable performance across various vision tasks when adapted through prompt chaining, though they fall short compared to specialized models. AI-generated summary Multimodal foundation models, such as GPT-4o , have recently made remarkable progress, but it is not clear where exactly these models stand in terms of understanding vision. In this paper, we benchmark the performance of popular multimodal foundation models ( GPT-4o , o4-mini , Gemini 1.5 Pro and Gemini 2.0 Flash, Claude 3.5 Sonnet , Qwen2-VL , Llama 3.2 ) on standard computer vision tasks ( semantic segmentation , object detection , image classification , depth and surface normal prediction ) using established datasets (e.g., COCO , ImageNet and its variants, etc). The main challenges to performing this are: 1) most models are trained to output text and cannot natively express versatile domains, such as segments or 3D geometry, and 2) many leading models are proprietary and accessible only at an API level, i.e., there is no weight access to adapt them. We address these challenges by translating standard vision tasks into equivalent text-promptable and API-compatible tasks via prompt chaining to create a standardized benchmarking framework. We observe that 1) the models are not close to the state-of-the-art specialist models at any task. However, 2) they are respectable generalists; this is remarkable as they are presumably trained on primarily image-text-based tasks. 3) They perform semantic tasks notably better than geometric ones. 4) While the prompt-chaining techniques affect performance, better models exhibit less sensitivity to prompt variations. 5) GPT-4o performs the best among non- reasoning models , securing the top position in 4 out of 6 tasks, 6) reasoning models , e.g. o3, show improvements in geometric tasks, and 7) a preliminary analysis of models with native image generation, like the latest GPT-4o , shows they exhibit quirks like hallucinations and spatial misalignments.
πΉ Links:
β’ arXiv Page: https://arxivexplained.com/papers/how-well-does-gpt-4o-understand-vision-evaluating-multimodal-foundation-models-on-standard-computer-vision-tasks
β’ PDF: https://arxiv.org/pdf/2507.01955
β’ Project Page: https://fm-vision-evals.epfl.ch/
β’ Github: https://github.com/EPFL-VILAB/fm-vision-evals
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
Arxivexplained
How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks - Explained Simply
By Rahul Ramachandran, Ali Garjani, Roman Bachmann et al.. # Executive Summary: The Real Vision Capabilities of AI Models Like GPT-4o
**The Big Question:** Ev...
**The Big Question:** Ev...
β€2
πΉ Title:
Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning
πΉ Publication Date: Published on Jul 18
πΉ Abstract:
Franca, an open-source vision foundation model, achieves high performance using a transparent training pipeline and novel clustering and disentanglement techniques. AI-generated summary We present Franca (pronounced Fran-ka): free one; the first fully open-source (data, code, weights) vision foundation model that matches and in many cases surpasses the performance of state-of-the-art proprietary models, e.g., DINOv2, CLIP, SigLIPv2, etc. Our approach is grounded in a transparent training pipeline inspired by Web-SSL and uses publicly available data: ImageNet-21K and a subset of ReLAION-2B . Beyond model release, we tackle critical limitations in SSL clustering methods. While modern models rely on assigning image features to large codebooks via clustering algorithms like Sinkhorn-Knopp , they fail to account for the inherent ambiguity in clustering semantics. To address this, we introduce a parameter-efficient , multi-head clustering projector based on nested Matryoshka representations . This design progressively refines features into increasingly fine-grained clusters without increasing the model size, enabling both performance and memory efficiency. Additionally, we propose a novel positional disentanglement strategy that explicitly removes positional biases from dense representations, thereby improving the encoding of semantic content. This leads to consistent gains on several downstream benchmarks , demonstrating the utility of cleaner feature spaces. Our contributions establish a new standard for transparent, high-performance vision models and open a path toward more reproducible and generalizable foundation models for the broader AI community. The code and model checkpoints are available at https://github.com/valeoai/Franca.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2507.14137
β’ PDF: https://arxiv.org/pdf/2507.14137
β’ Project Page: https://huggingface.co/papers?q=multi-head%20clustering%20projector
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning
πΉ Publication Date: Published on Jul 18
πΉ Abstract:
Franca, an open-source vision foundation model, achieves high performance using a transparent training pipeline and novel clustering and disentanglement techniques. AI-generated summary We present Franca (pronounced Fran-ka): free one; the first fully open-source (data, code, weights) vision foundation model that matches and in many cases surpasses the performance of state-of-the-art proprietary models, e.g., DINOv2, CLIP, SigLIPv2, etc. Our approach is grounded in a transparent training pipeline inspired by Web-SSL and uses publicly available data: ImageNet-21K and a subset of ReLAION-2B . Beyond model release, we tackle critical limitations in SSL clustering methods. While modern models rely on assigning image features to large codebooks via clustering algorithms like Sinkhorn-Knopp , they fail to account for the inherent ambiguity in clustering semantics. To address this, we introduce a parameter-efficient , multi-head clustering projector based on nested Matryoshka representations . This design progressively refines features into increasingly fine-grained clusters without increasing the model size, enabling both performance and memory efficiency. Additionally, we propose a novel positional disentanglement strategy that explicitly removes positional biases from dense representations, thereby improving the encoding of semantic content. This leads to consistent gains on several downstream benchmarks , demonstrating the utility of cleaner feature spaces. Our contributions establish a new standard for transparent, high-performance vision models and open a path toward more reproducible and generalizable foundation models for the broader AI community. The code and model checkpoints are available at https://github.com/valeoai/Franca.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2507.14137
β’ PDF: https://arxiv.org/pdf/2507.14137
β’ Project Page: https://huggingface.co/papers?q=multi-head%20clustering%20projector
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
GitHub
GitHub - valeoai/Franca: Official code of Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning
Official code of Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning - valeoai/Franca
β€2
Article Title:
Cautious Optimizers: Improving Training with One Line of Code
Article Date: 25 Nov 2024
Article Description:
AdamW has been the default optimizer for transformer pretraining. For many years, our community searched for faster and more stable optimizers with only constrained positive outcomes. In this work, we propose a single-line modification in Pytorch to any momentum-based optimizer, which we rename cautious optimizer, e.g. C-AdamW and C-Lion. Our theoretical result shows that this modification preserves Adam's Hamiltonian function and it does not break the convergence guarantee under the Lyapunov analysis. In addition, a whole new family of optimizers is revealed by our theoretical insight. Among them, we pick the simplest one for empirical experiments, showing not only speed-up on Llama and MAE pretraining up to $1.47$ times, but also better results in LLM post-training tasks. Code is available at https://github.com/kyleliang919/C-Optim.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2411.16085v3.pdf
GitHub:
β’ https://github.com/kyleliang919/c-optim
β’ https://github.com/huggingface/pytorch-image-models
β’ https://github.com/zhaoolee/garss
Datasets:
β’ GLUE
β’ QNLI
β’ C4
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
Cautious Optimizers: Improving Training with One Line of Code
Article Date: 25 Nov 2024
Article Description:
AdamW has been the default optimizer for transformer pretraining. For many years, our community searched for faster and more stable optimizers with only constrained positive outcomes. In this work, we propose a single-line modification in Pytorch to any momentum-based optimizer, which we rename cautious optimizer, e.g. C-AdamW and C-Lion. Our theoretical result shows that this modification preserves Adam's Hamiltonian function and it does not break the convergence guarantee under the Lyapunov analysis. In addition, a whole new family of optimizers is revealed by our theoretical insight. Among them, we pick the simplest one for empirical experiments, showing not only speed-up on Llama and MAE pretraining up to $1.47$ times, but also better results in LLM post-training tasks. Code is available at https://github.com/kyleliang919/C-Optim.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2411.16085v3.pdf
GitHub:
β’ https://github.com/kyleliang919/c-optim
β’ https://github.com/huggingface/pytorch-image-models
β’ https://github.com/zhaoolee/garss
Datasets:
β’ GLUE
β’ QNLI
β’ C4
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€3