Data Science | Machine Learning with Python for Researchers

Article Title:
TradingAgents: Multi-Agents LLM Financial Trading Framework

Article Date: 28 Dec 2024

Article Description:
Significant progress has been made in automated problem-solving using societies of agents powered by large language models (LLMs). In finance, efforts have largely focused on single-agent systems handling specific tasks or multi-agent frameworks independently gathering data. However, the multi-agent systems' potential to replicate real-world trading firms' collaborative dynamics remains underexplored. TradingAgents proposes a novel stock trading framework inspired by trading firms, featuring LLM-powered agents in specialized roles such as fundamental analysts, sentiment analysts, technical analysts, and traders with varied risk profiles. The framework includes Bull and Bear researcher agents assessing market conditions, a risk management team monitoring exposure, and traders synthesizing insights from debates and historical data to make informed decisions. By simulating a dynamic, collaborative trading environment, this framework aims to improve trading performance. Detailed architecture and extensive experiments reveal its superiority over baseline models, with notable improvements in cumulative returns, Sharpe ratio, and maximum drawdown, highlighting the potential of multi-agent LLM frameworks in financial trading. TradingAgents is available at https://github.com/TauricResearch/TradingAgents.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2412.20138v7.pdf

GitHub:
• https://github.com/tauricresearch/tradingagents

Datasets:
• How Do I Login McAfee Antivirus Account?: A Complete Guide
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

❤3

883 views12:38

Data Science | Machine Learning with Python for Researchers

🔹 Title:
Generalized Few-Shot Semantic Segmentation: All You Need is Fine-Tuning

🔹 Publication Date: Published on Dec 21, 2021

🔹 Abstract:
A fine-tuning solution for generalized few-shot semantic segmentation improves performance beyond meta-learning by addressing saturation and minimizing the performance gap between novel and base categories. AI-generated summary Generalized few-shot semantic segmentation was introduced to move beyond only evaluating few-shot segmentation models on novel classes to include testing their ability to remember base classes. While the current state-of-the-art approach is based on meta-learning , it performs poorly and saturates in learning after observing only a few shots. We propose the first fine-tuning solution, and demonstrate that it addresses the saturation problem while achieving state-of-the-art results on two datasets, PASCAL-5i and COCO-20i. We also show that it outperforms existing methods, whether fine-tuning multiple final layers or only the final layer. Finally, we present a triplet loss regularization that shows how to redistribute the balance of performance between novel and base categories so that there is a smaller gap between them.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2112.10982
• PDF: https://arxiv.org/pdf/2112.10982

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

arXiv.org

Generalized Few-Shot Semantic Segmentation: All You Need is Fine-Tuning

Generalized few-shot semantic segmentation was introduced to move beyond only evaluating few-shot segmentation models on novel classes to include testing their ability to remember base classes....

❤3

937 views13:23

Data Science | Machine Learning with Python for Researchers

🔹 Title:
RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning

🔹 Publication Date: Published on Jul 10

🔹 Abstract:
RLEP, a reinforcement learning framework with experience replay, enhances large language model training by focusing on high-quality examples, leading to faster convergence and improved performance on math-related benchmarks. AI-generated summary Reinforcement learning (RL) for large language models is an energy-intensive endeavor: training can be unstable, and the policy may gradually drift away from its pretrained weights. We present RLEP\, -- \,Reinforcement Learning with Experience rePlay \, -- \,a two-phase framework that first collects verified trajectories and then replays them during subsequent training. At every update step, the policy is optimized on mini-batches that blend newly generated rollouts with these replayed successes. By replaying high-quality examples, RLEP steers the model away from fruitless exploration, focuses learning on promising reasoning paths, and delivers both faster convergence and stronger final performance . On the Qwen2.5-Math-7B base model, RLEP reaches baseline peak accuracy with substantially fewer updates and ultimately surpasses it, improving accuracy on AIME-2024 from 38.2% to 39.9%, on AIME-2025 from 19.8% to 22.3%, and on AMC-2023 from 77.0% to 82.2%. Our code, datasets, and checkpoints are publicly available at https://github.com/Kwai-Klear/RLEP to facilitate reproducibility and further research.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.07451
• PDF: https://arxiv.org/pdf/2507.07451

🔹 Datasets citing this paper:
• https://huggingface.co/datasets/Kwai-Klear/RLEP_dataset

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

GitHub

GitHub - Kwai-Klear/RLEP: RL with Experience Replay

RL with Experience Replay. Contribute to Kwai-Klear/RLEP development by creating an account on GitHub.

❤1

1.06K views14:28

Data Science | Machine Learning with Python for Researchers

🔹 Title:
MMHU: A Massive-Scale Multimodal Benchmark for Human Behavior Understanding

🔹 Publication Date: Published on Jul 16

🔹 Abstract:
A large-scale benchmark, MMHU, is proposed for human behavior analysis in autonomous driving, featuring rich annotations and diverse data sources, and benchmarking multiple tasks including motion prediction and behavior question answering. AI-generated summary Humans are integral components of the transportation ecosystem, and understanding their behaviors is crucial to facilitating the development of safe driving systems. Although recent progress has explored various aspects of human behaviorx2014such as motion, trajectories, and intentionx2014a comprehensive benchmark for evaluating human behavior understanding in autonomous driving remains unavailable. In this work, we propose MMHU, a large-scale benchmark for human behavior analysis featuring rich annotations, such as human motion and trajectories, text description for human motions, human intention, and critical behavior labels relevant to driving safety. Our dataset encompasses 57k human motion clips and 1.73M frames gathered from diverse sources, including established driving datasets such as Waymo , in-the-wild videos from YouTube , and self-collected data. A human-in-the-loop annotation pipeline is developed to generate rich behavior captions. We provide a thorough dataset analysis and benchmark multiple tasksx2014ranging from motion prediction to motion generation and human behavior question answering x2014thereby offering a broad evaluation suite. Project page : https://MMHU-Benchmark.github.io.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.12463
• PDF: https://arxiv.org/pdf/2507.12463
• Project Page: https://mmhu-benchmark.github.io/

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

❤2

1.13K views09:28

Data Science | Machine Learning with Python for Researchers

Article Title:
Embedding Atlas: Low-Friction, Interactive Embedding Visualization

Article Date: 9 May 2025

Article Description:
Embedding projections are popular for visualizing large datasets and models. However, people often encounter "friction" when using embedding visualization tools: (1) barriers to adoption, e.g., tedious data wrangling and loading, scalability limits, no integration of results into existing workflows, and (2) limitations in possible analyses, without integration with external tools to additionally show coordinated views of metadata. In this paper, we present Embedding Atlas, a scalable, interactive visualization tool designed to make interacting with large embeddings as easy as possible. Embedding Atlas uses modern web technologies and advanced algorithms -- including density-based clustering, and automated labeling -- to provide a fast and rich data analysis experience at scale. We evaluate Embedding Atlas with a competitive analysis against other popular embedding tools, showing that Embedding Atlas's feature set specifically helps reduce friction, and report a benchmark on its real-time rendering performance with millions of points. Embedding Atlas is available as open source to support future work in embedding-based analysis.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2505.06386v1.pdf

GitHub:
• https://github.com/apple/embedding-atlas

Datasets:
• No datasets information available
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

❤4

1.21K views11:32

Data Science | Machine Learning with Python for Researchers

Article Title:
OmniGen2: Exploration to Advanced Multimodal Generation

Article Date: 23 Jun 2025

Article Description:
In this work, we introduce OmniGen2, a versatile and open-source generative model designed to provide a unified solution for diverse generation tasks, including text-to-image, image editing, and in-context generation. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. This design enables OmniGen2 to build upon existing multimodal understanding models without the need to re-adapt VAE inputs, thereby preserving the original text generation capabilities. To facilitate the training of OmniGen2, we developed comprehensive data construction pipelines, encompassing image editing and in-context generation data. Additionally, we introduce a reflection mechanism tailored for image generation tasks and curate a dedicated reflection dataset based on OmniGen2. Despite its relatively modest parameter size, OmniGen2 achieves competitive results on multiple task benchmarks, including text-to-image and image editing. To further evaluate in-context generation, also referred to as subject-driven tasks, we introduce a new benchmark named OmniContext. OmniGen2 achieves state-of-the-art performance among open-source models in terms of consistency. We will release our models, training code, datasets, and data construction pipeline to support future research in this field. Project Page: https://vectorspacelab.github.io/OmniGen2; GitHub Link: https://github.com/VectorSpaceLab/OmniGen2PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2506.18871v2.pdf

GitHub:
• https://github.com/vectorspacelab/omnigen2

Datasets:
• MM-Vet
• GenEval
• MagicBrush
• ImgEdit
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

❤3

1.18K views05:04

Data Science | Machine Learning with Python for Researchers

Article Title:
Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

Article Date: 20 May 2025

Article Description:
Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Current approaches either assemble specialized expert models or directly generate page-level content autoregressively, facing integration overhead, efficiency bottlenecks, and layout structure degradation despite their decent performance. To address these limitations, we present \textit{Dolphin} (\textit{\textbf{Do}cument Image \textbf{P}arsing via \textbf{H}eterogeneous Anchor Prompt\textbf{in}g}), a novel multimodal document image parsing model following an analyze-then-parse paradigm. In the first stage, Dolphin generates a sequence of layout elements in reading order. These heterogeneous elements, serving as anchors and coupled with task-specific prompts, are fed back to Dolphin for parallel content parsing in the second stage. To train Dolphin, we construct a large-scale dataset of over 30 million samples, covering multi-granularity parsing tasks. Through comprehensive evaluations on both prevalent benchmarks and self-constructed ones, Dolphin achieves state-of-the-art performance across diverse page-level and element-level settings, while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism. The code and pre-trained models are publicly available at https://github.com/ByteDance/DolphinPDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2505.14059v1.pdf

GitHub:
• https://github.com/bytedance/dolphin

Datasets:
• PubTabNet
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

❤2

1.08K views06:11

Data Science | Machine Learning with Python for Researchers

Article Title:
OGGSplat: Open Gaussian Growing for Generalizable Reconstruction with Expanded Field-of-View

Article Date: 5 Jun 2025

Article Description:
Reconstructing semantic-aware 3D scenes from sparse views is a challenging yet essential research direction, driven by the demands of emerging applications such as virtual reality and embodied AI. Existing per-scene optimization methods require dense input views and incur high computational costs, while generalizable approaches often struggle to reconstruct regions outside the input view cone. In this paper, we propose OGGSplat, an open Gaussian growing method that expands the field-of-view in generalizable 3D reconstruction. Our key insight is that the semantic attributes of open Gaussians provide strong priors for image extrapolation, enabling both semantic consistency and visual plausibility. Specifically, once open Gaussians are initialized from sparse views, we introduce an RGB-semantic consistent inpainting module applied to selected rendered views. This module enforces bidirectional control between an image diffusion model and a semantic diffusion model. The inpainted regions are then lifted back into 3D space for efficient and progressive Gaussian parameter optimization. To evaluate our method, we establish a Gaussian Outpainting (GO) benchmark that assesses both semantic and generative quality of reconstructed open-vocabulary scenes. OGGSplat also demonstrates promising semantic-aware scene reconstruction capabilities when provided with two view images captured directly from a smartphone camera.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2506.05204v1.pdf

GitHub:
• https://github.com/Yanbo-23/OGGSplat

Datasets:
• S3DIS
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

❤3🎉1

1.05K views07:03

Data Science | Machine Learning with Python for Researchers

Article Title:
Facial Appearance Capture at Home with Patch-Level Reflectance Prior

Article Date: 4 Jun 2025

Article Description:
Existing facial appearance capture methods can reconstruct plausible facial reflectance from smartphone-recorded videos. However, the reconstruction quality is still far behind the ones based on studio recordings. This paper fills the gap by developing a novel daily-used solution with a co-located smartphone and flashlight video capture setting in a dim room. To enhance the quality, our key observation is to solve facial reflectance maps within the data distribution of studio-scanned ones. Specifically, we first learn a diffusion prior over the Light Stage scans and then steer it to produce the reflectance map that best matches the captured images. We propose to train the diffusion prior at the patch level to improve generalization ability and training stability, as current Light Stage datasets are in ultra-high resolution but limited in data size. Tailored to this prior, we propose a patch-level posterior sampling technique to sample seamless full-resolution reflectance maps from this patch-level diffusion model. Experiments demonstrate our method closes the quality gap between low-cost and studio recordings by a large margin, opening the door for everyday users to clone themselves to the digital world. Our code will be released at https://github.com/yxuhan/DoRA.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2506.03478v1.pdf

GitHub:
• https://github.com/yxuhan/dora

Datasets:
• NeRF
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

❤4

848 views06:24

Data Science | Machine Learning with Python for Researchers

🔹 Title:
The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs

🔹 Publication Date: Published on Jul 15

🔹 Abstract:
DIJA is a framework that exploits safety weaknesses in diffusion-based large language models by constructing adversarial prompts, demonstrating significant vulnerabilities in their alignment mechanisms. AI-generated summary Diffusion-based large language models ( dLLMs ) have recently emerged as a powerful alternative to autoregressive LLMs , offering faster inference and greater interactivity via parallel decoding and bidirectional modeling . However, despite strong performance in code generation and text infilling, we identify a fundamental safety concern: existing alignment mechanisms fail to safeguard dLLMs against context-aware , masked-input adversarial prompts , exposing novel vulnerabilities. To this end, we present DIJA , the first systematic study and jailbreak attack framework that exploits unique safety weaknesses of dLLMs . Specifically, our proposed DIJA constructs adversarial interleaved mask-text prompts that exploit the text generation mechanisms of dLLMs , i.e., bidirectional modeling and parallel decoding . Bidirectional modeling drives the model to produce contextually consistent outputs for masked spans, even when harmful, while parallel decoding limits model dynamic filtering and rejection sampling of unsafe content. This causes standard alignment mechanisms to fail, enabling harmful completions in alignment-tuned dLLMs , even when harmful behaviors or unsafe instructions are directly exposed in the prompt. Through comprehensive experiments, we demonstrate that DIJA significantly outperforms existing jailbreak methods, exposing a previously overlooked threat surface in dLLM architectures. Notably, our method achieves up to 100% keyword-based ASR on Dream-Instruct, surpassing the strongest prior baseline, ReNeLLM, by up to 78.5% in evaluator-based ASR on JailbreakBench and by 37.7 points in StrongREJECT score , while requiring no rewriting or hiding of harmful content in the jailbreak prompt. Our findings underscore the urgent need for rethinking safety alignment in this emerging class of language models. Code is available at https://github.com/ZichenWen1/ DIJA .

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.11097
• PDF: https://arxiv.org/pdf/2507.11097
• Github: https://github.com/ZichenWen1/DIJA

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

GitHub

ZichenWen1 - Overview

Ph.D student at SJTU. ZichenWen1 has 13 repositories available. Follow their code on GitHub.

❤1

818 views07:21

Data Science | Machine Learning with Python for Researchers

🔹 Title:
4KAgent: Agentic Any Image to 4K Super-Resolution

🔹 Publication Date: Published on Jul 9

🔹 Abstract:
4KAgent, a unified agentic super-resolution system, enhances low-resolution images to 4K using profiling, perception, and restoration agents, achieving state-of-the-art performance across various imaging domains. AI-generated summary We present 4KAgent, a unified agentic super-resolution generalist system designed to universally upscale any image to 4K resolution (and even higher, if applied iteratively). Our system can transform images from extremely low resolutions with severe degradations, for example, highly distorted inputs at 256x256, into crystal-clear, photorealistic 4K outputs. 4KAgent comprises three core components: (1) Profiling , a module that customizes the 4KAgent pipeline based on bespoke use cases; (2) A Perception Agent , which leverages vision-language models alongside image quality assessment experts to analyze the input image and make a tailored restoration plan; and (3) A Restoration Agent, which executes the plan, following a recursive execution-reflection paradigm, guided by a quality-driven mixture-of-expert policy to select the optimal output for each step. Additionally, 4KAgent embeds a specialized face restoration pipeline, significantly enhancing facial details in portrait and selfie photos. We rigorously evaluate our 4KAgent across 11 distinct task categories encompassing a total of 26 diverse benchmarks, setting new state-of-the-art on a broad spectrum of imaging domains. Our evaluations cover natural images, portrait photos, AI-generated content, satellite imagery, fluorescence microscopy, and medical imaging like fundoscopy, ultrasound, and X-ray, demonstrating superior performance in terms of both perceptual (e.g., NIQE , MUSIQ ) and fidelity (e.g., PSNR ) metrics. By establishing a novel agentic paradigm for low-level vision tasks , we aim to catalyze broader interest and innovation within vision-centric autonomous agents across diverse research communities. We will release all the code, models, and results at: https://4kagent.github.io.

🔹 Links:
• arXiv Page: https://arxivexplained.com/papers/4kagent-agentic-any-image-to-4k-super-resolution
• PDF: https://arxiv.org/pdf/2507.07105
• Project Page: https://huggingface.co/collections/tonton5093/side-project-67eb59863bd520640f423b9a
• Github: https://4kagent.github.io/

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

❤6

904 views07:34

Data Science | Machine Learning with Python for Researchers

🔹 Title:
LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs

🔹 Publication Date: Published on Jun 17

🔹 Abstract:
This study investigates long-context performance of diffusion LLMs compared to auto-regressive LLMs, identifies their unique characteristics, and proposes LongLLaDA, a training-free method for extending context windows. AI-generated summary Large Language Diffusion Models, or diffusion LLMs , have emerged as a significant focus in NLP research, with substantial effort directed toward understanding their scalability and downstream task performance. However, their long-context capabilities remain unexplored, lacking systematic analysis or methods for context extension. In this work, we present the first systematic investigation comparing the long-context performance of diffusion LLMs and traditional auto-regressive LLMs . We first identify a unique characteristic of diffusion LLMs , unlike auto-regressive LLMs , they maintain remarkably \textit{ stable perplexity } during direct context extrapolation. Furthermore, where auto-regressive models fail outright during the Needle-In-A-Haystack task with context exceeding their pretrained length, we discover diffusion LLMs exhibit a distinct \textit{ local perception } phenomenon, enabling successful retrieval from recent context segments. We explain both phenomena through the lens of Rotary Position Embedding (RoPE) scaling theory. Building on these observations, we propose LongLLaDA , a training-free method that integrates LLaDA with the NTK-based RoPE extrapolation. Our results validate that established extrapolation scaling laws remain effective for extending the context windows of diffusion LLMs . Furthermore, we identify long-context tasks where diffusion LLMs outperform auto-regressive LLMs and others where they fall short. Consequently, this study establishes the first context extrapolation method for diffusion LLMs while providing essential theoretical insights and empirical benchmarks critical for advancing future research on long-context diffusion LLMs .

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.14429
• PDF: https://arxiv.org/pdf/2506.14429

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

arXiv.org

LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs

Large Language Diffusion Models, or diffusion LLMs, have emerged as a significant focus in NLP research, with substantial effort directed toward understanding their scalability and downstream task...

❤3

969 views13:01

Data Science | Machine Learning with Python for Researchers

🔹 Title:
T-LoRA: Single Image Diffusion Model Customization Without Overfitting

🔹 Publication Date: Published on Jul 8

🔹 Abstract:
T-LoRA, a timestep-dependent low-rank adaptation framework, enhances diffusion model personalization with a dynamic fine-tuning strategy and orthogonal initialization, achieving better concept fidelity and text alignment in data-limited settings. AI-generated summary While diffusion model fine-tuning offers a powerful approach for customizing pre-trained models to generate specific objects, it frequently suffers from overfitting when training samples are limited, compromising both generalization capability and output diversity . This paper tackles the challenging yet most impactful task of adapting a diffusion model using just a single concept image, as single-image customization holds the greatest practical potential. We introduce T-LoRA , a Timestep-Dependent Low-Rank Adaptation framework specifically designed for diffusion model personalization. In our work we show that higher diffusion timesteps are more prone to overfitting than lower ones, necessitating a timestep-sensitive fine-tuning strategy. T-LoRA incorporates two key innovations: (1) a dynamic fine-tuning strategy that adjusts rank-constrained updates based on diffusion timesteps , and (2) a weight parametrization technique that ensures independence between adapter components through orthogonal initialization . Extensive experiments show that T-LoRA and its individual components outperform standard LoRA and other diffusion model personalization techniques. They achieve a superior balance between concept fidelity and text alignment , highlighting the potential of T-LoRA in data-limited and resource-constrained scenarios. Code is available at https://github.com/ControlGenAI/ T-LoRA .

🔹 Links:
• arXiv Page: https://arxivexplained.com/papers/t-lora-single-image-diffusion-model-customization-without-overfitting
• PDF: https://arxiv.org/pdf/2507.05964
• Github: https://github.com/ControlGenAI/T-LoRA

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

GitHub

ControlGenAI

ControlGenAI has 12 repositories available. Follow their code on GitHub.

❤1

879 views14:33

Data Science | Machine Learning with Python for Researchers

🔹 Title:
How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks

🔹 Publication Date: Published on Jul 2

🔹 Abstract:
Multimodal foundation models, despite being primarily trained on image-text tasks, demonstrate respectable performance across various vision tasks when adapted through prompt chaining, though they fall short compared to specialized models. AI-generated summary Multimodal foundation models, such as GPT-4o , have recently made remarkable progress, but it is not clear where exactly these models stand in terms of understanding vision. In this paper, we benchmark the performance of popular multimodal foundation models ( GPT-4o , o4-mini , Gemini 1.5 Pro and Gemini 2.0 Flash, Claude 3.5 Sonnet , Qwen2-VL , Llama 3.2 ) on standard computer vision tasks ( semantic segmentation , object detection , image classification , depth and surface normal prediction ) using established datasets (e.g., COCO , ImageNet and its variants, etc). The main challenges to performing this are: 1) most models are trained to output text and cannot natively express versatile domains, such as segments or 3D geometry, and 2) many leading models are proprietary and accessible only at an API level, i.e., there is no weight access to adapt them. We address these challenges by translating standard vision tasks into equivalent text-promptable and API-compatible tasks via prompt chaining to create a standardized benchmarking framework. We observe that 1) the models are not close to the state-of-the-art specialist models at any task. However, 2) they are respectable generalists; this is remarkable as they are presumably trained on primarily image-text-based tasks. 3) They perform semantic tasks notably better than geometric ones. 4) While the prompt-chaining techniques affect performance, better models exhibit less sensitivity to prompt variations. 5) GPT-4o performs the best among non- reasoning models , securing the top position in 4 out of 6 tasks, 6) reasoning models , e.g. o3, show improvements in geometric tasks, and 7) a preliminary analysis of models with native image generation, like the latest GPT-4o , shows they exhibit quirks like hallucinations and spatial misalignments.

🔹 Links:
• arXiv Page: https://arxivexplained.com/papers/how-well-does-gpt-4o-understand-vision-evaluating-multimodal-foundation-models-on-standard-computer-vision-tasks
• PDF: https://arxiv.org/pdf/2507.01955
• Project Page: https://fm-vision-evals.epfl.ch/
• Github: https://github.com/EPFL-VILAB/fm-vision-evals

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

Arxivexplained

How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks - Explained Simply

By Rahul Ramachandran, Ali Garjani, Roman Bachmann et al.. # Executive Summary: The Real Vision Capabilities of AI Models Like GPT-4o

**The Big Question:** Ev...

❤2

1.05K views20:42

Data Science | Machine Learning with Python for Researchers

🔹 Title:
Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning

🔹 Publication Date: Published on Jul 18

🔹 Abstract:
Franca, an open-source vision foundation model, achieves high performance using a transparent training pipeline and novel clustering and disentanglement techniques. AI-generated summary We present Franca (pronounced Fran-ka): free one; the first fully open-source (data, code, weights) vision foundation model that matches and in many cases surpasses the performance of state-of-the-art proprietary models, e.g., DINOv2, CLIP, SigLIPv2, etc. Our approach is grounded in a transparent training pipeline inspired by Web-SSL and uses publicly available data: ImageNet-21K and a subset of ReLAION-2B . Beyond model release, we tackle critical limitations in SSL clustering methods. While modern models rely on assigning image features to large codebooks via clustering algorithms like Sinkhorn-Knopp , they fail to account for the inherent ambiguity in clustering semantics. To address this, we introduce a parameter-efficient , multi-head clustering projector based on nested Matryoshka representations . This design progressively refines features into increasingly fine-grained clusters without increasing the model size, enabling both performance and memory efficiency. Additionally, we propose a novel positional disentanglement strategy that explicitly removes positional biases from dense representations, thereby improving the encoding of semantic content. This leads to consistent gains on several downstream benchmarks , demonstrating the utility of cleaner feature spaces. Our contributions establish a new standard for transparent, high-performance vision models and open a path toward more reproducible and generalizable foundation models for the broader AI community. The code and model checkpoints are available at https://github.com/valeoai/Franca.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.14137
• PDF: https://arxiv.org/pdf/2507.14137
• Project Page: https://huggingface.co/papers?q=multi-head%20clustering%20projector

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

GitHub

GitHub - valeoai/Franca: Official code of Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning

Official code of Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning - valeoai/Franca

❤2

866 views07:31

Data Science | Machine Learning with Python for Researchers

Article Title:
Cautious Optimizers: Improving Training with One Line of Code

Article Date: 25 Nov 2024

Article Description:
AdamW has been the default optimizer for transformer pretraining. For many years, our community searched for faster and more stable optimizers with only constrained positive outcomes. In this work, we propose a single-line modification in Pytorch to any momentum-based optimizer, which we rename cautious optimizer, e.g. C-AdamW and C-Lion. Our theoretical result shows that this modification preserves Adam's Hamiltonian function and it does not break the convergence guarantee under the Lyapunov analysis. In addition, a whole new family of optimizers is revealed by our theoretical insight. Among them, we pick the simplest one for empirical experiments, showing not only speed-up on Llama and MAE pretraining up to $1.47$ times, but also better results in LLM post-training tasks. Code is available at https://github.com/kyleliang919/C-Optim.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2411.16085v3.pdf

GitHub:
• https://github.com/kyleliang919/c-optim
• https://github.com/huggingface/pytorch-image-models
• https://github.com/zhaoolee/garss

Datasets:
• GLUE
• QNLI
• C4
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

❤3

1.09K views09:24

Data Science | Machine Learning with Python for Researchers

🔹 Title:
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

🔹 Publication Date: Published on Jul 6

🔹 Abstract:
DreamVLA improves robot manipulation through a VLA framework that incorporates world knowledge, dynamic-region guidance, and a diffusion-based transformer to ensure clear, disentangled representations for action planning. AI-generated summary Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establishing a perception-prediction-action loop for manipulation tasks. Specifically, DreamVLA introduces a dynamic-region-guided world knowledge prediction , integrated with the spatial and semantic cues , which provide compact yet comprehensive representations for action planning. This design aligns with how humans interact with the world by first forming abstract multimodal reasoning chains before acting. To mitigate interference among the dynamic, spatial and semantic information during training, we adopt a block-wise structured attention mechanism that masks their mutual attention, preventing information leakage and keeping each representation clean and disentangled. Moreover, to model the conditional distribution over future actions, we employ a diffusion-based transformer that disentangles action representations from shared latent features. Extensive experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.

🔹 Links:
• arXiv Page: https://arxivexplained.com/papers/dreamvla-a-vision-language-action-model-dreamed-with-comprehensive-world-knowledge
• PDF: https://arxiv.org/pdf/2507.04447
• Project Page: https://zhangwenyao1.github.io/DreamVLA/
• Github: https://github.com/Zhangwenyao1/DreamVLA

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

Arxivexplained

DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge - Explained Simply

By Wenyao Zhang, Hongsi Liu, Zekun Qi et al.. # DreamVLA: Teaching Robots to Think Before They Act

**The Problem:** Current AI robots are like so...

❤4

962 views23:09

Data Science | Machine Learning with Python for Researchers

🔹 Title:
How to Train Your LLM Web Agent: A Statistical Diagnosis

🔹 Publication Date: Published on Jul 5

🔹 Abstract:
A study on compute allocation for post-training LLM-based web agents finds that combining supervised fine-tuning with on-policy reinforcement learning improves performance and reduces computational costs compared to either method alone. AI-generated summary LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with open-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents . To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++ . Further, this strategy requires only 55% of the compute to match the peak performance of pure SFT on MiniWob++ , effectively pushing the compute-performance Pareto frontier, and is the only strategy that can close the gap with closed-source models.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.04103
• PDF: https://arxiv.org/pdf/2507.04103

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

arXiv.org

How to Train Your LLM Web Agent: A Statistical Diagnosis

LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with open-source alternatives. Progress has been held back by...

❤6

818 views04:04

Data Science | Machine Learning with Python for Researchers

🔹 Title:
RedOne: Revealing Domain-specific LLM Post-Training in Social Networking Services

🔹 Publication Date: Published on Jul 13

🔹 Abstract:
RedOne, a domain-specific LLM, enhances performance across multiple SNS tasks through a three-stage training strategy, improving generalization and reducing harmful content exposure. AI-generated summary As a primary medium for modern information dissemination, social networking services (SNS) have experienced rapid growth, which has proposed significant challenges for platform content management and interaction quality improvement. Recently, the development of large language models ( LLMs ) has offered potential solutions but existing studies focus on isolated tasks, which not only encounter diminishing benefit from the data scaling within individual scenarios but also fail to flexibly adapt to diverse real-world context. To address these challenges, we introduce RedOne, a domain-specific LLM designed to break the performance bottleneck of single-task baselines and establish a comprehensive foundation for the SNS. RedOne was developed through a three-stage training strategy consisting of continue pretraining , supervised fine-tuning , and preference optimization , using a large-scale real-world dataset. Through extensive experiments, RedOne maintains strong general capabilities, and achieves an average improvement up to 14.02% across 8 major SNS tasks and 7.56% in SNS bilingual evaluation benchmark , compared with base models. Furthermore, through online testing, RedOne reduced the exposure rate in harmful content detection by 11.23% and improved the click page rate in post-view search by 14.95% compared with single-tasks finetuned baseline models. These results establish RedOne as a robust domain-specific LLM for SNS, demonstrating excellent generalization across various tasks and promising applicability in real-world scenarios.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.10605
• PDF: https://arxiv.org/pdf/2507.10605

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

arXiv.org

RedOne: Revealing Domain-specific LLM Post-Training in Social...

As a primary medium for modern information dissemination, social networking services (SNS) have experienced rapid growth, which has proposed significant challenges for platform content management...

❤2

823 views07:32

Data Science | Machine Learning with Python for Researchers

Article Title:
ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing

Article Date: 26 Jun 2025

Article Description:
While end-to-end video-to-audio generation has greatly improved, producing high-fidelity audio that authentically captures the nuances of visual content remains challenging. Like professionals in the creative industries, such generation requires sophisticated reasoning about items such as visual dynamics, acoustic environments, and temporal relationships. We present \textbf{ThinkSound}, a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos. Our approach decomposes the process into three complementary stages: foundational foley generation that creates semantically coherent soundscapes, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions. At each stage, a multimodal large language model generates contextually aligned CoT reasoning that guides a unified audio foundation model. Furthermore, we introduce \textbf{AudioCoT}, a comprehensive dataset with structured reasoning annotations that establishes connections between visual content, textual descriptions, and sound synthesis. Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics and excels in out-of-distribution Movie Gen Audio benchmark. The demo page is available at https://ThinkSound-Demo.github.io.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2506.21448v1.pdf

GitHub:
• https://github.com/FunAudioLLM/ThinkSound

Datasets:
• AudioSet
• VGG-Sound
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

❤1

803 views09:21

About

Blog

Apps

Platform