Article Title:
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model
Article Date: 29 May 2025
Article Description:
Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.PDFAbstract
Article Download Link: https://arxiv.org/pdf/2505.23606v1.pdf
GitHub:
- https://github.com/m-e-agi-lab/muddit
- https://github.com/viiika/Meissonic
Datasets:
- No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model
Article Date: 29 May 2025
Article Description:
Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.PDFAbstract
Article Download Link: https://arxiv.org/pdf/2505.23606v1.pdf
GitHub:
- https://github.com/m-e-agi-lab/muddit
- https://github.com/viiika/Meissonic
Datasets:
- No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤1
Article Title:
Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint
Article Date: 29 May 2025
Article Description:
Rebus puzzles, visual riddles that encode language through imagery, spatial arrangement, and symbolic substitution, pose a unique challenge to current vision-language models (VLMs). Unlike traditional image captioning or question answering tasks, rebus solving requires multi-modal abstraction, symbolic reasoning, and a grasp of cultural, phonetic and linguistic puns. In this paper, we investigate the capacity of contemporary VLMs to interpret and solve rebus puzzles by constructing a hand-generated and annotated benchmark of diverse English-language rebus puzzles, ranging from simple pictographic substitutions to spatially-dependent cues ("head" over "heels"). We analyze how different VLMs perform, and our findings reveal that while VLMs exhibit some surprising capabilities in decoding simple visual clues, they struggle significantly with tasks requiring abstract reasoning, lateral thinking, and understanding visual metaphors.PDFAbstract
Article Download Link: https://arxiv.org/pdf/2505.23759v1.pdf
GitHub:
- https://github.com/kyunnilee/visual_puzzles
Datasets:
- No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint
Article Date: 29 May 2025
Article Description:
Rebus puzzles, visual riddles that encode language through imagery, spatial arrangement, and symbolic substitution, pose a unique challenge to current vision-language models (VLMs). Unlike traditional image captioning or question answering tasks, rebus solving requires multi-modal abstraction, symbolic reasoning, and a grasp of cultural, phonetic and linguistic puns. In this paper, we investigate the capacity of contemporary VLMs to interpret and solve rebus puzzles by constructing a hand-generated and annotated benchmark of diverse English-language rebus puzzles, ranging from simple pictographic substitutions to spatially-dependent cues ("head" over "heels"). We analyze how different VLMs perform, and our findings reveal that while VLMs exhibit some surprising capabilities in decoding simple visual clues, they struggle significantly with tasks requiring abstract reasoning, lateral thinking, and understanding visual metaphors.PDFAbstract
Article Download Link: https://arxiv.org/pdf/2505.23759v1.pdf
GitHub:
- https://github.com/kyunnilee/visual_puzzles
Datasets:
- No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤1
Article Title:
GoMatching++: Parameter- and Data-Efficient Arbitrary-Shaped Video Text Spotting and Benchmarking
Article Date: 28 May 2025
Article Description:
Video text spotting (VTS) extends image text spotting (ITS) by adding text tracking, significantly increasing task complexity. Despite progress in VTS, existing methods still fall short of the performance seen in ITS. This paper identifies a key limitation in current video text spotters: limited recognition capability, even after extensive end-to-end training. To address this, we propose GoMatching++, a parameter- and data-efficient method that transforms an off-the-shelf image text spotter into a video specialist. The core idea lies in freezing the image text spotter and introducing a lightweight, trainable tracker, which can be optimized efficiently with minimal training data. Our approach includes two key components: (1) a rescoring mechanism to bridge the domain gap between image and video data, and (2) the LST-Matcher, which enhances the frozen image text spotter's ability to handle video text. We explore various architectures for LST-Matcher to ensure efficiency in both parameters and training data. As a result, GoMatching++ sets new performance records on challenging benchmarks such as ICDAR15-video, DSText, and BOVText, while significantly reducing training costs. To address the lack of curved text datasets in VTS, we introduce ArTVideo, a new benchmark featuring over 30% curved text with detailed annotations. We also provide a comprehensive statistical analysis and experimental results for ArTVideo. We believe that GoMatching++ and the ArTVideo benchmark will drive future advancements in video text spotting. The source code, models and dataset are publicly available at https://github.com/Hxyz-123/GoMatching.PDFAbstract
Article Download Link: https://arxiv.org/pdf/2505.22228v1.pdf
GitHub:
- https://github.com/hxyz-123/gomatching
Datasets:
- ICDAR 2013
- BOVText
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
GoMatching++: Parameter- and Data-Efficient Arbitrary-Shaped Video Text Spotting and Benchmarking
Article Date: 28 May 2025
Article Description:
Video text spotting (VTS) extends image text spotting (ITS) by adding text tracking, significantly increasing task complexity. Despite progress in VTS, existing methods still fall short of the performance seen in ITS. This paper identifies a key limitation in current video text spotters: limited recognition capability, even after extensive end-to-end training. To address this, we propose GoMatching++, a parameter- and data-efficient method that transforms an off-the-shelf image text spotter into a video specialist. The core idea lies in freezing the image text spotter and introducing a lightweight, trainable tracker, which can be optimized efficiently with minimal training data. Our approach includes two key components: (1) a rescoring mechanism to bridge the domain gap between image and video data, and (2) the LST-Matcher, which enhances the frozen image text spotter's ability to handle video text. We explore various architectures for LST-Matcher to ensure efficiency in both parameters and training data. As a result, GoMatching++ sets new performance records on challenging benchmarks such as ICDAR15-video, DSText, and BOVText, while significantly reducing training costs. To address the lack of curved text datasets in VTS, we introduce ArTVideo, a new benchmark featuring over 30% curved text with detailed annotations. We also provide a comprehensive statistical analysis and experimental results for ArTVideo. We believe that GoMatching++ and the ArTVideo benchmark will drive future advancements in video text spotting. The source code, models and dataset are publicly available at https://github.com/Hxyz-123/GoMatching.PDFAbstract
Article Download Link: https://arxiv.org/pdf/2505.22228v1.pdf
GitHub:
- https://github.com/hxyz-123/gomatching
Datasets:
- ICDAR 2013
- BOVText
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤2
Article Title:
Zero-Shot Vision Encoder Grafting via LLM Surrogates
Article Date: 28 May 2025
Article Description:
Vision language models (VLMs) typically pair a modestly sized vision encoder with a large language model (LLM), e.g., Llama-70B, making the decoder the primary computational burden during training. To reduce costs, a potential promising strategy is to first train the vision encoder using a small language model before transferring it to the large one. We construct small "surrogate models" that share the same embedding space and representation language as the large target LLM by directly inheriting its shallow layers. Vision encoders trained on the surrogate can then be directly transferred to the larger model, a process we call zero-shot grafting -- when plugged directly into the full-size target LLM, the grafted pair surpasses the encoder-surrogate pair and, on some benchmarks, even performs on par with full decoder training with the target LLM. Furthermore, our surrogate training approach reduces overall VLM training costs by ~45% when using Llama-70B as the decoder.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.22664v1.pdf
GitHub:
• https://github.com/facebookresearch/zero
Datasets:
• No datasets information available
====================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Zero-Shot Vision Encoder Grafting via LLM Surrogates
Article Date: 28 May 2025
Article Description:
Vision language models (VLMs) typically pair a modestly sized vision encoder with a large language model (LLM), e.g., Llama-70B, making the decoder the primary computational burden during training. To reduce costs, a potential promising strategy is to first train the vision encoder using a small language model before transferring it to the large one. We construct small "surrogate models" that share the same embedding space and representation language as the large target LLM by directly inheriting its shallow layers. Vision encoders trained on the surrogate can then be directly transferred to the larger model, a process we call zero-shot grafting -- when plugged directly into the full-size target LLM, the grafted pair surpasses the encoder-surrogate pair and, on some benchmarks, even performs on par with full decoder training with the target LLM. Furthermore, our surrogate training approach reduces overall VLM training costs by ~45% when using Llama-70B as the decoder.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.22664v1.pdf
GitHub:
• https://github.com/facebookresearch/zero
Datasets:
• No datasets information available
====================
=
=============For more data science resources:
✓ https://t.iss.one/DataScienceT
❤1
Article Title:
AlphaEvolve: A Learning Framework to Discover Novel Alphas in Quantitative Investment
Article Date: 30 Mar 2021
Article Description:
Alphas are stock prediction models capturing trading signals in a stock market. A set of effective alphas can generate weakly correlated high returns to diversify the risk. Existing alphas can be categorized into two classes: Formulaic alphas are simple algebraic expressions of scalar features, and thus can generalize well and be mined into a weakly correlated set. Machine learning alphas are data-driven models over vector and matrix features. They are more predictive than formulaic alphas, but are too complex to mine into a weakly correlated set. In this paper, we introduce a new class of alphas to model scalar, vector, and matrix features which possess the strengths of these two existing classes. The new alphas predict returns with high accuracy and can be mined into a weakly correlated set. In addition, we propose a novel alpha mining framework based on AutoML, called AlphaEvolve, to generate the new alphas. To this end, we first propose operators for generating the new alphas and selectively injecting relational domain knowledge to model the relations between stocks. We then accelerate the alpha mining by proposing a pruning technique for redundant alphas. Experiments show that AlphaEvolve can evolve initial alphas into the new alphas with high returns and weak correlations.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2103.16196v2.pdf
GitHub:
• https://github.com/codelion/openevolve
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
AlphaEvolve: A Learning Framework to Discover Novel Alphas in Quantitative Investment
Article Date: 30 Mar 2021
Article Description:
Alphas are stock prediction models capturing trading signals in a stock market. A set of effective alphas can generate weakly correlated high returns to diversify the risk. Existing alphas can be categorized into two classes: Formulaic alphas are simple algebraic expressions of scalar features, and thus can generalize well and be mined into a weakly correlated set. Machine learning alphas are data-driven models over vector and matrix features. They are more predictive than formulaic alphas, but are too complex to mine into a weakly correlated set. In this paper, we introduce a new class of alphas to model scalar, vector, and matrix features which possess the strengths of these two existing classes. The new alphas predict returns with high accuracy and can be mined into a weakly correlated set. In addition, we propose a novel alpha mining framework based on AutoML, called AlphaEvolve, to generate the new alphas. To this end, we first propose operators for generating the new alphas and selectively injecting relational domain knowledge to model the relations between stocks. We then accelerate the alpha mining by proposing a pruning technique for redundant alphas. Experiments show that AlphaEvolve can evolve initial alphas into the new alphas with high returns and weak correlations.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2103.16196v2.pdf
GitHub:
• https://github.com/codelion/openevolve
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤1👏1
Article Title:
GenoArmory: A Unified Evaluation Framework for Adversarial Attacks on Genomic Foundation Models
Article Date: 16 May 2025
Article Description:
We propose the first unified adversarial attack benchmark for Genomic Foundation Models (GFMs), named GenoArmory. Unlike existing GFM benchmarks, GenoArmory offers the first comprehensive evaluation framework to systematically assess the vulnerability of GFMs to adversarial attacks. Methodologically, we evaluate the adversarial robustness of five state-of-the-art GFMs using four widely adopted attack algorithms and three defense strategies. Importantly, our benchmark provides an accessible and comprehensive framework to analyze GFM vulnerabilities with respect to model architecture, quantization schemes, and training datasets. Additionally, we introduce GenoAdv, a new adversarial sample dataset designed to improve GFM safety. Empirically, classification models exhibit greater robustness to adversarial perturbations compared to generative models, highlighting the impact of task type on model vulnerability. Moreover, adversarial attacks frequently target biologically significant genomic regions, suggesting that these models effectively capture meaningful sequence features.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.10983v1.pdf
GitHub:
• https://github.com/MAGICS-LAB/GenoArmory
Datasets:
• GenoAdv
• GUE
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
GenoArmory: A Unified Evaluation Framework for Adversarial Attacks on Genomic Foundation Models
Article Date: 16 May 2025
Article Description:
We propose the first unified adversarial attack benchmark for Genomic Foundation Models (GFMs), named GenoArmory. Unlike existing GFM benchmarks, GenoArmory offers the first comprehensive evaluation framework to systematically assess the vulnerability of GFMs to adversarial attacks. Methodologically, we evaluate the adversarial robustness of five state-of-the-art GFMs using four widely adopted attack algorithms and three defense strategies. Importantly, our benchmark provides an accessible and comprehensive framework to analyze GFM vulnerabilities with respect to model architecture, quantization schemes, and training datasets. Additionally, we introduce GenoAdv, a new adversarial sample dataset designed to improve GFM safety. Empirically, classification models exhibit greater robustness to adversarial perturbations compared to generative models, highlighting the impact of task type on model vulnerability. Moreover, adversarial attacks frequently target biologically significant genomic regions, suggesting that these models effectively capture meaningful sequence features.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.10983v1.pdf
GitHub:
• https://github.com/MAGICS-LAB/GenoArmory
Datasets:
• GenoAdv
• GUE
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤4👏1
Article Title:
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model
Article Date: 6 May 2025
Article Description:
With the growing requirement for natural human-computer interaction, speech-based systems receive increasing attention as speech is one of the most common forms of daily communication. However, the existing speech models still experience high latency when generating the first audio token during streaming, which poses a significant bottleneck for deployment. To address this issue, we propose VITA-Audio, an end-to-end large speech model with fast audio-text token generation. Specifically, we introduce a lightweight Multiple Cross-modal Token Prediction (MCTP) module that efficiently generates multiple audio tokens within a single model forward pass, which not only accelerates the inference but also significantly reduces the latency for generating the first audio in streaming scenarios. In addition, a four-stage progressive training strategy is explored to achieve model acceleration with minimal loss of speech quality. To our knowledge, VITA-Audio is the first multi-modal large language model capable of generating audio output during the first forward pass, enabling real-time conversational capabilities with minimal latency. VITA-Audio is fully reproducible and is trained on open-source data only. Experimental results demonstrate that our model achieves an inference speedup of 3~5x at the 7B parameter scale, but also significantly outperforms open-source models of similar model size on multiple benchmarks for automatic speech recognition (ASR), text-to-speech (TTS), and spoken question answering (SQA) tasks.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.03739v1.pdf
GitHub:
• https://github.com/vita-mllm/vita-audio
Datasets:
• LibriSpeech
• TriviaQA
• LibriTTS
• AISHELL-1
• FLEURS
• VoxPopuli
• LIMA
• GigaSpeech
• Multilingual LibriSpeech
• AISHELL-2
• WenetSpeech
• MathInstruct
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model
Article Date: 6 May 2025
Article Description:
With the growing requirement for natural human-computer interaction, speech-based systems receive increasing attention as speech is one of the most common forms of daily communication. However, the existing speech models still experience high latency when generating the first audio token during streaming, which poses a significant bottleneck for deployment. To address this issue, we propose VITA-Audio, an end-to-end large speech model with fast audio-text token generation. Specifically, we introduce a lightweight Multiple Cross-modal Token Prediction (MCTP) module that efficiently generates multiple audio tokens within a single model forward pass, which not only accelerates the inference but also significantly reduces the latency for generating the first audio in streaming scenarios. In addition, a four-stage progressive training strategy is explored to achieve model acceleration with minimal loss of speech quality. To our knowledge, VITA-Audio is the first multi-modal large language model capable of generating audio output during the first forward pass, enabling real-time conversational capabilities with minimal latency. VITA-Audio is fully reproducible and is trained on open-source data only. Experimental results demonstrate that our model achieves an inference speedup of 3~5x at the 7B parameter scale, but also significantly outperforms open-source models of similar model size on multiple benchmarks for automatic speech recognition (ASR), text-to-speech (TTS), and spoken question answering (SQA) tasks.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.03739v1.pdf
GitHub:
• https://github.com/vita-mllm/vita-audio
Datasets:
• LibriSpeech
• TriviaQA
• LibriTTS
• AISHELL-1
• FLEURS
• VoxPopuli
• LIMA
• GigaSpeech
• Multilingual LibriSpeech
• AISHELL-2
• WenetSpeech
• MathInstruct
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤4
Article Title:
Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video
Article Date: 27 Mar 2025
Article Description:
This paper presents a unified approach to understanding dynamic scenes from casual videos. Large pretrained vision foundation models, such as vision-language, video depth prediction, motion tracking, and segmentation models, offer promising capabilities. However, training a single model for comprehensive 4D understanding remains challenging. We introduce Uni4D, a multi-stage optimization framework that harnesses multiple pretrained models to advance dynamic 3D modeling, including static/dynamic reconstruction, camera pose estimation, and dense 3D motion tracking. Our results show state-of-the-art performance in dynamic 4D modeling with superior visual quality. Notably, Uni4D requires no retraining or fine-tuning, highlighting the effectiveness of repurposing visual foundation models for 4D understanding.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2503.21761v1.pdf
GitHub:
• https://github.com/Davidyao99/uni4d
Datasets:
• KITTI
• DAVIS
• TUM RGB-D
• MPI Sintel
• Bonn RGB-D Dynamic
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video
Article Date: 27 Mar 2025
Article Description:
This paper presents a unified approach to understanding dynamic scenes from casual videos. Large pretrained vision foundation models, such as vision-language, video depth prediction, motion tracking, and segmentation models, offer promising capabilities. However, training a single model for comprehensive 4D understanding remains challenging. We introduce Uni4D, a multi-stage optimization framework that harnesses multiple pretrained models to advance dynamic 3D modeling, including static/dynamic reconstruction, camera pose estimation, and dense 3D motion tracking. Our results show state-of-the-art performance in dynamic 4D modeling with superior visual quality. Notably, Uni4D requires no retraining or fine-tuning, highlighting the effectiveness of repurposing visual foundation models for 4D understanding.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2503.21761v1.pdf
GitHub:
• https://github.com/Davidyao99/uni4d
Datasets:
• KITTI
• DAVIS
• TUM RGB-D
• MPI Sintel
• Bonn RGB-D Dynamic
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤3👍2
Article Title:
Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers
Article Date: 27 Apr 2025
Article Description:
Hallucinations are a persistent problem with Large Language Models (LLMs). As these models become increasingly used in high-stakes domains, such as healthcare and finance, the need for effective hallucination detection is crucial. To this end, we propose a versatile framework for zero-resource hallucination detection that practitioners can apply to real-world use cases. To achieve this, we adapt a variety of existing uncertainty quantification (UQ) techniques, including black-box UQ, white-box UQ, and LLM-as-a-Judge, transforming them as necessary into standardized response-level confidence scores ranging from 0 to 1. To enhance flexibility, we introduce a tunable ensemble approach that incorporates any combination of the individual confidence scores. This approach enables practitioners to optimize the ensemble for a specific use case for improved performance. To streamline implementation, the full suite of scorers is offered in this paper's companion Python toolkit, UQLM. To evaluate the performance of the various scorers, we conduct an extensive set of experiments using several LLM question-answering benchmarks. We find that our tunable ensemble typically surpasses its individual components and outperforms existing hallucination detection methods. Our results demonstrate the benefits of customized hallucination detection strategies for improving the accuracy and reliability of LLMs.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2504.19254v2.pdf
GitHub:
• https://github.com/cvs-health/uqlm
Datasets:
• GSM8K
• SVAMP
• PopQA
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers
Article Date: 27 Apr 2025
Article Description:
Hallucinations are a persistent problem with Large Language Models (LLMs). As these models become increasingly used in high-stakes domains, such as healthcare and finance, the need for effective hallucination detection is crucial. To this end, we propose a versatile framework for zero-resource hallucination detection that practitioners can apply to real-world use cases. To achieve this, we adapt a variety of existing uncertainty quantification (UQ) techniques, including black-box UQ, white-box UQ, and LLM-as-a-Judge, transforming them as necessary into standardized response-level confidence scores ranging from 0 to 1. To enhance flexibility, we introduce a tunable ensemble approach that incorporates any combination of the individual confidence scores. This approach enables practitioners to optimize the ensemble for a specific use case for improved performance. To streamline implementation, the full suite of scorers is offered in this paper's companion Python toolkit, UQLM. To evaluate the performance of the various scorers, we conduct an extensive set of experiments using several LLM question-answering benchmarks. We find that our tunable ensemble typically surpasses its individual components and outperforms existing hallucination detection methods. Our results demonstrate the benefits of customized hallucination detection strategies for improving the accuracy and reliability of LLMs.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2504.19254v2.pdf
GitHub:
• https://github.com/cvs-health/uqlm
Datasets:
• GSM8K
• SVAMP
• PopQA
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤5
Article Title:
s3: You Don't Need That Much Data to Train a Search Agent via RL
Article Date: 20 May 2025
Article Description:
Retrieval-augmented generation (RAG) systems empower large language models (LLMs) to access external knowledge during inference. Recent advances have enabled LLMs to act as search agents via reinforcement learning (RL), improving information acquisition through multi-turn interactions with retrieval engines. However, existing approaches either optimize retrieval using search-only metrics (e.g., NDCG) that ignore downstream utility or fine-tune the entire LLM to jointly reason and retrieve-entangling retrieval with generation and limiting the real search utility and compatibility with frozen or proprietary models. In this work, we propose s3, a lightweight, model-agnostic framework that decouples the searcher from the generator and trains the searcher using a Gain Beyond RAG reward: the improvement in generation accuracy over naive RAG. s3 requires only 2.4k training samples to outperform baselines trained on over 70x more data, consistently delivering stronger downstream performance across six general QA and five medical QA benchmarks.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.14146v1.pdf
GitHub:
• https://github.com/pat-jj/s3
Datasets:
• Natural Questions
• TriviaQA
• HotpotQA
• MedQA
• PubMedQA
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
s3: You Don't Need That Much Data to Train a Search Agent via RL
Article Date: 20 May 2025
Article Description:
Retrieval-augmented generation (RAG) systems empower large language models (LLMs) to access external knowledge during inference. Recent advances have enabled LLMs to act as search agents via reinforcement learning (RL), improving information acquisition through multi-turn interactions with retrieval engines. However, existing approaches either optimize retrieval using search-only metrics (e.g., NDCG) that ignore downstream utility or fine-tune the entire LLM to jointly reason and retrieve-entangling retrieval with generation and limiting the real search utility and compatibility with frozen or proprietary models. In this work, we propose s3, a lightweight, model-agnostic framework that decouples the searcher from the generator and trains the searcher using a Gain Beyond RAG reward: the improvement in generation accuracy over naive RAG. s3 requires only 2.4k training samples to outperform baselines trained on over 70x more data, consistently delivering stronger downstream performance across six general QA and five medical QA benchmarks.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.14146v1.pdf
GitHub:
• https://github.com/pat-jj/s3
Datasets:
• Natural Questions
• TriviaQA
• HotpotQA
• MedQA
• PubMedQA
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤4
Please open Telegram to view this post
VIEW IN TELEGRAM
Update Telegram, now job seekers can advertise their expertise and job opportunities without having to go back to the channel owner
Article Title:
Vision as LoRA
Article Date: 26 Mar 2025
Article Description:
We introduce Vision as LoRA (VoRA), a novel paradigm for transforming an LLM into an MLLM. Unlike prevalent MLLM architectures that rely on external vision modules for vision encoding, VoRA internalizes visual capabilities by integrating vision-specific LoRA layers directly into the LLM. This design allows the added parameters to be seamlessly merged into the LLM during inference, eliminating structural complexity and minimizing computational overhead. Moreover, inheriting the LLM's ability of handling flexible context, VoRA can process inputs at arbitrary resolutions. To further strengthen VoRA's visual capabilities, we introduce a block-wise distillation method that transfers visual priors from a pre-trained ViT into the LoRA layers, effectively accelerating training by injecting visual knowledge. Additionally, we apply bi-directional attention masks to better capture the context information of an image. We successfully demonstrate that with additional pre-training data, VoRA can perform comparably with conventional encode-based MLLMs. All training data, codes, and model weights will be released at https://github.com/Hon-Wong/VoRA.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2503.20680v1.pdf
GitHub:
• https://github.com/hon-wong/vora
Datasets:
• MM-Vet
• Google Landmarks Dataset v2
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Vision as LoRA
Article Date: 26 Mar 2025
Article Description:
We introduce Vision as LoRA (VoRA), a novel paradigm for transforming an LLM into an MLLM. Unlike prevalent MLLM architectures that rely on external vision modules for vision encoding, VoRA internalizes visual capabilities by integrating vision-specific LoRA layers directly into the LLM. This design allows the added parameters to be seamlessly merged into the LLM during inference, eliminating structural complexity and minimizing computational overhead. Moreover, inheriting the LLM's ability of handling flexible context, VoRA can process inputs at arbitrary resolutions. To further strengthen VoRA's visual capabilities, we introduce a block-wise distillation method that transfers visual priors from a pre-trained ViT into the LoRA layers, effectively accelerating training by injecting visual knowledge. Additionally, we apply bi-directional attention masks to better capture the context information of an image. We successfully demonstrate that with additional pre-training data, VoRA can perform comparably with conventional encode-based MLLMs. All training data, codes, and model weights will be released at https://github.com/Hon-Wong/VoRA.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2503.20680v1.pdf
GitHub:
• https://github.com/hon-wong/vora
Datasets:
• MM-Vet
• Google Landmarks Dataset v2
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤7👏1
Article Title:
Harnessing the Universal Geometry of Embeddings
Article Date: 18 May 2025
Article Description:
We introduce the first method for translating text embeddings from one vector space to another without any paired data, encoders, or predefined sets of matches. Our unsupervised approach translates any embedding to and from a universal latent representation (i.e., a universal semantic structure conjectured by the Platonic Representation Hypothesis). Our translations achieve high cosine similarity across model pairs with different architectures, parameter counts, and training datasets. The ability to translate unknown embeddings into a different space while preserving their geometry has serious implications for the security of vector databases. An adversary with access only to embedding vectors can extract sensitive information about the underlying documents, sufficient for classification and attribute inference.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.12540v2.pdf
GitHub:
• https://github.com/rjha18/vec2vec
• https://github.com/zhaoolee/garss
Datasets:
• Natural Questions
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Harnessing the Universal Geometry of Embeddings
Article Date: 18 May 2025
Article Description:
We introduce the first method for translating text embeddings from one vector space to another without any paired data, encoders, or predefined sets of matches. Our unsupervised approach translates any embedding to and from a universal latent representation (i.e., a universal semantic structure conjectured by the Platonic Representation Hypothesis). Our translations achieve high cosine similarity across model pairs with different architectures, parameter counts, and training datasets. The ability to translate unknown embeddings into a different space while preserving their geometry has serious implications for the security of vector databases. An adversary with access only to embedding vectors can extract sensitive information about the underlying documents, sufficient for classification and attribute inference.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.12540v2.pdf
GitHub:
• https://github.com/rjha18/vec2vec
• https://github.com/zhaoolee/garss
Datasets:
• Natural Questions
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤6🔥1🎉1
Article Title:
MTGS: Multi-Traversal Gaussian Splatting
Article Date: 16 Mar 2025
Article Description:
Multi-traversal data, commonly collected through daily commutes or by self-driving fleets, provides multiple viewpoints for scene reconstruction within a road block. This data offers significant potential for high-quality novel view synthesis, which is crucial for applications such as autonomous vehicle simulators. However, inherent challenges in multi-traversal data often result in suboptimal reconstruction quality, including variations in appearance and the presence of dynamic objects. To address these issues, we propose Multi-Traversal Gaussian Splatting (MTGS), a novel approach that reconstructs high-quality driving scenes from arbitrarily collected multi-traversal data by modeling a shared static geometry while separately handling dynamic elements and appearance variations. Our method employs a multi-traversal dynamic scene graph with a shared static node and traversal-specific dynamic nodes, complemented by color correction nodes with learnable spherical harmonics coefficient residuals. This approach enables high-fidelity novel view synthesis and provides flexibility to navigate any viewpoint. We conduct extensive experiments on a large-scale driving dataset, nuPlan, with multi-traversal data. Our results demonstrate that MTGS improves LPIPS by 23.5% and geometry accuracy by 46.3% compared to single-traversal baselines. The code and data would be available to the public.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2503.12552v3.pdf
GitHub:
• https://github.com/OpenDriveLab/MTGS
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
MTGS: Multi-Traversal Gaussian Splatting
Article Date: 16 Mar 2025
Article Description:
Multi-traversal data, commonly collected through daily commutes or by self-driving fleets, provides multiple viewpoints for scene reconstruction within a road block. This data offers significant potential for high-quality novel view synthesis, which is crucial for applications such as autonomous vehicle simulators. However, inherent challenges in multi-traversal data often result in suboptimal reconstruction quality, including variations in appearance and the presence of dynamic objects. To address these issues, we propose Multi-Traversal Gaussian Splatting (MTGS), a novel approach that reconstructs high-quality driving scenes from arbitrarily collected multi-traversal data by modeling a shared static geometry while separately handling dynamic elements and appearance variations. Our method employs a multi-traversal dynamic scene graph with a shared static node and traversal-specific dynamic nodes, complemented by color correction nodes with learnable spherical harmonics coefficient residuals. This approach enables high-fidelity novel view synthesis and provides flexibility to navigate any viewpoint. We conduct extensive experiments on a large-scale driving dataset, nuPlan, with multi-traversal data. Our results demonstrate that MTGS improves LPIPS by 23.5% and geometry accuracy by 46.3% compared to single-traversal baselines. The code and data would be available to the public.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2503.12552v3.pdf
GitHub:
• https://github.com/OpenDriveLab/MTGS
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤5🎉1
Article Title:
ImgEdit: A Unified Image Editing Dataset and Benchmark
Article Date: 26 May 2025
Article Description:
Recent advancements in generative models have enabled high-fidelity text-to-image generation. However, open-source image-editing models still lag behind their proprietary counterparts, primarily due to limited high-quality data and insufficient benchmarks. To overcome these limitations, we introduce ImgEdit, a large-scale, high-quality image-editing dataset comprising 1.2 million carefully curated edit pairs, which contain both novel and complex single-turn edits, as well as challenging multi-turn tasks. To ensure the data quality, we employ a multi-stage pipeline that integrates a cutting-edge vision-language model, a detection model, a segmentation model, alongside task-specific in-painting procedures and strict post-processing. ImgEdit surpasses existing datasets in both task novelty and data quality. Using ImgEdit, we train ImgEdit-E1, an editing model using Vision Language Model to process the reference image and editing prompt, which outperforms existing open-source models on multiple tasks, highlighting the value of ImgEdit and model design. For comprehensive evaluation, we introduce ImgEdit-Bench, a benchmark designed to evaluate image editing performance in terms of instruction adherence, editing quality, and detail preservation. It includes a basic testsuite, a challenging single-turn suite, and a dedicated multi-turn suite. We evaluate both open-source and proprietary models, as well as ImgEdit-E1, providing deep analysis and actionable insights into the current behavior of image-editing models. The source data are publicly available on https://github.com/PKU-YuanGroup/ImgEdit.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.20275v1.pdf
GitHub:
• https://github.com/pku-yuangroup/imgedit
Datasets:
• MagicBrush
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
ImgEdit: A Unified Image Editing Dataset and Benchmark
Article Date: 26 May 2025
Article Description:
Recent advancements in generative models have enabled high-fidelity text-to-image generation. However, open-source image-editing models still lag behind their proprietary counterparts, primarily due to limited high-quality data and insufficient benchmarks. To overcome these limitations, we introduce ImgEdit, a large-scale, high-quality image-editing dataset comprising 1.2 million carefully curated edit pairs, which contain both novel and complex single-turn edits, as well as challenging multi-turn tasks. To ensure the data quality, we employ a multi-stage pipeline that integrates a cutting-edge vision-language model, a detection model, a segmentation model, alongside task-specific in-painting procedures and strict post-processing. ImgEdit surpasses existing datasets in both task novelty and data quality. Using ImgEdit, we train ImgEdit-E1, an editing model using Vision Language Model to process the reference image and editing prompt, which outperforms existing open-source models on multiple tasks, highlighting the value of ImgEdit and model design. For comprehensive evaluation, we introduce ImgEdit-Bench, a benchmark designed to evaluate image editing performance in terms of instruction adherence, editing quality, and detail preservation. It includes a basic testsuite, a challenging single-turn suite, and a dedicated multi-turn suite. We evaluate both open-source and proprietary models, as well as ImgEdit-E1, providing deep analysis and actionable insights into the current behavior of image-editing models. The source data are publicly available on https://github.com/PKU-YuanGroup/ImgEdit.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.20275v1.pdf
GitHub:
• https://github.com/pku-yuangroup/imgedit
Datasets:
• MagicBrush
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤6👏1
Article Title:
OmniConsistency: Learning Style-Agnostic Consistency from Paired Stylization Data
Article Date: 24 May 2025
Article Description:
Diffusion models have advanced image stylization significantly, yet two core challenges persist: (1) maintaining consistent stylization in complex scenes, particularly identity, composition, and fine details, and (2) preventing style degradation in image-to-image pipelines with style LoRAs. GPT-4o's exceptional stylization consistency highlights the performance gap between open-source methods and proprietary models. To bridge this gap, we propose \textbf{OmniConsistency}, a universal consistency plugin leveraging large-scale Diffusion Transformers (DiTs). OmniConsistency contributes: (1) an in-context consistency learning framework trained on aligned image pairs for robust generalization; (2) a two-stage progressive learning strategy decoupling style learning from consistency preservation to mitigate style degradation; and (3) a fully plug-and-play design compatible with arbitrary style LoRAs under the Flux framework. Extensive experiments show that OmniConsistency significantly enhances visual coherence and aesthetic quality, achieving performance comparable to commercial state-of-the-art model GPT-4o.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.18445v1.pdf
GitHub:
• https://github.com/showlab/omniconsistency
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
OmniConsistency: Learning Style-Agnostic Consistency from Paired Stylization Data
Article Date: 24 May 2025
Article Description:
Diffusion models have advanced image stylization significantly, yet two core challenges persist: (1) maintaining consistent stylization in complex scenes, particularly identity, composition, and fine details, and (2) preventing style degradation in image-to-image pipelines with style LoRAs. GPT-4o's exceptional stylization consistency highlights the performance gap between open-source methods and proprietary models. To bridge this gap, we propose \textbf{OmniConsistency}, a universal consistency plugin leveraging large-scale Diffusion Transformers (DiTs). OmniConsistency contributes: (1) an in-context consistency learning framework trained on aligned image pairs for robust generalization; (2) a two-stage progressive learning strategy decoupling style learning from consistency preservation to mitigate style degradation; and (3) a fully plug-and-play design compatible with arbitrary style LoRAs under the Flux framework. Extensive experiments show that OmniConsistency significantly enhances visual coherence and aesthetic quality, achieving performance comparable to commercial state-of-the-art model GPT-4o.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.18445v1.pdf
GitHub:
• https://github.com/showlab/omniconsistency
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤6
Article Title:
Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers
Article Date: 27 Apr 2025
Article Description:
Hallucinations are a persistent problem with Large Language Models (LLMs). As these models become increasingly used in high-stakes domains, such as healthcare and finance, the need for effective hallucination detection is crucial. To this end, we propose a versatile framework for zero-resource hallucination detection that practitioners can apply to real-world use cases. To achieve this, we adapt a variety of existing uncertainty quantification (UQ) techniques, including black-box UQ, white-box UQ, and LLM-as-a-Judge, transforming them as necessary into standardized response-level confidence scores ranging from 0 to 1. To enhance flexibility, we introduce a tunable ensemble approach that incorporates any combination of the individual confidence scores. This approach enables practitioners to optimize the ensemble for a specific use case for improved performance. To streamline implementation, the full suite of scorers is offered in this paper's companion Python toolkit, UQLM. To evaluate the performance of the various scorers, we conduct an extensive set of experiments using several LLM question-answering benchmarks. We find that our tunable ensemble typically surpasses its individual components and outperforms existing hallucination detection methods. Our results demonstrate the benefits of customized hallucination detection strategies for improving the accuracy and reliability of LLMs.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2504.19254v2.pdf
GitHub:
• https://github.com/cvs-health/uqlm
Datasets:
• GSM8K
• SVAMP
• PopQA
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers
Article Date: 27 Apr 2025
Article Description:
Hallucinations are a persistent problem with Large Language Models (LLMs). As these models become increasingly used in high-stakes domains, such as healthcare and finance, the need for effective hallucination detection is crucial. To this end, we propose a versatile framework for zero-resource hallucination detection that practitioners can apply to real-world use cases. To achieve this, we adapt a variety of existing uncertainty quantification (UQ) techniques, including black-box UQ, white-box UQ, and LLM-as-a-Judge, transforming them as necessary into standardized response-level confidence scores ranging from 0 to 1. To enhance flexibility, we introduce a tunable ensemble approach that incorporates any combination of the individual confidence scores. This approach enables practitioners to optimize the ensemble for a specific use case for improved performance. To streamline implementation, the full suite of scorers is offered in this paper's companion Python toolkit, UQLM. To evaluate the performance of the various scorers, we conduct an extensive set of experiments using several LLM question-answering benchmarks. We find that our tunable ensemble typically surpasses its individual components and outperforms existing hallucination detection methods. Our results demonstrate the benefits of customized hallucination detection strategies for improving the accuracy and reliability of LLMs.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2504.19254v2.pdf
GitHub:
• https://github.com/cvs-health/uqlm
Datasets:
• GSM8K
• SVAMP
• PopQA
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤6
Please open Telegram to view this post
VIEW IN TELEGRAM
❤5
SWE-bench Goes Live
🖥 Github: https://github.com/microsoft/swe-bench-live
📕 Paper: https://arxiv.org/abs/2505.23419v1
🔗 Tasks: https://paperswithcode.com/dataset/humaneval
For more data science resources:
✓ https://t.iss.one/DataScienceT
🔗 Tasks: https://paperswithcode.com/dataset/humaneval
For more data science resources:
✓ https://t.iss.one/DataScienceT
Please open Telegram to view this post
VIEW IN TELEGRAM
❤6