Article Title:
UniVST: A Unified Framework for Training-free Localized Video Style Transfer
Article Date: 26 Oct 2024
Article Description:
This paper presents UniVST, a unified framework for localized video style transfer based on diffusion model. It operates without the need for training, offering a distinct advantage over existing diffusion methods that transfer style across entire videos. The endeavors of this paper comprise: (1) A point-matching mask propagation strategy that leverages the feature maps from the DDIM inversion. This streamlines the model's architecture by obviating the need for tracking models. (2) A training-free AdaIN-guided video style transfer mechanism that operates at both the latent and attention levels. This balances content fidelity and style richness, mitigating the loss of localized details commonly associated with direct video stylization. (3) A sliding-window consistent smoothing scheme that harnesses optical flow within the pixel representation and refines predicted noise to update the latent space. This significantly enhances temporal consistency and diminishes artifacts in stylized video. Our proposed UniVST has been validated to be superior to existing methods in quantitative and qualitative metrics. It adeptly addresses the challenges of preserving the primary object's style while ensuring temporal consistency and detail preservation. Our code is available at https://github.com/QuanjianSong/UniVST.PDFAbstract
Article Download Link: https://arxiv.org/pdf/2410.20084v3.pdf
GitHub:
- https://github.com/QuanjianSong/UniVST
Datasets:
- WikiArt
- LAION-Aesthetics V2 6.5+
==================================
For more data science resources:
https://t.iss.one/DataScienceT
UniVST: A Unified Framework for Training-free Localized Video Style Transfer
Article Date: 26 Oct 2024
Article Description:
This paper presents UniVST, a unified framework for localized video style transfer based on diffusion model. It operates without the need for training, offering a distinct advantage over existing diffusion methods that transfer style across entire videos. The endeavors of this paper comprise: (1) A point-matching mask propagation strategy that leverages the feature maps from the DDIM inversion. This streamlines the model's architecture by obviating the need for tracking models. (2) A training-free AdaIN-guided video style transfer mechanism that operates at both the latent and attention levels. This balances content fidelity and style richness, mitigating the loss of localized details commonly associated with direct video stylization. (3) A sliding-window consistent smoothing scheme that harnesses optical flow within the pixel representation and refines predicted noise to update the latent space. This significantly enhances temporal consistency and diminishes artifacts in stylized video. Our proposed UniVST has been validated to be superior to existing methods in quantitative and qualitative metrics. It adeptly addresses the challenges of preserving the primary object's style while ensuring temporal consistency and detail preservation. Our code is available at https://github.com/QuanjianSong/UniVST.PDFAbstract
Article Download Link: https://arxiv.org/pdf/2410.20084v3.pdf
GitHub:
- https://github.com/QuanjianSong/UniVST
Datasets:
- WikiArt
- LAION-Aesthetics V2 6.5+
==================================
For more data science resources:
https://t.iss.one/DataScienceT
❤3
Article Title:
syftr: Pareto-Optimal Generative AI
Article Date: 26 May 2025
Article Description:
Retrieval-Augmented Generation (RAG) pipelines are central to applying large language models (LLMs) to proprietary or dynamic data. However, building effective RAG flows is complex, requiring careful selection among vector databases, embedding models, text splitters, retrievers, and synthesizing LLMs. The challenge deepens with the rise of agentic paradigms. Modules like verifiers, rewriters, and rerankers-each with intricate hyperparameter dependencies have to be carefully tuned. Balancing tradeoffs between latency, accuracy, and cost becomes increasingly difficult in performance-sensitive applications. We introduce syftr, a framework that performs efficient multi-objective search over a broad space of agentic and non-agentic RAG configurations. Using Bayesian Optimization, syftr discovers Pareto-optimal flows that jointly optimize task accuracy and cost. A novel early-stopping mechanism further improves efficiency by pruning clearly suboptimal candidates. Across multiple RAG benchmarks, syftr finds flows which are on average approximately 9 times cheaper while preserving most of the accuracy of the most accurate flows on the Pareto-frontier. Furthermore, syftr's ability to design and optimize allows integrating new modules, making it even easier and faster to realize high-performing generative AI pipelines.PDFAbstract
Article Download Link: https://arxiv.org/pdf/2505.20266v1.pdf
GitHub:
- https://github.com/datarobot/syftr
Datasets:
- FinanceBench
==================================
For more data science resources:
https://t.iss.one/DataScienceT
syftr: Pareto-Optimal Generative AI
Article Date: 26 May 2025
Article Description:
Retrieval-Augmented Generation (RAG) pipelines are central to applying large language models (LLMs) to proprietary or dynamic data. However, building effective RAG flows is complex, requiring careful selection among vector databases, embedding models, text splitters, retrievers, and synthesizing LLMs. The challenge deepens with the rise of agentic paradigms. Modules like verifiers, rewriters, and rerankers-each with intricate hyperparameter dependencies have to be carefully tuned. Balancing tradeoffs between latency, accuracy, and cost becomes increasingly difficult in performance-sensitive applications. We introduce syftr, a framework that performs efficient multi-objective search over a broad space of agentic and non-agentic RAG configurations. Using Bayesian Optimization, syftr discovers Pareto-optimal flows that jointly optimize task accuracy and cost. A novel early-stopping mechanism further improves efficiency by pruning clearly suboptimal candidates. Across multiple RAG benchmarks, syftr finds flows which are on average approximately 9 times cheaper while preserving most of the accuracy of the most accurate flows on the Pareto-frontier. Furthermore, syftr's ability to design and optimize allows integrating new modules, making it even easier and faster to realize high-performing generative AI pipelines.PDFAbstract
Article Download Link: https://arxiv.org/pdf/2505.20266v1.pdf
GitHub:
- https://github.com/datarobot/syftr
Datasets:
- FinanceBench
==================================
For more data science resources:
https://t.iss.one/DataScienceT
❤1
Article Title:
Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting
Article Date: 20 May 2025
Article Description:
Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Current approaches either assemble specialized expert models or directly generate page-level content autoregressively, facing integration overhead, efficiency bottlenecks, and layout structure degradation despite their decent performance. To address these limitations, we present \textit{Dolphin} (\textit{\textbf{Do}cument Image \textbf{P}arsing via \textbf{H}eterogeneous Anchor Prompt\textbf{in}g}), a novel multimodal document image parsing model following an analyze-then-parse paradigm. In the first stage, Dolphin generates a sequence of layout elements in reading order. These heterogeneous elements, serving as anchors and coupled with task-specific prompts, are fed back to Dolphin for parallel content parsing in the second stage. To train Dolphin, we construct a large-scale dataset of over 30 million samples, covering multi-granularity parsing tasks. Through comprehensive evaluations on both prevalent benchmarks and self-constructed ones, Dolphin achieves state-of-the-art performance across diverse page-level and element-level settings, while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism. The code and pre-trained models are publicly available at https://github.com/ByteDance/DolphinPDFAbstract
Article Download Link: https://arxiv.org/pdf/2505.14059v1.pdf
GitHub:
- https://github.com/bytedance/dolphin
Datasets:
- PubTabNet
==================================
For more data science resources:
https://t.iss.one/DataScienceT
Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting
Article Date: 20 May 2025
Article Description:
Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Current approaches either assemble specialized expert models or directly generate page-level content autoregressively, facing integration overhead, efficiency bottlenecks, and layout structure degradation despite their decent performance. To address these limitations, we present \textit{Dolphin} (\textit{\textbf{Do}cument Image \textbf{P}arsing via \textbf{H}eterogeneous Anchor Prompt\textbf{in}g}), a novel multimodal document image parsing model following an analyze-then-parse paradigm. In the first stage, Dolphin generates a sequence of layout elements in reading order. These heterogeneous elements, serving as anchors and coupled with task-specific prompts, are fed back to Dolphin for parallel content parsing in the second stage. To train Dolphin, we construct a large-scale dataset of over 30 million samples, covering multi-granularity parsing tasks. Through comprehensive evaluations on both prevalent benchmarks and self-constructed ones, Dolphin achieves state-of-the-art performance across diverse page-level and element-level settings, while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism. The code and pre-trained models are publicly available at https://github.com/ByteDance/DolphinPDFAbstract
Article Download Link: https://arxiv.org/pdf/2505.14059v1.pdf
GitHub:
- https://github.com/bytedance/dolphin
Datasets:
- PubTabNet
==================================
For more data science resources:
https://t.iss.one/DataScienceT
Article Title:
Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis
Article Date: 29 Nov 2024
Article Description:
Recent advances in diffusion models have endowed talking head synthesis with subtle expressions and vivid head movements, but have also led to slow inference speed and insufficient control over generated results. To address these issues, we propose Ditto, a diffusion-based talking head framework that enables fine-grained controls and real-time inference. Specifically, we utilize an off-the-shelf motion extractor and devise a diffusion transformer to generate representations in a specific motion space. We optimize the model architecture and training strategy to address the issues in generating motion representations, including insufficient disentanglement between motion and identity, and large internal discrepancies within the representation. Besides, we employ diverse conditional signals while establishing a mapping between motion representation and facial semantics, enabling control over the generation process and correction of the results. Moreover, we jointly optimize the holistic framework to enable streaming processing, real-time inference, and low first-frame delay, offering functionalities crucial for interactive applications such as AI assistants. Extensive experimental results demonstrate that Ditto generates compelling talking head videos and exhibits superiority in both controllability and real-time performance.PDFAbstract
Article Download Link: https://arxiv.org/pdf/2411.19509v3.pdf
GitHub:
- https://github.com/antgroup/ditto-talkinghead
- https://github.com/KwaiVGI/LivePortrait
Datasets:
- No datasets information available
==================================
For more data science resources:
https://t.iss.one/DataScienceT
Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis
Article Date: 29 Nov 2024
Article Description:
Recent advances in diffusion models have endowed talking head synthesis with subtle expressions and vivid head movements, but have also led to slow inference speed and insufficient control over generated results. To address these issues, we propose Ditto, a diffusion-based talking head framework that enables fine-grained controls and real-time inference. Specifically, we utilize an off-the-shelf motion extractor and devise a diffusion transformer to generate representations in a specific motion space. We optimize the model architecture and training strategy to address the issues in generating motion representations, including insufficient disentanglement between motion and identity, and large internal discrepancies within the representation. Besides, we employ diverse conditional signals while establishing a mapping between motion representation and facial semantics, enabling control over the generation process and correction of the results. Moreover, we jointly optimize the holistic framework to enable streaming processing, real-time inference, and low first-frame delay, offering functionalities crucial for interactive applications such as AI assistants. Extensive experimental results demonstrate that Ditto generates compelling talking head videos and exhibits superiority in both controllability and real-time performance.PDFAbstract
Article Download Link: https://arxiv.org/pdf/2411.19509v3.pdf
GitHub:
- https://github.com/antgroup/ditto-talkinghead
- https://github.com/KwaiVGI/LivePortrait
Datasets:
- No datasets information available
==================================
For more data science resources:
https://t.iss.one/DataScienceT
❤2
Article Title:
SQL-o1: A Self-Reward Heuristic Dynamic Search Method for Text-to-SQL
Article Date: 17 Feb 2025
Article Description:
Text-to-SQL (Text2SQL) aims to map natural language questions to executable SQL queries. Although large language models (LLMs) have driven significant progress, current approaches struggle with poor transferability to open-source LLMs, limited robustness against logic and function errors in complex queries, and inefficiencies in structured search. We introduce SQL-o1, a self-reward-driven heuristic search framework built on an agent-based architecture to enhance model reasoning capabilities. SQL-o1 leverages Monte Carlo Tree Search (MCTS) for structured, multi-step exploration, and incorporates a dynamic pruning strategy to accelerate inference without sacrificing accuracy. On the Spider and Bird benchmarks, SQL-o1 achieves a +10.8 execution accuracy improvement on the complex Bird dataset, surpassing even GPT-4-based models. Notably, it exhibits strong few-shot generalization and robust cross-model transferability across open-source LLMs. Our code is available at:https://github.com/ShuaiLyu0110/SQL-o1.PDFAbstract
Article Download Link: https://arxiv.org/pdf/2502.11741v3.pdf
GitHub:
- https://github.com/shuailyu0110/sql-o1
Datasets:
- No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
SQL-o1: A Self-Reward Heuristic Dynamic Search Method for Text-to-SQL
Article Date: 17 Feb 2025
Article Description:
Text-to-SQL (Text2SQL) aims to map natural language questions to executable SQL queries. Although large language models (LLMs) have driven significant progress, current approaches struggle with poor transferability to open-source LLMs, limited robustness against logic and function errors in complex queries, and inefficiencies in structured search. We introduce SQL-o1, a self-reward-driven heuristic search framework built on an agent-based architecture to enhance model reasoning capabilities. SQL-o1 leverages Monte Carlo Tree Search (MCTS) for structured, multi-step exploration, and incorporates a dynamic pruning strategy to accelerate inference without sacrificing accuracy. On the Spider and Bird benchmarks, SQL-o1 achieves a +10.8 execution accuracy improvement on the complex Bird dataset, surpassing even GPT-4-based models. Notably, it exhibits strong few-shot generalization and robust cross-model transferability across open-source LLMs. Our code is available at:https://github.com/ShuaiLyu0110/SQL-o1.PDFAbstract
Article Download Link: https://arxiv.org/pdf/2502.11741v3.pdf
GitHub:
- https://github.com/shuailyu0110/sql-o1
Datasets:
- No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Article Title:
Probability-Consistent Preference Optimization for Enhanced LLM Reasoning
Article Date: 29 May 2025
Article Description:
Recent advances in preference optimization have demonstrated significant potential for improving mathematical reasoning capabilities in large language models (LLMs). While current approaches leverage high-quality pairwise preference data through outcome-based criteria like answer correctness or consistency, they fundamentally neglect the internal logical coherence of responses. To overcome this, we propose Probability-Consistent Preference Optimization (PCPO), a novel framework that establishes dual quantitative metrics for preference selection: (1) surface-level answer correctness and (2) intrinsic token-level probability consistency across responses. Extensive experiments show that our PCPO consistently outperforms existing outcome-only criterion approaches across a diverse range of LLMs and benchmarks. Our code is publicly available at https://github.com/YunqiaoYang/PCPO.PDFAbstract
Article Download Link: https://arxiv.org/pdf/2505.23540v1.pdf
GitHub:
- https://github.com/yunqiaoyang/pcpo
Datasets:
- No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Probability-Consistent Preference Optimization for Enhanced LLM Reasoning
Article Date: 29 May 2025
Article Description:
Recent advances in preference optimization have demonstrated significant potential for improving mathematical reasoning capabilities in large language models (LLMs). While current approaches leverage high-quality pairwise preference data through outcome-based criteria like answer correctness or consistency, they fundamentally neglect the internal logical coherence of responses. To overcome this, we propose Probability-Consistent Preference Optimization (PCPO), a novel framework that establishes dual quantitative metrics for preference selection: (1) surface-level answer correctness and (2) intrinsic token-level probability consistency across responses. Extensive experiments show that our PCPO consistently outperforms existing outcome-only criterion approaches across a diverse range of LLMs and benchmarks. Our code is publicly available at https://github.com/YunqiaoYang/PCPO.PDFAbstract
Article Download Link: https://arxiv.org/pdf/2505.23540v1.pdf
GitHub:
- https://github.com/yunqiaoyang/pcpo
Datasets:
- No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
🔥1
Article Title:
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model
Article Date: 29 May 2025
Article Description:
Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.PDFAbstract
Article Download Link: https://arxiv.org/pdf/2505.23606v1.pdf
GitHub:
- https://github.com/m-e-agi-lab/muddit
- https://github.com/viiika/Meissonic
Datasets:
- No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model
Article Date: 29 May 2025
Article Description:
Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.PDFAbstract
Article Download Link: https://arxiv.org/pdf/2505.23606v1.pdf
GitHub:
- https://github.com/m-e-agi-lab/muddit
- https://github.com/viiika/Meissonic
Datasets:
- No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤1
Article Title:
Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint
Article Date: 29 May 2025
Article Description:
Rebus puzzles, visual riddles that encode language through imagery, spatial arrangement, and symbolic substitution, pose a unique challenge to current vision-language models (VLMs). Unlike traditional image captioning or question answering tasks, rebus solving requires multi-modal abstraction, symbolic reasoning, and a grasp of cultural, phonetic and linguistic puns. In this paper, we investigate the capacity of contemporary VLMs to interpret and solve rebus puzzles by constructing a hand-generated and annotated benchmark of diverse English-language rebus puzzles, ranging from simple pictographic substitutions to spatially-dependent cues ("head" over "heels"). We analyze how different VLMs perform, and our findings reveal that while VLMs exhibit some surprising capabilities in decoding simple visual clues, they struggle significantly with tasks requiring abstract reasoning, lateral thinking, and understanding visual metaphors.PDFAbstract
Article Download Link: https://arxiv.org/pdf/2505.23759v1.pdf
GitHub:
- https://github.com/kyunnilee/visual_puzzles
Datasets:
- No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint
Article Date: 29 May 2025
Article Description:
Rebus puzzles, visual riddles that encode language through imagery, spatial arrangement, and symbolic substitution, pose a unique challenge to current vision-language models (VLMs). Unlike traditional image captioning or question answering tasks, rebus solving requires multi-modal abstraction, symbolic reasoning, and a grasp of cultural, phonetic and linguistic puns. In this paper, we investigate the capacity of contemporary VLMs to interpret and solve rebus puzzles by constructing a hand-generated and annotated benchmark of diverse English-language rebus puzzles, ranging from simple pictographic substitutions to spatially-dependent cues ("head" over "heels"). We analyze how different VLMs perform, and our findings reveal that while VLMs exhibit some surprising capabilities in decoding simple visual clues, they struggle significantly with tasks requiring abstract reasoning, lateral thinking, and understanding visual metaphors.PDFAbstract
Article Download Link: https://arxiv.org/pdf/2505.23759v1.pdf
GitHub:
- https://github.com/kyunnilee/visual_puzzles
Datasets:
- No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤1
Article Title:
GoMatching++: Parameter- and Data-Efficient Arbitrary-Shaped Video Text Spotting and Benchmarking
Article Date: 28 May 2025
Article Description:
Video text spotting (VTS) extends image text spotting (ITS) by adding text tracking, significantly increasing task complexity. Despite progress in VTS, existing methods still fall short of the performance seen in ITS. This paper identifies a key limitation in current video text spotters: limited recognition capability, even after extensive end-to-end training. To address this, we propose GoMatching++, a parameter- and data-efficient method that transforms an off-the-shelf image text spotter into a video specialist. The core idea lies in freezing the image text spotter and introducing a lightweight, trainable tracker, which can be optimized efficiently with minimal training data. Our approach includes two key components: (1) a rescoring mechanism to bridge the domain gap between image and video data, and (2) the LST-Matcher, which enhances the frozen image text spotter's ability to handle video text. We explore various architectures for LST-Matcher to ensure efficiency in both parameters and training data. As a result, GoMatching++ sets new performance records on challenging benchmarks such as ICDAR15-video, DSText, and BOVText, while significantly reducing training costs. To address the lack of curved text datasets in VTS, we introduce ArTVideo, a new benchmark featuring over 30% curved text with detailed annotations. We also provide a comprehensive statistical analysis and experimental results for ArTVideo. We believe that GoMatching++ and the ArTVideo benchmark will drive future advancements in video text spotting. The source code, models and dataset are publicly available at https://github.com/Hxyz-123/GoMatching.PDFAbstract
Article Download Link: https://arxiv.org/pdf/2505.22228v1.pdf
GitHub:
- https://github.com/hxyz-123/gomatching
Datasets:
- ICDAR 2013
- BOVText
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
GoMatching++: Parameter- and Data-Efficient Arbitrary-Shaped Video Text Spotting and Benchmarking
Article Date: 28 May 2025
Article Description:
Video text spotting (VTS) extends image text spotting (ITS) by adding text tracking, significantly increasing task complexity. Despite progress in VTS, existing methods still fall short of the performance seen in ITS. This paper identifies a key limitation in current video text spotters: limited recognition capability, even after extensive end-to-end training. To address this, we propose GoMatching++, a parameter- and data-efficient method that transforms an off-the-shelf image text spotter into a video specialist. The core idea lies in freezing the image text spotter and introducing a lightweight, trainable tracker, which can be optimized efficiently with minimal training data. Our approach includes two key components: (1) a rescoring mechanism to bridge the domain gap between image and video data, and (2) the LST-Matcher, which enhances the frozen image text spotter's ability to handle video text. We explore various architectures for LST-Matcher to ensure efficiency in both parameters and training data. As a result, GoMatching++ sets new performance records on challenging benchmarks such as ICDAR15-video, DSText, and BOVText, while significantly reducing training costs. To address the lack of curved text datasets in VTS, we introduce ArTVideo, a new benchmark featuring over 30% curved text with detailed annotations. We also provide a comprehensive statistical analysis and experimental results for ArTVideo. We believe that GoMatching++ and the ArTVideo benchmark will drive future advancements in video text spotting. The source code, models and dataset are publicly available at https://github.com/Hxyz-123/GoMatching.PDFAbstract
Article Download Link: https://arxiv.org/pdf/2505.22228v1.pdf
GitHub:
- https://github.com/hxyz-123/gomatching
Datasets:
- ICDAR 2013
- BOVText
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤2
Article Title:
Zero-Shot Vision Encoder Grafting via LLM Surrogates
Article Date: 28 May 2025
Article Description:
Vision language models (VLMs) typically pair a modestly sized vision encoder with a large language model (LLM), e.g., Llama-70B, making the decoder the primary computational burden during training. To reduce costs, a potential promising strategy is to first train the vision encoder using a small language model before transferring it to the large one. We construct small "surrogate models" that share the same embedding space and representation language as the large target LLM by directly inheriting its shallow layers. Vision encoders trained on the surrogate can then be directly transferred to the larger model, a process we call zero-shot grafting -- when plugged directly into the full-size target LLM, the grafted pair surpasses the encoder-surrogate pair and, on some benchmarks, even performs on par with full decoder training with the target LLM. Furthermore, our surrogate training approach reduces overall VLM training costs by ~45% when using Llama-70B as the decoder.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.22664v1.pdf
GitHub:
• https://github.com/facebookresearch/zero
Datasets:
• No datasets information available
====================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Zero-Shot Vision Encoder Grafting via LLM Surrogates
Article Date: 28 May 2025
Article Description:
Vision language models (VLMs) typically pair a modestly sized vision encoder with a large language model (LLM), e.g., Llama-70B, making the decoder the primary computational burden during training. To reduce costs, a potential promising strategy is to first train the vision encoder using a small language model before transferring it to the large one. We construct small "surrogate models" that share the same embedding space and representation language as the large target LLM by directly inheriting its shallow layers. Vision encoders trained on the surrogate can then be directly transferred to the larger model, a process we call zero-shot grafting -- when plugged directly into the full-size target LLM, the grafted pair surpasses the encoder-surrogate pair and, on some benchmarks, even performs on par with full decoder training with the target LLM. Furthermore, our surrogate training approach reduces overall VLM training costs by ~45% when using Llama-70B as the decoder.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.22664v1.pdf
GitHub:
• https://github.com/facebookresearch/zero
Datasets:
• No datasets information available
====================
=
=============For more data science resources:
✓ https://t.iss.one/DataScienceT
❤1
Article Title:
AlphaEvolve: A Learning Framework to Discover Novel Alphas in Quantitative Investment
Article Date: 30 Mar 2021
Article Description:
Alphas are stock prediction models capturing trading signals in a stock market. A set of effective alphas can generate weakly correlated high returns to diversify the risk. Existing alphas can be categorized into two classes: Formulaic alphas are simple algebraic expressions of scalar features, and thus can generalize well and be mined into a weakly correlated set. Machine learning alphas are data-driven models over vector and matrix features. They are more predictive than formulaic alphas, but are too complex to mine into a weakly correlated set. In this paper, we introduce a new class of alphas to model scalar, vector, and matrix features which possess the strengths of these two existing classes. The new alphas predict returns with high accuracy and can be mined into a weakly correlated set. In addition, we propose a novel alpha mining framework based on AutoML, called AlphaEvolve, to generate the new alphas. To this end, we first propose operators for generating the new alphas and selectively injecting relational domain knowledge to model the relations between stocks. We then accelerate the alpha mining by proposing a pruning technique for redundant alphas. Experiments show that AlphaEvolve can evolve initial alphas into the new alphas with high returns and weak correlations.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2103.16196v2.pdf
GitHub:
• https://github.com/codelion/openevolve
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
AlphaEvolve: A Learning Framework to Discover Novel Alphas in Quantitative Investment
Article Date: 30 Mar 2021
Article Description:
Alphas are stock prediction models capturing trading signals in a stock market. A set of effective alphas can generate weakly correlated high returns to diversify the risk. Existing alphas can be categorized into two classes: Formulaic alphas are simple algebraic expressions of scalar features, and thus can generalize well and be mined into a weakly correlated set. Machine learning alphas are data-driven models over vector and matrix features. They are more predictive than formulaic alphas, but are too complex to mine into a weakly correlated set. In this paper, we introduce a new class of alphas to model scalar, vector, and matrix features which possess the strengths of these two existing classes. The new alphas predict returns with high accuracy and can be mined into a weakly correlated set. In addition, we propose a novel alpha mining framework based on AutoML, called AlphaEvolve, to generate the new alphas. To this end, we first propose operators for generating the new alphas and selectively injecting relational domain knowledge to model the relations between stocks. We then accelerate the alpha mining by proposing a pruning technique for redundant alphas. Experiments show that AlphaEvolve can evolve initial alphas into the new alphas with high returns and weak correlations.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2103.16196v2.pdf
GitHub:
• https://github.com/codelion/openevolve
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤1👏1
Article Title:
GenoArmory: A Unified Evaluation Framework for Adversarial Attacks on Genomic Foundation Models
Article Date: 16 May 2025
Article Description:
We propose the first unified adversarial attack benchmark for Genomic Foundation Models (GFMs), named GenoArmory. Unlike existing GFM benchmarks, GenoArmory offers the first comprehensive evaluation framework to systematically assess the vulnerability of GFMs to adversarial attacks. Methodologically, we evaluate the adversarial robustness of five state-of-the-art GFMs using four widely adopted attack algorithms and three defense strategies. Importantly, our benchmark provides an accessible and comprehensive framework to analyze GFM vulnerabilities with respect to model architecture, quantization schemes, and training datasets. Additionally, we introduce GenoAdv, a new adversarial sample dataset designed to improve GFM safety. Empirically, classification models exhibit greater robustness to adversarial perturbations compared to generative models, highlighting the impact of task type on model vulnerability. Moreover, adversarial attacks frequently target biologically significant genomic regions, suggesting that these models effectively capture meaningful sequence features.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.10983v1.pdf
GitHub:
• https://github.com/MAGICS-LAB/GenoArmory
Datasets:
• GenoAdv
• GUE
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
GenoArmory: A Unified Evaluation Framework for Adversarial Attacks on Genomic Foundation Models
Article Date: 16 May 2025
Article Description:
We propose the first unified adversarial attack benchmark for Genomic Foundation Models (GFMs), named GenoArmory. Unlike existing GFM benchmarks, GenoArmory offers the first comprehensive evaluation framework to systematically assess the vulnerability of GFMs to adversarial attacks. Methodologically, we evaluate the adversarial robustness of five state-of-the-art GFMs using four widely adopted attack algorithms and three defense strategies. Importantly, our benchmark provides an accessible and comprehensive framework to analyze GFM vulnerabilities with respect to model architecture, quantization schemes, and training datasets. Additionally, we introduce GenoAdv, a new adversarial sample dataset designed to improve GFM safety. Empirically, classification models exhibit greater robustness to adversarial perturbations compared to generative models, highlighting the impact of task type on model vulnerability. Moreover, adversarial attacks frequently target biologically significant genomic regions, suggesting that these models effectively capture meaningful sequence features.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.10983v1.pdf
GitHub:
• https://github.com/MAGICS-LAB/GenoArmory
Datasets:
• GenoAdv
• GUE
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤4👏1
Article Title:
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model
Article Date: 6 May 2025
Article Description:
With the growing requirement for natural human-computer interaction, speech-based systems receive increasing attention as speech is one of the most common forms of daily communication. However, the existing speech models still experience high latency when generating the first audio token during streaming, which poses a significant bottleneck for deployment. To address this issue, we propose VITA-Audio, an end-to-end large speech model with fast audio-text token generation. Specifically, we introduce a lightweight Multiple Cross-modal Token Prediction (MCTP) module that efficiently generates multiple audio tokens within a single model forward pass, which not only accelerates the inference but also significantly reduces the latency for generating the first audio in streaming scenarios. In addition, a four-stage progressive training strategy is explored to achieve model acceleration with minimal loss of speech quality. To our knowledge, VITA-Audio is the first multi-modal large language model capable of generating audio output during the first forward pass, enabling real-time conversational capabilities with minimal latency. VITA-Audio is fully reproducible and is trained on open-source data only. Experimental results demonstrate that our model achieves an inference speedup of 3~5x at the 7B parameter scale, but also significantly outperforms open-source models of similar model size on multiple benchmarks for automatic speech recognition (ASR), text-to-speech (TTS), and spoken question answering (SQA) tasks.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.03739v1.pdf
GitHub:
• https://github.com/vita-mllm/vita-audio
Datasets:
• LibriSpeech
• TriviaQA
• LibriTTS
• AISHELL-1
• FLEURS
• VoxPopuli
• LIMA
• GigaSpeech
• Multilingual LibriSpeech
• AISHELL-2
• WenetSpeech
• MathInstruct
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model
Article Date: 6 May 2025
Article Description:
With the growing requirement for natural human-computer interaction, speech-based systems receive increasing attention as speech is one of the most common forms of daily communication. However, the existing speech models still experience high latency when generating the first audio token during streaming, which poses a significant bottleneck for deployment. To address this issue, we propose VITA-Audio, an end-to-end large speech model with fast audio-text token generation. Specifically, we introduce a lightweight Multiple Cross-modal Token Prediction (MCTP) module that efficiently generates multiple audio tokens within a single model forward pass, which not only accelerates the inference but also significantly reduces the latency for generating the first audio in streaming scenarios. In addition, a four-stage progressive training strategy is explored to achieve model acceleration with minimal loss of speech quality. To our knowledge, VITA-Audio is the first multi-modal large language model capable of generating audio output during the first forward pass, enabling real-time conversational capabilities with minimal latency. VITA-Audio is fully reproducible and is trained on open-source data only. Experimental results demonstrate that our model achieves an inference speedup of 3~5x at the 7B parameter scale, but also significantly outperforms open-source models of similar model size on multiple benchmarks for automatic speech recognition (ASR), text-to-speech (TTS), and spoken question answering (SQA) tasks.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.03739v1.pdf
GitHub:
• https://github.com/vita-mllm/vita-audio
Datasets:
• LibriSpeech
• TriviaQA
• LibriTTS
• AISHELL-1
• FLEURS
• VoxPopuli
• LIMA
• GigaSpeech
• Multilingual LibriSpeech
• AISHELL-2
• WenetSpeech
• MathInstruct
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤4
Article Title:
Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video
Article Date: 27 Mar 2025
Article Description:
This paper presents a unified approach to understanding dynamic scenes from casual videos. Large pretrained vision foundation models, such as vision-language, video depth prediction, motion tracking, and segmentation models, offer promising capabilities. However, training a single model for comprehensive 4D understanding remains challenging. We introduce Uni4D, a multi-stage optimization framework that harnesses multiple pretrained models to advance dynamic 3D modeling, including static/dynamic reconstruction, camera pose estimation, and dense 3D motion tracking. Our results show state-of-the-art performance in dynamic 4D modeling with superior visual quality. Notably, Uni4D requires no retraining or fine-tuning, highlighting the effectiveness of repurposing visual foundation models for 4D understanding.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2503.21761v1.pdf
GitHub:
• https://github.com/Davidyao99/uni4d
Datasets:
• KITTI
• DAVIS
• TUM RGB-D
• MPI Sintel
• Bonn RGB-D Dynamic
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video
Article Date: 27 Mar 2025
Article Description:
This paper presents a unified approach to understanding dynamic scenes from casual videos. Large pretrained vision foundation models, such as vision-language, video depth prediction, motion tracking, and segmentation models, offer promising capabilities. However, training a single model for comprehensive 4D understanding remains challenging. We introduce Uni4D, a multi-stage optimization framework that harnesses multiple pretrained models to advance dynamic 3D modeling, including static/dynamic reconstruction, camera pose estimation, and dense 3D motion tracking. Our results show state-of-the-art performance in dynamic 4D modeling with superior visual quality. Notably, Uni4D requires no retraining or fine-tuning, highlighting the effectiveness of repurposing visual foundation models for 4D understanding.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2503.21761v1.pdf
GitHub:
• https://github.com/Davidyao99/uni4d
Datasets:
• KITTI
• DAVIS
• TUM RGB-D
• MPI Sintel
• Bonn RGB-D Dynamic
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤3👍2
Article Title:
Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers
Article Date: 27 Apr 2025
Article Description:
Hallucinations are a persistent problem with Large Language Models (LLMs). As these models become increasingly used in high-stakes domains, such as healthcare and finance, the need for effective hallucination detection is crucial. To this end, we propose a versatile framework for zero-resource hallucination detection that practitioners can apply to real-world use cases. To achieve this, we adapt a variety of existing uncertainty quantification (UQ) techniques, including black-box UQ, white-box UQ, and LLM-as-a-Judge, transforming them as necessary into standardized response-level confidence scores ranging from 0 to 1. To enhance flexibility, we introduce a tunable ensemble approach that incorporates any combination of the individual confidence scores. This approach enables practitioners to optimize the ensemble for a specific use case for improved performance. To streamline implementation, the full suite of scorers is offered in this paper's companion Python toolkit, UQLM. To evaluate the performance of the various scorers, we conduct an extensive set of experiments using several LLM question-answering benchmarks. We find that our tunable ensemble typically surpasses its individual components and outperforms existing hallucination detection methods. Our results demonstrate the benefits of customized hallucination detection strategies for improving the accuracy and reliability of LLMs.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2504.19254v2.pdf
GitHub:
• https://github.com/cvs-health/uqlm
Datasets:
• GSM8K
• SVAMP
• PopQA
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers
Article Date: 27 Apr 2025
Article Description:
Hallucinations are a persistent problem with Large Language Models (LLMs). As these models become increasingly used in high-stakes domains, such as healthcare and finance, the need for effective hallucination detection is crucial. To this end, we propose a versatile framework for zero-resource hallucination detection that practitioners can apply to real-world use cases. To achieve this, we adapt a variety of existing uncertainty quantification (UQ) techniques, including black-box UQ, white-box UQ, and LLM-as-a-Judge, transforming them as necessary into standardized response-level confidence scores ranging from 0 to 1. To enhance flexibility, we introduce a tunable ensemble approach that incorporates any combination of the individual confidence scores. This approach enables practitioners to optimize the ensemble for a specific use case for improved performance. To streamline implementation, the full suite of scorers is offered in this paper's companion Python toolkit, UQLM. To evaluate the performance of the various scorers, we conduct an extensive set of experiments using several LLM question-answering benchmarks. We find that our tunable ensemble typically surpasses its individual components and outperforms existing hallucination detection methods. Our results demonstrate the benefits of customized hallucination detection strategies for improving the accuracy and reliability of LLMs.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2504.19254v2.pdf
GitHub:
• https://github.com/cvs-health/uqlm
Datasets:
• GSM8K
• SVAMP
• PopQA
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤5
Article Title:
s3: You Don't Need That Much Data to Train a Search Agent via RL
Article Date: 20 May 2025
Article Description:
Retrieval-augmented generation (RAG) systems empower large language models (LLMs) to access external knowledge during inference. Recent advances have enabled LLMs to act as search agents via reinforcement learning (RL), improving information acquisition through multi-turn interactions with retrieval engines. However, existing approaches either optimize retrieval using search-only metrics (e.g., NDCG) that ignore downstream utility or fine-tune the entire LLM to jointly reason and retrieve-entangling retrieval with generation and limiting the real search utility and compatibility with frozen or proprietary models. In this work, we propose s3, a lightweight, model-agnostic framework that decouples the searcher from the generator and trains the searcher using a Gain Beyond RAG reward: the improvement in generation accuracy over naive RAG. s3 requires only 2.4k training samples to outperform baselines trained on over 70x more data, consistently delivering stronger downstream performance across six general QA and five medical QA benchmarks.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.14146v1.pdf
GitHub:
• https://github.com/pat-jj/s3
Datasets:
• Natural Questions
• TriviaQA
• HotpotQA
• MedQA
• PubMedQA
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
s3: You Don't Need That Much Data to Train a Search Agent via RL
Article Date: 20 May 2025
Article Description:
Retrieval-augmented generation (RAG) systems empower large language models (LLMs) to access external knowledge during inference. Recent advances have enabled LLMs to act as search agents via reinforcement learning (RL), improving information acquisition through multi-turn interactions with retrieval engines. However, existing approaches either optimize retrieval using search-only metrics (e.g., NDCG) that ignore downstream utility or fine-tune the entire LLM to jointly reason and retrieve-entangling retrieval with generation and limiting the real search utility and compatibility with frozen or proprietary models. In this work, we propose s3, a lightweight, model-agnostic framework that decouples the searcher from the generator and trains the searcher using a Gain Beyond RAG reward: the improvement in generation accuracy over naive RAG. s3 requires only 2.4k training samples to outperform baselines trained on over 70x more data, consistently delivering stronger downstream performance across six general QA and five medical QA benchmarks.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.14146v1.pdf
GitHub:
• https://github.com/pat-jj/s3
Datasets:
• Natural Questions
• TriviaQA
• HotpotQA
• MedQA
• PubMedQA
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤4
Please open Telegram to view this post
VIEW IN TELEGRAM
Update Telegram, now job seekers can advertise their expertise and job opportunities without having to go back to the channel owner
Article Title:
Vision as LoRA
Article Date: 26 Mar 2025
Article Description:
We introduce Vision as LoRA (VoRA), a novel paradigm for transforming an LLM into an MLLM. Unlike prevalent MLLM architectures that rely on external vision modules for vision encoding, VoRA internalizes visual capabilities by integrating vision-specific LoRA layers directly into the LLM. This design allows the added parameters to be seamlessly merged into the LLM during inference, eliminating structural complexity and minimizing computational overhead. Moreover, inheriting the LLM's ability of handling flexible context, VoRA can process inputs at arbitrary resolutions. To further strengthen VoRA's visual capabilities, we introduce a block-wise distillation method that transfers visual priors from a pre-trained ViT into the LoRA layers, effectively accelerating training by injecting visual knowledge. Additionally, we apply bi-directional attention masks to better capture the context information of an image. We successfully demonstrate that with additional pre-training data, VoRA can perform comparably with conventional encode-based MLLMs. All training data, codes, and model weights will be released at https://github.com/Hon-Wong/VoRA.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2503.20680v1.pdf
GitHub:
• https://github.com/hon-wong/vora
Datasets:
• MM-Vet
• Google Landmarks Dataset v2
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Vision as LoRA
Article Date: 26 Mar 2025
Article Description:
We introduce Vision as LoRA (VoRA), a novel paradigm for transforming an LLM into an MLLM. Unlike prevalent MLLM architectures that rely on external vision modules for vision encoding, VoRA internalizes visual capabilities by integrating vision-specific LoRA layers directly into the LLM. This design allows the added parameters to be seamlessly merged into the LLM during inference, eliminating structural complexity and minimizing computational overhead. Moreover, inheriting the LLM's ability of handling flexible context, VoRA can process inputs at arbitrary resolutions. To further strengthen VoRA's visual capabilities, we introduce a block-wise distillation method that transfers visual priors from a pre-trained ViT into the LoRA layers, effectively accelerating training by injecting visual knowledge. Additionally, we apply bi-directional attention masks to better capture the context information of an image. We successfully demonstrate that with additional pre-training data, VoRA can perform comparably with conventional encode-based MLLMs. All training data, codes, and model weights will be released at https://github.com/Hon-Wong/VoRA.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2503.20680v1.pdf
GitHub:
• https://github.com/hon-wong/vora
Datasets:
• MM-Vet
• Google Landmarks Dataset v2
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤7👏1
Article Title:
Harnessing the Universal Geometry of Embeddings
Article Date: 18 May 2025
Article Description:
We introduce the first method for translating text embeddings from one vector space to another without any paired data, encoders, or predefined sets of matches. Our unsupervised approach translates any embedding to and from a universal latent representation (i.e., a universal semantic structure conjectured by the Platonic Representation Hypothesis). Our translations achieve high cosine similarity across model pairs with different architectures, parameter counts, and training datasets. The ability to translate unknown embeddings into a different space while preserving their geometry has serious implications for the security of vector databases. An adversary with access only to embedding vectors can extract sensitive information about the underlying documents, sufficient for classification and attribute inference.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.12540v2.pdf
GitHub:
• https://github.com/rjha18/vec2vec
• https://github.com/zhaoolee/garss
Datasets:
• Natural Questions
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Harnessing the Universal Geometry of Embeddings
Article Date: 18 May 2025
Article Description:
We introduce the first method for translating text embeddings from one vector space to another without any paired data, encoders, or predefined sets of matches. Our unsupervised approach translates any embedding to and from a universal latent representation (i.e., a universal semantic structure conjectured by the Platonic Representation Hypothesis). Our translations achieve high cosine similarity across model pairs with different architectures, parameter counts, and training datasets. The ability to translate unknown embeddings into a different space while preserving their geometry has serious implications for the security of vector databases. An adversary with access only to embedding vectors can extract sensitive information about the underlying documents, sufficient for classification and attribute inference.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.12540v2.pdf
GitHub:
• https://github.com/rjha18/vec2vec
• https://github.com/zhaoolee/garss
Datasets:
• Natural Questions
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤6🔥1🎉1
Article Title:
MTGS: Multi-Traversal Gaussian Splatting
Article Date: 16 Mar 2025
Article Description:
Multi-traversal data, commonly collected through daily commutes or by self-driving fleets, provides multiple viewpoints for scene reconstruction within a road block. This data offers significant potential for high-quality novel view synthesis, which is crucial for applications such as autonomous vehicle simulators. However, inherent challenges in multi-traversal data often result in suboptimal reconstruction quality, including variations in appearance and the presence of dynamic objects. To address these issues, we propose Multi-Traversal Gaussian Splatting (MTGS), a novel approach that reconstructs high-quality driving scenes from arbitrarily collected multi-traversal data by modeling a shared static geometry while separately handling dynamic elements and appearance variations. Our method employs a multi-traversal dynamic scene graph with a shared static node and traversal-specific dynamic nodes, complemented by color correction nodes with learnable spherical harmonics coefficient residuals. This approach enables high-fidelity novel view synthesis and provides flexibility to navigate any viewpoint. We conduct extensive experiments on a large-scale driving dataset, nuPlan, with multi-traversal data. Our results demonstrate that MTGS improves LPIPS by 23.5% and geometry accuracy by 46.3% compared to single-traversal baselines. The code and data would be available to the public.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2503.12552v3.pdf
GitHub:
• https://github.com/OpenDriveLab/MTGS
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
MTGS: Multi-Traversal Gaussian Splatting
Article Date: 16 Mar 2025
Article Description:
Multi-traversal data, commonly collected through daily commutes or by self-driving fleets, provides multiple viewpoints for scene reconstruction within a road block. This data offers significant potential for high-quality novel view synthesis, which is crucial for applications such as autonomous vehicle simulators. However, inherent challenges in multi-traversal data often result in suboptimal reconstruction quality, including variations in appearance and the presence of dynamic objects. To address these issues, we propose Multi-Traversal Gaussian Splatting (MTGS), a novel approach that reconstructs high-quality driving scenes from arbitrarily collected multi-traversal data by modeling a shared static geometry while separately handling dynamic elements and appearance variations. Our method employs a multi-traversal dynamic scene graph with a shared static node and traversal-specific dynamic nodes, complemented by color correction nodes with learnable spherical harmonics coefficient residuals. This approach enables high-fidelity novel view synthesis and provides flexibility to navigate any viewpoint. We conduct extensive experiments on a large-scale driving dataset, nuPlan, with multi-traversal data. Our results demonstrate that MTGS improves LPIPS by 23.5% and geometry accuracy by 46.3% compared to single-traversal baselines. The code and data would be available to the public.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2503.12552v3.pdf
GitHub:
• https://github.com/OpenDriveLab/MTGS
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤5🎉1