Data Science | Machine Learning with Python for Researchers

Article Title:
Visual Causal Scene Refinement for Video Question Answering

Article Date: 7 May 2023

Article Description:
Existing methods for video question answering (VideoQA) often suffer from spurious correlations between different modalities, leading to a failure in identifying the dominant visual evidence and the intended question. Moreover, these methods function as black boxes, making it difficult to interpret the visual scene during the QA process. In this paper, to discover critical video segments and frames that serve as the visual causal scene for generating reliable answers, we present a causal analysis of VideoQA and propose a framework for cross-modal causal relational reasoning, named Visual Causal Scene Refinement (VCSR). Particularly, a set of causal front-door intervention operations is introduced to explicitly find the visual causal scenes at both segment and frame levels. Our VCSR involves two essential modules: i) the Question-Guided Refiner (QGR) module, which refines consecutive video frames guided by the question semantics to obtain more representative segment features for causal front-door intervention; ii) the Causal Scene Separator (CSS) module, which discovers a collection of visual causal and non-causal scenes based on the visual-linguistic causal relevance and estimates the causal effect of the scene-separating intervention in a contrastive learning manner. Extensive experiments on the NExT-QA, Causal-VidQA, and MSRVTT-QA datasets demonstrate the superiority of our VCSR in discovering visual causal scene and achieving robust video question answering. The code is available at https://github.com/YangLiu9208/VCSR.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2305.04224v2.pdf

GitHub:
• https://github.com/yangliu9208/vcsr
• https://github.com/hcplab-sysu/causal-vlreasoning

Datasets:
• NExT-QA
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

❤2

1.27K views20:49

Data Science | Machine Learning with Python for Researchers

Article Title:
Direct3D-S2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention

Article Date: 23 May 2025

Article Description:
Generating high-resolution 3D shapes using volumetric representations such as Signed Distance Functions (SDFs) presents substantial computational and memory challenges. We introduce Direct3D-S2, a scalable 3D generation framework based on sparse volumes that achieves superior output quality with dramatically reduced training costs. Our key innovation is the Spatial Sparse Attention (SSA) mechanism, which greatly enhances the efficiency of Diffusion Transformer (DiT) computations on sparse volumetric data. SSA allows the model to effectively process large token sets within sparse volumes, substantially reducing computational overhead and achieving a 3.9x speedup in the forward pass and a 9.6x speedup in the backward pass. Our framework also includes a variational autoencoder (VAE) that maintains a consistent sparse volumetric format across input, latent, and output stages. Compared to previous methods with heterogeneous representations in 3D VAE, this unified design significantly improves training efficiency and stability. Our model is trained on public available datasets, and experiments demonstrate that Direct3D-S2 not only surpasses state-of-the-art methods in generation quality and efficiency, but also enables training at 1024 resolution using only 8 GPUs, a task typically requiring at least 32 GPUs for volumetric representations at 256 resolution, thus making gigascale 3D generation both practical and accessible. Project page: https://www.neural4d.com/research/direct3d-s2.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2505.17412v2.pdf

GitHub:
• https://github.com/DreamTechAI/Direct3D-S2

Datasets:
• ShapeNet
• Objaverse
• Objaverse-XL
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

❤3👏1

1.09K views10:55

Data Science | Machine Learning with Python for Researchers

🔹 Title:
Sparsified State-Space Models are Efficient Highway Networks

🔹 Publication Date: Published on May 27

🔹 Abstract:
Simba, a hierarchical sparsification method for state-space models, enhances efficiency and information flow in natural language tasks by pruning tokens more aggressively in upper layers. AI-generated summary State-space models (SSMs) offer a promising architecture for sequence modeling, providing an alternative to Transformers by replacing expensive self-attention with linear recurrences . In this paper, we propose a simple yet effective trick to enhance SSMs within given computational budgets by sparsifying them. Our intuition is that tokens in SSMs are highly redundant due to gradual recurrent updates, and dense recurrence operations block the delivery of past information. In particular, we observe that upper layers of SSMs tend to be more redundant as they encode global information , while lower layers encode local information . Motivated by this, we introduce Simba , a hierarchical sparsification method for SSMs based on token pruning . Simba sparsifies upper layers more than lower layers, encouraging the upper layers to behave like highways . To achieve this, we propose a novel token pruning criterion for SSMs, measuring the global impact of tokens on the final output by accumulating local recurrences. We demonstrate that Simba outperforms the baseline model, Mamba, with the same FLOPS in various natural language tasks . Moreover, we illustrate the effect of highways , showing that Simba not only enhances efficiency but also improves the information flow across long sequences. Code is available at https://github.com/woominsong/ Simba .

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2505.20698
• PDF: https://arxiv.org/pdf/2505.20698

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

GitHub

woominsong - Overview

woominsong has 38 repositories available. Follow their code on GitHub.

❤3

1.17K views12:14

Data Science | Machine Learning with Python for Researchers

Article Title:
LeVo: High-Quality Song Generation with Multi-Preference Alignment

Article Date: 9 Jun 2025

Article Description:
Recent advances in large language models (LLMs) and audio language models have significantly improved music generation, particularly in lyrics-to-song generation. However, existing approaches still struggle with the complex composition of songs and the scarcity of high-quality data, leading to limitations in sound quality, musicality, instruction following, and vocal-instrument harmony. To address these challenges, we introduce LeVo, an LM-based framework consisting of LeLM and a music codec. LeLM is capable of parallelly modeling two types of tokens: mixed tokens, which represent the combined audio of vocals and accompaniment to achieve vocal-instrument harmony, and dual-track tokens, which separately encode vocals and accompaniment for high-quality song generation. It employs two decoder-only transformers and a modular extension training strategy to prevent interference between different token types. To further enhance musicality and instruction following, we introduce a multi-preference alignment method based on Direct Preference Optimization (DPO). This method handles diverse human preferences through a semi-automatic data construction process and DPO post-training. Experimental results demonstrate that LeVo consistently outperforms existing methods on both objective and subjective metrics. Ablation studies further justify the effectiveness of our designs. Audio examples are available at https://levo-demo.github.io/. Code is released at https://github.com/tencent-ailab/songgeneration.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2506.07520v2.pdf

GitHub:
• https://github.com/tencent-ailab/songgeneration

Datasets:
• 100style
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

❤1

1.11K views20:47

Data Science | Machine Learning with Python for Researchers

Forwarded from Python | Machine Learning | Coding | R

This channels is for Programmers, Coders, Software Engineers.

0️⃣ Python
1️⃣ Data Science
2️⃣ Machine Learning
3️⃣ Data Visualization
4️⃣ Artificial Intelligence
5️⃣ Data Analysis
6️⃣ Statistics
7️⃣ Deep Learning
8️⃣ programming Languages

✅

https://t.iss.one/addlist/8_rRW2scgfRhOTc0

✅

https://t.iss.one/Codeprogrammer

Please open Telegram to view this post

VIEW IN TELEGRAM

❤2

487 views21:21

Data Science | Machine Learning with Python for Researchers

🔹 Title:
MoCha: Towards Movie-Grade Talking Character Synthesis

🔹 Publication Date: Published on Mar 30

🔹 Abstract:
MoCha generates realistic talking character animations from speech and text using a speech-video attention mechanism and joint training on speech-labeled and text-labeled data, enabling multi-character conversations and superior realism. AI-generated summary Recent advancements in video generation have achieved impressive motion realism, yet they often overlook character-driven storytelling, a crucial task for automated film, animation generation. We introduce Talking Characters, a more realistic task to generate talking character animations directly from speech and text. Unlike talking head, Talking Characters aims at generating the full portrait of one or more characters beyond the facial region. In this paper, we propose MoCha , the first of its kind to generate talking characters. To ensure precise synchronization between video and speech, we propose a speech-video window attention mechanism that effectively aligns speech and video tokens. To address the scarcity of large-scale speech-labeled video datasets, we introduce a joint training strategy that leverages both speech-labeled and text-labeled video data, significantly improving generalization across diverse character actions. We also design structured prompt templates with character tags, enabling, for the first time, multi-character conversation with turn-based dialogue-allowing AI-generated characters to engage in context-aware conversations with cinematic coherence . Extensive qualitative and quantitative evaluations, including human preference studies and benchmark comparisons, demonstrate that MoCha sets a new standard for AI-generated cinematic storytelling, achieving superior realism, expressiveness, controllability and generalization.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2503.23307
• PDF: https://arxiv.org/pdf/2503.23307
• Project Page: https://congwei1230.github.io/MoCha/
• Github: https://github.com/congwei1230/MoChaBench

🔹 Datasets citing this paper:
• https://huggingface.co/datasets/CongWei1230/MoCha-Generation-on-MoChaBench-Visualizer
• https://huggingface.co/datasets/CongWei1230/MoChaBench

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

arXiv.org

MoCha: Towards Movie-Grade Talking Character Synthesis

Recent advancements in video generation have achieved impressive motion realism, yet they often overlook character-driven storytelling, a crucial task for automated film, animation generation. We...

❤4

1.1K views08:28

Data Science | Machine Learning with Python for Researchers

🔹 Title:
SimpleGVR: A Simple Baseline for Latent-Cascaded Video Super-Resolution

🔹 Publication Date: Published on Jun 24

🔹 Abstract:
Researchers propose design principles for cascaded video super-resolution models to improve high-resolution video generation by introduces degradation strategies, timestep sampling, noise augmentation, and interleaving temporal units with sparse local attention. AI-generated summary Latent diffusion models have emerged as a leading paradigm for efficient video generation . However, as user expectations shift toward higher-resolution outputs, relying solely on latent computation becomes inadequate. A promising approach involves decoupling the process into two stages: semantic content generation and detail synthesis. The former employs a computationally intensive base model at lower resolutions, while the latter leverages a lightweight cascaded video super-resolution ( VSR ) model to achieve high-resolution output. In this work, we focus on studying key design principles for latter cascaded VSR models, which are underexplored currently. First, we propose two degradation strategies to generate training pairs that better mimic the output characteristics of the base model, ensuring alignment between the VSR model and its upstream generator. Second, we provide critical insights into VSR model behavior through systematic analysis of (1) timestep sampling strategies, (2) noise augmentation effects on low-resolution (LR) inputs. These findings directly inform our architectural and training innovations. Finally, we introduce interleaving temporal unit and sparse local attention to achieve efficient training and inference, drastically reducing computational overhead. Extensive experiments demonstrate the superiority of our framework over existing methods, with ablation studies confirming the efficacy of each design choice. Our work establishes a simple yet effective baseline for cascaded video super-resolution generation, offering practical insights to guide future advancements in efficient cascaded synthesis systems.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.19838
• PDF: https://arxiv.org/pdf/2506.19838

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

❤4

1.17K views10:38

Data Science | Machine Learning with Python for Researchers

Forwarded from Python | Machine Learning | Coding | R

Top 50 LLM Interview Questions!

A comprehensive resource that covers traditional ML basics, model architectures, real-world case studies, and theoretical foundations.

👇👇👇👇👇👇

✉️ Our Telegram channels: https://t.iss.one/addlist/0f6vfFbEMdAwODBk

📱 Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A

Please open Telegram to view this post

VIEW IN TELEGRAM

750 views05:27

Data Science | Machine Learning with Python for Researchers

Forwarded from Python | Machine Learning | Coding | R

LLM Interview Questions.pdf

71.2 KB

Top 50 LLM Interview Questions!

#LLM #AIInterviews #MachineLearning #DeepLearning #NLP #LLMInterviewPrep #ModelArchitectures #AITheory #TechInterviews #MLBasics #InterviewQuestions #LargeLanguageModels

✉️ Our Telegram channels: https://t.iss.one/addlist/0f6vfFbEMdAwODBk

📱 Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A

Please open Telegram to view this post

VIEW IN TELEGRAM

❤6👍1

607 views05:27

Data Science | Machine Learning with Python for Researchers

🔹 Title:
Is a PET all you need? A multi-modal study for Alzheimer's disease using 3D CNNs

🔹 Publication Date: Published on Jul 5, 2022

🔹 Abstract:
A systematic evaluation of multi-modal deep neural networks for Alzheimer's disease diagnosis shows that FDG-PET performs better than sMRI and that multi-modal fusion does not improve accuracy. AI-generated summary Alzheimer's Disease (AD) is the most common form of dementia and often difficult to diagnose due to the multifactorial etiology of dementia. Recent works on neuroimaging-based computer-aided diagnosis with deep neural networks (DNNs) showed that fusing structural magnetic resonance images (sMRI) and fluorodeoxyglucose positron emission tomography (FDG-PET) leads to improved accuracy in a study population of healthy controls and subjects with AD. However, this result conflicts with the established clinical knowledge that FDG-PET better captures AD-specific pathologies than sMRI. Therefore, we propose a framework for the systematic evaluation of multi-modal DNNs and critically re-evaluate single- and multi-modal DNNs based on FDG-PET and sMRI for binary healthy vs. AD, and three-way healthy / mild cognitive impairment /AD classification. Our experiments demonstrate that a single-modal ity network using FDG-PET performs better than MRI (accuracy 0.91 vs 0.87) and does not show improvement when combined. This conforms with the established clinical knowledge on AD biomarkers, but raises questions about the true benefit of multi-modal DNNs. We argue that future work on multi-modal fusion should systematically assess the contribution of individual modalities following our proposed evaluation framework. Finally, we encourage the community to go beyond healthy vs. AD classification and focus on differential diagnosis of dementia, where fusing multi-modal image information conforms with a clinical need.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2207.02094
• PDF: https://arxiv.org/pdf/2207.02094

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

arXiv.org

Is a PET all you need? A multi-modal study for Alzheimer's...

Alzheimer's Disease (AD) is the most common form of dementia and often difficult to diagnose due to the multifactorial etiology of dementia. Recent works on neuroimaging-based computer-aided...

❤4

1.06K views06:47

Data Science | Machine Learning with Python for Researchers

🔹 Title:
Solving Inequality Proofs with Large Language Models

🔹 Publication Date: Published on Jun 9

🔹 Abstract:
The investigation into inequality proving using large language models uncovers significant challenges in constructing rigorous proofs, revealing gaps between finding answers and generating valid step-wise solutions. AI-generated summary Inequality proving, crucial across diverse scientific and mathematical fields, tests advanced reasoning skills such as discovering tight bounds and strategic theorem application. This makes it a distinct, demanding frontier for large language models ( LLMs ), offering insights beyond general mathematical problem-solving. Progress in this area is hampered by existing datasets that are often scarce, synthetic, or rigidly formal. We address this by proposing an informal yet verifiable task formulation, recasting inequality proving into two automatically checkable subtasks: bound estimation and relation prediction . Building on this, we release IneqMath , an expert-curated dataset of Olympiad-level inequalities, including a test set and training corpus enriched with step-wise solutions and theorem annotations. We also develop a novel LLM-as-judge evaluation framework, combining a final-answer judge with four step-wise judges designed to detect common reasoning flaws. A systematic evaluation of 29 leading LLMs on IneqMath reveals a surprising reality: even top models like o1 achieve less than 10% overall accuracy under step-wise scrutiny; this is a drop of up to 65.5% from their accuracy considering only final answer equivalence. This discrepancy exposes fragile deductive chains and a critical gap for current LLMs between merely finding an answer and constructing a rigorous proof. Scaling model size and increasing test-time computation yield limited gains in overall proof correctness. Instead, our findings highlight promising research directions such as theorem-guided reasoning and self-refinement . Code and data are available at https:// ineqmath .github.io/.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.07927
• PDF: https://arxiv.org/pdf/2506.07927
• Github: https://ineqmath.github.io/#visualization

🔹 Datasets citing this paper:
• https://huggingface.co/datasets/AI4Math/IneqMath

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

❤4

1.03K views12:35

Data Science | Machine Learning with Python for Researchers

🙏💸 500$ FOR THE FIRST 500 WHO JOIN THE CHANNEL! 🙏💸

Join our channel today for free! Tomorrow it will cost 500$!

https://t.iss.one/+Cl8uwGkD0l5lMGNl

You can join at this link! 👆👇

https://t.iss.one/+Cl8uwGkD0l5lMGNl

❤1

1.06K views14:30

Data Science | Machine Learning with Python for Researchers

Article Title:
TradingAgents: Multi-Agents LLM Financial Trading Framework

Article Date: 28 Dec 2024

Article Description:
Significant progress has been made in automated problem-solving using societies of agents powered by large language models (LLMs). In finance, efforts have largely focused on single-agent systems handling specific tasks or multi-agent frameworks independently gathering data. However, the multi-agent systems' potential to replicate real-world trading firms' collaborative dynamics remains underexplored. TradingAgents proposes a novel stock trading framework inspired by trading firms, featuring LLM-powered agents in specialized roles such as fundamental analysts, sentiment analysts, technical analysts, and traders with varied risk profiles. The framework includes Bull and Bear researcher agents assessing market conditions, a risk management team monitoring exposure, and traders synthesizing insights from debates and historical data to make informed decisions. By simulating a dynamic, collaborative trading environment, this framework aims to improve trading performance. Detailed architecture and extensive experiments reveal its superiority over baseline models, with notable improvements in cumulative returns, Sharpe ratio, and maximum drawdown, highlighting the potential of multi-agent LLM frameworks in financial trading. TradingAgents is available at https://github.com/TauricResearch/TradingAgents.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2412.20138v7.pdf

GitHub:
• https://github.com/tauricresearch/tradingagents

Datasets:
• No datasets information available
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

❤4

1.2K views01:39

Data Science | Machine Learning with Python for Researchers

🔹 Title:
Configurable Preference Tuning with Rubric-Guided Synthetic Data

🔹 Publication Date: Published on Jun 13

🔹 Abstract:
Configurable Preference Tuning enables language models to dynamically adjust their behavior based on human-interprettable directives, using rubric-guided preference data for fine-tuning and inference-time modulation. AI-generated summary Models of human feedback for AI alignment, such as those underpinning Direct Preference Optimization (DPO), often bake in a singular, static set of preferences, limiting adaptability. This paper challenges the assumption of monolithic preferences by introducing Configurable Preference Tuning (CPT), a novel framework for endowing language models with the ability to dynamically adjust their behavior based on explicit, human-interpretable directives. CPT leverages synthetically generated preference data, conditioned on system prompts derived from structured, fine-grained rubrics that define desired attributes like writing style. By fine-tuning with these rubric-guided preferences, the LLM learns to modulate its outputs at inference time in response to the system prompt, without retraining. This approach not only offers fine-grained control but also provides a mechanism for modeling more nuanced and context-dependent human feedback. Several experimental artifacts, such as training code, generated datasets and fine-tuned models are released at https://github.com/vicgalle/configurable-preference-tuning

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.11702
• PDF: https://arxiv.org/pdf/2506.11702

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

arXiv.org

Configurable Preference Tuning with Rubric-Guided Synthetic Data

Models of human feedback for AI alignment, such as those underpinning Direct Preference Optimization (DPO), often bake in a singular, static set of preferences, limiting adaptability. This paper...

❤1

1.15K views06:20

Data Science | Machine Learning with Python for Researchers

🔹 Title:
PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference Time by Leveraging Sparsity

🔹 Publication Date: Published on Mar 10

🔹 Abstract:
PLADIS leverages sparse attention in cross-attention layers to enhance pre-trained text-to-image diffusion models, improving text alignment and human preference without additional training. AI-generated summary Diffusion models have shown impressive results in generating high-quality conditional samples using guidance techniques such as Classifier-Free Guidance (CFG). However, existing methods often require additional training or neural function evaluations (NFEs), making them incompatible with guidance-distilled models. Also, they rely on heuristic approaches that need identifying target layers. In this work, we propose a novel and efficient method, termed PLADIS , which boosts pre-trained models ( U-Net / Transformer ) by leveraging sparse attention. Specifically, we extrapolate query-key correlations using softmax and its sparse counterpart in the cross-attention layer during inference, without requiring extra training or NFEs. By leveraging the noise robustness of sparse attention , our PLADIS unleashes the latent potential of text-to-image diffusion models , enabling them to excel in areas where they once struggled with newfound effectiveness. It integrates seamlessly with guidance techniques, including guidance-distilled models . Extensive experiments show notable improvements in text alignment and human preference, offering a highly efficient and universally applicable solution.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2503.07677
• PDF: https://arxiv.org/pdf/2503.07677
• Github: https://cubeyoung.github.io/pladis-proejct/

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

arXiv.org

PLADIS: Pushing the Limits of Attention in Diffusion Models at...

Diffusion models have shown impressive results in generating high-quality conditional samples using guidance techniques such as Classifier-Free Guidance (CFG). However, existing methods often...

❤1

1.24K views13:33

Data Science | Machine Learning with Python for Researchers

Article Title:
Urban1960SatSeg: Unsupervised Semantic Segmentation of Mid-20$^{th}$ century Urban Landscapes with Satellite Imageries

Article Date: 11 Jun 2025

Article Description:
Historical satellite imagery, such as mid-20$^{th}$ century Keyhole data, offers rare insights into understanding early urban development and long-term transformation. However, severe quality degradation (e.g., distortion, misalignment, and spectral scarcity) and annotation absence have long hindered semantic segmentation on such historical RS imagery. To bridge this gap and enhance understanding of urban development, we introduce $\textbf{Urban1960SatBench}$, an annotated segmentation dataset based on historical satellite imagery with the earliest observation time among all existing segmentation datasets, along with a benchmark framework for unsupervised segmentation tasks, $\textbf{Urban1960SatUSM}$. First, $\textbf{Urban1960SatBench}$ serves as a novel, expertly annotated semantic segmentation dataset built on mid-20$^{th}$ century Keyhole imagery, covering 1,240 km$^2$ and key urban classes (buildings, roads, farmland, water). As the earliest segmentation dataset of its kind, it provides a pioneering benchmark for historical urban understanding. Second, $\textbf{Urban1960SatUSM}$(Unsupervised Segmentation Model) is a novel unsupervised semantic segmentation framework for historical RS imagery. It employs a confidence-aware alignment mechanism and focal-confidence loss based on a self-supervised learning architecture, which generates robust pseudo-labels and adaptively prioritizes prediction difficulty and label reliability to improve unsupervised segmentation on noisy historical data without manual supervision. Experiments show Urban1960SatUSM significantly outperforms existing unsupervised segmentation methods on Urban1960SatSeg for segmenting historical urban scenes, promising in paving the way for quantitative studies of long-term urban change using modern computer vision. Our benchmark and supplementary material are available at https://github.com/Tianxiang-Hao/Urban1960SatSeg.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2506.09476v1.pdf

GitHub:
• https://github.com/tianxiang-hao/urban1960satseg

Datasets:
• No datasets information available
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

❤2

1.11K views06:18

Data Science | Machine Learning with Python for Researchers

🔹 Title:
MMSearch-R1: Incentivizing LMMs to Search

🔹 Publication Date: Published on Jun 25

🔹 Abstract:
MMSearch-R1, a reinforcement learning framework, enables large multimodal models to perform efficient, on-demand, multi-turn search in real-world environments, outperforming existing approaches. AI-generated summary Robust deployment of large multimodal models (LMMs) in real-world scenarios requires access to external knowledge sources, given the complexity and dynamic nature of real-world information. Existing approaches such as retrieval-augmented generation (RAG) and prompt engineered search agents rely on rigid pipelines, often leading to inefficient or excessive search behaviors. We present MMSearch-R1, the first end-to-end reinforcement learning framework that enables LMMs to perform on-demand, multi-turn search in real-world Internet environments. Our framework integrates both image and text search tools, allowing the model to reason about when and how to invoke them guided by an outcome-based reward with a search penalty . To support training, We collect a multimodal search VQA dataset through a semi-automated pipeline that covers diverse visual and textual knowledge needs and curate a search-balanced subset with both search-required and search-free samples, which proves essential for shaping efficient and on-demand search behavior. Extensive experiments on knowledge-intensive and info-seeking VQA tasks show that our model not only outperforms RAG-based baselines of the same model size, but also matches the performance of a larger RAG-based model while reducing search calls by over 30%. We further analyze key empirical findings to offer actionable insights for advancing research in multimodal search.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.20670
• PDF: https://arxiv.org/pdf/2506.20670
• Github: https://github.com/EvolvingLMMs-Lab/multimodal-search-r1

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

arXiv.org

MMSearch-R1: Incentivizing LMMs to Search

Robust deployment of large multimodal models (LMMs) in real-world scenarios requires access to external knowledge sources, given the complexity and dynamic nature of real-world information....

❤1

854 views06:06

Data Science | Machine Learning with Python for Researchers

🔹 Title:
Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge

🔹 Publication Date: Published on Jun 26

🔹 Abstract:
Mind2Web 2 benchmark evaluates agentic search systems with a suite of realistic, long-horizon tasks, introducing an Agent-as-a-Judge framework to assess accuracy and source attribution. AI-generated summary Agentic search such as Deep Research systems , where large language models autonomously browse the web, synthesize information, and return comprehensive citation-backed answers , represents a major shift in how users interact with web-scale information. While promising greater efficiency and cognitive offloading, the growing complexity and open-endedness of agentic search have outpaced existing evaluation benchmarks and methodologies, which largely assume short search horizons and static answers . In this paper, we introduce Mind2Web 2, a benchmark of 130 realistic, high-quality, and long-horizon tasks that require real-time web browsing and extensive information synthesis , constructed with over 1,000 hours of human labor. To address the challenge of evaluating time-varying and complex answers, we propose a novel Agent-as-a-Judge framework. Our method constructs task-specific judge agents based on a tree-structured rubric design to automatically assess both answer correctness and source attribution . We conduct a comprehensive evaluation of nine frontier agentic search systems and human performance , along with a detailed error analysis to draw insights for future development. The best-performing system, OpenAI Deep Research , can already achieve 50-70% of human performance while spending half the time, showing a great potential. Altogether, Mind2Web 2 provides a rigorous foundation for developing and benchmarking the next generation of agentic search systems .

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.21506
• PDF: https://arxiv.org/pdf/2506.21506
• Project Page: https://osu-nlp-group.github.io/Mind2Web-2
• Github: https://github.com/OSU-NLP-Group/Mind2Web-2/

🔹 Datasets citing this paper:
• https://huggingface.co/datasets/osunlp/Mind2Web-2

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

arXiv.org

Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge

Agentic search such as Deep Research systems-where agents autonomously browse the web, synthesize information, and return comprehensive citation-backed answers-represents a major shift in how...

❤1

762 views06:17

Data Science | Machine Learning with Python for Researchers

🔹 Title:
Intelligent Operation and Maintenance and Prediction Model Optimization for Improving Wind Power Generation Efficiency

🔹 Publication Date: Published on Jun 19

🔹 Abstract:
This study explores the effectiveness of predictive maintenance models and the optimization of intelligent Operation and Maintenance (O&M) systems in improving wind power generation efficiency. Through qualitative research, structured interviews were conducted with five wind farm engineers and maintenance managers, each with extensive experience in turbine operations. Using thematic analysis, the study revealed that while predictive maintenance models effectively reduce downtime by identifying major faults, they often struggle with detecting smaller, gradual failures. Key challenges identified include false positives, sensor malfunctions, and difficulties in integrating new models with older turbine systems. Advanced technologies such as digital twins, SCADA systems, and condition monitoring have significantly enhanced turbine maintenance practices. However, these technologies still require improvements, particularly in AI refinement and real-time data integration. The findings emphasize the need for continuous development to fully optimize wind turbine performance and support the broader adoption of renewable energy.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.16095
• PDF: https://arxiv.org/pdf/2506.16095

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

❤2

798 views06:17

Data Science | Machine Learning with Python for Researchers

🔹 Title:
HiWave: Training-Free High-Resolution Image Generation via Wavelet-Based Diffusion Sampling

🔹 Publication Date: Published on Jun 25

🔹 Abstract:
HiWave enhances ultra-high-resolution image synthesis using pretrained diffusion models through a two-stage pipeline involving DDIM inversion and wavelet-based detail enhancement, improving visual fidelity and reducing artifacts. AI-generated summary Diffusion models have emerged as the leading approach for image synthesis , demonstrating exceptional photorealism and diversity. However, training diffusion models at high resolutions remains computationally prohibitive, and existing zero-shot generation techniques for synthesizing images beyond training resolutions often produce artifacts , including object duplication and spatial incoherence . In this paper, we introduce HiWave, a training-free, zero-shot approach that substantially enhances visual fidelity and structural coherence in ultra-high-resolution image synthesis using pretrained diffusion models. Our method employs a two-stage pipeline : generating a base image from the pretrained model followed by a patch-wise DDIM inversion step and a novel wavelet-based detail enhancer module. Specifically, we first utilize inversion methods to derive initial noise vectors that preserve global coherence from the base image. Subsequently, during sampling, our wavelet-domain detail enhancer retains low-frequency components from the base image to ensure structural consistency, while selectively guiding high-frequency components to enrich fine details and textures . Extensive evaluations using Stable Diffusion XL demonstrate that HiWave effectively mitigates common visual artifacts seen in prior methods, achieving superior perceptual quality . A user study confirmed HiWave's performance, where it was preferred over the state-of-the-art alternative in more than 80% of comparisons, highlighting its effectiveness for high-quality, ultra-high-resolution image synthesis without requiring retraining or architectural modifications.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.20452
• PDF: https://arxiv.org/pdf/2506.20452

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

❤‍🔥1

799 views07:22

About

Blog

Apps

Platform