🔹 Title:
Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation
🔹 Publication Date: Published on Jun 13
🔹 Abstract:
A diffusion-based framework generates aligned novel views of images and geometry using warping-and-inpainting with cross-modal attention distillation and proximity-based mesh conditioning, achieving high-fidelity synthesis and 3D completion. AI-generated summary We introduce a diffusion-based framework that performs aligned novel view image and geometry generation via a warping-and-inpainting methodology. Unlike prior methods that require dense posed images or pose-embedded generative models limited to in-domain views, our method leverages off-the-shelf geometry predictors to predict partial geometries viewed from reference images, and formulates novel-view synthesis as an inpainting task for both image and geometry. To ensure accurate alignment between generated images and geometry, we propose cross-modal attention distillation , where attention maps from the image diffusion branch are injected into a parallel geometry diffusion branch during both training and inference. This multi-task approach achieves synergistic effects, facilitating geometrically robust image synthesis as well as well-defined geometry prediction . We further introduce proximity-based mesh conditioning to integrate depth and normal cues, interpolating between point cloud and filtering erroneously predicted geometry from influencing the generation process. Empirically, our method achieves high-fidelity extrapolative view synthesis on both image and geometry across a range of unseen scenes, delivers competitive reconstruction quality under interpolation settings, and produces geometrically aligned colored point clouds for comprehensive 3D completion . Project page is available at https://cvlab-kaist.github.io/MoAI.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.11924
• PDF: https://arxiv.org/pdf/2506.11924
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation
🔹 Publication Date: Published on Jun 13
🔹 Abstract:
A diffusion-based framework generates aligned novel views of images and geometry using warping-and-inpainting with cross-modal attention distillation and proximity-based mesh conditioning, achieving high-fidelity synthesis and 3D completion. AI-generated summary We introduce a diffusion-based framework that performs aligned novel view image and geometry generation via a warping-and-inpainting methodology. Unlike prior methods that require dense posed images or pose-embedded generative models limited to in-domain views, our method leverages off-the-shelf geometry predictors to predict partial geometries viewed from reference images, and formulates novel-view synthesis as an inpainting task for both image and geometry. To ensure accurate alignment between generated images and geometry, we propose cross-modal attention distillation , where attention maps from the image diffusion branch are injected into a parallel geometry diffusion branch during both training and inference. This multi-task approach achieves synergistic effects, facilitating geometrically robust image synthesis as well as well-defined geometry prediction . We further introduce proximity-based mesh conditioning to integrate depth and normal cues, interpolating between point cloud and filtering erroneously predicted geometry from influencing the generation process. Empirically, our method achieves high-fidelity extrapolative view synthesis on both image and geometry across a range of unseen scenes, delivers competitive reconstruction quality under interpolation settings, and produces geometrically aligned colored point clouds for comprehensive 3D completion . Project page is available at https://cvlab-kaist.github.io/MoAI.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.11924
• PDF: https://arxiv.org/pdf/2506.11924
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤1
🔹 Title:
FairyGen: Storied Cartoon Video from a Single Child-Drawn Character
🔹 Publication Date: Published on Jun 26
🔹 Abstract:
FairyGen generates story-driven cartoon videos from a single drawing by disentangling character modeling and background styling, employing MLLM for storyboards, style propagation for consistency, and MMDiT-based diffusion models for motion. AI-generated summary We propose FairyGen, an automatic system for generating story-driven cartoon videos from a single child's drawing, while faithfully preserving its unique artistic style. Unlike previous storytelling methods that primarily focus on character consistency and basic motion, FairyGen explicitly disentangles character modeling from stylized background generation and incorporates cinematic shot design to support expressive and coherent storytelling. Given a single character sketch, we first employ an MLLM to generate a structured storyboard with shot-level descriptions that specify environment settings, character actions, and camera perspectives. To ensure visual consistency, we introduce a style propagation adapter that captures the character's visual style and applies it to the background, faithfully retaining the character's full visual identity while synthesizing style-consistent scenes. A shot design module further enhances visual diversity and cinematic quality through frame cropping and multi-view synthesis based on the storyboard . To animate the story, we reconstruct a 3D proxy of the character to derive physically plausible motion sequences, which are then used to fine-tune an MMDiT-based image-to-video diffusion model. We further propose a two-stage motion customization adapter: the first stage learns appearance features from temporally unordered frames, disentangling identity from motion; the second stage models temporal dynamics using a timestep-shift strategy with frozen identity weights. Once trained, FairyGen directly renders diverse and coherent video scenes aligned with the storyboard . Extensive experiments demonstrate that our system produces animations that are stylistically faithful, narratively structured natural motion, highlighting its potential for personalized and engaging story animation. The code will be available at https://github.com/GVCLab/FairyGen
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.21272
• PDF: https://arxiv.org/pdf/2506.21272
• Project Page: https://jayleejia.github.io/FairyGen/
• Github: https://github.com/GVCLab/FairyGen
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
FairyGen: Storied Cartoon Video from a Single Child-Drawn Character
🔹 Publication Date: Published on Jun 26
🔹 Abstract:
FairyGen generates story-driven cartoon videos from a single drawing by disentangling character modeling and background styling, employing MLLM for storyboards, style propagation for consistency, and MMDiT-based diffusion models for motion. AI-generated summary We propose FairyGen, an automatic system for generating story-driven cartoon videos from a single child's drawing, while faithfully preserving its unique artistic style. Unlike previous storytelling methods that primarily focus on character consistency and basic motion, FairyGen explicitly disentangles character modeling from stylized background generation and incorporates cinematic shot design to support expressive and coherent storytelling. Given a single character sketch, we first employ an MLLM to generate a structured storyboard with shot-level descriptions that specify environment settings, character actions, and camera perspectives. To ensure visual consistency, we introduce a style propagation adapter that captures the character's visual style and applies it to the background, faithfully retaining the character's full visual identity while synthesizing style-consistent scenes. A shot design module further enhances visual diversity and cinematic quality through frame cropping and multi-view synthesis based on the storyboard . To animate the story, we reconstruct a 3D proxy of the character to derive physically plausible motion sequences, which are then used to fine-tune an MMDiT-based image-to-video diffusion model. We further propose a two-stage motion customization adapter: the first stage learns appearance features from temporally unordered frames, disentangling identity from motion; the second stage models temporal dynamics using a timestep-shift strategy with frozen identity weights. Once trained, FairyGen directly renders diverse and coherent video scenes aligned with the storyboard . Extensive experiments demonstrate that our system produces animations that are stylistically faithful, narratively structured natural motion, highlighting its potential for personalized and engaging story animation. The code will be available at https://github.com/GVCLab/FairyGen
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.21272
• PDF: https://arxiv.org/pdf/2506.21272
• Project Page: https://jayleejia.github.io/FairyGen/
• Github: https://github.com/GVCLab/FairyGen
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤2
🔹 Title:
Seedance 1.0: Exploring the Boundaries of Video Generation Models
🔹 Publication Date: Published on Jun 10
🔹 Abstract:
Seedance 1.0 offers high-performance video generation by integrating advanced data curation, efficient architecture, post-training optimization, and model acceleration, resulting in superior quality and speed. AI-generated summary Notable breakthroughs in diffusion modeling have propelled rapid improvements in video generation, yet current foundational model still face critical challenges in simultaneously balancing prompt following, motion plausibility, and visual quality. In this report, we introduce Seedance 1.0, a high-performance and inference-efficient video foundation generation model that integrates several core technical improvements: (i) multi-source data curation augmented with precision and meaningful video captioning , enabling comprehensive learning across diverse scenarios; (ii) an efficient architecture design with proposed training paradigm , which allows for natively supporting multi-shot generation and jointly learning of both text-to-video and image-to-video tasks. (iii) carefully-optimized post-training approaches leveraging fine-grained supervised fine-tuning , and video-specific RLHF with multi-dimensional reward mechanisms for comprehensive performance improvements; (iv) excellent model acceleration achieving ~10x inference speedup through multi-stage distillation strategies and system-level optimizations. Seedance 1.0 can generate a 5-second video at 1080p resolution only with 41.4 seconds (NVIDIA-L20). Compared to state-of-the-art video generation models, Seedance 1.0 stands out with high-quality and fast video generation having superior spatiotemporal fluidity with structural stability , precise instruction adherence in complex multi-subject contexts, native multi-shot narrative coherence with consistent subject representation .
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.09113
• PDF: https://arxiv.org/pdf/2506.09113
• Project Page: https://seed.bytedance.com/seedance
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Seedance 1.0: Exploring the Boundaries of Video Generation Models
🔹 Publication Date: Published on Jun 10
🔹 Abstract:
Seedance 1.0 offers high-performance video generation by integrating advanced data curation, efficient architecture, post-training optimization, and model acceleration, resulting in superior quality and speed. AI-generated summary Notable breakthroughs in diffusion modeling have propelled rapid improvements in video generation, yet current foundational model still face critical challenges in simultaneously balancing prompt following, motion plausibility, and visual quality. In this report, we introduce Seedance 1.0, a high-performance and inference-efficient video foundation generation model that integrates several core technical improvements: (i) multi-source data curation augmented with precision and meaningful video captioning , enabling comprehensive learning across diverse scenarios; (ii) an efficient architecture design with proposed training paradigm , which allows for natively supporting multi-shot generation and jointly learning of both text-to-video and image-to-video tasks. (iii) carefully-optimized post-training approaches leveraging fine-grained supervised fine-tuning , and video-specific RLHF with multi-dimensional reward mechanisms for comprehensive performance improvements; (iv) excellent model acceleration achieving ~10x inference speedup through multi-stage distillation strategies and system-level optimizations. Seedance 1.0 can generate a 5-second video at 1080p resolution only with 41.4 seconds (NVIDIA-L20). Compared to state-of-the-art video generation models, Seedance 1.0 stands out with high-quality and fast video generation having superior spatiotemporal fluidity with structural stability , precise instruction adherence in complex multi-subject contexts, native multi-shot narrative coherence with consistent subject representation .
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.09113
• PDF: https://arxiv.org/pdf/2506.09113
• Project Page: https://seed.bytedance.com/seedance
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
arXiv.org
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Notable breakthroughs in diffusion modeling have propelled rapid improvements in video generation, yet current foundational model still face critical challenges in simultaneously balancing prompt...
❤2
Article Title:
Large Language Models for Cyber Security: A Systematic Literature Review
Article Date: 8 May 2024
Article Description:
The rapid advancement of Large Language Models (LLMs) has opened up new opportunities for leveraging artificial intelligence in various domains, including cybersecurity. As the volume and sophistication of cyber threats continue to grow, there is an increasing need for intelligent systems that can automatically detect vulnerabilities, analyze malware, and respond to attacks. In this survey, we conduct a comprehensive review of the literature on the application of LLMs in cybersecurity (LLM4Security). By comprehensively collecting over 30K relevant papers and systematically analyzing 127 papers from top security and software engineering venues, we aim to provide a holistic view of how LLMs are being used to solve diverse problems across the cybersecurity domain. Through our analysis, we identify several key findings. First, we observe that LLMs are being applied to a wide range of cybersecurity tasks, including vulnerability detection, malware analysis, network intrusion detection, and phishing detection. Second, we find that the datasets used for training and evaluating LLMs in these tasks are often limited in size and diversity, highlighting the need for more comprehensive and representative datasets. Third, we identify several promising techniques for adapting LLMs to specific cybersecurity domains, such as fine-tuning, transfer learning, and domain-specific pre-training. Finally, we discuss the main challenges and opportunities for future research in LLM4Security, including the need for more interpretable and explainable models, the importance of addressing data privacy and security concerns, and the potential for leveraging LLMs for proactive defense and threat hunting. Overall, our survey provides a comprehensive overview of the current state-of-the-art in LLM4Security and identifies several promising directions for future research.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2405.04760v4.pdf
GitHub:
• https://github.com/hiyouga/llama-efficient-tuning
Datasets:
• 10,000 People - Human Pose Recognition Data
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Large Language Models for Cyber Security: A Systematic Literature Review
Article Date: 8 May 2024
Article Description:
The rapid advancement of Large Language Models (LLMs) has opened up new opportunities for leveraging artificial intelligence in various domains, including cybersecurity. As the volume and sophistication of cyber threats continue to grow, there is an increasing need for intelligent systems that can automatically detect vulnerabilities, analyze malware, and respond to attacks. In this survey, we conduct a comprehensive review of the literature on the application of LLMs in cybersecurity (LLM4Security). By comprehensively collecting over 30K relevant papers and systematically analyzing 127 papers from top security and software engineering venues, we aim to provide a holistic view of how LLMs are being used to solve diverse problems across the cybersecurity domain. Through our analysis, we identify several key findings. First, we observe that LLMs are being applied to a wide range of cybersecurity tasks, including vulnerability detection, malware analysis, network intrusion detection, and phishing detection. Second, we find that the datasets used for training and evaluating LLMs in these tasks are often limited in size and diversity, highlighting the need for more comprehensive and representative datasets. Third, we identify several promising techniques for adapting LLMs to specific cybersecurity domains, such as fine-tuning, transfer learning, and domain-specific pre-training. Finally, we discuss the main challenges and opportunities for future research in LLM4Security, including the need for more interpretable and explainable models, the importance of addressing data privacy and security concerns, and the potential for leveraging LLMs for proactive defense and threat hunting. Overall, our survey provides a comprehensive overview of the current state-of-the-art in LLM4Security and identifies several promising directions for future research.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2405.04760v4.pdf
GitHub:
• https://github.com/hiyouga/llama-efficient-tuning
Datasets:
• 10,000 People - Human Pose Recognition Data
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤2
🔹 Title:
PosterCraft: Rethinking High-Quality Aesthetic Poster Generation in a Unified Framework
🔹 Publication Date: Published on Jun 12
🔹 Abstract:
PosterCraft improves aesthetic poster generation through a unified, modular pipeline with enhanced text rendering, region-aware fine-tuning, aesthetic reinforcement learning, and joint vision-language refinement. AI-generated summary Generating aesthetic posters is more challenging than simple design images: it requires not only precise text rendering but also the seamless integration of abstract artistic content, striking layouts, and overall stylistic harmony. To address this, we propose PosterCraft, a unified framework that abandons prior modular pipelines and rigid, predefined layouts, allowing the model to freely explore coherent, visually compelling compositions. PosterCraft employs a carefully designed, cascaded workflow to optimize the generation of high-aesthetic posters: (i) large-scale text-rendering optimization on our newly introduced Text-Render-2M dataset; (ii) region-aware supervised fine-tuning on HQ-Poster100K ; (iii) aesthetic-text-reinforcement learning via best-of-n preference optimization ; and (iv) joint vision-language feedback refinement. Each stage is supported by a fully automated data-construction pipeline tailored to its specific needs, enabling robust training without complex architectural modifications. Evaluated on multiple experiments, PosterCraft significantly outperforms open-source baselines in rendering accuracy, layout coherence, and overall visual appeal-approaching the quality of SOTA commercial systems. Our code, models, and datasets can be found in the Project page: https://ephemeral182.github.io/PosterCraft
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.10741
• PDF: https://arxiv.org/pdf/2506.10741
• Project Page: https://ephemeral182.github.io/PosterCraft/
• Github: https://ephemeral182.github.io/PosterCraft
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
PosterCraft: Rethinking High-Quality Aesthetic Poster Generation in a Unified Framework
🔹 Publication Date: Published on Jun 12
🔹 Abstract:
PosterCraft improves aesthetic poster generation through a unified, modular pipeline with enhanced text rendering, region-aware fine-tuning, aesthetic reinforcement learning, and joint vision-language refinement. AI-generated summary Generating aesthetic posters is more challenging than simple design images: it requires not only precise text rendering but also the seamless integration of abstract artistic content, striking layouts, and overall stylistic harmony. To address this, we propose PosterCraft, a unified framework that abandons prior modular pipelines and rigid, predefined layouts, allowing the model to freely explore coherent, visually compelling compositions. PosterCraft employs a carefully designed, cascaded workflow to optimize the generation of high-aesthetic posters: (i) large-scale text-rendering optimization on our newly introduced Text-Render-2M dataset; (ii) region-aware supervised fine-tuning on HQ-Poster100K ; (iii) aesthetic-text-reinforcement learning via best-of-n preference optimization ; and (iv) joint vision-language feedback refinement. Each stage is supported by a fully automated data-construction pipeline tailored to its specific needs, enabling robust training without complex architectural modifications. Evaluated on multiple experiments, PosterCraft significantly outperforms open-source baselines in rendering accuracy, layout coherence, and overall visual appeal-approaching the quality of SOTA commercial systems. Our code, models, and datasets can be found in the Project page: https://ephemeral182.github.io/PosterCraft
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.10741
• PDF: https://arxiv.org/pdf/2506.10741
• Project Page: https://ephemeral182.github.io/PosterCraft/
• Github: https://ephemeral182.github.io/PosterCraft
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
arXiv.org
PosterCraft: Rethinking High-Quality Aesthetic Poster Generation...
Generating aesthetic posters is more challenging than simple design images: it requires not only precise text rendering but also the seamless integration of abstract artistic content, striking...
❤1
🔹 Title:
Scaling Reasoning without Attention
🔹 Publication Date: Published on May 28
🔹 Abstract:
A new attention-free language model using state space dual layers and a two-phase curriculum fine-tuning strategy outperforms Transformer models on complex reasoning tasks. AI-generated summary Large language models ( LLMs ) have made significant advances in complex reasoning tasks, yet they remain bottlenecked by two core challenges: architectural inefficiency due to reliance on Transformers , and a lack of structured fine-tuning for high-difficulty domains. We introduce \ourmodel, an attention-free language model that addresses both issues through architectural and data-centric innovations. Built on the state space dual (SSD) layers of Mamba-2 , our model eliminates the need for self-attention and key-value caching, enabling fixed-memory, constant-time inference . To train it for complex reasoning, we propose a two-phase curriculum fine-tuning strategy based on the PromptCoT synthesis paradigm , which generates pedagogically structured problems via abstract concept selection and rationale-guided generation. On benchmark evaluations, \ourmodel-7B outperforms strong Transformer and hybrid models of comparable scale, and even surpasses the much larger Gemma3-27B by 2.6\% on AIME 24 , 0.6\% on AIME 25 , and 3.0\% on Livecodebench . These results highlight the potential of state space models as efficient and scalable alternatives to attention-based architectures for high-capacity reasoning .
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2505.22425
• PDF: https://arxiv.org/pdf/2505.22425
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Scaling Reasoning without Attention
🔹 Publication Date: Published on May 28
🔹 Abstract:
A new attention-free language model using state space dual layers and a two-phase curriculum fine-tuning strategy outperforms Transformer models on complex reasoning tasks. AI-generated summary Large language models ( LLMs ) have made significant advances in complex reasoning tasks, yet they remain bottlenecked by two core challenges: architectural inefficiency due to reliance on Transformers , and a lack of structured fine-tuning for high-difficulty domains. We introduce \ourmodel, an attention-free language model that addresses both issues through architectural and data-centric innovations. Built on the state space dual (SSD) layers of Mamba-2 , our model eliminates the need for self-attention and key-value caching, enabling fixed-memory, constant-time inference . To train it for complex reasoning, we propose a two-phase curriculum fine-tuning strategy based on the PromptCoT synthesis paradigm , which generates pedagogically structured problems via abstract concept selection and rationale-guided generation. On benchmark evaluations, \ourmodel-7B outperforms strong Transformer and hybrid models of comparable scale, and even surpasses the much larger Gemma3-27B by 2.6\% on AIME 24 , 0.6\% on AIME 25 , and 3.0\% on Livecodebench . These results highlight the potential of state space models as efficient and scalable alternatives to attention-based architectures for high-capacity reasoning .
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2505.22425
• PDF: https://arxiv.org/pdf/2505.22425
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
arXiv.org
Scaling Reasoning without Attention
Large language models (LLMs) have made significant advances in complex reasoning tasks, yet they remain bottlenecked by two core challenges: architectural inefficiency due to reliance on...
❤2
Article Title:
Demystifying and Enhancing the Efficiency of Large Language Model Based Search Agents
Article Date: 17 May 2025
Article Description:
Large Language Model (LLM)-based search agents have shown remarkable capabilities in solving complex tasks by dynamically decomposing problems and addressing them through interleaved reasoning and retrieval. However, this interleaved paradigm introduces substantial efficiency bottlenecks. First, we observe that both highly accurate and overly approximate retrieval methods degrade system efficiency: exact search incurs significant retrieval overhead, while coarse retrieval requires additional reasoning steps during generation. Second, we identify inefficiencies in system design, including improper scheduling and frequent retrieval stalls, which lead to cascading latency -- where even minor delays in retrieval amplify end-to-end inference time. To address these challenges, we introduce SearchAgent-X, a high-efficiency inference framework for LLM-based search agents. SearchAgent-X leverages high-recall approximate retrieval and incorporates two key techniques: priority-aware scheduling and non-stall retrieval. Extensive experiments demonstrate that SearchAgent-X consistently outperforms state-of-the-art systems such as vLLM and HNSW-based retrieval across diverse tasks, achieving up to 3.4$\times$ higher throughput and 5$\times$ lower latency, without compromising generation quality. SearchAgent-X is available at https://github.com/tiannuo-yang/SearchAgent-X.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.12065v1.pdf
GitHub:
• https://github.com/tiannuo-yang/searchagent-x
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Demystifying and Enhancing the Efficiency of Large Language Model Based Search Agents
Article Date: 17 May 2025
Article Description:
Large Language Model (LLM)-based search agents have shown remarkable capabilities in solving complex tasks by dynamically decomposing problems and addressing them through interleaved reasoning and retrieval. However, this interleaved paradigm introduces substantial efficiency bottlenecks. First, we observe that both highly accurate and overly approximate retrieval methods degrade system efficiency: exact search incurs significant retrieval overhead, while coarse retrieval requires additional reasoning steps during generation. Second, we identify inefficiencies in system design, including improper scheduling and frequent retrieval stalls, which lead to cascading latency -- where even minor delays in retrieval amplify end-to-end inference time. To address these challenges, we introduce SearchAgent-X, a high-efficiency inference framework for LLM-based search agents. SearchAgent-X leverages high-recall approximate retrieval and incorporates two key techniques: priority-aware scheduling and non-stall retrieval. Extensive experiments demonstrate that SearchAgent-X consistently outperforms state-of-the-art systems such as vLLM and HNSW-based retrieval across diverse tasks, achieving up to 3.4$\times$ higher throughput and 5$\times$ lower latency, without compromising generation quality. SearchAgent-X is available at https://github.com/tiannuo-yang/SearchAgent-X.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.12065v1.pdf
GitHub:
• https://github.com/tiannuo-yang/searchagent-x
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤1
🔹 Title:
Sekai: A Video Dataset towards World Exploration
🔹 Publication Date: Published on Jun 18
🔹 Abstract:
Sekai, a worldwide video dataset with comprehensive annotations, is introduced to support world exploration applications, enhancing video generation models. AI-generated summary Video generation techniques have made remarkable progress, promising to be the foundation of interactive world exploration. However, existing video generation datasets are not well-suited for world exploration training as they suffer from some limitations: limited locations, short duration, static scenes, and a lack of annotations about exploration and the world. In this paper, we introduce Sekai (meaning ``world'' in Japanese), a high-quality first-person view worldwide video dataset with rich annotations for world exploration. It consists of over 5,000 hours of walking or drone view ( FPV and UVA ) videos from over 100 countries and regions across 750 cities. We develop an efficient and effective toolbox to collect, pre-process and annotate videos with location, scene, weather, crowd density, captions, and camera trajectories . Experiments demonstrate the quality of the dataset. And, we use a subset to train an interactive video world exploration model , named YUME (meaning ``dream'' in Japanese). We believe Sekai will benefit the area of video generation and world exploration, and motivate valuable applications.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.15675
• PDF: https://arxiv.org/pdf/2506.15675
• Project Page: https://huggingface.co/datasets/Lixsp11/Sekai-Project
• Github: https://github.com/Lixsp11/sekai-codebase
🔹 Datasets citing this paper:
• https://huggingface.co/datasets/Lixsp11/Sekai-Project
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Sekai: A Video Dataset towards World Exploration
🔹 Publication Date: Published on Jun 18
🔹 Abstract:
Sekai, a worldwide video dataset with comprehensive annotations, is introduced to support world exploration applications, enhancing video generation models. AI-generated summary Video generation techniques have made remarkable progress, promising to be the foundation of interactive world exploration. However, existing video generation datasets are not well-suited for world exploration training as they suffer from some limitations: limited locations, short duration, static scenes, and a lack of annotations about exploration and the world. In this paper, we introduce Sekai (meaning ``world'' in Japanese), a high-quality first-person view worldwide video dataset with rich annotations for world exploration. It consists of over 5,000 hours of walking or drone view ( FPV and UVA ) videos from over 100 countries and regions across 750 cities. We develop an efficient and effective toolbox to collect, pre-process and annotate videos with location, scene, weather, crowd density, captions, and camera trajectories . Experiments demonstrate the quality of the dataset. And, we use a subset to train an interactive video world exploration model , named YUME (meaning ``dream'' in Japanese). We believe Sekai will benefit the area of video generation and world exploration, and motivate valuable applications.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.15675
• PDF: https://arxiv.org/pdf/2506.15675
• Project Page: https://huggingface.co/datasets/Lixsp11/Sekai-Project
• Github: https://github.com/Lixsp11/sekai-codebase
🔹 Datasets citing this paper:
• https://huggingface.co/datasets/Lixsp11/Sekai-Project
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
arXiv.org
Sekai: A Video Dataset towards World Exploration
Video generation techniques have made remarkable progress, promising to be the foundation of interactive world exploration. However, existing video generation datasets are not well-suited for...
❤1
Article Title:
MUSt3R: Multi-view Network for Stereo 3D Reconstruction
Article Date: Yohann Cabon
Article Description:
DUSt3R introduced a novel paradigm in geometric computer vision by proposing a model that can provide dense and unconstrained Stereo 3D Reconstruction of arbitrary image collections with no prior information about camera calibration nor viewpoint poses. Under the hood, however, DUSt3R processes image pairs, regressing local 3D reconstructions that need to be aligned in a global coordinate system. The number of pairs, growing quadratically, is an inherent limitation that becomes especially concerning for robust and fast optimization in the case of large image collections. In this paper, we propose an extension of DUSt3R from pairs to multiple views, that addresses all aforementioned concerns. Indeed, we propose a Multi-view Network for Stereo 3D Reconstruction, or MUSt3R, that modifies the DUSt3R architecture by making it symmetric and extending it to directly predict 3D structure for all views in a common coordinate frame. Second, we entail the model with a multi-layer memory mechanism which allows to reduce the computational complexity and to scale the reconstruction to large collections, inferring thousands of 3D pointmaps at high frame-rates with limited added complexity. The framework is designed to perform 3D reconstruction both offline and online, and hence can be seamlessly applied to SfM and visual SLAM scenarios showing state-of-the-art performance on various 3D downstream tasks, including uncalibrated Visual Odometry, relative camera pose, scale and focal estimation, 3D reconstruction and multi-view depth estimation.PDFAbstractCVPR 2025 PDFCVPR 2025 Abstract
PDF Download Link:
https://arxiv.org/pdf/2503.01661v1.pdf
GitHub:
• https://github.com/naver/must3r
Datasets:
• KITTI
• ScanNet
• TUM RGB-D
• MegaDepth
• ETH3D
• BlendedMVS
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
MUSt3R: Multi-view Network for Stereo 3D Reconstruction
Article Date: Yohann Cabon
Article Description:
DUSt3R introduced a novel paradigm in geometric computer vision by proposing a model that can provide dense and unconstrained Stereo 3D Reconstruction of arbitrary image collections with no prior information about camera calibration nor viewpoint poses. Under the hood, however, DUSt3R processes image pairs, regressing local 3D reconstructions that need to be aligned in a global coordinate system. The number of pairs, growing quadratically, is an inherent limitation that becomes especially concerning for robust and fast optimization in the case of large image collections. In this paper, we propose an extension of DUSt3R from pairs to multiple views, that addresses all aforementioned concerns. Indeed, we propose a Multi-view Network for Stereo 3D Reconstruction, or MUSt3R, that modifies the DUSt3R architecture by making it symmetric and extending it to directly predict 3D structure for all views in a common coordinate frame. Second, we entail the model with a multi-layer memory mechanism which allows to reduce the computational complexity and to scale the reconstruction to large collections, inferring thousands of 3D pointmaps at high frame-rates with limited added complexity. The framework is designed to perform 3D reconstruction both offline and online, and hence can be seamlessly applied to SfM and visual SLAM scenarios showing state-of-the-art performance on various 3D downstream tasks, including uncalibrated Visual Odometry, relative camera pose, scale and focal estimation, 3D reconstruction and multi-view depth estimation.PDFAbstractCVPR 2025 PDFCVPR 2025 Abstract
PDF Download Link:
https://arxiv.org/pdf/2503.01661v1.pdf
GitHub:
• https://github.com/naver/must3r
Datasets:
• KITTI
• ScanNet
• TUM RGB-D
• MegaDepth
• ETH3D
• BlendedMVS
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤2
Article Title:
Spiking Graph Convolutional Networks
Article Date: 5 May 2022
Article Description:
Graph Convolutional Networks (GCNs) achieve an impressive performance due to the remarkable representation ability in learning the graph information. However, GCNs, when implemented on a deep network, require expensive computation power, making them difficult to be deployed on battery-powered devices. In contrast, Spiking Neural Networks (SNNs), which perform a bio-fidelity inference process, offer an energy-efficient neural architecture. In this work, we propose SpikingGCN, an end-to-end framework that aims to integrate the embedding of GCNs with the biofidelity characteristics of SNNs. The original graph data are encoded into spike trains based on the incorporation of graph convolution. We further model biological information processing by utilizing a fully connected layer combined with neuron nodes. In a wide range of scenarios (e.g. citation networks, image graph classification, and recommender systems), our experimental results show that the proposed method could gain competitive performance against state-of-the-art approaches. Furthermore, we show that SpikingGCN on a neuromorphic chip can bring a clear advantage of energy efficiency into graph data analysis, which demonstrates its great potential to construct environment-friendly machine learning models.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2205.02767v2.pdf
GitHub:
• https://github.com/zulunzhu/spikinggcn
Datasets:
• MNIST
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Spiking Graph Convolutional Networks
Article Date: 5 May 2022
Article Description:
Graph Convolutional Networks (GCNs) achieve an impressive performance due to the remarkable representation ability in learning the graph information. However, GCNs, when implemented on a deep network, require expensive computation power, making them difficult to be deployed on battery-powered devices. In contrast, Spiking Neural Networks (SNNs), which perform a bio-fidelity inference process, offer an energy-efficient neural architecture. In this work, we propose SpikingGCN, an end-to-end framework that aims to integrate the embedding of GCNs with the biofidelity characteristics of SNNs. The original graph data are encoded into spike trains based on the incorporation of graph convolution. We further model biological information processing by utilizing a fully connected layer combined with neuron nodes. In a wide range of scenarios (e.g. citation networks, image graph classification, and recommender systems), our experimental results show that the proposed method could gain competitive performance against state-of-the-art approaches. Furthermore, we show that SpikingGCN on a neuromorphic chip can bring a clear advantage of energy efficiency into graph data analysis, which demonstrates its great potential to construct environment-friendly machine learning models.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2205.02767v2.pdf
GitHub:
• https://github.com/zulunzhu/spikinggcn
Datasets:
• MNIST
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
👍1
🔹 Title:
Unified Vision-Language-Action Model
🔹 Publication Date: Published on Jun 24
🔹 Abstract:
UniVLA is a multimodal VLA model that autoregressively processes vision, language, and action as token sequences, incorporating world modeling for effective long-horizon policy learning and achieving state-of-the-art results across simulation and real-world benchmarks. AI-generated summary Vision-language-action models ( VLAs ) have garnered significant attention for their potential in advancing robotic manipulation. However, previous approaches predominantly rely on the general comprehension capabilities of vision-language models ( VLMs ) to generate action signals, often overlooking the rich temporal and causal structure embedded in visual observations. In this paper, we present UniVLA, a unified and native multimodal VLA model that autoregressively models vision, language, and action signals as discrete token sequences . This formulation enables flexible multimodal tasks learning , particularly from large-scale video data. By incorporating world modeling during post-training, UniVLA captures causal dynamics from videos, facilitating effective transfer to downstream policy learning --especially for long-horizon tasks. Our approach sets new state-of-the-art results across several widely used simulation benchmarks, including CALVIN , LIBERO , and Simplenv-Bridge , significantly surpassing previous methods. For example, UniVLA achieves 95.5% average success rate on LIBERO benchmark, surpassing pi0-FAST's 85.5%. We further demonstrate its broad applicability on real-world ALOHA manipulation and autonomous driving.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.19850
• PDF: https://arxiv.org/pdf/2506.19850
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Unified Vision-Language-Action Model
🔹 Publication Date: Published on Jun 24
🔹 Abstract:
UniVLA is a multimodal VLA model that autoregressively processes vision, language, and action as token sequences, incorporating world modeling for effective long-horizon policy learning and achieving state-of-the-art results across simulation and real-world benchmarks. AI-generated summary Vision-language-action models ( VLAs ) have garnered significant attention for their potential in advancing robotic manipulation. However, previous approaches predominantly rely on the general comprehension capabilities of vision-language models ( VLMs ) to generate action signals, often overlooking the rich temporal and causal structure embedded in visual observations. In this paper, we present UniVLA, a unified and native multimodal VLA model that autoregressively models vision, language, and action signals as discrete token sequences . This formulation enables flexible multimodal tasks learning , particularly from large-scale video data. By incorporating world modeling during post-training, UniVLA captures causal dynamics from videos, facilitating effective transfer to downstream policy learning --especially for long-horizon tasks. Our approach sets new state-of-the-art results across several widely used simulation benchmarks, including CALVIN , LIBERO , and Simplenv-Bridge , significantly surpassing previous methods. For example, UniVLA achieves 95.5% average success rate on LIBERO benchmark, surpassing pi0-FAST's 85.5%. We further demonstrate its broad applicability on real-world ALOHA manipulation and autonomous driving.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.19850
• PDF: https://arxiv.org/pdf/2506.19850
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤2
🔹 Title:
Biomed-Enriched: A Biomedical Dataset Enriched with LLMs for Pretraining and Extracting Rare and Hidden Content
🔹 Publication Date: Published on Jun 25
🔹 Abstract:
A biomedical text dataset, constructed from PubMed, uses a two-stage annotation process involving large and small language models to fine-tune and extract subsets for clinical NLP, improving pretraining efficiency and performance. AI-generated summary We introduce Biomed-Enriched , a biomedical text dataset constructed from PubMed via a two-stage annotation process. In the first stage, a large language model annotates 400K paragraphs from PubMed scientific articles, assigning scores for their type (review, study, clinical case, other), domain (clinical, biomedical, other), and educational quality . The educational quality score (rated 1 to 5) estimates how useful a paragraph is for college-level learning. These annotations are then used to fine-tune a small language model , which propagates the labels across the full PMC-OA corpus . The resulting metadata allows us to extract refined subsets, including 2M clinical case paragraphs with over 450K high-quality ones from articles with commercial-use licenses, and to construct several variants via quality filtering and domain upsampling. Clinical text is typically difficult to access due to privacy constraints, as hospital records cannot be publicly shared. Hence, our dataset provides an alternative large-scale, openly available collection of clinical cases from PubMed , making it a valuable resource for biomedical and clinical NLP. Preliminary continual-pretraining experiments with OLMo2 suggest these curated subsets enable targeted improvements, with clinical upsampling boosting performance by ~5% on MMLU ProfMed and educational quality filtering improving MedQA and MedMCQA by ~1%. Combinations of these techniques led to faster convergence, reaching same performance with a third of training tokens , indicating potential for more efficient and effective biomedical pretraining strategies.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.20331
• PDF: https://arxiv.org/pdf/2506.20331
🔹 Datasets citing this paper:
• https://huggingface.co/datasets/almanach/Biomed-Enriched
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Biomed-Enriched: A Biomedical Dataset Enriched with LLMs for Pretraining and Extracting Rare and Hidden Content
🔹 Publication Date: Published on Jun 25
🔹 Abstract:
A biomedical text dataset, constructed from PubMed, uses a two-stage annotation process involving large and small language models to fine-tune and extract subsets for clinical NLP, improving pretraining efficiency and performance. AI-generated summary We introduce Biomed-Enriched , a biomedical text dataset constructed from PubMed via a two-stage annotation process. In the first stage, a large language model annotates 400K paragraphs from PubMed scientific articles, assigning scores for their type (review, study, clinical case, other), domain (clinical, biomedical, other), and educational quality . The educational quality score (rated 1 to 5) estimates how useful a paragraph is for college-level learning. These annotations are then used to fine-tune a small language model , which propagates the labels across the full PMC-OA corpus . The resulting metadata allows us to extract refined subsets, including 2M clinical case paragraphs with over 450K high-quality ones from articles with commercial-use licenses, and to construct several variants via quality filtering and domain upsampling. Clinical text is typically difficult to access due to privacy constraints, as hospital records cannot be publicly shared. Hence, our dataset provides an alternative large-scale, openly available collection of clinical cases from PubMed , making it a valuable resource for biomedical and clinical NLP. Preliminary continual-pretraining experiments with OLMo2 suggest these curated subsets enable targeted improvements, with clinical upsampling boosting performance by ~5% on MMLU ProfMed and educational quality filtering improving MedQA and MedMCQA by ~1%. Combinations of these techniques led to faster convergence, reaching same performance with a third of training tokens , indicating potential for more efficient and effective biomedical pretraining strategies.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.20331
• PDF: https://arxiv.org/pdf/2506.20331
🔹 Datasets citing this paper:
• https://huggingface.co/datasets/almanach/Biomed-Enriched
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Article Title:
PixelsDB: Serverless and NL-Aided Data Analytics with Flexible Service Levels and Prices
Article Date: 30 May 2024
Article Description:
Serverless query processing has become increasingly popular due to its advantages, including automated resource management, high elasticity, and pay-as-you-go pricing. For users who are not system experts, serverless query processing greatly reduces the cost of owning a data analytic system. However, it is still a significant challenge for non-expert users to transform their complex and evolving data analytic needs into proper SQL queries and select a serverless query service that delivers satisfactory performance and price for each type of query. This paper presents PixelsDB, an open-source data analytic system that allows users who lack system or SQL expertise to explore data efficiently. It allows users to generate and debug SQL queries using a natural language interface powered by fine-tuned language models. The queries are then executed by a serverless query engine that offers varying prices for different performance service levels (SLAs). The performance SLAs are natively supported by dedicated architecture design and heterogeneous resource scheduling that can apply cost-efficient resources to process non-urgent queries. We demonstrate that the combination of a serverless paradigm, a natural-language-aided interface, and flexible SLAs and prices will substantially improve the usability of cloud data analytic systems.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2405.19784v2.pdf
GitHub:
• https://github.com/pixelsdb/pixels
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
PixelsDB: Serverless and NL-Aided Data Analytics with Flexible Service Levels and Prices
Article Date: 30 May 2024
Article Description:
Serverless query processing has become increasingly popular due to its advantages, including automated resource management, high elasticity, and pay-as-you-go pricing. For users who are not system experts, serverless query processing greatly reduces the cost of owning a data analytic system. However, it is still a significant challenge for non-expert users to transform their complex and evolving data analytic needs into proper SQL queries and select a serverless query service that delivers satisfactory performance and price for each type of query. This paper presents PixelsDB, an open-source data analytic system that allows users who lack system or SQL expertise to explore data efficiently. It allows users to generate and debug SQL queries using a natural language interface powered by fine-tuned language models. The queries are then executed by a serverless query engine that offers varying prices for different performance service levels (SLAs). The performance SLAs are natively supported by dedicated architecture design and heterogeneous resource scheduling that can apply cost-efficient resources to process non-urgent queries. We demonstrate that the combination of a serverless paradigm, a natural-language-aided interface, and flexible SLAs and prices will substantially improve the usability of cloud data analytic systems.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2405.19784v2.pdf
GitHub:
• https://github.com/pixelsdb/pixels
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤2
Article Title:
Optuna: A Next-generation Hyperparameter Optimization Framework
Article Date: 25 Jul 2019
Article Description:
The purpose of this study is to introduce new design-criteria for next-generation hyperparameter optimization software. The criteria we propose include (1) define-by-run API that allows users to construct the parameter search space dynamically, (2) efficient implementation of both searching and pruning strategies, and (3) easy-to-setup, versatile architecture that can be deployed for various purposes, ranging from scalable distributed computing to light-weight experiment conducted via interactive interface. In order to prove our point, we will introduce Optuna, an optimization software which is a culmination of our effort in the development of a next generation optimization software. As an optimization software designed with define-by-run principle, Optuna is particularly the first of its kind. We will present the design-techniques that became necessary in the development of the software that meets the above criteria, and demonstrate the power of our new design through experimental results and real world applications. Our software is available under the MIT license (https://github.com/pfnet/optuna/).PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/1907.10902v1.pdf
GitHub:
• https://github.com/pfnet/optuna
• https://github.com/optuna/optuna
• https://github.com/Automunge/AutoMunge
• https://github.com/optuna/optuna-integration
• https://github.com/crcrpar/benchmark-runner-ci
• https://github.com/crcrpar/optuna-mirror
• https://github.com/himkt/optuna-test-rtds
• https://github.com/crcrpar/ci-example-execution
• https://github.com/yqian4/optuna
• https://github.com/brethvoice/optuna_demo_MNIST
• https://github.com/rickyHong/optuna-repl
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Optuna: A Next-generation Hyperparameter Optimization Framework
Article Date: 25 Jul 2019
Article Description:
The purpose of this study is to introduce new design-criteria for next-generation hyperparameter optimization software. The criteria we propose include (1) define-by-run API that allows users to construct the parameter search space dynamically, (2) efficient implementation of both searching and pruning strategies, and (3) easy-to-setup, versatile architecture that can be deployed for various purposes, ranging from scalable distributed computing to light-weight experiment conducted via interactive interface. In order to prove our point, we will introduce Optuna, an optimization software which is a culmination of our effort in the development of a next generation optimization software. As an optimization software designed with define-by-run principle, Optuna is particularly the first of its kind. We will present the design-techniques that became necessary in the development of the software that meets the above criteria, and demonstrate the power of our new design through experimental results and real world applications. Our software is available under the MIT license (https://github.com/pfnet/optuna/).PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/1907.10902v1.pdf
GitHub:
• https://github.com/pfnet/optuna
• https://github.com/optuna/optuna
• https://github.com/Automunge/AutoMunge
• https://github.com/optuna/optuna-integration
• https://github.com/crcrpar/benchmark-runner-ci
• https://github.com/crcrpar/optuna-mirror
• https://github.com/himkt/optuna-test-rtds
• https://github.com/crcrpar/ci-example-execution
• https://github.com/yqian4/optuna
• https://github.com/brethvoice/optuna_demo_MNIST
• https://github.com/rickyHong/optuna-repl
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
GitHub
GitHub - optuna/optuna: A hyperparameter optimization framework
A hyperparameter optimization framework. Contribute to optuna/optuna development by creating an account on GitHub.
❤2
🔹 Title:
OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling
🔹 Publication Date: Published on Jun 25
🔹 Abstract:
Investigating mid-training strategies reveals that high-quality mathematical corpora and well-formatted chain-of-thought reasoning examples enhance reinforcement learning performance in language models, leading to the development of OctoThinker. AI-generated summary Different base language model families, such as Llama and Qwen, exhibit divergent behaviors during post-training with reinforcement learning (RL), especially on reasoning-intensive tasks. What makes a base language model suitable for reinforcement learning ? Gaining deeper insight into this question is essential for developing RL-scalable foundation models of the next generation. In this work, we investigate how mid-training strategies shape RL dynamics, focusing on two representative model families: Qwen and Llama. Our study reveals that (1) high-quality mathematical corpora, such as MegaMath-Web-Pro , significantly improve both base model and RL performance, while existing alternatives (e.g., FineMath-4plus) fail to do so; (2) further adding QA-style data , particularly long chain-of-thought (CoT) reasoning examples, enhances RL outcomes, and instruction data further unlocks this effect; (3) while long-CoT improves reasoning depth, it can also induce verbosity of model responses and unstability of RL training, underscoring the importance of data formatting ; (4) scaling mid-training consistently leads to stronger downstream RL performance. Building on these insights, we introduce a two-stage mid-training strategy , Stable-then-Decay, in which base models are first trained on 200B tokens with a constant learning rate, followed by 20B tokens across three CoT-focused branches with learning rate decay . This yields OctoThinker , a family of models demonstrating strong RL compatibility and closing the performance gap with more RL-friendly model families, i.e., Qwen. We hope our work will help shape pre-training strategies for foundation models in the RL era. To support further research, we release our open-source models along with a curated math reasoning-intensive corpus of over 70 billion tokens (i.e., MegaMath-Web-Pro-Max ).
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.20512
• PDF: https://arxiv.org/pdf/2506.20512
• Github: https://github.com/GAIR-NLP/OctoThinker
🔹 Datasets citing this paper:
• https://huggingface.co/datasets/OctoThinker/MegaMath-Web-Pro-Max
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling
🔹 Publication Date: Published on Jun 25
🔹 Abstract:
Investigating mid-training strategies reveals that high-quality mathematical corpora and well-formatted chain-of-thought reasoning examples enhance reinforcement learning performance in language models, leading to the development of OctoThinker. AI-generated summary Different base language model families, such as Llama and Qwen, exhibit divergent behaviors during post-training with reinforcement learning (RL), especially on reasoning-intensive tasks. What makes a base language model suitable for reinforcement learning ? Gaining deeper insight into this question is essential for developing RL-scalable foundation models of the next generation. In this work, we investigate how mid-training strategies shape RL dynamics, focusing on two representative model families: Qwen and Llama. Our study reveals that (1) high-quality mathematical corpora, such as MegaMath-Web-Pro , significantly improve both base model and RL performance, while existing alternatives (e.g., FineMath-4plus) fail to do so; (2) further adding QA-style data , particularly long chain-of-thought (CoT) reasoning examples, enhances RL outcomes, and instruction data further unlocks this effect; (3) while long-CoT improves reasoning depth, it can also induce verbosity of model responses and unstability of RL training, underscoring the importance of data formatting ; (4) scaling mid-training consistently leads to stronger downstream RL performance. Building on these insights, we introduce a two-stage mid-training strategy , Stable-then-Decay, in which base models are first trained on 200B tokens with a constant learning rate, followed by 20B tokens across three CoT-focused branches with learning rate decay . This yields OctoThinker , a family of models demonstrating strong RL compatibility and closing the performance gap with more RL-friendly model families, i.e., Qwen. We hope our work will help shape pre-training strategies for foundation models in the RL era. To support further research, we release our open-source models along with a curated math reasoning-intensive corpus of over 70 billion tokens (i.e., MegaMath-Web-Pro-Max ).
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.20512
• PDF: https://arxiv.org/pdf/2506.20512
• Github: https://github.com/GAIR-NLP/OctoThinker
🔹 Datasets citing this paper:
• https://huggingface.co/datasets/OctoThinker/MegaMath-Web-Pro-Max
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤3
Article Title:
pySLAM: An Open-Source, Modular, and Extensible Framework for SLAM
Article Date: 17 Feb 2025
Article Description:
pySLAM is an open-source Python framework for Visual SLAM, supporting monocular, stereo, and RGB-D cameras. It provides a flexible interface for integrating both classical and modern local features, making it adaptable to various SLAM tasks. The framework includes different loop closure methods, a volumetric reconstruction pipeline, and support for depth prediction models. Additionally, it offers a suite of tools for visual odometry and SLAM applications. Designed for both beginners and experienced researchers, pySLAM encourages community contributions, fostering collaborative development in the field of Visual SLAM.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2502.11955v2.pdf
GitHub:
• https://github.com/luigifreda/pyslam
Datasets:
• KITTI
• Replica
• TUM RGB-D
• EuRoC MAV
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
pySLAM: An Open-Source, Modular, and Extensible Framework for SLAM
Article Date: 17 Feb 2025
Article Description:
pySLAM is an open-source Python framework for Visual SLAM, supporting monocular, stereo, and RGB-D cameras. It provides a flexible interface for integrating both classical and modern local features, making it adaptable to various SLAM tasks. The framework includes different loop closure methods, a volumetric reconstruction pipeline, and support for depth prediction models. Additionally, it offers a suite of tools for visual odometry and SLAM applications. Designed for both beginners and experienced researchers, pySLAM encourages community contributions, fostering collaborative development in the field of Visual SLAM.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2502.11955v2.pdf
GitHub:
• https://github.com/luigifreda/pyslam
Datasets:
• KITTI
• Replica
• TUM RGB-D
• EuRoC MAV
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤1
🔹 Title:
PlayerOne: Egocentric World Simulator
🔹 Publication Date: Published on Jun 11
🔹 Abstract:
PlayerOne is an egocentric realistic world simulator that constructs and generates videos from user-captured images, using a coarse-to-fine training pipeline and advanced motion injection and reconstruction frameworks. AI-generated summary We introduce PlayerOne, the first egocentric realistic world simulator , facilitating immersive and unrestricted exploration within vividly dynamic environments. Given an egocentric scene image from the user, PlayerOne can accurately construct the corresponding world and generate egocentric videos that are strictly aligned with the real scene human motion of the user captured by an exocentric camera. PlayerOne is trained in a coarse-to-fine pipeline that first performs pretraining on large-scale egocentric text-video pairs for coarse-level egocentric understanding, followed by finetuning on synchronous motion-video data extracted from egocentric-exocentric video datasets with our automatic construction pipeline . Besides, considering the varying importance of different components, we design a part-disentangled motion injection scheme, enabling precise control of part-level movements. In addition, we devise a joint reconstruction framework that progressively models both the 4D scene and video frames , ensuring scene consistency in the long-form video generation . Experimental results demonstrate its great generalization ability in precise control of varying human movements and worldconsistent modeling of diverse scenarios. It marks the first endeavor into egocentric real-world simulation and can pave the way for the community to delve into fresh frontiers of world modeling and its diverse applications.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.09995
• PDF: https://arxiv.org/pdf/2506.09995
• Project Page: https://playerone-hku.github.io/
• Github: https://playerone-hku.github.io/
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
PlayerOne: Egocentric World Simulator
🔹 Publication Date: Published on Jun 11
🔹 Abstract:
PlayerOne is an egocentric realistic world simulator that constructs and generates videos from user-captured images, using a coarse-to-fine training pipeline and advanced motion injection and reconstruction frameworks. AI-generated summary We introduce PlayerOne, the first egocentric realistic world simulator , facilitating immersive and unrestricted exploration within vividly dynamic environments. Given an egocentric scene image from the user, PlayerOne can accurately construct the corresponding world and generate egocentric videos that are strictly aligned with the real scene human motion of the user captured by an exocentric camera. PlayerOne is trained in a coarse-to-fine pipeline that first performs pretraining on large-scale egocentric text-video pairs for coarse-level egocentric understanding, followed by finetuning on synchronous motion-video data extracted from egocentric-exocentric video datasets with our automatic construction pipeline . Besides, considering the varying importance of different components, we design a part-disentangled motion injection scheme, enabling precise control of part-level movements. In addition, we devise a joint reconstruction framework that progressively models both the 4D scene and video frames , ensuring scene consistency in the long-form video generation . Experimental results demonstrate its great generalization ability in precise control of varying human movements and worldconsistent modeling of diverse scenarios. It marks the first endeavor into egocentric real-world simulation and can pave the way for the community to delve into fresh frontiers of world modeling and its diverse applications.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.09995
• PDF: https://arxiv.org/pdf/2506.09995
• Project Page: https://playerone-hku.github.io/
• Github: https://playerone-hku.github.io/
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
arXiv.org
PlayerOne: Egocentric World Simulator
We introduce PlayerOne, the first egocentric realistic world simulator, facilitating immersive and unrestricted exploration within vividly dynamic environments. Given an egocentric scene image...
❤4
🔹 Title:
When Less is Enough: Adaptive Token Reduction for Efficient Image Representation
🔹 Publication Date: Published on Mar 20
🔹 Abstract:
A method using autoencoder and Gumbel-Softmax selection identifies and retains only the most informative visual tokens, enabling efficient multimodal pruning with minimal performance loss. AI-generated summary Vision encoders typically generate a large number of visual tokens , providing information-rich representations but significantly increasing computational demands. This raises the question of whether all generated tokens are equally valuable or if some of them can be discarded to reduce computational costs without compromising quality. In this paper, we introduce a new method for determining feature utility based on the idea that less valuable features can be reconstructed from more valuable ones. We implement this concept by integrating an autoencoder with a Gumbel-Softmax selection mechanism , that allows identifying and retaining only the most informative visual tokens . To validate our approach, we compared the performance of the LLaVA-NeXT model, using features selected by our method with randomly selected features. We found that on OCR-based tasks , more than 50% of the visual context can be removed with minimal performance loss, whereas randomly discarding the same proportion of features significantly affects the model capabilities. Furthermore, in general-domain tasks , even randomly retaining only 30% of tokens achieves performance comparable to using the full set of visual tokens . Our results highlight a promising direction towards adaptive and efficient multimodal pruning that facilitates scalable and low-overhead inference without compromising performance.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2503.16660
• PDF: https://arxiv.org/pdf/2503.16660
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
When Less is Enough: Adaptive Token Reduction for Efficient Image Representation
🔹 Publication Date: Published on Mar 20
🔹 Abstract:
A method using autoencoder and Gumbel-Softmax selection identifies and retains only the most informative visual tokens, enabling efficient multimodal pruning with minimal performance loss. AI-generated summary Vision encoders typically generate a large number of visual tokens , providing information-rich representations but significantly increasing computational demands. This raises the question of whether all generated tokens are equally valuable or if some of them can be discarded to reduce computational costs without compromising quality. In this paper, we introduce a new method for determining feature utility based on the idea that less valuable features can be reconstructed from more valuable ones. We implement this concept by integrating an autoencoder with a Gumbel-Softmax selection mechanism , that allows identifying and retaining only the most informative visual tokens . To validate our approach, we compared the performance of the LLaVA-NeXT model, using features selected by our method with randomly selected features. We found that on OCR-based tasks , more than 50% of the visual context can be removed with minimal performance loss, whereas randomly discarding the same proportion of features significantly affects the model capabilities. Furthermore, in general-domain tasks , even randomly retaining only 30% of tokens achieves performance comparable to using the full set of visual tokens . Our results highlight a promising direction towards adaptive and efficient multimodal pruning that facilitates scalable and low-overhead inference without compromising performance.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2503.16660
• PDF: https://arxiv.org/pdf/2503.16660
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
arXiv.org
When Less is Enough: Adaptive Token Reduction for Efficient Image...
Vision encoders typically generate a large number of visual tokens, providing information-rich representations but significantly increasing computational demands. This raises the question of...
🔹 Title:
FilMaster: Bridging Cinematic Principles and Generative AI for Automated Film Generation
🔹 Publication Date: Published on Jun 23
🔹 Abstract:
FilMaster is an AI system that integrates cinematic principles to generate professional-grade films, featuring camera language design and cinematic rhythm control using real-world film data and generative AI models. AI-generated summary AI-driven content creation has shown potential in film production. However, existing film generation systems struggle to implement cinematic principles and thus fail to generate professional-quality films, particularly lacking diverse camera language and cinematic rhythm. This results in templated visuals and unengaging narratives. To address this, we introduce FilMaster, an end-to-end AI system that integrates real-world cinematic principles for professional-grade film generation, yielding editable, industry-standard outputs. FilMaster is built on two key principles: (1) learning cinematography from extensive real-world film data and (2) emulating professional, audience-centric post-production workflows. Inspired by these principles, FilMaster incorporates two stages: a Reference-Guided Generation Stage which transforms user input to video clips, and a Generative Post-Production Stage which transforms raw footage into audiovisual outputs by orchestrating visual and auditory elements for cinematic rhythm. Our generation stage highlights a Multi-shot Synergized RAG Camera Language Design module to guide the AI in generating professional camera language by retrieving reference clips from a vast corpus of 440,000 film clips. Our post-production stage emulates professional workflows by designing an Audience-Centric Cinematic Rhythm Control module, including Rough Cut and Fine Cut processes informed by simulated audience feedback, for effective integration of audiovisual elements to achieve engaging content. The system is empowered by generative AI models like (M)LLMs and video generation models . Furthermore, we introduce FilmEval , a comprehensive benchmark for evaluating AI-generated films. Extensive experiments show FilMaster's superior performance in camera language design and cinematic rhythm control, advancing generative AI in professional filmmaking.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.18899
• PDF: https://arxiv.org/pdf/2506.18899
• Github: https://filmaster-ai.github.io
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
FilMaster: Bridging Cinematic Principles and Generative AI for Automated Film Generation
🔹 Publication Date: Published on Jun 23
🔹 Abstract:
FilMaster is an AI system that integrates cinematic principles to generate professional-grade films, featuring camera language design and cinematic rhythm control using real-world film data and generative AI models. AI-generated summary AI-driven content creation has shown potential in film production. However, existing film generation systems struggle to implement cinematic principles and thus fail to generate professional-quality films, particularly lacking diverse camera language and cinematic rhythm. This results in templated visuals and unengaging narratives. To address this, we introduce FilMaster, an end-to-end AI system that integrates real-world cinematic principles for professional-grade film generation, yielding editable, industry-standard outputs. FilMaster is built on two key principles: (1) learning cinematography from extensive real-world film data and (2) emulating professional, audience-centric post-production workflows. Inspired by these principles, FilMaster incorporates two stages: a Reference-Guided Generation Stage which transforms user input to video clips, and a Generative Post-Production Stage which transforms raw footage into audiovisual outputs by orchestrating visual and auditory elements for cinematic rhythm. Our generation stage highlights a Multi-shot Synergized RAG Camera Language Design module to guide the AI in generating professional camera language by retrieving reference clips from a vast corpus of 440,000 film clips. Our post-production stage emulates professional workflows by designing an Audience-Centric Cinematic Rhythm Control module, including Rough Cut and Fine Cut processes informed by simulated audience feedback, for effective integration of audiovisual elements to achieve engaging content. The system is empowered by generative AI models like (M)LLMs and video generation models . Furthermore, we introduce FilmEval , a comprehensive benchmark for evaluating AI-generated films. Extensive experiments show FilMaster's superior performance in camera language design and cinematic rhythm control, advancing generative AI in professional filmmaking.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.18899
• PDF: https://arxiv.org/pdf/2506.18899
• Github: https://filmaster-ai.github.io
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
arXiv.org
FilMaster: Bridging Cinematic Principles and Generative AI for...
AI-driven content creation has shown potential in film production. However, existing film generation systems struggle to implement cinematic principles and thus fail to generate...
❤3
Article Title:
Hyperbolic Dataset Distillation
Article Date: 30 May 2025
Article Description:
To address the computational and storage challenges posed by large-scale datasets in deep learning, dataset distillation has been proposed to synthesize a compact dataset that replaces the original while maintaining comparable model performance. Unlike optimization-based approaches that require costly bi-level optimization, distribution matching (DM) methods improve efficiency by aligning the distributions of synthetic and original data, thereby eliminating nested optimization. DM achieves high computational efficiency and has emerged as a promising solution. However, existing DM methods, constrained to Euclidean space, treat data as independent and identically distributed points, overlooking complex geometric and hierarchical relationships. To overcome this limitation, we propose a novel hyperbolic dataset distillation method, termed HDD. Hyperbolic space, characterized by negative curvature and exponential volume growth with distance, naturally models hierarchical and tree-like structures. HDD embeds features extracted by a shallow network into the Lorentz hyperbolic space, where the discrepancy between synthetic and original data is measured by the hyperbolic (geodesic) distance between their centroids. By optimizing this distance, the hierarchical structure is explicitly integrated into the distillation process, guiding synthetic samples to gravitate towards the root-centric regions of the original data distribution while preserving their underlying geometric characteristics. Furthermore, we find that pruning in hyperbolic space requires only 20% of the distilled core set to retain model performance, while significantly improving training stability. Notably, HDD is seamlessly compatible with most existing DM methods, and extensive experiments on different datasets validate its effectiveness.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.24623v1.pdf
GitHub:
• https://github.com/Guang000/Awesome-Dataset-Distillation
Datasets:
• CIFAR-100
• Fashion-MNIST
• Tiny ImageNet
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Hyperbolic Dataset Distillation
Article Date: 30 May 2025
Article Description:
To address the computational and storage challenges posed by large-scale datasets in deep learning, dataset distillation has been proposed to synthesize a compact dataset that replaces the original while maintaining comparable model performance. Unlike optimization-based approaches that require costly bi-level optimization, distribution matching (DM) methods improve efficiency by aligning the distributions of synthetic and original data, thereby eliminating nested optimization. DM achieves high computational efficiency and has emerged as a promising solution. However, existing DM methods, constrained to Euclidean space, treat data as independent and identically distributed points, overlooking complex geometric and hierarchical relationships. To overcome this limitation, we propose a novel hyperbolic dataset distillation method, termed HDD. Hyperbolic space, characterized by negative curvature and exponential volume growth with distance, naturally models hierarchical and tree-like structures. HDD embeds features extracted by a shallow network into the Lorentz hyperbolic space, where the discrepancy between synthetic and original data is measured by the hyperbolic (geodesic) distance between their centroids. By optimizing this distance, the hierarchical structure is explicitly integrated into the distillation process, guiding synthetic samples to gravitate towards the root-centric regions of the original data distribution while preserving their underlying geometric characteristics. Furthermore, we find that pruning in hyperbolic space requires only 20% of the distilled core set to retain model performance, while significantly improving training stability. Notably, HDD is seamlessly compatible with most existing DM methods, and extensive experiments on different datasets validate its effectiveness.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.24623v1.pdf
GitHub:
• https://github.com/Guang000/Awesome-Dataset-Distillation
Datasets:
• CIFAR-100
• Fashion-MNIST
• Tiny ImageNet
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤1