Slamming: Training a Speech Language Model on One GPU in a Day
19 Feb 2025 · Gallil Maimon, Avishai Elmakies, Yossi Adi ·
Paper: https://arxiv.org/pdf/2502.15814v1.pdf
Code: https://github.com/slp-rl/slamkit
19 Feb 2025 · Gallil Maimon, Avishai Elmakies, Yossi Adi ·
We introduce Slam, a recipe for training high-quality Speech Language Models (SLMs) on a single academic GPU in 24 hours. We do so through empirical analysis of model initialisation and architecture, synthetic training data, preference optimisation with synthetic data and tweaking all other components. We empirically demonstrate that this training recipe also scales well with more compute getting results on par with leading SLMs in a fraction of the compute cost. We hope these insights will make SLM training and research more accessible. In the context of SLM scaling laws, our results far outperform predicted compute optimal performance, giving an optimistic view to #SLM feasibility. See code, data, models, samples at - https://pages.cs.huji.ac.il/adiyoss-lab/slamming .
Paper: https://arxiv.org/pdf/2502.15814v1.pdf
Code: https://github.com/slp-rl/slamkit
#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek #RAG #Agents
https://t.iss.one/DataScienceT
👍1
🎨 Can AI design truly novel concepts like humans? Check SYNTHIA, a breakthrough in T2I generation!
🤖 SYNTHIA composes affordances to create visually novel & functionally coherent designs.
📄 https://arxiv.org/pdf/2502.17793
💻 https://github.com/HyeonjeongHa/SYNTHIA
🎥 https://youtube.com/watch?v=KvsOx44WdzM
🤖 SYNTHIA composes affordances to create visually novel & functionally coherent designs.
📄 https://arxiv.org/pdf/2502.17793
💻 https://github.com/HyeonjeongHa/SYNTHIA
🎥 https://youtube.com/watch?v=KvsOx44WdzM
#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #DeepSeek #RAG #Agents
https://t.iss.one/DataScienceT
👍3
Magma: A Foundation Model for Multimodal AI Agents
18 Feb 2025 · Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, Jianfeng Gao ·
Paper: https://arxiv.org/pdf/2502.13130v1.pdf
Code: https://github.com/microsoft/Magma
Datasets: Something-Something V2 - EPIC-KITCHENS-100 - Open-X-Embodiment - Ego4D
18 Feb 2025 · Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, Jianfeng Gao ·
We present Magma, a foundation model that serves multimodal AI agentic tasks in both the digital and physical worlds. Magma is a significant extension of vision-language (VL) models in that it not only retains the VL understanding ability (verbal intelligence) of the latter, but is also equipped with the ability to plan and act in the visual-spatial world (spatial-temporal intelligence) and complete agentic tasks ranging from UI navigation to robot manipulation. To endow the agentic capabilities, Magma is pretrained on large amounts of heterogeneous datasets spanning from images, videos to robotics data, where the actionable visual objects (e.g., clickable buttons in GUI) in images are labeled by Set-of-Mark (SoM) for action grounding, and the object movements (e.g., the trace of human hands or robotic arms) in videos are labeled by Trace-of-Mark (ToM) for action planning. Extensive experiments show that SoM and ToM reach great synergy and facilitate the acquisition of spatial-temporal intelligence for our Magma model, which is fundamental to a wide range of tasks as shown in Fig.1. In particular, Magma creates new state-of-the-art results on UI navigation and robotic manipulation tasks, outperforming previous models that are specifically tailored to these tasks. On image and video-related multimodal tasks, Magma also compares favorably to popular large multimodal models that are trained on much larger datasets. We make our model and code public for reproducibility at https://microsoft.github.io/Magma.
Paper: https://arxiv.org/pdf/2502.13130v1.pdf
Code: https://github.com/microsoft/Magma
Datasets: Something-Something V2 - EPIC-KITCHENS-100 - Open-X-Embodiment - Ego4D
#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek #RAG #Agents
https://t.iss.one/DataScienceT
👍7
From System 1 to System 2: A Survey of Reasoning Large Language Models
24 Feb 2025 · Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, Yingying Zhang, Fei Yin, Jiahua Dong, Zhijiang Guo, Le Song, Cheng-Lin Liu ·
Paper: https://arxiv.org/pdf/2502.17419v1.pdf
Code: https://github.com/zzli2022/awesome-slow-reason-system
Datasets: GSM8K - MedQA - MathVista - GPQA - MMLU-Pro - PGPS9K
24 Feb 2025 · Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, Yingying Zhang, Fei Yin, Jiahua Dong, Zhijiang Guo, Le Song, Cheng-Lin Liu ·
Achieving human-level intelligence requires refining the transition from the fast, intuitive System 1 to the slower, more deliberate System 2 reasoning. While System 1 excels in quick, heuristic decisions, System 2 relies on logical reasoning for more accurate judgments and reduced biases. Foundational Large Language Models (LLMs) excel at fast decision-making but lack the depth for complex reasoning, as they have not yet fully embraced the step-by-step analysis characteristic of true System 2 thinking. Recently, reasoning LLMs like OpenAI's o1/o3 and DeepSeek's R1 have demonstrated expert-level performance in fields such as mathematics and coding, closely mimicking the deliberate reasoning of System 2 and showcasing human-like cognitive abilities. This survey begins with a brief overview of the progress in foundational LLMs and the early development of System 2 technologies, exploring how their combination has paved the way for reasoning LLMs. Next, we discuss how to construct reasoning #LLMs, analyzing their features, the core methods enabling advanced reasoning, and the evolution of various reasoning LLMs. Additionally, we provide an overview of reasoning benchmarks, offering an in-depth comparison of the performance of representative reasoning LLMs. Finally, we explore promising directions for advancing reasoning LLMs and maintain a real-time \href{https://github.com/zzli2022/Awesome-Slow-Reason-System}{GitHub Repository} to track the latest developments. We hope this survey will serve as a valuable resource to inspire innovation and drive progress in this rapidly evolving field.
Paper: https://arxiv.org/pdf/2502.17419v1.pdf
Code: https://github.com/zzli2022/awesome-slow-reason-system
Datasets: GSM8K - MedQA - MathVista - GPQA - MMLU-Pro - PGPS9K
#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek #RAG #Agents
https://t.iss.one/DataScienceT
👍4
Magma: A Foundation Model for Multimodal AI Agents
18 Feb 2025 · Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, Jianfeng Gao ·
Paper: https://arxiv.org/pdf/2502.13130v1.pdf
Code: https://github.com/microsoft/Magma
Datasets: Something-Something V2 - EPIC-KITCHENS-100 - Open-X-Embodiment - Ego4D
18 Feb 2025 · Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, Jianfeng Gao ·
We present Magma, a foundation model that serves multimodal AI agentic tasks in both the digital and physical worlds. Magma is a significant extension of vision-language (VL) models in that it not only retains the VL understanding ability (verbal intelligence) of the latter, but is also equipped with the ability to plan and act in the visual-spatial world (spatial-temporal intelligence) and complete agentic tasks ranging from UI navigation to robot manipulation. To endow the agentic capabilities, Magma is pretrained on large amounts of heterogeneous datasets spanning from images, videos to robotics data, where the actionable visual objects (e.g., clickable buttons in GUI) in images are labeled by Set-of-Mark (SoM) for action grounding, and the object movements (e.g., the trace of human hands or robotic arms) in videos are labeled by Trace-of-Mark (ToM) for action planning. Extensive experiments show that #SoM and ToM reach great synergy and facilitate the acquisition of spatial-temporal intelligence for our Magma model, which is fundamental to a wide range of tasks as shown in Fig.1. In particular, #Magma creates new state-of-the-art results on UI navigation and robotic manipulation tasks, outperforming previous models that are specifically tailored to these tasks. On image and video-related multimodal tasks, Magma also compares favorably to popular large multimodal models that are trained on much larger datasets. We make our model and code public for reproducibility at https://microsoft.github.io/Magma.
Paper: https://arxiv.org/pdf/2502.13130v1.pdf
Code: https://github.com/microsoft/Magma
Datasets: Something-Something V2 - EPIC-KITCHENS-100 - Open-X-Embodiment - Ego4D
#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek #RAG #Agents
https://t.iss.one/DataScienceT
👍2
Hawk: Learning to Understand Open-World Video Anomalies
27 May 2024 · Jiaqi Tang, Hao Lu, Ruizheng Wu, Xiaogang Xu, Ke Ma, Cheng Fang, Bin Guo, Jiangbo Lu, Qifeng Chen, Ying-Cong Chen ·
Paper: https://arxiv.org/pdf/2405.16886v1.pdf
Code: https://github.com/jqtangust/hawk
Dataset: Hawk Annotation Dataset
27 May 2024 · Jiaqi Tang, Hao Lu, Ruizheng Wu, Xiaogang Xu, Ke Ma, Cheng Fang, Bin Guo, Jiangbo Lu, Qifeng Chen, Ying-Cong Chen ·
Video Anomaly Detection (#VAD) systems can autonomously monitor and identify disturbances, reducing the need for manual labor and associated costs. However, current VAD systems are often limited by their superficial semantic understanding of scenes and minimal user interaction. Additionally, the prevalent data scarcity in existing datasets restricts their applicability in open-world scenarios. In this paper, we introduce Hawk, a novel framework that leverages interactive large Visual Language Models (#VLM) to interpret video anomalies precisely. Recognizing the difference in motion information between abnormal and normal videos, Hawk explicitly integrates motion modality to enhance anomaly identification. To reinforce motion attention, we construct an auxiliary consistency loss within the motion and video space, guiding the video branch to focus on the motion modality. Moreover, to improve the interpretation of motion-to-language, we establish a clear supervisory relationship between motion and its linguistic representation. Furthermore, we have annotated over 8,000 anomaly videos with language descriptions, enabling effective training across diverse open-world scenarios, and also created 8,000 question-answering pairs for users' open-world questions. The final results demonstrate that #Hawk achieves SOTA performance, surpassing existing baselines in both video description generation and question-answering. Our codes/dataset/demo will be released at https://github.com/jqtangust/hawk.
Paper: https://arxiv.org/pdf/2405.16886v1.pdf
Code: https://github.com/jqtangust/hawk
Dataset: Hawk Annotation Dataset
#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek #RAG #Agents
https://t.iss.one/DataScienceT
👍4
PhotoDoodle: Learning Artistic Image Editing from Few-Shot Pairwise Data
20 Feb 2025 · Shijie Huang, Yiren Song, Yuxuan Zhang, Hailong Guo, Xueyin Wang, Mike Zheng Shou, Jiaming Liu ·
Paper: https://arxiv.org/pdf/2502.14397v1.pdf
Code: https://github.com/showlab/PhotoDoodle
20 Feb 2025 · Shijie Huang, Yiren Song, Yuxuan Zhang, Hailong Guo, Xueyin Wang, Mike Zheng Shou, Jiaming Liu ·
We introduce PhotoDoodle, a novel image editing framework designed to facilitate photo doodling by enabling artists to overlay decorative elements onto photographs. Photo doodling is challenging because the inserted elements must appear seamlessly integrated with the background, requiring realistic blending, perspective alignment, and contextual coherence. Additionally, the background must be preserved without distortion, and the artist's unique style must be captured efficiently from limited training data. These requirements are not addressed by previous methods that primarily focus on global style transfer or regional inpainting. The proposed method, PhotoDoodle, employs a two-stage training strategy. Initially, we train a general-purpose image editing model, OmniEditor, using large-scale data. Subsequently, we fine-tune this model with EditLoRA using a small, artist-curated dataset of before-and-after image pairs to capture distinct editing styles and techniques. To enhance consistency in the generated results, we introduce a positional encoding reuse mechanism. Additionally, we release a PhotoDoodle dataset featuring six high-quality styles. Extensive experiments demonstrate the advanced performance and robustness of our method in customized image editing, opening new possibilities for artistic creation.
Paper: https://arxiv.org/pdf/2502.14397v1.pdf
Code: https://github.com/showlab/PhotoDoodle
#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek #RAG #Agents
https://t.iss.one/DataScienceT
👍5
Distill Any Depth: Distillation Creates a Stronger Monocular Depth Estimator
26 Feb 2025 · Xiankang He, Dongyan Guo, Hongji Li, Ruibo Li, Ying Cui, Chi Zhang ·
Paper: https://arxiv.org/pdf/2502.19204v1.pdf
Code: https://github.com/Westlake-AGI-Lab/Distill-Any-Depth
Datasets: ScanNet - NYUv2 - ETH3D
Note: Ranked #1 on Depth Estimation on ScanNetV2
26 Feb 2025 · Xiankang He, Dongyan Guo, Hongji Li, Ruibo Li, Ying Cui, Chi Zhang ·
Monocular depth estimation (#MDE) aims to predict scene depth from a single RGB image and plays a crucial role in 3D scene understanding. Recent advances in zero-shot MDE leverage normalized depth representations and distillation-based learning to improve generalization across diverse scenes. However, current depth normalization methods for distillation, relying on global normalization, can amplify noisy pseudo-labels, reducing distillation effectiveness. In this paper, we systematically analyze the impact of different depth normalization strategies on pseudo-label distillation. Based on our findings, we propose Cross-Context Distillation, which integrates global and local depth cues to enhance pseudo-label quality. Additionally, we introduce a multi-teacher distillation framework that leverages complementary strengths of different depth estimation models, leading to more robust and accurate depth predictions. Extensive experiments on benchmark datasets demonstrate that our approach significantly outperforms state-of-the-art methods, both quantitatively and qualitatively.
Paper: https://arxiv.org/pdf/2502.19204v1.pdf
Code: https://github.com/Westlake-AGI-Lab/Distill-Any-Depth
Datasets: ScanNet - NYUv2 - ETH3D
Note: Ranked #1 on Depth Estimation on ScanNetV2
#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek #RAG #Agents
https://t.iss.one/DataScienceT
👍4
Escaping The Big Data Paradigm in Self-Supervised Representation Learning
25 Feb 2025 · Carlos Vélez García, Miguel Cazorla, Jorge Pomares ·
Paper: https://arxiv.org/pdf/2502.18056v1.pdf
Code: https://github.com/inescopresearch/scott
Datasets: Oxford 102 Flower - Oxford-IIIT Pets - Imagenet100
25 Feb 2025 · Carlos Vélez García, Miguel Cazorla, Jorge Pomares ·
The reliance on large-scale datasets and extensive computational resources has become a major barrier to advancing representation learning in vision, especially in data-scarce domains. In this paper, we address the critical question: Can we escape the big data paradigm in self-supervised representation learning from images? We introduce #SCOTT (Sparse Convolutional Tokenizer for Transformers), a shallow tokenization architecture that is compatible with Masked Image Modeling (MIM) tasks. SCOTT injects convolutional inductive biases into Vision Transformers (ViTs), enhancing their efficacy in small-scale data regimes. Alongside, we propose to train on a Joint-Embedding Predictive Architecture within a MIM framework (MIM-JEPA), operating in latent representation space to capture more semantic features. Our approach enables ViTs to be trained from scratch on datasets orders of magnitude smaller than traditionally required --without relying on massive external datasets for pretraining. We validate our method on three small-size, standard-resoultion, fine-grained datasets: Oxford Flowers-102, Oxford IIIT Pets-37, and ImageNet-100. Despite the challenges of limited data and high intra-class similarity, frozen SCOTT models pretrained with MIM-JEPA significantly outperform fully supervised methods and achieve competitive results with SOTA approaches that rely on large-scale pretraining, complex image augmentations and bigger model sizes. By demonstrating that robust off-the-shelf representations can be learned with limited data, compute, and model sizes, our work paves the way for computer applications in resource constrained environments such as medical imaging or robotics. Our findings challenge the prevailing notion that vast amounts of data are indispensable for effective representation learning in vision, offering a new pathway toward more accessible and inclusive advancements in the field.
Paper: https://arxiv.org/pdf/2502.18056v1.pdf
Code: https://github.com/inescopresearch/scott
Datasets: Oxford 102 Flower - Oxford-IIIT Pets - Imagenet100
#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #AutoML #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #DeepSeek #RAG #Agents #GPT4
https://t.iss.one/DataScienceT
👍3
A-MEM: Agentic Memory for LLM Agents
17 Feb 2025 · Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, Yongfeng Zhang ·
Paper: https://arxiv.org/pdf/2502.12110v3.pdf
Code: https://github.com/wujiangxu/agenticmemory
17 Feb 2025 · Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, Yongfeng Zhang ·
While large language model (LLM) agents can effectively use external tools for complex real-world tasks, they require memory systems to leverage historical experiences. Current memory systems enable basic storage and retrieval but lack sophisticated memory organization, despite recent attempts to incorporate graph databases. Moreover, these systems' fixed operations and structures limit their adaptability across diverse tasks. To address this limitation, this paper proposes a novel agentic memory system for LLM agents that can dynamically organize memories in an agentic way. Following the basic principles of the Zettelkasten method, we designed our memory system to create interconnected knowledge networks through dynamic indexing and linking. When a new memory is added, we generate a comprehensive note containing multiple structured attributes, including contextual descriptions, keywords, and tags. The system then analyzes historical memories to identify relevant connections, establishing links where meaningful similarities exist. Additionally, this process enables memory evolution - as new memories are integrated, they can trigger updates to the contextual representations and attributes of existing historical memories, allowing the memory network to continuously refine its understanding. Our approach combines the structured organization principles of Zettelkasten with the flexibility of agent-driven decision making, allowing for more adaptive and context-aware memory management. Empirical experiments on six foundation models show superior improvement against existing SOTA baselines. The source code for evaluating performance is available at https://github.com/WujiangXu/AgenticMemory, while the source code of agentic memory system is available at https://github.com/agiresearch/A-mem.
Paper: https://arxiv.org/pdf/2502.12110v3.pdf
Code: https://github.com/wujiangxu/agenticmemory
#DataScience #ArtificialIntelligence #MachineLearning #PythonProgramming #DeepLearning #LLM #AIResearch #BigData #NeuralNetworks #DataAnalytics #NLP #DataVisualization #ScikitLearn #Pandas #NumPy #TensorFlow #AIethics #PredictiveModeling #GPUComputing #OpenSourceAI #RAG #Agents #GPT4
https://t.iss.one/DataScienceT
👍3