DeepSeek-V3 Technical Report
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in #DeepSeek V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.
Paper: https://arxiv.org/pdf/2412.19437v1.pdf
Code: https://github.com/deepseek-ai/deepseek-v3
@Machine_learn
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in #DeepSeek V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.
Paper: https://arxiv.org/pdf/2412.19437v1.pdf
Code: https://github.com/deepseek-ai/deepseek-v3
@Machine_learn
๐2โค1
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Paper submitted by #DeepSeek team has generated significant attention in the AI community.
This work addresses the enhancement of reasoning capabilities in Large Language Models (LLMs) through the application of reinforcement learning techniques. The authors introduce a novel framework, DeepSeek-R1, which aims to improve LLM reasoning abilities by incorporating incentives for logical reasoning processes within their training. This integration of reinforcement learning allows LLMs to go beyond basic linguistic processing, developing sophisticated reasoning methods that can boost performance across a wide array of complex applications.
This approach has cause lots of discussions in different communities, but it definitely opens up the whole new direction of development for the research.
Paper: https://arxiv.org/abs/2501.12948
#nn #LLM
@Machine_learn
Paper submitted by #DeepSeek team has generated significant attention in the AI community.
This work addresses the enhancement of reasoning capabilities in Large Language Models (LLMs) through the application of reinforcement learning techniques. The authors introduce a novel framework, DeepSeek-R1, which aims to improve LLM reasoning abilities by incorporating incentives for logical reasoning processes within their training. This integration of reinforcement learning allows LLMs to go beyond basic linguistic processing, developing sophisticated reasoning methods that can boost performance across a wide array of complex applications.
This approach has cause lots of discussions in different communities, but it definitely opens up the whole new direction of development for the research.
Paper: https://arxiv.org/abs/2501.12948
#nn #LLM
@Machine_learn
arXiv.org
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via...
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning...
โค2
Forwarded from Github LLMs
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
13 Dec 2024 ยท Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, Chong Ruan ยท
We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL, through two key major upgrades. For the vision component, we incorporate a dynamic tiling vision encoding strategy designed for processing high-resolution images with different aspect ratios. For the language component, we leverage #DeepSeekMoE models with the Multi-head Latent Attention mechanism, which compresses Key-Value cache into latent vectors, to enable efficient inference and high throughput. Trained on an improved vision-language dataset, DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. Our model series is composed of three variants: DeepSeek-VL2-Tiny, #DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated parameters respectively. DeepSeek-VL2 achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models. Codes and pre-trained models are publicly accessible at https://github.com/deepseek-ai/DeepSeek-VL2.
Paper: https://arxiv.org/pdf/2412.10302v1.pdf
Code: https://github.com/deepseek-ai/deepseek-vl2
Datasets: RefCOCO TextVQA MMBench
DocVQA
๐
https://t.iss.one/deep_learning_proj
13 Dec 2024 ยท Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, Chong Ruan ยท
We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL, through two key major upgrades. For the vision component, we incorporate a dynamic tiling vision encoding strategy designed for processing high-resolution images with different aspect ratios. For the language component, we leverage #DeepSeekMoE models with the Multi-head Latent Attention mechanism, which compresses Key-Value cache into latent vectors, to enable efficient inference and high throughput. Trained on an improved vision-language dataset, DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. Our model series is composed of three variants: DeepSeek-VL2-Tiny, #DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated parameters respectively. DeepSeek-VL2 achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models. Codes and pre-trained models are publicly accessible at https://github.com/deepseek-ai/DeepSeek-VL2.
Paper: https://arxiv.org/pdf/2412.10302v1.pdf
Code: https://github.com/deepseek-ai/deepseek-vl2
Datasets: RefCOCO TextVQA MMBench
DocVQA
https://t.iss.one/deep_learning_proj
Please open Telegram to view this post
VIEW IN TELEGRAM
๐4โค1