Using some publicly available data, and assuming each of these models are trained in a similar way Chinchilla was, we can compare the performance of GPT-4 to GPT-2, GPT-3, Chinchilla, and PaLM.
Let's calculate what GPT-4's performance would be if it used 10x more parameters without retrieval, and naively assume that will be its performance with retrieval. This chart is what we get.
With the algorithmic adjustment, the qualitative improvement from GPT-3 (vanilla) to GPT-4 is comparable to the improvement from GPT-2 to GPT-3. Since that was a rather big jump, I expect many will be stunned by GPT-4, especially those who expected strong diminishing returns.
Let's calculate what GPT-4's performance would be if it used 10x more parameters without retrieval, and naively assume that will be its performance with retrieval. This chart is what we get.
With the algorithmic adjustment, the qualitative improvement from GPT-3 (vanilla) to GPT-4 is comparable to the improvement from GPT-2 to GPT-3. Since that was a rather big jump, I expect many will be stunned by GPT-4, especially those who expected strong diminishing returns.
βIn short: Training runs of large Machine Learning systems are likely to last less than 14-15 months. This is because longer runs will be outcompeted by runs that start later and therefore use better hardware and better algorithms.β
https://www.lesswrong.com/posts/RihYwmskuJT9Rkbjq/the-longest-training-run
https://www.lesswrong.com/posts/RihYwmskuJT9Rkbjq/the-longest-training-run
π2π€1