Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models
βOur work provides an alternative to demonstrations: tool documentation. We advocate the use of tool documentationβdescriptions for the individual tool usageβover demonstrations. We substantiate our claim through three main empirical findings on 6 tasks across both vision and language modalities.β
βwe show that tool documentation is significantly more valuable than demonstrations, with zero-shot documentation significantly outperforming few-shot without documentation.β
From the author:
βOur new paper finds something quite neat: We easily scale up how many tools LLMs can use to over 200 tools (APIs, models, python functions, etc.) ...without any training, without a single tool-use demonstration!!β
Arxiv Link
βOur work provides an alternative to demonstrations: tool documentation. We advocate the use of tool documentationβdescriptions for the individual tool usageβover demonstrations. We substantiate our claim through three main empirical findings on 6 tasks across both vision and language modalities.β
βwe show that tool documentation is significantly more valuable than demonstrations, with zero-shot documentation significantly outperforming few-shot without documentation.β
From the author:
βOur new paper finds something quite neat: We easily scale up how many tools LLMs can use to over 200 tools (APIs, models, python functions, etc.) ...without any training, without a single tool-use demonstration!!β
Arxiv Link
β€3π1π₯1π1
Does better natural language modeling transfer to better mathematical reasoning? Yes.
βWe assume that performance follows RFT>SFT>ICL, from the findings in this paper we know the improvement speed follows RFT<SFT<ICL. And if we have an omnipotent language model which has a pre-training loss that is the same as the corpus randomness, it could have RFT = SFT = ICL = 100. Thus when you pre-train a better language model (i.e. smaller pre-training loss), your modelβs performance still follows RFT>SFT>ICL but their performance gaps are diminishing. Since you can obtain an RFT model without too much effort (compared to pre-training), then the most important thing we should do is to decrease the modelβs pre-training loss.β
Translation: Simply starting with a far more powerful foundation model, e.g. starting with GPT-4 rather than of Llama, has a much bigger impact on model performance than increasing the amount of supervised fine-tuning you do on top.
I.e. Getting someone to spending a massive amount to create huger foundation models crushes all else.
I.e. Specialized fine-tuning isnβt enough to eliminate the need for foundation models that have greater general intelligence.
I.e. General intelligence dominates all.
Arxiv Link
βWe assume that performance follows RFT>SFT>ICL, from the findings in this paper we know the improvement speed follows RFT<SFT<ICL. And if we have an omnipotent language model which has a pre-training loss that is the same as the corpus randomness, it could have RFT = SFT = ICL = 100. Thus when you pre-train a better language model (i.e. smaller pre-training loss), your modelβs performance still follows RFT>SFT>ICL but their performance gaps are diminishing. Since you can obtain an RFT model without too much effort (compared to pre-training), then the most important thing we should do is to decrease the modelβs pre-training loss.β
Translation: Simply starting with a far more powerful foundation model, e.g. starting with GPT-4 rather than of Llama, has a much bigger impact on model performance than increasing the amount of supervised fine-tuning you do on top.
I.e. Getting someone to spending a massive amount to create huger foundation models crushes all else.
I.e. Specialized fine-tuning isnβt enough to eliminate the need for foundation models that have greater general intelligence.
I.e. General intelligence dominates all.
Arxiv Link
π₯7β€4π3π2
Congrats to Chad Coin, coin that keeps our AI chatbots free for all 510K+ users, up another 246% since yesterday!
More soon.π€
@chadgptcoin
More soon.π€
@chadgptcoin
β€12π₯5π₯΄3π€£2π1π±1π1