Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models
“Our work provides an alternative to demonstrations: tool documentation. We advocate the use of tool documentation—descriptions for the individual tool usage—over demonstrations. We substantiate our claim through three main empirical findings on 6 tasks across both vision and language modalities.”
“we show that tool documentation is significantly more valuable than demonstrations, with zero-shot documentation significantly outperforming few-shot without documentation.”
From the author:
“Our new paper finds something quite neat: We easily scale up how many tools LLMs can use to over 200 tools (APIs, models, python functions, etc.) ...without any training, without a single tool-use demonstration!!”
Arxiv Link
“Our work provides an alternative to demonstrations: tool documentation. We advocate the use of tool documentation—descriptions for the individual tool usage—over demonstrations. We substantiate our claim through three main empirical findings on 6 tasks across both vision and language modalities.”
“we show that tool documentation is significantly more valuable than demonstrations, with zero-shot documentation significantly outperforming few-shot without documentation.”
From the author:
“Our new paper finds something quite neat: We easily scale up how many tools LLMs can use to over 200 tools (APIs, models, python functions, etc.) ...without any training, without a single tool-use demonstration!!”
Arxiv Link
❤3👍1🔥1👏1
Does better natural language modeling transfer to better mathematical reasoning? Yes.
“We assume that performance follows RFT>SFT>ICL, from the findings in this paper we know the improvement speed follows RFT<SFT<ICL. And if we have an omnipotent language model which has a pre-training loss that is the same as the corpus randomness, it could have RFT = SFT = ICL = 100. Thus when you pre-train a better language model (i.e. smaller pre-training loss), your model’s performance still follows RFT>SFT>ICL but their performance gaps are diminishing. Since you can obtain an RFT model without too much effort (compared to pre-training), then the most important thing we should do is to decrease the model’s pre-training loss.”
Translation: Simply starting with a far more powerful foundation model, e.g. starting with GPT-4 rather than of Llama, has a much bigger impact on model performance than increasing the amount of supervised fine-tuning you do on top.
I.e. Getting someone to spending a massive amount to create huger foundation models crushes all else.
I.e. Specialized fine-tuning isn’t enough to eliminate the need for foundation models that have greater general intelligence.
I.e. General intelligence dominates all.
Arxiv Link
“We assume that performance follows RFT>SFT>ICL, from the findings in this paper we know the improvement speed follows RFT<SFT<ICL. And if we have an omnipotent language model which has a pre-training loss that is the same as the corpus randomness, it could have RFT = SFT = ICL = 100. Thus when you pre-train a better language model (i.e. smaller pre-training loss), your model’s performance still follows RFT>SFT>ICL but their performance gaps are diminishing. Since you can obtain an RFT model without too much effort (compared to pre-training), then the most important thing we should do is to decrease the model’s pre-training loss.”
Translation: Simply starting with a far more powerful foundation model, e.g. starting with GPT-4 rather than of Llama, has a much bigger impact on model performance than increasing the amount of supervised fine-tuning you do on top.
I.e. Getting someone to spending a massive amount to create huger foundation models crushes all else.
I.e. Specialized fine-tuning isn’t enough to eliminate the need for foundation models that have greater general intelligence.
I.e. General intelligence dominates all.
Arxiv Link
🔥7❤4👏3👍2
Congrats to Chad Coin, coin that keeps our AI chatbots free for all 510K+ users, up another 246% since yesterday!
More soon.🤐
@chadgptcoin
More soon.🤐
@chadgptcoin
❤12🔥5🥴3🤣2👍1😱1🌭1