Preliminary results showing that OpenAIโs newest GPT-4 Turbo upgrade has totally failed to solve the โlazinessโ problem
OpenAI must be either totally incompetent at benchmarking, or total liars.
Leaning toward the latter.
โOverall, the new gpt-4-0125-preview model does worse on the lazy coding benchmark as compared to the November gpt-4-1106-preview modelโ
Lazy coding benchmark for gpt-4-0125-preview
OpenAI must be either totally incompetent at benchmarking, or total liars.
Leaning toward the latter.
โOverall, the new gpt-4-0125-preview model does worse on the lazy coding benchmark as compared to the November gpt-4-1106-preview modelโ
Lazy coding benchmark for gpt-4-0125-preview
๐จโ๐ป36๐15๐คฌ12๐ฟ5๐คฃ4๐3๐3๐ซก3
This media is not supported in your browser
VIEW IN TELEGRAM
Show me a feminist
๐คฃ81๐ฏ5๐จ5๐4๐คฌ3๐ฟ3๐2โคโ๐ฅ1
Chat GPT
Preliminary results showing that OpenAIโs newest GPT-4 Turbo upgrade has totally failed to solve the โlazinessโ problem OpenAI must be either totally incompetent at benchmarking, or total liars. Leaning toward the latter. โOverall, the new gpt-4-0125-previewโฆ
Sam says NOW they really got rid of the laziness
Unlike the last 3 times they claimed to have gotten rid of it, but hadnโt, trust-me-bro.
People are skeptical
Unlike the last 3 times they claimed to have gotten rid of it, but hadnโt, trust-me-bro.
People are skeptical
๐คฌ14๐คฃ12
Princeton on ChatGPT-4 for real-world coding: Only 1.7% of the time generated a solution that worked.
โWe therefore introduce SWE-bench, an evaluation framework including 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue.โ
โOur evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. Claude 2 and GPT-4 solve a mere 4.8% and 1.7% of instances respectively, even when provided with an oracle retrieverโ
Arxiv Paper
โWe therefore introduce SWE-bench, an evaluation framework including 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue.โ
โOur evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. Claude 2 and GPT-4 solve a mere 4.8% and 1.7% of instances respectively, even when provided with an oracle retrieverโ
Arxiv Paper
๐คฌ18๐6
Google presents MusicRL: MusicRL is the first music generation system finetuned with RLHF
Paper
Project Page
Paper
Project Page
๐ฅ17๐จโ๐ป5๐2
Media is too big
VIEW IN TELEGRAM
Until 1 year ago, this vision of the AI future seemed like a joke
Not anymore. Thanks Sam.
Not anymore. Thanks Sam.
๐คฃ31๐7๐คฌ4๐ฑ3๐ฏ2๐ฅ1
This media is not supported in your browser
VIEW IN TELEGRAM
Humans may be making AI stupider over time, as a result of interacting with humans - AI researcher James Zou
๐53๐13๐ฏ7๐6๐4๐ฅฐ2๐ข2โคโ๐ฅ1๐1๐ฟ1