Preliminary results showing that OpenAI’s newest GPT-4 Turbo upgrade has totally failed to solve the “laziness” problem
OpenAI must be either totally incompetent at benchmarking, or total liars.
Leaning toward the latter.
“Overall, the new gpt-4-0125-preview model does worse on the lazy coding benchmark as compared to the November gpt-4-1106-preview model”
Lazy coding benchmark for gpt-4-0125-preview
OpenAI must be either totally incompetent at benchmarking, or total liars.
Leaning toward the latter.
“Overall, the new gpt-4-0125-preview model does worse on the lazy coding benchmark as compared to the November gpt-4-1106-preview model”
Lazy coding benchmark for gpt-4-0125-preview
👨💻36👍15🤬12🗿5🤣4🏆3💔3🫡3
Chat GPT
Preliminary results showing that OpenAI’s newest GPT-4 Turbo upgrade has totally failed to solve the “laziness” problem OpenAI must be either totally incompetent at benchmarking, or total liars. Leaning toward the latter. “Overall, the new gpt-4-0125-preview…
Sam says NOW they really got rid of the laziness
Unlike the last 3 times they claimed to have gotten rid of it, but hadn’t, trust-me-bro.
People are skeptical
Unlike the last 3 times they claimed to have gotten rid of it, but hadn’t, trust-me-bro.
People are skeptical
🤬14🤣12
Princeton on ChatGPT-4 for real-world coding: Only 1.7% of the time generated a solution that worked.
“We therefore introduce SWE-bench, an evaluation framework including 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue.“
“Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. Claude 2 and GPT-4 solve a mere 4.8% and 1.7% of instances respectively, even when provided with an oracle retriever”
Arxiv Paper
“We therefore introduce SWE-bench, an evaluation framework including 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue.“
“Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. Claude 2 and GPT-4 solve a mere 4.8% and 1.7% of instances respectively, even when provided with an oracle retriever”
Arxiv Paper
🤬18😎6
Google presents MusicRL: MusicRL is the first music generation system finetuned with RLHF
Paper
Project Page
Paper
Project Page
🔥17👨💻5👀2
Media is too big
VIEW IN TELEGRAM
Until 1 year ago, this vision of the AI future seemed like a joke
Not anymore. Thanks Sam.
Not anymore. Thanks Sam.
🤣31😎7🤬4😱3💯2🔥1