Chat GPT
Thank God for Grok Really outdid yourself Elon.
It’s real - just reproduced it
They manually censored the exact original one, so that it refuses to even follow the directions properly anymore,
But if you do slight rewordings of the original then you can reproduce it on the first try.
They manually censored the exact original one, so that it refuses to even follow the directions properly anymore,
But if you do slight rewordings of the original then you can reproduce it on the first try.
🤬44🤣16👍8😱7💯2😐2
Preliminary results showing that OpenAI’s newest GPT-4 Turbo upgrade has totally failed to solve the “laziness” problem
OpenAI must be either totally incompetent at benchmarking, or total liars.
Leaning toward the latter.
“Overall, the new gpt-4-0125-preview model does worse on the lazy coding benchmark as compared to the November gpt-4-1106-preview model”
Lazy coding benchmark for gpt-4-0125-preview
OpenAI must be either totally incompetent at benchmarking, or total liars.
Leaning toward the latter.
“Overall, the new gpt-4-0125-preview model does worse on the lazy coding benchmark as compared to the November gpt-4-1106-preview model”
Lazy coding benchmark for gpt-4-0125-preview
👨💻36👍15🤬12🗿5🤣4🏆3💔3🫡3
Chat GPT
Preliminary results showing that OpenAI’s newest GPT-4 Turbo upgrade has totally failed to solve the “laziness” problem OpenAI must be either totally incompetent at benchmarking, or total liars. Leaning toward the latter. “Overall, the new gpt-4-0125-preview…
Sam says NOW they really got rid of the laziness
Unlike the last 3 times they claimed to have gotten rid of it, but hadn’t, trust-me-bro.
People are skeptical
Unlike the last 3 times they claimed to have gotten rid of it, but hadn’t, trust-me-bro.
People are skeptical
🤬14🤣12
Princeton on ChatGPT-4 for real-world coding: Only 1.7% of the time generated a solution that worked.
“We therefore introduce SWE-bench, an evaluation framework including 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue.“
“Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. Claude 2 and GPT-4 solve a mere 4.8% and 1.7% of instances respectively, even when provided with an oracle retriever”
Arxiv Paper
“We therefore introduce SWE-bench, an evaluation framework including 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue.“
“Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. Claude 2 and GPT-4 solve a mere 4.8% and 1.7% of instances respectively, even when provided with an oracle retriever”
Arxiv Paper
🤬18😎6
Google presents MusicRL: MusicRL is the first music generation system finetuned with RLHF
Paper
Project Page
Paper
Project Page
🔥17👨💻5👀2