This media is not supported in your browser
VIEW IN TELEGRAM
Absolute state of prompting rn
π€£57β8π4
Chat GPT
Thank God for Grok Really outdid yourself Elon.
Itβs real - just reproduced it
They manually censored the exact original one, so that it refuses to even follow the directions properly anymore,
But if you do slight rewordings of the original then you can reproduce it on the first try.
They manually censored the exact original one, so that it refuses to even follow the directions properly anymore,
But if you do slight rewordings of the original then you can reproduce it on the first try.
π€¬44π€£16π8π±7π―2π2
Preliminary results showing that OpenAIβs newest GPT-4 Turbo upgrade has totally failed to solve the βlazinessβ problem
OpenAI must be either totally incompetent at benchmarking, or total liars.
Leaning toward the latter.
βOverall, the new gpt-4-0125-preview model does worse on the lazy coding benchmark as compared to the November gpt-4-1106-preview modelβ
Lazy coding benchmark for gpt-4-0125-preview
OpenAI must be either totally incompetent at benchmarking, or total liars.
Leaning toward the latter.
βOverall, the new gpt-4-0125-preview model does worse on the lazy coding benchmark as compared to the November gpt-4-1106-preview modelβ
Lazy coding benchmark for gpt-4-0125-preview
π¨βπ»36π15π€¬12πΏ5π€£4π3π3π«‘3
This media is not supported in your browser
VIEW IN TELEGRAM
Show me a feminist
π€£81π―5π¨5π4π€¬3πΏ3π2β€βπ₯1
Chat GPT
Preliminary results showing that OpenAIβs newest GPT-4 Turbo upgrade has totally failed to solve the βlazinessβ problem OpenAI must be either totally incompetent at benchmarking, or total liars. Leaning toward the latter. βOverall, the new gpt-4-0125-previewβ¦
Sam says NOW they really got rid of the laziness
Unlike the last 3 times they claimed to have gotten rid of it, but hadnβt, trust-me-bro.
People are skeptical
Unlike the last 3 times they claimed to have gotten rid of it, but hadnβt, trust-me-bro.
People are skeptical
π€¬14π€£12
Princeton on ChatGPT-4 for real-world coding: Only 1.7% of the time generated a solution that worked.
βWe therefore introduce SWE-bench, an evaluation framework including 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue.β
βOur evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. Claude 2 and GPT-4 solve a mere 4.8% and 1.7% of instances respectively, even when provided with an oracle retrieverβ
Arxiv Paper
βWe therefore introduce SWE-bench, an evaluation framework including 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue.β
βOur evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. Claude 2 and GPT-4 solve a mere 4.8% and 1.7% of instances respectively, even when provided with an oracle retrieverβ
Arxiv Paper
π€¬18π6
Google presents MusicRL: MusicRL is the first music generation system finetuned with RLHF
Paper
Project Page
Paper
Project Page
π₯17π¨βπ»5π2