Princeton on ChatGPT-4 for real-world coding: Only 1.7% of the time generated a solution that worked.
βWe therefore introduce SWE-bench, an evaluation framework including 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue.β
βOur evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. Claude 2 and GPT-4 solve a mere 4.8% and 1.7% of instances respectively, even when provided with an oracle retrieverβ
Arxiv Paper
βWe therefore introduce SWE-bench, an evaluation framework including 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue.β
βOur evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. Claude 2 and GPT-4 solve a mere 4.8% and 1.7% of instances respectively, even when provided with an oracle retrieverβ
Arxiv Paper
π€¬18π6
Google presents MusicRL: MusicRL is the first music generation system finetuned with RLHF
Paper
Project Page
Paper
Project Page
π₯17π¨βπ»5π2
Media is too big
VIEW IN TELEGRAM
Until 1 year ago, this vision of the AI future seemed like a joke
Not anymore. Thanks Sam.
Not anymore. Thanks Sam.
π€£31π7π€¬4π±3π―2π₯1
This media is not supported in your browser
VIEW IN TELEGRAM
Humans may be making AI stupider over time, as a result of interacting with humans - AI researcher James Zou
π53π13π―7π6π4π₯°2π’2β€βπ₯1π1πΏ1