Data Science | Machine Learning with Python for Researchers

✨RiddleBench: A New Generative Reasoning Benchmark for LLMs

📝 Summary:
RiddleBench, a new benchmark of 1,737 puzzles, reveals fundamental weaknesses in state-of-the-art LLMs, including hallucination cascades and poor self-correction. Models achieve only about 60% accuracy, underscoring the need for more robust and reliable reasoning capabilities.

🔹 Publication Date: Published on Oct 28

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.24932
• PDF: https://arxiv.org/pdf/2510.24932

✨ Datasets citing this paper:
• https://huggingface.co/datasets/ai4bharat/RiddleBench

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#LLMs #GenerativeAI #AIResearch #Benchmarks #NLP

266 views16:27

✨ Explore Data Science 📝 Write your paper

✨miniF2F-Lean Revisited: Reviewing Limitations and Charting a Path Forward

📝 Summary:
An analysis of miniF2F showed AI systems had 36% accuracy due to problem errors. Correcting these errors created miniF2F-v2, improving accuracy to 70%. High-quality benchmarks like miniF2F-v2 are crucial for evaluating formal reasoning progress.

🔹 Publication Date: Published on Nov 5

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.03108
• PDF: https://arxiv.org/pdf/2511.03108
• Github: https://github.com/roozbeh-yz/miniF2F_v2

✨ Datasets citing this paper:
• https://huggingface.co/datasets/roozbeh-yz/miniF2F_v2

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#AI #FormalReasoning #Benchmarks #MachineLearning #Dataset

193 views01:01

✨ Explore Data Science 📝 Write your paper

Data Science | Machine Learning with Python for Researchers

✨Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark

📝 Summary:
Current video model benchmarks miss assessing Chain-of-Frames CoF reasoning, crucial for world simulators. Gen-ViRe is a new benchmark that decomposes CoF reasoning into cognitive subtasks, offering the first quantitative assessment. It reveals poor reasoning depth despite impressive visual quali...

🔹 Publication Date: Published on Nov 17

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.13853
• PDF: https://arxiv.org/pdf/2511.13853

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#AI #WorldSimulators #VisualReasoning #GenerativeAI #Benchmarks

119 views04:02

✨ Explore Data Science 📝 Write your paper

About

Blog

Apps

Platform