✨RiddleBench: A New Generative Reasoning Benchmark for LLMs
📝 Summary:
RiddleBench, a new benchmark of 1,737 puzzles, reveals fundamental weaknesses in state-of-the-art LLMs, including hallucination cascades and poor self-correction. Models achieve only about 60% accuracy, underscoring the need for more robust and reliable reasoning capabilities.
🔹 Publication Date: Published on Oct 28
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.24932
• PDF: https://arxiv.org/pdf/2510.24932
✨ Datasets citing this paper:
• https://huggingface.co/datasets/ai4bharat/RiddleBench
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#LLMs #GenerativeAI #AIResearch #Benchmarks #NLP
📝 Summary:
RiddleBench, a new benchmark of 1,737 puzzles, reveals fundamental weaknesses in state-of-the-art LLMs, including hallucination cascades and poor self-correction. Models achieve only about 60% accuracy, underscoring the need for more robust and reliable reasoning capabilities.
🔹 Publication Date: Published on Oct 28
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.24932
• PDF: https://arxiv.org/pdf/2510.24932
✨ Datasets citing this paper:
• https://huggingface.co/datasets/ai4bharat/RiddleBench
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#LLMs #GenerativeAI #AIResearch #Benchmarks #NLP
✨miniF2F-Lean Revisited: Reviewing Limitations and Charting a Path Forward
📝 Summary:
An analysis of miniF2F showed AI systems had 36% accuracy due to problem errors. Correcting these errors created miniF2F-v2, improving accuracy to 70%. High-quality benchmarks like miniF2F-v2 are crucial for evaluating formal reasoning progress.
🔹 Publication Date: Published on Nov 5
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.03108
• PDF: https://arxiv.org/pdf/2511.03108
• Github: https://github.com/roozbeh-yz/miniF2F_v2
✨ Datasets citing this paper:
• https://huggingface.co/datasets/roozbeh-yz/miniF2F_v2
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AI #FormalReasoning #Benchmarks #MachineLearning #Dataset
📝 Summary:
An analysis of miniF2F showed AI systems had 36% accuracy due to problem errors. Correcting these errors created miniF2F-v2, improving accuracy to 70%. High-quality benchmarks like miniF2F-v2 are crucial for evaluating formal reasoning progress.
🔹 Publication Date: Published on Nov 5
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.03108
• PDF: https://arxiv.org/pdf/2511.03108
• Github: https://github.com/roozbeh-yz/miniF2F_v2
✨ Datasets citing this paper:
• https://huggingface.co/datasets/roozbeh-yz/miniF2F_v2
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AI #FormalReasoning #Benchmarks #MachineLearning #Dataset
✨Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark
📝 Summary:
Current video model benchmarks miss assessing Chain-of-Frames CoF reasoning, crucial for world simulators. Gen-ViRe is a new benchmark that decomposes CoF reasoning into cognitive subtasks, offering the first quantitative assessment. It reveals poor reasoning depth despite impressive visual quali...
🔹 Publication Date: Published on Nov 17
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.13853
• PDF: https://arxiv.org/pdf/2511.13853
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AI #WorldSimulators #VisualReasoning #GenerativeAI #Benchmarks
📝 Summary:
Current video model benchmarks miss assessing Chain-of-Frames CoF reasoning, crucial for world simulators. Gen-ViRe is a new benchmark that decomposes CoF reasoning into cognitive subtasks, offering the first quantitative assessment. It reveals poor reasoning depth despite impressive visual quali...
🔹 Publication Date: Published on Nov 17
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.13853
• PDF: https://arxiv.org/pdf/2511.13853
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AI #WorldSimulators #VisualReasoning #GenerativeAI #Benchmarks