ML Research Hub
32.6K subscribers
3.39K photos
132 videos
23 files
3.61K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho
Download Telegram
Wasm: A Pipeline for Constructing Structured Arabic Interleaved Multimodal Corpora

📝 Summary:
Wasm is a pipeline creating a new structured Arabic multimodal dataset from Common Crawl. It preserves document structure and supports both text-only and multimodal pre-training, addressing the lack of high-quality Arabic datasets.

🔹 Publication Date: Published on Nov 10

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.07080
• PDF: https://arxiv.org/pdf/2511.07080

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#ArabicNLP #MultimodalAI #DatasetCreation #Corpora #DataScience
1