Hugging Face

Hugging Face (Twitter)

RT @LeRobotHF: 🚀 Big news: we just added Reachy 2 to LeRobot!

Huge thanks to our friends at @pollenrobotics 💛🤗
Reachy 2 is also available in simulation, so you can try it out right away.
🎥 Check out the teleop & autonomous demo below!

17 views11:52

Hugging Face

Hugging Face (Twitter)

RT @QuixiAI: Cannot wait to try the new Kimi K2! @Kimi_Moonshot

16 views21:23

Hugging Face

Hugging Face (Twitter)

RT @Laz4rz: Brand new, fresh out of a French printer

18 views21:23

Hugging Face

0:04

This media is not supported in your browser

VIEW IN TELEGRAM

Hugging Face (Twitter)

RT @_fracapuano: “It’s a good dog, @ClementDelangue” vibes

21 views21:23

Hugging Face

0:45

This media is not supported in your browser

VIEW IN TELEGRAM

Hugging Face (Twitter)

RT @fffiloni: Quietly landed on the hub
you can try ROSE on @huggingface 🤗
—> https://huggingface.co/spaces/Kunbyte/ROSE https://twitter.com/Almorgand/status/1962846321372471755#m

23 views21:23

Hugging Face

Hugging Face (Twitter)

RT @Thom_Wolf: This is huge

Continuing our foundational work to enable anyone to train state of the art AI model, we’re thrilled to release « FinePDFs »

3T tokens of textual data that until now was locked away in PDFs, arguably some of the highest quality publicly available data out there.

We gathered FinePDF to create the largest permissively licensed corpus sourced exclusively from PDFs.

Amazingly challenging infra and processing work, h/t to the fineweb team https://twitter.com/HKydlicek/status/1964584936524124645#m

10 views21:34

Hugging Face

Hugging Face (Twitter)

RT @HKydlicek: We are releasing 📄 FinePDFs:
the largest PDF dataset spanning over half a billion documents!

- Long context: Documents are 2x longer than web text
- 3T tokens from high-demand domains like legal and science.
- Heavily improves over SoTA when mixed with FW-EDU&DCLM web copora.

8 views21:34

Hugging Face

Hugging Face (Twitter)

RT @gpj: Released a new synthetic dataset: 1.5k [human] → 10k [synthetic] children’s stories.

Pipeline generated by @Kilo_Code and model switching from @poe_platform API 🙏🤗

https://huggingface.co/datasets/garethpaul/children-stories-dataset

11 views21:34

Hugging Face

Hugging Face (Twitter)

RT @maximelabonne: Pheww, another banger dataset from @huggingface!

> 3T tokens, 475M PDFs, 1733 languages

> Close to Nemotron-CC v2 and FineWeb-Edu+DCLM on its own (‼️)

> Greatly boosts perf when combined, likely because it provides high diversity that complements the other datasets well

8 views21:35

Hugging Face

Hugging Face (Twitter)

RT @TrackioApp: Trackio represents @huggingface's effort to democratize experiment tracking for the community:

> absolutely free,
> open-source,
> local-first
> drop-in alternative to commercial solutions

12 views21:35

🔥 Check our MAD AI BOT 🔥

Hugging Face

Hugging Face (Twitter)

RT @OfirPress: 3 out of the top 6 most downloaded datasets on @huggingface are SWE-bench related.

Thanks!!! ♥️

11 views21:35

🔥 Check our INSTANT MEDIA BOT 🔥

Hugging Face

Hugging Face (Twitter)

RT @TencentHunyuan: We did it! We now have two models in the top two spots on the @huggingface trending charts.

🥇 Hunyuan-MT-7B
🥈 HunyuanWorld-Voyager

Download and deploy the models for free on Hugging Face and GitHub. Your stars and feedback are welcome! 🌟👍❤️

This is just the beginning. Stay tuned for our next open-source release next week!

14 views21:35

Hugging Face (Twitter)

RT @Thom_Wolf: wow, total BoM cost $660, folks

open-source community >> closed source hyped robots

14 views21:35

Hugging Face

Hugging Face (Twitter)

RT @LeRobotHF: Almost 10,000 followers here! Let's build the biggest and most active community of Robotics AI builders thanks to open-source!

6 views22:18

Hugging Face

Hugging Face (Twitter)

RT @Thom_Wolf: 3 trillions tokens finely distilled from more than a petabyte of PDF files

We’ve just released FinePDF, the latest addition to the Fineweb datasets

6 views22:18

Hugging Face

Hugging Face (Twitter)

RT @cgeorgiaw: 🚨 Big news in ML for biotech 🚨

Today, we're launching the Antibody Developability Prediction Competition with @Ginkgo + @huggingface!

💧 Hydrophobicity
🎯 Polyreactivity
🧲 Self-association
🔥 Thermostability
🧪 Titer

🏆 Up to $60k in prizes
📅 Submit by Nov 1, 2025

6 views22:18

Hugging Face

‌Hugging Face (Twitter)

RT @charlesbben: Recently finished writing a new blogpost about @PyTorch compilation in ZeroGPU Spaces.

Worth reading if you're interested in learning about :

- PyTorch ahead-of-time compilation
- ZeroGPU internals

https://huggingface.co/blog/zerogpu-aoti

huggingface.co

Make your ZeroGPU Spaces go brrr with ahead-of-time compilation

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

6 views22:18

Hugging Face

Hugging Face (Twitter)

RT @HuggingPapers: Here's your recap of the hottest AI papers on @huggingface for September 1-7! This week, we dive into LLM comprehension, hallucination, robotics, and more:

- Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth
- From Editor to Dense Geometry Estimator
- Open Data Synthesis For Deep Research (mentioning @Google Gemini)
- Towards a Unified View of Large Language Model Post-Training
- ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding
- Why Language Models Hallucinate
- Robix: A Unified Model for Robot Interaction, Reasoning and Planning (outperforming @OpenAI GPT-4o & @Google Gemini 2.5 Pro)
- DeepResearch Arena: The First Exam of LLMs' Research Abilities

7 views22:18

Hugging Face

Hugging Face (Twitter)

RT @rohanpaul_ai: MASSIVE. THE LARGEST open-sourced pdf data just dropped on @huggingface . Finepdfs

3 trillion tokens across 475 million documents in 1733 languages.

This is the largest publicly available corpus sourced exclusively from PDFs, containing about

The data was sourced from 105 CommonCrawl snapshots, spanning the summer of 2013 to February 2025, as well as refetched from the internet, and processed using 🏭 datatrove, huggingface's large scale data processing library.

This carefully deduplicated and filtered dataset comprises roughly 3.65 terabytes of 3T tokens. For PII and opt-out see Personal and Sensitive Information and opt-out.

The dataset is fully reproducible and released under the ODC-By 1.0 license. You will be able to access the reproduction code, ablation and evaluation setup in this GitHub repository soon 👷.

Compared to HTML datasets, despite being only mildly filtered, it achieves results nearly on par with...

Перейти на оригинальный пост

10 views22:18

About

Blog

Apps

Platform