Allgemein

Hugging Face Releases FineTranslations, a Trillion-Token Multilingual Parallel Text Dataset

Hugging Face Releases FineTranslations, a Trillion-Token Multilingual Parallel Text Dataset

Hugging Face has released FineTranslations, a large-scale multilingual dataset containing more than 1 trillion tokens of parallel text across English and 500+ languages. The dataset was created by translating non-English content from the FineWeb2 corpus into English using Gemma3 27B, with the full data generation pipeline designed to be reproducible and publicly documented.

By Robert KrzaczyƄski