A 2.46 trillion token AI-curated STEM pretraining dataset.
AutoMathText-V2 is a large-scale text collection of 2.46 trillion tokens spanning web, mathematics, code, and reasoning content.
It is useful for pretraining and fine-tuning NLP models that require broad STEM knowledge and reasoning capabilities.
Train or continue pretraining large models on the full 2.46T tokens to improve mathematical and scientific reasoning capabilities.
Fine-tune models on the reasoning and mathematics portions to create specialized QA tools for STEM problems.
Use the included code and math text to train models that generate or explain code in scientific computing contexts.
from datasets import load_dataset
ds = load_dataset("OpenSQZ/AutoMathText-V2")A 2.46-trillion-token deduplicated text dataset covering web, mathematics, code and reasoning for STEM language-model pretraining.
Verified reviews from the community shape this listing's rating.
Loading reviews…