Language, translation and text-corpus APIs and datasets.
6 results
Text & NLP · KakologArchives
Archive of 11 years of Nico Nico Jikkyo live commentary logs.
Text & NLP · Salesforce
Over 100 million tokens from Wikipedia for language modeling benchmarks.
Text & NLP · openai
8.5K grade school math word problems requiring multi-step arithmetic reasoning.
Text & NLP · allenai
Cleaned Common Crawl corpus with multiple language variants for NLP training.
Text & NLP · SWE-bench
Multilingual benchmark for AI models resolving GitHub issues in code repositories.
Text & NLP · m-a-p
Fine-grained multi-domain web corpus with iteration-wise token statistics.