Vietnamese web text extracted from Common Crawl in plaintext and Markdown.
The dataset comprises Vietnamese text extracted from the full Common Crawl archive and converted to plaintext and Markdown formats.
It supports large-scale Vietnamese NLP work including language-model pretraining and web-text analysis.
Use the 10TB plaintext corpus to pretrain or continue-train language models focused on Vietnamese web language.
Index the extracted Vietnamese pages to create domain-specific retrieval or question-answering datasets.
Analyze the provided quality filters and Vietnamese subset to benchmark Common Crawl processing pipelines.
from datasets import load_dataset
ds = load_dataset("Symato/cc")A filtered Vietnamese subset of Common Crawl containing roughly 10 TB of plaintext and markdown text.
Verified reviews from the community shape this listing's rating.
Loading reviews…