Is the dataset free to use?

Yes, it is hosted publicly on the Hugging Face Hub.

How do I access the data?

Load it with the Hugging Face datasets library using load_dataset('Symato/cc').

What license applies?

License information is not specified in the dataset card summary; check the full page for details.

cc

Vietnamese web text extracted from Common Crawl in plaintext and Markdown.

DatasetText & NLP↓ 234K/moFree

Open dataset

Updated 2026-06-18

What is cc?

The dataset comprises Vietnamese text extracted from the full Common Crawl archive and converted to plaintext and Markdown formats.

It supports large-scale Vietnamese NLP work including language-model pretraining and web-text analysis.

What you can build with cc

Train Vietnamese LLMs

Use the 10TB plaintext corpus to pretrain or continue-train language models focused on Vietnamese web language.

Build search and QA systems

Index the extracted Vietnamese pages to create domain-specific retrieval or question-answering datasets.

Study web data filtering

Analyze the provided quality filters and Vietnamese subset to benchmark Common Crawl processing pipelines.

Load cc

Python

from datasets import load_dataset

ds = load_dataset("Symato/cc")

1pip install datasets
2from datasets import load_dataset
3ds = load_dataset('Symato/cc')
4Access the plaintext or markdown splits for Vietnamese content
5Stream or download subsets as needed for training

cc: pros & cons

Pros

+Massive Vietnamese web corpus (~10 TB plaintext)
+Sourced directly from Common Crawl WARC files
+Includes both plaintext and markdown formats
+Publicly hosted on Hugging Face

Cons

–Extremely large size requires substantial storage and bandwidth
–Web data remains noisy despite basic filters
–No detailed license or usage terms provided in summary

Did you find this helpful?

Frequently asked questions

A filtered Vietnamese subset of Common Crawl containing roughly 10 TB of plaintext and markdown text.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Promote cc

Add this badge to your website, or share the tool.

DFeatured on Dhanasvicc 0

cc

Vietnamese web text extracted from Common Crawl in plaintext and Markdown.

DatasetText & NLP↓ 234K/moFree

Open dataset

Updated 2026-06-18

What is cc?

The dataset comprises Vietnamese text extracted from the full Common Crawl archive and converted to plaintext and Markdown formats.

It supports large-scale Vietnamese NLP work including language-model pretraining and web-text analysis.

What you can build with cc

Train Vietnamese LLMs

Use the 10TB plaintext corpus to pretrain or continue-train language models focused on Vietnamese web language.

Build search and QA systems

Index the extracted Vietnamese pages to create domain-specific retrieval or question-answering datasets.

Study web data filtering

Analyze the provided quality filters and Vietnamese subset to benchmark Common Crawl processing pipelines.

Load cc

Python

from datasets import load_dataset

ds = load_dataset("Symato/cc")

1pip install datasets
2from datasets import load_dataset
3ds = load_dataset('Symato/cc')
4Access the plaintext or markdown splits for Vietnamese content
5Stream or download subsets as needed for training

cc: pros & cons

Pros

+Massive Vietnamese web corpus (~10 TB plaintext)
+Sourced directly from Common Crawl WARC files
+Includes both plaintext and markdown formats
+Publicly hosted on Hugging Face

Cons

–Extremely large size requires substantial storage and bandwidth
–Web data remains noisy despite basic filters
–No detailed license or usage terms provided in summary

Did you find this helpful?

Frequently asked questions

A filtered Vietnamese subset of Common Crawl containing roughly 10 TB of plaintext and markdown text.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Promote cc

Add this badge to your website, or share the tool.

DFeatured on Dhanasvicc 0

cc

What is cc?

What you can build with cc

Train Vietnamese LLMs

Build search and QA systems

Study web data filtering

Load cc

cc: pros & cons

Pros

Cons

Frequently asked questions

User reviews

KakologArchives

wikitext

gsm8k

Promote cc

cc

What is cc?

What you can build with cc

Train Vietnamese LLMs

Build search and QA systems

Study web data filtering

Load cc

cc: pros & cons

Pros

Cons

Frequently asked questions

User reviews

KakologArchives

wikitext

gsm8k

Promote cc

cc

What is cc?

What you can build with cc

Train Vietnamese LLMs

Build search and QA systems

Study web data filtering

Load cc

cc: pros & cons

Pros

Cons

Frequently asked questions

What is Symato/cc?

Is the dataset free to use?

How do I access the data?

What license applies?

User reviews

Similar datasets

KakologArchives

wikitext

gsm8k

Promote cc

cc

What is cc?

What you can build with cc

Train Vietnamese LLMs

Build search and QA systems

Study web data filtering

Load cc

cc: pros & cons

Pros

Cons

Frequently asked questions

What is Symato/cc?

Is the dataset free to use?

How do I access the data?

What license applies?

User reviews

Similar datasets

KakologArchives

wikitext

gsm8k

Promote cc