zhongyangribao
VerifiedHistorical Chinese newspaper texts from Zhongyang Ribao for NLP tasks.
What is zhongyangribao?
The zhongyangribao dataset consists of text from the historical Chinese newspaper Zhongyang Ribao archived by banned-historical-archives.
It is useful for NLP research and machine learning work involving Chinese historical newspaper and archival text data.
What you can build with zhongyangribao
Historical Chinese NLP training
Fine-tune language models on authentic mid-20th-century newspaper text for improved handling of classical-modern Chinese transitions and period-specific vocabulary.
Topic modeling of political discourse
Run LDA or BERTopic pipelines to track evolving themes such as propaganda, international relations, and domestic policy across decades of articles.
OCR and layout analysis benchmarking
Use the raw scans and transcripts to evaluate document-understanding models on noisy historical print layouts and traditional Chinese characters.
Load zhongyangribao
from datasets import load_dataset
ds = load_dataset("banned-historical-archives/zhongyangribao")- 1pip install datasets
- 2from datasets import load_dataset
- 3ds = load_dataset('banned-historical-archives/zhongyangribao')
- 4print(ds['train'][0])
- 5ds['train'].to_pandas().head()
zhongyangribao: pros & cons
Pros
- +Large collection of real historical Chinese newspaper text
- +Directly loadable via Hugging Face datasets
- +Useful for period-specific language and political analysis
- +Maintained under banned-historical-archives org
Cons
- –No dataset card or description provided
- –Likely Chinese-only content limits non-Chinese use
- –Potential copyright or sensitivity restrictions on redistribution
Frequently asked questions
A Hugging Face dataset containing issues of the historical Chinese newspaper Zhongyang Ribao, hosted by banned-historical-archives.
User reviews
Verified reviews from the community shape this listing's rating.
Loading reviews…