arxiv-papers-by-subject
VerifiedReorganized arXiv metadata partitioned by subject, year, and month.
What is arxiv-papers-by-subject?
It is a reorganized version of the nick007x/arxiv-papers dataset with entries partitioned into directories by subject, year, and month.
It is useful for researchers needing targeted access to arXiv paper metadata without retrieving the full collection.
What you can build with arxiv-papers-by-subject
Subject-specific text generation
Train language models on subsets of arXiv abstracts filtered by subject code and date range to generate domain-specific scientific text.
Feature extraction pipelines
Load monthly batches of paper metadata to compute embeddings or extract keywords for downstream academic search or recommendation systems.
Temporal topic analysis
Analyze trends by loading year/month slices within a subject to track research evolution and build visualization dashboards.
Load arxiv-papers-by-subject
from datasets import load_dataset
ds = load_dataset("permutans/arxiv-papers-by-subject")- 1pip install datasets
- 2from datasets import load_dataset
- 3ds = load_dataset('permutans/arxiv-papers-by-subject', 'cs.AI', split='2023-01')
- 4Filter or iterate over the returned metadata columns
- 5Use abstracts or titles for model training or feature extraction
arxiv-papers-by-subject: pros & cons
Pros
- +2.5M+ papers with hierarchical slicing
- +Selective subset downloads reduce bandwidth
- +Direct HF datasets integration
- +Clean metadata suited for NLP tasks
Cons
- –Contains only metadata, no full-text PDFs
- –Subject codes follow arXiv taxonomy only
- –Requires internet for initial load
Frequently asked questions
A Hugging Face dataset of over 2.5 million arXiv paper metadata records organized by subject, year, and month.
User reviews
Verified reviews from the community shape this listing's rating.
Loading reviews…