Is this dataset free to use?

Yes, it is freely available via the Hugging Face datasets library.

Metadata follows arXiv's own terms; check the dataset card for exact usage conditions.

How do I access specific subjects?

Pass the subject code (e.g. 'cs.AI') as the configuration name when calling load_dataset.

arxiv-papers-by-subject — Free Dataset Docs, Examples & Alternatives (2026)

What is arxiv-papers-by-subject?

It is a reorganized version of the nick007x/arxiv-papers dataset with entries partitioned into directories by subject, year, and month.

It is useful for researchers needing targeted access to arXiv paper metadata without retrieving the full collection.

What you can build with arxiv-papers-by-subject

Subject-specific text generation

Train language models on subsets of arXiv abstracts filtered by subject code and date range to generate domain-specific scientific text.

Feature extraction pipelines

Load monthly batches of paper metadata to compute embeddings or extract keywords for downstream academic search or recommendation systems.

Temporal topic analysis

Analyze trends by loading year/month slices within a subject to track research evolution and build visualization dashboards.

Load arxiv-papers-by-subject

Python

from datasets import load_dataset

ds = load_dataset("permutans/arxiv-papers-by-subject")

1pip install datasets
2from datasets import load_dataset
3ds = load_dataset('permutans/arxiv-papers-by-subject', 'cs.AI', split='2023-01')
4Filter or iterate over the returned metadata columns
5Use abstracts or titles for model training or feature extraction

arxiv-papers-by-subject: pros & cons

Pros

+2.5M+ papers with hierarchical slicing
+Selective subset downloads reduce bandwidth
+Direct HF datasets integration
+Clean metadata suited for NLP tasks

Cons

–Contains only metadata, no full-text PDFs
–Subject codes follow arXiv taxonomy only
–Requires internet for initial load

Did you find this helpful?

Frequently asked questions

A Hugging Face dataset of over 2.5 million arXiv paper metadata records organized by subject, year, and month.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Sign in to review

Similar datasets

Other text & nlp options worth comparing.

KakologArchives

Text & NLP · KakologArchives

Verified

Archive of 11 years of Nico Nico Jikkyo live commentary logs.

Dataset↓ 1.8MFree

wikitext

Text & NLP · Salesforce

Verified

Over 100 million tokens from Wikipedia for language modeling benchmarks.

Dataset↓ 1.3MFree

gsm8k

Text & NLP · openai

Verified

8.5K grade school math word problems requiring multi-step arithmetic reasoning.

Dataset↓ 901KFree

arxiv-papers-by-subject

What is arxiv-papers-by-subject?

What you can build with arxiv-papers-by-subject

Subject-specific text generation

Feature extraction pipelines

Temporal topic analysis

Load arxiv-papers-by-subject

arxiv-papers-by-subject: pros & cons

Pros

Cons

Frequently asked questions

What is arxiv-papers-by-subject?

Is this dataset free to use?

What is the license?

How do I access specific subjects?

User reviews

Similar datasets

KakologArchives

wikitext

gsm8k

Promote arxiv-papers-by-subject