Is this dataset free?

Yes, it is publicly available at no cost on the Hugging Face Hub.

How do I access the dataset?

Load it directly with the Hugging Face datasets library using load_dataset('bluuebunny/arxiv_metadata_by_year').

What license applies?

Check the dataset card on Hugging Face for the specific license and usage terms.

arxiv_metadata_by_year

ArXiv paper metadata organized by publication year.

DatasetText & NLP↓ 140K/moFree

Open dataset

Updated 2026-06-18

What is arxiv_metadata_by_year?

arxiv_metadata_by_year supplies arXiv paper metadata structured by publication year.

Researchers and developers use it for natural language processing and text analysis on scientific literature.

What you can build with arxiv_metadata_by_year

Research Trend Analysis

Analyze shifts in topics like machine learning or NLP by aggregating paper titles, abstracts, and categories across yearly splits.

Author Network Construction

Extract author lists and affiliations to build collaboration graphs filtered by publication year for network analysis.

Category Classification Benchmarking

Train and evaluate NLP models on paper metadata fields such as titles and primary categories using the year-based partitions.

Load arxiv_metadata_by_year

Python

from datasets import load_dataset

ds = load_dataset("bluuebunny/arxiv_metadata_by_year")

1pip install datasets
2from datasets import load_dataset
3dataset = load_dataset('bluuebunny/arxiv_metadata_by_year')
4Select a year split with dataset['2023'] or similar
5Process records with pandas or export to JSON/CSV