Yes, it is publicly available through the Hugging Face datasets library at no cost.

How do I access the PDFs?

Load the dataset using the Hugging Face datasets library and iterate over the returned objects to retrieve the PDF files.

What license applies?

License information is not specified in the dataset description.

arxiv-cs-2020-2025-pdfs — Free Dataset Docs, Examples & Alternatives (2026)

What is arxiv-cs-2020-2025-pdfs?

The arxiv-cs-2020-2025-pdfs dataset consists of PDF files from arXiv computer science submissions dated 2020 through 2025.

It supports researchers and model developers working with academic literature in AI and machine learning.

What you can build with arxiv-cs-2020-2025-pdfs

Training PDF parsing models

Developers can fine-tune layout detection or OCR models on the raw PDF files to improve extraction of equations, tables, and figures from scientific documents.

Domain-specific LLM pretraining

Use the full-text content of recent CS papers to continue pretraining language models on technical vocabulary and research writing styles.

Building academic search tools

Index the papers to create semantic search or citation recommendation systems focused on 2020-2025 computer science literature.

Load arxiv-cs-2020-2025-pdfs

Python

from datasets import load_dataset

ds = load_dataset("Chelsea707/arxiv-cs-2020-2025-pdfs")

1Install the datasets library via pip install datasets
2Import load_dataset from the datasets package
3Load with load_dataset('Chelsea707/arxiv-cs-2020-2025-pdfs')
4Iterate over the dataset to access individual PDF files
5Extract text using pdfplumber or PyMuPDF for downstream tasks

arxiv-cs-2020-2025-pdfs: pros & cons

Pros

+Full PDFs of recent arXiv CS papers
+Straightforward Hugging Face loading
+Covers five years of computer science output
+Ready for large-scale document AI experiments

Cons

–No size, splits, or metadata details provided
–PDFs need extra processing before text use
–License and redistribution terms unspecified

Did you find this helpful?

Frequently asked questions

A collection of PDF files containing computer science articles from arXiv published between 2020 and 2025.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Sign in to review

Similar datasets

Other ai & machine learning options worth comparing.

FineNews

AI & Machine Learning · ksolovev

Verified

News dataset for AI and machine learning workflows.

Dataset↓ 1.5MFree

hd_tmp

AI & Machine Learning · ayuo

Verified

Temporary AI/ML dataset for Hugging Face prototyping.

Dataset↓ 1.5MFree

results

AI & Machine Learning · mteb

Verified

MTEB benchmark results for text embedding model evaluations.

Dataset↓ 1.3MFree

arxiv-cs-2020-2025-pdfs

What is arxiv-cs-2020-2025-pdfs?

What you can build with arxiv-cs-2020-2025-pdfs

Training PDF parsing models

Domain-specific LLM pretraining

Building academic search tools

Load arxiv-cs-2020-2025-pdfs

arxiv-cs-2020-2025-pdfs: pros & cons

Pros

Cons

Frequently asked questions

What is this dataset?

Is the dataset free?

How do I access the PDFs?

What license applies?

User reviews

Similar datasets

FineNews

hd_tmp

results

Promote arxiv-cs-2020-2025-pdfs