arxiv-cs-2020-2025-pdfs
VerifiedarXiv computer science PDFs from 2020 to 2025.
What is arxiv-cs-2020-2025-pdfs?
The arxiv-cs-2020-2025-pdfs dataset consists of PDF files from arXiv computer science submissions dated 2020 through 2025.
It supports researchers and model developers working with academic literature in AI and machine learning.
What you can build with arxiv-cs-2020-2025-pdfs
Training PDF parsing models
Developers can fine-tune layout detection or OCR models on the raw PDF files to improve extraction of equations, tables, and figures from scientific documents.
Domain-specific LLM pretraining
Use the full-text content of recent CS papers to continue pretraining language models on technical vocabulary and research writing styles.
Building academic search tools
Index the papers to create semantic search or citation recommendation systems focused on 2020-2025 computer science literature.
Load arxiv-cs-2020-2025-pdfs
from datasets import load_dataset
ds = load_dataset("Chelsea707/arxiv-cs-2020-2025-pdfs")- 1Install the datasets library via pip install datasets
- 2Import load_dataset from the datasets package
- 3Load with load_dataset('Chelsea707/arxiv-cs-2020-2025-pdfs')
- 4Iterate over the dataset to access individual PDF files
- 5Extract text using pdfplumber or PyMuPDF for downstream tasks
arxiv-cs-2020-2025-pdfs: pros & cons
Pros
- +Full PDFs of recent arXiv CS papers
- +Straightforward Hugging Face loading
- +Covers five years of computer science output
- +Ready for large-scale document AI experiments
Cons
- –No size, splits, or metadata details provided
- –PDFs need extra processing before text use
- –License and redistribution terms unspecified
Frequently asked questions
A collection of PDF files containing computer science articles from arXiv published between 2020 and 2025.
User reviews
Verified reviews from the community shape this listing's rating.
Loading reviews…