Is ACL-OCL free to use?

Yes, it is hosted publicly on Hugging Face and accessible via the datasets library at no cost.

What license applies to ACL-OCL?

Usage follows ACL Anthology terms; check the dataset page for specific redistribution rules.

How do I access the PDFs?

Load via load_dataset and access the pdf field in each example after downloading.

ACL-OCL — Free Dataset Docs, Examples & Alternatives (2026)

What is ACL-OCL?

ACL-OCL supplies the ACL Anthology corpus with PDF files, full text, references, and additional fields obtained through Grobid processing of the original PDFs.

It supports NLP research on scholarly documents, citation analysis, and information extraction from scientific publications.

What you can build with ACL-OCL

Train domain-specific language models

Use the full-text extractions to fine-tune BERT-style models on computational linguistics papers for tasks like scientific entity recognition.

Build citation and reference graphs

Leverage Grobid-extracted references and metadata to construct citation networks for analyzing research trends in NLP.

Develop PDF parsing benchmarks

Compare custom PDF-to-text pipelines against the provided Grobid outputs on the 80k ACL articles.

Load ACL-OCL

Python

from datasets import load_dataset

ds = load_dataset("WINGNUS/ACL-OCL")

1pip install datasets
2from datasets import load_dataset
3dataset = load_dataset('WINGNUS/ACL-OCL')
4Access 'train' split for full corpus with pdfs and grobid fields
5Filter by year or venue metadata for targeted subsets

ACL-OCL: pros & cons

Pros

+Includes full PDFs and Grobid extractions beyond abstracts
+Large scale: 80k ACL articles as of 2022
+Ready-to-use on Hugging Face datasets library
+Provides references and structured metadata

Cons

–Grobid extractions can contain parsing errors
–Dataset size requires significant storage for PDFs
–Updates depend on external ACL Anthology releases

Did you find this helpful?

Frequently asked questions

A Hugging Face dataset providing full-text, PDFs, and Grobid extractions for the ACL Anthology collection of 80k papers.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Sign in to review

Similar datasets

Other text & nlp options worth comparing.

KakologArchives

Text & NLP · KakologArchives

Verified

Archive of 11 years of Nico Nico Jikkyo live commentary logs.

Dataset↓ 1.8MFree

wikitext

Text & NLP · Salesforce

Verified

Over 100 million tokens from Wikipedia for language modeling benchmarks.

Dataset↓ 1.3MFree

gsm8k

Text & NLP · openai

Verified

8.5K grade school math word problems requiring multi-step arithmetic reasoning.

Dataset↓ 901KFree

ACL-OCL

What is ACL-OCL?

What you can build with ACL-OCL

Train domain-specific language models

Build citation and reference graphs

Develop PDF parsing benchmarks

Load ACL-OCL

ACL-OCL: pros & cons

Pros

Cons

Frequently asked questions

What is ACL-OCL?

Is ACL-OCL free to use?

What license applies to ACL-OCL?

How do I access the PDFs?

User reviews

Similar datasets

KakologArchives

wikitext

gsm8k

Promote ACL-OCL