ACL-OCL
VerifiedFull-text ACL Anthology papers with Grobid extractions, PDFs, and metadata.
What is ACL-OCL?
ACL-OCL supplies the ACL Anthology corpus with PDF files, full text, references, and additional fields obtained through Grobid processing of the original PDFs.
It supports NLP research on scholarly documents, citation analysis, and information extraction from scientific publications.
What you can build with ACL-OCL
Train domain-specific language models
Use the full-text extractions to fine-tune BERT-style models on computational linguistics papers for tasks like scientific entity recognition.
Build citation and reference graphs
Leverage Grobid-extracted references and metadata to construct citation networks for analyzing research trends in NLP.
Develop PDF parsing benchmarks
Compare custom PDF-to-text pipelines against the provided Grobid outputs on the 80k ACL articles.
Load ACL-OCL
from datasets import load_dataset
ds = load_dataset("WINGNUS/ACL-OCL")- 1pip install datasets
- 2from datasets import load_dataset
- 3dataset = load_dataset('WINGNUS/ACL-OCL')
- 4Access 'train' split for full corpus with pdfs and grobid fields
- 5Filter by year or venue metadata for targeted subsets
ACL-OCL: pros & cons
Pros
- +Includes full PDFs and Grobid extractions beyond abstracts
- +Large scale: 80k ACL articles as of 2022
- +Ready-to-use on Hugging Face datasets library
- +Provides references and structured metadata
Cons
- –Grobid extractions can contain parsing errors
- –Dataset size requires significant storage for PDFs
- –Updates depend on external ACL Anthology releases
Frequently asked questions
A Hugging Face dataset providing full-text, PDFs, and Grobid extractions for the ACL Anthology collection of 80k papers.
User reviews
Verified reviews from the community shape this listing's rating.
Loading reviews…