Is this dataset free to use?

Yes, it is publicly available via the Hugging Face Hub at no cost.

How do I access the dataset?

Load it directly with the datasets library using load_dataset('jat-project/jat-dataset-tokenized').

What license applies?

License information is not provided in the dataset card; check the repository for updates.

jat-dataset-tokenized

Verified

Tokenized dataset from jat-project sized 10M to 100M entries.

DatasetAI & Machine Learning↓ 511K/moFree

Open dataset

Updated 2026-06-15

What is jat-dataset-tokenized?

jat-dataset-tokenized consists of tokenized data prepared by the jat-project team.

It supports AI/ML workflows that require tokenized inputs in the stated size range.

Data preview

A real sample from the dataset — 5 columns.

image_observationsList	rewardsList	discrete_actionsList	attention_maskList	loss_weightList
[[[[-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-	[0,0,0,0,0,0,0,10,0,0,10,0,0,0,10,10,0,0,0,0,0,0,10,0,0,10,0,0,0,10,0,0]	[8,8,8,8,8,6,6,6,7,15,11,11,11,11,11,11,14,11,14,14,14,14,14,7,7,7,12,12,12,12,12,12]	[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]	[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
[[[[-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-	[0,10,0,10,0,0,0,0,10,0,0,10,0,10,0,0,0,0,0,0,10,0,0,0,10,0,0,0,10,0,0,0]	[12,12,17,17,17,17,17,4,7,7,7,7,7,7,2,2,2,2,2,2,9,9,9,5,5,5,5,5,9,4,9,9]	[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]	[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
[[[[-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-	[10,0,0,0,0,10,0,0,0,10,0,0,10,0,0,0,10,0,0,10,0,0,0,10,0,0,10,0,0,0,10,0]	[9,9,2,9,4,9,9,9,9,16,16,16,16,16,16,16,5,5,5,5,9,9,5,5,5,5,9,7,7,17,17,17]	[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]	[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
[[[[-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-	[0,0,10,0,0,0,10,0,0,10,0,10,0,0,0,0,10,0,0,10,0,0,10,0,0,0,0,10,0,0,0,0]	[17,17,17,8,8,8,14,10,3,8,8,8,5,8,8,8,8,8,8,8,8,8,8,8,16,16,16,16,11,16,5,0]	[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]	[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
[[[[-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-	[0,0,10,0,0,0,0,0,10,0,0,0,0,10,0,0,0,10,0,0,10,0,0,0,0,10,0,10,0,0,10,0]	[0,16,6,7,7,2,2,4,4,2,0,2,7,7,7,3,4,4,6,6,6,6,6,6,6,6,6,6,2,2,7,6]	[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]	[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]

Dataset structure

Total rows

31,962,087

Columns

Size on disk

152 GB

Subset	Split	Rows
atari-alien	train	15,614
atari-alien	test	15,614
atari-amidar	train	15,634
atari-amidar	test	15,634
atari-assault	train	15,636
atari-assault	test	15,636
atari-asterix	train	15,861
atari-asterix	test	15,861
atari-asteroids	train	15,334
atari-asteroids	test	15,334
atari-atlantis	train	15,412
atari-atlantis	test	15,412

What you can build with jat-dataset-tokenized

Pre-train transformer models

Use the tokenized records to continue pre-training language models on a large corpus without additional preprocessing.

Benchmark tokenization pipelines

Load the dataset to measure throughput and memory usage of custom tokenizers or data loaders at scale.

Build data-mixture experiments

Combine subsets of the 10-100M records with other datasets to study the effect of data composition on model performance.

Load jat-dataset-tokenized

Python

from datasets import load_dataset

ds = load_dataset("jat-project/jat-dataset-tokenized")

1pip install datasets
2from datasets import load_dataset
3dataset = load_dataset('jat-project/jat-dataset-tokenized')
4print(dataset['train'][0])
5Use dataset['train'].select(range(10000)) for quick experiments