Yes, it is publicly available on the Hugging Face Hub at no cost.

How do I load the data?

Use the Hugging Face datasets library with the repository ID from mvp-lab.

What license applies?

Check the dataset card on Hugging Face for the exact license and usage terms.

LLaVA-OneVision-2-Data

Video clips and captions for LLaVA-OneVision-2 mid-training.

DatasetImages & Vision↓ 210K/moFree

Open dataset

Updated 2026-06-18

What is LLaVA-OneVision-2-Data?

LLaVA-OneVision-2-Data provides video clips stored in WebDataset tar format together with separate JSONL caption files for 30 s and 60 s segments.

It supports model developers building or fine-tuning systems for video-text-to-text, visual question answering, and image-text-to-text workloads.

What you can build with LLaVA-OneVision-2-Data

Mid-train video-text models

Use the 60-second clips and paired captions to continue pretraining multimodal models on video-language alignment.

Build short-video reasoning pipelines

Train or evaluate models on the provided 30-second and 60-second caption data for temporal reasoning tasks.

Prepare WebDataset shards for large-scale training

Load the 10,809 shards directly into training loops that expect tar-based video and JSONL metadata.

Load LLaVA-OneVision-2-Data

Python

from datasets import load_dataset

ds = load_dataset("mvp-lab/LLaVA-OneVision-2-Data")

1pip install datasets webdataset
2from datasets import load_dataset
3ds = load_dataset('mvp-lab/LLaVA-OneVision-2-Data')
4Iterate over the WebDataset shards to stream video clips
5Read the accompanying JSONL files for 30s and 60s captions