Video clips and captions for LLaVA-OneVision-2 mid-training.
LLaVA-OneVision-2-Data provides video clips stored in WebDataset tar format together with separate JSONL caption files for 30 s and 60 s segments.
It supports model developers building or fine-tuning systems for video-text-to-text, visual question answering, and image-text-to-text workloads.
Use the 60-second clips and paired captions to continue pretraining multimodal models on video-language alignment.
Train or evaluate models on the provided 30-second and 60-second caption data for temporal reasoning tasks.
Load the 10,809 shards directly into training loops that expect tar-based video and JSONL metadata.
from datasets import load_dataset
ds = load_dataset("mvp-lab/LLaVA-OneVision-2-Data")A collection of video clips and captions released to train the LLaVA-OneVision-2 multimodal model family.
Verified reviews from the community shape this listing's rating.
Loading reviews…