85M samples from eight vision datasets for LLaVA-OneVision-1.5 mid-training.
LLaVA-OneVision-1.5-Mid-Training-85M consists of image and text data compiled from the eight listed public sources for use in multimodal model training.
It is intended for researchers developing or reproducing open multimodal large language models that require large-scale mid-training stages.
Use the 85M aggregated samples to continue pre-training models like LLaVA-OneVision-1.5 on diverse image-text pairs from multiple public sources.
Measure the impact of mid-training on model performance by subsampling this 10-100M dataset and comparing against smaller curated sets.
Leverage the combined ImageNet-21k, SA-1B, and web-scale sources to train or evaluate image-text retrieval components.
from datasets import load_dataset
ds = load_dataset("mvp-lab/LLaVA-OneVision-1.5-Mid-Training-85M")An 85-million-sample vision dataset aggregated from ImageNet-21k, LAIONCN, DataComp-1B and other public collections to support mid-training of the LLaVA-OneVision-1.5 framework.
Verified reviews from the community shape this listing's rating.
Loading reviews…