Cache-to-Cache
VerifiedDirect KV-cache sharing lets LLMs exchange knowledge without generating text.
What is Cache-to-Cache?
Cache-to-Cache is an open-source system that lets separate large language models exchange semantic information by operating directly on their KV caches instead of passing generated tokens. A learned projector maps one model’s cache into the representation space of another, after which the caches are fused so the receiver can continue generation with enriched context.
The approach works with existing model checkpoints and published fuser weights hosted on Hugging Face. Users load a base model together with a teacher model, supply an index that marks which cache segments to share, and run generation as usual. Recent updates also allow multiple sharer models to contribute to a single receiver.
Researchers and developers who need tighter model collaboration or lower communication overhead will find the toolkit useful. It ships with inference scripts, a Gradio demo, and an interactive chat example that demonstrate cache-level communication between models such as Qwen variants.
Capabilities
What you can build with Cache-to-Cache
Combined reasoning
Fuse caches from two models so a philosophical or multi-step question receives input from both knowledge bases before any text is produced.
Low-latency multi-model chat
Run an interactive session where one model instantly incorporates latent information from another without waiting for full text responses.
Multi-sharer knowledge transfer
Let several smaller models contribute their KV caches to a single larger receiver for improved accuracy on specialized tasks.
Install Cache-to-Cache
pip install -e .pip install -e .- 1Create a fresh conda environment with Python 3.10 and activate it.
- 2Clone the repository and run pip install -e . inside the project folder.
- 3For training or evaluation features, add the extras: pip install -e ".[training,evaluation]".
- 4Download published fuser weights from Hugging Face and point the inference script at the checkpoint directory.
- 5Launch the live chat example or Gradio demo script to test cache-to-cache communication between supported model pairs.
Cache-to-Cache: pros & cons
Pros
- +Measurable accuracy gains over both single models and text-based communication
- +Roughly 2x lower latency because no intermediate text is generated
- +Works with released fuser weights and supports multiple sharers
- +Includes ready-to-run inference and demo scripts
Cons
- –Currently limited to specific model-pair fusers published on Hugging Face
- –Multi-sharer support is still experimental
- –Requires manual index construction for cache segments in custom code
Frequently asked questions
No, only the lightweight projector fuser is trained; the base and teacher models stay frozen.
User reviews
Verified reviews from the community shape this listing's rating.
Loading reviews…