Skip to content
Cache-to-Cache logo

Cache-to-Cache

Verified

Direct KV-cache sharing lets LLMs exchange knowledge without generating text.

Autonomous AgentsResearch 405Open source
View on GitHub
Updated 2026-06-15
Cache-to-Cache GitHub repository

What is Cache-to-Cache?

Cache-to-Cache is an open-source system that lets separate large language models exchange semantic information by operating directly on their KV caches instead of passing generated tokens. A learned projector maps one model’s cache into the representation space of another, after which the caches are fused so the receiver can continue generation with enriched context.

The approach works with existing model checkpoints and published fuser weights hosted on Hugging Face. Users load a base model together with a teacher model, supply an index that marks which cache segments to share, and run generation as usual. Recent updates also allow multiple sharer models to contribute to a single receiver.

Researchers and developers who need tighter model collaboration or lower communication overhead will find the toolkit useful. It ships with inference scripts, a Gradio demo, and an interactive chat example that demonstrate cache-level communication between models such as Qwen variants.

Capabilities

enable direct semantic communication between LLMs
fuse KV-caches across models
eliminate token-by-token latency
support efficient multi-agent collaboration

What you can build with Cache-to-Cache

Combined reasoning

Fuse caches from two models so a philosophical or multi-step question receives input from both knowledge bases before any text is produced.

Low-latency multi-model chat

Run an interactive session where one model instantly incorporates latent information from another without waiting for full text responses.

Multi-sharer knowledge transfer

Let several smaller models contribute their KV caches to a single larger receiver for improved accuracy on specialized tasks.

Install Cache-to-Cache

Install
pip install -e .
Quick start
pip install -e .
  1. 1Create a fresh conda environment with Python 3.10 and activate it.
  2. 2Clone the repository and run pip install -e . inside the project folder.
  3. 3For training or evaluation features, add the extras: pip install -e ".[training,evaluation]".
  4. 4Download published fuser weights from Hugging Face and point the inference script at the checkpoint directory.
  5. 5Launch the live chat example or Gradio demo script to test cache-to-cache communication between supported model pairs.

Cache-to-Cache: pros & cons

Pros

  • +Measurable accuracy gains over both single models and text-based communication
  • +Roughly 2x lower latency because no intermediate text is generated
  • +Works with released fuser weights and supports multiple sharers
  • +Includes ready-to-run inference and demo scripts

Cons

  • Currently limited to specific model-pair fusers published on Hugging Face
  • Multi-sharer support is still experimental
  • Requires manual index construction for cache segments in custom code
Did you find this helpful?

Frequently asked questions

No, only the lightweight projector fuser is trained; the base and teacher models stay frozen.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Sign in to review

Promote Cache-to-Cache

Add this badge to your website, or share the tool.

DFeatured on DhanasviCache-to-Cache 0