Which model pairs are currently supported?

Published weights exist for several Qwen2.5 and Qwen3 size combinations; the repository lists all available pairs.

Can I use my own fine-tuned models?

You can train new fusers following the provided training scripts, but pre-trained weights are only released for the listed pairs.

Is the package name really Rosetta?

Yes, the installed Python package is named rosetta to reflect the idea of translating between different model representations.

Cache-to-Cache

Verified

Direct KV-cache sharing lets LLMs exchange knowledge without generating text.

Autonomous AgentsResearch 405Open source

View on GitHub

Updated 2026-06-15

What is Cache-to-Cache?

Cache-to-Cache is an open-source system that lets separate large language models exchange semantic information by operating directly on their KV caches instead of passing generated tokens. A learned projector maps one model’s cache into the representation space of another, after which the caches are fused so the receiver can continue generation with enriched context.

The approach works with existing model checkpoints and published fuser weights hosted on Hugging Face. Users load a base model together with a teacher model, supply an index that marks which cache segments to share, and run generation as usual. Recent updates also allow multiple sharer models to contribute to a single receiver.

Researchers and developers who need tighter model collaboration or lower communication overhead will find the toolkit useful. It ships with inference scripts, a Gradio demo, and an interactive chat example that demonstrate cache-level communication between models such as Qwen variants.

Capabilities

enable direct semantic communication between LLMs

fuse KV-caches across models

eliminate token-by-token latency

support efficient multi-agent collaboration

What you can build with Cache-to-Cache

Combined reasoning

Fuse caches from two models so a philosophical or multi-step question receives input from both knowledge bases before any text is produced.

Low-latency multi-model chat

Run an interactive session where one model instantly incorporates latent information from another without waiting for full text responses.

Multi-sharer knowledge transfer

Let several smaller models contribute their KV caches to a single larger receiver for improved accuracy on specialized tasks.

Install Cache-to-Cache

Install

pip install -e .

Quick start

pip install -e .

1Create a fresh conda environment with Python 3.10 and activate it.
2Clone the repository and run pip install -e . inside the project folder.
3For training or evaluation features, add the extras: pip install -e ".[training,evaluation]".
4Download published fuser weights from Hugging Face and point the inference script at the checkpoint directory.
5Launch the live chat example or Gradio demo script to test cache-to-cache communication between supported model pairs.

Cache-to-Cache: pros & cons

Pros

+Measurable accuracy gains over both single models and text-based communication
+Roughly 2x lower latency because no intermediate text is generated
+Works with released fuser weights and supports multiple sharers
+Includes ready-to-run inference and demo scripts

Cons

–Currently limited to specific model-pair fusers published on Hugging Face
–Multi-sharer support is still experimental
–Requires manual index construction for cache segments in custom code

Did you find this helpful?

Frequently asked questions

No, only the lightweight projector fuser is trained; the base and teacher models stay frozen.

User reviews

Verified reviews from the community shape this listing's rating.

Loading reviews…

Promote Cache-to-Cache

Add this badge to your website, or share the tool.

DFeatured on DhanasviCache-to-Cache 0