Think about attempting to renovate the muse of a towering skyscraper with out asking its occupants to go away or pause their work. That’s precisely what MoonshotAI’s Checkpoint Engine does for AI fashions. It permits large language fashions to replace their brains, the weights, whereas nonetheless operating, so there’s no downtime. This breakthrough lets builders enhance their AI shortly and effectively, even on fashions with over a trillion parameters operating on 1000’s of GPUs. It’s quick, dependable, and designed to maintain AI methods operating easily whereas evolving in real-time, making it a significant instrument for cutting-edge AI functions. This text goes over what it’s, the way it works, and why it issues for the way forward for large-scale AI methods.
What’s Moonshot AI’s Checkpoint engine?
Moonshot AI’s Checkpoint Engine is a specialised middleware designed to replace the weights of massive language fashions (LLMs) in real-time throughout inference with out interrupting ongoing operations. This functionality is crucial in Reinforcement studying situations the place mannequin weights must be up to date continuously. The Checkpoint Engine presently integrates seamlessly with vLLM inference frameworks and gives optimized efficiency by means of pipelining and reminiscence administration methods. It additionally gives options like reusing weights from current situations to cut back overhead in scaling situations.
Structure
The core of the Checkpoint is the ParameterServer class, which handles the burden replace logic and orchestrates the info move.
- H2D(Host to Gadget): Strikes up to date weights from CPU reminiscence or storage to GPU reminiscence, utilizing optimized switch pipelines.
- Broadcast: Distributes the burden throughout all inference engine situations effectively, leveraging CUDA IPC buffers for shared reminiscence communication.
- Reload: Every inference engine then selectively reloads related weight shards from the broadcasted knowledge in line with its sharding sample.
These three-stage pipelines guarantee environment friendly, overlapping communication and copying for pace.
When GPU reminiscence is restricted, the system can fall again to serial execution to keep up reliability.
Strategies Used
The Checkpoint Engine makes use of two fundamental strategies to replace mannequin weights throughout inference.
- Broadcast Technique: That is the quickest and the default strategy. That is ideally suited when numerous inference situations must be up to date concurrently. It broadcasts the up to date weights from CPU reminiscence to all inference GPUs synchronously, guaranteeing all situations keep completely in sync with minimal delay.
- P2P (Peer-to-Peer) Technique: It’s used when inference situations are added or eliminated dynamically throughout runtime. It avoids disrupting current inference workloads by sending weights instantly from CPUs in current situations to GPUs in new situations by means of a peer-to-peer switch system, permitting clean and versatile updates.
Working
The Checkpoint Engine orchestrates the whole switch course of. It first gathers obligatory metadata to create a plan, together with deciding the correct bucket dimension for knowledge switch. Then, it executes the switch, controlling the inference engine by means of a ZeroMQ socket to maximise efficiency. It organizes knowledge switch into pipelines with overlapped communication and replica, enabling quick and environment friendly weight updates even beneath heavy workload.
By implementing the above-mentioned strategies and structure, the Checkpoint Engine permits dwell weight updates for LLMs throughout 1000’s of GPUs with minimal latency and repair disruption.
Set up and Utilization
Set up
To make use of the quickest broadcast
Use Code:
pip set up checkpoint-engine
To make use of the versatile P2P implementation:
Use Code:
pip set up 'checkpoint-engine[p2p]'
This may set up mooncake-transfer-engine to assist RDMA switch between totally different ranks.
Instance Use case
Step 1:
Put together an H800 or H20 machine with 8 GPUs with the most recent vLLM. Remember to embrace /collective_rpc API endpoint commit (accessible in the principle department) since checkpoint-engine will use this endpoint to replace weights.
Step 2:
set up checkpoint-engine
Code:
uv pip set up 'checkpoint-engine[p2p]'
Step 3:
For our use case, we’re gonna use Qwen/Qwen3-235B-A22B-Instruct-2507 because the check mannequin.
Code:
hf obtain Qwen/Qwen3-235B-A22B-Instruct-2507 --local-dir /choose/fashions/Qwen/Qwen3-235B-A22B-Instruct-2507/
Step 4:
Begin vLLM in dev mode and set –load-format dummy. Be sure to set –worker-extension-cls=checkpoint_engine.employee.VllmColocateWorkerExtension
Code:
VLLM_SERVER_DEV_MODE=1 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 19730 --trust-remote-code
--tensor-parallel-size=8 --max-model-len 4096 --load-format dummy
--served-model-name checkpoint-engine-demo --model /choose/fashions/Qwen/Qwen3-235B-A22B-Instruct-2507/
--worker-extension-cls checkpoint_engine.employee.VllmColocateWorkerExtension
To replace weights by checkpoint-engine. No want to attend for vLLM to prepare. Use the code beneath.
Code:
torchrun --nproc-per-node 8 examples/replace.py --update-method all --checkpoint-path /choose/fashions/Qwen/Qwen3-235B-A22B-Instruct-2507/
To reuse weights from current situations
New checkpoint-engine situations can be part of current situations and reuse their weights.
Utilizing the tactic beneath:
Step 1: Begin the present occasion with –save-metas-file global_metas.pkl to save lots of international metas to a file.
Step 2: Use –sleep-time 300 to ensure they keep alive.
Code:
torchrun --nproc-per-node 8 examples/replace.py --checkpoint-path $MODEL_PATH
--sleep-time 300 --save-metas-file global_metas.pkl
Step 3: After a checkpoint is registered, new situations can get hold of a replica of the checkpoint by setting –load-metas-file global_metas.pkl
Code:
torchrun --nproc-per-node 8 examples/replace.py --load-metas-file global_metas.pkl
FP8 quantization
At the moment, FP8 quantization doesn’t work in vLLM when updating weights. It makes use of a easy patch in patches/vllm_fp8.patch to deal with the proper weight replace. Additionally ,this patch is simply examined in DeepSeek-V3.1 and Kimi-K2. So there are probabilities of having some compatibility points with different fashions.
Take a look at
Run a easy correctness check for checkpoint_engine
Code:
torchrun --nproc-per-node 8 assessments/test_update.py
Benchmark
| Mannequin | Gadget Setup | Metadata Gathering | Replace (Broadcast) | Replace (P2P) |
|---|---|---|---|---|
| GLM-4.5-Air (BF16) | 8x H800 TP8 | 0.17 seconds | 3.94 seconds (1.42 GiB) | 8.83 seconds (4.77 GiB) |
| Qwen3-235B-A22B-Instruct-2507 (BF16) | 8x H800 TP8 | 0.46 seconds | 6.75 seconds (2.69 GiB) | 16.47 seconds (4.05 GiB) |
| DeepSeek-V3.1 (FP8) | 16x H20 TP16 | 1.44 seconds | 12.22 seconds (2.38 GiB) | 25.77 seconds (3.61 GiB) |
| Kimi-K2-Instruct (FP8) | 16x H20 TP16 | 1.81 seconds | 15.45 seconds (2.93 GiB) | 36.24 seconds (4.46 GiB) |
| DeepSeek-V3.1 (FP8) | 256x H20 TP16 | 1.40 seconds | 13.88 seconds (2.54 GiB) | 33.30 seconds (3.86 GiB) |
| Kimi-K2-Instruct (FP8) | 256x H20 TP16 | 1.88 seconds | 21.50 seconds (2.99 GiB) | 34.49 seconds (4.57 GiB) |
Insights
Listed below are a couple of observations that I’ve made:
- The published methodology usually gives the quickest replace time, optimized for synchronous weight updates throughout many inference situations.
- The P2P methodology takes longer however permits dynamic updates when situations be part of or depart throughout runtime.
- These benchmark exhibits the scalability of Checkpoint Engine, dealing with a trillion parameter fashions effectively on clusters starting from 8 to 256 GPUs
Limitations of Checkpoint Engine
Whereas Checkpoint Engine is a strong resolution for dwell weight updates in LLMs, it presently has some limitations.
- Works Greatest with vLLM for Now: The engine is especially examined with the vLLM framework. If you happen to’re hoping to make use of it with different AI frameworks or customized setups, you would possibly want some additional work to get it operating easily.
- Pipeline Nonetheless Bettering: The perfect seamless pipeline that overlaps knowledge strikes completely isn’t totally completed but. This implies there’s nonetheless potential to make the updates even quicker.
- P2P Replace Might Be Smoother: The peer-to-peer methodology sends knowledge by means of a bottleneck at one fundamental node earlier than sharing it with others, which may sluggish issues down when you might have a lot of GPUs.
- Wants Further GPU Reminiscence: The intelligent broadcast system makes use of extra GPU reminiscence to hurry issues up. On machines with much less reminiscence, it switches to a slower, much less environment friendly course of.
- Restricted Help for FP8 Fashions: If you happen to’re working with the newer FP8 quantized fashions, you’ll want some experimental patches. And even then, not all fashions play properly, but past a few examined ones.
Conclusion
Moonshot AI’s Checkpoint Engine is a game-changer for updating big AI fashions with out stopping them. It retains every little thing operating easily, even whereas the mannequin’s “mind” is getting smarter in real-time. Whereas it nonetheless has a couple of areas to enhance, the potential is large. If you happen to’re working with massive AI methods, this instrument is unquestionably value watching. It’s serving to make the way forward for AI quicker and extra environment friendly, with none downtime.
Ceaselessly Requested Questions
A. It lets massive language fashions replace weights in real-time throughout inference with out downtime, so AI methods keep on-line whereas enhancing.
A. Proper now, it’s primarily built-in and examined with the vLLM inference framework.
A. Broadcast is quicker for synchronized updates throughout many GPUs, whereas P2P permits versatile updates when situations be part of or depart.
Login to proceed studying and luxuriate in expert-curated content material.