Matrix-Game 2.0:
An Open-Source, Real-Time, and Streaming
Interactive World Model

Xianglong He*  Chunli Peng*  Zexiang Liu*  Boyang Wang*†  Yifan Zhang  Qi Cui
Fei Kang  Biao Jiang  Mengyin An  Yangyang Ren  Baixin Xu  Hao-Xiang Guo
Kaixiong Gong  Xuchen Song  Yang Liu  Eric Li  Yahui Zhou

Skywork AI

Technical Report GitHub 🤗HuggingFace

Abstract

Recent advances in interactive video generations have demonstrated diffusion model’s potential as world models by capturing complex physical dynamics and interactive behaviors. However, existing interactive world models depend on bidirectional attention and lengthy inference steps, severely limiting real-time performance. Consequently, they are hard to simulate real-world dynamics, where outcomes must update instantaneously based on historical context and current actions. To address this, we present Matrix-Game 2.0, an interactive world model generates long videos on-the-fly via few-step auto-regressive diffusion. Our framework consists of three key components: (1) A scalable data production pipeline for Unreal Engine and GTA5 environments to effectively produce massive amounts (∼ 1200 hours) of interactive video data; (2) An action injection module that enables frame-level mouse and keyboard input as interactions; (3) A few-step distillation based on the casual architecture for real-time, streaming video generation. Matrix-Game 2.0 can generate high-quality minute-level videos across diverse scenes at an ultra-fast speed of 25 FPS. We open-source our model weights and codebase to advance research in interactive world modeling.

Model Overview

The foundation model is derived from WanX. By removing the text branch and adding action modules, the model predicts next frames only from visual contents and corresponding actions.

Matrix Game

Performance Comparison

GameWorld Score Benchmark Comparison

Matrix-Game 2.0 achieves excellent performance on the GameWorld Score benchmark in Minecraft scenes.

Model Image Quality ↑ Aesthetic Quality↑ Temporal Cons. ↑ Motion Smooth. ↑ Keyboard Acc. ↑ Mouse Acc. ↑ Object Cons. ↑ Scenario Cons. ↑
Oasis 0.27 0.27 0.82 0.99 0.73 0.56 0.18 0.84
Ours 0.61 0.50 0.94 0.98 0.91 0.95 0.64 0.80

Generation across Diverse Scene Styles

Matrix-Game 2.0 demonstrates strong generative capabilities across diverse scene styles, featuring varying visual aesthetics and terrains.

Generation across GTA Scenes

Matrix-Game 2.0 demonstrates the ability to generate precisely controlled videos in GTA scenarios, while also shows the capability for modeling scene dynamics.

Long Video Generation

Matrix-Game 2.0 demonstrates strong auto-regressive generation capabilities for producing long videos.

Generation across MC Scenes

Matrix-Game 2.0 demonstrates strong generative capabilities in Minecraft scenes, adapting to diverse visual styles and terrains.

Generation across TempleRun Scenes

Matrix-Game 2.0 can also be applied to generate interactive videos in TempleRun scenes.

Acknowledgement

We would like to express our gratitude to:

We are grateful to the broader research community for their open exploration and contributions to the field of interactive world generation.