Generating natural and physically plausible character motion remains challenging, particularly for long-horizon control with diverse guidance signals. While prior work combines high-level diffusion-based motion planners with low-level physics controllers, these systems suffer from domain gaps that degrade motion quality and require task-specific fine-tuning. To tackle this problem, we introduce UniPhys, a diffusion-based behavior cloning framework that unifies motion planning and control into a single model. UniPhys enables flexible, expressive character motion conditioned on multi-modal inputs such as text, trajectories, and goals. To address accumulated prediction errors over long sequences, UniPhys is trained with the Diffusion Forcing paradigm, learning to denoise noisy motion histories and handle discrepancies introduced by the physics simulator. This design allows UniPhys to robustly generate physically plausible, long-horizon motions. Through guided sampling, UniPhys generalizes to a wide range of control signals, including unseen ones, without requiring task-specific fine-tuning. Experiments show that UniPhys outperforms prior methods in motion naturalness, generalization, and robustness across diverse control tasks.
However, simply training the diffusion model with offline paired state-action data doesn’t work well—it often leads to unstable motions, with the character frequently falling.
What makes this challenging?
Difficult bipedal control – Controlling a full-body character is tough. The action space is large (69 dimensions, excluding fingers control), and even small errors can throw off balance, causing falls.
Compounding error in behavior cloning – Training on offline data suffers from the classic compounding error problem. As the state drifts beyond the training distribution over time, predictions become less accurate, leading to unstable and unreliable control.
Latent action representation – Instead of directly learning the high-dimensional action distribution, we employ a compact latent action representation for ease of learning.
Injecting independant noises per frame when training – Inspired by Diffusion Forcing, when predicting clean output from corrupted noisy input at training time, we corrupt the input sequence by injecting independant noises to each frame.
This enables flexible denoising settings and a stabilization trick to mitigate compounding error issues at inference time:
UniPhys produces stable long-horizon rollouts, covering diverse skills.