UniPhys: Unified Planner and Controller with Diffusion for Flexible Physics-Based Character Control

Key Idea

We're building a diffusion-based model that learns both kinematic motion (state) and physical control actions (action), acting as a unified planner and text-driven controller.
By directly modeling the text-conditioned action distribution, we can control a character end-to-end with text instructions.
Plus, since we’re learning the full state-action distribution, we can predict future motions and apply flexible guidance during denoising to steer actions toward a desired goal.
This approach works for any task as long as we can design a differentiable guidance loss over the state-action space.

Key Challenges

However, simply training the diffusion model with offline paired state-action data doesn’t work well—it often leads to unstable motions, with the character frequently falling.

What makes this challenging?

Difficult bipedal control – Controlling a full-body character is tough. The action space is large (69 dimensions, excluding fingers control), and even small errors can throw off balance, causing falls.

Compounding error in behavior cloning – Training on offline data suffers from the classic compounding error problem. As the state drifts beyond the training distribution over time, predictions become less accurate, leading to unstable and unreliable control.

Our Solutions

Latent action representation – Instead of directly learning the high-dimensional action distribution, we employ a compact latent action representation for ease of learning.

Injecting independant noises per frame when training – Inspired by Diffusion Forcing, when predicting clean output from corrupted noisy input at training time, we corrupt the input sequence by injecting independant noises to each frame.
This enables flexible denoising settings and a stabilization trick to mitigate compounding error issues at inference time:

Flexible denoising settings – As the model has seen various random noise configurations at training time, we can adjust noise levels at inference time. This allows for more flexible denoising schedules for the sequence beyond the standard uniform denoising schedule over a fixed-length sequence. We can design custom denoising schedules and even use different lengths of past motion to predict variable-length future motions, by simpling setting different noise levels.
Stabilization method – When generating future motions autoregressively, predictions are conditioned on previously predicted frames. However, these frames may slightly deviate from the training distribution. To prevent the model from treating them as perfect ground truth, we introduce a small amount of noise to indicate potential uncertainty. This helps mitigate compounding errors in long-horizon rollouts, leading to more stable motion generation.