UniPhys: Unified Planner and Controller with Diffusion for Flexible Physics-Based Character Control

1ETH Zurich     2Carnegie Mellon University


UniPhys is a diffusion-based unified planner and text-driven controller for physics-based character control. It generalizes across diverse tasks using a single model—from short-term reactive control tasks to long-term planning tasks, without requiring task-specific training.




Abstract

Generating natural and physically plausible character motion remains challenging, particularly for long-horizon control with diverse guidance signals. While prior work combines high-level diffusion-based motion planners with low-level physics controllers, these systems suffer from domain gaps that degrade motion quality and require task-specific fine-tuning. To tackle this problem, we introduce UniPhys, a diffusion-based behavior cloning framework that unifies motion planning and control into a single model. UniPhys enables flexible, expressive character motion conditioned on multi-modal inputs such as text, trajectories, and goals. To address accumulated prediction errors over long sequences, UniPhys is trained with the Diffusion Forcing paradigm, learning to denoise noisy motion histories and handle discrepancies introduced by the physics simulator. This design allows UniPhys to robustly generate physically plausible, long-horizon motions. Through guided sampling, UniPhys generalizes to a wide range of control signals, including unseen ones, without requiring task-specific fine-tuning. Experiments show that UniPhys outperforms prior methods in motion naturalness, generalization, and robustness across diverse control tasks.


Method

Key Idea

We're building a diffusion-based model that learns both kinematic motion (state) and physical control actions (action), acting as a unified planner and text-driven controller.
By directly modeling the text-conditioned action distribution, we can control a character end-to-end with text instructions.
Plus, since we’re learning the full state-action distribution, we can predict future motions and apply flexible guidance during denoising to steer actions toward a desired goal.
This approach works for any task as long as we can design a differentiable guidance loss over the state-action space.

Key Challenges

However, simply training the diffusion model with offline paired state-action data doesn’t work well—it often leads to unstable motions, with the character frequently falling.

What makes this challenging?

Difficult bipedal control – Controlling a full-body character is tough. The action space is large (69 dimensions, excluding fingers control), and even small errors can throw off balance, causing falls.

Compounding error in behavior cloning – Training on offline data suffers from the classic compounding error problem. As the state drifts beyond the training distribution over time, predictions become less accurate, leading to unstable and unreliable control.

Our Solutions

Latent action representation – Instead of directly learning the high-dimensional action distribution, we employ a compact latent action representation for ease of learning.

Injecting independant noises per frame when training – Inspired by Diffusion Forcing, when predicting clean output from corrupted noisy input at training time, we corrupt the input sequence by injecting independant noises to each frame.
This enables flexible denoising settings and a stabilization trick to mitigate compounding error issues at inference time:

  • Flexible denoising settings – As the model has seen various random noise configurations at training time, we can adjust noise levels at inference time. This allows for more flexible denoising schedules for the sequence beyond the standard uniform denoising schedule over a fixed-length sequence. We can design custom denoising schedules and even use different lengths of past motion to predict variable-length future motions, by simpling setting different noise levels.
  • Stabilization method – When generating future motions autoregressively, predictions are conditioned on previously predicted frames. However, these frames may slightly deviate from the training distribution. To prevent the model from treating them as perfect ground truth, we introduce a small amount of noise to indicate potential uncertainty. This helps mitigate compounding errors in long-horizon rollouts, leading to more stable motion generation.

Method Overview.


Results


Long-horizon random rollouts

UniPhys produces stable long-horizon rollouts, covering diverse skills.

Text-driven control

Multi-modal text-driven atomic skills



Interactive text-driven control

Goal reaching

Velocity control

Dynamic object avoidance