Canonical Policy enables vision-conditioned policies to generalize across object appearances and viewpoints by learning canonical 3D representations, with improved sample efficiency.
We introduce canonical policy, a principled framework for 3D equivariant imitation learning that unifies point cloud observations through a canonical representation. Built upon a rigorous theory of 3D canonical mappings, our method enables end-to-end learning of spatially equivariant policies from demonstrations. By leveraging geometric consistency through canonicalization and the expressiveness of generative policy models, such as diffusion models, canonical policy improves generalization and data efficiency in imitation learning.
We benchmarked Canonical Policy and several point cloud baselines across 12 simulation tasks. Canonical policy consistently outperforms all baselines, achieving an average task success improvement of 18%.
Stack D11
Mug Cleanup D11
Nut Assembly D01
Stack Three D11
Threading D22
Square D21
Coffee D21
Hammer Cleanup D11
Push T2
Cloth Folding3
Object Covering3
Box Closing3
In the block stacking task, the robot stacks an I-shaped block onto a T-shaped block. In the top video, under the original setup, the canonical policy performs reliably. In the bottom video, even with changes in block color, the policy continues to execute the task accurately—showcasing robustness to appearance shifts.
Next is the shoe alignment task, where the goal is to pick up the right shoe and align it side by side with the left shoe. In the top video, the policy succeeds under the original setup. In the bottom video, it maintains high performance even when the shoe color is completely changed—demonstrating strong appearance invariance.
The can insertion task is more precision-demanding. The robot must pick up a can on the right and insert it into a hole on the left. In the top video, the canonical policy often succeeds under the original setup. In the bottom video, it maintains comparable performance even under color-shifted conditions—highlighting its robustness in grasping and precise placement.
Table organization is the most complex task, requiring a sequence of precise operations: object placement and drawer manipulation. Canonical policy handles this long-horizon task effectively in both the original and color-variant setups, highlighting its capacity for robust, multi-step decision-making.
Beyond color variations, we also test shape generalization in the shoe alignment task. By introducing unseen objects—leather shoes and hiking shoes—with both appearance and geometric shifts, we test the policy’s ability to generalize. Despite these challenges, canonical policy achieves the highest alignment accuracy, demonstrating strong generalization to both shape and color shifts.
We further evaluate the policy's SE(3) equivariance using a mobile UR5 platform with camera viewpoint shifts. As the camera gradually rotates from the original angle on the left to a significantly shifted one on the right, canonical policy maintains stable and accurate predictions, confirming its robustness to egocentric view changes.