Canonical Policy enables vision-conditioned policies to generalize across object appearances and viewpoints by learning canonical 3D representations, with improved sample efficiency.
We introduce canonical policy, a principled framework for 3D equivariant imitation learning that unifies point cloud observations through a canonical representation. Built upon a rigorous theory of 3D canonical mappings, our method enables end-to-end learning of spatially equivariant policies from demonstrations. By leveraging geometric consistency through canonicalization and the expressiveness of generative policy models, such as diffusion models, canonical policy improves generalization and data efficiency in imitation learning.
We benchmarked Canonical Policy and several point cloud baselines across 12 simulation tasks. Canonical policy consistently outperforms all baselines, achieving an average task success improvement of 18%.
Stack D11
Mug Cleanup D11
Nut Assembly D01
Stack Three D11
Threading D22
Square D21
Coffee D21
Hammer Cleanup D11
Push T2
Cloth Folding3
Object Covering3
Box Closing3
In the block stacking task, the robot stacks an I-shaped block onto a T-shaped block. On the left, under the original setup, the canonical policy performs reliably. On the right, even with changes in block color, the policy continues to execute the task accurately—showcasing robustness to appearance shifts.
Next is the shoe alignment task, where the goal is to pick up the right shoe and align it side by side with the left shoe. The policy succeeds under the original setup on the left, and maintains high performance even when the shoe color is completely changed on the right, demonstrating strong appearance invariance.
The can insertion task is more precision-demanding. The robot must pick up a can on the right and insert it into a hole on the left. Despite the challenge, canonical policy consistently succeeds under both the original and color-shifted settings, validating its grasping and placement accuracy.
Table organization is the most complex task, requiring a sequence of precise operations: object placement and drawer manipulation. Canonical policy handles this long-horizon task effectively in both the original and color-variant setups, highlighting its capacity for robust, multi-step decision-making.
Beyond color variations, we also test shape generalization in the shoe alignment task. By introducing unseen objects—leather shoes and hiking shoes—with both appearance and geometric shifts, we test the policy’s ability to generalize. Despite these challenges, canonical policy achieves the highest alignment accuracy, demonstrating strong generalization to both shape and color shifts.
We further evaluate the policy's SE(3) equivariance using a mobile UR5 platform with camera viewpoint shifts. As the camera gradually rotates from the original angle on the left to a significantly shifted one on the right, canonical policy maintains stable and accurate predictions, confirming its robustness to egocentric view changes.
Failure case analysis: The performance drop in the Right-Angled View stems from the single-camera setup, where large viewpoint shifts lead to significantly different point clouds. These new observations often include previously unseen regions, making it difficult for the policy to generalize. In contrast, under small-angle shifts (Frontal View), there is substantial overlap between training and test point clouds, allowing the canonical policy to exploit geometric equivariance and consistently map inputs to the same canonical pose.
@article{zhang2025canonical,
title={Canonical Policy: Learning Canonical 3D Representation for Equivariant Policy},
author={Zhang, Zhiyuan and Xu, Zhengtong and Lakamsani, Jai Nanda and She, Yu},
journal={arXiv preprint arXiv:2505.18474},
year={2025}
}