Canonical Policy: Learning Canonical 3D Representation for SE(3)-Equivariant Policy

Teaser

Canonical Policy enables vision-conditioned policies to generalize across object appearances and viewpoints by learning canonical 3D representations, with improved sample efficiency.

Abstract

We introduce canonical policy, a principled framework for 3D equivariant imitation learning that unifies point cloud observations through a canonical representation. Built upon a rigorous theory of 3D canonical mappings, our method enables end-to-end learning of spatially equivariant policies from demonstrations. By leveraging geometric consistency through canonicalization and the expressiveness of generative policy models, such as diffusion models, canonical policy improves generalization and data efficiency in imitation learning.

Simulation Benchmark

We benchmarked Canonical Policy and several point cloud baselines across 12 simulation tasks. Canonical policy consistently outperforms all baselines, achieving an average task success improvement of 18%.

Stack D1¹

Mug Cleanup D1¹

Nut Assembly D0¹

Stack Three D1¹

Threading D2²

Square D2¹

Coffee D2¹

Hammer Cleanup D1¹

Push T²

Cloth Folding³

Object Covering³

Box Closing³

¹Robomimic ²Diffusion Policy ³Equibot

Real Robot Experiments

Effectiveness of Canonical Policy

Block Stacking

In the block stacking task, the robot stacks an I-shaped block onto a T-shaped block. On the left, under the original setup, the canonical policy performs reliably. On the right, even with changes in block color, the policy continues to execute the task accurately—showcasing robustness to appearance shifts.

Shoe Alignment

Next is the shoe alignment task, where the goal is to pick up the right shoe and align it side by side with the left shoe. The policy succeeds under the original setup on the left, and maintains high performance even when the shoe color is completely changed on the right, demonstrating strong appearance invariance.

Can Insertion

The can insertion task is more precision-demanding. The robot must pick up a can on the right and insert it into a hole on the left. Despite the challenge, canonical policy consistently succeeds under both the original and color-shifted settings, validating its grasping and placement accuracy.

Table Organzation

Table organization is the most complex task, requiring a sequence of precise operations: object placement and drawer manipulation. Canonical policy handles this long-horizon task effectively in both the original and color-variant setups, highlighting its capacity for robust, multi-step decision-making.

Generalization to Unseen Object Appearances

Beyond color variations, we also test shape generalization in the shoe alignment task. By introducing unseen objects—leather shoes and hiking shoes—with both appearance and geometric shifts, we test the policy’s ability to generalize. Despite these challenges, canonical policy achieves the highest alignment accuracy, demonstrating strong generalization to both shape and color shifts.

Robustness to Viewpoint Variations

We further evaluate the policy's SE(3) equivariance using a mobile UR5 platform with camera viewpoint shifts. As the camera gradually rotates from the original angle on the left to a significantly shifted one on the right, canonical policy maintains stable and accurate predictions, confirming its robustness to egocentric view changes.

Left-Angled View

Frontal View

Right-Angled View

Failure case analysis: The performance drop in the Right-Angled View stems from the single-camera setup, where large viewpoint shifts lead to significantly different point clouds. These new observations often include previously unseen regions, making it difficult for the policy to generalize. In contrast, under small-angle shifts (Frontal View), there is substantial overlap between training and test point clouds, allowing the canonical policy to exploit geometric equivariance and consistently map inputs to the same canonical pose.

BibTeX


      @article{zhang2025canonical,
        title={Canonical Policy: Learning Canonical 3D Representation for Equivariant Policy},
        author={Zhang, Zhiyuan and Xu, Zhengtong and Lakamsani, Jai Nanda and She, Yu},
        journal={arXiv preprint arXiv:2505.18474},
        year={2025}
      }