EquiForm enables robots to act consistently under both SE(3) transformations and noisy point cloud observations. This robustness leads to reliable manipulation behavior across diverse objects, layouts, and real-world sensing conditions.
Equivariance alone is not sufficient in practice. As shown above, non-equivariant policies fail to align actions under transformations, while fragile equivariant methods become unstable when exposed to realistic point cloud noise and partial observations. EquiForm addresses this gap by explicitly modeling how noise disrupts equivariance, enabling robust and consistent behavior across diverse poses, layouts, and sensing conditions.
We introduce EquiForm, a framework for noise-robust SE(3)-equivariant policy learning from point clouds. By combining geometric denoising and equivariant contrastive learning, EquiForm explicitly addresses how realistic perception noise disrupts equivariance and stabilizes canonical representations. This enables consistent action generation under SE(3) transformations and improves robustness and generalization in imitation learning. EquiForm is a modular and flexible framework that integrates seamlessly with existing policy architectures.
Qualitative visualization of geometric denoising under increasing Gaussian noise. Point cloud observations are corrupted with isotropic Gaussian noise of increasing standard deviation (left to right). We compare the noisy input, farthest point sampling (FPS), and the proposed geometric denoising. Geometric denoising preserves surface structure and spatial consistency under severe noise, whereas FPS alone fails to recover coherent geometry.
Noise levels. We evaluate robustness under progressively increasing observation noise:
Robustness of equivariant representations under increasing observation noise. Equivariant feature embeddings are visualized across training epochs and noise levels, comparing models trained with and without contrastive learning.
Under SE(3) layout variations, many policies either fail to align their actions or become unstable due to sensing noise. As shown above, EquiForm consistently follows the intended manipulation strategy, producing reliable real-world behavior across diverse scene configurations.
Finally, we analyze representative failure cases to understand the remaining limitations of EquiForm. Failures mainly occur when small objects are under-represented after point cloud downsampling, or when thin deformable objects introduce geometric ambiguity under geometry-only perception. These cases highlight inherent challenges of point cloud-based manipulation and suggest future directions such as adaptive sampling and complementary sensing modalities.