Learning Upright and Forward-Facing Object Poses using Category-level Canonical Representations

Bing Han, Ruitao Pan, Xinyu Zhang, Chenxi Wang, Zhi Zhai, Zhibin Zhao*, Xuefeng Chen
Xi'an Jiaotong University
Nerfies teaser

TL;DR: We propose a category-level point cloud canonicalization method that learns a unified canonical pose across different instances within the same category.

Abstract

Constructing a unified canonical pose representation for 3D object categories is crucial for pose estimation and robotic scene understanding. Previous unified pose representations often relied on manual alignment, such as in ShapeNet and ModelNet. Recently, self-supervised canonicalization methods have been proposed, However, they are sensitive to intra-class shape variations, and their canonical pose representations cannot be aligned to a coordinate system centered on the object.

In this paper, we propose a category-level canonicalization method that alleviates the impact of shape variation and extends the canonical pose representation to an upright and forward-facing state. First, we design a Siamese VN Module (SVNM) that achieves SE(3) equivariance modeling and self-supervised disentangling of 3D shape and pose attributes. Next, we introduce a Siamese equivariant constraint that addresses the pose alignment bias caused by shape deformation. Finally, we propose a method to generate upright surface labels from pose-unknown in-the-wild data and use upright and symmetry losses to correct the canonical pose.

Experimental results show that our method not only achieves SOTA consistency performance but also aligns with the object-centered coordinate system.

Overview

Nerfies teaser

Overall architecture. \(X \rightarrow \widetilde{X}\): Learning neutral canonical representations of point clouds with arbitrary input poses. \(\widetilde{X} \rightarrow \widehat{X}\): Learning residual rotations to align the canonical pose to the upright and forward-facing state.

Data Generation

Results


Input with pose variations
Our Canonical Shape Representation
Input with pose variations
Our Canonical Shape Representation
GIF 1
GIF 1
GIF 2
GIF 2
GIF 3
GIF 3
GIF 4
GIF 3
GIF 5
GIF 5
GIF 6
GIF 6
GIF 3
GIF 3
GIF 3
GIF 3
GIF 3
GIF 3
GIF 3
GIF 3
GIF 3
GIF 3
GIF 3
GIF 3
GIF 3
GIF 3
GIF 3
GIF 3

Input with shape variations
Our Canonical Shape Representation
ShapeNet (manual alignment)
Nerfies teaser

We offer a novel perspective: transforming in-the-wild, unaligned point cloud data into physically meaningful aligned datasets without relying on any pose annotations. In these datasets, the aligned axes are synchronized with the Euclidean coordinate system (e.g., airplane cockpits are always aligned with \( +\mathrm{Y} \), and car wheels always touch \( \mathrm{Z} = 0 \)).

Robot Placement Task


First-Person View
Third-Person View