Learning Upright and Forward-Facing Object Poses using Category-level Canonical Representations

Bing Han, Ruitao Pan, Xinyu Zhang, Chenxi Wang, Zhi Zhai, Zhibin Zhao*, Xuefeng Chen

Xi'an Jiaotong University

Abstract

Constructing a unified canonical pose representation for 3D object categories is crucial for pose estimation and robotic scene understanding. Previous unified pose representations often relied on manual alignment, such as in ShapeNet and ModelNet. Recently, self-supervised canonicalization methods have been proposed, However, they are sensitive to intra-class shape variations, and their canonical pose representations cannot be aligned to a coordinate system centered on the object.

In this paper, we propose a category-level canonicalization method that alleviates the impact of shape variation and extends the canonical pose representation to an upright and forward-facing state. First, we design a Siamese VN Module (SVNM) that achieves SE(3) equivariance modeling and self-supervised disentangling of 3D shape and pose attributes. Next, we introduce a Siamese equivariant constraint that addresses the pose alignment bias caused by shape deformation. Finally, we propose a method to generate upright surface labels from pose-unknown in-the-wild data and use upright and symmetry losses to correct the canonical pose.

Experimental results show that our method not only achieves SOTA consistency performance but also aligns with the object-centered coordinate system.

Overview

Overall architecture. \(X \rightarrow \widetilde{X}\): Learning neutral canonical representations of point clouds with arbitrary input poses. \(\widetilde{X} \rightarrow \widehat{X}\): Learning residual rotations to align the canonical pose to the upright and forward-facing state.

Results

Input with pose variations

Our Canonical Shape Representation

Input with pose variations

Our Canonical Shape Representation

Input with shape variations

Our Canonical Shape Representation

ShapeNet (manual alignment)

We offer a novel perspective： transforming in-the-wild, unaligned point cloud data into physically meaningful aligned datasets without relying on any pose annotations. In these datasets, the aligned axes are synchronized with the Euclidean coordinate system (e.g., airplane cockpits are always aligned with \( +\mathrm{Y} \), and car wheels always touch \( \mathrm{Z} = 0 \)).

Learning Upright and Forward-Facing Object Poses using Category-level Canonical Representations

Abstract

Overview

Data Generation

Results

Robot Placement Task