CenterSnap: Single-Shot Multi-Object 3D Shape Reconstruction and Categorical 6D Pose and Size Estimation
1Georgia Tech, 2Toyota Research Institute
Interpolate start reference image.

CenterSnap: A framework for real-time 6D pose and size estimation and 3D shape reconstruction of novel object instances.

Abstract

We study the complex task of simultaneous multi-object 3D reconstruction, 6D pose and size estimation from a single-view RGBD observation.

In contrast to instance-level pose estimation, we focus on a more challenging problem where CAD models are not available in inference time. Existing approaches mainly follow a multi-stage complex pipeline which first localizes and detects each object instance in the image and then regresses to either their 3D meshes or 6D poses. These approaches suffer from high-computational cost and low performance in challenging scenarios, such as occlusions or clutter. Hence, we present a simple one-stage approach to predict both the 3D shape and estimate the 6D pose and size jointly in a bounding-box free manner. In particular, our method regards object instances as centers in a spatial space where each center denotes the complete shape of an object along with its 6D pose and size.

Through this per-pixel representation, our approach can reconstruct in real- time (40 FPS) multiple novel object instances and predict their 6D pose and sizes in a single-forward pass. Through extensive experiments, we demonstrate that our approach significantly outperforms all shape completion and categorical 6D pose and size estimation baselines on multi-object shapenet and NOCS datasets respectively with a 12.6 % absolute improvement in mAP for 6D pose for novel real-world object instances.

Method

CenterSnap is an anchor-free, single-shot approach to jointly optimize for 3D shape reconstruction and 6D pose and size.

Video

Fast Real-time Reconstruction and Sim2Real transfer

CenterSnap performs effective sim2real transfer using limited real-world examples. Our technique runs at 40 FPS on Nvidia Quadro RTX 5000 GPU

3D Shape Completion

CenterSnap performs effective sim2real transfer using limited real-world examples. Our technique runs at 40 FPS on Nvidia Quadro RTX 5000 GPU

6D Pose and Size Comparison

CenterSnap performs accurate 6D pose and size estimation on NOCS-Real275 Dataset.

Texture Reconstruction

As a byproduct of our method, we can also reconstruct complete 3D models with textures.

Sim2Real Canonical Reconstruction

CenterSnap results in real-home envioronments on ZED2 Camera. Note that in this case, we only trained the model in simulation.

BibTeX

@inproceedings{irshad2022centersnap,
  title={CenterSnap: Single-Shot Multi-Object 3D Shape Reconstruction and Categorical 6D Pose and Size Estimation},
  author={Muhammad Zubair Irshad and Thomas Kollar and Michael Laskey and Kevin Stone and Zsolt Kira},
  journal={IEEE International Conference on Robotics and Automation (ICRA)},
  year={2022},
  url={https://arxiv.org/abs/2203.01929},
}