CenterSnap: Single-Shot Multi-Object 3D Shape Reconstruction and Categorical 6D Pose and Size Estimation
1Georgia Tech, 2Toyota Research Institute
Interpolate start reference image.

CenterSnap: A framework for real-time 6D pose and size estimation and 3D shape reconstruction of novel object instances.

Abstract

We study the complex task of simultaneous multi-object 3D reconstruction, 6D pose and size estimation from a single-view RGBD observation.

In contrast to instance-level pose estimation, we focus on a more challenging problem where CAD models are not available in inference time. Existing approaches mainly follow a multi-stage complex pipeline which first localizes and detects each object instance in the image and then regresses to either their 3D meshes or 6D poses. These approaches suffer from high-computational cost and low performance in challenging scenarios, such as occlusions or clutter. Hence, we present a simple one-stage approach to predict both the 3D shape and estimate the 6D pose and size jointly in a bounding-box free manner. In particular, our method regards object instances as centers in a spatial space where each center denotes the complete shape of an object along with its 6D pose and size.

Through this per-pixel representation, our approach can reconstruct in real- time (40 FPS) multiple novel object instances and predict their 6D pose and sizes in a single-forward pass. Through extensive experiments, we demonstrate that our approach significantly outperforms all shape completion and categorical 6D pose and size estimation baselines on multi-object shapenet and NOCS datasets respectively with a 12.6 % absolute improvement in mAP for 6D pose for novel real-world object instances.

Method

CenterSnap is an anchor-free, single-shot approach to jointly optimize for 3D shape reconstruction and 6D pose and size.

Video

Fast Real-time Reconstruction and Sim2Real transfer

CenterSnap performs effective sim2real transfer using limited real-world examples. Our technique runs at 40 FPS on Nvidia Quadro RTX 5000 GPU

3D Shape Completion

CenterSnap performs effective sim2real transfer using limited real-world examples. Our technique runs at 40 FPS on Nvidia Quadro RTX 5000 GPU

6D Pose and Size Comparison

CenterSnap performs accurate 6D pose and size estimation on NOCS-Real275 Dataset.

Texture Reconstruction

As a byproduct of our method, we can also reconstruct complete 3D models with textures.

Sim2Real Canonical Reconstruction

CenterSnap results in real-home envioronments on ZED2 Camera. Note that in this case, we only trained the model in simulation.

BibTeX

If you find our paper or pytorch code repository useful, please consider citing:

@inproceedings{irshad2022centersnap,
  title={CenterSnap: Single-Shot Multi-Object 3D Shape Reconstruction and Categorical 6D Pose and Size Estimation},
  author={Muhammad Zubair Irshad and Thomas Kollar and Michael Laskey and Kevin Stone and Zsolt Kira},
  journal={IEEE International Conference on Robotics and Automation (ICRA)},
  year={2022},
  url={https://arxiv.org/abs/2203.01929},
}