We study the complex task of simultaneous multi-object 3D reconstruction, 6D pose and size estimation from a single-view RGBD observation.
In contrast to instance-level pose estimation, we focus on a more challenging problem where CAD models are not available in inference time. Existing approaches mainly follow a multi-stage complex pipeline which first localizes and detects each object instance in the image and then regresses to either their 3D meshes or 6D poses. These approaches suffer from high-computational cost and low performance in challenging scenarios, such as occlusions or clutter. Hence, we present a simple one-stage approach to predict both the 3D shape and estimate the 6D pose and size jointly in a bounding-box free manner. In particular, our method regards object instances as centers in a spatial space where each center denotes the complete shape of an object along with its 6D pose and size.
Through this per-pixel representation, our approach can reconstruct in real- time (40 FPS) multiple novel object instances and predict their 6D pose and sizes in a single-forward pass. Through extensive experiments, we demonstrate that our approach significantly outperforms all shape completion and categorical 6D pose and size estimation baselines on multi-object shapenet and NOCS datasets respectively with a 12.6 % absolute improvement in mAP for 6D pose for novel real-world object instances.
CenterSnap is an anchor-free, single-shot approach to jointly optimize for 3D shape reconstruction and 6D pose and size.
CenterSnap performs effective sim2real transfer using limited real-world examples. Our technique runs at 40 FPS on Nvidia Quadro RTX 5000 GPU
CenterSnap performs effective sim2real transfer using limited real-world examples. Our technique runs at 40 FPS on Nvidia Quadro RTX 5000 GPU
CenterSnap performs accurate 6D pose and size estimation on NOCS-Real275 Dataset.
As a byproduct of our method, we can also reconstruct complete 3D models with textures.
CenterSnap results in real-home envioronments on ZED2 Camera. Note that in this case, we only trained the model in simulation.
If you find our paper or pytorch code repository useful, please consider citing:
@inproceedings{irshad2022centersnap,
title = {CenterSnap: Single-Shot Multi-Object 3D Shape Reconstruction and Categorical 6D Pose and Size Estimation},
author = {Muhammad Zubair Irshad and Thomas Kollar and Michael Laskey and Kevin Stone and Zsolt Kira},
journal = {IEEE International Conference on Robotics and Automation (ICRA)},
year = {2022}
}