ShAPO : Implicit Representations for Multi Object Shape Appearance and Pose Optimization        

European Conference on Computer Vision (ECCV), 2022

*Equal Contribution

1Georgia Tech, 2Toyota Research Institute

ShAPO: A framework for holistic object-centric 3D scene understanding. ShAPO reconstructs 3D shape, appearance and estimates the 6D pose and sizes of novel object instances.

Abstract

Our method studies the complex task of holistic object-centric 3D understanding from a single RGB-D observation. As it is an ill-posed problem, existing methods suffer from low performance for both 3D shape and 6D pose estimation in complex multi-object scenarios with occlusions. We present ShAPO, a method for joint multi-object detection, 3D textured reconstruction, 6D object pose and size estimation.

Key to ShAPO is a single-shot pipeline to regress shape, appearance and pose latent codes along with the masks of each object instance, which is then further refined in a sparse-to-dense fashion. A novel disentangled shape and appearance code is first learned to embed objects in their respective shape and appearance space. We also propose a novel, octree- based differentiable optimization step, allowing us to further improve object shape, pose and appearance simultaneously under the learned latent space, in an analysis-by-synthesis fashion. Our novel joint implicit textured object representation allows us to accurately identify and recon- struct novel unseen objects without having access to their 3D meshes. Through extensive experiments, we show that our method, trained on simulated indoor scenes, accurately regresses the shape, appearance and pose of novel objects in the real-world with minimal fine-tuning.

Our method significantly out-performs all baselines on the NOCS dataset with an 8% absolute improvement in mAP for 6D pose estimation.

Method

ShAPO uses two-stages for 3D shape and appearance reconstruction and 6D pose and size estimation; First, a single-shot network to predict 3D shape, pose and size codes along with segmentation masks in a per-pixel manner. Second, test-time optimization of joint shape, pose and size codes given a single-view RGB-D observation of a new instance.

Reconstruction Results

From left to right, we show an interactive visualization of the 6D poses and 5 reconstructed objects along with their appearance from a single-view RGB-D observation.


Video

Shape and Texture Latent Space Traversal

Joint Shape, Appearance and 6D Pose Optimization

We perform joint shape, appearance and 6D object pose optimization using our technique given a single-view of a novel object scene.

Shape and Texture Reconstruction

Here we show more qualitative results of multi-object shape and appearance reconstruction on real-world novel object instances.

Appearance Optimization

We further compare inference vs latent optimization vs texture network finetuning for appearance optimization.

6D pose and size estimation comparison

Our inference-time optimization allows us to perform accurate 6D pose and size estimation in comparison to strong baselines.

3D only optimization

Here we show shape and texture latent optimization for 3D models starting from a randomized model within a category and optimizing towards a target model.

Zero-shot results on HSR Robot

Our method generalizes nicely to real world unseen objects within the same category as shown by qualitative results and on the images taken on HSR robot showing zero-shot generalization i.e. no re-training was done.

BibTeX

If you find our paper or pytorch code repository useful, please consider citing:

@inproceedings{irshad2022shapo,
  title={ShAPO: Implicit Representations for Multi-Object Shape Appearance and Pose Optimization},
  author={Muhammad Zubair Irshad and Sergey Zakharov and Rares Ambrus and Thomas Kollar and Zsolt Kira and Adrien Gaidon},
  journal={European Conference on Computer Vision (ECCV)},
  year={2022},
  url={https://arxiv.org/abs/2207.13691},
}