Explore the text-to-3D generation results of Prometheus🔥 on various scenes. You can select a scene by clicking the corresponding multiview image.
In this work, we introduce Prometheus🔥, a 3D-aware latent diffusion model for text-to-3D generation at both object and scene levels in seconds. We formulate 3D scene generation as multi-view, feed-forward, pixel-aligned 3D Gaussian generation within the latent diffusion paradigm. To ensure generalizability, we build our model upon pre-trained text-to-image generation model with only minimal adjustments, and further train it using a large number of images from both single-view and multi-view datasets. Furthermore, we introduce an RGB-D latent space into 3D Gaussian generation to disentangle appearance and geometry information, enabling efficient feed-forward generation of 3D Gaussians with better fidelity and geometry. Extensive experimental results demonstrate the effectiveness of our method in both feed-forward 3D Gaussian reconstruction and text-to-3D generation.
Method Overview: Our training process is divided into two stages. In stage 1, our objective is to train a GS-VAE. Utilizing multi-view images along with their corresponding pseudo depth maps and camera poses, our GS-VAE is designed to encode these multi-view RGB-D images, integrate cross-view information, and ultimately decode them into pixel-aligned 3DGS. In stage
2, we focus on training a MV-LDM. We can generate multi-view RGB-D latents by sampling from randomly-sampled noise with
trained MV-LDM.
Quantitatively: We compare our GS-VAE with baselines for generalizable reconstruction on Tartanair.
Qualitatively: We compare Prometheus🔥 against baselines under varying difficulty settings. As overlap gradually decreases, the advantages of our method continue to grow. Moreover, as shown in the depth map, our method exhibits superior geometry quality across all settings.
Quantitatively: We compare Prometheus🔥 with baselines for text-to-3D generation utilizing text prompts from T3Bench.
Qualitatively (Object-level): Prometheus🔥 generates objects that align with the given description, incorporating rich background information and intricate details.
Qualitatively (Scene-level): Comparing with Director3D, our result better aligns with the text prompt and captures more details.
@article{yang2024prometheus,
title={Prometheus: 3D-Aware Latent Diffusion Models for Feed-Forward Text-to-3D Scene Generation},
author={Yuanbo, Yang and Jiahao, Shao and Xinyang, Li and Yujun, Shen and Andreas, Geiger and Yiyi, Liao},
journal={arxiv:2412.21117},
year={2024}
}
Acknowledgements: We borrow this template from Monst3R, which is originally from DreamBooth. The interactive 3DGS visualization is inspired by Robot-See-Robot-Do, and powered by Viser. We sincerely thank Brent Yi for his support in setting up the online visualization.