Explore the text-to-3D generation results of Prometheus🔥 on various scenes. You can select a scene by clicking the corresponding multiview image.
In this work, we introduce Prometheus🔥, a 3D-aware latent diffusion model for text-to-3D generation at both object and scene levels in seconds. We formulate 3D scene generation as multi-view, feed-forward, pixel-aligned 3D Gaussian generation within the latent diffusion paradigm. To ensure generalizability, we build our model upon pre-trained text-to-image generation model with only minimal adjustments, and further train it using a large number of images from both single-view and multi-view datasets. Furthermore, we introduce an RGB-D latent space into 3D Gaussian generation to disentangle appearance and geometry information, enabling efficient feed-forward generation of 3D Gaussians with better fidelity and geometry. Extensive experimental results demonstrate the effectiveness of our method in both feed-forward 3D Gaussian reconstruction and text-to-3D generation.
Quantitatively: We compare our GS-VAE with baselines for generalizable reconstruction on Tartanair.
Qualitatively: We compare Prometheus🔥 against baselines under varying difficulty settings. As overlap gradually decreases, the advantages of our method continue to grow. Moreover, as shown in the depth map, our method exhibits superior geometry quality across all settings.
Quantitatively: We compare Prometheus🔥 with baselines for text-to-3D generation utilizing text prompts from T3Bench.
Qualitatively (Object-level): Prometheus🔥 generates objects that align with the given description, incorporating rich background information and intricate details.
Qualitatively (Scene-level): Comparing with Director3D, our result better aligns with the text prompt and captures more details.
@article{yang2024prometheus,
title={Prometheus: 3D-Aware Latent Diffusion Models for Feed-Forward Text-to-3D Scene Generation},
author={Yuanbo, Yang and Jiahao, Shao and Xinyang, Li and Yujun, Shen and Andreas, Geiger and Yiyi, Liao},
journal={arxiv:2412.21117},
year={2024}
}
Acknowledgements: We borrow this template from Monst3R, which is originally from DreamBooth. The interactive 3DGS visualization is inspired by Robot-See-Robot-Do, and powered by Viser. We sincerely thank Brent Yi for his support in setting up the online visualization.