Prometheus: 3D-Aware Latent Diffusion Models for Feed-Forward Text-to-3D Scene Generation

We present a novel method for feed-forward scene-level 3D generation. At its core, our approach harnesses the power of 2D priors to fuel generalizable and efficient 3D synthesis – hence our name, Prometheus🔥.

[Paper] [Interactive Results🔥] [Code] [Video] [BibTeX]

Interactive 3D Scene Visualization

Explore the text-to-3D generation results of Prometheus🔥 on various scenes. You can select a scene by clicking the corresponding multiview image.

A bridge spans across the width of the image, connecting two sides separated by water⛰️

A large, white luxury yacht floats serenely above the calm waters of an ocean 🛥️

A desert landscape is visible in the image, with houses scattered across it. The terrain is flat and covered with sparse grass 🏜️

A bustling coral reef, alive with the vibrant dance of fish and swaying plants🪸

A cozy living room_with warm lighting has a wooden floor and features two armchairs centered on a plush rug 🪑

A gray-roofed house with a_garage sits amidst green grass and scattered trees🏡

A blooming potted orchid with purple flowers🌷

A campfire 🔥

Interactive 3D Scene

Drag with left click to rotate view

Scroll to zoom in/out

Drag with right click to move view

Moving forward and backward

Moving left and right

Moving upward and downward

Abstract

In this work, we introduce Prometheus🔥, a 3D-aware latent diffusion model for text-to-3D generation at both object and scene levels in seconds. We formulate 3D scene generation as multi-view, feed-forward, pixel-aligned 3D Gaussian generation within the latent diffusion paradigm. To ensure generalizability, we build our model upon pre-trained text-to-image generation model with only minimal adjustments, and further train it using a large number of images from both single-view and multi-view datasets. Furthermore, we introduce an RGB-D latent space into 3D Gaussian generation to disentangle appearance and geometry information, enabling efficient feed-forward generation of 3D Gaussians with better fidelity and geometry. Extensive experimental results demonstrate the effectiveness of our method in both feed-forward 3D Gaussian reconstruction and text-to-3D generation.

Method

Method Overview: Our training process is divided into two stages. In stage 1, our objective is to train a GS-VAE. Utilizing multi-view images along with their corresponding pseudo depth maps and camera poses, our GS-VAE is designed to encode these multi-view RGB-D images, integrate cross-view information, and ultimately decode them into pixel-aligned 3DGS. In stage 2, we focus on training a MV-LDM. We can generate multi-view RGB-D latents by sampling from randomly-sampled noise with trained MV-LDM.

Stage1 Results: Feed-forward 3D Reconstruction

Quantitatively: We compare our GS-VAE with baselines for generalizable reconstruction on Tartanair.

Qualitatively: We compare Prometheus🔥 against baselines under varying difficulty settings. As overlap gradually decreases, the advantages of our method continue to grow. Moreover, as shown in the depth map, our method exhibits superior geometry quality across all settings.

Stage2 Results: Text-to-3D Genetaion

Quantitatively: We compare Prometheus🔥 with baselines for text-to-3D generation utilizing text prompts from T3Bench.

Qualitatively (Object-level): Prometheus🔥 generates objects that align with the given description, incorporating rich background information and intricate details.

Qualitatively (Scene-level): Comparing with Director3D, our result better aligns with the text prompt and captures more details.

More Text-to-3D Results: Here🔥

BibTex

 @article{yang2024prometheus,

      title={Prometheus: 3D-Aware Latent Diffusion Models for Feed-Forward Text-to-3D Scene Generation},

      author={Yuanbo, Yang and Jiahao, Shao and Xinyang, Li and Yujun, Shen and Andreas, Geiger and Yiyi, Liao},

      journal={arxiv:2412.21117},

      year={2024}

    }

Acknowledgements: We borrow this template from Monst3R, which is originally from DreamBooth. The interactive 3DGS visualization is inspired by Robot-See-Robot-Do, and powered by Viser. We sincerely thank Brent Yi for his support in setting up the online visualization.

Prometheus : 3D-Aware Latent Diffusion Models for Feed-Forward Text-to-3D Scene Generation