UrbanGen: Urban Generation with Compositional and Controllable Neural Fields

1Zhejiang University 2Ant Group 3University of Tübingen
corresponding author

Overview. In this paper, we propose UrbanGen, a solution for the challenging task of generating urban radiance fields with photorealistic rendering, accurate geometry, high controllability, and diverse city styles. Our key idea is to leverage a coarse 3D panoptic prior, represented by a semantic voxel grid for stuff and bounding boxes for countable objects, to condition a compositional generative radiance field. This panoptic prior simplifies the task of learning complex urban geometry, enables disentanglement of stuff and objects, and provides versatile control over both. Moreover, by combining semantic and geometry losses with adversarial training, our method faithfully adheres to the input conditions, allowing for joint rendering of semantic and depth maps alongside RGB images. In addition, we collect a unified dataset with images and their panoptic priors in the same format from 3 diverse real-world datasets: KITTI-360, nuScenes, and Waymo, and train a city style-aware model on this data. Our systematic study shows that UrbanGen outperforms state-of-the-art generative radiance field baselines in terms of image fidelity and geometry accuracy for urban scene generation. Furthermore, UrbenGen brings a new set of controllability features, including large camera movements, stuff editing, and city style control.

Method

method

Method. We utilize panoptic prior in the form of semantic voxel grids and instance object layouts to construct a generative urban radiance field. Our model accepts as input a global noise vector $\mathbf{z}_{wld}$ for the entire scene, $K$ noise vectors $\{\mathbf{z}^k_{obj}\}_{k=1}^K$ for objects, a scene domain class $\mathbf{c} \sim p_{\mathcal{C}}$, and a sampled panoptic prior $\mathbf{V},\mathbf{O} \sim p_{\mathcal{V},\mathcal{O}}$. We decompose the scene into background, stuff, and objects. The stuff generator is conditioned on the semantic voxel grid $\mathbf{V}$ to maintain its semantic and geometric information. Objects are generated in the canonical object coordinate system guided by $\mathbf{O}$. Combined with the background generator, a feature map $\hat{\mathbf{I}}_\mathbf{F}$, depth map $\hat{\mathbf{I}}_{D}$, and semantic map $\hat{\mathbf{I}}_{\mathbf{L}}$ are obtained through volume rendering. We further employ neural rendering to produce the RGB image $\hat{\mathbf{I}}$ and object patches $\hat{\mathbf{P}}_k$. The entire model is optimized jointly with adversarial losses $\mathcal{L}_{adv}^{\mathbf{I}}$ and $\mathcal{L}_{adv}^{\mathbf{P}}$ applied to the full image and object patches, respectively, as well as a geometry loss $\mathcal{L}_{geo}$ for improved underlying geometry, and a semantic loss $\mathcal{L}_{seg}$ for alignment between the rendered appearance and semantic.

Object Control

KITTI-360

Waymo/nuScenes

Camera Control

KITTI-360

Waymo/nuScenes

Style Interpolation

KITTI-360

Waymo/nuScenes

City-Style Transfer

Comparison with Baselines

EG3D

DiscoScene

CC3D

UrbanGen