SceneDreamer360: Text-Driven 3D-Consistent Scene Generation with Panoramic Gaussian Splatting

7 minute read

Published: April 03, 2025

Homepage

title

framework

Reading

Abstract

任务：

Text-driven 3D scene generation

现有方法的问题：

However, most existing methods generate single-view images using generative models and then stitch them together in 3D space. This independent generation for each view often results in spatial inconsistency and implausibility in the 3D scenes.

提出的方法：

Our proposed method leverages a text-driven panoramic image generation model as a prior for 3D scene generation and employs 3D Gaussian Splatting (3DGS) to ensure consistency across multi-view panoramic images.

multi-view panoramic images?

展开方法：

Specifically, SceneDreamer360 enhances the fine-tuned Panfusion generator with a three-stage panoramic enhancement, enabling the generation of high-resolution, detail-rich panoramic images.

During the 3D scene construction, a novel point cloud fusion initialization method is used, producing higher quality and spatially consistent point clouds.

实验验证：

Our extensive experiments demonstrate that compared to other methods, SceneDreamer360 with its panoramic image generation and 3DGS can produce higher quality, spatially consistent, and visually appealing 3D scenes from any text prompt.

Intro

任务背景：

The scarcity of annotated 3D text-point cloud datasets poses substantial challenges to training 3D point cloud models directly from user queries, particularly for complex 3D scenes. This difficulty primarily stems from the high costs associated with acquiring and annotating 3D data, in addition to the precision required for modeling detailed 3D environments.

training 3D point cloud models？

To address these limitations, existing approaches frequently extend advanced 2D generative models by leveraging 2D object priors as a bridge to 3D neural radiance fields.

For instance, CLIP-NeRF [1] uses the CLIP model [2] to control the shape and appearance of 3D objects based on textual prompts or images. Similarly, Dream Fields [3] integrates neural rendering with image and text representations to produce diverse 3D objects from natural language descriptions. Dreamfusion [4] further advances this approach by optimizing NeRF without the need for 3D training data, instead relying solely on a pre-trained 2D text-to-image diffusion model for text-to-3D synthesis.

现有方法的问题：

Despite significant advancements in NeRF-based point cloud generation methods, achieving consistent, fine-grained point cloud details remains a challenge.

For example, the Text2Room method [5] often produces point clouds that are discontinuous, lack detailed features, and require prolonged rendering times.

Similarly, LucidDreamer [6] generates panoramic point clouds with numerous inconsistencies, resulting in incomplete and fragmented panoramas.

PanFusion [7] also struggles with generating high-quality panoramic images from complex, extended text inputs, frequently yielding blurred areas and deformed objects. Moreover, PanFusion’s panoramic images are relatively low-resolution (512 ×1024), leading to a reduced spatial-to-pixel point ratio, which further contributes to blurred, suboptimal point cloud rendering.

To address these challenges, we propose incorporating 3D Gaussian Splatting (3DGS) [8] for multi-scene generation, specifically aimed at producing more finely detailed and consistent complex scene point clouds. By employing a Gaussian distribution for scene modeling, 3DGS enables more precise representation of complex structures and improves reconstruction accuracy. Additionally, 3DGS’s optimized rendering algorithm significantly reduces rendering times while preserving high quality, resulting in faster and more efficient generation of detailed point clouds.

本文做法：

In this paper, we introduce SceneDreamer360, a novel framework for text-driven 3D-consistent scene generation using panoramic Gaussian splatting (3DGS). Our method operates in two stages, as illustrated in Fig. 1: first, we enhance panoramic images to establish 3D consistency priors, and second, we apply 3DGS to reconstruct high-quality, text-aligned point clouds. This dual-stage approach ensures both accurate 3D structure and enhanced visual quality and realism in the generated scenes. During the panorama generation and enhancement stage, existing techniques frequently rely on iterative, progressive scene rendering, which can lead to consistency issues across successive renderings. To address this challenge, we enhance the PanFusion model [7] by integrating a multi-layer perceptron (MLP) [9] and a LoRA layer [10] into its final processing stage. These additions, combined with training on the Habitat Matterport Dataset [11], produce a model checkpoint fine-tuned to the specific demands of panoramic image generation. This enhancement step produces high-quality panoramas, forming a solid foundation for subsequent point cloud rendering. Further, we upscale and refine the generated panoramas using ControlNet (CN) [12] and RealESRGAN [13] techniques, achieving resolutions up to 6K. This upscaling process preserves fine details and enhances image fidelity, which is crucial for realistic point cloud rendering. High-resolution panoramas contribute to more detailed and visually appealing scenes, which are important for achieving photorealism in the final output. To optimize the point cloud rendering process, we introduce a novel point cloud initialization method that improves 3D consistency and reduces rendering time. This step ensures better alignment of the resulting point clouds with the panoramic views, creating a more cohesive and immersive 3D experience. Additionally, the robust 3D representation capabilities of 3DGS enable the creation of complete, high-quality point cloud images that are consistent with the input text prompts. Our contributions can be summarised as follows:

The field of 3D scene generation has evolved significantly, drawing inspiration from various breakthroughs in image generation techniques. Early approaches leveraged Generative Adversarial Networks (GANs) [14], attempting to create multi-view consistent images [15], [16] or directly generate 3D representations like voxels [17], [18] and point clouds [19],[20]. However, these methods were hindered by GAN’s inherent learning instability and the memory constraints of 3D representations, limiting the quality of generated scenes. The advent of diffusion models [21], [22], [23] and their success in image generation [24], [25] sparked a new wave of research in 3D scene generation. Researchers began applying diffusion models to various 3D representations, including voxels [26], point clouds [27], and implicit neural networks [28]. While these approaches showed promise, they often focused on simple, object-centric examples due to their inherent nature. To address more complex scenarios, some generative diffusion models employed meshes as proxies, diffusing in UV space. This approach enabled the creation of large portrait scenes through continuous mesh building [29] and the generation of indoor scenes [30], [31] and more realistic objects[32]. Currently, an increasing number of studies [6], [33], [34], [35] have incorporated 3DGS [8] as a 3D representation to complement diffusion models in generating more complex 3D scenes. For instance, LucidDreamer [6] pioneered the integration of 3D Gaussian Splatting into scene generation, demonstrating the efficacy of 3DGS. RealmDreamer [36] optimizes a 3D Gaussian Splatting representation to align with complex text prompts. However, due to diffusion models generating one viewpoint at a time before merging and elevating to 3D space, the resulting 3D scenes often lack realism and consistency. To address this issue, some recent methods such as DreamScene360 [33] and FastScene [34] have introduced panoramic images to generate more coherent 3D scenes. Nevertheless, these approaches still produce scenes lacking in detail and quality, primarily due to limitations in panoramic image generation models and view fusion algorithms. In contrast, to address the issues of quality and spatial consistency, our method leverages the promising 3D Gaussian Splatting technique and implements a three-stage enhancement process for the generated panoramic images. Concurrent work HoloDreamer [37] has explored similar concepts. However, our approach is distinguished by an innovative design in the panoramic image enhancement stage and a novel point cloud fusion algorithm. Our approach improves the overall quality and spatial coherence of the generated 3D scenes, advancing the state-of-the-art in text-driven 3D scene generation.

Thoughts

全景图可以理解成从一个固定的位置以水平360度，垂直180度FoV角观测得到的照片，从全景图中采样的nFov图应该无法称作multi-view images，它们之间的基线距离为0。
intro从3d点云模型这个背景入手，不是很合适。
3d高斯不算是3d点云的加强版。3d高斯只适合建模外观，在几何上的建模能力并不如3d点云准确，通常基于深度估计获取的3d点云几何准确但稀疏。
主要贡献在于一阶段的优化，这些优化方式实用却不太novel。
同样的输入和输出表示，这篇文章的方法框架和DreamScene360接近。
相关工作的3d场景生成部份，写的不全面。

参考

Share on

X (formerly Twitter) Facebook LinkedIn

Kaiqiang Xiong

SceneDreamer360: Text-Driven 3D-Consistent Scene Generation with Panoramic Gaussian Splatting

Reading

Abstract

Intro

Thoughts

Share on

You May Also Enjoy

3D空间中的相机

3D空间中的世界坐标系和相机坐标系

Kaiqiang Xiong

Reading

Abstract

Intro

Related Work about 3D Scene Generation

Thoughts

Share on

You May Also Enjoy

3D空间中的相机

3D空间中的世界坐标系和相机坐标系