HDSL: Hierarchical Domain-Specific Language and LLM Agent-Based 3D Indoor Scene Generation and Local Editing

This paper addresses the challenge of precisely localizing local geometric structures in text-driven 3D indoor scene generation and editing, where existing large language model systems rely on scene graphs or global constraint lists that lack fine-grained spatial specificity. The authors propose Hierarchical Description Scene Language (HDSL), an XML/CSS-inspired domain-specific language that represents rooms, areas, objects, and supporting surfaces as a tree structure with local coordinates, greatly simplifying recursive planning and edit retrieval. The research builds an LLM agent-based pipeline that generates HDSL subtrees through bounded verification, grounds non-fictional nodes via multimodal asset retrieval, and resolves collision errors using force-directed layout optimization. For editing, the proposed Hierarchical Retrieval-Augmented Generation (HRAG) technique precisely retrieves relevant subtrees for localized rewriting and integrates results through deterministic three-way merging. Experiments show HDSL outperforms full text-to-scene baselines in object coverage, text-scene alignment, and generation time, while matching state-of-the-art layout methods on geometric metrics. HRAG reduces token consumption by 5.22x and runtime by 6.19x during editing, effectively preserving unrelated scene objects.

Background and Context

The intersection of natural language processing and computer graphics has recently focused on using natural language instructions to drive the generation and editing of 3D indoor scenes. While this capability promises to democratize 3D content creation, a significant technical bottleneck remains: the lack of an intermediate representation that is both efficiently generatable by Large Language Models (LLMs) and sufficiently precise for localized modifications. Current systems predominantly rely on scene graphs or global constraint lists as their structural backbone. Although these representations are compact, they frequently lack the fine-grained spatial specificity required to describe local geometric details accurately. Consequently, when users issue instruction-based edits, the system struggles to pinpoint specific regions or objects, often resulting in erroneous global changes where a minor adjustment triggers unintended alterations across the entire scene.

To address these limitations, recent research has redefined the scene construction problem as a task of structured program generation and local program repair. This paradigm shift led to the development of the Hierarchical Description Scene Language (HDSL), a novel domain-specific language inspired by the design philosophies of XML and CSS. HDSL is explicitly engineered for structured 3D indoor environments, offering a hierarchical and semantically clear framework. By organizing complex indoor spatial planning into recursively processable units, HDSL provides a robust index foundation for subsequent local editing. This approach maintains the generative flexibility inherent to LLMs while significantly enhancing the controllability of geometric structures, thereby solving the "ripple effect" problem common in previous global reconstruction methods.

Deep Analysis

At the core of the HDSL framework is its ability to model rooms, functional areas, specific objects, and supporting surfaces as a tree structure enriched with local coordinate information. This hierarchical topology allows for a精细化 (fine-grained) description of scene geometry, moving beyond simple object lists to a spatially aware graph. The generation pipeline is orchestrated by multiple collaborating LLM agents. Initially, these agents generate HDSL subtrees, employing a bounded verification mechanism to ensure that both syntax and logical constraints are strictly adhered to. This step is critical for preventing the hallucinations and structural inconsistencies that often plague unconstrained LLM outputs in complex spatial tasks.

Following the structural generation, the pipeline addresses the grounding of abstract descriptions into concrete 3D assets. For non-fictional nodes within the HDSL tree, the system utilizes multimodal asset retrieval to map textual descriptors to specific 3D model resources. This ensures that the generated scene is not only structurally sound but also visually consistent with the user's intent. To handle physical plausibility, the pipeline incorporates a force-directed layout optimization algorithm. This component automatically detects and resolves boundary conflicts or object collisions, ensuring that the final scene adheres to basic physical rules without requiring manual intervention from the user.

The editing capabilities of HDSL are powered by a newly proposed technique called Hierarchical Retrieval-Augmented Generation (HRAG). When a user submits a modification instruction, the system does not regenerate the entire scene. Instead, HRAG precisely retrieves the specific HDSL subtrees affected by the change. The LLM is then guided to rewrite only within this localized context, drastically reducing the computational overhead associated with full-scene regeneration. The modified subtree is subsequently integrated back into the original scene structure using a deterministic three-way merging algorithm. This method ensures the atomicity of the edit while preserving the stability of unrelated scene components, effectively isolating changes to their relevant spatial domains.

Industry Impact

Empirical evaluations conducted on reproduced benchmarks demonstrate that HDSL offers substantial improvements over existing methodologies. In generation tasks, HDSL outperforms full text-to-scene baselines across several key metrics, including average object coverage, text-scene alignment, and generation time. These results indicate that the hierarchical structure not only aids in editing but also enhances the initial creation process by providing a more organized scaffold for the LLM to populate. Furthermore, in terms of hard geometric fidelity metrics, HDSL remains competitive with state-of-the-art layout-only reproduction methods, proving that the addition of semantic richness does not come at the cost of geometric quality.

The efficiency gains in the editing phase are particularly noteworthy for industrial applications. Experimental data reveals that the HRAG mechanism reduces token consumption by a factor of 5.22 and shortens runtime by 6.19 times compared to traditional full-regeneration approaches. This dramatic improvement in efficiency translates directly to faster interaction response times, making real-time iterative design feasible. In a series of eight paired editing tests, HDSL consistently generated valid domain-specific language code. Crucially, it successfully preserved the state of unrelated objects in the scene, avoiding the accidental modifications that are commonplace in methods relying on global reconstruction.

These technical advancements have profound implications for the 3D content creation community and related industries. By establishing HDSL as a standardized intermediate representation, the research provides a universal interface for interaction between LLMs and 3D engines. This standardization is poised to become foundational infrastructure for future intelligent 3D creation tools. For sectors such as game development, virtual reality interior design, and digital twin construction, the ability to perform high-fidelity generation and precise editing significantly lowers the costs associated with manual modeling. It accelerates the workflow from conceptual design to final rendering, allowing creators to focus on high-level artistic direction rather than low-level geometric adjustments.

Outlook

The introduction of HDSL and the associated LLM agent pipeline offers a new perspective on managing the cognitive load of large models in long-context scenarios. By adopting concepts akin to "local program repair" from software engineering, the study demonstrates that structured constraints and localized processing can effectively mitigate issues of hallucination and inconsistency. This approach suggests a broader trend in AI-driven graphics: moving away from monolithic generation towards modular, verifiable, and editable components. As LLMs continue to evolve, the integration of such structured intermediate languages will likely become a standard practice for ensuring reliability in complex generative tasks.

Looking forward, the open-source potential of HDSL presents significant opportunities for community-driven innovation. Developers can build plugins and toolchains atop this standardized language, further enriching 3D asset libraries and expanding editing functionalities. This ecosystem growth will be essential for pushing the boundaries of AIGC in three-dimensional space understanding and generation. As more tools adopt HDSL, the interoperability between different 3D software packages and AI models will improve, fostering a more cohesive and efficient workflow for professionals.

Ultimately, the success of HDSL hinges on its ability to balance flexibility with precision. The current results indicate that this balance is achievable, offering a viable path toward scalable and standardized 3D content production. Future research may explore extending HDSL to outdoor environments or dynamic scenes, further testing the limits of hierarchical domain-specific languages in graphics. For now, the framework stands as a significant step forward in making 3D scene generation not just an automated process, but a controllable and interactive design partner.

Sources