OpenMonoGS-SLAM

OpenMonoGS-SLAM: Monocular Gaussian Splatting SLAM with Open-set Semantics

Sungkyunkwan University¹ Yonsei University²

TL;DR: We present OpenMonoGS-SLAM, enabling monocular 3DGS SLAM with open-set semantics via VFMs, without depth or 3D semantic supervision.

Simultaneous Localization and Mapping (SLAM) is a foundational component in robotics, AR/VR, and autonomous systems. With the rising focus on spatial AI in recent years, combining SLAM with semantic understanding has become increasingly important for enabling intelligent perception and interaction. Recent efforts have explored this integration, but they often rely on depth sensors or closed-set semantic models, limiting their scalability and adaptability in open-world environments. In this work, we present OpenMonoGS-SLAM, the first monocular SLAM framework that unifies 3D Gaussian Splatting (3DGS) with open-set semantic understanding. To achieve our goal, we leverage recent advances in Visual Foundation Models (VFMs), including MASt3R for visual geometry and SAM and CLIP for open-vocabulary semantics. These models provide robust generalization across diverse tasks, enabling accurate monocular camera tracking and mapping, as well as a rich understanding of semantics in open-world environments. Our method operates without any depth input or 3D semantic ground truth, relying solely on self-supervised learning objectives. Furthermore, we propose a memory mechanism specifically designed to manage high-dimensional semantic features, which effectively constructs Gaussian semantic feature maps, leading to strong overall performance. Experimental results demonstrate that our approach achieves performance comparable to or surpassing existing baselines in both closed-set and open-set segmentation tasks, all without relying on supplementary sensors such as depth maps or semantic annotations.

System Overview

Top: Given the previous keyframe and the current frame, MASt3R estimates a point map. The 3D Gaussian scene is initialized by assigning each point with Gaussian attributes and a learnable semantic feature vector. Differentiable rendering produces an RGB image and a semantic feature map, supervised by the ground-truth RGB image and multi-scale SAM masks, respectively. Bottom: When the current frame becomes a new keyframe, SAM predicts instance masks and masked CLIP features are extracted from the RGB image. These features update the memory bank online and provide targets for the language-guided loss. Memory attention then retrieves high-dimensional semantics to refine the Gaussian semantic map.

Novel-view Synthesis

Qualitative results of novel view synthesis on the Replica dataset.

Class-agnostic Prompt Segmentation

Qualitative results of Class-agnostic prompt segmentation on the Replica dataset. Qualitative results of class-agnostic prompt segmentation on the Replica dataset. Once the 3D semantic features are projected into a 2D map, cosine similarity is computed between the per-pixel 2D features and the sampled point features from all classes in the scene. Thresholding the resulting similarity maps produces the final segmentation masks.

Open-set Semantic Segmentation

Qualitative results of Open-set semantic segmentation on the ScanNet dataset. The text prompt for the current frame is used to query the rendered 2D feature map. The resulting similarity map is then thresholded to obtain the final segmentation mask.