2023 32nd IEEE International Conference on Robot and
Human Interactive Communication (RO-MAN)
August 28-31, 2023. Paradise Hotel, Busan, Korea
Instance-Level Semantic Maps for Vision Language Navigation
Laksh Nanwani1, Anmol Agarwal1, Kanishk Jain1, Raghav Prabhakar1,
Aaron Monis1, Aditya Mathur1, Krishna Murthy Jatavallabhula2,
A. H. Abdul Hafez3, Vineet Gandhi1, K. Madhava Krishna1
Fig. 1: VLMaps [1], being a semantic top-view map, cannot distinguish between different instances of the same object. On
the other hand, SI Maps (ours) are directly amenable for handling such queries as they contain instance-specific information
for all objects in the environment. For the scene on the extreme left, the instances of the object ‘chair’ as detected by SI
Maps is shown in different colors in the rightmost figure.
Abstract— Humans have a natural ability to perform seman- to achieve this goal by incorporating natural language under-
tic associations with the surrounding objects in the environment. standing into autonomous agents to navigate the environment
This allows them to create a mental map of the environment, based on linguistic commands. Prior approaches to VLN
allowing them to navigate on-demand when given linguistic
instructions. A natural goal in Vision Language Navigation have addressed this task by harnessing the capabilities of
(VLN) research is to impart autonomous agents with similar visual grounding models, which allow the navigating agents
capabilities. Recent works take a step towards this goal by to localize objects in the visual scene or directly ground
creating a semantic spatial map representation of the environ- navigable regions based on linguistic descriptions. However,
ment without any labeled data. However, their representations these approaches fail to address linguistic commands which
are limited for practical applicability as they do not distinguish
between different instances of the same object. In this work, we require spatial precision to identify the goal region. Further-
address this limitation by integrating instance-level information more, these approaches assume that the object referred to
into spatial map representation using a community detection al- by the linguistic command is always visible in the current
gorithm and utilizing word ontology learned by large language scene. Such an assumption rarely holds in realistic scenarios,
models (LLMs) to perform open-set semantic associations in where things can move in or out of the current scene as we
the mapping representation. The resulting map representation
improves the navigation performance by two-fold (233%) on navigate the environment.
realistic language commands with instance-specific descriptions
compared to the baseline. We validate the practicality and Consider the example in Figure 1 with the language
effectiveness of our approach through extensive qualitative and command, “walk to the fourth chair in your field of view”.
quantitative experiments. To execute this command, we first need to explore the entire
room to find all instances of chairs and then find the fourth
I. INTRODUCTION instance from where the command was given. For visual
Advancements in machine learning research have brought grounding-based approaches, it is non-trivial to handle such
about rapid changes in the field of robotics, allowing for scenarios as there is no way to rank the localized chairs
the development of sophisticated autonomous agents. How- based on distance. To counteract the above issues, geometric
ever, making this technology practically viable for large- maps, which create a global mapping of the surrounding
scale adoption requires a natural mechanism to interact with environment, provide a direct mechanism to ground all the
humans. Vision Language Navigation (VLN) research aims objects present in the scene, including those not visible in
the current view, and additionally, are readily amenable for
1KCIS, International Institute of Information Technology, Hyderabad,
India. planning and navigation purposes. In this work, we propose a
2CSAIL, MIT, Cambridge, United States. memory-efficient mechanism for creating a semantic spatial
3Hasan Kalyoncu University, Sahinbey, Gaziantep, Turkey. representation of the environment, which is directly applica-
979-8-3503-3670-2/23/$31.00 ©2023 IEEE 507
Authorized licensed use limited to: Hasan Kalyoncu Universitesi. Downloaded on March 20,2024 at 07:50:21 UTC from IEEE Xplore.  Restrictions apply. 
2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN) | 979-8-3503-3670-2/23/$31.00 ©2023 IEEE | DOI: 10.1109/RO-MAN57019.2023.10309534
Fig. 2: Our top-view map representation allows indoor embodied agents to perform complex instance-specific goal navigation
in object-rich environments. The language queries can refer to individual instances based on spatial and viewpoint
configuration with respect to other objects of the same type while preserving the navigation performance on standard
language queries.
ble to robots navigating in real-world scenes. models (LLMs) to parse language instructions and identify
Recent works like VLMaps and NLMap [2] propose the involved objects to query the scene representation for
a mechanism to build semantic spatial maps without any object availability and location.
labeled data by fusing pre-trained vision-language features B. Instance Segmentation
with the 3D point cloud of the physical world. They com-
pute the similarity between visual and linguistic features The ability to identify and localize different instances
in a common semantic space of a large-scale pre-trained of similar objects is crucial for visual perception tasks in
vision-language model and utilize large-language models to robotics. In the Computer Vision literature, the task of
convert the natural language command to a sequence of instance segmentation serves to evaluate such capabilities
navigation goals for planning. However, their map represen- formally. Earlier works [5] utilized region proposal networks
tation doesn’t allow them to differentiate between different to predict candidate bounding boxes followed by a mask
instances of the same object and hence handle language head to regress the instance-level segmentation mask for each
queries that describe an instance-specific navigation goal, proposal. While initial approaches designed task-specific
like the ones mentioned in Figure 2, as the visual encodings architectures, more recent methods [7] have moved towards
are instance-agnostic. Moreover, their mechanism is memory generalized architectures for different image segmentation
intensive as they require high-dimensional feature embed- tasks like semantic, instance, and panoptic segmentation.
dings to make semantic associations for the objects in the Mask2Former [7] employs attention mechanism to extract
visual scene. localized object-centric features in an end-end manner. In
Our work focuses on creating spatial maps of the envi- this work, we utilize segmentation masks from Mask2Former
ronment with instance-level semantics. We achieve this in a to create instance-level semantic maps which are directly
memory-efficient manner, bypassing the use of feature em- amenable for planning during autonomous navigation.
beddings altogether. We show that Semantic Instance Maps C. Vision Language Navigation
(SI Maps) are computationally efficient to construct and Most of the work in Vision Language Navigation (VLN)
allow for a wide range of complex and realistic commands has focused on navigating in the environment using semantic
that evade prior works. perception based on the front camera view of the autonomous
II. RELATED WORK agent. Specifically, these works take the front camera image
and the language command as input, and the navigation
A. Semantic Mapping task is reduced to a sequence modeling task where at each
With the recent progress in computer vision and natural time stamp, the optimal action is predicted to complete the
language processing literature, there has been considerable navigation task successfully. Subsequent works have tackled
interest in augmenting the semantic understanding of tra- the VLN problem using sequence-to-sequence learning [8],
ditional SLAM algorithms. Earlier works like SLAM++ reinforcement learning [9] or behavior cloning methods [10].
[3] propose an object-oriented SLAM, which utilizes prior However, these methods are non-trivial to interpret, and
knowledge about the domain-specific objects and structures recent works [8] have found that such methods are unable to
in the environment. Later works like [4] assign instance- utilize the visual modality effectively for the navigation task.
level semantics using Mask-RCNN [5] to 3D volumetric Consequently, recent works [1], [2] on VLN have focused
maps. Some methods [1], [2] have also explored transferring on creating a semantic map of the environment for motion
predictions from CNNs in 2D pixel space to 3D space for planning and utilizing visual grounding capabilities of large-
3D reconstruction. Concurrent to our work, [6] proposes a scale vision-language models [11] to ground the semantic
deep reinforcement learning-based approach for multi-object concepts in a visual world. In this work, we focus on creating
instance navigation, albeit without linguistic commands. a semantic mapping representation of the environment using
VLMaps [1] and NLMap-Saycan [2] propose a natural large-scale language models. Unlike prior works, we create
language queryable scene representation with Visual Lan- these maps in an embedding-free manner, thus reducing the
guage models (VLMs). These methods utilize large-language computational cost significantly.
508
Authorized licensed use limited to: Hasan Kalyoncu Universitesi. Downloaded on March 20,2024 at 07:50:21 UTC from IEEE Xplore.  Restrictions apply. 
Fig. 3: In STEP 1, we create a semantic level map of the environment by back projecting the Mask2Former semantic labels
of the RGB pixels across different images onto the grid map. In STEP 2, we extract the subgraph concerned with object o
and run a community detection algorithm to break the grid cells containing object o into instances.
III. METHOD environment. Specifically, we use the modularity-based Lou-
A. Problem Statement vain method, a greedy, hierarchical optimization method that
iteratively refines communities to maximize the modularity
In this work, we aim to create a semantic map of the sur- value. The modularity value is a measure of the density
rounding environment containing instance-level information of links within communities compared to links between
for the various objects. Maps containing both instance-level communities.
and semantic information are necessary to handle linguistic
commands which are frequently used in the daily vernacular. Let the output of the panoptic segmentation model for u=
For example, consider the command, “Go to the empty chair (u,v) be 〈oi,u,v, ti,u,v〉. This means object o
th
i,u,v’s ti,u,v instance
near the third table”. We are required to identify “which within the frame is present at pixel u. We use this information
instance of the table” is being talked about and then point out to set the object label o for M pi,u,v pi,u,v as oi,u,v. When there( x , y )
the instance of the empty chair. Our approach is equipped to exist multiple 3D depth pixels projecting to the same grid
handle such scenarios through an instance-specific mapping location in the map, we retain the label of the pixel with the
representation of the environment. We build SI Maps using highest vertical height.
only RGB-D sensors, pose information, and an off-the-shelf To divide the different grid cells labeled having object o
panoptic segmentation model. SI Maps creation involves two into different instances, we construct an undirected weighted
steps: (1) Occupancy map creation with semantic labels and graph G= (V,E,W ), where each grid cell (i, j) for whom the
(2) Community detection to separate instances of a given object label of Mi, j is equal to o is included as a node in the
semantic label. The whole pipeline is illustrated in Figure 3. set of vertices V . Whenever two neighbouring pixels u1 =
(u1,v1) and u2 = (u2,v2) belong to the same entity in the ith
B. SI Map Creation RGB-D frame, their corresponding grid cells (pi,u1,v1 , pi,u1,v1x y )
Building Occupancy Grid: We define SI Maps as M ∈ and (pi,u2,v2x , pi,u2,v2y ) should also belong to the same instance
RH̄×W̄×2, where H̄ and W̄ represent the size of the top-down in real-world. Hence, whenever pixels u1 and u2 have the
grid map. Similar to VLMaps, with the scale parameter s semantic label o and <oi,u1,v1 , ti,u1,v1〉= 〈oi,u2,v2 , ti,u2,v2〉, i.e.,
(= 0.05m in our experiments), a SI Map M represents an depth pixels u1 and u2 belong to the same entity within
area with size sH̄× sW̄ square meters. Mi, j =<o, t> means the image, we increase the edge weight between grid cells
that grid cell (i, j) is occupied by the tth instance of object (pi,u1,v1 , pi,u1,v1x y ) and (p
i,u2,v2 , pi,u2,v2x y ) by one. This helps us in
o in the environment. Since we are using the Mask2Former transferring the instance segmentation information present in
panoptic segmentation model trained on the COCO dataset the panoptic segmentation outputs of the RGB-D frames to
[12], o ∈ O (where O is the set of objects present in the our map and also helps us to track the same instance across
COCO dataset). To build our map, similar to VLMaps, we, frames using the pose data. To prevent the frequency of
for each RGB-D frame, back-project all the depth pixels u = visiting a particular area in the environment during mapping
(u,v) to form a local depth point cloud that we transform to from unfairly affecting any edge weight, we normalize all
the world frame using the pose information. For depth pixel the edge weights by the number of times their constituent
u= (u,v) belonging to the ith RGB-D frame, let (pi,u,v, pi,u,vx y ) nodes (grid cells) were observed across all RGB-D images
represent the coordinates of the projected point in the grid for that scene. Ideally, in our graph, all grid cells belonging to
map M . the same connected component should belong to the same
Integrating Instance-level information: With the oc- real-world entity. But Mask2Former masks are not perfect
cupancy map defined, we now utilize community detection at a pixel level; hence it is possible for spurious edges to
algorithms to separate out the different instances in the be drawn between nodes belonging to different real-world
509
Authorized licensed use limited to: Hasan Kalyoncu Universitesi. Downloaded on March 20,2024 at 07:50:21 UTC from IEEE Xplore.  Restrictions apply. 
entities. However, such edges are likely to be few in number.
To disregard such spurious edges, we group the nodes in
V using community detection algorithms instead of naively
breaking them into connected components.
We initialize the graph with a separate community for each
node. We use the Louvain community detection method,
which involves two phases: (1) Modularity optimization and
(2) Community aggregation. During modularity optimiza-
tion, for each node in the graph, we compute the change in
modularity by moving it to neighboring communities. The
node is transferred to the community, which results in the
highest increase in modularity. This procedure is repeated
for all nodes until no further improvement in modularity
is possible. In the community aggregation phase, the com-
munities formed in the modularity optimization phase are
considered single nodes. The weights of the edges between
the new nodes are determined by the sum of the weights of Fig. 4: An example of the executable Python code generated
the edges between the nodes in the original communities. The by ChatGPT for the given language commands. The gen-
two phases are iteratively repeated until the modularity value erated code includes an instance parameter in the function
converges. After convergence, we get a labeled graph, where primitive call for navigating to the specified instance in the
the nodes are grouped based on their community member- environment.
ship, i.e., occupancy grid cells belonging to the same instance
are grouped together for all the objects in the environment. LLMs, trained on billions of lines of text and code,
To correct the over-segmentation of communities, a post- demonstrate advanced natural language understanding, rea-
processing step is applied to merge communities C1 and C2 soning, and coding capabilities. Similar to the approach with
if more than K% of the members of C1 are neighbors of VLMaps, we repurpose LLMs to generate executable Python
some member of C2. code for the robot. Specifically, we supply ChatGPT with the
In contrast to VLMaps, our approach doesn’t utilize the list of function primitives and their respective descriptions.
high dimensional LSeg [13] feature embeddings for semantic We then prompt ChatGPT with several language queries
map creation, which provides a memory-efficient mechanism accompanied by the corresponding ground truth Python code
to construct the instance-level semantic occupancy grid. containing a sequence of function primitives based on the
For comparison, VLMaps representation requires an average language command. During inference, for each language
storage of about 2 gigabytes for a 1000×1000 map, whereas command, we provide ChatGPT with the list of objects
SI Maps needs only about 16 megabytes for the same map present in the SI Maps and generate Python code that refers
size. Additionally, the proposed approach is highly flexible to the specific instances involved in the language command.
and adaptable, as it can easily incorporate other types of In Figure 4, we show a few examples of the Python
sensor data like LiDar, IMU and plug different segmentation executable code generated by ChatGPT for the given com-
models. The provision of tunable hyper-parameter K further mands. ChatGPT successfully generates the correct exe-
provides controllability in our approach, which is a desired cutable code after prompting it with a few examples of
capability for real-world deployment. In the next section, language queries and corresponding ground truth Python
we show how SI Maps can be directly used for language- executable code. To ground instances, our function primitives
conditioned navigation. calls also include an instance parameter to handle instance-
specific queries. The instance parameter is directly inferred
C. Language-based Navigation from the language command by ChatGPT along with the
The significance of Semantic Instance maps becomes object of interest. Overall, we define 23 function primitives
apparent when dealing with commands that necessitate for complex navigational maneuvers like moving between
instance-level grounding. For a given language command, two objects, navigating to nth closest object, etc., and the
we would like to identify the region in SI Maps where essential turning and moving primitives.
the robot must navigate to execute the command success- IV. EXPERIMENTS
fully. Additionally, since different commands can refer to
different navigational maneuvers, we must also determine A. Experimental Setup
the maneuvers required for a specific language query. To We showcase the effectiveness of our approach on multiple
achieve this, we define function primitives for each possible scenes from Matterport3D [15] dataset in the Habitat [16]
maneuver, reducing the task to classifying the appropriate simulator. Matterport3D is a commonly used dataset for eval-
function primitive for each sub-command. For this classifi- uating the navigational capabilities of existing VLN agents
cation, we utilize the powerful large language model (LLM), in an indoor environment. The robot must maneuver in a
ChatGPT[14], for motion planning. continuous environment, performing navigational maneuvers
510
Authorized licensed use limited to: Hasan Kalyoncu Universitesi. Downloaded on March 20,2024 at 07:50:21 UTC from IEEE Xplore.  Restrictions apply. 
specified by the natural language command. For top-view tic information, they fail on queries that refer to specific
map creation, we collect 5,267 RGB-D frames from 5 instances of an object, like “navigate to the second counter”.
different scenes and store the camera pose for each frame. Our logical baseline, VLMaps with connected components,
Baseline: We evaluate against a logical baseline where the can handle some instance-specific queries, resulting in an
semantic top-view maps from the VLMaps-based approach incremental performance gain of 10% for human evaluation
are separated into separate instances. If the objects in the than vanilla VLMaps. However, the success of this method is
environment are well separated, the semantic segmentation observed in scenes where neighboring instances of the same
output should already contain the information required to object have ample room between them. In contrast, real-life
separate different instances of similar objects by simply environments such as offices, restaurants, and hospitals often
applying connected components. As a result, our baseline have objects in close proximity to each other. In these cases,
involves applying connected components over the VLMaps instance-level information is essential for distinguishing be-
output. However, in realistic scenarios, different instances of tween neighboring objects. SI Maps demonstrate robustness
the same object can be close to each other; for example: to object placement in the environment by directly utiliz-
in a restaurant, chairs belonging to the same table are close ing the instance-level information provided by the instance
to each other. In such a scenario, just computing connected segmentation model during the occupancy grid creation.
components will not work, as multiple instances will get
clubbed into a single instance. C. Qualitative Results
Evaluation Metrics: Like prior approaches [1], [8], [17]
in VLN literature, we use the gold standard Success Rate
metric, also known as Task Completion metric to measure
the success ratio for the navigation task. We compute the
Success Rate metric through human and automatic evalu-
ations. For automatic evaluation, we use the ground truth
environment map and compute the Success Rate using a pre-
defined heuristic where the navigation sub-goal is considered
successful if we stop within a threshold distance of the
ground truth object. For human evaluation, we verify if the
agent ends up in a position desired according to the query.
B. Evaluation Results
In this section, we perform quantitative and qualitative
comparisons of SI Maps against VLMaps and VLMaps
with connected components. We compare the performance
of each scene representation for the downstream language-
based navigation task using the Success Rate in table I. We
use the same function primitives for all the methods.
Human evaluation was done because of the observation
made during a few queries where the agent ended up close
to the target object, but it did not complete the task in the
desired way.
Success Rate
Method
Human Evaluation Automatic Evaluation
VL Maps 0.24 0.46
VL Maps with CC 0.34 0.48
Fig. 5: The above figure shows the agent in different scenes
SI Maps (K=5) 0.80 0.88
in a simulated environment with three different queries.
SI Maps (K=9) 0.76 0.88 Images on the top show the RGB top-down view map, along
TABLE I: SI Maps outperform other baseline methods by with the segmented goal object instance. The corresponding
significantly large margins on the Success Rate metric. The images on the bottom represent the path taken by the agent
best results are highlighted in bold. to reach the desired object from the initial location.
We observe that SI Maps exhibit a remarkable improve- In this section, we showcase qualitative examples of our
ment in performance compared to other approaches. SI Maps approach for the vision language navigation task. The results
achieve an impressive two-fold increase in success rate are illustrated in Figure 5 with the corresponding A-star
metric compared to 24% obtained by VLMaps on human trajectory using SI Maps for navigation. SI Maps allow
evaluation, demonstrating a substantial leap in the instance- navigating to specific instances in the scene based on their
specific goal navigation. Since VLMaps only contain seman- relative distance with respect to other objects (left, center)
511
Authorized licensed use limited to: Hasan Kalyoncu Universitesi. Downloaded on March 20,2024 at 07:50:21 UTC from IEEE Xplore.  Restrictions apply. 
and direction-based specification in the global map (right). demonstrates robustness in relation to object placement in the
The downstream navigation, as a consequence of SI Maps, environment and is less vulnerable to noise than previous
is agnostic to the starting pose and orientation of the agent methods. We showcase the practicality of the proposed
in the environment. SI Maps using success rate and panoptic quality metrics.
Future research could investigate 3D instance segmentation
techniques to incorporate instance-level semantics into the
occupancy grid creation process directly.
ACKNOWLEDGEMENT
We acknowledge iHub-Data IIIT Hyderabad for their
support to this work.
Fig. 6: Qualitative example of the instance-level semantics REFERENCES
captured by different methods for all the chairs in the [1] C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual language
environment. SI Maps clearly localize the different instances maps for robot navigation,” in Proceedings of the IEEE International
in the map. Conference on Robotics and Automation (ICRA), (London, UK), 2023.
[2] B. Chen, F. Xia, B. Ichter, K. Rao, K. Gopalakrishnan, M. S. Ryoo,
We also show qualitative comparisons of different methods A. Stone, and D. Kappler, “Open-vocabulary queryable scene represen-tations for real world planning,” in arXiv preprint arXiv:2209.09874,
on the quality of instance-level top-view maps in Figure 6 for 2022.
different seating objects (chair, couch, sofa) in the simulated [3] R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and
environment. Our approach effectively captures the instance- A. J. Davison, “Slam++: Simultaneous localisation and mapping at thelevel of objects,” in Proceedings of the IEEE conference on computer
level semantics of objects in the environment, recovering vision and pattern recognition, pp. 1352–1359, 2013.
32 instances out of 29 present in the map (with 3 extra [4] J. McCormac, R. Clark, M. Bloesch, A. Davison, and S. Leuteneg-
noisy segments). In contrast, the baseline of VLMaps with ger, “Fusion++: Volumetric object-level slam,” in 2018 internationalconference on 3D vision (3DV), pp. 32–41, IEEE, 2018.
connected components detects 26 instances, but most of them [5] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in
are noisy segments, and it merges several separate instances Proceedings of the IEEE international conference on computer vision,
(for the same object in close proximity) into a single instance. pp. 2961–2969, 2017.[6] N. Gireesh, A. Agrawal, A. Datta, S. Banerjee, M. Sridharan,
Our results are particularly impressive in the middle region B. Bhowmick, and M. Krishna, “Sequence-agnostic multi-object nav-
of the map, which corresponds to the dining area in the igation,” in IEEE International Conference on Robotics and Automa-
environment. Here, the chairs are in close proximity to tion, 2023. (to be published).[7] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar,
each other, and the vanilla VLMaps approach fails when a “Masked-attention mask transformer for universal image segmenta-
particular instance of chair is queried. Similarly, applying tion,” 2022.
connected components-based heuristics to separate instances [8] R. Schumann and S. Riezler, “Analyzing generalization of visionand language navigation to unseen outdoor areas,” in Proceedings
is not enough, as the semantic segmentation masks of the of the 60th Annual Meeting of the Association for Computational
chairs end up being connected with each other, resulting in Linguistics (Volume 1: Long Papers), (Dublin, Ireland), pp. 7519–
multiple instances being merged. 7532, Association for Computational Linguistics, May 2022.[9] K. Nguyen, D. Dey, C. Brockett, and B. Dolan, “Vision-based nav-
The VLMaps-based approaches rely on alignment be- igation with language-based assistance via imitation learning with
tween per-pixel visual embeddings and linguistic feature indirect intervention,” in Proceedings of the IEEE/CVF Conference on
embeddings, which can be sensitive to noise due to the Computer Vision and Pattern Recognition, pp. 12527–12537, 2019.[10] A. Das, G. Gkioxari, S. Lee, D. Parikh, and D. Batra, “Neural modular
unconstrained nature of the association. The benefit of our control for embodied question answering,” in Conference on Robot
feature-embedding-free approach becomes evident as we di- Learning, pp. 53–62, PMLR, 2018.
rectly constrain the occupancy grid creation with the instance [11] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal,G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable
segmentation masks. As a result, SI Maps have considerably visual models from natural language supervision,” in International
less noise than derivative VLMaps approaches. Community conference on machine learning, pp. 8748–8763, PMLR, 2021.
detection further helps reduce noise by filtering out spurious [12] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in
communities formed due to noise, leading to a much cleaner context,” in Computer Vision–ECCV 2014: 13th European Conference,
map, which can also be observed in Figures 1, 6. Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13,
pp. 740–755, Springer, 2014.
V. C [13] B. Li, K. Q. Weinberger, S. Belongie, V. Koltun, and R. Ranftl,ONCLUSION “Language-driven semantic segmentation,” in International Confer-
In this study, we introduce a novel instance-focused ence on Learning Representations, 2022.[14] OpenAI, “Chatgpt.” https://openai.com/blog/chatgpt.
scene representation for indoor settings, enabling seamless [15] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva,
language-based navigation across various environments. Our S. Song, A. Zeng, and Y. Zhang, “Matterport3d: Learning from rgb-d
representation accommodates language commands that refer data in indoor environments,” arXiv preprint arXiv:1709.06158, 2017.[16] M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain,
to specific instances within the environment. Furthermore, J. Straub, J. Liu, V. Koltun, J. Malik, et al., “Habitat: A platform for
our map creation method is more memory-efficient, result- embodied ai research,” in Proceedings of the IEEE/CVF international
ing in an impressive 128-fold decrease in storage, as it conference on computer vision, pp. 9339–9347, 2019.[17] K. Jain, V. Chhangani, A. Tiwari, K. M. Krishna, and V. Gandhi,
does not rely on high-dimensional feature embeddings for “Ground then navigate: Language-guided navigation in dynamic
visual and linguistic modalities. Additionally, our approach scenes,” arXiv preprint arXiv:2209.11972, 2022.
512
Authorized licensed use limited to: Hasan Kalyoncu Universitesi. Downloaded on March 20,2024 at 07:50:21 UTC from IEEE Xplore.  Restrictions apply.