2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN) August 28-31, 2023. Paradise Hotel, Busan, Korea Instance-Level Semantic Maps for Vision Language Navigation Laksh Nanwani1, Anmol Agarwal1, Kanishk Jain1, Raghav Prabhakar1, Aaron Monis1, Aditya Mathur1, Krishna Murthy Jatavallabhula2, A. H. Abdul Hafez3, Vineet Gandhi1, K. Madhava Krishna1 Fig. 1: VLMaps [1], being a semantic top-view map, cannot distinguish between different instances of the same object. On the other hand, SI Maps (ours) are directly amenable for handling such queries as they contain instance-specific information for all objects in the environment. For the scene on the extreme left, the instances of the object ‘chair’ as detected by SI Maps is shown in different colors in the rightmost figure. Abstract— Humans have a natural ability to perform seman- to achieve this goal by incorporating natural language under- tic associations with the surrounding objects in the environment. standing into autonomous agents to navigate the environment This allows them to create a mental map of the environment, based on linguistic commands. Prior approaches to VLN allowing them to navigate on-demand when given linguistic instructions. A natural goal in Vision Language Navigation have addressed this task by harnessing the capabilities of (VLN) research is to impart autonomous agents with similar visual grounding models, which allow the navigating agents capabilities. Recent works take a step towards this goal by to localize objects in the visual scene or directly ground creating a semantic spatial map representation of the environ- navigable regions based on linguistic descriptions. However, ment without any labeled data. However, their representations these approaches fail to address linguistic commands which are limited for practical applicability as they do not distinguish between different instances of the same object. In this work, we require spatial precision to identify the goal region. Further- address this limitation by integrating instance-level information more, these approaches assume that the object referred to into spatial map representation using a community detection al- by the linguistic command is always visible in the current gorithm and utilizing word ontology learned by large language scene. Such an assumption rarely holds in realistic scenarios, models (LLMs) to perform open-set semantic associations in where things can move in or out of the current scene as we the mapping representation. The resulting map representation improves the navigation performance by two-fold (233%) on navigate the environment. realistic language commands with instance-specific descriptions compared to the baseline. We validate the practicality and Consider the example in Figure 1 with the language effectiveness of our approach through extensive qualitative and command, “walk to the fourth chair in your field of view”. quantitative experiments. To execute this command, we first need to explore the entire room to find all instances of chairs and then find the fourth I. INTRODUCTION instance from where the command was given. For visual Advancements in machine learning research have brought grounding-based approaches, it is non-trivial to handle such about rapid changes in the field of robotics, allowing for scenarios as there is no way to rank the localized chairs the development of sophisticated autonomous agents. How- based on distance. To counteract the above issues, geometric ever, making this technology practically viable for large- maps, which create a global mapping of the surrounding scale adoption requires a natural mechanism to interact with environment, provide a direct mechanism to ground all the humans. Vision Language Navigation (VLN) research aims objects present in the scene, including those not visible in the current view, and additionally, are readily amenable for 1KCIS, International Institute of Information Technology, Hyderabad, India. planning and navigation purposes. In this work, we propose a 2CSAIL, MIT, Cambridge, United States. memory-efficient mechanism for creating a semantic spatial 3Hasan Kalyoncu University, Sahinbey, Gaziantep, Turkey. representation of the environment, which is directly applica- 979-8-3503-3670-2/23/$31.00 ©2023 IEEE 507 Authorized licensed use limited to: Hasan Kalyoncu Universitesi. Downloaded on March 20,2024 at 07:50:21 UTC from IEEE Xplore. Restrictions apply. 2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN) | 979-8-3503-3670-2/23/$31.00 ©2023 IEEE | DOI: 10.1109/RO-MAN57019.2023.10309534 Fig. 2: Our top-view map representation allows indoor embodied agents to perform complex instance-specific goal navigation in object-rich environments. The language queries can refer to individual instances based on spatial and viewpoint configuration with respect to other objects of the same type while preserving the navigation performance on standard language queries. ble to robots navigating in real-world scenes. models (LLMs) to parse language instructions and identify Recent works like VLMaps and NLMap [2] propose the involved objects to query the scene representation for a mechanism to build semantic spatial maps without any object availability and location. labeled data by fusing pre-trained vision-language features B. Instance Segmentation with the 3D point cloud of the physical world. They com- pute the similarity between visual and linguistic features The ability to identify and localize different instances in a common semantic space of a large-scale pre-trained of similar objects is crucial for visual perception tasks in vision-language model and utilize large-language models to robotics. In the Computer Vision literature, the task of convert the natural language command to a sequence of instance segmentation serves to evaluate such capabilities navigation goals for planning. However, their map represen- formally. Earlier works [5] utilized region proposal networks tation doesn’t allow them to differentiate between different to predict candidate bounding boxes followed by a mask instances of the same object and hence handle language head to regress the instance-level segmentation mask for each queries that describe an instance-specific navigation goal, proposal. While initial approaches designed task-specific like the ones mentioned in Figure 2, as the visual encodings architectures, more recent methods [7] have moved towards are instance-agnostic. Moreover, their mechanism is memory generalized architectures for different image segmentation intensive as they require high-dimensional feature embed- tasks like semantic, instance, and panoptic segmentation. dings to make semantic associations for the objects in the Mask2Former [7] employs attention mechanism to extract visual scene. localized object-centric features in an end-end manner. In Our work focuses on creating spatial maps of the envi- this work, we utilize segmentation masks from Mask2Former ronment with instance-level semantics. We achieve this in a to create instance-level semantic maps which are directly memory-efficient manner, bypassing the use of feature em- amenable for planning during autonomous navigation. beddings altogether. We show that Semantic Instance Maps C. Vision Language Navigation (SI Maps) are computationally efficient to construct and Most of the work in Vision Language Navigation (VLN) allow for a wide range of complex and realistic commands has focused on navigating in the environment using semantic that evade prior works. perception based on the front camera view of the autonomous II. RELATED WORK agent. Specifically, these works take the front camera image and the language command as input, and the navigation A. Semantic Mapping task is reduced to a sequence modeling task where at each With the recent progress in computer vision and natural time stamp, the optimal action is predicted to complete the language processing literature, there has been considerable navigation task successfully. Subsequent works have tackled interest in augmenting the semantic understanding of tra- the VLN problem using sequence-to-sequence learning [8], ditional SLAM algorithms. Earlier works like SLAM++ reinforcement learning [9] or behavior cloning methods [10]. [3] propose an object-oriented SLAM, which utilizes prior However, these methods are non-trivial to interpret, and knowledge about the domain-specific objects and structures recent works [8] have found that such methods are unable to in the environment. Later works like [4] assign instance- utilize the visual modality effectively for the navigation task. level semantics using Mask-RCNN [5] to 3D volumetric Consequently, recent works [1], [2] on VLN have focused maps. Some methods [1], [2] have also explored transferring on creating a semantic map of the environment for motion predictions from CNNs in 2D pixel space to 3D space for planning and utilizing visual grounding capabilities of large- 3D reconstruction. Concurrent to our work, [6] proposes a scale vision-language models [11] to ground the semantic deep reinforcement learning-based approach for multi-object concepts in a visual world. In this work, we focus on creating instance navigation, albeit without linguistic commands. a semantic mapping representation of the environment using VLMaps [1] and NLMap-Saycan [2] propose a natural large-scale language models. Unlike prior works, we create language queryable scene representation with Visual Lan- these maps in an embedding-free manner, thus reducing the guage models (VLMs). These methods utilize large-language computational cost significantly. 508 Authorized licensed use limited to: Hasan Kalyoncu Universitesi. Downloaded on March 20,2024 at 07:50:21 UTC from IEEE Xplore. Restrictions apply. Fig. 3: In STEP 1, we create a semantic level map of the environment by back projecting the Mask2Former semantic labels of the RGB pixels across different images onto the grid map. In STEP 2, we extract the subgraph concerned with object o and run a community detection algorithm to break the grid cells containing object o into instances. III. METHOD environment. Specifically, we use the modularity-based Lou- A. Problem Statement vain method, a greedy, hierarchical optimization method that iteratively refines communities to maximize the modularity In this work, we aim to create a semantic map of the sur- value. The modularity value is a measure of the density rounding environment containing instance-level information of links within communities compared to links between for the various objects. Maps containing both instance-level communities. and semantic information are necessary to handle linguistic commands which are frequently used in the daily vernacular. Let the output of the panoptic segmentation model for u= For example, consider the command, “Go to the empty chair (u,v) be 〈oi,u,v, ti,u,v〉. This means object o th i,u,v’s ti,u,v instance near the third table”. We are required to identify “which within the frame is present at pixel u. We use this information instance of the table” is being talked about and then point out to set the object label o for M pi,u,v pi,u,v as oi,u,v. When there( x , y ) the instance of the empty chair. Our approach is equipped to exist multiple 3D depth pixels projecting to the same grid handle such scenarios through an instance-specific mapping location in the map, we retain the label of the pixel with the representation of the environment. We build SI Maps using highest vertical height. only RGB-D sensors, pose information, and an off-the-shelf To divide the different grid cells labeled having object o panoptic segmentation model. SI Maps creation involves two into different instances, we construct an undirected weighted steps: (1) Occupancy map creation with semantic labels and graph G= (V,E,W ), where each grid cell (i, j) for whom the (2) Community detection to separate instances of a given object label of Mi, j is equal to o is included as a node in the semantic label. The whole pipeline is illustrated in Figure 3. set of vertices V . Whenever two neighbouring pixels u1 = (u1,v1) and u2 = (u2,v2) belong to the same entity in the ith B. SI Map Creation RGB-D frame, their corresponding grid cells (pi,u1,v1 , pi,u1,v1x y ) Building Occupancy Grid: We define SI Maps as M ∈ and (pi,u2,v2x , pi,u2,v2y ) should also belong to the same instance RH̄×W̄×2, where H̄ and W̄ represent the size of the top-down in real-world. Hence, whenever pixels u1 and u2 have the grid map. Similar to VLMaps, with the scale parameter s semantic label o and means the image, we increase the edge weight between grid cells that grid cell (i, j) is occupied by the tth instance of object (pi,u1,v1 , pi,u1,v1x y ) and (p i,u2,v2 , pi,u2,v2x y ) by one. This helps us in o in the environment. Since we are using the Mask2Former transferring the instance segmentation information present in panoptic segmentation model trained on the COCO dataset the panoptic segmentation outputs of the RGB-D frames to [12], o ∈ O (where O is the set of objects present in the our map and also helps us to track the same instance across COCO dataset). To build our map, similar to VLMaps, we, frames using the pose data. To prevent the frequency of for each RGB-D frame, back-project all the depth pixels u = visiting a particular area in the environment during mapping (u,v) to form a local depth point cloud that we transform to from unfairly affecting any edge weight, we normalize all the world frame using the pose information. For depth pixel the edge weights by the number of times their constituent u= (u,v) belonging to the ith RGB-D frame, let (pi,u,v, pi,u,vx y ) nodes (grid cells) were observed across all RGB-D images represent the coordinates of the projected point in the grid for that scene. Ideally, in our graph, all grid cells belonging to map M . the same connected component should belong to the same Integrating Instance-level information: With the oc- real-world entity. But Mask2Former masks are not perfect cupancy map defined, we now utilize community detection at a pixel level; hence it is possible for spurious edges to algorithms to separate out the different instances in the be drawn between nodes belonging to different real-world 509 Authorized licensed use limited to: Hasan Kalyoncu Universitesi. Downloaded on March 20,2024 at 07:50:21 UTC from IEEE Xplore. Restrictions apply. entities. However, such edges are likely to be few in number. To disregard such spurious edges, we group the nodes in V using community detection algorithms instead of naively breaking them into connected components. We initialize the graph with a separate community for each node. We use the Louvain community detection method, which involves two phases: (1) Modularity optimization and (2) Community aggregation. During modularity optimiza- tion, for each node in the graph, we compute the change in modularity by moving it to neighboring communities. The node is transferred to the community, which results in the highest increase in modularity. This procedure is repeated for all nodes until no further improvement in modularity is possible. In the community aggregation phase, the com- munities formed in the modularity optimization phase are considered single nodes. The weights of the edges between the new nodes are determined by the sum of the weights of Fig. 4: An example of the executable Python code generated the edges between the nodes in the original communities. The by ChatGPT for the given language commands. The gen- two phases are iteratively repeated until the modularity value erated code includes an instance parameter in the function converges. After convergence, we get a labeled graph, where primitive call for navigating to the specified instance in the the nodes are grouped based on their community member- environment. ship, i.e., occupancy grid cells belonging to the same instance are grouped together for all the objects in the environment. LLMs, trained on billions of lines of text and code, To correct the over-segmentation of communities, a post- demonstrate advanced natural language understanding, rea- processing step is applied to merge communities C1 and C2 soning, and coding capabilities. Similar to the approach with if more than K% of the members of C1 are neighbors of VLMaps, we repurpose LLMs to generate executable Python some member of C2. code for the robot. Specifically, we supply ChatGPT with the In contrast to VLMaps, our approach doesn’t utilize the list of function primitives and their respective descriptions. high dimensional LSeg [13] feature embeddings for semantic We then prompt ChatGPT with several language queries map creation, which provides a memory-efficient mechanism accompanied by the corresponding ground truth Python code to construct the instance-level semantic occupancy grid. containing a sequence of function primitives based on the For comparison, VLMaps representation requires an average language command. During inference, for each language storage of about 2 gigabytes for a 1000×1000 map, whereas command, we provide ChatGPT with the list of objects SI Maps needs only about 16 megabytes for the same map present in the SI Maps and generate Python code that refers size. Additionally, the proposed approach is highly flexible to the specific instances involved in the language command. and adaptable, as it can easily incorporate other types of In Figure 4, we show a few examples of the Python sensor data like LiDar, IMU and plug different segmentation executable code generated by ChatGPT for the given com- models. The provision of tunable hyper-parameter K further mands. ChatGPT successfully generates the correct exe- provides controllability in our approach, which is a desired cutable code after prompting it with a few examples of capability for real-world deployment. In the next section, language queries and corresponding ground truth Python we show how SI Maps can be directly used for language- executable code. To ground instances, our function primitives conditioned navigation. calls also include an instance parameter to handle instance- specific queries. The instance parameter is directly inferred C. Language-based Navigation from the language command by ChatGPT along with the The significance of Semantic Instance maps becomes object of interest. Overall, we define 23 function primitives apparent when dealing with commands that necessitate for complex navigational maneuvers like moving between instance-level grounding. For a given language command, two objects, navigating to nth closest object, etc., and the we would like to identify the region in SI Maps where essential turning and moving primitives. the robot must navigate to execute the command success- IV. EXPERIMENTS fully. Additionally, since different commands can refer to different navigational maneuvers, we must also determine A. Experimental Setup the maneuvers required for a specific language query. To We showcase the effectiveness of our approach on multiple achieve this, we define function primitives for each possible scenes from Matterport3D [15] dataset in the Habitat [16] maneuver, reducing the task to classifying the appropriate simulator. Matterport3D is a commonly used dataset for eval- function primitive for each sub-command. For this classifi- uating the navigational capabilities of existing VLN agents cation, we utilize the powerful large language model (LLM), in an indoor environment. The robot must maneuver in a ChatGPT[14], for motion planning. continuous environment, performing navigational maneuvers 510 Authorized licensed use limited to: Hasan Kalyoncu Universitesi. Downloaded on March 20,2024 at 07:50:21 UTC from IEEE Xplore. Restrictions apply. specified by the natural language command. For top-view tic information, they fail on queries that refer to specific map creation, we collect 5,267 RGB-D frames from 5 instances of an object, like “navigate to the second counter”. different scenes and store the camera pose for each frame. Our logical baseline, VLMaps with connected components, Baseline: We evaluate against a logical baseline where the can handle some instance-specific queries, resulting in an semantic top-view maps from the VLMaps-based approach incremental performance gain of 10% for human evaluation are separated into separate instances. If the objects in the than vanilla VLMaps. However, the success of this method is environment are well separated, the semantic segmentation observed in scenes where neighboring instances of the same output should already contain the information required to object have ample room between them. In contrast, real-life separate different instances of similar objects by simply environments such as offices, restaurants, and hospitals often applying connected components. As a result, our baseline have objects in close proximity to each other. In these cases, involves applying connected components over the VLMaps instance-level information is essential for distinguishing be- output. However, in realistic scenarios, different instances of tween neighboring objects. SI Maps demonstrate robustness the same object can be close to each other; for example: to object placement in the environment by directly utiliz- in a restaurant, chairs belonging to the same table are close ing the instance-level information provided by the instance to each other. In such a scenario, just computing connected segmentation model during the occupancy grid creation. components will not work, as multiple instances will get clubbed into a single instance. C. Qualitative Results Evaluation Metrics: Like prior approaches [1], [8], [17] in VLN literature, we use the gold standard Success Rate metric, also known as Task Completion metric to measure the success ratio for the navigation task. We compute the Success Rate metric through human and automatic evalu- ations. For automatic evaluation, we use the ground truth environment map and compute the Success Rate using a pre- defined heuristic where the navigation sub-goal is considered successful if we stop within a threshold distance of the ground truth object. For human evaluation, we verify if the agent ends up in a position desired according to the query. B. Evaluation Results In this section, we perform quantitative and qualitative comparisons of SI Maps against VLMaps and VLMaps with connected components. We compare the performance of each scene representation for the downstream language- based navigation task using the Success Rate in table I. We use the same function primitives for all the methods. Human evaluation was done because of the observation made during a few queries where the agent ended up close to the target object, but it did not complete the task in the desired way. Success Rate Method Human Evaluation Automatic Evaluation VL Maps 0.24 0.46 VL Maps with CC 0.34 0.48 Fig. 5: The above figure shows the agent in different scenes SI Maps (K=5) 0.80 0.88 in a simulated environment with three different queries. SI Maps (K=9) 0.76 0.88 Images on the top show the RGB top-down view map, along TABLE I: SI Maps outperform other baseline methods by with the segmented goal object instance. The corresponding significantly large margins on the Success Rate metric. The images on the bottom represent the path taken by the agent best results are highlighted in bold. to reach the desired object from the initial location. We observe that SI Maps exhibit a remarkable improve- In this section, we showcase qualitative examples of our ment in performance compared to other approaches. SI Maps approach for the vision language navigation task. The results achieve an impressive two-fold increase in success rate are illustrated in Figure 5 with the corresponding A-star metric compared to 24% obtained by VLMaps on human trajectory using SI Maps for navigation. SI Maps allow evaluation, demonstrating a substantial leap in the instance- navigating to specific instances in the scene based on their specific goal navigation. Since VLMaps only contain seman- relative distance with respect to other objects (left, center) 511 Authorized licensed use limited to: Hasan Kalyoncu Universitesi. Downloaded on March 20,2024 at 07:50:21 UTC from IEEE Xplore. Restrictions apply. and direction-based specification in the global map (right). demonstrates robustness in relation to object placement in the The downstream navigation, as a consequence of SI Maps, environment and is less vulnerable to noise than previous is agnostic to the starting pose and orientation of the agent methods. We showcase the practicality of the proposed in the environment. SI Maps using success rate and panoptic quality metrics. Future research could investigate 3D instance segmentation techniques to incorporate instance-level semantics into the occupancy grid creation process directly. ACKNOWLEDGEMENT We acknowledge iHub-Data IIIT Hyderabad for their support to this work. Fig. 6: Qualitative example of the instance-level semantics REFERENCES captured by different methods for all the chairs in the [1] C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual language environment. SI Maps clearly localize the different instances maps for robot navigation,” in Proceedings of the IEEE International in the map. Conference on Robotics and Automation (ICRA), (London, UK), 2023. [2] B. Chen, F. Xia, B. Ichter, K. Rao, K. Gopalakrishnan, M. S. Ryoo, We also show qualitative comparisons of different methods A. Stone, and D. Kappler, “Open-vocabulary queryable scene represen-tations for real world planning,” in arXiv preprint arXiv:2209.09874, on the quality of instance-level top-view maps in Figure 6 for 2022. different seating objects (chair, couch, sofa) in the simulated [3] R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and environment. Our approach effectively captures the instance- A. J. Davison, “Slam++: Simultaneous localisation and mapping at thelevel of objects,” in Proceedings of the IEEE conference on computer level semantics of objects in the environment, recovering vision and pattern recognition, pp. 1352–1359, 2013. 32 instances out of 29 present in the map (with 3 extra [4] J. McCormac, R. Clark, M. Bloesch, A. Davison, and S. Leuteneg- noisy segments). In contrast, the baseline of VLMaps with ger, “Fusion++: Volumetric object-level slam,” in 2018 internationalconference on 3D vision (3DV), pp. 32–41, IEEE, 2018. connected components detects 26 instances, but most of them [5] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in are noisy segments, and it merges several separate instances Proceedings of the IEEE international conference on computer vision, (for the same object in close proximity) into a single instance. pp. 2961–2969, 2017.[6] N. Gireesh, A. Agrawal, A. Datta, S. Banerjee, M. Sridharan, Our results are particularly impressive in the middle region B. Bhowmick, and M. Krishna, “Sequence-agnostic multi-object nav- of the map, which corresponds to the dining area in the igation,” in IEEE International Conference on Robotics and Automa- environment. Here, the chairs are in close proximity to tion, 2023. (to be published).[7] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, each other, and the vanilla VLMaps approach fails when a “Masked-attention mask transformer for universal image segmenta- particular instance of chair is queried. Similarly, applying tion,” 2022. connected components-based heuristics to separate instances [8] R. Schumann and S. Riezler, “Analyzing generalization of visionand language navigation to unseen outdoor areas,” in Proceedings is not enough, as the semantic segmentation masks of the of the 60th Annual Meeting of the Association for Computational chairs end up being connected with each other, resulting in Linguistics (Volume 1: Long Papers), (Dublin, Ireland), pp. 7519– multiple instances being merged. 7532, Association for Computational Linguistics, May 2022.[9] K. Nguyen, D. Dey, C. Brockett, and B. Dolan, “Vision-based nav- The VLMaps-based approaches rely on alignment be- igation with language-based assistance via imitation learning with tween per-pixel visual embeddings and linguistic feature indirect intervention,” in Proceedings of the IEEE/CVF Conference on embeddings, which can be sensitive to noise due to the Computer Vision and Pattern Recognition, pp. 12527–12537, 2019.[10] A. Das, G. Gkioxari, S. Lee, D. Parikh, and D. Batra, “Neural modular unconstrained nature of the association. The benefit of our control for embodied question answering,” in Conference on Robot feature-embedding-free approach becomes evident as we di- Learning, pp. 53–62, PMLR, 2018. rectly constrain the occupancy grid creation with the instance [11] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal,G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable segmentation masks. As a result, SI Maps have considerably visual models from natural language supervision,” in International less noise than derivative VLMaps approaches. Community conference on machine learning, pp. 8748–8763, PMLR, 2021. detection further helps reduce noise by filtering out spurious [12] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in communities formed due to noise, leading to a much cleaner context,” in Computer Vision–ECCV 2014: 13th European Conference, map, which can also be observed in Figures 1, 6. Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755, Springer, 2014. V. C [13] B. Li, K. Q. Weinberger, S. Belongie, V. Koltun, and R. Ranftl,ONCLUSION “Language-driven semantic segmentation,” in International Confer- In this study, we introduce a novel instance-focused ence on Learning Representations, 2022.[14] OpenAI, “Chatgpt.” https://openai.com/blog/chatgpt. scene representation for indoor settings, enabling seamless [15] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, language-based navigation across various environments. Our S. Song, A. Zeng, and Y. Zhang, “Matterport3d: Learning from rgb-d representation accommodates language commands that refer data in indoor environments,” arXiv preprint arXiv:1709.06158, 2017.[16] M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain, to specific instances within the environment. Furthermore, J. Straub, J. Liu, V. Koltun, J. Malik, et al., “Habitat: A platform for our map creation method is more memory-efficient, result- embodied ai research,” in Proceedings of the IEEE/CVF international ing in an impressive 128-fold decrease in storage, as it conference on computer vision, pp. 9339–9347, 2019.[17] K. Jain, V. Chhangani, A. Tiwari, K. M. Krishna, and V. Gandhi, does not rely on high-dimensional feature embeddings for “Ground then navigate: Language-guided navigation in dynamic visual and linguistic modalities. Additionally, our approach scenes,” arXiv preprint arXiv:2209.11972, 2022. 512 Authorized licensed use limited to: Hasan Kalyoncu Universitesi. Downloaded on March 20,2024 at 07:50:21 UTC from IEEE Xplore. Restrictions apply.