2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) IDD-3D: Indian Driving Dataset for 3D Unstructured Road Scenes Shubham Dokania1, A. H. Abdul Hafez2, Anbumani Subramanian1, Manmohan Chandraker3, C.V. Jawahar1 1IIIT Hyderabad, 2Hasan Kalyoncu University, 3UC San Diego shubham.dokania@research.iiit.ac.in, abdul.hafez@hku.edu.tr, anbumani@iiit.ac.in, mkchandraker@eng.ucsd.edu, jawahar@iiit.ac.in Abstract Autonomous driving and assistance systems rely on an- notated data from traffic and road scenarios to model and learn the various object relations in complex real-world scenarios. Preparation and training of deploy-able deep learning architectures require the models to be suited to different traffic scenarios and adapt to different situations. Currently, existing datasets, while large-scale, lack such di- versities and are geographically biased towards mainly de- veloped cities. An unstructured and complex driving layout found in several developing countries such as India poses a challenge to these models due to the sheer degree of varia- tions in the object types, densities, and locations. To facili- tate better research toward accommodating such scenarios, Figure 1. Some examples from the dataset showing different traffic we build a new dataset, IDD-3D, which consists of multi- scenarios, LiDAR data with annotations, and a sample of LiDAR point clouds projected on camera data. modal data from multiple cameras and LiDAR sensors with 12k annotated driving LiDAR frames across various traf- fic scenarios. We discuss the need for this dataset through evenly distributed traffic. In such situations, crowd behavior statistical comparisons with existing datasets and highlight demonstrates low diversity and average densities. In south- benchmarks on standard 3D object detection and tracking east Asian countries, such as India, the traffic densities and tasks in complex layouts. Code and data available 1. inter-object behaviors are much more complex. Such com- plexities have been studied in the past [39, 5, 4], but ex- tensive data coverage and multi-modal systems are still un- 1. Introduction available for such scenes. It hence may not be entirely ap- Intelligent vehicles and autonomous driving systems plied to cases where the distribution of object categories and have come a long way and keep becoming more sophisti- types varies greatly. cated over time, owing to the rapid progress in the deep In this paper, we propose a dataset on complex unstruc- learning and computer vision. However, the core compo- tured driving scenarios with multi-modal data, highlighting nent for all these increments is the availability of high- the capabilities of 3D sensors such as LiDAR for better quality annotated data. Recently, many works have fo- scene perception in unstructured and sporadically chaotic cused on data selection and quality improvement [34, 8, traffic conditions. In the proposed dataset, we highlight a 47], building high-quality and large-scale datasets, and ap- significantly different distribution of object types and cat- proaches built using these resources, which improve the egories compared to existing datasets collected in Euro- state of autonomous driving [48, 16]. pean or similar settings [24, 13, 38], due to the different Existing datasets are usually collected in well-structured nature of traffic scenes in Indian roads. Furthermore, the environments with proper traffic regulations and relatively- categories and annotations available in the proposed dataset vary greatly from existing datasets. Specifically, they cover 1https://github.com/shubham1810/idd3d_kit.git objects in scenes that usually appear in still-developing 2642-9381/23/$31.00 ©2023 IEEE 4471 DOI 10.1109/WACV56688.2023.00446 Authorized licensed use limited to: Hasan Kalyoncu Universitesi. Downloaded on August 11,2023 at 12:33:15 UTC from IEEE Xplore. Restrictions apply. 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) | 978-1-6654-9346-8/23/$31.00 ©2023 IEEE | DOI: 10.1109/WACV56688.2023.00446 Figure 2. Samples from the dataset highlighting different (a) RGB images and (b) LiDAR Bird-Eye-View (BEV) along with bounding box annotations. The samples visualized above are taken from different sequences of the dataset. cities, for example, Auto-rickshaws, hand carts, concrete ments to accentuate the usefulness of proposed dataset, and mixer machines on roads, and animals on roads. (iv) provide 3D object detection and tracking benchmarks We provide data collected in Indian road scenes, from across popular methods in literature. high-quality LiDAR sensors and six cameras that cover the surrounding area of the ego-vehicle to enable sensor- 2. Related Work fusion-based applications. We provide annotations for Data plays a huge role in machine learning systems, and 15.5k frames in the dataset, which spans 10 primary cat- in this context, for autonomous vehicles and scene percep- egories (and 7 additional miscellaneous categories), which tion. There have been several efforts over the years in this we use for model training and evaluation. Along with the area to improve the state of datasets available and towards annotations, we also provide extra unlabelled raw data from increasing the volumes of high-quality and well annotated the sensors to facilitate further research, especially into datasets. self- and unsupervised learning over such traffic scenes. A unique feature of the proposed dataset, which stems from 2D Driving: One of the early datasets towards visual per- the unstructured environment, is the availability of highly ception and understanding driving has been the CamVid complex trajectories. We show samples from the dataset [2] and Cityscapes [9, 10] dataset, providing annotations which emphasize such cases and display experiments on ob- for semantic segmentation and enabling research in deeper ject detection and tracking, which is possible due to avail- scene understanding at pixel-level. KITTI [14, 15] dataset ability of instance specific labels for each object bounding provided 2D object annotations for detection and tracking box per sequence. along with segmentation data. However, fusion of multi- Our main contributions can be summarised as follows: ple modalities such as 3D LiDAR data enhances the perfor- (i) We propose the IDD-3D dataset for driving in unstruc- mance for scene understanding benchmarks as these pro- tured traffic scenarios for Indian roads with 3D informa- vide a higher level of detail of a scene when combined with tion, (ii) high-quality annotations for 3D object bounding available 2D data. This multi-modal sensor-fusion based boxes with 9DoF data, and instance IDs to enable tracking, direction has been the motivation for the proposed dataset (iii) Analysis over highly unstructured and diverse environ- to alleviate the discrepancies in existing datasets for scene 4472 Authorized licensed use limited to: Hasan Kalyoncu Universitesi. Downloaded on August 11,2023 at 12:33:15 UTC from IEEE Xplore. Restrictions apply. Dataset 3D Scenes Cameras Lidar Images Classes 3D Boxes Traffic Diversity KITTI [15] 15k 2 yes 15k 3 80k Low nuScenes [3] 40k 6 yes 1.4M 23 1.4M Mid Apolloscape [17] 20k 6 yes 0 6 475k Low KAIST [7] 8.9k 2 yes 8.9k 3 0 Low Waymo Open [38] 230k 5 yes 1M 4 12M Mid ONCE [26] 1M (16k) 7 yes 7M 5 417k Mid Cityscapes-3D [13] 20k - no 490k 8 - Low A* 3D [29] 39k 1 yes 39k 7 230k Mid Ours 15.5k* 6 yes 93k 10 (17**) 223k* High Table 1. A comparison with existing popular 3D autonomous driving datasets. Our dataset showcases the highest diversity with the highest average number of bounding boxes per frame and a wide distribution. The statistical distribution is further studied in the following sections. (*) Number reported on train-val-test set, experiments/statistics reported on train-val set. (**) The 17 classes are total of the 10 primary and 7 additional classes. Figure 3. Distribution of class labels in the proposed dataset. (a) The primary 10 classes are shown here along with the 3 super-categories (Vehicle, Pedestrian, and Rider) which are considered to make the proposed dataset more consistent with labels from existing datasets. (b) The additional 7 classes annotated in the dataset are shown in log-scale separately since they are currently not used for training the models. The Rider class covers both riders and non-riders on two-wheeler motor vehicles. We do not consider the Miscellaneous classes for evaluation of the dataset currently. perception and autonomous driving. more complex environments and extending the diversity of driving datasets. Driving Datasets: Recent datasets such as nuScenes [3], Argoverse [5], Argoverse 2 [42] provide HD maps for road Complex environments: There have been multiple ef- scenes. This allows for improved perception and planning forts to build datasets for difficult environments such as capabilities and towards construction of better metrics for variations in extreme weather [30, 35], night-time driving object detection such as in [38]. These large scale datasets conditions [11], and safety critical scenarios [1]. There cover a variety of scenes and traffic densities and have en- have been recent works which make use of different sen- abled systems with high safety regulations in the area of sors such as fisheye lenses to cover a larger area around driver assistance and autonomous driving. However, the the ego-vehicle [46, 24] and event camera [33] for train- drawback for a majority of these datasets arises from the ing models with faster reaction times. However, most of fact that the collection happens in well-developed cities these datasets have been collected in environments with with clear and structured traffic flows. The proposed dataset little to no changes in the traffic patterns and consis- bridges the gaps of varying environments by introducing tency in the background objects. Some works in literature 4473 Authorized licensed use limited to: Hasan Kalyoncu Universitesi. Downloaded on August 11,2023 at 12:33:15 UTC from IEEE Xplore. Restrictions apply. Figure 4. Samples of scenes of interest in our dataset (LiDAR and RGB samples) which especially differentiate our proposed dataset from those available in literature. (Clockwise from top-left) (a) Complex traffic scenarios with vehicles orientations in a wide variety of directions, (b) Perspective view of a scene with ego-vehicle on elevated flyover with ground level visible and another highway over the vehicle path with pillars, (c) humans in the middle of traffic (shown in red boxes) and jaywalking near moving vehicles, resulting in a safety critical scenario, (d) An example with very high density traffic scenario. Such case are abundant in the proposed dataset (rather than special cases when compared to other popular datasets) and hence require special attention for such unstructured environments. Refer to supplementary material for more examples. [39, 36, 19, 41] explore such situations where the label dis- [6, 31, 37]. We have used approaches such as SECOND tributions can vary significantly, however these are either [43] which voxelize the input point cloud and apply 3D con- limited to mostly 2D modalities, or off-road environments. volution, which leads to discrete geometric representations In this work, the proposed dataset enhances the availabil- of the data. CenterPoint [45] approach which assigns cen- ity of data for enabling research for autonomous driving in ters is known to perform well for smaller objects due to the unconstrained traffic environments. fine level of details for each point feature. We also explore PointPillars [21] for an analysis of pillar based approaches Object Detection and Tracking: Several popular meth- where the data is projected to Bird-Eye-View mode and then ods have been explored in recent literature which handle the treated as an image. We highlight the performance of each task of 3D object detection for the cases of driving scenar- in the experiments section and draw our inferences specific ios [49, 44, 43, 21, 45]. In our work, we specifically talk to the proposed dataset. about 3D object detection from point clouds, while we do Many methods have been proposed towards 3D Multi- note the effectiveness of multi-modal approaches as well Object Tracking (MOT) in literature which have been 4474 Authorized licensed use limited to: Hasan Kalyoncu Universitesi. Downloaded on August 11,2023 at 12:33:15 UTC from IEEE Xplore. Restrictions apply. Figure 5. Figure showing (a) sensors on the vehicle (cameras, LiDAR) and their respective orientations, (b) image of the vehicle used along with the sensor rig. Please note that the real-world car image has been edited to preserve anonymity. shown to perform well across a multitude of datasets in dif- ferent scenarios. There are various ways to model the track- ing task such as using the Bird-Eye View [25], approaches based on multi-sensor fusion [20], and simple tracking based on distance metrics and methods like Kalman filter [28]. In this work, we utilise the method presented in [28] using the detections from our trained models on IDD-3D and present the evaluations based on popular MOT metrics Figure 7. (a) Distribution showing distances of all bounding boxes such as the ones presented in [40]. from the ego-vehicle. The short distance of vehicles and pedestri- ans provides motivation for the proposed dataset to facilitate mod- eling of shorter reaction times. (b) Cumulative distribution of the distances further highlight the differences in distance distributions, showing that most of the objects in the proposed dataset are close to the ego-vehicle compared to existing popular datasets. Figure 8. Distributions of number of bounding boxes per LiDAR frame. The number of objects in a scene is usually higher in the Figure 6. Class-wise distribution of some common prominent frames present in the proposed dataset. We filter the boxes specif- classes (Car, MotorcycleRider, Pedestrian, TourCar) with respect ically based on the distance of less than 30m based on data shown to number of frames is visualized to show traffic and crowd den- in fig 6. (a) Shows statistics with KITTI dataset, and (b) shows the sity in proposed dataset. Distributions for all classes is shown in same without KITTI dataset to highlight the sparsity in the KITTI supplementary material. dataset. We note the heavier tail of our distribution indicating a greater density of objects close to the ego-vehicle. 3. Proposed Dataset 3.1. Data Acquisition The data collection for the proposed dataset was covered In the following sections we discuss and highlight the in two driving sessions with over 5 hours of collected data qualities of the proposed dataset, including the design during daytime. Afterwards, we manually sample scenes choices and method for data collection, annotations and of interest in sequences of 100 frames at 10fps making 150 analysis of the dataset over interesting scenarios. sequences, each of 10s. The data collection has been per- 4475 Authorized licensed use limited to: Hasan Kalyoncu Universitesi. Downloaded on August 11,2023 at 12:33:15 UTC from IEEE Xplore. Restrictions apply. formed in different regions of Hyderabad, India. We now the ego-vehicle. We also show better data density compared provide details about the configuration and data preparation to KITTI, which is on a comparable scale to IDD-3D. Ad- in the following. ditionally, it can be seen that in the range of 0-25m (where most of the proposed dataset’s annotations exist), we show Sensors (Hardware configuration): The proposed higher densities than both ONCE and nuScenes as shown in dataset encompasses data from multiple sensors which fig. 8(b). include six RGB cameras and one LiDAR (Ouster OS1) sensor. The details about the sensors and data processing Interesting cases: While existing datasets provide high used are mentioned in Table 1 in supplementary material. diversity in type of traffic scenarios, these are usually re- The position and orientation of the sensors on the acquisi- stricted to controlled and well-structured environments with tion vehicle is shown in Fig. 5 along with the real-world only a few anomalies. In IDD-3D, we show a large amount image of the vehicle. of diversity in the situations and also highlight some cases which could be of interest for progress in driving behaviour Data processing: For each driving sequence, all cali- modeling such as the samples shown in Fig. 4. For ex- brations are performed through popular methods such as ample, we see safety critical cases where multiple pedes- [18, 27]. We preserve the raw data from the sensors in ros- trians are seen jaywalking while vehicles are on the roads. bag format [32]. The current release of the dataset con- Existing datasets claim high density traffic when there are sists of 15.5k total frames out of which 12k frames are from 20-30 object bounding boxes in one frame, whereas in our train-val set. samples we show 50-60 or more objects existing in the same frame, and in close proximity. Considering the dif- ferent variations of scenes in the proposed dataset, the ap- Data Privacy: We ensure that all the faces and license plications for surveillance, road-safety, traffic quality, and plates in the dataset are blurred by first using automated ap- crowd-behaviour are immense and show potential to be dis- proaches (such as [12, 22]) and then performing a manual parate from the data patterns from other datasets. quality inspection. For the automated approaches, we run the object detection pipeline and then perform a NMS based 4. Experiments and Benchmarks matching to find any missing boxes in between frames. The missing boxes are interpolated, and finally, we blur the re- We present an extensive analysis of IDD-3D with exist- gions in the images for data protection. ing methods to highlight the diversity and usefulness data. We first discuss the experimental setup and then based on 3.2. Dataset Analysis the evaluations, report the understanding about the dataset Labels and Annotations We provide 3D bounding box properties and behaviour of different approaches. annotations for 15.5k (train-val-test) LiDAR frames with 223k 3D bounding boxes. We have used the annotation Proposed Dataset: We use 10 primary categories which tool [23] for labeling data across 17 categories, shown in are highlighted in Fig. 3, however, since most datasets in Fig. 3. Each object in a sequence contains a unique ID literature ordinarily provide a few categories as common la- which enables tracking and re-identification. Furthermore, bels (For example, Car, truck, Van as Vehicle), we com- we provide class specific object distribution based on num- bine our class labels into three categories, namely Vehi- ber of frames for some of the prominent categories in Fig. cle, Pedestrian, and Rider as super-categories. The net- 6. We note that out of the 17 available classes (primary work architectures are trained on 10 categories (Car, Bus, and additional), we are using 10 primary classes currently Truck, Scooter, Van, Motorcycle, Pedestrian, Motorcy- for training and validation is performed on 10 classes and 3 cleRider, ScooterRider, TourCar). We transform the anno- super-classes (Vehicle, Pedestrian, and Rider). tations to a simpler format for the 3D object detection task a 7-dimensional vector as (x, y, z, w, h, l, α), where (x, y, z) Data Statistics : We first highlight the bounding box dis- represent the object location, (w, h, l) represent the dimen- tance distribution in IDD-3D and the comparison with ex- sions of the bounding box and α represents the yaw angle. isting popular datasets [3, 15, 26] in figure 7. In Fig. 7 (a) we show that IDD-3D consists of most of the annotations 3D Object detection: We discuss about some of the pop- close to the ego-vehicle, caused by the low gaps between ular datasets which have been considered for comparison vehicles causing occlusion for LiDAR rays for longer dis- with the proposed dataset and highlight their strengths and tances. Nonetheless, it is crucial to highlight this feature of weaknesses in the complex setting of the presented driving the proposed dataset because split-second decisions are im- scenarios. For fair comparison, we train network architec- portant for safety, especially when other objects are close to tures proposed in [43, 45, 21] for 3D object detection and 4476 Authorized licensed use limited to: Hasan Kalyoncu Universitesi. Downloaded on August 11,2023 at 12:33:15 UTC from IEEE Xplore. Restrictions apply. CenterPoint SECOND SuperCategory Categories/Methods CenterPoint (nuScenes) SECOND (KITTI) PointPillar Car 65.28 66.97 69.89 68.50 67.77 Bus 59.09 78.47 59.12 49.69 43.70 Vehicle Truck 68.79 72.18 65.11 68.09 63.68 Van 9.58 12.71 1.27 15.77 0.14 TourCar 76.94 77.40 74.81 77.02 72.80 Pedestrian Pedestrian 28.60 22.49 19.54 23.74 22.72 Motorcycle 23.65 25.28 21.69 22.79 16.97 Rider Scooter 42.36 38.05 26.98 23.73 16.81MotorcycleRider 59.29 61.48 53.39 48.90 46.52 ScooterRider 66.33 64.65 52.27 50.62 41.60 mAP 49.99 51.97 44.31 44.89 39.27 Table 2. Results on IDD-3D with popular methods. We report AP scores across different categories on the validation set. This table shows the results on each training class. The scores are reported with different thresholds for each class (Vehicles @ 0.5, Rider @ 0.4, Pedestrian @ 0.3) and all objects are considered till 30m distance, please see supplementary material for more details and full table. Approach Pre-Training Vehicle RiderOverall 0-10m 10-25m >25m Overall 0-10m 10-25m >25m CenterPoint nuScenes 73.85 87.57 70.98 30.48 71.03 84.24 69.54 23.42 CenterPoint - 71.20 88.84 67.62 26.32 69.51 83.66 67.49 19.76 SECOND KITTI 72.51 88.60 68.99 28.07 71.60 83.25 70.98 24.32 SECOND - 73.01 88.71 67.82 29.46 72.05 85.44 70.89 26.28 PointPillar - 68.61 87.64 64.59 26.30 69.66 82.56 68.60 25.64 Table 3. Experimental results on proposed dataset with different popular methods. We report AP scores across different categories on the validation set. This table shows the results on Vehicle and Rider categories from the proposed dataset. show the results in Tables 2, 3 and 4. We report the mAP dataset training may not be fruitful in this scenario given the scores for the 3 combined categories (Vehicle, Pedestrian, significantly different distribution of the categories and in- and Rider) in Tables 3 and 4, and further report mAP scores put data in the given datasets. The existing datasets usually in four sub-levels, i.e. overall AP score for each training utilise information such as LiDAR intensity, elongation, and class in Table 2. The scores reported in Table 2 are for a timestamp information as input to the model, which is dif- distance up to 30m in the dataset, and the distances in the ferent from the proposed dataset. However, considering super-classes are divided as upto 30m (denoted as Overall), the wide research available based on these datasets, it is 0-10m, 10-25m, and 25+m. The small distance buckets are imperative that we highlight how using the existing mod- considered due to the data distribution (as shown in Fig. 7) els trained on these datasets as pre-training backbones usu- in the proposed dataset. ally enhances the performances. For this purpose, we con- sider using the models [43, 45] for pre-training by using the 3D Object Tracking: A notable property of the proposed weights for the common layers and fine-tune for better per- dataset is the existence of the instance IDs for each 3D formance. bounding box. In this work, we also show results on 3D ob- ject tracking and report important metrics such as AMOTA, Result Analysis : We note that the performance of the ar- AMOTP [40] in Table 5. We use SimpleTrack [28] for the chitectures for both 10 categories and the 3 super-categories task of object tracking and report the results based on the is consistent and aligns with our claims. It is clear that the detections from Centerpoint [45] due to the highest mAP number of annotated instances plays a major role for bet- score on the detection task. The MOT scores are reported ter mAP scores, for example, classes such as Car achieve a for all 10 primary classes and the overall categories. high mAP compared to classes such as Van or Scooter. An- other major factor appears to be the object size, wherein Datasets : We use KITTI [15, 14] dataset and nuScenes larger and denser objects are easier to model and detect [3] for pre-training of 3D object detection methods to fur- compared to smaller instances. An example of the varia- ther fine-tune on our proposed dataset. We note that cross- tions in mAP scores based on sizes is the differences be- 4477 Authorized licensed use limited to: Hasan Kalyoncu Universitesi. Downloaded on August 11,2023 at 12:33:15 UTC from IEEE Xplore. Restrictions apply. Approach Pre-Training Pedestrian mAPOverall 0-10m 10-25m >25m Overall 0-10m 10-25m >25m CenterPoint nuScenes 22.49 33.85 19.47 4.48 55.79 68.56 53.33 19.46 CenterPoint - 28.60 44.89 24.39 3.48 56.43 72.46 53.17 16.52 SECOND KITTI 23.74 33.67 21.05 5.58 55.95 68.51 53.67 19.32 SECOND - 19.54 27.18 17.61 6.44 54.87 67.11 52.11 20.73 PointPillar - 22.72 29.34 20.45 5.45 53.66 66.52 51.21 19.13 Table 4. Experimental results (continued) on proposed dataset with different popular methods. We report AP scores across different categories on the validation set. This table shows the results on Pedestrian category and the mAP score from the proposed dataset. Category AMOTA AMOTP Recall MOTAR MOTP MOTA lgd tid faf Bus 0.831 0.679 0.812 0.907 0.589 0.736 3.045 2.659 13.805 Car 0.641 0.726 0.667 0.787 0.518 0.521 3.422 2.035 44.806 Motorcycle 0.202 0.826 0.242 0.941 0.356 0.228 2.000 2.000 2.321 MotorcyleRider 0.507 0.735 0.496 0.801 0.320 0.390 5.027 2.585 36.410 Pedestrian 0.254 0.912 0.319 0.737 0.363 0.225 9.918 6.731 34.557 Scooter 0.250 0.494 0.323 1.000 0.092 0.323 0.000 0.000 0.000 ScooterRider 0.540 0.536 0.581 0.742 0.258 0.427 3.868 2.274 35.251 TourCar 0.796 0.433 0.848 0.821 0.351 0.692 2.877 1.034 48.866 Truck 0.701 0.635 0.675 0.903 0.403 0.607 5.108 2.676 17.796 Van 0.000 1.677 0.275 0.000 0.563 0.000 14.500 0.000 75.163 Overall 0.472 0.765 0.524 0.764 0.381 0.415 4.977 2.199 30.898 Table 5. Experimental results for 3D object tracking for the 10 primary classes present in the proposed dataset. We use SimpleTrack [28] for the task of tracking using detections from CenterPoint [45] in the presented table. For the abalation study, please refer to the supplementary material. tween the Pedestrian and Bus/Truck categories, even though 5. Conclusion Pedestrian category consists of the maximum bounding box instances. From Table 2, we see that CenterPoint approach In this work, we presented IDD-3D, a dataset for un- generally performs better than SECOND or PointPillars for structured driving scenarios with complex road situations is the proposed dataset, this could be due to the nature of the presented with thorough statistical and experimental anal- approach where it deals directly with point clouds to pre- ysis. Through this dataset, and the future release, we aim dict object centers instead of voxelizing the points (SEC- to solve the problem of generalizability across geographi- OND) or projecting the point to BEV (PointPillars). We cal locations and provide more diverse information in driv- also provide results on the super-classes for the same ar- ing datasets and road scene analysis. We show interesting chitectures over different distance ranges and show similar cases which cover a manifold of cases but also show some performances for all approaches. safety-critical situations which are frequent in several cities. We justify our claims for the proposed dataset through a set For the object tracking results presented in Table 5, we of experiments for 3D object detection and tracking using notice a correlation between the detection scores and track- state-of-the-art approaches which were available as open- ing scores (AP and AMOTA/MOTA) for classes such as source implementations. The future works for the dataset Pedestrian and Car. We highlight that the detection as shall extend these tasks to a vast number of applications, well as tracking models perform adequately on the pro- further enhancing the applicability of the proposed dataset posed dataset achieving an overall AMOTA score of 0.472 to autonomous driving applications. (higher better), while we also note that a similar configura- tion achieves an overall AMOTA of 0.668 on the nuscenes 6. Acknowledgements dataset (from the leaderboard). The complexity of the pro- posed dataset is especially highlighted in the results of the The project is funded by iHub-data and mobility at IIIT Pedestrian class, where the low scores prove complex mo- Hyderabad. The authors would like to acknowledge the sup- tion present in the dataset. We provide further results in the port from Radha Krishna B towards data collection and an- supplementary section along with the tracking results using notation. We would also like to thank Government of Telan- SECOND and PointPillars models for completeness, with gana for the permissions, encouragement and enabling this the corresponding visualizations. effort. 4478 Authorized licensed use limited to: Hasan Kalyoncu Universitesi. Downloaded on August 11,2023 at 12:33:15 UTC from IEEE Xplore. Restrictions apply. References [13] Nils Gählert, Nicolas Jourdan, Marius Cordts, Uwe Franke, and Joachim Denzler. Cityscapes 3d: Dataset and [1] Wentao Bao, Qi Yu, and Yu Kong. Uncertainty-based traffic benchmark for 9 dof vehicle detection. arXiv preprint accident anticipation with spatio-temporal relational learn- arXiv:2006.07864, 2020. ing. In Proceedings of the 28th ACM International Confer- [14] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel ence on Multimedia, pages 2682–2690, 2020. Urtasun. Vision meets robotics: The kitti dataset. The Inter- [2] Gabriel J Brostow, Julien Fauqueur, and Roberto Cipolla. national Journal of Robotics Research, 32(11):1231–1237, Semantic object classes in video: A high-definition ground 2013. truth database. Pattern Recognition Letters, 30(2):88–97, [15] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we 2009. ready for autonomous driving? the kitti vision benchmark [3] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, suite. In 2012 IEEE conference on computer vision and pat- Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- tern recognition, pages 3354–3361. IEEE, 2012. ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- [16] Sorin Grigorescu, Bogdan Trasnea, Tiberiu Cocias, and modal dataset for autonomous driving. In CVPR, 2020. Gigel Macesanu. A survey of deep learning techniques for [4] Rohan Chandra, Mridul Mahajan, Rahul Kala, Rishitha autonomous driving. Journal of Field Robotics, 37(3):362– Palugulla, Chandrababu Naidu, Alok Jain, and Dinesh 386, 2020. Manocha. Meteor: A massive dense & heterogeneous [17] X. Huang, X. Cheng, Q. Geng, B. Cao, D. Zhou, P. Wang, Y. behavior dataset for autonomous driving. arXiv preprint Lin, and R. Yang. The apolloscape dataset for autonomous arXiv:2109.07648, 2021. driving. In 2018 IEEE/CVF Conference on Computer Vision [5] Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jag- and Pattern Recognition Workshops (CVPRW), pages 1067– jeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter 10676, 2018. Carr, Simon Lucey, Deva Ramanan, et al. Argoverse: 3d [18] Ganesh Iyer, R Karnik Ram, J Krishna Murthy, and K Mad- tracking and forecasting with rich maps. In Proceedings of hava Krishna. Calibnet: Geometrically supervised extrinsic the IEEE/CVF Conference on Computer Vision and Pattern calibration using 3d spatial transformer networks. In 2018 Recognition, pages 8748–8757, 2019. IEEE/RSJ International Conference on Intelligent Robots [6] Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. and Systems (IROS), pages 1110–1117. IEEE, 2018. Multi-view 3d object detection network for autonomous [19] Peng Jiang, Philip Osteen, Maggie Wigness, and Srikanth driving. In Proceedings of the IEEE conference on Computer Saripalli. Rellis-3d dataset: Data, benchmarks and analy- Vision and Pattern Recognition, pages 1907–1915, 2017. sis. In 2021 IEEE international conference on robotics and [7] Yukyung Choi, Namil Kim, Soonmin Hwang, Kibaek Park, automation (ICRA), pages 1110–1116. IEEE, 2021. Jae Shin Yoon, Kyounghwan An, and In So Kweon. Kaist [20] Aleksandr Kim, Aljoša Ošep, and Laura Leal-Taixé. Eager- multi-spectral day/night data set for autonomous and assisted mot: 3d multi-object tracking via sensor fusion. In 2021 driving. IEEE Transactions on Intelligent Transportation IEEE International Conference on Robotics and Automation Systems, 19(3):934–948, 2018. (ICRA), pages 11315–11321. IEEE, 2021. [8] Brendan Collins, Jia Deng, Kai Li, and Li Fei-Fei. Towards [21] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, scalable dataset construction: An active learning approach. Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders In European conference on computer vision, pages 86–98. for object detection from point clouds. In Proceedings of Springer, 2008. the IEEE/CVF conference on computer vision and pattern [9] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo recognition, pages 12697–12705, 2019. Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe [22] Rayson Laroca, Luiz A Zanlorensi, Gabriel R Gonçalves, Franke, Stefan Roth, and Bernt Schiele. The cityscapes Eduardo Todt, William Robson Schwartz, and David dataset for semantic urban scene understanding. In Proc. Menotti. An efficient and layout-independent automatic li- of the IEEE Conference on Computer Vision and Pattern cense plate recognition system based on the yolo detector. Recognition (CVPR), 2016. IET Intelligent Transport Systems, 15(4):483–503, 2021. [10] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo [23] E Li, Shuaijun Wang, Chengyang Li, Dachuan Li, Xiangbin Scharwächter, Markus Enzweiler, Rodrigo Benenson, Uwe Wu, and Qi Hao. Sustech points: A portable 3d point cloud Franke, Stefan Roth, and Bernt Schiele. The cityscapes interactive annotation platform system. In 2020 IEEE In- dataset. In CVPR Workshop on The Future of Datasets in telligent Vehicles Symposium (IV), pages 1108–1115. IEEE, Vision, 2015. 2020. [11] Dengxin Dai and Luc Van Gool. Dark model adaptation: [24] Yiyi Liao, Jun Xie, and Andreas Geiger. Kitti-360: A novel Semantic image segmentation from daytime to nighttime. dataset and benchmarks for urban scene understanding in 2d In 2018 21st International Conference on Intelligent Trans- and 3d. arXiv preprint arXiv:2109.13410, 2021. portation Systems (ITSC), pages 3819–3824. IEEE, 2018. [25] Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, [12] Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kot- Huizi Mao, Daniela Rus, and Song Han. Bevfusion: Multi- sia, and Stefanos Zafeiriou. Retinaface: Single-shot multi- task multi-sensor fusion with unified bird’s-eye view repre- level face localisation in the wild. In Proceedings of sentation. arXiv preprint arXiv:2205.13542, 2022. the IEEE/CVF conference on computer vision and pattern [26] Jiageng Mao, Minzhe Niu, Chenhan Jiang, Hanxue Liang, recognition, pages 5203–5212, 2020. Jingheng Chen, Xiaodan Liang, Yamin Li, Chaoqiang Ye, 4479 Authorized licensed use limited to: Hasan Kalyoncu Universitesi. Downloaded on August 11,2023 at 12:33:15 UTC from IEEE Xplore. Restrictions apply. Wei Zhang, Zhenguo Li, et al. One million scenes ings of the IEEE/CVF conference on computer vision and for autonomous driving: Once dataset. arXiv preprint pattern recognition, pages 2446–2454, 2020. arXiv:2106.11037, 2021. [39] Girish Varma, Anbumani Subramanian, Anoop Namboodiri, [27] Gaurav Pandey, James R McBride, Silvio Savarese, and Manmohan Chandraker, and C.V. Jawahar. Idd: A dataset Ryan M Eustice. Automatic targetless extrinsic calibration for exploring problems of autonomous navigation in uncon- of a 3d lidar and camera by maximizing mutual information. strained environments. In 2019 IEEE Winter Conference on In Twenty-Sixth AAAI Conference on Artificial Intelligence, Applications of Computer Vision (WACV), pages 1743–1751, 2012. 2019. [28] Ziqi Pang, Zhichao Li, and Naiyan Wang. Simpletrack: Un- [40] Xinshuo Weng, Jianren Wang, David Held, and Kris Kitani. derstanding and rethinking 3d multi-object tracking. arXiv 3d multi-object tracking: A baseline and new evaluation met- preprint arXiv:2111.09621, 2021. rics. In 2020 IEEE/RSJ International Conference on Intelli- [29] Quang-Hieu Pham, Pierre Sevestre, Ramanpreet Singh gent Robots and Systems (IROS), pages 10359–10366. IEEE, Pahwa, Huijing Zhan, Chun Ho Pang, Yuda Chen, Armin 2020. Mustafa, Vijay Chandrasekhar, and Jie Lin. A*3d dataset: [41] Maggie Wigness, Sungmin Eum, John G Rogers, David Han, Towards autonomous driving in challenging environments. and Heesung Kwon. A rugd dataset for autonomous naviga- In 2020 IEEE International Conference on Robotics and Au- tion and visual perception in unstructured outdoor environ- tomation (ICRA), pages 2267–2273. IEEE, 2020. ments. In 2019 IEEE/RSJ International Conference on Intel- [30] Matthew Pitropov, Danson Evan Garcia, Jason Rebello, ligent Robots and Systems (IROS), pages 5000–5007. IEEE, Michael Smart, Carlos Wang, Krzysztof Czarnecki, and 2019. Steven Waslander. Canadian adverse driving conditions [42] Benjamin Wilson, William Qi, Tanmay Agarwal, John Lam- dataset. The International Journal of Robotics Research, bert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Rat- 40(4-5):681–690, 2021. nesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, [31] Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J et al. Argoverse 2: Next generation datasets for self-driving Guibas. Frustum pointnets for 3d object detection from rgb- perception and forecasting. 2021. d data. In Proceedings of the IEEE conference on computer [43] Yan Yan, Yuxing Mao, and Bo Li. Second: Sparsely embed- vision and pattern recognition, pages 918–927, 2018. ded convolutional detection. Sensors, 18(10):3337, 2018. [32] Morgan Quigley, Ken Conley, Brian Gerkey, Josh Faust, [44] Bin Yang, Wenjie Luo, and Raquel Urtasun. Pixor: Real- Tully Foote, Jeremy Leibs, Rob Wheeler, Andrew Y Ng, time 3d object detection from point clouds. In Proceedings of et al. Ros: an open-source robot operating system. In ICRA the IEEE conference on Computer Vision and Pattern Recog- workshop on open source software, volume 3, page 5. Kobe, nition, pages 7652–7660, 2018. Japan, 2009. [45] Tianwei Yin, Xingyi Zhou, and Philipp Krahenbuhl. Center- [33] Henri Rebecq, René Ranftl, Vladlen Koltun, and Davide based 3d object detection and tracking. In Proceedings of Scaramuzza. High speed and high dynamic range video with the IEEE/CVF conference on computer vision and pattern an event camera. IEEE transactions on pattern analysis and recognition, pages 11784–11793, 2021. machine intelligence, 43(6):1964–1980, 2019. [46] Senthil Yogamani, Ciarán Hughes, Jonathan Horgan, Ganesh [34] Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Sistu, Padraig Varley, Derek O’Dea, Michal Uricár, Ste- Zhihui Li, Brij B Gupta, Xiaojiang Chen, and Xin Wang. fan Milz, Martin Simon, Karl Amende, et al. Woodscape: A survey of deep active learning. ACM Computing Surveys A multi-task, multi-camera fisheye dataset for autonomous (CSUR), 54(9):1–40, 2021. driving. In Proceedings of the IEEE/CVF International Con- [35] Christos Sakaridis, Dengxin Dai, and Luc Van Gool. Acdc: ference on Computer Vision, pages 9308–9318, 2019. The adverse conditions dataset with correspondences for se- [47] Donggeun Yoo and In So Kweon. Learning loss for active mantic driving scene understanding. In Proceedings of the learning. In Proceedings of the IEEE/CVF Conference on IEEE/CVF International Conference on Computer Vision, Computer Vision and Pattern Recognition, pages 93–102, pages 10765–10775, 2021. 2019. [36] Suvash Sharma, Lalitha Dabbiru, Tyler Hannis, George Ma- [48] E. Yurtsever, J. Lambert, A. Carballo, and K. Takeda. A sur- son, Daniel W Carruth, Matthew Doude, Chris Goodin, vey of autonomous driving: Common practices and emerg- Christopher Hudson, Sam Ozier, John E Ball, et al. Cat: ing technologies. IEEE Access, 8:58443–58469, 2020. Cavs traversability dataset for off-road autonomous driving. [49] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning IEEE Access, 10:24759–24768, 2022. for point cloud based 3d object detection. In Proceedings of [37] Vishwanath A Sindagi, Yin Zhou, and Oncel Tuzel. Mvx- the IEEE conference on computer vision and pattern recog- net: Multimodal voxelnet for 3d object detection. In 2019 In- nition, pages 4490–4499, 2018. ternational Conference on Robotics and Automation (ICRA), pages 7276–7282. IEEE, 2019. [38] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceed- 4480 Authorized licensed use limited to: Hasan Kalyoncu Universitesi. Downloaded on August 11,2023 at 12:33:15 UTC from IEEE Xplore. Restrictions apply.