2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
IDD-3D: Indian Driving Dataset for 3D Unstructured Road Scenes
Shubham Dokania1, A. H. Abdul Hafez2, Anbumani Subramanian1,
Manmohan Chandraker3, C.V. Jawahar1
1IIIT Hyderabad, 2Hasan Kalyoncu University, 3UC San Diego
shubham.dokania@research.iiit.ac.in, abdul.hafez@hku.edu.tr,
anbumani@iiit.ac.in, mkchandraker@eng.ucsd.edu, jawahar@iiit.ac.in
Abstract
Autonomous driving and assistance systems rely on an-
notated data from traffic and road scenarios to model and
learn the various object relations in complex real-world
scenarios. Preparation and training of deploy-able deep
learning architectures require the models to be suited to
different traffic scenarios and adapt to different situations.
Currently, existing datasets, while large-scale, lack such di-
versities and are geographically biased towards mainly de-
veloped cities. An unstructured and complex driving layout
found in several developing countries such as India poses a
challenge to these models due to the sheer degree of varia-
tions in the object types, densities, and locations. To facili-
tate better research toward accommodating such scenarios, Figure 1. Some examples from the dataset showing different traffic
we build a new dataset, IDD-3D, which consists of multi- scenarios, LiDAR data with annotations, and a sample of LiDAR
point clouds projected on camera data.
modal data from multiple cameras and LiDAR sensors with
12k annotated driving LiDAR frames across various traf-
fic scenarios. We discuss the need for this dataset through evenly distributed traffic. In such situations, crowd behavior
statistical comparisons with existing datasets and highlight demonstrates low diversity and average densities. In south-
benchmarks on standard 3D object detection and tracking east Asian countries, such as India, the traffic densities and
tasks in complex layouts. Code and data available 1. inter-object behaviors are much more complex. Such com-
plexities have been studied in the past [39, 5, 4], but ex-
tensive data coverage and multi-modal systems are still un-
1. Introduction available for such scenes. It hence may not be entirely ap-
Intelligent vehicles and autonomous driving systems plied to cases where the distribution of object categories and
have come a long way and keep becoming more sophisti- types varies greatly.
cated over time, owing to the rapid progress in the deep In this paper, we propose a dataset on complex unstruc-
learning and computer vision. However, the core compo- tured driving scenarios with multi-modal data, highlighting
nent for all these increments is the availability of high- the capabilities of 3D sensors such as LiDAR for better
quality annotated data. Recently, many works have fo- scene perception in unstructured and sporadically chaotic
cused on data selection and quality improvement [34, 8, traffic conditions. In the proposed dataset, we highlight a
47], building high-quality and large-scale datasets, and ap- significantly different distribution of object types and cat-
proaches built using these resources, which improve the egories compared to existing datasets collected in Euro-
state of autonomous driving [48, 16]. pean or similar settings [24, 13, 38], due to the different
Existing datasets are usually collected in well-structured nature of traffic scenes in Indian roads. Furthermore, the
environments with proper traffic regulations and relatively- categories and annotations available in the proposed dataset
vary greatly from existing datasets. Specifically, they cover
1https://github.com/shubham1810/idd3d_kit.git objects in scenes that usually appear in still-developing
2642-9381/23/$31.00 ©2023 IEEE 4471
DOI 10.1109/WACV56688.2023.00446
Authorized licensed use limited to: Hasan Kalyoncu Universitesi. Downloaded on August 11,2023 at 12:33:15 UTC from IEEE Xplore.  Restrictions apply. 
2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) | 978-1-6654-9346-8/23/$31.00 ©2023 IEEE | DOI: 10.1109/WACV56688.2023.00446
Figure 2. Samples from the dataset highlighting different (a) RGB images and (b) LiDAR Bird-Eye-View (BEV) along with bounding box
annotations. The samples visualized above are taken from different sequences of the dataset.
cities, for example, Auto-rickshaws, hand carts, concrete ments to accentuate the usefulness of proposed dataset, and
mixer machines on roads, and animals on roads. (iv) provide 3D object detection and tracking benchmarks
We provide data collected in Indian road scenes, from across popular methods in literature.
high-quality LiDAR sensors and six cameras that cover
the surrounding area of the ego-vehicle to enable sensor- 2. Related Work
fusion-based applications. We provide annotations for Data plays a huge role in machine learning systems, and
15.5k frames in the dataset, which spans 10 primary cat- in this context, for autonomous vehicles and scene percep-
egories (and 7 additional miscellaneous categories), which tion. There have been several efforts over the years in this
we use for model training and evaluation. Along with the area to improve the state of datasets available and towards
annotations, we also provide extra unlabelled raw data from increasing the volumes of high-quality and well annotated
the sensors to facilitate further research, especially into datasets.
self- and unsupervised learning over such traffic scenes. A
unique feature of the proposed dataset, which stems from 2D Driving: One of the early datasets towards visual per-
the unstructured environment, is the availability of highly ception and understanding driving has been the CamVid
complex trajectories. We show samples from the dataset [2] and Cityscapes [9, 10] dataset, providing annotations
which emphasize such cases and display experiments on ob- for semantic segmentation and enabling research in deeper
ject detection and tracking, which is possible due to avail- scene understanding at pixel-level. KITTI [14, 15] dataset
ability of instance specific labels for each object bounding provided 2D object annotations for detection and tracking
box per sequence. along with segmentation data. However, fusion of multi-
Our main contributions can be summarised as follows: ple modalities such as 3D LiDAR data enhances the perfor-
(i) We propose the IDD-3D dataset for driving in unstruc- mance for scene understanding benchmarks as these pro-
tured traffic scenarios for Indian roads with 3D informa- vide a higher level of detail of a scene when combined with
tion, (ii) high-quality annotations for 3D object bounding available 2D data. This multi-modal sensor-fusion based
boxes with 9DoF data, and instance IDs to enable tracking, direction has been the motivation for the proposed dataset
(iii) Analysis over highly unstructured and diverse environ- to alleviate the discrepancies in existing datasets for scene
4472
Authorized licensed use limited to: Hasan Kalyoncu Universitesi. Downloaded on August 11,2023 at 12:33:15 UTC from IEEE Xplore.  Restrictions apply. 
Dataset 3D Scenes Cameras Lidar Images Classes 3D Boxes Traffic Diversity
KITTI [15] 15k 2 yes 15k 3 80k Low
nuScenes [3] 40k 6 yes 1.4M 23 1.4M Mid
Apolloscape [17] 20k 6 yes 0 6 475k Low
KAIST [7] 8.9k 2 yes 8.9k 3 0 Low
Waymo Open [38] 230k 5 yes 1M 4 12M Mid
ONCE [26] 1M (16k) 7 yes 7M 5 417k Mid
Cityscapes-3D [13] 20k - no 490k 8 - Low
A* 3D [29] 39k 1 yes 39k 7 230k Mid
Ours 15.5k* 6 yes 93k 10 (17**) 223k* High
Table 1. A comparison with existing popular 3D autonomous driving datasets. Our dataset showcases the highest diversity with the highest
average number of bounding boxes per frame and a wide distribution. The statistical distribution is further studied in the following sections.
(*) Number reported on train-val-test set, experiments/statistics reported on train-val set. (**) The 17 classes are total of the 10 primary
and 7 additional classes.
Figure 3. Distribution of class labels in the proposed dataset. (a) The primary 10 classes are shown here along with the 3 super-categories
(Vehicle, Pedestrian, and Rider) which are considered to make the proposed dataset more consistent with labels from existing datasets.
(b) The additional 7 classes annotated in the dataset are shown in log-scale separately since they are currently not used for training the
models. The Rider class covers both riders and non-riders on two-wheeler motor vehicles. We do not consider the Miscellaneous classes
for evaluation of the dataset currently.
perception and autonomous driving. more complex environments and extending the diversity of
driving datasets.
Driving Datasets: Recent datasets such as nuScenes [3],
Argoverse [5], Argoverse 2 [42] provide HD maps for road Complex environments: There have been multiple ef-
scenes. This allows for improved perception and planning forts to build datasets for difficult environments such as
capabilities and towards construction of better metrics for variations in extreme weather [30, 35], night-time driving
object detection such as in [38]. These large scale datasets conditions [11], and safety critical scenarios [1]. There
cover a variety of scenes and traffic densities and have en- have been recent works which make use of different sen-
abled systems with high safety regulations in the area of sors such as fisheye lenses to cover a larger area around
driver assistance and autonomous driving. However, the the ego-vehicle [46, 24] and event camera [33] for train-
drawback for a majority of these datasets arises from the ing models with faster reaction times. However, most of
fact that the collection happens in well-developed cities these datasets have been collected in environments with
with clear and structured traffic flows. The proposed dataset little to no changes in the traffic patterns and consis-
bridges the gaps of varying environments by introducing tency in the background objects. Some works in literature
4473
Authorized licensed use limited to: Hasan Kalyoncu Universitesi. Downloaded on August 11,2023 at 12:33:15 UTC from IEEE Xplore.  Restrictions apply. 
Figure 4. Samples of scenes of interest in our dataset (LiDAR and RGB samples) which especially differentiate our proposed dataset
from those available in literature. (Clockwise from top-left) (a) Complex traffic scenarios with vehicles orientations in a wide variety of
directions, (b) Perspective view of a scene with ego-vehicle on elevated flyover with ground level visible and another highway over the
vehicle path with pillars, (c) humans in the middle of traffic (shown in red boxes) and jaywalking near moving vehicles, resulting in a
safety critical scenario, (d) An example with very high density traffic scenario. Such case are abundant in the proposed dataset (rather than
special cases when compared to other popular datasets) and hence require special attention for such unstructured environments. Refer to
supplementary material for more examples.
[39, 36, 19, 41] explore such situations where the label dis- [6, 31, 37]. We have used approaches such as SECOND
tributions can vary significantly, however these are either [43] which voxelize the input point cloud and apply 3D con-
limited to mostly 2D modalities, or off-road environments. volution, which leads to discrete geometric representations
In this work, the proposed dataset enhances the availabil- of the data. CenterPoint [45] approach which assigns cen-
ity of data for enabling research for autonomous driving in ters is known to perform well for smaller objects due to the
unconstrained traffic environments. fine level of details for each point feature. We also explore
PointPillars [21] for an analysis of pillar based approaches
Object Detection and Tracking: Several popular meth- where the data is projected to Bird-Eye-View mode and then
ods have been explored in recent literature which handle the treated as an image. We highlight the performance of each
task of 3D object detection for the cases of driving scenar- in the experiments section and draw our inferences specific
ios [49, 44, 43, 21, 45]. In our work, we specifically talk to the proposed dataset.
about 3D object detection from point clouds, while we do Many methods have been proposed towards 3D Multi-
note the effectiveness of multi-modal approaches as well Object Tracking (MOT) in literature which have been
4474
Authorized licensed use limited to: Hasan Kalyoncu Universitesi. Downloaded on August 11,2023 at 12:33:15 UTC from IEEE Xplore.  Restrictions apply. 
Figure 5. Figure showing (a) sensors on the vehicle (cameras, LiDAR) and their respective orientations, (b) image of the vehicle used along
with the sensor rig. Please note that the real-world car image has been edited to preserve anonymity.
shown to perform well across a multitude of datasets in dif-
ferent scenarios. There are various ways to model the track-
ing task such as using the Bird-Eye View [25], approaches
based on multi-sensor fusion [20], and simple tracking
based on distance metrics and methods like Kalman filter
[28]. In this work, we utilise the method presented in [28]
using the detections from our trained models on IDD-3D
and present the evaluations based on popular MOT metrics Figure 7. (a) Distribution showing distances of all bounding boxes
such as the ones presented in [40]. from the ego-vehicle. The short distance of vehicles and pedestri-
ans provides motivation for the proposed dataset to facilitate mod-
eling of shorter reaction times. (b) Cumulative distribution of the
distances further highlight the differences in distance distributions,
showing that most of the objects in the proposed dataset are close
to the ego-vehicle compared to existing popular datasets.
Figure 8. Distributions of number of bounding boxes per LiDAR
frame. The number of objects in a scene is usually higher in the
Figure 6. Class-wise distribution of some common prominent frames present in the proposed dataset. We filter the boxes specif-
classes (Car, MotorcycleRider, Pedestrian, TourCar) with respect ically based on the distance of less than 30m based on data shown
to number of frames is visualized to show traffic and crowd den- in fig 6. (a) Shows statistics with KITTI dataset, and (b) shows the
sity in proposed dataset. Distributions for all classes is shown in same without KITTI dataset to highlight the sparsity in the KITTI
supplementary material. dataset. We note the heavier tail of our distribution indicating a
greater density of objects close to the ego-vehicle.
3. Proposed Dataset 3.1. Data Acquisition
The data collection for the proposed dataset was covered
In the following sections we discuss and highlight the in two driving sessions with over 5 hours of collected data
qualities of the proposed dataset, including the design during daytime. Afterwards, we manually sample scenes
choices and method for data collection, annotations and of interest in sequences of 100 frames at 10fps making 150
analysis of the dataset over interesting scenarios. sequences, each of 10s. The data collection has been per-
4475
Authorized licensed use limited to: Hasan Kalyoncu Universitesi. Downloaded on August 11,2023 at 12:33:15 UTC from IEEE Xplore.  Restrictions apply. 
formed in different regions of Hyderabad, India. We now the ego-vehicle. We also show better data density compared
provide details about the configuration and data preparation to KITTI, which is on a comparable scale to IDD-3D. Ad-
in the following. ditionally, it can be seen that in the range of 0-25m (where
most of the proposed dataset’s annotations exist), we show
Sensors (Hardware configuration): The proposed higher densities than both ONCE and nuScenes as shown in
dataset encompasses data from multiple sensors which fig. 8(b).
include six RGB cameras and one LiDAR (Ouster OS1)
sensor. The details about the sensors and data processing Interesting cases: While existing datasets provide high
used are mentioned in Table 1 in supplementary material. diversity in type of traffic scenarios, these are usually re-
The position and orientation of the sensors on the acquisi- stricted to controlled and well-structured environments with
tion vehicle is shown in Fig. 5 along with the real-world only a few anomalies. In IDD-3D, we show a large amount
image of the vehicle. of diversity in the situations and also highlight some cases
which could be of interest for progress in driving behaviour
Data processing: For each driving sequence, all cali- modeling such as the samples shown in Fig. 4. For ex-
brations are performed through popular methods such as ample, we see safety critical cases where multiple pedes-
[18, 27]. We preserve the raw data from the sensors in ros- trians are seen jaywalking while vehicles are on the roads.
bag format [32]. The current release of the dataset con- Existing datasets claim high density traffic when there are
sists of 15.5k total frames out of which 12k frames are from 20-30 object bounding boxes in one frame, whereas in our
train-val set. samples we show 50-60 or more objects existing in the
same frame, and in close proximity. Considering the dif-
ferent variations of scenes in the proposed dataset, the ap-
Data Privacy: We ensure that all the faces and license plications for surveillance, road-safety, traffic quality, and
plates in the dataset are blurred by first using automated ap- crowd-behaviour are immense and show potential to be dis-
proaches (such as [12, 22]) and then performing a manual parate from the data patterns from other datasets.
quality inspection. For the automated approaches, we run
the object detection pipeline and then perform a NMS based 4. Experiments and Benchmarks
matching to find any missing boxes in between frames. The
missing boxes are interpolated, and finally, we blur the re- We present an extensive analysis of IDD-3D with exist-
gions in the images for data protection. ing methods to highlight the diversity and usefulness data.
We first discuss the experimental setup and then based on
3.2. Dataset Analysis the evaluations, report the understanding about the dataset
Labels and Annotations We provide 3D bounding box properties and behaviour of different approaches.
annotations for 15.5k (train-val-test) LiDAR frames with
223k 3D bounding boxes. We have used the annotation Proposed Dataset: We use 10 primary categories which
tool [23] for labeling data across 17 categories, shown in are highlighted in Fig. 3, however, since most datasets in
Fig. 3. Each object in a sequence contains a unique ID literature ordinarily provide a few categories as common la-
which enables tracking and re-identification. Furthermore, bels (For example, Car, truck, Van as Vehicle), we com-
we provide class specific object distribution based on num- bine our class labels into three categories, namely Vehi-
ber of frames for some of the prominent categories in Fig. cle, Pedestrian, and Rider as super-categories. The net-
6. We note that out of the 17 available classes (primary work architectures are trained on 10 categories (Car, Bus,
and additional), we are using 10 primary classes currently Truck, Scooter, Van, Motorcycle, Pedestrian, Motorcy-
for training and validation is performed on 10 classes and 3 cleRider, ScooterRider, TourCar). We transform the anno-
super-classes (Vehicle, Pedestrian, and Rider). tations to a simpler format for the 3D object detection task a
7-dimensional vector as (x, y, z, w, h, l, α), where (x, y, z)
Data Statistics : We first highlight the bounding box dis- represent the object location, (w, h, l) represent the dimen-
tance distribution in IDD-3D and the comparison with ex- sions of the bounding box and α represents the yaw angle.
isting popular datasets [3, 15, 26] in figure 7. In Fig. 7 (a)
we show that IDD-3D consists of most of the annotations 3D Object detection: We discuss about some of the pop-
close to the ego-vehicle, caused by the low gaps between ular datasets which have been considered for comparison
vehicles causing occlusion for LiDAR rays for longer dis- with the proposed dataset and highlight their strengths and
tances. Nonetheless, it is crucial to highlight this feature of weaknesses in the complex setting of the presented driving
the proposed dataset because split-second decisions are im- scenarios. For fair comparison, we train network architec-
portant for safety, especially when other objects are close to tures proposed in [43, 45, 21] for 3D object detection and
4476
Authorized licensed use limited to: Hasan Kalyoncu Universitesi. Downloaded on August 11,2023 at 12:33:15 UTC from IEEE Xplore.  Restrictions apply. 
CenterPoint SECOND
SuperCategory Categories/Methods CenterPoint (nuScenes) SECOND (KITTI) PointPillar
Car 65.28 66.97 69.89 68.50 67.77
Bus 59.09 78.47 59.12 49.69 43.70
Vehicle Truck 68.79 72.18 65.11 68.09 63.68
Van 9.58 12.71 1.27 15.77 0.14
TourCar 76.94 77.40 74.81 77.02 72.80
Pedestrian Pedestrian 28.60 22.49 19.54 23.74 22.72
Motorcycle 23.65 25.28 21.69 22.79 16.97
Rider Scooter 42.36 38.05 26.98 23.73 16.81MotorcycleRider 59.29 61.48 53.39 48.90 46.52
ScooterRider 66.33 64.65 52.27 50.62 41.60
mAP 49.99 51.97 44.31 44.89 39.27
Table 2. Results on IDD-3D with popular methods. We report AP scores across different categories on the validation set. This table shows
the results on each training class. The scores are reported with different thresholds for each class (Vehicles @ 0.5, Rider @ 0.4, Pedestrian
@ 0.3) and all objects are considered till 30m distance, please see supplementary material for more details and full table.
Approach Pre-Training Vehicle RiderOverall 0-10m 10-25m >25m Overall 0-10m 10-25m >25m
CenterPoint nuScenes 73.85 87.57 70.98 30.48 71.03 84.24 69.54 23.42
CenterPoint - 71.20 88.84 67.62 26.32 69.51 83.66 67.49 19.76
SECOND KITTI 72.51 88.60 68.99 28.07 71.60 83.25 70.98 24.32
SECOND - 73.01 88.71 67.82 29.46 72.05 85.44 70.89 26.28
PointPillar - 68.61 87.64 64.59 26.30 69.66 82.56 68.60 25.64
Table 3. Experimental results on proposed dataset with different popular methods. We report AP scores across different categories on the
validation set. This table shows the results on Vehicle and Rider categories from the proposed dataset.
show the results in Tables 2, 3 and 4. We report the mAP dataset training may not be fruitful in this scenario given the
scores for the 3 combined categories (Vehicle, Pedestrian, significantly different distribution of the categories and in-
and Rider) in Tables 3 and 4, and further report mAP scores put data in the given datasets. The existing datasets usually
in four sub-levels, i.e. overall AP score for each training utilise information such as LiDAR intensity, elongation, and
class in Table 2. The scores reported in Table 2 are for a timestamp information as input to the model, which is dif-
distance up to 30m in the dataset, and the distances in the ferent from the proposed dataset. However, considering
super-classes are divided as upto 30m (denoted as Overall), the wide research available based on these datasets, it is
0-10m, 10-25m, and 25+m. The small distance buckets are imperative that we highlight how using the existing mod-
considered due to the data distribution (as shown in Fig. 7) els trained on these datasets as pre-training backbones usu-
in the proposed dataset. ally enhances the performances. For this purpose, we con-
sider using the models [43, 45] for pre-training by using the
3D Object Tracking: A notable property of the proposed weights for the common layers and fine-tune for better per-
dataset is the existence of the instance IDs for each 3D formance.
bounding box. In this work, we also show results on 3D ob-
ject tracking and report important metrics such as AMOTA, Result Analysis : We note that the performance of the ar-
AMOTP [40] in Table 5. We use SimpleTrack [28] for the chitectures for both 10 categories and the 3 super-categories
task of object tracking and report the results based on the is consistent and aligns with our claims. It is clear that the
detections from Centerpoint [45] due to the highest mAP number of annotated instances plays a major role for bet-
score on the detection task. The MOT scores are reported ter mAP scores, for example, classes such as Car achieve a
for all 10 primary classes and the overall categories. high mAP compared to classes such as Van or Scooter. An-
other major factor appears to be the object size, wherein
Datasets : We use KITTI [15, 14] dataset and nuScenes larger and denser objects are easier to model and detect
[3] for pre-training of 3D object detection methods to fur- compared to smaller instances. An example of the varia-
ther fine-tune on our proposed dataset. We note that cross- tions in mAP scores based on sizes is the differences be-
4477
Authorized licensed use limited to: Hasan Kalyoncu Universitesi. Downloaded on August 11,2023 at 12:33:15 UTC from IEEE Xplore.  Restrictions apply. 
Approach Pre-Training Pedestrian mAPOverall 0-10m 10-25m >25m Overall 0-10m 10-25m >25m
CenterPoint nuScenes 22.49 33.85 19.47 4.48 55.79 68.56 53.33 19.46
CenterPoint - 28.60 44.89 24.39 3.48 56.43 72.46 53.17 16.52
SECOND KITTI 23.74 33.67 21.05 5.58 55.95 68.51 53.67 19.32
SECOND - 19.54 27.18 17.61 6.44 54.87 67.11 52.11 20.73
PointPillar - 22.72 29.34 20.45 5.45 53.66 66.52 51.21 19.13
Table 4. Experimental results (continued) on proposed dataset with different popular methods. We report AP scores across different
categories on the validation set. This table shows the results on Pedestrian category and the mAP score from the proposed dataset.
Category AMOTA AMOTP Recall MOTAR MOTP MOTA lgd tid faf
Bus 0.831 0.679 0.812 0.907 0.589 0.736 3.045 2.659 13.805
Car 0.641 0.726 0.667 0.787 0.518 0.521 3.422 2.035 44.806
Motorcycle 0.202 0.826 0.242 0.941 0.356 0.228 2.000 2.000 2.321
MotorcyleRider 0.507 0.735 0.496 0.801 0.320 0.390 5.027 2.585 36.410
Pedestrian 0.254 0.912 0.319 0.737 0.363 0.225 9.918 6.731 34.557
Scooter 0.250 0.494 0.323 1.000 0.092 0.323 0.000 0.000 0.000
ScooterRider 0.540 0.536 0.581 0.742 0.258 0.427 3.868 2.274 35.251
TourCar 0.796 0.433 0.848 0.821 0.351 0.692 2.877 1.034 48.866
Truck 0.701 0.635 0.675 0.903 0.403 0.607 5.108 2.676 17.796
Van 0.000 1.677 0.275 0.000 0.563 0.000 14.500 0.000 75.163
Overall 0.472 0.765 0.524 0.764 0.381 0.415 4.977 2.199 30.898
Table 5. Experimental results for 3D object tracking for the 10 primary classes present in the proposed dataset. We use SimpleTrack [28] for
the task of tracking using detections from CenterPoint [45] in the presented table. For the abalation study, please refer to the supplementary
material.
tween the Pedestrian and Bus/Truck categories, even though 5. Conclusion
Pedestrian category consists of the maximum bounding box
instances. From Table 2, we see that CenterPoint approach In this work, we presented IDD-3D, a dataset for un-
generally performs better than SECOND or PointPillars for structured driving scenarios with complex road situations is
the proposed dataset, this could be due to the nature of the presented with thorough statistical and experimental anal-
approach where it deals directly with point clouds to pre- ysis. Through this dataset, and the future release, we aim
dict object centers instead of voxelizing the points (SEC- to solve the problem of generalizability across geographi-
OND) or projecting the point to BEV (PointPillars). We cal locations and provide more diverse information in driv-
also provide results on the super-classes for the same ar- ing datasets and road scene analysis. We show interesting
chitectures over different distance ranges and show similar cases which cover a manifold of cases but also show some
performances for all approaches. safety-critical situations which are frequent in several cities.
We justify our claims for the proposed dataset through a set
For the object tracking results presented in Table 5, we of experiments for 3D object detection and tracking using
notice a correlation between the detection scores and track- state-of-the-art approaches which were available as open-
ing scores (AP and AMOTA/MOTA) for classes such as source implementations. The future works for the dataset
Pedestrian and Car. We highlight that the detection as shall extend these tasks to a vast number of applications,
well as tracking models perform adequately on the pro- further enhancing the applicability of the proposed dataset
posed dataset achieving an overall AMOTA score of 0.472 to autonomous driving applications.
(higher better), while we also note that a similar configura-
tion achieves an overall AMOTA of 0.668 on the nuscenes 6. Acknowledgements
dataset (from the leaderboard). The complexity of the pro-
posed dataset is especially highlighted in the results of the The project is funded by iHub-data and mobility at IIIT
Pedestrian class, where the low scores prove complex mo- Hyderabad. The authors would like to acknowledge the sup-
tion present in the dataset. We provide further results in the port from Radha Krishna B towards data collection and an-
supplementary section along with the tracking results using notation. We would also like to thank Government of Telan-
SECOND and PointPillars models for completeness, with gana for the permissions, encouragement and enabling this
the corresponding visualizations. effort.
4478
Authorized licensed use limited to: Hasan Kalyoncu Universitesi. Downloaded on August 11,2023 at 12:33:15 UTC from IEEE Xplore.  Restrictions apply. 
References [13] Nils Gählert, Nicolas Jourdan, Marius Cordts, Uwe Franke,
and Joachim Denzler. Cityscapes 3d: Dataset and
[1] Wentao Bao, Qi Yu, and Yu Kong. Uncertainty-based traffic benchmark for 9 dof vehicle detection. arXiv preprint
accident anticipation with spatio-temporal relational learn- arXiv:2006.07864, 2020.
ing. In Proceedings of the 28th ACM International Confer- [14] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel
ence on Multimedia, pages 2682–2690, 2020. Urtasun. Vision meets robotics: The kitti dataset. The Inter-
[2] Gabriel J Brostow, Julien Fauqueur, and Roberto Cipolla. national Journal of Robotics Research, 32(11):1231–1237,
Semantic object classes in video: A high-definition ground 2013.
truth database. Pattern Recognition Letters, 30(2):88–97, [15] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we
2009. ready for autonomous driving? the kitti vision benchmark
[3] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, suite. In 2012 IEEE conference on computer vision and pat-
Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- tern recognition, pages 3354–3361. IEEE, 2012.
ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- [16] Sorin Grigorescu, Bogdan Trasnea, Tiberiu Cocias, and
modal dataset for autonomous driving. In CVPR, 2020. Gigel Macesanu. A survey of deep learning techniques for
[4] Rohan Chandra, Mridul Mahajan, Rahul Kala, Rishitha autonomous driving. Journal of Field Robotics, 37(3):362–
Palugulla, Chandrababu Naidu, Alok Jain, and Dinesh 386, 2020.
Manocha. Meteor: A massive dense & heterogeneous [17] X. Huang, X. Cheng, Q. Geng, B. Cao, D. Zhou, P. Wang, Y.
behavior dataset for autonomous driving. arXiv preprint Lin, and R. Yang. The apolloscape dataset for autonomous
arXiv:2109.07648, 2021. driving. In 2018 IEEE/CVF Conference on Computer Vision
[5] Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jag- and Pattern Recognition Workshops (CVPRW), pages 1067–
jeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter 10676, 2018.
Carr, Simon Lucey, Deva Ramanan, et al. Argoverse: 3d [18] Ganesh Iyer, R Karnik Ram, J Krishna Murthy, and K Mad-
tracking and forecasting with rich maps. In Proceedings of hava Krishna. Calibnet: Geometrically supervised extrinsic
the IEEE/CVF Conference on Computer Vision and Pattern calibration using 3d spatial transformer networks. In 2018
Recognition, pages 8748–8757, 2019. IEEE/RSJ International Conference on Intelligent Robots
[6] Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. and Systems (IROS), pages 1110–1117. IEEE, 2018.
Multi-view 3d object detection network for autonomous [19] Peng Jiang, Philip Osteen, Maggie Wigness, and Srikanth
driving. In Proceedings of the IEEE conference on Computer Saripalli. Rellis-3d dataset: Data, benchmarks and analy-
Vision and Pattern Recognition, pages 1907–1915, 2017. sis. In 2021 IEEE international conference on robotics and
[7] Yukyung Choi, Namil Kim, Soonmin Hwang, Kibaek Park, automation (ICRA), pages 1110–1116. IEEE, 2021.
Jae Shin Yoon, Kyounghwan An, and In So Kweon. Kaist [20] Aleksandr Kim, Aljoša Ošep, and Laura Leal-Taixé. Eager-
multi-spectral day/night data set for autonomous and assisted mot: 3d multi-object tracking via sensor fusion. In 2021
driving. IEEE Transactions on Intelligent Transportation IEEE International Conference on Robotics and Automation
Systems, 19(3):934–948, 2018. (ICRA), pages 11315–11321. IEEE, 2021.
[8] Brendan Collins, Jia Deng, Kai Li, and Li Fei-Fei. Towards [21] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou,
scalable dataset construction: An active learning approach. Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders
In European conference on computer vision, pages 86–98. for object detection from point clouds. In Proceedings of
Springer, 2008. the IEEE/CVF conference on computer vision and pattern
[9] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo recognition, pages 12697–12705, 2019.
Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe [22] Rayson Laroca, Luiz A Zanlorensi, Gabriel R Gonçalves,
Franke, Stefan Roth, and Bernt Schiele. The cityscapes Eduardo Todt, William Robson Schwartz, and David
dataset for semantic urban scene understanding. In Proc. Menotti. An efficient and layout-independent automatic li-
of the IEEE Conference on Computer Vision and Pattern cense plate recognition system based on the yolo detector.
Recognition (CVPR), 2016. IET Intelligent Transport Systems, 15(4):483–503, 2021.
[10] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo [23] E Li, Shuaijun Wang, Chengyang Li, Dachuan Li, Xiangbin
Scharwächter, Markus Enzweiler, Rodrigo Benenson, Uwe Wu, and Qi Hao. Sustech points: A portable 3d point cloud
Franke, Stefan Roth, and Bernt Schiele. The cityscapes interactive annotation platform system. In 2020 IEEE In-
dataset. In CVPR Workshop on The Future of Datasets in telligent Vehicles Symposium (IV), pages 1108–1115. IEEE,
Vision, 2015. 2020.
[11] Dengxin Dai and Luc Van Gool. Dark model adaptation: [24] Yiyi Liao, Jun Xie, and Andreas Geiger. Kitti-360: A novel
Semantic image segmentation from daytime to nighttime. dataset and benchmarks for urban scene understanding in 2d
In 2018 21st International Conference on Intelligent Trans- and 3d. arXiv preprint arXiv:2109.13410, 2021.
portation Systems (ITSC), pages 3819–3824. IEEE, 2018. [25] Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang,
[12] Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kot- Huizi Mao, Daniela Rus, and Song Han. Bevfusion: Multi-
sia, and Stefanos Zafeiriou. Retinaface: Single-shot multi- task multi-sensor fusion with unified bird’s-eye view repre-
level face localisation in the wild. In Proceedings of sentation. arXiv preprint arXiv:2205.13542, 2022.
the IEEE/CVF conference on computer vision and pattern [26] Jiageng Mao, Minzhe Niu, Chenhan Jiang, Hanxue Liang,
recognition, pages 5203–5212, 2020. Jingheng Chen, Xiaodan Liang, Yamin Li, Chaoqiang Ye,
4479
Authorized licensed use limited to: Hasan Kalyoncu Universitesi. Downloaded on August 11,2023 at 12:33:15 UTC from IEEE Xplore.  Restrictions apply. 
Wei Zhang, Zhenguo Li, et al. One million scenes ings of the IEEE/CVF conference on computer vision and
for autonomous driving: Once dataset. arXiv preprint pattern recognition, pages 2446–2454, 2020.
arXiv:2106.11037, 2021. [39] Girish Varma, Anbumani Subramanian, Anoop Namboodiri,
[27] Gaurav Pandey, James R McBride, Silvio Savarese, and Manmohan Chandraker, and C.V. Jawahar. Idd: A dataset
Ryan M Eustice. Automatic targetless extrinsic calibration for exploring problems of autonomous navigation in uncon-
of a 3d lidar and camera by maximizing mutual information. strained environments. In 2019 IEEE Winter Conference on
In Twenty-Sixth AAAI Conference on Artificial Intelligence, Applications of Computer Vision (WACV), pages 1743–1751,
2012. 2019.
[28] Ziqi Pang, Zhichao Li, and Naiyan Wang. Simpletrack: Un- [40] Xinshuo Weng, Jianren Wang, David Held, and Kris Kitani.
derstanding and rethinking 3d multi-object tracking. arXiv 3d multi-object tracking: A baseline and new evaluation met-
preprint arXiv:2111.09621, 2021. rics. In 2020 IEEE/RSJ International Conference on Intelli-
[29] Quang-Hieu Pham, Pierre Sevestre, Ramanpreet Singh gent Robots and Systems (IROS), pages 10359–10366. IEEE,
Pahwa, Huijing Zhan, Chun Ho Pang, Yuda Chen, Armin 2020.
Mustafa, Vijay Chandrasekhar, and Jie Lin. A*3d dataset: [41] Maggie Wigness, Sungmin Eum, John G Rogers, David Han,
Towards autonomous driving in challenging environments. and Heesung Kwon. A rugd dataset for autonomous naviga-
In 2020 IEEE International Conference on Robotics and Au- tion and visual perception in unstructured outdoor environ-
tomation (ICRA), pages 2267–2273. IEEE, 2020. ments. In 2019 IEEE/RSJ International Conference on Intel-
[30] Matthew Pitropov, Danson Evan Garcia, Jason Rebello, ligent Robots and Systems (IROS), pages 5000–5007. IEEE,
Michael Smart, Carlos Wang, Krzysztof Czarnecki, and 2019.
Steven Waslander. Canadian adverse driving conditions [42] Benjamin Wilson, William Qi, Tanmay Agarwal, John Lam-
dataset. The International Journal of Robotics Research, bert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Rat-
40(4-5):681–690, 2021. nesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes,
[31] Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J et al. Argoverse 2: Next generation datasets for self-driving
Guibas. Frustum pointnets for 3d object detection from rgb- perception and forecasting. 2021.
d data. In Proceedings of the IEEE conference on computer [43] Yan Yan, Yuxing Mao, and Bo Li. Second: Sparsely embed-
vision and pattern recognition, pages 918–927, 2018. ded convolutional detection. Sensors, 18(10):3337, 2018.
[32] Morgan Quigley, Ken Conley, Brian Gerkey, Josh Faust, [44] Bin Yang, Wenjie Luo, and Raquel Urtasun. Pixor: Real-
Tully Foote, Jeremy Leibs, Rob Wheeler, Andrew Y Ng, time 3d object detection from point clouds. In Proceedings of
et al. Ros: an open-source robot operating system. In ICRA the IEEE conference on Computer Vision and Pattern Recog-
workshop on open source software, volume 3, page 5. Kobe, nition, pages 7652–7660, 2018.
Japan, 2009. [45] Tianwei Yin, Xingyi Zhou, and Philipp Krahenbuhl. Center-
[33] Henri Rebecq, René Ranftl, Vladlen Koltun, and Davide based 3d object detection and tracking. In Proceedings of
Scaramuzza. High speed and high dynamic range video with the IEEE/CVF conference on computer vision and pattern
an event camera. IEEE transactions on pattern analysis and recognition, pages 11784–11793, 2021.
machine intelligence, 43(6):1964–1980, 2019. [46] Senthil Yogamani, Ciarán Hughes, Jonathan Horgan, Ganesh
[34] Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Sistu, Padraig Varley, Derek O’Dea, Michal Uricár, Ste-
Zhihui Li, Brij B Gupta, Xiaojiang Chen, and Xin Wang. fan Milz, Martin Simon, Karl Amende, et al. Woodscape:
A survey of deep active learning. ACM Computing Surveys A multi-task, multi-camera fisheye dataset for autonomous
(CSUR), 54(9):1–40, 2021. driving. In Proceedings of the IEEE/CVF International Con-
[35] Christos Sakaridis, Dengxin Dai, and Luc Van Gool. Acdc: ference on Computer Vision, pages 9308–9318, 2019.
The adverse conditions dataset with correspondences for se- [47] Donggeun Yoo and In So Kweon. Learning loss for active
mantic driving scene understanding. In Proceedings of the learning. In Proceedings of the IEEE/CVF Conference on
IEEE/CVF International Conference on Computer Vision, Computer Vision and Pattern Recognition, pages 93–102,
pages 10765–10775, 2021. 2019.
[36] Suvash Sharma, Lalitha Dabbiru, Tyler Hannis, George Ma- [48] E. Yurtsever, J. Lambert, A. Carballo, and K. Takeda. A sur-
son, Daniel W Carruth, Matthew Doude, Chris Goodin, vey of autonomous driving: Common practices and emerg-
Christopher Hudson, Sam Ozier, John E Ball, et al. Cat: ing technologies. IEEE Access, 8:58443–58469, 2020.
Cavs traversability dataset for off-road autonomous driving. [49] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning
IEEE Access, 10:24759–24768, 2022. for point cloud based 3d object detection. In Proceedings of
[37] Vishwanath A Sindagi, Yin Zhou, and Oncel Tuzel. Mvx- the IEEE conference on computer vision and pattern recog-
net: Multimodal voxelnet for 3d object detection. In 2019 In- nition, pages 4490–4499, 2018.
ternational Conference on Robotics and Automation (ICRA),
pages 7276–7282. IEEE, 2019.
[38] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien
Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou,
Yuning Chai, Benjamin Caine, et al. Scalability in perception
for autonomous driving: Waymo open dataset. In Proceed-
4480
Authorized licensed use limited to: Hasan Kalyoncu Universitesi. Downloaded on August 11,2023 at 12:33:15 UTC from IEEE Xplore.  Restrictions apply.