Modalities

We placed 12 cameras in the environment, collecting raw data, including RGB images, depth images, semantic segmentation images, and camera poses. The 12 cameras all produce pinhole images with 90$^{circ}$ field of views. They are separated as 2 stereo sets of 6-cameras pointing in 6 directions to cover 360$^{circ}$ views (Figure~ref{fig:sensors}). The raw data are processed to generate other modalities including optical flow, fisheye image, LiDAR, and IMU.

Raw RGB image

For each time step, we provide 12 RGB images from 12 cameras covering a stereo set of 360$^{circ}$ views. Each image is sampled using an FoV of 90$^{circ}$ and a resolution of 640 by 640. The stereo baseline is 0.25$m$. In addition, We provide tools for adding random noise and motion blur to the images, to improve the realism.

Raw depth image

The depth images are sampled with the same camera intrinsics and extrinsics setups as the RGB images. It is perfectly aligned and synchronized with the RGB images. Each pixel of the depth image is represented using a float32 number. Depending on the environment setup, the far pixels pointing to the sky usually return a very large value.

Raw semantic segmentation

We provide category-level semantic labels. We overcome the disadvantage of AirSim, which provides random semantic labels with respect to each type of the models in the environment, by manually labeling the model types for all the environments. So each label in the semantic segmentation images is mapped to a semantic class. With 65 highly distinct environments, our data covers a wide range ofwsnote{xxx} semantic classes. However, due to the consecutive format of the data, large objects such as building, ground and sky take much higher percentage. We provide statistics files for each environment, with which people can easily balance the data while training their semantic models.

Raw camera pose

Camera pose are in the same format with V1. Each line of the pose file consists of 3 numbers of translation and 4 numbers of orientation in quarternion format, describing in NED frame.

LiDAR

The LiDAR data is sampled from the raw depth images of 6 left-side cameras, following the pattern of Velodyn Puck (VLP-16). As a result, the LiDAR frame is perfectly aligned with the left camera frame. We didn’t use AirSim LiDAR sensor because it is based on the collision model, which misses a lot of objects that don’t have collision model, such as branches and leaves. While sampling from the depth, we balance carefully the accuracy and realism. We use linear interpolation except for the points along the edges of the objects, to avoid ghost points at object edges. We provide the processing script as well as other LiDAR models, such as Velodyn Ultra Puck (VLP-32C), allowing users to create their own LiDAR data. %wsnote{Yaoyu revise with other info?} % LiDAR models, blend interpolation, etc

Fisheye and Panorama

The fisheye and Panorama data are sampled from the raw pinhole data, thus containing all three modalities of RGB, depth, and semantics. One of the biggest challenges for the fisheye model is that those real-world fisheye cameras have diverse FoVs and distortions. We have done two things to resolve the potential generalization issue. First, we define a standard model called Linear Spherical model for fisheye images. To test real-world fisheye data with different intrinsics and distortion on the model trained on TartanAir-V2 dataset, we just need to convert the real-world data into the Linear Spherical model. Second, we open-source our sampling code, together with a rich set of fisheye and pinhole camera models, which allow users to sample their own fisheye images (detailed in the nameref{sec:costomize} section).

% Standard model (Linear Spherical), allow customization % wsnote{Yaoyu, Yorai please check and revise} % wsnote{add an illustration of linear spherical model? }

Optical flow

% New support for fisheye Same as V1, the optical flow is calculated for the static environments by image warping, using the camera pose and depth images. The biggest upgrades are that we accelerate the code by a Cuda implementation and provide tools for generating optical flow across any type of camera model (e.g. between pinhole and fisheye).

IMU and noise model

The IMU ground truth data is generated by interpolating the camera pose, as a result, the IMU frame is perfectly aligned and synchronized with the left camera data. In specific, we double-differentiate the translation pose using a spline for the acceleration and differentiate the orientation using a spline for the angular rate. We provide the code for customizing the data generation (such as changing the frequency) as well as the code to add realistic noise.

Event Camera

Following the same trajectories with other modalities, we recollect the front-facing camera data at 1000 Hz. We use the ESIM ~cite{Rebecq18corl,Gehrig_2020_CVPR} as that is one of the fastest event camera simulators available and close to SoTA performance. Though some of the new simulators like V2E and DVS Voltmeter produce events that are closer to real cameras, the models trained by ESIM have close performance when compared to the SoTA simulators~cite{lin2022dvs}. We sample the 640$times$640 RGB images at 1000 Hz and then generate the events using the simulator. To improve the generalization across various event cameras we used a wide range of contrast thresholds between 0.2 to 1.0. %Reducing the Sim-to-Real Gap for Event Cameras Because the event data is collected separately from other modalities, some frames are inconsistent with other modalities in dynamic scenes.

Occupancy Map

Occupancy grid maps are built while the data collection as will be detailed in the data collection pipeline section. The resolution of the map ranges from 0.125$m$ to 0.5$m$ depending on the size of the environment. The map can be used in evaluating the mapping algorithm.