3
B. Neural Network Architecture
The network used is the stacked hourglass architec- ture as
first presented by Chang et al. in pyramid stereo matching
network [17]. This particular architecture was chosen for
the ability to extract global context information and infer
features from noisy event-based images. The architecture is
as shown in Figure 3.
This work utilizes the stacked hourglass design. This
allows for the 3D CNN to regularize cost volume [17].
C. Training the Deep Neural Network
The process described above is used to create a dataset of
around 2400 stereo-pairs registered to the corresponding
depth image. These are divided into batches of two, and
trained for 26 epochs as seen in Figure 4 using a NVIDIA
GeForce RTX 3070 graphical processing unit.
Several strategies were used so as to prevent overfit- ting
[18]. Firstly an early-stopping strategy was used to stop the
training process before validation variance started rising.
Secondly, the training and validation set was shuffled at
each epoch.
D. Extrinsic and Intrinsic Calibration
The sensor extrinsics are calibrated using computer vision
techniques to compute the relative positions of the sen-
sor axes. From this, we can get the offset distance between
the two optical axes, the baseline, as well as the mapping
between the Lidar and each of the event cameras [7].
OpenCV’s library [19] is leveraged to compute the princi-
pal point and focal lengths of each of the three sensors. The
focal length, f ,is used for pose computation and mapping
between disparity and depth as in 1.
Figure 2. Shows the process for training the network to
predict depth images
Figure 3. Shows the architecture of the Pyramid Stereo Matching Network [17]
B. Neural Network Architecture
The network used is the stacked hourglass architec- ture as
first presented by Chang et al. in pyramid stereo matching
network [17]. This particular architecture was chosen for
the ability to extract global context information and infer
features from noisy event-based images. The architecture is
as shown in Figure 3.
This work utilizes the stacked hourglass design. This
allows for the 3D CNN to regularize cost volume [17].
C. Training the Deep Neural Network
The process described above is used to create a dataset of
around 2400 stereo-pairs registered to the corresponding
depth image. These are divided into batches of two, and
trained for 26 epochs as seen in Figure 4 using a NVIDIA
GeForce RTX 3070 graphical processing unit.
Several strategies were used so as to prevent overfit- ting
[18]. Firstly an early-stopping strategy was used to stop the
training process before validation variance started rising.
Secondly, the training and validation set was shuffled at
each epoch.
D. Extrinsic and Intrinsic Calibration
The sensor extrinsics are calibrated using computer vision
techniques to compute the relative positions of the sen-
sor axes. From this, we can get the offset distance between
the two optical axes, the baseline, as well as the mapping
between the Lidar and each of the event cameras [7].
OpenCV’s library [19] is leveraged to compute the princi-
pal point and focal lengths of each of the three sensors. The
focal length, f ,is used for pose computation and mapping
between disparity and depth as in 1.
Figure 2. Shows the process for training the network to
predict depth images
Figure 3. Shows the architecture of the Pyramid Stereo Matching Network [17]