2
performance degrades significantly when used with noisy
event-based images.
The alternative is to train a convolutional neural net-
work (CNN) with event-based stereo pairs, labeled with
ground-truth depth data, to learn feature match- ing. This
approach means that the system learns how to match cor-
responding points. The pyramid stereo matching network
[17] utilizes a 3D CNN and region- scale features, rather
than pixel-scale features, to in- corporate global context
information. This is essential to predict a disparity map
given a stereo pair of noisy, event-based images.
This paper presents a novel approach to solving the 3D
reconstruction problem in harsh lighting and environmen-
tal conditions. Specifically, the contribu- tion is the pipeline
for generating very large training data sets for event-based
images during active mining operations.
This paper is laid out with a discussion of the planned
approach, a description of the experimen- tal setup, a pre-
sentation of the results, a discussion grounding the results
in the context of the project, and a conclusion.
METHODS
The goal is to predict depth images given time- synced
accumulated event images. To do so, the dataset needs to
be comprised of event images captured from a mine, and
an associated depth ground-truth image.
Disparity is used as a proxy feature to ensure model
modularity. Therefore, a disparity image is computed from
the corresponding ground truth depth image us- ing the
camera intrinsics of the sensor being used. This disparity
is combined with the time-synced stereo- image pair and
input to a 3D CNN architecture. To test, a time-synced
event stereo pair of images is input to the network. The
output is a disparity image, which can then be reprojected
into depth, given the focal length of the stereo sensors.
A. Stereo-Disparity
The disparity ground-truth images are computed from the
depth images taken by the L515. To do so the horizon-
tal distance between the event cameras needs to be calcu-
lated. Once this is known, the disparity is calculated as in
(1). Disparity is chosen as the quantity of interest as it is
hardware agnostic. This means that depth can be inferred
from any disparity given the focal length, as in 1. Therefore,
the stereo-disparity training dataset is comprised of time-
synced stereo- event pairs and a ground-truth disparity
image. When testing the network, a stereo-pair of event
images and a disparity image is predicted. This predicted
disparity image is projected to depth using Equation 1.
Roof bolting is generally considered to be one of the
most dangerous jobs in underground mines in the United
States [9]–[12]. This is due to two factors -the operator is
at risk of being injured by the bolting machine itself [13],
and there is a risk of being a casu- alty of a roof fall [14].
Accidents, particularly affecting less-experienced opera-
tors, are prevalent. Miner safety and productivity could be
greatly improved if the operator was not required to be in
situ, identifying areas of the roof to drill, positioning, and
operating the drill. Autonomous technologies in the mining
industry offer numerous advantages by minimizing work-
ers’ exposure to dangerous conditions, increasing safety
standards, lowering costs, and enhancing efficiency a[15].
There are several challenges to automating this pro-
cess. Firstly, the semantic recognition of regions of the
roof that are rock versus the support strap. It is critical for
the autonomous system to be able to differentiate areas
of the roof from straps that are already bolted to the roof.
Secondly, the system needs to create a 3D surface repre-
sentation of the scene during a drilling operation. This is
to allow for the roof bolter to localize itself relative to the
roof. Both of these tasks, semantic segmentation and 3D
reconstruction, need to be per- formed in an active mining
environment. This means that the system has to be robust
to vibrational loads, dust clouds, and severe lighting con-
ditions. These factors make it difficult to implement off-
the-shelf computer vision products to solve these problems.
Traditional stereo-vision products exist for stereo pairs of
images. However, since this class of algorithms relies on
matching corresponding pixels in the images taken by the
left and right cameras using feature detectors [16], their
Figure 1. Shows an image of the roof with a bolted support
strap, taken at the Edgar Mine in Idaho Springs, CO
performance degrades significantly when used with noisy
event-based images.
The alternative is to train a convolutional neural net-
work (CNN) with event-based stereo pairs, labeled with
ground-truth depth data, to learn feature match- ing. This
approach means that the system learns how to match cor-
responding points. The pyramid stereo matching network
[17] utilizes a 3D CNN and region- scale features, rather
than pixel-scale features, to in- corporate global context
information. This is essential to predict a disparity map
given a stereo pair of noisy, event-based images.
This paper presents a novel approach to solving the 3D
reconstruction problem in harsh lighting and environmen-
tal conditions. Specifically, the contribu- tion is the pipeline
for generating very large training data sets for event-based
images during active mining operations.
This paper is laid out with a discussion of the planned
approach, a description of the experimen- tal setup, a pre-
sentation of the results, a discussion grounding the results
in the context of the project, and a conclusion.
METHODS
The goal is to predict depth images given time- synced
accumulated event images. To do so, the dataset needs to
be comprised of event images captured from a mine, and
an associated depth ground-truth image.
Disparity is used as a proxy feature to ensure model
modularity. Therefore, a disparity image is computed from
the corresponding ground truth depth image us- ing the
camera intrinsics of the sensor being used. This disparity
is combined with the time-synced stereo- image pair and
input to a 3D CNN architecture. To test, a time-synced
event stereo pair of images is input to the network. The
output is a disparity image, which can then be reprojected
into depth, given the focal length of the stereo sensors.
A. Stereo-Disparity
The disparity ground-truth images are computed from the
depth images taken by the L515. To do so the horizon-
tal distance between the event cameras needs to be calcu-
lated. Once this is known, the disparity is calculated as in
(1). Disparity is chosen as the quantity of interest as it is
hardware agnostic. This means that depth can be inferred
from any disparity given the focal length, as in 1. Therefore,
the stereo-disparity training dataset is comprised of time-
synced stereo- event pairs and a ground-truth disparity
image. When testing the network, a stereo-pair of event
images and a disparity image is predicted. This predicted
disparity image is projected to depth using Equation 1.
Roof bolting is generally considered to be one of the
most dangerous jobs in underground mines in the United
States [9]–[12]. This is due to two factors -the operator is
at risk of being injured by the bolting machine itself [13],
and there is a risk of being a casu- alty of a roof fall [14].
Accidents, particularly affecting less-experienced opera-
tors, are prevalent. Miner safety and productivity could be
greatly improved if the operator was not required to be in
situ, identifying areas of the roof to drill, positioning, and
operating the drill. Autonomous technologies in the mining
industry offer numerous advantages by minimizing work-
ers’ exposure to dangerous conditions, increasing safety
standards, lowering costs, and enhancing efficiency a[15].
There are several challenges to automating this pro-
cess. Firstly, the semantic recognition of regions of the
roof that are rock versus the support strap. It is critical for
the autonomous system to be able to differentiate areas
of the roof from straps that are already bolted to the roof.
Secondly, the system needs to create a 3D surface repre-
sentation of the scene during a drilling operation. This is
to allow for the roof bolter to localize itself relative to the
roof. Both of these tasks, semantic segmentation and 3D
reconstruction, need to be per- formed in an active mining
environment. This means that the system has to be robust
to vibrational loads, dust clouds, and severe lighting con-
ditions. These factors make it difficult to implement off-
the-shelf computer vision products to solve these problems.
Traditional stereo-vision products exist for stereo pairs of
images. However, since this class of algorithms relies on
matching corresponding pixels in the images taken by the
left and right cameras using feature detectors [16], their
Figure 1. Shows an image of the roof with a bolted support
strap, taken at the Edgar Mine in Idaho Springs, CO