Auto-encoders work well for data sets like MNIST~\cite{Deng2012}
but perform poorly on challenging real world data sets
like MS COCO~\cite{Lin2014}, complicating any potential comparison between
them and object detection networks like SSD.
them and object detection networks like \gls{SSD}.
Therefore, a comparison between model uncertainty with a network like
SSD and novelty detection with auto-encoders is considered out of scope
for this thesis.
Miller et al.~\cite{Miller2018} used an SSD pre-trained on COCO
Miller et al.~\cite{Miller2018} used an \gls{SSD} pre-trained on COCO
without further fine-tuning on the SceneNet RGB-D data
set~\cite{McCormac2017} and reported good results regarding
open set error for an SSD variant with dropout sampling and entropy
open set error for an \gls{SSD} variant with dropout sampling and entropy
thresholding.
If their results are generalisable it should be possible to replicate
the relative difference between the variants on the COCO data set.
@ -131,15 +131,15 @@ This leads to the following hypothesis: \emph{Dropout sampling
delivers better object detection performance under open set
conditions compared to object detection without it.}
For the purpose of this thesis, I will use the vanilla SSD (as in: the original SSD) as
baseline to compare against. In particular, vanilla SSD uses
For the purpose of this thesis, I will use the \gls{vanilla}\gls{SSD} (as in: the original SSD) as
baseline to compare against. In particular, \gls{vanilla}\gls{SSD} uses
a per-class confidence threshold of 0.01, an IOU threshold of 0.45
for the non-maximum suppression, and a top \(k\) value of 200. For this
for the \gls{NMS}, and a top \(k\) value of 200. For this
thesis, the top \(k\) value was changed to 20 and the confidence threshold
of 0.2 was tried as well.
The effect of an entropy threshold is measured against this vanilla
The effect of an entropy threshold is measured against this \gls{vanilla}
SSD by applying entropy thresholds from 0.1 to 2.4 inclusive (limits taken from
Miller et al.). Dropout sampling is compared to vanilla SSD
Miller et al.). Dropout sampling is compared to \gls{vanilla} SSD
with and without entropy thresholding.
\paragraph{Hypothesis} Dropout sampling
@ -150,8 +150,8 @@ conditions compared to object detection without it.
First, chapter \ref{chap:background} presents related works and
provides the background for dropout sampling.
Afterwards, chapter \ref{chap:methods} explains how vanilla SSD works, how
Bayesian SSD extends vanilla SSD, and how the decoding pipelines are
Afterwards, chapter \ref{chap:methods} explains how \gls{vanilla}\gls{SSD} works, how
Bayesian \gls{SSD} extends \gls{vanilla} SSD, and how the decoding pipelines are
structured.
Chapter \ref{chap:experiments-results} presents the data sets,
the experimental setup, and the results. This is followed by
@ -421,19 +421,19 @@ be used to identify and reject these false positive cases.
\label{chap:methods}
This chapter explains the functionality of vanilla SSD, Bayesian SSD, and the decoding pipelines.
This chapter explains the functionality of \gls{vanilla} SSD, Bayesian SSD, and the decoding pipelines.
\section{Vanilla SSD}
\begin{figure}
\centering
\includegraphics[scale=1.2]{vanilla-ssd}
\caption{The vanilla SSD network as defined by Liu et al.~\cite{Liu2016}. VGG-16 is the base network, extended with extra feature layers. These predict offsets to anchor boxes with different sizes and aspect ratios. Furthermore, they predict the
\caption{The \gls{vanilla}\gls{SSD} network as defined by Liu et al.~\cite{Liu2016}. VGG-16 is the base network, extended with extra feature layers. These predict offsets to anchor boxes with different sizes and aspect ratios. Furthermore, they predict the
corresponding confidences.}
\label{fig:vanilla-ssd}
\end{figure}
Vanilla SSD is based upon the VGG-16 network (see figure
Vanilla \gls{SSD} is based upon the VGG-16 network (see figure
\ref{fig:vanilla-ssd}) and adds extra feature layers. The entire
image (always size 300x300) is divided up into anchor boxes. During
training, each of these boxes is mapped to a ground truth box or
@ -443,7 +443,7 @@ SSD network are the predictions with class confidences, offsets to the
anchor box, anchor box coordinates, and variance. The model loss is a
weighted sum of localisation and confidence loss. As the network
has a fixed number of anchor boxes, every forward pass creates the same
number of detections---8732 in the case of SSD 300x300.
number of detections---8732 in the case of \gls{SSD} 300x300.
Notably, the object proposals are made in a single run for an image -
single shot.
@ -454,13 +454,13 @@ Liu et al.~\cite{Liu2016}.
\section{Bayesian SSD for Model Uncertainty}
Networks trained with dropout are a general approximate Bayesian model~\cite{Gal2017}. As such, they can be used for everything a true
Bayesian model could be used for. The idea is applied to SSD in this
thesis: two dropout layers are added to vanilla SSD, after the layers fc6 and fc7 respectively (see figure \ref{fig:bayesian-ssd}).
Bayesian model could be used for. The idea is applied to \gls{SSD} in this
thesis: two dropout layers are added to \gls{vanilla} SSD, after the layers fc6 and fc7 respectively (see figure \ref{fig:bayesian-ssd}).
\begin{figure}
\centering
\includegraphics[scale=1.2]{bayesian-ssd}
\caption{The Bayesian SSD network as defined by Miller et al.~\cite{Miller2018}. It adds dropout layers after the fc6
\caption{The Bayesian \gls{SSD} network as defined by Miller et al.~\cite{Miller2018}. It adds dropout layers after the fc6
and fc7 layers.}
\label{fig:bayesian-ssd}
\end{figure}
@ -476,51 +476,52 @@ and very low confidences in other classes.
\subsection{Implementation Details}
For this thesis, an SSD implementation based on Tensorflow~\cite{Abadi2015} and
For this thesis, an \gls{SSD} implementation based on Tensorflow~\cite{Abadi2015} and
% entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
% 0.5 for Bayesian - 6, 1.4 for 7, 1.4 for 8, 1.3 for 9
\hline
\end{tabular}
\caption{Rounded results for micro averaging. SSD with Entropy test and Bayesian SSD are represented with
their best performing entropy threshold with respect to \(F_1\) score. Vanilla SSD with Entropy test performed best with an
entropy threshold of 2.4, Bayesian SSD without non-maximum suppression performed best for 1.0,
and Bayesian SSD with non-maximum suppression performed best for 1.4 as entropy
\caption{Rounded results for micro averaging. \gls{SSD} with Entropy test and Bayesian \gls{SSD} are represented with
their best performing entropy threshold with respect to \(F_1\) score. Vanilla \gls{SSD} with Entropy test performed best with an
entropy threshold of 2.4, Bayesian \gls{SSD} without \gls{NMS} performed best for 1.0,
and Bayesian \gls{SSD} with \gls{NMS} performed best for 1.4 as entropy
threshold.
Bayesian SSD with dropout enabled and 0.9 keep ratio performed
Bayesian \gls{SSD} with dropout enabled and 0.9 keep ratio performed
best for 1.4 as entropy threshold, the run with 0.5 keep ratio performed
best for 1.3 as threshold.}
\label{tab:results-micro}
@ -739,26 +740,26 @@ in the next chapter.
\end{minipage}
\end{figure}
Vanilla SSD with a per-class confidence threshold of 0.2 performs best (see
Vanilla \gls{SSD} with a per-class confidence threshold of 0.2 performs best (see
table \ref{tab:results-micro}) with respect to the maximum \(F_1\) score
(0.376) and recall at the maximum \(F_1\) point (0.382). In comparison, neither
the vanilla SSD variant with a confidence threshold of 0.01 nor the SSD with
an entropy test can outperform the 0.2 variant. Among the vanilla SSD variants,
the \gls{vanilla}\gls{SSD} variant with a confidence threshold of 0.01 nor the \gls{SSD} with
an entropy test can outperform the 0.2 variant. Among the \gls{vanilla}\gls{SSD} variants,
the 0.2 variant also has the lowest number of open set errors (2939) and the
highest precision (0.372).
The comparison of the vanilla SSD variants with a confidence threshold of 0.01
The comparison of the \gls{vanilla}\gls{SSD} variants with a confidence threshold of 0.01
shows no significant impact of an entropy test. Only the open set errors
are lower but in an insignificant way. The rest of the performance metrics is
identical after rounding.
Bayesian SSD with disabled dropout and without non-maximum suppression
has the worst performance of all tested variants (vanilla and Bayesian)
Bayesian \gls{SSD} with disabled dropout and without \gls{NMS}
has the worst performance of all tested variants (\gls{vanilla} and Bayesian)
with respect to \(F_1\) score (0.209) and precision (0.161). The precision is not only the worst but also significantly lower compared to all other variants.
In comparison to all variants with 0.2 confidence threshold, it has the worst recall (0.300) as well.
With 2335 open set errors, the Bayesian SSD variant with disabled dropout and
enabled non-maximum suppression offers the best performance with respect
With 2335 open set errors, the Bayesian \gls{SSD} variant with disabled dropout and
enabled \gls{NMS} offers the best performance with respect
to open set errors. It also has the best precision (0.378) of all tested
variants. Furthermore, it provides the best performance among all variants
with multiple forward passes.
@ -768,11 +769,11 @@ in the lower \(F_1\) scores, higher open set errors, and lower precision
values. Both dropout variants have worse recall (0.363 and 0.342) than
the variant with disabled dropout.
However, all variants with multiple forward passes have lower open set
errors than all vanilla SSD variants.
errors than all \gls{vanilla}\gls{SSD} variants.
The relation of \(F_1\) score to absolute open set error can be observed
in figure \ref{fig:ose-f1-micro}. Precision-recall curves for all variants
can be seen in figure \ref{fig:precision-recall-micro}. Both vanilla SSD
can be seen in figure \ref{fig:precision-recall-micro}. Both \gls{vanilla} SSD
variants with 0.01 confidence threshold reach much higher open set errors
and a higher recall. This behaviour is expected as more and worse predictions
are included.
@ -787,25 +788,25 @@ reported figures, such as the ones in Miller et al.~\cite{Miller2018}
Forward & max & abs OSE & Recall & Precision\\
Passes &\(F_1\) Score &\multicolumn{3}{c}{at max \(F_1\) point}\\
% entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
% entropy thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7
% 1.7 for 8, 2.0 for 9
\hline
\end{tabular}
\caption{Rounded results for macro averaging. SSD with Entropy test and Bayesian SSD are represented with
their best performing entropy threshold with respect to \(F_1\) score. Vanilla SSD with Entropy test performed best with an
entropy threshold of 1.7, Bayesian SSD without non-maximum suppression performed best for 1.5,
and Bayesian SSD with non-maximum suppression performed best for 1.5 as entropy
threshold. Bayesian SSD with dropout enabled and 0.9 keep ratio performed
\caption{Rounded results for macro averaging. \gls{SSD} with Entropy test and Bayesian \gls{SSD} are represented with
their best performing entropy threshold with respect to \(F_1\) score. Vanilla \gls{SSD} with Entropy test performed best with an
entropy threshold of 1.7, Bayesian \gls{SSD} without \gls{NMS} performed best for 1.5,
and Bayesian \gls{SSD} with \gls{NMS} performed best for 1.5 as entropy
threshold. Bayesian \gls{SSD} with dropout enabled and 0.9 keep ratio performed
best for 1.7 as entropy threshold, the run with 0.5 keep ratio performed
best for 2.0 as threshold.}
\label{tab:results-macro}
@ -825,36 +826,36 @@ reported figures, such as the ones in Miller et al.~\cite{Miller2018}
\end{minipage}
\end{figure}
Vanilla SSD with a per-class confidence threshold of 0.2 performs best (see
Vanilla \gls{SSD} with a per-class confidence threshold of 0.2 performs best (see
table \ref{tab:results-macro}) with respect to the maximum \(F_1\) score
(0.375) and recall at the maximum \(F_1\) point (0.338). In comparison, the SSD
with an entropy test slightly outperforms the 0.2 variant with respect to
precision (0.425). Additionally, this is the best precision overall. Among
the vanilla SSD variants, the 0.2 variant also has the lowest
the \gls{vanilla}\gls{SSD} variants, the 0.2 variant also has the lowest
number of open set errors (1218).
The comparison of the vanilla SSD variants with a confidence threshold of 0.01
The comparison of the \gls{vanilla}\gls{SSD} variants with a confidence threshold of 0.01
shows no significant impact of an entropy test. Only the open set errors
are lower but in an insignificant way. The rest of the performance metrics is
almost identical after rounding.
The results for Bayesian SSD show a significant impact of non-maximum suppression or the lack thereof: maximum \(F_1\) score of 0.363 (with NMS) to 0.226
The results for Bayesian \gls{SSD} show a significant impact of \gls{NMS} or the lack thereof: maximum \(F_1\) score of 0.363 (with NMS) to 0.226
(without NMS). Dropout was disabled in both cases, making them effectively a
vanilla SSD run with multiple forward passes.
\gls{vanilla}\gls{SSD} run with multiple forward passes.
With 809 open set errors, the Bayesian SSD variant with disabled dropout and
without non-maximum suppression offers the best performance with respect
to open set errors. The variant without dropout and enabled non-maximum suppression has the best \(F_1\) score (0.363), the best
With 809 open set errors, the Bayesian \gls{SSD} variant with disabled dropout and
without \gls{NMS} offers the best performance with respect
to open set errors. The variant without dropout and enabled \gls{NMS} has the best \(F_1\) score (0.363), the best
precision (0.420) and the best recall (0.321) of all Bayesian variants.
Dropout decreases the performance of the network, this can be seen
in the lower \(F_1\) scores, higher open set errors, and lower precision and
recall values. However, all variants with multiple forward passes have lower open set errors than all vanilla SSD
recall values. However, all variants with multiple forward passes have lower open set errors than all \gls{vanilla} SSD
variants.
The relation of \(F_1\) score to absolute open set error can be observed
in figure \ref{fig:ose-f1-macro}. Precision-recall curves for all variants
can be seen in figure \ref{fig:precision-recall-macro}. Both vanilla SSD
can be seen in figure \ref{fig:precision-recall-macro}. Both \gls{vanilla} SSD
variants with 0.01 confidence threshold reach much higher open set errors
and a higher recall. This behaviour is expected as more and worse predictions
are included.
@ -884,35 +885,35 @@ they had the exact same performance before rounding.
Forward & max & Recall & Precision\\
Passes &\(F_1\) Score &\multicolumn{2}{c}{at max \(F_1\) point}\\
% entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
% entropy thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7
% 1.7 for 8, 2.0 for 9
\hline
\end{tabular}
\caption{Rounded results for persons class. SSD with Entropy test and Bayesian SSD are represented with
\caption{Rounded results for persons class. \gls{SSD} with Entropy test and Bayesian \gls{SSD} are represented with
their best performing macro averaging entropy threshold with respect to \(F_1\) score.}
\label{tab:results-persons}
\end{table}
It is clearly visible that the overall trend continues in the individual
classes (see tables \ref{tab:results-persons}, \ref{tab:results-cars}, \ref{tab:results-chairs}, \ref{tab:results-bottles}, and \ref{tab:results-giraffes}). However, the two vanilla SSD variants with only 0.01 confidence
classes (see tables \ref{tab:results-persons}, \ref{tab:results-cars}, \ref{tab:results-chairs}, \ref{tab:results-bottles}, and \ref{tab:results-giraffes}). However, the two \gls{vanilla}\gls{SSD} variants with only 0.01 confidence
threshold perform better than in the averaged results presented earlier.
Only in the chairs class, a Bayesian SSD variant performs better (in
precision) than any of the vanilla SSD variants. Moreover, there are
multiple classes where two or all of the vanilla SSD variants perform
Only in the chairs class, a Bayesian \gls{SSD} variant performs better (in
precision) than any of the \gls{vanilla}\gls{SSD} variants. Moreover, there are
multiple classes where two or all of the \gls{vanilla}\gls{SSD} variants perform
equally well. When compared with the macro averaged results,
giraffes and persons perform better across the board. Cars have a higher
precision than average but lower recall values for all but the Bayesian
SSD variant without NMS and dropout. Chairs and bottles perform
SSD variant without \gls{NMS} and dropout. Chairs and bottles perform
worse than average.
\begin{table}[tbp]
@ -921,21 +922,21 @@ worse than average.
Forward & max & Recall & Precision\\
Passes &\(F_1\) Score &\multicolumn{2}{c}{at max \(F_1\) point}\\
\caption{Image with stop sign and truck at right edge. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from vanilla SSD.}
\caption{Image with stop sign and truck at right edge. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from \gls{vanilla} SSD.}
\caption{Image with stop sign and truck at right edge. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from Bayesian SSD with 0.9 keep ratio.}
\caption{Image with stop sign and truck at right edge. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from Bayesian \gls{SSD} with 0.9 keep ratio.}
\label{fig:stop-sign-truck-bayesian}
\end{minipage}
\end{figure}
The ground truth only contains a stop sign and a truck. The differences between vanilla SSD and Bayesian SSD are almost not visible
(see figures \ref{fig:stop-sign-truck-vanilla} and \ref{fig:stop-sign-truck-bayesian}): the truck is neither detected by vanilla nor Bayesian SSD, instead both detected a pottet plant and a traffic light. The stop sign is detected by both variants.
The ground truth only contains a stop sign and a truck. The differences between \gls{vanilla}\gls{SSD} and Bayesian \gls{SSD} are almost not visible
(see figures \ref{fig:stop-sign-truck-vanilla} and \ref{fig:stop-sign-truck-bayesian}): the truck is neither detected by \gls{vanilla} nor Bayesian SSD, instead both detected a pottet plant and a traffic light. The stop sign is detected by both variants.
This behaviour implies problems with detecting objects at the edge
that overwhelmingly lie outside the image frame. Furthermore, the predictions are usually identical.
\caption{Image with a cat and laptop/TV. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from vanilla SSD.}
\caption{Image with a cat and laptop/TV. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from \gls{vanilla} SSD.}
\caption{Image with a cat and laptop/TV. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from Bayesian SSD with 0.9 keep ratio.}
\caption{Image with a cat and laptop/TV. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from Bayesian \gls{SSD} with 0.9 keep ratio.}
\label{fig:cat-laptop-bayesian}
\end{minipage}
\end{figure}
Another example (see figures \ref{fig:cat-laptop-vanilla} and \ref{fig:cat-laptop-bayesian}) is a cat with a laptop/TV in the background on the right
side. Both variants detect a cat but the vanilla variant detects a dog as well. The laptop and TV are not detected but this is expected since
side. Both variants detect a cat but the \gls{vanilla} variant detects a dog as well. The laptop and TV are not detected but this is expected since
these classes were not trained.
\chapter{Discussion and Outlook}
@ -1073,7 +1074,7 @@ questions will be addressed.
\section*{Discussion}
The results clearly do not support the hypothesis: \textit{Dropout sampling delivers better object detection performance under open set conditions compared to object detection without it}. With the exception of open set errors, there
is no area where dropout sampling performs better than vanilla SSD. In the
is no area where dropout sampling performs better than \gls{vanilla} SSD. In the
remainder of the section the individual results will be interpreted.
\subsection*{Impact of Averaging}
@ -1085,8 +1086,8 @@ of the plot in both the \(F_1\) versus absolute open set error graph (see figure
the precision-recall curve (see figure \ref{fig:precision-recall-micro}).
This behaviour is caused by a large imbalance of detections between
the classes. For vanilla SSD with 0.2 confidence threshold there are
a total of 36,863 detections after non-maximum suppression and top \(k\).
the classes. For \gls{vanilla}\gls{SSD} with 0.2 confidence threshold there are
a total of 36,863 detections after \gls{NMS} and top \(k\).
The persons class contributes 14,640 detections or around 40\% to that number. Another strong class is cars with 2,252 detections or around
6\%. In third place come chairs with 1352 detections or around 4\%. This means that three classes have together roughly as many detections
as the remaining 57 classes combined.
@ -1119,7 +1120,7 @@ averaging was not reported in their paper.
\subsection*{Impact of Entropy}
There is no visible impact of entropy thresholding on the object detection
performance for vanilla SSD. This indicates that the network has almost no
performance for \gls{vanilla} SSD. This indicates that the network has almost no
uniform or close to uniform predictions, the vast majority of predictions
has a high confidence in one class---including the background.
However, the entropy plays a larger role for the Bayesian variants---as
@ -1144,52 +1145,52 @@ threshold indicates a worse performance.
variant & before & after & after \\
& entropy/NMS & entropy/NMS & top \(k\)\\
\hline
Bay. SSD, no dropout, no NMS & 155,251 & 122,868 & 72,207 \\
no dropout, NMS & 155,250 & 36,061 & 33,827 \\
Bay. SSD, no dropout, no \gls{NMS}& 155,251 & 122,868 & 72,207 \\
no dropout, \gls{NMS}& 155,250 & 36,061 & 33,827 \\
\hline
\end{tabular}
\caption{Comparison of Bayesian SSD variants without dropout with
\caption{Comparison of Bayesian \gls{SSD} variants without dropout with
respect to the number of detections before the entropy threshold,
after it and/or non-maximum suppression, and after top \(k\). The
after it and/or \gls{NMS}, and after top \(k\). The
entropy threshold 1.5 was used for both.}
\label{tab:effect-nms}
\end{table}
Miller et al.~\cite{Miller2018} supposedly did not use non-maximum suppression
in their implementation of dropout sampling. Therefore, a variant with disabled
non-maximum suppression (NMS) was tested. The results are somewhat expected:
non-maximum suppression removes all non-maximum detections that overlap
Miller et al.~\cite{Miller2018} supposedly did not use \gls{NMS}
in their implementation of dropout sampling. Therefore, a variant with disabled\glslocalreset{NMS}
\gls{NMS} was tested. The results are somewhat expected:
\gls{NMS} removes all non-maximum detections that overlap
with a maximum one. This reduces the number of multiple detections per
ground truth bounding box and therefore the false positives. Without it,
a lot more false positives remain and have a negative impact on precision.
In combination with top \(k\) selection, recall can be affected:
duplicate detections could stay and maxima boxes could be removed.
The number of observations was measured before and after the combination of entropy threshold and NMS filter: both Bayesian SSD without
NMS and dropout, and Bayesian SSD with NMS and disabled dropout
have the same number of observations everywhere before the entropy threshold. After the entropy threshold (the value 1.5 was used for both) and NMS, the variant with NMS has roughly 23\% of its observations left
The number of observations was measured before and after the combination of entropy threshold and \gls{NMS} filter: both Bayesian \gls{SSD} without
NMS and dropout, and Bayesian \gls{SSD} with \gls{NMS} and disabled dropout
have the same number of observations everywhere before the entropy threshold. After the entropy threshold (the value 1.5 was used for both) and NMS, the variant with \gls{NMS} has roughly 23\% of its observations left
(see table \ref{tab:effect-nms} for absolute numbers).
Without NMS 79\% of observations are left. Irrespective of the absolute
number, this discrepancy clearly shows the impact of non-maximum suppression and also explains a higher count of false positives:
more than 50\% of the original observations were removed with NMS and
Without \gls{NMS} 79\% of observations are left. Irrespective of the absolute
number, this discrepancy clearly shows the impact of \gls{NMS} and also explains a higher count of false positives:
more than 50\% of the original observations were removed with \gls{NMS} and
stayed without---all of these are very likely to be false positives.
A clear distinction between micro and macro averaging can be observed:
recall is hardly effected with micro averaging (0.300) but goes down equally with macro averaging (0.229). For micro averaging, it does
not matter which class the true positives belong to: every detection
counts the same way. This also means that top \(k\) will have only
a marginal effect: some true positives might be removed without NMS but overall that does not have a big impact. With macro averaging, however,
a marginal effect: some true positives might be removed without \gls{NMS} but overall that does not have a big impact. With macro averaging, however,
the class of the true positives matters a lot: for example, if two
true positives are removed from a class with only few true positives
to begin with than their removal will have a drastic influence on
the class recall value and hence the overall result.
The impact of top \(k\) was measured by counting the number of observations
after top \(k\) has been applied: the variant with NMS keeps about 94\%
of the observations left after NMS, without NMS only about 59\% of observations
after top \(k\) has been applied: the variant with \gls{NMS} keeps about 94\%
of the observations left after NMS, without \gls{NMS} only about 59\% of observations
are kept. This shows a significant impact on the result by top \(k\)
in the case of disabled non-maximum suppression. Furthermore, some
in the case of disabled \gls{NMS}. Furthermore, some
classes are hit harder by top \(k\) then others: for example,
dogs keep around 82\% of the observations but persons only 57\%.
This indicates that detected dogs are mostly on images with few detections
@ -1211,12 +1212,12 @@ recall.
variant & after & after \\
& prediction & observation grouping \\
\hline
Bay. SSD, no dropout, NMS & 1,677,050 & 155,250 \\
keep rate 0.9, NMS & 1,617,675 & 549,166 \\
Bay. SSD, no dropout, \gls{NMS}& 1,677,050 & 155,250 \\
keep rate 0.9, \gls{NMS}& 1,617,675 & 549,166 \\
\hline
\end{tabular}
\caption{Comparison of Bayesian SSD variants without dropout and with
\caption{Comparison of Bayesian \gls{SSD} variants without dropout and with
0.9 keep ratio of dropout with
respect to the number of detections directly after the network
predictions and after the observation grouping.}
@ -1229,7 +1230,7 @@ dropout and the weights are not prepared for it.
Gal~\cite{Gal2017}
showed that networks \textbf{trained} with dropout are approximate Bayesian
models. The Bayesian variants of SSD implemented in this thesis are not fine-tuned or trained with dropout, therefore, they are not guaranteed to be such approximate models.
models. The Bayesian variants of \gls{SSD} implemented in this thesis are not fine-tuned or trained with dropout, therefore, they are not guaranteed to be such approximate models.
But dropout alone does not explain the difference in results. Both variants
with and without dropout have the exact same number of detections coming
@ -1252,9 +1253,9 @@ has slightly fewer predictions left compared to the one without dropout.
After the grouping, the variant without dropout has on average between
10 and 11 detections grouped into an observation. This is expected as every
forward pass creates the exact same result and these 10 identical detections
per vanilla SSD detection perfectly overlap. The fact that slightly more than
per \gls{vanilla}\gls{SSD} detection perfectly overlap. The fact that slightly more than
10 detections are grouped together could explain the marginally better precision
of the Bayesian variant without dropout compared to vanilla SSD.
of the Bayesian variant without dropout compared to \gls{vanilla} SSD.
However, on average only three detections are grouped together into an
observation if dropout with 0.9 keep ratio is enabled. This does not
negatively impact recall as true positives do not disappear but offers
@ -1276,7 +1277,7 @@ from Miller et al. The complete source code or otherwise exhaustive
implementation details of Miller et al. would be required to attempt an answer.
Future work could explore the performance of this implementation when used
on an SSD variant that was fine-tuned or trained with dropout. In this case, it
on an \gls{SSD} variant that was fine-tuned or trained with dropout. In this case, it
should also look into the impact of training with both dropout and batch
normalisation.
Other avenues include the application to other data sets or object detection