Added glossary

Signed-off-by: Jim Martens <github@2martens.de>
This commit is contained in:
Jim Martens 2019-09-27 16:02:59 +02:00
parent e7ebea9ae8
commit 452d97b4b2
4 changed files with 200 additions and 184 deletions

365
body.tex
View File

@ -115,15 +115,15 @@ novelty score.
Auto-encoders work well for data sets like MNIST~\cite{Deng2012} Auto-encoders work well for data sets like MNIST~\cite{Deng2012}
but perform poorly on challenging real world data sets but perform poorly on challenging real world data sets
like MS COCO~\cite{Lin2014}, complicating any potential comparison between like MS COCO~\cite{Lin2014}, complicating any potential comparison between
them and object detection networks like SSD. them and object detection networks like \gls{SSD}.
Therefore, a comparison between model uncertainty with a network like Therefore, a comparison between model uncertainty with a network like
SSD and novelty detection with auto-encoders is considered out of scope SSD and novelty detection with auto-encoders is considered out of scope
for this thesis. for this thesis.
Miller et al.~\cite{Miller2018} used an SSD pre-trained on COCO Miller et al.~\cite{Miller2018} used an \gls{SSD} pre-trained on COCO
without further fine-tuning on the SceneNet RGB-D data without further fine-tuning on the SceneNet RGB-D data
set~\cite{McCormac2017} and reported good results regarding set~\cite{McCormac2017} and reported good results regarding
open set error for an SSD variant with dropout sampling and entropy open set error for an \gls{SSD} variant with dropout sampling and entropy
thresholding. thresholding.
If their results are generalisable it should be possible to replicate If their results are generalisable it should be possible to replicate
the relative difference between the variants on the COCO data set. the relative difference between the variants on the COCO data set.
@ -131,15 +131,15 @@ This leads to the following hypothesis: \emph{Dropout sampling
delivers better object detection performance under open set delivers better object detection performance under open set
conditions compared to object detection without it.} conditions compared to object detection without it.}
For the purpose of this thesis, I will use the vanilla SSD (as in: the original SSD) as For the purpose of this thesis, I will use the \gls{vanilla} \gls{SSD} (as in: the original SSD) as
baseline to compare against. In particular, vanilla SSD uses baseline to compare against. In particular, \gls{vanilla} \gls{SSD} uses
a per-class confidence threshold of 0.01, an IOU threshold of 0.45 a per-class confidence threshold of 0.01, an IOU threshold of 0.45
for the non-maximum suppression, and a top \(k\) value of 200. For this for the \gls{NMS}, and a top \(k\) value of 200. For this
thesis, the top \(k\) value was changed to 20 and the confidence threshold thesis, the top \(k\) value was changed to 20 and the confidence threshold
of 0.2 was tried as well. of 0.2 was tried as well.
The effect of an entropy threshold is measured against this vanilla The effect of an entropy threshold is measured against this \gls{vanilla}
SSD by applying entropy thresholds from 0.1 to 2.4 inclusive (limits taken from SSD by applying entropy thresholds from 0.1 to 2.4 inclusive (limits taken from
Miller et al.). Dropout sampling is compared to vanilla SSD Miller et al.). Dropout sampling is compared to \gls{vanilla} SSD
with and without entropy thresholding. with and without entropy thresholding.
\paragraph{Hypothesis} Dropout sampling \paragraph{Hypothesis} Dropout sampling
@ -150,8 +150,8 @@ conditions compared to object detection without it.
First, chapter \ref{chap:background} presents related works and First, chapter \ref{chap:background} presents related works and
provides the background for dropout sampling. provides the background for dropout sampling.
Afterwards, chapter \ref{chap:methods} explains how vanilla SSD works, how Afterwards, chapter \ref{chap:methods} explains how \gls{vanilla} \gls{SSD} works, how
Bayesian SSD extends vanilla SSD, and how the decoding pipelines are Bayesian \gls{SSD} extends \gls{vanilla} SSD, and how the decoding pipelines are
structured. structured.
Chapter \ref{chap:experiments-results} presents the data sets, Chapter \ref{chap:experiments-results} presents the data sets,
the experimental setup, and the results. This is followed by the experimental setup, and the results. This is followed by
@ -421,19 +421,19 @@ be used to identify and reject these false positive cases.
\label{chap:methods} \label{chap:methods}
This chapter explains the functionality of vanilla SSD, Bayesian SSD, and the decoding pipelines. This chapter explains the functionality of \gls{vanilla} SSD, Bayesian SSD, and the decoding pipelines.
\section{Vanilla SSD} \section{Vanilla SSD}
\begin{figure} \begin{figure}
\centering \centering
\includegraphics[scale=1.2]{vanilla-ssd} \includegraphics[scale=1.2]{vanilla-ssd}
\caption{The vanilla SSD network as defined by Liu et al.~\cite{Liu2016}. VGG-16 is the base network, extended with extra feature layers. These predict offsets to anchor boxes with different sizes and aspect ratios. Furthermore, they predict the \caption{The \gls{vanilla} \gls{SSD} network as defined by Liu et al.~\cite{Liu2016}. VGG-16 is the base network, extended with extra feature layers. These predict offsets to anchor boxes with different sizes and aspect ratios. Furthermore, they predict the
corresponding confidences.} corresponding confidences.}
\label{fig:vanilla-ssd} \label{fig:vanilla-ssd}
\end{figure} \end{figure}
Vanilla SSD is based upon the VGG-16 network (see figure Vanilla \gls{SSD} is based upon the VGG-16 network (see figure
\ref{fig:vanilla-ssd}) and adds extra feature layers. The entire \ref{fig:vanilla-ssd}) and adds extra feature layers. The entire
image (always size 300x300) is divided up into anchor boxes. During image (always size 300x300) is divided up into anchor boxes. During
training, each of these boxes is mapped to a ground truth box or training, each of these boxes is mapped to a ground truth box or
@ -443,7 +443,7 @@ SSD network are the predictions with class confidences, offsets to the
anchor box, anchor box coordinates, and variance. The model loss is a anchor box, anchor box coordinates, and variance. The model loss is a
weighted sum of localisation and confidence loss. As the network weighted sum of localisation and confidence loss. As the network
has a fixed number of anchor boxes, every forward pass creates the same has a fixed number of anchor boxes, every forward pass creates the same
number of detections---8732 in the case of SSD 300x300. number of detections---8732 in the case of \gls{SSD} 300x300.
Notably, the object proposals are made in a single run for an image - Notably, the object proposals are made in a single run for an image -
single shot. single shot.
@ -454,13 +454,13 @@ Liu et al.~\cite{Liu2016}.
\section{Bayesian SSD for Model Uncertainty} \section{Bayesian SSD for Model Uncertainty}
Networks trained with dropout are a general approximate Bayesian model~\cite{Gal2017}. As such, they can be used for everything a true Networks trained with dropout are a general approximate Bayesian model~\cite{Gal2017}. As such, they can be used for everything a true
Bayesian model could be used for. The idea is applied to SSD in this Bayesian model could be used for. The idea is applied to \gls{SSD} in this
thesis: two dropout layers are added to vanilla SSD, after the layers fc6 and fc7 respectively (see figure \ref{fig:bayesian-ssd}). thesis: two dropout layers are added to \gls{vanilla} SSD, after the layers fc6 and fc7 respectively (see figure \ref{fig:bayesian-ssd}).
\begin{figure} \begin{figure}
\centering \centering
\includegraphics[scale=1.2]{bayesian-ssd} \includegraphics[scale=1.2]{bayesian-ssd}
\caption{The Bayesian SSD network as defined by Miller et al.~\cite{Miller2018}. It adds dropout layers after the fc6 \caption{The Bayesian \gls{SSD} network as defined by Miller et al.~\cite{Miller2018}. It adds dropout layers after the fc6
and fc7 layers.} and fc7 layers.}
\label{fig:bayesian-ssd} \label{fig:bayesian-ssd}
\end{figure} \end{figure}
@ -476,51 +476,52 @@ and very low confidences in other classes.
\subsection{Implementation Details} \subsection{Implementation Details}
For this thesis, an SSD implementation based on Tensorflow~\cite{Abadi2015} and For this thesis, an \gls{SSD} implementation based on Tensorflow~\cite{Abadi2015} and
Keras\footnote{\url{https://github.com/pierluigiferrari/ssd\_keras}} Keras\footnote{\url{https://github.com/pierluigiferrari/ssd\_keras}}
was used. It was modified to support entropy thresholding, was used. It was modified to support entropy thresholding,
partitioning of observations, and dropout partitioning of observations, and dropout
layers in the SSD model. Entropy thresholding takes place before layers in the \gls{SSD} model. Entropy thresholding takes place before
the per-class confidence threshold is applied. the per-class confidence threshold is applied.
The Bayesian variant was not fine-tuned and operates with the same The Bayesian variant was not fine-tuned and operates with the same
weights that vanilla SSD uses as well. weights that \gls{vanilla} \gls{SSD} uses as well.
\section{Decoding Pipelines} \section{Decoding Pipelines}
The raw output of SSD is not very useful: it contains thousands of The raw output of \gls{SSD} is not very useful: it contains thousands of
boxes per image. Among them are many boxes with very low confidences boxes per image. Among them are many boxes with very low confidences
or background classifications, those need to be filtered out to or background classifications, those need to be filtered out to
get any meaningful output of the network. The process of get any meaningful output of the network. The process of
filtering is called decoding and presented for the three variants filtering is called decoding and presented for the three variants
of SSD used in the thesis. of \gls{SSD} used in the thesis.
\subsection{Vanilla SSD} \subsection{Vanilla SSD}
Liu et al.~\cite{Liu2016} used Caffe for their original SSD Liu et al.~\cite{Liu2016} used Caffe for their original SSD
implementation. The decoding process contains largely two implementation. The decoding process contains largely two
phases: decoding and filtering. Decoding transforms the relative phases: decoding and filtering. Decoding transforms the relative
coordinates predicted by SSD into absolute coordinates. At this point coordinates predicted by \gls{SSD} into absolute coordinates. At this point
the shape of the output per batch is \((batch\_size, \#nr\_boxes, \#nr\_classes + 12)\). The last twelve elements are split into the shape of the output per batch is \((batch\_size, \#nr\_boxes, \#nr\_classes + 12)\). The last twelve elements are split into
the four bounding box offsets, the four anchor box coordinates, and the four bounding box offsets, the four anchor box coordinates, and
the four variances; there are 8732 boxes. the four variances; there are 8732 boxes.
\glslocalreset{NMS}
Filtering of these boxes is first done per class: Filtering of these boxes is first done per class:
only the class id, confidence of that class, and the bounding box only the class id, confidence of that class, and the bounding box
coordinates are kept per box. The filtering consists of coordinates are kept per box. The filtering consists of
confidence thresholding and a subsequent non-maximum suppression. confidence thresholding and a subsequent \gls{NMS}.
All boxes that pass non-maximum suppression are added to a All boxes that pass \gls{NMS} are added to a
per image maxima list. One box could make the confidence threshold per image maxima list. One box could make the confidence threshold
for multiple classes and, hence, be present multiple times in the for multiple classes and, hence, be present multiple times in the
maxima list for the image. Lastly, a total of \(k\) boxes with the maxima list for the image. Lastly, a total of \(k\) boxes with the
highest confidences is kept per image across all classes. The highest confidences is kept per image across all classes. The
original implementation uses a confidence threshold of \(0.01\), an original implementation uses a confidence threshold of \(0.01\), an
IOU threshold for non-maximum suppression of \(0.45\) and a top \(k\) IOU threshold for \gls{NMS} of \(0.45\) and a top \(k\)
value of 200. value of 200.
The vanilla SSD The \gls{vanilla} SSD
per-class confidence threshold and non-maximum suppression has one per-class confidence threshold and \gls{NMS} has one
weakness: even if SSD correctly predicts all objects as the weakness: even if \gls{SSD} correctly predicts all objects as the
background class with high confidence, the per-class confidence background class with high confidence, the per-class confidence
threshold of 0.01 will consider predictions with very low threshold of 0.01 will consider predictions with very low
confidences; as background boxes are not present in the maxima confidences; as background boxes are not present in the maxima
@ -531,7 +532,7 @@ pass because the background class has high confidence. Subsequently,
a low per-class confidence threshold does not restrict the boxes a low per-class confidence threshold does not restrict the boxes
either. Therefore, the decoding output is worse than the actual either. Therefore, the decoding output is worse than the actual
predictions of the network. predictions of the network.
Bayesian SSD cannot help in this situation because the network Bayesian \gls{SSD} cannot help in this situation because the network
is not actually uncertain. is not actually uncertain.
SSD was developed with closed set conditions in mind. A well trained SSD was developed with closed set conditions in mind. A well trained
@ -543,8 +544,8 @@ confidence threshold is required.
\subsection{Vanilla SSD with Entropy Thresholding} \subsection{Vanilla SSD with Entropy Thresholding}
Vanilla SSD with entropy tresholding adds an additional component Vanilla \gls{SSD} with entropy tresholding adds an additional component
to the filtering already done for vanilla SSD. The entropy is to the filtering already done for \gls{vanilla} SSD. The entropy is
calculated from all \(\#nr\_classes\) softmax scores in a prediction. calculated from all \(\#nr\_classes\) softmax scores in a prediction.
Only predictions with a low enough entropy pass the entropy Only predictions with a low enough entropy pass the entropy
threshold and move on to the aforementioned per class filtering. threshold and move on to the aforementioned per class filtering.
@ -553,7 +554,7 @@ false positive or false negative cases with high confidence values.
\subsection{Bayesian SSD with Entropy Thresholding} \subsection{Bayesian SSD with Entropy Thresholding}
Bayesian SSD has the speciality of multiple forward passes. Based Bayesian \gls{SSD} has the speciality of multiple forward passes. Based
on the information in the paper, the detections of all forward passes on the information in the paper, the detections of all forward passes
are grouped per image but not by forward pass. This leads are grouped per image but not by forward pass. This leads
to the following shape of the network output after all to the following shape of the network output after all
@ -585,8 +586,8 @@ varying classifications are averaged into multiple lower confidence
values which should increase the entropy and, hence, flag an values which should increase the entropy and, hence, flag an
observation for removal. observation for removal.
The remainder of the filtering follows the vanilla SSD procedure: per-class The remainder of the filtering follows the \gls{vanilla} \gls{SSD} procedure: per-class
confidence threshold, non-maximum suppression, and a top \(k\) selection confidence threshold, \gls{NMS}, and a top \(k\) selection
at the end. at the end.
\chapter{Experimental Setup and Results} \chapter{Experimental Setup and Results}
@ -627,7 +628,7 @@ process. MS COCO contains landscape and portrait images with (640x480)
and (480x640) as the resolution. This led to a uniform distortion of the and (480x640) as the resolution. This led to a uniform distortion of the
portrait and landscape images respectively. Furthermore, portrait and landscape images respectively. Furthermore,
the colour channels were swapped from RGB to BGR in order to the colour channels were swapped from RGB to BGR in order to
comply with the SSD implementation. The BGR requirement stems from comply with the \gls{SSD} implementation. The BGR requirement stems from
the usage of Open CV in SSD: the internal channel order for the usage of Open CV in SSD: the internal channel order for
Open CV is BGR. Open CV is BGR.
@ -653,28 +654,28 @@ between the classes in the data set.
This section explains the setup for the different conducted This section explains the setup for the different conducted
experiments. Each comparison investigates one particular question. experiments. Each comparison investigates one particular question.
As a baseline, vanilla SSD with the confidence threshold of 0.01 As a baseline, \gls{vanilla} \gls{SSD} with the confidence threshold of 0.01
and a non-maximum suppression IOU threshold of 0.45 was used. and a \gls{NMS} IOU threshold of 0.45 was used.
Due to the low number of objects per image in the COCO data set, Due to the low number of objects per image in the COCO data set,
the top \(k\) value was set to 20. Vanilla SSD with entropy the top \(k\) value was set to 20. Vanilla \gls{SSD} with entropy
thresholding uses the same parameters; compared to vanilla SSD thresholding uses the same parameters; compared to \gls{vanilla} SSD
without entropy thresholding, it showcases the relevance of without entropy thresholding, it showcases the relevance of
entropy thresholding for vanilla SSD. entropy thresholding for \gls{vanilla} SSD.
Vanilla SSD was also run with 0.2 confidence threshold and compared Vanilla \gls{SSD} was also run with 0.2 confidence threshold and compared
to vanilla SSD with 0.01 confidence threshold; this comparison to \gls{vanilla} \gls{SSD} with 0.01 confidence threshold; this comparison
investigates the effect of the per class confidence threshold investigates the effect of the per class confidence threshold
on the object detection performance. on the object detection performance.
Bayesian SSD was run with 0.2 confidence threshold and compared Bayesian \gls{SSD} was run with 0.2 confidence threshold and compared
to vanilla SSD with 0.2 confidence threshold. Coupled with the to \gls{vanilla} \gls{SSD} with 0.2 confidence threshold. Coupled with the
entropy threshold, this comparison reveals how uncertain the network entropy threshold, this comparison reveals how uncertain the network
is. If it is very certain the dropout sampling should have no is. If it is very certain the dropout sampling should have no
significant impact on the result. Furthermore, in two cases the significant impact on the result. Furthermore, in two cases the
dropout was turned off to isolate the impact of non-maximum suppression dropout was turned off to isolate the impact of \gls{NMS}
on the result. on the result.
Both, vanilla SSD with entropy thresholding and Bayesian SSD with Both, \gls{vanilla} \gls{SSD} with entropy thresholding and Bayesian \gls{SSD} with
entropy thresholding, were tested for entropy thresholds ranging entropy thresholding, were tested for entropy thresholds ranging
from 0.1 to 2.4 inclusive as specified in Miller et al.~\cite{Miller2018}. from 0.1 to 2.4 inclusive as specified in Miller et al.~\cite{Miller2018}.
@ -701,25 +702,25 @@ in the next chapter.
Forward & max & abs OSE & Recall & Precision\\ Forward & max & abs OSE & Recall & Precision\\
Passes & \(F_1\) Score & \multicolumn{3}{c}{at max \(F_1\) point} \\ Passes & \(F_1\) Score & \multicolumn{3}{c}{at max \(F_1\) point} \\
\hline \hline
vanilla SSD - 0.01 conf & 0.255 & 3176 & 0.214 & 0.318 \\ \gls{vanilla} \gls{SSD} - 0.01 conf & 0.255 & 3176 & 0.214 & 0.318 \\
vanilla SSD - 0.2 conf & \textbf{0.376} & 2939 & \textbf{0.382} & 0.372 \\ \gls{vanilla} \gls{SSD} - 0.2 conf & \textbf{0.376} & 2939 & \textbf{0.382} & 0.372 \\
SSD with Entropy test - 0.01 conf & 0.255 & 3168 & 0.214 & 0.318 \\ \gls{SSD} with Entropy test - 0.01 conf & 0.255 & 3168 & 0.214 & 0.318 \\
% entropy thresh: 2.4 for vanilla SSD is best % entropy thresh: 2.4 for \gls{vanilla} \gls{SSD} is best
\hline \hline
Bay. SSD - no DO - 0.2 conf - no NMS \; 10 & 0.209 & 2709 & 0.300 & 0.161 \\ Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.209 & 2709 & 0.300 & 0.161 \\
no dropout - 0.2 conf - NMS \; 10 & 0.371 & \textbf{2335} & 0.365 & \textbf{0.378} \\ no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.371 & \textbf{2335} & 0.365 & \textbf{0.378} \\
0.9 keep ratio - 0.2 conf - NMS \; 10 & 0.359 & 2584 & 0.363 & 0.357 \\ 0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.359 & 2584 & 0.363 & 0.357 \\
0.5 keep ratio - 0.2 conf - NMS \; 10 & 0.325 & 2759 & 0.342 & 0.311 \\ 0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.325 & 2759 & 0.342 & 0.311 \\
% entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3 % entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
% 0.5 for Bayesian - 6, 1.4 for 7, 1.4 for 8, 1.3 for 9 % 0.5 for Bayesian - 6, 1.4 for 7, 1.4 for 8, 1.3 for 9
\hline \hline
\end{tabular} \end{tabular}
\caption{Rounded results for micro averaging. SSD with Entropy test and Bayesian SSD are represented with \caption{Rounded results for micro averaging. \gls{SSD} with Entropy test and Bayesian \gls{SSD} are represented with
their best performing entropy threshold with respect to \(F_1\) score. Vanilla SSD with Entropy test performed best with an their best performing entropy threshold with respect to \(F_1\) score. Vanilla \gls{SSD} with Entropy test performed best with an
entropy threshold of 2.4, Bayesian SSD without non-maximum suppression performed best for 1.0, entropy threshold of 2.4, Bayesian \gls{SSD} without \gls{NMS} performed best for 1.0,
and Bayesian SSD with non-maximum suppression performed best for 1.4 as entropy and Bayesian \gls{SSD} with \gls{NMS} performed best for 1.4 as entropy
threshold. threshold.
Bayesian SSD with dropout enabled and 0.9 keep ratio performed Bayesian \gls{SSD} with dropout enabled and 0.9 keep ratio performed
best for 1.4 as entropy threshold, the run with 0.5 keep ratio performed best for 1.4 as entropy threshold, the run with 0.5 keep ratio performed
best for 1.3 as threshold.} best for 1.3 as threshold.}
\label{tab:results-micro} \label{tab:results-micro}
@ -739,26 +740,26 @@ in the next chapter.
\end{minipage} \end{minipage}
\end{figure} \end{figure}
Vanilla SSD with a per-class confidence threshold of 0.2 performs best (see Vanilla \gls{SSD} with a per-class confidence threshold of 0.2 performs best (see
table \ref{tab:results-micro}) with respect to the maximum \(F_1\) score table \ref{tab:results-micro}) with respect to the maximum \(F_1\) score
(0.376) and recall at the maximum \(F_1\) point (0.382). In comparison, neither (0.376) and recall at the maximum \(F_1\) point (0.382). In comparison, neither
the vanilla SSD variant with a confidence threshold of 0.01 nor the SSD with the \gls{vanilla} \gls{SSD} variant with a confidence threshold of 0.01 nor the \gls{SSD} with
an entropy test can outperform the 0.2 variant. Among the vanilla SSD variants, an entropy test can outperform the 0.2 variant. Among the \gls{vanilla} \gls{SSD} variants,
the 0.2 variant also has the lowest number of open set errors (2939) and the the 0.2 variant also has the lowest number of open set errors (2939) and the
highest precision (0.372). highest precision (0.372).
The comparison of the vanilla SSD variants with a confidence threshold of 0.01 The comparison of the \gls{vanilla} \gls{SSD} variants with a confidence threshold of 0.01
shows no significant impact of an entropy test. Only the open set errors shows no significant impact of an entropy test. Only the open set errors
are lower but in an insignificant way. The rest of the performance metrics is are lower but in an insignificant way. The rest of the performance metrics is
identical after rounding. identical after rounding.
Bayesian SSD with disabled dropout and without non-maximum suppression Bayesian \gls{SSD} with disabled dropout and without \gls{NMS}
has the worst performance of all tested variants (vanilla and Bayesian) has the worst performance of all tested variants (\gls{vanilla} and Bayesian)
with respect to \(F_1\) score (0.209) and precision (0.161). The precision is not only the worst but also significantly lower compared to all other variants. with respect to \(F_1\) score (0.209) and precision (0.161). The precision is not only the worst but also significantly lower compared to all other variants.
In comparison to all variants with 0.2 confidence threshold, it has the worst recall (0.300) as well. In comparison to all variants with 0.2 confidence threshold, it has the worst recall (0.300) as well.
With 2335 open set errors, the Bayesian SSD variant with disabled dropout and With 2335 open set errors, the Bayesian \gls{SSD} variant with disabled dropout and
enabled non-maximum suppression offers the best performance with respect enabled \gls{NMS} offers the best performance with respect
to open set errors. It also has the best precision (0.378) of all tested to open set errors. It also has the best precision (0.378) of all tested
variants. Furthermore, it provides the best performance among all variants variants. Furthermore, it provides the best performance among all variants
with multiple forward passes. with multiple forward passes.
@ -768,11 +769,11 @@ in the lower \(F_1\) scores, higher open set errors, and lower precision
values. Both dropout variants have worse recall (0.363 and 0.342) than values. Both dropout variants have worse recall (0.363 and 0.342) than
the variant with disabled dropout. the variant with disabled dropout.
However, all variants with multiple forward passes have lower open set However, all variants with multiple forward passes have lower open set
errors than all vanilla SSD variants. errors than all \gls{vanilla} \gls{SSD} variants.
The relation of \(F_1\) score to absolute open set error can be observed The relation of \(F_1\) score to absolute open set error can be observed
in figure \ref{fig:ose-f1-micro}. Precision-recall curves for all variants in figure \ref{fig:ose-f1-micro}. Precision-recall curves for all variants
can be seen in figure \ref{fig:precision-recall-micro}. Both vanilla SSD can be seen in figure \ref{fig:precision-recall-micro}. Both \gls{vanilla} SSD
variants with 0.01 confidence threshold reach much higher open set errors variants with 0.01 confidence threshold reach much higher open set errors
and a higher recall. This behaviour is expected as more and worse predictions and a higher recall. This behaviour is expected as more and worse predictions
are included. are included.
@ -787,25 +788,25 @@ reported figures, such as the ones in Miller et al.~\cite{Miller2018}
Forward & max & abs OSE & Recall & Precision\\ Forward & max & abs OSE & Recall & Precision\\
Passes & \(F_1\) Score & \multicolumn{3}{c}{at max \(F_1\) point} \\ Passes & \(F_1\) Score & \multicolumn{3}{c}{at max \(F_1\) point} \\
\hline \hline
vanilla SSD - 0.01 conf & 0.370 & 1426 & 0.328 & 0.424 \\ \gls{vanilla} \gls{SSD} - 0.01 conf & 0.370 & 1426 & 0.328 & 0.424 \\
vanilla SSD - 0.2 conf & \textbf{0.375} & 1218 & \textbf{0.338} & 0.424 \\ \gls{vanilla} \gls{SSD} - 0.2 conf & \textbf{0.375} & 1218 & \textbf{0.338} & 0.424 \\
SSD with Entropy test - 0.01 conf & 0.370 & 1373 & 0.329 & \textbf{0.425} \\ \gls{SSD} with Entropy test - 0.01 conf & 0.370 & 1373 & 0.329 & \textbf{0.425} \\
% entropy thresh: 1.7 for vanilla SSD is best % entropy thresh: 1.7 for \gls{vanilla} \gls{SSD} is best
\hline \hline
Bay. SSD - no DO - 0.2 conf - no NMS \; 10 & 0.226 & \textbf{809} & 0.229 & 0.224 \\ Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.226 & \textbf{809} & 0.229 & 0.224 \\
no dropout - 0.2 conf - NMS \; 10 & 0.363 & 1057 & 0.321 & 0.420 \\ no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.363 & 1057 & 0.321 & 0.420 \\
0.9 keep ratio - 0.2 conf - NMS \; 10 & 0.355 & 1137 & 0.320 & 0.399 \\ 0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.355 & 1137 & 0.320 & 0.399 \\
0.5 keep ratio - 0.2 conf - NMS \; 10 & 0.322 & 1264 & 0.307 & 0.340 \\ 0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.322 & 1264 & 0.307 & 0.340 \\
% entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3 % entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
% entropy thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7 % entropy thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7
% 1.7 for 8, 2.0 for 9 % 1.7 for 8, 2.0 for 9
\hline \hline
\end{tabular} \end{tabular}
\caption{Rounded results for macro averaging. SSD with Entropy test and Bayesian SSD are represented with \caption{Rounded results for macro averaging. \gls{SSD} with Entropy test and Bayesian \gls{SSD} are represented with
their best performing entropy threshold with respect to \(F_1\) score. Vanilla SSD with Entropy test performed best with an their best performing entropy threshold with respect to \(F_1\) score. Vanilla \gls{SSD} with Entropy test performed best with an
entropy threshold of 1.7, Bayesian SSD without non-maximum suppression performed best for 1.5, entropy threshold of 1.7, Bayesian \gls{SSD} without \gls{NMS} performed best for 1.5,
and Bayesian SSD with non-maximum suppression performed best for 1.5 as entropy and Bayesian \gls{SSD} with \gls{NMS} performed best for 1.5 as entropy
threshold. Bayesian SSD with dropout enabled and 0.9 keep ratio performed threshold. Bayesian \gls{SSD} with dropout enabled and 0.9 keep ratio performed
best for 1.7 as entropy threshold, the run with 0.5 keep ratio performed best for 1.7 as entropy threshold, the run with 0.5 keep ratio performed
best for 2.0 as threshold.} best for 2.0 as threshold.}
\label{tab:results-macro} \label{tab:results-macro}
@ -825,36 +826,36 @@ reported figures, such as the ones in Miller et al.~\cite{Miller2018}
\end{minipage} \end{minipage}
\end{figure} \end{figure}
Vanilla SSD with a per-class confidence threshold of 0.2 performs best (see Vanilla \gls{SSD} with a per-class confidence threshold of 0.2 performs best (see
table \ref{tab:results-macro}) with respect to the maximum \(F_1\) score table \ref{tab:results-macro}) with respect to the maximum \(F_1\) score
(0.375) and recall at the maximum \(F_1\) point (0.338). In comparison, the SSD (0.375) and recall at the maximum \(F_1\) point (0.338). In comparison, the SSD
with an entropy test slightly outperforms the 0.2 variant with respect to with an entropy test slightly outperforms the 0.2 variant with respect to
precision (0.425). Additionally, this is the best precision overall. Among precision (0.425). Additionally, this is the best precision overall. Among
the vanilla SSD variants, the 0.2 variant also has the lowest the \gls{vanilla} \gls{SSD} variants, the 0.2 variant also has the lowest
number of open set errors (1218). number of open set errors (1218).
The comparison of the vanilla SSD variants with a confidence threshold of 0.01 The comparison of the \gls{vanilla} \gls{SSD} variants with a confidence threshold of 0.01
shows no significant impact of an entropy test. Only the open set errors shows no significant impact of an entropy test. Only the open set errors
are lower but in an insignificant way. The rest of the performance metrics is are lower but in an insignificant way. The rest of the performance metrics is
almost identical after rounding. almost identical after rounding.
The results for Bayesian SSD show a significant impact of non-maximum suppression or the lack thereof: maximum \(F_1\) score of 0.363 (with NMS) to 0.226 The results for Bayesian \gls{SSD} show a significant impact of \gls{NMS} or the lack thereof: maximum \(F_1\) score of 0.363 (with NMS) to 0.226
(without NMS). Dropout was disabled in both cases, making them effectively a (without NMS). Dropout was disabled in both cases, making them effectively a
vanilla SSD run with multiple forward passes. \gls{vanilla} \gls{SSD} run with multiple forward passes.
With 809 open set errors, the Bayesian SSD variant with disabled dropout and With 809 open set errors, the Bayesian \gls{SSD} variant with disabled dropout and
without non-maximum suppression offers the best performance with respect without \gls{NMS} offers the best performance with respect
to open set errors. The variant without dropout and enabled non-maximum suppression has the best \(F_1\) score (0.363), the best to open set errors. The variant without dropout and enabled \gls{NMS} has the best \(F_1\) score (0.363), the best
precision (0.420) and the best recall (0.321) of all Bayesian variants. precision (0.420) and the best recall (0.321) of all Bayesian variants.
Dropout decreases the performance of the network, this can be seen Dropout decreases the performance of the network, this can be seen
in the lower \(F_1\) scores, higher open set errors, and lower precision and in the lower \(F_1\) scores, higher open set errors, and lower precision and
recall values. However, all variants with multiple forward passes have lower open set errors than all vanilla SSD recall values. However, all variants with multiple forward passes have lower open set errors than all \gls{vanilla} SSD
variants. variants.
The relation of \(F_1\) score to absolute open set error can be observed The relation of \(F_1\) score to absolute open set error can be observed
in figure \ref{fig:ose-f1-macro}. Precision-recall curves for all variants in figure \ref{fig:ose-f1-macro}. Precision-recall curves for all variants
can be seen in figure \ref{fig:precision-recall-macro}. Both vanilla SSD can be seen in figure \ref{fig:precision-recall-macro}. Both \gls{vanilla} SSD
variants with 0.01 confidence threshold reach much higher open set errors variants with 0.01 confidence threshold reach much higher open set errors
and a higher recall. This behaviour is expected as more and worse predictions and a higher recall. This behaviour is expected as more and worse predictions
are included. are included.
@ -884,35 +885,35 @@ they had the exact same performance before rounding.
Forward & max & Recall & Precision\\ Forward & max & Recall & Precision\\
Passes & \(F_1\) Score & \multicolumn{2}{c}{at max \(F_1\) point} \\ Passes & \(F_1\) Score & \multicolumn{2}{c}{at max \(F_1\) point} \\
\hline \hline
vanilla SSD - 0.01 conf & 0.460 & \textbf{0.405} & 0.532 \\ \gls{vanilla} \gls{SSD} - 0.01 conf & 0.460 & \textbf{0.405} & 0.532 \\
vanilla SSD - 0.2 conf & \textbf{0.460} & \textbf{0.405} & \textbf{0.533} \\ \gls{vanilla} \gls{SSD} - 0.2 conf & \textbf{0.460} & \textbf{0.405} & \textbf{0.533} \\
SSD with Entropy test - 0.01 conf & 0.460 & 0.405 & 0.532 \\ \gls{SSD} with Entropy test - 0.01 conf & 0.460 & 0.405 & 0.532 \\
% entropy thresh: 1.7 for vanilla SSD is best % entropy thresh: 1.7 for \gls{vanilla} \gls{SSD} is best
\hline \hline
Bay. SSD - no DO - 0.2 conf - no NMS \; 10 & 0.272 & 0.292 & 0.256 \\ Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.272 & 0.292 & 0.256 \\
no dropout - 0.2 conf - NMS \; 10 & 0.451 & 0.403 & 0.514 \\ no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.451 & 0.403 & 0.514 \\
0.9 keep ratio - 0.2 conf - NMS \; 10 & 0.447 & 0.401 & 0.505 \\ 0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.447 & 0.401 & 0.505 \\
0.5 keep ratio - 0.2 conf - NMS \; 10 & 0.410 & 0.368 & 0.465 \\ 0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.410 & 0.368 & 0.465 \\
% entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3 % entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
% entropy thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7 % entropy thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7
% 1.7 for 8, 2.0 for 9 % 1.7 for 8, 2.0 for 9
\hline \hline
\end{tabular} \end{tabular}
\caption{Rounded results for persons class. SSD with Entropy test and Bayesian SSD are represented with \caption{Rounded results for persons class. \gls{SSD} with Entropy test and Bayesian \gls{SSD} are represented with
their best performing macro averaging entropy threshold with respect to \(F_1\) score.} their best performing macro averaging entropy threshold with respect to \(F_1\) score.}
\label{tab:results-persons} \label{tab:results-persons}
\end{table} \end{table}
It is clearly visible that the overall trend continues in the individual It is clearly visible that the overall trend continues in the individual
classes (see tables \ref{tab:results-persons}, \ref{tab:results-cars}, \ref{tab:results-chairs}, \ref{tab:results-bottles}, and \ref{tab:results-giraffes}). However, the two vanilla SSD variants with only 0.01 confidence classes (see tables \ref{tab:results-persons}, \ref{tab:results-cars}, \ref{tab:results-chairs}, \ref{tab:results-bottles}, and \ref{tab:results-giraffes}). However, the two \gls{vanilla} \gls{SSD} variants with only 0.01 confidence
threshold perform better than in the averaged results presented earlier. threshold perform better than in the averaged results presented earlier.
Only in the chairs class, a Bayesian SSD variant performs better (in Only in the chairs class, a Bayesian \gls{SSD} variant performs better (in
precision) than any of the vanilla SSD variants. Moreover, there are precision) than any of the \gls{vanilla} \gls{SSD} variants. Moreover, there are
multiple classes where two or all of the vanilla SSD variants perform multiple classes where two or all of the \gls{vanilla} \gls{SSD} variants perform
equally well. When compared with the macro averaged results, equally well. When compared with the macro averaged results,
giraffes and persons perform better across the board. Cars have a higher giraffes and persons perform better across the board. Cars have a higher
precision than average but lower recall values for all but the Bayesian precision than average but lower recall values for all but the Bayesian
SSD variant without NMS and dropout. Chairs and bottles perform SSD variant without \gls{NMS} and dropout. Chairs and bottles perform
worse than average. worse than average.
\begin{table}[tbp] \begin{table}[tbp]
@ -921,21 +922,21 @@ worse than average.
Forward & max & Recall & Precision\\ Forward & max & Recall & Precision\\
Passes & \(F_1\) Score & \multicolumn{2}{c}{at max \(F_1\) point} \\ Passes & \(F_1\) Score & \multicolumn{2}{c}{at max \(F_1\) point} \\
\hline \hline
vanilla SSD - 0.01 conf & 0.364 & \textbf{0.305} & 0.452 \\ \gls{vanilla} \gls{SSD} - 0.01 conf & 0.364 & \textbf{0.305} & 0.452 \\
vanilla SSD - 0.2 conf & 0.363 & 0.294 & \textbf{0.476} \\ \gls{vanilla} \gls{SSD} - 0.2 conf & 0.363 & 0.294 & \textbf{0.476} \\
SSD with Entropy test - 0.01 conf & \textbf{0.364} & \textbf{0.305} & 0.453 \\ \gls{SSD} with Entropy test - 0.01 conf & \textbf{0.364} & \textbf{0.305} & 0.453 \\
% entropy thresh: 1.7 for vanilla SSD is best % entropy thresh: 1.7 for \gls{vanilla} \gls{SSD} is best
\hline \hline
Bay. SSD - no DO - 0.2 conf - no NMS \; 10 & 0.236 & 0.244 & 0.229 \\ Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.236 & 0.244 & 0.229 \\
no dropout - 0.2 conf - NMS \; 10 & 0.336 & 0.266 & 0.460 \\ no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.336 & 0.266 & 0.460 \\
0.9 keep ratio - 0.2 conf - NMS \; 10 & 0.332 & 0.262 & 0.454 \\ 0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.332 & 0.262 & 0.454 \\
0.5 keep ratio - 0.2 conf - NMS \; 10 & 0.309 & 0.264 & 0.374 \\ 0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.309 & 0.264 & 0.374 \\
% entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3 % entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
% entropy thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7 % entropy thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7
% 1.7 for 8, 2.0 for 9 % 1.7 for 8, 2.0 for 9
\hline \hline
\end{tabular} \end{tabular}
\caption{Rounded results for cars class. SSD with Entropy test and Bayesian SSD are represented with \caption{Rounded results for cars class. \gls{SSD} with Entropy test and Bayesian \gls{SSD} are represented with
their best performing macro averaging entropy threshold with respect to \(F_1\) score. } their best performing macro averaging entropy threshold with respect to \(F_1\) score. }
\label{tab:results-cars} \label{tab:results-cars}
\end{table} \end{table}
@ -946,21 +947,21 @@ worse than average.
Forward & max & Recall & Precision\\ Forward & max & Recall & Precision\\
Passes & \(F_1\) Score & \multicolumn{2}{c}{at max \(F_1\) point} \\ Passes & \(F_1\) Score & \multicolumn{2}{c}{at max \(F_1\) point} \\
\hline \hline
vanilla SSD - 0.01 conf & 0.287 & \textbf{0.251} & 0.335 \\ \gls{vanilla} \gls{SSD} - 0.01 conf & 0.287 & \textbf{0.251} & 0.335 \\
vanilla SSD - 0.2 conf & 0.283 & 0.242 & 0.341 \\ \gls{vanilla} \gls{SSD} - 0.2 conf & 0.283 & 0.242 & 0.341 \\
SSD with Entropy test - 0.01 conf & \textbf{0.288} & \textbf{0.251} & 0.338 \\ \gls{SSD} with Entropy test - 0.01 conf & \textbf{0.288} & \textbf{0.251} & 0.338 \\
% entropy thresh: 1.7 for vanilla SSD is best % entropy thresh: 1.7 for \gls{vanilla} \gls{SSD} is best
\hline \hline
Bay. SSD - no DO - 0.2 conf - no NMS \; 10 & 0.172 & 0.168 & 0.178 \\ Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.172 & 0.168 & 0.178 \\
no dropout - 0.2 conf - NMS \; 10 & 0.280 & 0.229 & \textbf{0.360} \\ no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.280 & 0.229 & \textbf{0.360} \\
0.9 keep ratio - 0.2 conf - NMS \; 10 & 0.274 & 0.228 & 0.343 \\ 0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.274 & 0.228 & 0.343 \\
0.5 keep ratio - 0.2 conf - NMS \; 10 & 0.240 & 0.220 & 0.265 \\ 0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.240 & 0.220 & 0.265 \\
% entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3 % entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
% entropy thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7 % entropy thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7
% 1.7 for 8, 2.0 for 9 % 1.7 for 8, 2.0 for 9
\hline \hline
\end{tabular} \end{tabular}
\caption{Rounded results for chairs class. SSD with Entropy test and Bayesian SSD are represented with \caption{Rounded results for chairs class. \gls{SSD} with Entropy test and Bayesian \gls{SSD} are represented with
their best performing macro averaging entropy threshold with respect to \(F_1\) score. } their best performing macro averaging entropy threshold with respect to \(F_1\) score. }
\label{tab:results-chairs} \label{tab:results-chairs}
\end{table} \end{table}
@ -972,21 +973,21 @@ worse than average.
Forward & max & Recall & Precision\\ Forward & max & Recall & Precision\\
Passes & \(F_1\) Score & \multicolumn{2}{c}{at max \(F_1\) point} \\ Passes & \(F_1\) Score & \multicolumn{2}{c}{at max \(F_1\) point} \\
\hline \hline
vanilla SSD - 0.01 conf & 0.233 & \textbf{0.175} & 0.348 \\ \gls{vanilla} \gls{SSD} - 0.01 conf & 0.233 & \textbf{0.175} & 0.348 \\
vanilla SSD - 0.2 conf & 0.231 & 0.173 & \textbf{0.350} \\ \gls{vanilla} \gls{SSD} - 0.2 conf & 0.231 & 0.173 & \textbf{0.350} \\
SSD with Entropy test - 0.01 conf & \textbf{0.233} & \textbf{0.175} & 0.350 \\ \gls{SSD} with Entropy test - 0.01 conf & \textbf{0.233} & \textbf{0.175} & 0.350 \\
% entropy thresh: 1.7 for vanilla SSD is best % entropy thresh: 1.7 for \gls{vanilla} \gls{SSD} is best
\hline \hline
Bay. SSD - no DO - 0.2 conf - no NMS \; 10 & 0.160 & 0.140 & 0.188 \\ Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.160 & 0.140 & 0.188 \\
no dropout - 0.2 conf - NMS \; 10 & 0.224 & 0.170 & 0.328 \\ no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.224 & 0.170 & 0.328 \\
0.9 keep ratio - 0.2 conf - NMS \; 10 & 0.220 & 0.170 & 0.311 \\ 0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.220 & 0.170 & 0.311 \\
0.5 keep ratio - 0.2 conf - NMS \; 10 & 0.202 & 0.172 & 0.245 \\ 0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.202 & 0.172 & 0.245 \\
% entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3 % entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
% entropy thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7 % entropy thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7
% 1.7 for 8, 2.0 for 9 % 1.7 for 8, 2.0 for 9
\hline \hline
\end{tabular} \end{tabular}
\caption{Rounded results for bottles class. SSD with Entropy test and Bayesian SSD are represented with \caption{Rounded results for bottles class. \gls{SSD} with Entropy test and Bayesian \gls{SSD} are represented with
their best performing macro averaging entropy threshold with respect to \(F_1\) score. } their best performing macro averaging entropy threshold with respect to \(F_1\) score. }
\label{tab:results-bottles} \label{tab:results-bottles}
\end{table} \end{table}
@ -997,21 +998,21 @@ worse than average.
Forward & max & Recall & Precision\\ Forward & max & Recall & Precision\\
Passes & \(F_1\) Score & \multicolumn{2}{c}{at max \(F_1\) point} \\ Passes & \(F_1\) Score & \multicolumn{2}{c}{at max \(F_1\) point} \\
\hline \hline
vanilla SSD - 0.01 conf & \textbf{0.650} & \textbf{0.647} & \textbf{0.655} \\ \gls{vanilla} \gls{SSD} - 0.01 conf & \textbf{0.650} & \textbf{0.647} & \textbf{0.655} \\
vanilla SSD - 0.2 conf & \textbf{0.650} & \textbf{0.647} & \textbf{0.655} \\ \gls{vanilla} \gls{SSD} - 0.2 conf & \textbf{0.650} & \textbf{0.647} & \textbf{0.655} \\
SSD with Entropy test - 0.01 conf & \textbf{0.650} & \textbf{0.647} & \textbf{0.655} \\ \gls{SSD} with Entropy test - 0.01 conf & \textbf{0.650} & \textbf{0.647} & \textbf{0.655} \\
% entropy thresh: 1.7 for vanilla SSD is best % entropy thresh: 1.7 for \gls{vanilla} \gls{SSD} is best
\hline \hline
Bay. SSD - no DO - 0.2 conf - no NMS \; 10 & 0.415 & 0.414 & 0.417 \\ Bay. \gls{SSD} - no DO - 0.2 conf - no \gls{NMS} \; 10 & 0.415 & 0.414 & 0.417 \\
no dropout - 0.2 conf - NMS \; 10 & 0.647 & 0.642 & 0.654 \\ no dropout - 0.2 conf - \gls{NMS} \; 10 & 0.647 & 0.642 & 0.654 \\
0.9 keep ratio - 0.2 conf - NMS \; 10 & 0.637 & 0.634 & 0.642 \\ 0.9 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.637 & 0.634 & 0.642 \\
0.5 keep ratio - 0.2 conf - NMS \; 10 & 0.586 & 0.578 & 0.596 \\ 0.5 keep ratio - 0.2 conf - \gls{NMS} \; 10 & 0.586 & 0.578 & 0.596 \\
% entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3 % entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
% entropy thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7 % entropy thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7
% 1.7 for 8, 2.0 for 9 % 1.7 for 8, 2.0 for 9
\hline \hline
\end{tabular} \end{tabular}
\caption{Rounded results for giraffe class. SSD with Entropy test and Bayesian SSD are represented with \caption{Rounded results for giraffe class. \gls{SSD} with Entropy test and Bayesian \gls{SSD} are represented with
their best performing macro averaging entropy threshold with respect to \(F_1\) score. } their best performing macro averaging entropy threshold with respect to \(F_1\) score. }
\label{tab:results-giraffes} \label{tab:results-giraffes}
\end{table} \end{table}
@ -1020,47 +1021,47 @@ worse than average.
% TODO: expand % TODO: expand
This subsection compares vanilla SSD This subsection compares \gls{vanilla} SSD
with Bayesian SSD with respect to specific images that illustrate with Bayesian \gls{SSD} with respect to specific images that illustrate
similarities and differences between both approaches. For this similarities and differences between both approaches. For this
comparison, a 0.2 confidence threshold is applied. Furthermore, Bayesian comparison, a 0.2 confidence threshold is applied. Furthermore, Bayesian
SSD uses non-maximum suppression and dropout with 0.9 keep ratio. SSD uses \gls{NMS} and dropout with 0.9 keep ratio.
\begin{figure} \begin{figure}
\begin{minipage}[t]{0.48\textwidth} \begin{minipage}[t]{0.48\textwidth}
\includegraphics[width=\textwidth]{COCO_val2014_000000336587_bboxes_vanilla} \includegraphics[width=\textwidth]{COCO_val2014_000000336587_bboxes_vanilla}
\caption{Image with stop sign and truck at right edge. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from vanilla SSD.} \caption{Image with stop sign and truck at right edge. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from \gls{vanilla} SSD.}
\label{fig:stop-sign-truck-vanilla} \label{fig:stop-sign-truck-vanilla}
\end{minipage}% \end{minipage}%
\hfill \hfill
\begin{minipage}[t]{0.48\textwidth} \begin{minipage}[t]{0.48\textwidth}
\includegraphics[width=\textwidth]{COCO_val2014_000000336587_bboxes_bayesian} \includegraphics[width=\textwidth]{COCO_val2014_000000336587_bboxes_bayesian}
\caption{Image with stop sign and truck at right edge. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from Bayesian SSD with 0.9 keep ratio.} \caption{Image with stop sign and truck at right edge. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from Bayesian \gls{SSD} with 0.9 keep ratio.}
\label{fig:stop-sign-truck-bayesian} \label{fig:stop-sign-truck-bayesian}
\end{minipage} \end{minipage}
\end{figure} \end{figure}
The ground truth only contains a stop sign and a truck. The differences between vanilla SSD and Bayesian SSD are almost not visible The ground truth only contains a stop sign and a truck. The differences between \gls{vanilla} \gls{SSD} and Bayesian \gls{SSD} are almost not visible
(see figures \ref{fig:stop-sign-truck-vanilla} and \ref{fig:stop-sign-truck-bayesian}): the truck is neither detected by vanilla nor Bayesian SSD, instead both detected a pottet plant and a traffic light. The stop sign is detected by both variants. (see figures \ref{fig:stop-sign-truck-vanilla} and \ref{fig:stop-sign-truck-bayesian}): the truck is neither detected by \gls{vanilla} nor Bayesian SSD, instead both detected a pottet plant and a traffic light. The stop sign is detected by both variants.
This behaviour implies problems with detecting objects at the edge This behaviour implies problems with detecting objects at the edge
that overwhelmingly lie outside the image frame. Furthermore, the predictions are usually identical. that overwhelmingly lie outside the image frame. Furthermore, the predictions are usually identical.
\begin{figure} \begin{figure}
\begin{minipage}[t]{0.48\textwidth} \begin{minipage}[t]{0.48\textwidth}
\includegraphics[width=\textwidth]{COCO_val2014_000000403817_bboxes_vanilla} \includegraphics[width=\textwidth]{COCO_val2014_000000403817_bboxes_vanilla}
\caption{Image with a cat and laptop/TV. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from vanilla SSD.} \caption{Image with a cat and laptop/TV. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from \gls{vanilla} SSD.}
\label{fig:cat-laptop-vanilla} \label{fig:cat-laptop-vanilla}
\end{minipage}% \end{minipage}%
\hfill \hfill
\begin{minipage}[t]{0.48\textwidth} \begin{minipage}[t]{0.48\textwidth}
\includegraphics[width=\textwidth]{COCO_val2014_000000403817_bboxes_bayesian} \includegraphics[width=\textwidth]{COCO_val2014_000000403817_bboxes_bayesian}
\caption{Image with a cat and laptop/TV. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from Bayesian SSD with 0.9 keep ratio.} \caption{Image with a cat and laptop/TV. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from Bayesian \gls{SSD} with 0.9 keep ratio.}
\label{fig:cat-laptop-bayesian} \label{fig:cat-laptop-bayesian}
\end{minipage} \end{minipage}
\end{figure} \end{figure}
Another example (see figures \ref{fig:cat-laptop-vanilla} and \ref{fig:cat-laptop-bayesian}) is a cat with a laptop/TV in the background on the right Another example (see figures \ref{fig:cat-laptop-vanilla} and \ref{fig:cat-laptop-bayesian}) is a cat with a laptop/TV in the background on the right
side. Both variants detect a cat but the vanilla variant detects a dog as well. The laptop and TV are not detected but this is expected since side. Both variants detect a cat but the \gls{vanilla} variant detects a dog as well. The laptop and TV are not detected but this is expected since
these classes were not trained. these classes were not trained.
\chapter{Discussion and Outlook} \chapter{Discussion and Outlook}
@ -1073,7 +1074,7 @@ questions will be addressed.
\section*{Discussion} \section*{Discussion}
The results clearly do not support the hypothesis: \textit{Dropout sampling delivers better object detection performance under open set conditions compared to object detection without it}. With the exception of open set errors, there The results clearly do not support the hypothesis: \textit{Dropout sampling delivers better object detection performance under open set conditions compared to object detection without it}. With the exception of open set errors, there
is no area where dropout sampling performs better than vanilla SSD. In the is no area where dropout sampling performs better than \gls{vanilla} SSD. In the
remainder of the section the individual results will be interpreted. remainder of the section the individual results will be interpreted.
\subsection*{Impact of Averaging} \subsection*{Impact of Averaging}
@ -1085,8 +1086,8 @@ of the plot in both the \(F_1\) versus absolute open set error graph (see figure
the precision-recall curve (see figure \ref{fig:precision-recall-micro}). the precision-recall curve (see figure \ref{fig:precision-recall-micro}).
This behaviour is caused by a large imbalance of detections between This behaviour is caused by a large imbalance of detections between
the classes. For vanilla SSD with 0.2 confidence threshold there are the classes. For \gls{vanilla} \gls{SSD} with 0.2 confidence threshold there are
a total of 36,863 detections after non-maximum suppression and top \(k\). a total of 36,863 detections after \gls{NMS} and top \(k\).
The persons class contributes 14,640 detections or around 40\% to that number. Another strong class is cars with 2,252 detections or around The persons class contributes 14,640 detections or around 40\% to that number. Another strong class is cars with 2,252 detections or around
6\%. In third place come chairs with 1352 detections or around 4\%. This means that three classes have together roughly as many detections 6\%. In third place come chairs with 1352 detections or around 4\%. This means that three classes have together roughly as many detections
as the remaining 57 classes combined. as the remaining 57 classes combined.
@ -1119,7 +1120,7 @@ averaging was not reported in their paper.
\subsection*{Impact of Entropy} \subsection*{Impact of Entropy}
There is no visible impact of entropy thresholding on the object detection There is no visible impact of entropy thresholding on the object detection
performance for vanilla SSD. This indicates that the network has almost no performance for \gls{vanilla} SSD. This indicates that the network has almost no
uniform or close to uniform predictions, the vast majority of predictions uniform or close to uniform predictions, the vast majority of predictions
has a high confidence in one class---including the background. has a high confidence in one class---including the background.
However, the entropy plays a larger role for the Bayesian variants---as However, the entropy plays a larger role for the Bayesian variants---as
@ -1144,52 +1145,52 @@ threshold indicates a worse performance.
variant & before & after & after \\ variant & before & after & after \\
& entropy/NMS & entropy/NMS & top \(k\) \\ & entropy/NMS & entropy/NMS & top \(k\) \\
\hline \hline
Bay. SSD, no dropout, no NMS & 155,251 & 122,868 & 72,207 \\ Bay. SSD, no dropout, no \gls{NMS} & 155,251 & 122,868 & 72,207 \\
no dropout, NMS & 155,250 & 36,061 & 33,827 \\ no dropout, \gls{NMS} & 155,250 & 36,061 & 33,827 \\
\hline \hline
\end{tabular} \end{tabular}
\caption{Comparison of Bayesian SSD variants without dropout with \caption{Comparison of Bayesian \gls{SSD} variants without dropout with
respect to the number of detections before the entropy threshold, respect to the number of detections before the entropy threshold,
after it and/or non-maximum suppression, and after top \(k\). The after it and/or \gls{NMS}, and after top \(k\). The
entropy threshold 1.5 was used for both.} entropy threshold 1.5 was used for both.}
\label{tab:effect-nms} \label{tab:effect-nms}
\end{table} \end{table}
Miller et al.~\cite{Miller2018} supposedly did not use non-maximum suppression Miller et al.~\cite{Miller2018} supposedly did not use \gls{NMS}
in their implementation of dropout sampling. Therefore, a variant with disabled in their implementation of dropout sampling. Therefore, a variant with disabled \glslocalreset{NMS}
non-maximum suppression (NMS) was tested. The results are somewhat expected: \gls{NMS} was tested. The results are somewhat expected:
non-maximum suppression removes all non-maximum detections that overlap \gls{NMS} removes all non-maximum detections that overlap
with a maximum one. This reduces the number of multiple detections per with a maximum one. This reduces the number of multiple detections per
ground truth bounding box and therefore the false positives. Without it, ground truth bounding box and therefore the false positives. Without it,
a lot more false positives remain and have a negative impact on precision. a lot more false positives remain and have a negative impact on precision.
In combination with top \(k\) selection, recall can be affected: In combination with top \(k\) selection, recall can be affected:
duplicate detections could stay and maxima boxes could be removed. duplicate detections could stay and maxima boxes could be removed.
The number of observations was measured before and after the combination of entropy threshold and NMS filter: both Bayesian SSD without The number of observations was measured before and after the combination of entropy threshold and \gls{NMS} filter: both Bayesian \gls{SSD} without
NMS and dropout, and Bayesian SSD with NMS and disabled dropout NMS and dropout, and Bayesian \gls{SSD} with \gls{NMS} and disabled dropout
have the same number of observations everywhere before the entropy threshold. After the entropy threshold (the value 1.5 was used for both) and NMS, the variant with NMS has roughly 23\% of its observations left have the same number of observations everywhere before the entropy threshold. After the entropy threshold (the value 1.5 was used for both) and NMS, the variant with \gls{NMS} has roughly 23\% of its observations left
(see table \ref{tab:effect-nms} for absolute numbers). (see table \ref{tab:effect-nms} for absolute numbers).
Without NMS 79\% of observations are left. Irrespective of the absolute Without \gls{NMS} 79\% of observations are left. Irrespective of the absolute
number, this discrepancy clearly shows the impact of non-maximum suppression and also explains a higher count of false positives: number, this discrepancy clearly shows the impact of \gls{NMS} and also explains a higher count of false positives:
more than 50\% of the original observations were removed with NMS and more than 50\% of the original observations were removed with \gls{NMS} and
stayed without---all of these are very likely to be false positives. stayed without---all of these are very likely to be false positives.
A clear distinction between micro and macro averaging can be observed: A clear distinction between micro and macro averaging can be observed:
recall is hardly effected with micro averaging (0.300) but goes down equally with macro averaging (0.229). For micro averaging, it does recall is hardly effected with micro averaging (0.300) but goes down equally with macro averaging (0.229). For micro averaging, it does
not matter which class the true positives belong to: every detection not matter which class the true positives belong to: every detection
counts the same way. This also means that top \(k\) will have only counts the same way. This also means that top \(k\) will have only
a marginal effect: some true positives might be removed without NMS but overall that does not have a big impact. With macro averaging, however, a marginal effect: some true positives might be removed without \gls{NMS} but overall that does not have a big impact. With macro averaging, however,
the class of the true positives matters a lot: for example, if two the class of the true positives matters a lot: for example, if two
true positives are removed from a class with only few true positives true positives are removed from a class with only few true positives
to begin with than their removal will have a drastic influence on to begin with than their removal will have a drastic influence on
the class recall value and hence the overall result. the class recall value and hence the overall result.
The impact of top \(k\) was measured by counting the number of observations The impact of top \(k\) was measured by counting the number of observations
after top \(k\) has been applied: the variant with NMS keeps about 94\% after top \(k\) has been applied: the variant with \gls{NMS} keeps about 94\%
of the observations left after NMS, without NMS only about 59\% of observations of the observations left after NMS, without \gls{NMS} only about 59\% of observations
are kept. This shows a significant impact on the result by top \(k\) are kept. This shows a significant impact on the result by top \(k\)
in the case of disabled non-maximum suppression. Furthermore, some in the case of disabled \gls{NMS}. Furthermore, some
classes are hit harder by top \(k\) then others: for example, classes are hit harder by top \(k\) then others: for example,
dogs keep around 82\% of the observations but persons only 57\%. dogs keep around 82\% of the observations but persons only 57\%.
This indicates that detected dogs are mostly on images with few detections This indicates that detected dogs are mostly on images with few detections
@ -1211,12 +1212,12 @@ recall.
variant & after & after \\ variant & after & after \\
& prediction & observation grouping \\ & prediction & observation grouping \\
\hline \hline
Bay. SSD, no dropout, NMS & 1,677,050 & 155,250 \\ Bay. SSD, no dropout, \gls{NMS} & 1,677,050 & 155,250 \\
keep rate 0.9, NMS & 1,617,675 & 549,166 \\ keep rate 0.9, \gls{NMS} & 1,617,675 & 549,166 \\
\hline \hline
\end{tabular} \end{tabular}
\caption{Comparison of Bayesian SSD variants without dropout and with \caption{Comparison of Bayesian \gls{SSD} variants without dropout and with
0.9 keep ratio of dropout with 0.9 keep ratio of dropout with
respect to the number of detections directly after the network respect to the number of detections directly after the network
predictions and after the observation grouping.} predictions and after the observation grouping.}
@ -1229,7 +1230,7 @@ dropout and the weights are not prepared for it.
Gal~\cite{Gal2017} Gal~\cite{Gal2017}
showed that networks \textbf{trained} with dropout are approximate Bayesian showed that networks \textbf{trained} with dropout are approximate Bayesian
models. The Bayesian variants of SSD implemented in this thesis are not fine-tuned or trained with dropout, therefore, they are not guaranteed to be such approximate models. models. The Bayesian variants of \gls{SSD} implemented in this thesis are not fine-tuned or trained with dropout, therefore, they are not guaranteed to be such approximate models.
But dropout alone does not explain the difference in results. Both variants But dropout alone does not explain the difference in results. Both variants
with and without dropout have the exact same number of detections coming with and without dropout have the exact same number of detections coming
@ -1252,9 +1253,9 @@ has slightly fewer predictions left compared to the one without dropout.
After the grouping, the variant without dropout has on average between After the grouping, the variant without dropout has on average between
10 and 11 detections grouped into an observation. This is expected as every 10 and 11 detections grouped into an observation. This is expected as every
forward pass creates the exact same result and these 10 identical detections forward pass creates the exact same result and these 10 identical detections
per vanilla SSD detection perfectly overlap. The fact that slightly more than per \gls{vanilla} \gls{SSD} detection perfectly overlap. The fact that slightly more than
10 detections are grouped together could explain the marginally better precision 10 detections are grouped together could explain the marginally better precision
of the Bayesian variant without dropout compared to vanilla SSD. of the Bayesian variant without dropout compared to \gls{vanilla} SSD.
However, on average only three detections are grouped together into an However, on average only three detections are grouped together into an
observation if dropout with 0.9 keep ratio is enabled. This does not observation if dropout with 0.9 keep ratio is enabled. This does not
negatively impact recall as true positives do not disappear but offers negatively impact recall as true positives do not disappear but offers
@ -1276,7 +1277,7 @@ from Miller et al. The complete source code or otherwise exhaustive
implementation details of Miller et al. would be required to attempt an answer. implementation details of Miller et al. would be required to attempt an answer.
Future work could explore the performance of this implementation when used Future work could explore the performance of this implementation when used
on an SSD variant that was fine-tuned or trained with dropout. In this case, it on an \gls{SSD} variant that was fine-tuned or trained with dropout. In this case, it
should also look into the impact of training with both dropout and batch should also look into the impact of training with both dropout and batch
normalisation. normalisation.
Other avenues include the application to other data sets or object detection Other avenues include the application to other data sets or object detection

12
glossary.tex Normal file
View File

@ -0,0 +1,12 @@
% acronyms
\newacronym{NMS}{NMS}{non-maximum suppression}
\newacronym{SSD}{SSD}{Single Shot MultiBox Detector}
% terms
\newglossaryentry{vanilla}
{
name={vanilla},
description={
is used to describe the original state of something
}
}

View File

@ -102,7 +102,8 @@
\usepackage{makeidx} \usepackage{makeidx}
\makeindex \makeindex
\usepackage[xindy]{glossaries} % for \printglossary \usepackage[xindy,toc]{glossaries} % for \printglossary
\setacronymstyle{long-short}
\makeglossaries \makeglossaries
%%% conditional includes %%% conditional includes
@ -183,7 +184,7 @@
\newcommand{\finish}{% \newcommand{\finish}{%
%\clearpage %\clearpage
\printglossary \printglossaries
%\clearpage %\clearpage
\printindex \printindex

View File

@ -33,6 +33,8 @@
% specify bib resource % specify bib resource
\addbibresource{ma.bib} \addbibresource{ma.bib}
\input{glossary.tex}
\makeatletter \makeatletter
\g@addto@macro\appendix{% \g@addto@macro\appendix{%
\renewcommand*{\chapterformat}{% \renewcommand*{\chapterformat}{%