Auto-encoders work well for data sets like MNIST~\cite{Deng2012}

but perform poorly on challenging real world data sets

like MS COCO~\cite{Lin2014}, complicating any potential comparison between

them and object detection networks like SSD.

them and object detection networks like \gls{SSD}.

Therefore, a comparison between model uncertainty with a network like

SSD and novelty detection with auto-encoders is considered out of scope

for this thesis.

Miller et al.~\cite{Miller2018} used an SSD pre-trained on COCO

Miller et al.~\cite{Miller2018} used an \gls{SSD} pre-trained on COCO

without further fine-tuning on the SceneNet RGB-D data

set~\cite{McCormac2017} and reported good results regarding

open set error for an SSD variant with dropout sampling and entropy

open set error for an \gls{SSD} variant with dropout sampling and entropy

thresholding.

If their results are generalisable it should be possible to replicate

the relative difference between the variants on the COCO data set.

@ -131,15 +131,15 @@ This leads to the following hypothesis: \emph{Dropout sampling

delivers better object detection performance under open set

conditions compared to object detection without it.}

For the purpose of this thesis, I will use the vanilla SSD (as in: the original SSD) as

baseline to compare against. In particular, vanilla SSD uses

For the purpose of this thesis, I will use the \gls{vanilla}\gls{SSD} (as in: the original SSD) as

baseline to compare against. In particular, \gls{vanilla}\gls{SSD} uses

a per-class confidence threshold of 0.01, an IOU threshold of 0.45

for the non-maximum suppression, and a top \(k\) value of 200. For this

for the \gls{NMS}, and a top \(k\) value of 200. For this

thesis, the top \(k\) value was changed to 20 and the confidence threshold

of 0.2 was tried as well.

The effect of an entropy threshold is measured against this vanilla

The effect of an entropy threshold is measured against this \gls{vanilla}

SSD by applying entropy thresholds from 0.1 to 2.4 inclusive (limits taken from

Miller et al.). Dropout sampling is compared to vanilla SSD

Miller et al.). Dropout sampling is compared to \gls{vanilla} SSD

with and without entropy thresholding.

\paragraph{Hypothesis} Dropout sampling

@ -150,8 +150,8 @@ conditions compared to object detection without it.

First, chapter \ref{chap:background} presents related works and

provides the background for dropout sampling.

Afterwards, chapter \ref{chap:methods} explains how vanilla SSD works, how

Bayesian SSD extends vanilla SSD, and how the decoding pipelines are

Afterwards, chapter \ref{chap:methods} explains how \gls{vanilla}\gls{SSD} works, how

Bayesian \gls{SSD} extends \gls{vanilla} SSD, and how the decoding pipelines are

structured.

Chapter \ref{chap:experiments-results} presents the data sets,

the experimental setup, and the results. This is followed by

@ -421,19 +421,19 @@ be used to identify and reject these false positive cases.

\label{chap:methods}

This chapter explains the functionality of vanilla SSD, Bayesian SSD, and the decoding pipelines.

This chapter explains the functionality of \gls{vanilla} SSD, Bayesian SSD, and the decoding pipelines.

\section{Vanilla SSD}

\begin{figure}

\centering

\includegraphics[scale=1.2]{vanilla-ssd}

\caption{The vanilla SSD network as defined by Liu et al.~\cite{Liu2016}. VGG-16 is the base network, extended with extra feature layers. These predict offsets to anchor boxes with different sizes and aspect ratios. Furthermore, they predict the

\caption{The \gls{vanilla}\gls{SSD} network as defined by Liu et al.~\cite{Liu2016}. VGG-16 is the base network, extended with extra feature layers. These predict offsets to anchor boxes with different sizes and aspect ratios. Furthermore, they predict the

corresponding confidences.}

\label{fig:vanilla-ssd}

\end{figure}

Vanilla SSD is based upon the VGG-16 network (see figure

Vanilla \gls{SSD} is based upon the VGG-16 network (see figure

\ref{fig:vanilla-ssd}) and adds extra feature layers. The entire

image (always size 300x300) is divided up into anchor boxes. During

training, each of these boxes is mapped to a ground truth box or

@ -443,7 +443,7 @@ SSD network are the predictions with class confidences, offsets to the

anchor box, anchor box coordinates, and variance. The model loss is a

weighted sum of localisation and confidence loss. As the network

has a fixed number of anchor boxes, every forward pass creates the same

number of detections---8732 in the case of SSD 300x300.

number of detections---8732 in the case of \gls{SSD} 300x300.

Notably, the object proposals are made in a single run for an image -

single shot.

@ -454,13 +454,13 @@ Liu et al.~\cite{Liu2016}.

\section{Bayesian SSD for Model Uncertainty}

Networks trained with dropout are a general approximate Bayesian model~\cite{Gal2017}. As such, they can be used for everything a true

Bayesian model could be used for. The idea is applied to SSD in this

thesis: two dropout layers are added to vanilla SSD, after the layers fc6 and fc7 respectively (see figure \ref{fig:bayesian-ssd}).

Bayesian model could be used for. The idea is applied to \gls{SSD} in this

thesis: two dropout layers are added to \gls{vanilla} SSD, after the layers fc6 and fc7 respectively (see figure \ref{fig:bayesian-ssd}).

\begin{figure}

\centering

\includegraphics[scale=1.2]{bayesian-ssd}

\caption{The Bayesian SSD network as defined by Miller et al.~\cite{Miller2018}. It adds dropout layers after the fc6

\caption{The Bayesian \gls{SSD} network as defined by Miller et al.~\cite{Miller2018}. It adds dropout layers after the fc6

and fc7 layers.}

\label{fig:bayesian-ssd}

\end{figure}

@ -476,51 +476,52 @@ and very low confidences in other classes.

\subsection{Implementation Details}

For this thesis, an SSD implementation based on Tensorflow~\cite{Abadi2015} and

For this thesis, an \gls{SSD} implementation based on Tensorflow~\cite{Abadi2015} and

% entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3

% 0.5 for Bayesian - 6, 1.4 for 7, 1.4 for 8, 1.3 for 9

\hline

\end{tabular}

\caption{Rounded results for micro averaging. SSD with Entropy test and Bayesian SSD are represented with

their best performing entropy threshold with respect to \(F_1\) score. Vanilla SSD with Entropy test performed best with an

entropy threshold of 2.4, Bayesian SSD without non-maximum suppression performed best for 1.0,

and Bayesian SSD with non-maximum suppression performed best for 1.4 as entropy

\caption{Rounded results for micro averaging. \gls{SSD} with Entropy test and Bayesian \gls{SSD} are represented with

their best performing entropy threshold with respect to \(F_1\) score. Vanilla \gls{SSD} with Entropy test performed best with an

entropy threshold of 2.4, Bayesian \gls{SSD} without \gls{NMS} performed best for 1.0,

and Bayesian \gls{SSD} with \gls{NMS} performed best for 1.4 as entropy

threshold.

Bayesian SSD with dropout enabled and 0.9 keep ratio performed

Bayesian \gls{SSD} with dropout enabled and 0.9 keep ratio performed

best for 1.4 as entropy threshold, the run with 0.5 keep ratio performed

best for 1.3 as threshold.}

\label{tab:results-micro}

@ -739,26 +740,26 @@ in the next chapter.

\end{minipage}

\end{figure}

Vanilla SSD with a per-class confidence threshold of 0.2 performs best (see

Vanilla \gls{SSD} with a per-class confidence threshold of 0.2 performs best (see

table \ref{tab:results-micro}) with respect to the maximum \(F_1\) score

(0.376) and recall at the maximum \(F_1\) point (0.382). In comparison, neither

the vanilla SSD variant with a confidence threshold of 0.01 nor the SSD with

an entropy test can outperform the 0.2 variant. Among the vanilla SSD variants,

the \gls{vanilla}\gls{SSD} variant with a confidence threshold of 0.01 nor the \gls{SSD} with

an entropy test can outperform the 0.2 variant. Among the \gls{vanilla}\gls{SSD} variants,

the 0.2 variant also has the lowest number of open set errors (2939) and the

highest precision (0.372).

The comparison of the vanilla SSD variants with a confidence threshold of 0.01

The comparison of the \gls{vanilla}\gls{SSD} variants with a confidence threshold of 0.01

shows no significant impact of an entropy test. Only the open set errors

are lower but in an insignificant way. The rest of the performance metrics is

identical after rounding.

Bayesian SSD with disabled dropout and without non-maximum suppression

has the worst performance of all tested variants (vanilla and Bayesian)

Bayesian \gls{SSD} with disabled dropout and without \gls{NMS}

has the worst performance of all tested variants (\gls{vanilla} and Bayesian)

with respect to \(F_1\) score (0.209) and precision (0.161). The precision is not only the worst but also significantly lower compared to all other variants.

In comparison to all variants with 0.2 confidence threshold, it has the worst recall (0.300) as well.

With 2335 open set errors, the Bayesian SSD variant with disabled dropout and

enabled non-maximum suppression offers the best performance with respect

With 2335 open set errors, the Bayesian \gls{SSD} variant with disabled dropout and

enabled \gls{NMS} offers the best performance with respect

to open set errors. It also has the best precision (0.378) of all tested

variants. Furthermore, it provides the best performance among all variants

with multiple forward passes.

@ -768,11 +769,11 @@ in the lower \(F_1\) scores, higher open set errors, and lower precision

values. Both dropout variants have worse recall (0.363 and 0.342) than

the variant with disabled dropout.

However, all variants with multiple forward passes have lower open set

errors than all vanilla SSD variants.

errors than all \gls{vanilla}\gls{SSD} variants.

The relation of \(F_1\) score to absolute open set error can be observed

in figure \ref{fig:ose-f1-micro}. Precision-recall curves for all variants

can be seen in figure \ref{fig:precision-recall-micro}. Both vanilla SSD

can be seen in figure \ref{fig:precision-recall-micro}. Both \gls{vanilla} SSD

variants with 0.01 confidence threshold reach much higher open set errors

and a higher recall. This behaviour is expected as more and worse predictions

are included.

@ -787,25 +788,25 @@ reported figures, such as the ones in Miller et al.~\cite{Miller2018}

Forward & max & abs OSE & Recall & Precision\\

Passes &\(F_1\) Score &\multicolumn{3}{c}{at max \(F_1\) point}\\

% entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3

% entropy thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7

% 1.7 for 8, 2.0 for 9

\hline

\end{tabular}

\caption{Rounded results for macro averaging. SSD with Entropy test and Bayesian SSD are represented with

their best performing entropy threshold with respect to \(F_1\) score. Vanilla SSD with Entropy test performed best with an

entropy threshold of 1.7, Bayesian SSD without non-maximum suppression performed best for 1.5,

and Bayesian SSD with non-maximum suppression performed best for 1.5 as entropy

threshold. Bayesian SSD with dropout enabled and 0.9 keep ratio performed

\caption{Rounded results for macro averaging. \gls{SSD} with Entropy test and Bayesian \gls{SSD} are represented with

their best performing entropy threshold with respect to \(F_1\) score. Vanilla \gls{SSD} with Entropy test performed best with an

entropy threshold of 1.7, Bayesian \gls{SSD} without \gls{NMS} performed best for 1.5,

and Bayesian \gls{SSD} with \gls{NMS} performed best for 1.5 as entropy

threshold. Bayesian \gls{SSD} with dropout enabled and 0.9 keep ratio performed

best for 1.7 as entropy threshold, the run with 0.5 keep ratio performed

best for 2.0 as threshold.}

\label{tab:results-macro}

@ -825,36 +826,36 @@ reported figures, such as the ones in Miller et al.~\cite{Miller2018}

\end{minipage}

\end{figure}

Vanilla SSD with a per-class confidence threshold of 0.2 performs best (see

Vanilla \gls{SSD} with a per-class confidence threshold of 0.2 performs best (see

table \ref{tab:results-macro}) with respect to the maximum \(F_1\) score

(0.375) and recall at the maximum \(F_1\) point (0.338). In comparison, the SSD

with an entropy test slightly outperforms the 0.2 variant with respect to

precision (0.425). Additionally, this is the best precision overall. Among

the vanilla SSD variants, the 0.2 variant also has the lowest

the \gls{vanilla}\gls{SSD} variants, the 0.2 variant also has the lowest

number of open set errors (1218).

The comparison of the vanilla SSD variants with a confidence threshold of 0.01

The comparison of the \gls{vanilla}\gls{SSD} variants with a confidence threshold of 0.01

shows no significant impact of an entropy test. Only the open set errors

are lower but in an insignificant way. The rest of the performance metrics is

almost identical after rounding.

The results for Bayesian SSD show a significant impact of non-maximum suppression or the lack thereof: maximum \(F_1\) score of 0.363 (with NMS) to 0.226

The results for Bayesian \gls{SSD} show a significant impact of \gls{NMS} or the lack thereof: maximum \(F_1\) score of 0.363 (with NMS) to 0.226

(without NMS). Dropout was disabled in both cases, making them effectively a

vanilla SSD run with multiple forward passes.

\gls{vanilla}\gls{SSD} run with multiple forward passes.

With 809 open set errors, the Bayesian SSD variant with disabled dropout and

without non-maximum suppression offers the best performance with respect

to open set errors. The variant without dropout and enabled non-maximum suppression has the best \(F_1\) score (0.363), the best

With 809 open set errors, the Bayesian \gls{SSD} variant with disabled dropout and

without \gls{NMS} offers the best performance with respect

to open set errors. The variant without dropout and enabled \gls{NMS} has the best \(F_1\) score (0.363), the best

precision (0.420) and the best recall (0.321) of all Bayesian variants.

Dropout decreases the performance of the network, this can be seen

in the lower \(F_1\) scores, higher open set errors, and lower precision and

recall values. However, all variants with multiple forward passes have lower open set errors than all vanilla SSD

recall values. However, all variants with multiple forward passes have lower open set errors than all \gls{vanilla} SSD

variants.

The relation of \(F_1\) score to absolute open set error can be observed

in figure \ref{fig:ose-f1-macro}. Precision-recall curves for all variants

can be seen in figure \ref{fig:precision-recall-macro}. Both vanilla SSD

can be seen in figure \ref{fig:precision-recall-macro}. Both \gls{vanilla} SSD

variants with 0.01 confidence threshold reach much higher open set errors

and a higher recall. This behaviour is expected as more and worse predictions

are included.

@ -884,35 +885,35 @@ they had the exact same performance before rounding.

Forward & max & Recall & Precision\\

Passes &\(F_1\) Score &\multicolumn{2}{c}{at max \(F_1\) point}\\

% entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3

% entropy thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7

% 1.7 for 8, 2.0 for 9

\hline

\end{tabular}

\caption{Rounded results for persons class. SSD with Entropy test and Bayesian SSD are represented with

\caption{Rounded results for persons class. \gls{SSD} with Entropy test and Bayesian \gls{SSD} are represented with

their best performing macro averaging entropy threshold with respect to \(F_1\) score.}

\label{tab:results-persons}

\end{table}

It is clearly visible that the overall trend continues in the individual

classes (see tables \ref{tab:results-persons}, \ref{tab:results-cars}, \ref{tab:results-chairs}, \ref{tab:results-bottles}, and \ref{tab:results-giraffes}). However, the two vanilla SSD variants with only 0.01 confidence

classes (see tables \ref{tab:results-persons}, \ref{tab:results-cars}, \ref{tab:results-chairs}, \ref{tab:results-bottles}, and \ref{tab:results-giraffes}). However, the two \gls{vanilla}\gls{SSD} variants with only 0.01 confidence

threshold perform better than in the averaged results presented earlier.

Only in the chairs class, a Bayesian SSD variant performs better (in

precision) than any of the vanilla SSD variants. Moreover, there are

multiple classes where two or all of the vanilla SSD variants perform

Only in the chairs class, a Bayesian \gls{SSD} variant performs better (in

precision) than any of the \gls{vanilla}\gls{SSD} variants. Moreover, there are

multiple classes where two or all of the \gls{vanilla}\gls{SSD} variants perform

equally well. When compared with the macro averaged results,

giraffes and persons perform better across the board. Cars have a higher

precision than average but lower recall values for all but the Bayesian

SSD variant without NMS and dropout. Chairs and bottles perform

SSD variant without \gls{NMS} and dropout. Chairs and bottles perform

worse than average.

\begin{table}[tbp]

@ -921,21 +922,21 @@ worse than average.

Forward & max & Recall & Precision\\

Passes &\(F_1\) Score &\multicolumn{2}{c}{at max \(F_1\) point}\\

\caption{Image with stop sign and truck at right edge. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from vanilla SSD.}

\caption{Image with stop sign and truck at right edge. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from \gls{vanilla} SSD.}

\caption{Image with stop sign and truck at right edge. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from Bayesian SSD with 0.9 keep ratio.}

\caption{Image with stop sign and truck at right edge. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from Bayesian \gls{SSD} with 0.9 keep ratio.}

\label{fig:stop-sign-truck-bayesian}

\end{minipage}

\end{figure}

The ground truth only contains a stop sign and a truck. The differences between vanilla SSD and Bayesian SSD are almost not visible

(see figures \ref{fig:stop-sign-truck-vanilla} and \ref{fig:stop-sign-truck-bayesian}): the truck is neither detected by vanilla nor Bayesian SSD, instead both detected a pottet plant and a traffic light. The stop sign is detected by both variants.

The ground truth only contains a stop sign and a truck. The differences between \gls{vanilla}\gls{SSD} and Bayesian \gls{SSD} are almost not visible

(see figures \ref{fig:stop-sign-truck-vanilla} and \ref{fig:stop-sign-truck-bayesian}): the truck is neither detected by \gls{vanilla} nor Bayesian SSD, instead both detected a pottet plant and a traffic light. The stop sign is detected by both variants.

This behaviour implies problems with detecting objects at the edge

that overwhelmingly lie outside the image frame. Furthermore, the predictions are usually identical.

\caption{Image with a cat and laptop/TV. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from vanilla SSD.}

\caption{Image with a cat and laptop/TV. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from \gls{vanilla} SSD.}

\caption{Image with a cat and laptop/TV. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from Bayesian SSD with 0.9 keep ratio.}

\caption{Image with a cat and laptop/TV. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from Bayesian \gls{SSD} with 0.9 keep ratio.}

\label{fig:cat-laptop-bayesian}

\end{minipage}

\end{figure}

Another example (see figures \ref{fig:cat-laptop-vanilla} and \ref{fig:cat-laptop-bayesian}) is a cat with a laptop/TV in the background on the right

side. Both variants detect a cat but the vanilla variant detects a dog as well. The laptop and TV are not detected but this is expected since

side. Both variants detect a cat but the \gls{vanilla} variant detects a dog as well. The laptop and TV are not detected but this is expected since

these classes were not trained.

\chapter{Discussion and Outlook}

@ -1073,7 +1074,7 @@ questions will be addressed.

\section*{Discussion}

The results clearly do not support the hypothesis: \textit{Dropout sampling delivers better object detection performance under open set conditions compared to object detection without it}. With the exception of open set errors, there

is no area where dropout sampling performs better than vanilla SSD. In the

is no area where dropout sampling performs better than \gls{vanilla} SSD. In the

remainder of the section the individual results will be interpreted.

\subsection*{Impact of Averaging}

@ -1085,8 +1086,8 @@ of the plot in both the \(F_1\) versus absolute open set error graph (see figure

the precision-recall curve (see figure \ref{fig:precision-recall-micro}).

This behaviour is caused by a large imbalance of detections between

the classes. For vanilla SSD with 0.2 confidence threshold there are

a total of 36,863 detections after non-maximum suppression and top \(k\).

the classes. For \gls{vanilla}\gls{SSD} with 0.2 confidence threshold there are

a total of 36,863 detections after \gls{NMS} and top \(k\).

The persons class contributes 14,640 detections or around 40\% to that number. Another strong class is cars with 2,252 detections or around

6\%. In third place come chairs with 1352 detections or around 4\%. This means that three classes have together roughly as many detections

as the remaining 57 classes combined.

@ -1119,7 +1120,7 @@ averaging was not reported in their paper.

\subsection*{Impact of Entropy}

There is no visible impact of entropy thresholding on the object detection

performance for vanilla SSD. This indicates that the network has almost no

performance for \gls{vanilla} SSD. This indicates that the network has almost no

uniform or close to uniform predictions, the vast majority of predictions

has a high confidence in one class---including the background.

However, the entropy plays a larger role for the Bayesian variants---as

@ -1144,52 +1145,52 @@ threshold indicates a worse performance.

variant & before & after & after \\

& entropy/NMS & entropy/NMS & top \(k\)\\

\hline

Bay. SSD, no dropout, no NMS & 155,251 & 122,868 & 72,207 \\

no dropout, NMS & 155,250 & 36,061 & 33,827 \\

Bay. SSD, no dropout, no \gls{NMS}& 155,251 & 122,868 & 72,207 \\

no dropout, \gls{NMS}& 155,250 & 36,061 & 33,827 \\

\hline

\end{tabular}

\caption{Comparison of Bayesian SSD variants without dropout with

\caption{Comparison of Bayesian \gls{SSD} variants without dropout with

respect to the number of detections before the entropy threshold,

after it and/or non-maximum suppression, and after top \(k\). The

after it and/or \gls{NMS}, and after top \(k\). The

entropy threshold 1.5 was used for both.}

\label{tab:effect-nms}

\end{table}

Miller et al.~\cite{Miller2018} supposedly did not use non-maximum suppression

in their implementation of dropout sampling. Therefore, a variant with disabled

non-maximum suppression (NMS) was tested. The results are somewhat expected:

non-maximum suppression removes all non-maximum detections that overlap

Miller et al.~\cite{Miller2018} supposedly did not use \gls{NMS}

in their implementation of dropout sampling. Therefore, a variant with disabled\glslocalreset{NMS}

\gls{NMS} was tested. The results are somewhat expected:

\gls{NMS} removes all non-maximum detections that overlap

with a maximum one. This reduces the number of multiple detections per

ground truth bounding box and therefore the false positives. Without it,

a lot more false positives remain and have a negative impact on precision.

In combination with top \(k\) selection, recall can be affected:

duplicate detections could stay and maxima boxes could be removed.

The number of observations was measured before and after the combination of entropy threshold and NMS filter: both Bayesian SSD without

NMS and dropout, and Bayesian SSD with NMS and disabled dropout

have the same number of observations everywhere before the entropy threshold. After the entropy threshold (the value 1.5 was used for both) and NMS, the variant with NMS has roughly 23\% of its observations left

The number of observations was measured before and after the combination of entropy threshold and \gls{NMS} filter: both Bayesian \gls{SSD} without

NMS and dropout, and Bayesian \gls{SSD} with \gls{NMS} and disabled dropout

have the same number of observations everywhere before the entropy threshold. After the entropy threshold (the value 1.5 was used for both) and NMS, the variant with \gls{NMS} has roughly 23\% of its observations left

(see table \ref{tab:effect-nms} for absolute numbers).

Without NMS 79\% of observations are left. Irrespective of the absolute

number, this discrepancy clearly shows the impact of non-maximum suppression and also explains a higher count of false positives:

more than 50\% of the original observations were removed with NMS and

Without \gls{NMS} 79\% of observations are left. Irrespective of the absolute

number, this discrepancy clearly shows the impact of \gls{NMS} and also explains a higher count of false positives:

more than 50\% of the original observations were removed with \gls{NMS} and

stayed without---all of these are very likely to be false positives.

A clear distinction between micro and macro averaging can be observed:

recall is hardly effected with micro averaging (0.300) but goes down equally with macro averaging (0.229). For micro averaging, it does

not matter which class the true positives belong to: every detection

counts the same way. This also means that top \(k\) will have only

a marginal effect: some true positives might be removed without NMS but overall that does not have a big impact. With macro averaging, however,

a marginal effect: some true positives might be removed without \gls{NMS} but overall that does not have a big impact. With macro averaging, however,

the class of the true positives matters a lot: for example, if two

true positives are removed from a class with only few true positives

to begin with than their removal will have a drastic influence on

the class recall value and hence the overall result.

The impact of top \(k\) was measured by counting the number of observations

after top \(k\) has been applied: the variant with NMS keeps about 94\%

of the observations left after NMS, without NMS only about 59\% of observations

after top \(k\) has been applied: the variant with \gls{NMS} keeps about 94\%

of the observations left after NMS, without \gls{NMS} only about 59\% of observations

are kept. This shows a significant impact on the result by top \(k\)

in the case of disabled non-maximum suppression. Furthermore, some

in the case of disabled \gls{NMS}. Furthermore, some

classes are hit harder by top \(k\) then others: for example,

dogs keep around 82\% of the observations but persons only 57\%.

This indicates that detected dogs are mostly on images with few detections

@ -1211,12 +1212,12 @@ recall.

variant & after & after \\

& prediction & observation grouping \\

\hline

Bay. SSD, no dropout, NMS & 1,677,050 & 155,250 \\

keep rate 0.9, NMS & 1,617,675 & 549,166 \\

Bay. SSD, no dropout, \gls{NMS}& 1,677,050 & 155,250 \\

keep rate 0.9, \gls{NMS}& 1,617,675 & 549,166 \\

\hline

\end{tabular}

\caption{Comparison of Bayesian SSD variants without dropout and with

\caption{Comparison of Bayesian \gls{SSD} variants without dropout and with

0.9 keep ratio of dropout with

respect to the number of detections directly after the network

predictions and after the observation grouping.}

@ -1229,7 +1230,7 @@ dropout and the weights are not prepared for it.

Gal~\cite{Gal2017}

showed that networks \textbf{trained} with dropout are approximate Bayesian

models. The Bayesian variants of SSD implemented in this thesis are not fine-tuned or trained with dropout, therefore, they are not guaranteed to be such approximate models.

models. The Bayesian variants of \gls{SSD} implemented in this thesis are not fine-tuned or trained with dropout, therefore, they are not guaranteed to be such approximate models.

But dropout alone does not explain the difference in results. Both variants

with and without dropout have the exact same number of detections coming

@ -1252,9 +1253,9 @@ has slightly fewer predictions left compared to the one without dropout.

After the grouping, the variant without dropout has on average between

10 and 11 detections grouped into an observation. This is expected as every

forward pass creates the exact same result and these 10 identical detections

per vanilla SSD detection perfectly overlap. The fact that slightly more than

per \gls{vanilla}\gls{SSD} detection perfectly overlap. The fact that slightly more than

10 detections are grouped together could explain the marginally better precision

of the Bayesian variant without dropout compared to vanilla SSD.

of the Bayesian variant without dropout compared to \gls{vanilla} SSD.

However, on average only three detections are grouped together into an

observation if dropout with 0.9 keep ratio is enabled. This does not

negatively impact recall as true positives do not disappear but offers

@ -1276,7 +1277,7 @@ from Miller et al. The complete source code or otherwise exhaustive

implementation details of Miller et al. would be required to attempt an answer.

Future work could explore the performance of this implementation when used

on an SSD variant that was fine-tuned or trained with dropout. In this case, it

on an \gls{SSD} variant that was fine-tuned or trained with dropout. In this case, it

should also look into the impact of training with both dropout and batch

normalisation.

Other avenues include the application to other data sets or object detection