Improved a variety of smaller things

Signed-off-by: Jim Martens <github@2martens.de>
This commit is contained in:
Jim Martens 2019-09-24 12:22:13 +02:00
parent 7813cafc32
commit 2f9623e3d5
1 changed files with 54 additions and 57 deletions

111
body.tex
View File

@ -7,9 +7,9 @@ providing technical details.
\subsection*{Motivation}
Famous examples like the automatic soap dispenser which does not
Famous examples like the automatic soap dispenser, which does not
recognise the hand of a black person but dispenses soap when presented
with a paper towel raise the question of bias in computer
with a paper towel, raise the question of bias in computer
systems~\cite{Friedman1996}. Related to this ethical question regarding
the design of so called algorithms is the question of
algorithmic accountability~\cite{Diakopoulos2014}.
@ -132,26 +132,28 @@ conditions compared to object detection without it.}
For the purpose of this thesis, I will use the vanilla SSD (as in: the original SSD) as
baseline to compare against. In particular, vanilla SSD uses
a per-class confidence threshold of 0.01, an IOU threshold of 0.45
for the non-maximum suppression, and a top k value of 200.
for the non-maximum suppression, and a top \(k\) value of 200. For this
thesis, the top \(k\) value was changed to 20 and the confidence threshold
of 0.2 was tried as well.
The effect of an entropy threshold is measured against this vanilla
SSD by applying entropy thresholds from 0.1 to 2.4 inclusive (limits taken from
Miller et al.). Dropout sampling is compared to vanilla SSD, both
Miller et al.). Dropout sampling is compared to vanilla SSD
with and without entropy thresholding.
\paragraph{Hypothesis} Dropout sampling
delivers better object detection performance under open set
conditions compared to object detection without it.
\subsection*{Reader's guide}
\subsection*{Reader's Guide}
First, chapter \ref{chap:background} presents related works and
provides the background for dropout sampling a.k.a Bayesian SSD.
Afterwards, chapter \ref{chap:methods} explains how the Bayesian SSD
works and how the decoding pipelines are structured.
provides the background for dropout sampling.
Afterwards, chapter \ref{chap:methods} explains how vanilla SSD works, how
Bayesian SSD extends vanilla SSD, and how the decoding pipelines are
structured.
Chapter \ref{chap:experiments-results} presents the data sets,
the experimental setup, and the results. This is followed by
chapter \ref{chap:discussion}, focusing on
the discussion and closing.
chapter \ref{chap:discussion}, focusing on the discussion and closing.
Therefore, the contribution is found in chapters \ref{chap:methods},
\ref{chap:experiments-results}, and \ref{chap:discussion}.
@ -162,8 +164,7 @@ Therefore, the contribution is found in chapters \ref{chap:methods},
This chapter will begin with an overview over previous works
in the field of this thesis. Afterwards the theoretical foundations
of the work of Miller et al.~\cite{Miller2018} will
be explained.
of dropout sampling will be explained.
\section{Related Works}
@ -176,7 +177,7 @@ reconstruction-based novelty detection as it deals only with neural network
approaches. Therefore, the other types of novelty detection will only be
briefly introduced.
\subsection{Overview over types of novelty detection}
\subsection{Overview over types of Novelty Detection}
Probabilistic approaches estimate the generative probability density function (pdf)
of the data. It is assumed that the training data is generated from an underlying
@ -208,7 +209,7 @@ difference in the metric when removed from the data set. This subset is consider
to consist of novel data. For example, Filippone and Sanguinetti \cite{Filippone2011} provide
a recent approach.
\subsection{Reconstruction-based novelty detection}
\subsection{Reconstruction-based Novelty Detection}
Reconstruction-based approaches use the reconstruction error in one form
or another to calculate the novelty score. This can be auto-encoders that
@ -224,7 +225,7 @@ Novelty detection for object detection is intricately linked with
open set conditions: the test data can contain unknown classes.
Bishop~\cite{Bishop1994} investigated the correlation between
the degree of novel input data and the reliability of network
outputs.
outputs, and introduced a quantitative way to measure novelty.
The Bayesian approach provides a theoretical foundation for
modelling uncertainty \cite{Ghahramani2015}.
@ -259,20 +260,17 @@ Li et al.~\cite{Li2019} investigated the problem of poor performance
when combining dropout and batch normalisation: Dropout shifts the variance
of a neural unit when switching from train to test, batch normalisation
does not change the variance. This inconsistency leads to a variance shift which
can have a larger or smaller impact based on the network used. For example,
adding dropout layers to SSD \cite{Liu2016} and applying MC dropout, like
Miller et al.~\cite{Miller2018} did, causes such a problem because SSD uses
batch normalisation.
can have a larger or smaller impact based on the network used.
Non-Bayesian approaches have been developed as well. Usually, they compare with
MC dropout and show better performance.
Postels et al.~\cite{Postels2019} provided a sampling-free approach for
uncertainty estimation that does not affect training and approximates the
sampling on test time. They compared it to MC dropout and found less computational
sampling at test time. They compared it to MC dropout and found less computational
overhead with better results.
Lakshminarayanan et al.~\cite{Lakshminarayanan2017}
implemented a predictive uncertainty estimation using deep ensembles.
Compared to MC dropout, it showed better results.
Compared to MC dropout, it shows better results.
Geifman et al.~\cite{Geifman2018}
introduced an uncertainty estimation algorithm for non-Bayesian deep
neural classification that estimates the uncertainty of highly
@ -288,10 +286,10 @@ are important as well. Mukhoti and Gal~\cite{Mukhoti2018}
contributed metrics to measure uncertainty for semantic
segmentation. Wu et al.~\cite{Wu2019} introduced two innovations
that turn variational Bayes into a robust tool for Bayesian
networks: a novel deterministic method to approximate
networks: first, a novel deterministic method to approximate
moments in neural networks which eliminates gradient variance, and
a hierarchical prior for parameters and an empirical Bayes procedure to select
prior variances.
second, a hierarchical prior for parameters and an empirical Bayes
procedure to select prior variances.
\section{Background for Dropout Sampling}
@ -342,7 +340,7 @@ over the network weights, for example a Gaussian prior distribution:
\(\mathbf{W}\) are the weights and \(I\) symbolises that every
weight is drawn from an independent and identical distribution. The
training of the network determines a plausible set of weights by
evaluating the posterior (probability output) over the weights given
evaluating the probability output (posterior) over the weights given
the training data \(\mathbf{T}\): \(p(\mathbf{W}|\mathbf{T})\).
However, this
evaluation cannot be performed in any reasonable
@ -369,7 +367,7 @@ training data \(\mathbf{T}\):
p(y|\mathcal{I}, \mathbf{T}) = \int p(y|\mathcal{I}, \mathbf{W}) \cdot p(\mathbf{W}|\mathbf{T})d\mathbf{W} \approx \frac{1}{n} \sum_{i=1}^{n}\mathbf{s}_i
\end{equation}
With this dropout sampling technique \(n\) model weights
With this dropout sampling technique, \(n\) model weights
\(\widetilde{\mathbf{W}}_i\) are sampled from the posterior
\(p(\mathbf{W}|\mathbf{T})\). The class probability
\(p(y|\mathcal{I}, \mathbf{T})\) is a probability vector
@ -437,8 +435,8 @@ Vanilla SSD is based upon the VGG-16 network (see figure
\ref{fig:vanilla-ssd}) and adds extra feature layers. The entire
image (always size 300x300) is divided up into anchor boxes. During
training, each of these boxes is mapped to a ground truth box or
background. For every anchor box the offset to
the object, and the class confidences are calculated. The output of the
background. For every anchor box both the offset to
the object and the class confidences are calculated. The output of the
SSD network are the predictions with class confidences, offsets to the
anchor box, anchor box coordinates, and variance. The model loss is a
weighted sum of localisation and confidence loss. As the network
@ -567,7 +565,7 @@ Additionally, all detections with a background prediction of 0.8 or higher are d
The remaining detections are partitioned into observations to
further reduce the size of the output, and
to identify uncertainty. This is accomplished by calculating the
mutual IOU of every detection with all other detections. Detections
mutual IOU score of every detection with all other detections. Detections
with a mutual IOU score of 0.95 or higher are partitioned into an
observation. Next, the softmax scores and bounding box coordinates of
all detections in an observation are averaged.
@ -596,7 +594,7 @@ at the end.
This chapter explains the used data sets, how the experiments were
set up, and what the results are.
\section{Data sets}
\section{Data Sets}
This thesis uses the MS COCO~\cite{Lin2014} data set. It contains
80 classes, from airplanes to toothbrushes many classes are present.
@ -615,7 +613,7 @@ impossible values: bounding box height or width lower than zero,
\(x_{max}\) and \(y_{max}\) coordinates lower than or equal to zero, \(x_{min}\) greater than \(x_{max}\),
\(y_{min}\) greater than \(y_{max}\), image width lower than \(x_{max}\),
and image height lower than \(y_{max}\). In the last two cases the
bounding box width or height was set to (image width - \(x_{min}\)) or
bounding box width and height were set to (image width - \(x_{min}\)) and
(image height - \(y_{min}\)) respectively;
in the other cases the annotation was skipped.
If the bounding box width or height afterwards is
@ -623,8 +621,9 @@ lower than or equal to zero the annotation was skipped.
SSD accepts 300x300 input images, the MS COCO data set images were
resized to this resolution; the aspect ratio was not kept in the
process. As all images of MS COCO have the same resolution,
this led to a uniform distortion of the images. Furthermore,
process. MS COCO contains landscape and portrait images with (640x480)
and (480x640) as the resolution. This led to a uniform distortion of the
portrait and landscape images respectively. Furthermore,
the colour channels were swapped from RGB to BGR in order to
comply with the SSD implementation. The BGR requirement stems from
the usage of Open CV in SSD: the internal channel order for
@ -661,7 +660,7 @@ on the object detection performance.
Bayesian SSD was run with 0.2 confidence threshold and compared
to vanilla SSD with 0.2 confidence threshold. Coupled with the
entropy threshold, this comparison shows how uncertain the network
entropy threshold, this comparison reveals how uncertain the network
is. If it is very certain the dropout sampling should have no
significant impact on the result. Furthermore, in two cases the
dropout was turned off to isolate the impact of non-maximum suppression
@ -701,7 +700,7 @@ in the next chapter.
\hline
Bay. SSD - no DO - 0.2 conf - no NMS \; 10 & 0.209 & 2709 & 0.300 & 0.161 \\
no dropout - 0.2 conf - NMS \; 10 & 0.371 & \textbf{2335} & 0.365 & \textbf{0.378} \\
0.9 keep ratio - 0.2 conf - NMS \; 10 & 0.360 & 2595 & 0.367 & 0.353 \\
0.9 keep ratio - 0.2 conf - NMS \; 10 & 0.359 & 2584 & 0.363 & 0.357 \\
0.5 keep ratio - 0.2 conf - NMS \; 10 & 0.325 & 2759 & 0.342 & 0.311 \\
% entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
% 0.5 for Bayesian - 6, 1.4 for 7, 1.4 for 8, 1.3 for 9
@ -754,15 +753,14 @@ With 2335 open set errors, the Bayesian SSD variant with disabled dropout and
enabled non-maximum suppression offers the best performance with respect
to open set errors. It also has the best precision (0.378) of all tested
variants. Furthermore, it provides the best performance among all variants
with multiple forward passes except for recall.
with multiple forward passes.
Dropout decreases the performance of the network, this can be seen
in the lower \(F_1\) scores, higher open set errors, and lower precision
values. The variant with 0.9 keep ratio outperforms all other Bayesian
variants with respect to recall (0.367). The variant with 0.5 keep
ratio has worse recall (0.342) than the variant with disabled dropout.
However, all variants with multiple forward passes have lower open set errors
than all vanilla SSD variants.
values. Both dropout variants have worse recall (0.363 and 0.342) than
the variant with disabled dropout.
However, all variants with multiple forward passes have lower open set
errors than all vanilla SSD variants.
The relation of \(F_1\) score to absolute open set error can be observed
in figure \ref{fig:ose-f1-micro}. Precision-recall curves for all variants
@ -788,7 +786,7 @@ reported figures, such as the ones in Miller et al.~\cite{Miller2018}
\hline
Bay. SSD - no DO - 0.2 conf - no NMS \; 10 & 0.226 & \textbf{809} & 0.229 & 0.224 \\
no dropout - 0.2 conf - NMS \; 10 & 0.363 & 1057 & 0.321 & 0.420 \\
0.9 keep ratio - 0.2 conf - NMS \; 10 & 0.354 & 1150 & 0.321 & 0.396 \\
0.9 keep ratio - 0.2 conf - NMS \; 10 & 0.355 & 1137 & 0.320 & 0.399 \\
0.5 keep ratio - 0.2 conf - NMS \; 10 & 0.322 & 1264 & 0.307 & 0.340 \\
% entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
% entropy thresh: 0.7 for Bayesian - 6 is best, 1.5 for 7
@ -838,14 +836,12 @@ vanilla SSD run with multiple forward passes.
With 809 open set errors, the Bayesian SSD variant with disabled dropout and
without non-maximum suppression offers the best performance with respect
to open set errors. The variant without dropout and enabled non-maximum suppression has the best \(F_1\) score (0.363) and best
precision (0.420) of all Bayesian variants, and ties with the 0.9 keep ratio
variant on recall (0.321).
to open set errors. The variant without dropout and enabled non-maximum suppression has the best \(F_1\) score (0.363), the best
precision (0.420) and the best recall (0.321) of all Bayesian variants.
Dropout decreases the performance of the network, this can be seen
in the lower \(F_1\) scores, higher open set errors, and lower precision and
recall values. However, all variants with multiple forward passes and
non-maximum suppression have lower open set errors than all vanilla SSD
recall values. However, all variants with multiple forward passes have lower open set errors than all vanilla SSD
variants.
The relation of \(F_1\) score to absolute open set error can be observed
@ -864,19 +860,19 @@ reported figures, such as the ones in Miller et al.~\cite{Miller2018}
This subsection compares vanilla SSD
with Bayesian SSD with respect to specific images that illustrate
similarities and differences between both approaches. For this
comparison, 0.2 confidence threshold is applied. Furthermore, Bayesian
comparison, a 0.2 confidence threshold is applied. Furthermore, Bayesian
SSD uses non-maximum suppression and dropout with 0.9 keep ratio.
\begin{figure}
\begin{minipage}[t]{0.48\textwidth}
\includegraphics[width=\textwidth]{COCO_val2014_000000336587_bboxes_vanilla}
\caption{Image with stop sign and truck at right edge. Ground truth in blue and predictions in red. Predictions are from vanilla SSD.}
\caption{Image with stop sign and truck at right edge. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from vanilla SSD.}
\label{fig:stop-sign-truck-vanilla}
\end{minipage}%
\hfill
\begin{minipage}[t]{0.48\textwidth}
\includegraphics[width=\textwidth]{COCO_val2014_000000336587_bboxes_bayesian}
\caption{Image with stop sign and truck at right edge. Ground truth in blue and predictions in red. Predictions are from Bayesian SSD with 0.9 keep ratio.}
\caption{Image with stop sign and truck at right edge. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from Bayesian SSD with 0.9 keep ratio.}
\label{fig:stop-sign-truck-bayesian}
\end{minipage}
\end{figure}
@ -889,13 +885,13 @@ that overwhelmingly lie outside the image frame. Furthermore, the predictions ar
\begin{figure}
\begin{minipage}[t]{0.48\textwidth}
\includegraphics[width=\textwidth]{COCO_val2014_000000403817_bboxes_vanilla}
\caption{Image with a cat and laptop/TV. Ground truth in blue and predictions in red and rounded to three digits. Predictions are from vanilla SSD.}
\caption{Image with a cat and laptop/TV. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from vanilla SSD.}
\label{fig:cat-laptop-vanilla}
\end{minipage}%
\hfill
\begin{minipage}[t]{0.48\textwidth}
\includegraphics[width=\textwidth]{COCO_val2014_000000403817_bboxes_bayesian}
\caption{Image with a cat and laptop/TV. Ground truth in blue and predictions in red and rounded to three digits. Predictions are from Bayesian SSD with 0.9 keep ratio.}
\caption{Image with a cat and laptop/TV. Ground truth in blue, predictions in red, and rounded to three digits. Predictions are from Bayesian SSD with 0.9 keep ratio.}
\label{fig:cat-laptop-bayesian}
\end{minipage}
\end{figure}
@ -917,7 +913,7 @@ The results clearly do not support the hypothesis: \textit{Dropout sampling deli
is no area where dropout sampling performs better than vanilla SSD. In the
remainder of the section the individual results will be interpreted.
\subsection*{Impact of averaging}
\subsection*{Impact of Averaging}
Micro and macro averaging create largely similar results. Notably, micro
averaging has a significant performance increase towards the end
@ -945,7 +941,7 @@ threshold is not the largest threshold tested. A lower threshold likely
eliminated some false positives from the result set. On the other hand a
too low threshold likely eliminated true positives as well.
\subsection*{Non-maximum suppression and top \(k\)}
\subsection*{Non-Maximum Suppression and Top \(k\)}
Miller et al.~\cite{Miller2018} supposedly did not use non-maximum suppression
in their implementation of dropout sampling. Therefore, a variant with disabled
@ -957,7 +953,7 @@ a lot more false positives remain and have a negative impact on precision.
In combination with top \(k\) selection, recall can be affected:
duplicate detections could stay and maxima boxes could be removed.
The number of observations was measured before and after the entropy threshold/NMS filter: both Bayesian SSD without
The number of observations was measured before and after the combination of entropy threshold and NMS filter: both Bayesian SSD without
NMS and dropout, and Bayesian SSD with NMS and disabled dropout
have the same number of observations everywhere before the entropy threshold. After the entropy threshold (the value 1.5 was used for both) and NMS, the variant with NMS has roughly 23\% of its observations left.
Without NMS 79\% of observations are left. Irrespective of the absolute
@ -988,7 +984,8 @@ kept by top \(k\). However, persons are likely often on images
with many detections and/or have too low confidences.
In this example, the likelihood for true positives to be removed in
the person category is quite high. For dogs, the probability is far lower.
This goes back to micro and macro averaging and their impact on recall.
This is a good example for micro and macro averaging, and their impact on
recall.
\subsection*{Dropout Sampling and Observations}
@ -1043,7 +1040,7 @@ questions that cannot be answered in this thesis. This thesis offers
one possible implementation of dropout sampling that technically works.
However, this thesis cannot answer why this implementation differs significantly
from Miller et al. The complete source code or otherwise exhaustive
implementation details would be required to attempt an answer.
implementation details of Miller et al. would be required to attempt an answer.
Future work could explore the performance of this implementation when used
on an SSD variant that was fine-tuned or trained with dropout. In this case, it