Second pass over thesis

- added more glossary terms
- added crucial information
- improved language

Signed-off-by: Jim Martens <github@2martens.de>
This commit is contained in:
2019-09-30 13:44:38 +02:00
parent 7ec1cfce36
commit d1ead9e613
3 changed files with 105 additions and 40 deletions

View File

@ -25,11 +25,11 @@ Is a correlation enough to bring forth negative consequences
for a particular person? And if so, what is the possible defence
against math? Similar questions can be raised when looking at computer
vision networks that might be used together with so called smart
CCTV cameras to discover suspicious activity.
\gls{CCTV} cameras to discover suspicious activity.
This leads to the need for neural networks to explain their results.
Such an explanation must come from the network or an attached piece
of technology to allow adoption in mass. Obviously this setting
of technology to allow adoption in mass. Obviously, this setting
poses the question, how such an endeavour can be achieved.
For neural networks there are fundamentally two types of tasks:
@ -55,7 +55,7 @@ class of any given input. In this thesis, I will work with both.
More specifically, I will look at object detection in the open set
conditions (see figure \ref{fig:open-set}).
In non-technical words this effectively describes
the kind of situation you encounter with CCTV cameras or robots
the kind of situation you encounter with \gls{CCTV} or robots
outside of a laboratory. Both use cameras that record
images. Subsequently, a neural network analyses the image
and returns a list of detected and classified objects that it
@ -63,8 +63,8 @@ found in the image. The problem here is that networks can only
classify what they know. If presented with an object type that
the network was not trained with, as happens frequently in real
environments, it will still classify the object and might even
have a high confidence in doing so. Such an example would be
a false positive. Anyone who uses the results of
have a high confidence in doing so. This is an example for a
false positive. Anyone who uses the results of
such a network could falsely assume that a high confidence always
means the classification is very likely correct. If one uses
a proprietary system one might not even be able to find out
@ -73,7 +73,7 @@ Therefore, it would be impossible for one to identify the output
of the network as false positive.
This reaffirms the need for automatic explanation. Such a system
should by itself recognise that the given object is unknown and
should recognise by itself that the given object is unknown and
hence mark any classification result of the network as meaningless.
Technically there are two slightly different approaches that deal
with this type of task: model uncertainty and novelty detection.
@ -117,7 +117,7 @@ but perform poorly on challenging real world data sets
like MS COCO~\cite{Lin2014}, complicating any potential comparison between
them and object detection networks like \gls{SSD}.
Therefore, a comparison between model uncertainty with a network like
SSD and novelty detection with auto-encoders is considered out of scope
\gls{SSD} and novelty detection with auto-encoders is considered out of scope
for this thesis.
Miller et al.~\cite{Miller2018} use an \gls{SSD} pre-trained on COCO
@ -177,7 +177,7 @@ distance-based, reconstruction-based, domain-based, and information-theoretic
novelty detection. Based on their categorisation, this thesis falls under
reconstruction-based novelty detection as it deals only with neural network
approaches. Therefore, the other types of novelty detection will only be
briefly introduced.
introduced briefly.
\subsection{Overview over types of Novelty Detection}
@ -215,19 +215,19 @@ a recent approach.
Reconstruction-based approaches use the reconstruction error in one form
or another to calculate the novelty score. This can be auto-encoders that
literally reconstruct the input but it also includes MLP networks which try
literally reconstruct the input but it also includes \gls{MLP} networks which try
to reconstruct the ground truth. Pimentel et al.~\cite{Pimentel2014} differentiate
between neural network-based approaches and subspace methods. The first are
further differentiated between MLPs, Hopfield networks, autoassociative networks,
further differentiated between MLPs, \glspl{Hopfield network}, autoassociative networks,
radial basis function, and self-organising networks.
The remainder of this section focuses on MLP-based works, a particular focus will
The remainder of this section focuses on \gls{MLP}-based works, a particular focus will
be on the task of object detection and Bayesian networks.
Novelty detection for object detection is intricately linked with
open set conditions: the test data can contain unknown classes.
Bishop~\cite{Bishop1994} investigates the correlation between
the degree of novel input data and the reliability of network
outputs, and introduced a quantitative way to measure novelty.
outputs, and introduces a quantitative way to measure novelty.
The Bayesian approach provides a theoretical foundation for
modelling uncertainty \cite{Ghahramani2015}.
@ -235,7 +235,7 @@ MacKay~\cite{MacKay1992} provides a practical Bayesian
framework for backpropagation networks. Neal~\cite{Neal1996} builds upon
the work of MacKay and explores Bayesian learning for neural networks.
However, these Bayesian neural networks do not scale well. Over the course
of time, two major Bayesian approximations were introduced: one based
of time, two major Bayesian approximations have been introduced: one based
on dropout and one based on batch normalisation.
Gal and Ghahramani~\cite{Gal2016} show that dropout training is a
@ -243,9 +243,9 @@ Bayesian approximation of a Gaussian process. Subsequently, Gal~\cite{Gal2017}
shows that dropout training actually corresponds to a general approximate
Bayesian model. This means every network trained with dropout is an
approximate Bayesian model. During inference the dropout remains active,
this form of inference is called Monte Carlo Dropout (MCDO).
this form of inference is called \gls{MCDO}.
Miller et al.~\cite{Miller2018} build upon the work of Gal and Ghahramani: they
use MC dropout under open-set conditions for object detection.
use \gls{MCDO} under open-set conditions for object detection.
In a second paper \cite{Miller2018a}, Miller et al. continue their work and
compare merging strategies for sampling-based uncertainty techniques in
object detection.
@ -256,7 +256,7 @@ introduce batch normalisation which has been adapted widely in the
meantime. Teye et al.
show how batch normalisation training is similar to dropout and can be
viewed as an approximate Bayesian inference. Estimates of the model uncertainty
can be gained with a technique named Monte Carlo Batch Normalisation (MCBN).
can be gained with a technique named \gls{MCBN}.
Consequently, this technique can be applied to any network that utilises
standard batch normalisation.
Li et al.~\cite{Li2019} investigate the problem of poor performance
@ -266,21 +266,21 @@ does not change the variance. This inconsistency leads to a variance shift which
can have a larger or smaller impact based on the network used.
Non-Bayesian approaches have been developed as well. Usually, they compare with
MC dropout and show better performance.
\gls{MCDO} and show better performance.
Postels et al.~\cite{Postels2019} provide a sampling-free approach for
uncertainty estimation that does not affect training and approximates the
sampling at test time. They compare it to MC dropout and find less computational
sampling at test time. They compare it to \gls{MCDO} and find less computational
overhead with better results.
Lakshminarayanan et al.~\cite{Lakshminarayanan2017}
implement a predictive uncertainty estimation using deep ensembles.
Compared to MC dropout, it shows better results.
Compared to \gls{MCDO}, it shows better results.
Geifman et al.~\cite{Geifman2018}
introduce an uncertainty estimation algorithm for non-Bayesian deep
neural classification that estimates the uncertainty of highly
confident points using earlier snapshots of the trained model and improves,
among others, the approach introduced by Lakshminarayanan et al.
Sensoy et al.~\cite{Sensoy2018} explicitely model prediction uncertainty:
a Dirichlet distribution is placed over the class probabilities. Consequently,
a \gls{Dirichlet distribution} is placed over the class probabilities. Consequently,
the predictions of a neural network are treated as subjective opinions.
In addition to the aforementioned Bayesian and non-Bayesian works,
@ -492,18 +492,18 @@ The raw output of \gls{SSD} is not very useful: it contains thousands of
boxes per image. Among them are many boxes with very low confidences
or background classifications, those need to be filtered out to
get any meaningful output of the network. The process of
filtering is called decoding and presented for the three variants
of \gls{SSD} used in the thesis.
filtering is called decoding and presented for the three structural
variants of \gls{SSD} used in the thesis.
\subsection{Vanilla SSD}
Liu et al.~\cite{Liu2016} used \gls{Caffe} for their original \gls{SSD}
Liu et al.~\cite{Liu2016} use \gls{Caffe} for their original \gls{SSD}
implementation. The decoding process contains largely two
phases: decoding and filtering. Decoding transforms the relative
coordinates predicted by \gls{SSD} into absolute coordinates. At this point
the shape of the output per batch is \((batch\_size, \#nr\_boxes, \#nr\_classes + 12)\). The last twelve elements are split into
coordinates predicted by \gls{SSD} into absolute coordinates. Before decoding, the shape of the output per batch is \((batch\_size, \#nr\_boxes, \#nr\_classes + 12)\). The last twelve elements are split into
the four bounding box offsets, the four anchor box coordinates, and
the four variances; there are 8732 boxes.
the four variances; there are 8732 boxes. After decoding, of the twelve
elements only four remain: the absolute coordinates of the bounding box.
\glslocalreset{NMS}
Filtering of these boxes is first done per class:
@ -600,7 +600,7 @@ set up, and what the results are.
\section{Data Sets}
This thesis uses the MS COCO~\cite{Lin2014} data set. It contains
80 classes, from airplanes to toothbrushes many classes are present.
80 classes, their range is illustrated by two classes: airplanes and toothbrushes.
The images are taken by camera from the real world, ground truth
is provided for all images. The data set supports object detection,
keypoint detection, and panoptic segmentation (scene segmentation).
@ -636,13 +636,13 @@ For this thesis, weights pre-trained on the sub data set trainval35k of the
COCO data set are used. These weights have been created with closed set
conditions in mind, therefore, they have been sub-sampled to create
an open set condition. To this end, the weights for the last
20 classes have been thrown away, making them effectively unknown.
20 classes have been thrown away, making these classes effectively unknown.
All images of the minival2014 data set are used but only ground truth
belonging to the first 60 classes is loaded. The remaining 20
classes are considered "unknown" and no ground truth bounding
boxes for them is provided during the inference phase.
A total of 31,991 detections remains after this exclusion. Of these
boxes for them are provided during the inference phase.
A total of 31,991 detections remain after this exclusion. Of these
detections, a staggering 10,988 or 34,3\% belong to the persons
class, followed by cars with 1,932 or 6\%, chairs with 1,791 or 5,6\%,
and bottles with 1,021 or 3,2\%. Together, these four classes make up
@ -721,7 +721,7 @@ in the next chapter.
and Bayesian \gls{SSD} with \gls{NMS} performed best for 1.4 as \gls{entropy}
threshold.
Bayesian \gls{SSD} with dropout enabled and 0.9 keep ratio performed
best for 1.4 as \gls{entropy} threshold, the run with 0.5 keep ratio performed
best for 1.4 as \gls{entropy} threshold, the variant with 0.5 keep ratio performed
best for 1.3 as threshold.}
\label{tab:results-micro}
\end{table}
@ -807,7 +807,7 @@ reported figures, such as the ones in Miller et al.~\cite{Miller2018}
\gls{entropy} threshold of 1.7, Bayesian \gls{SSD} without \gls{NMS} performed best for 1.5,
and Bayesian \gls{SSD} with \gls{NMS} performed best for 1.5 as \gls{entropy}
threshold. Bayesian \gls{SSD} with dropout enabled and 0.9 keep ratio performed
best for 1.7 as \gls{entropy} threshold, the run with 0.5 keep ratio performed
best for 1.7 as \gls{entropy} threshold, the variant with 0.5 keep ratio performed
best for 2.0 as threshold.}
\label{tab:results-macro}
\end{table}
@ -840,8 +840,8 @@ is lower but in an insignificant way. The rest of the performance metrics are
almost identical after rounding.
The results for Bayesian \gls{SSD} show a significant impact of \gls{NMS} or the lack thereof: maximum \(F_1\) score of 0.363 (with NMS) to 0.226
(without NMS). Dropout was disabled in both cases, making them effectively a
\gls{vanilla} \gls{SSD} run with multiple forward passes.
(without NMS). Dropout was disabled in both cases, making them effectively
\gls{vanilla} \gls{SSD} with multiple forward passes.
With an open set error of 809, the Bayesian \gls{SSD} variant with disabled dropout and
without \gls{NMS} offers the best performance with respect
@ -1133,7 +1133,7 @@ threshold indicates a worse performance.
\subsection*{Non-Maximum Suppression and Top \(k\)}
\begin{table}[htbp]
\begin{table}[tbp]
\centering
\begin{tabular}{rccc}
\hline
@ -1166,10 +1166,10 @@ The number of observations have been measured before and after the combination o
NMS and dropout, and Bayesian \gls{SSD} with \gls{NMS} and disabled dropout
have the same number of observations everywhere before the \gls{entropy} threshold. After the \gls{entropy} threshold (the value 1.5 has been used for both) and NMS, the variant with \gls{NMS} has roughly 23\% of its observations left
(see table \ref{tab:effect-nms} for absolute numbers).
Without \gls{NMS} 79\% of observations are left. Irrespective of the absolute
Without \gls{NMS} 79\% of observations are left. Moreover, many classes have more observations after the entropy threshold and per class confidence threshold than before, which is clear since the background observations make up around 70\% of the initial observations and only 21\% of the initial observations are removed. Irrespective of the absolute
number, this discrepancy clearly shows the impact of \gls{NMS} and also explains a higher count of false positives:
more than 50\% of the original observations are removed with \gls{NMS} and
stayed without---all of these are very likely to be false positives.
stay without---all of these are very likely to be false positives.
A clear distinction between micro and macro averaging can be observed:
recall is hardly effected with micro averaging (0.300) but goes down equally with macro averaging (0.229). For micro averaging, it does
@ -1185,8 +1185,8 @@ The impact of top \(k\) has been measured by counting the number of observations
after top \(k\) is applied: the variant with \gls{NMS} keeps about 94\%
of the observations left after NMS, without \gls{NMS} only about 59\% of observations
are kept. This shows a significant impact on the result by top \(k\)
in the case of disabled \gls{NMS}. Furthermore, some
classes are hit harder by top \(k\) then others: for example,
in the case of disabled \gls{NMS}. Furthermore, with disabled \gls{NMS}
some classes are hit harder by top \(k\) then others: for example,
dogs keep around 82\% of the observations but persons only 57\%.
This indicates that detected dogs are mostly on images with few detections
overall and/or have a high enough prediction confidence to be
@ -1200,7 +1200,7 @@ recall.
\subsection*{Dropout Sampling and Observations}
\begin{table}[htbp]
\begin{table}[tbp]
\centering
\begin{tabular}{rccc}
\hline