Second pass over thesis
- added more glossary terms - added crucial information - improved language Signed-off-by: Jim Martens <github@2martens.de>
This commit is contained in:
80
body.tex
80
body.tex
@ -25,11 +25,11 @@ Is a correlation enough to bring forth negative consequences
|
||||
for a particular person? And if so, what is the possible defence
|
||||
against math? Similar questions can be raised when looking at computer
|
||||
vision networks that might be used together with so called smart
|
||||
CCTV cameras to discover suspicious activity.
|
||||
\gls{CCTV} cameras to discover suspicious activity.
|
||||
|
||||
This leads to the need for neural networks to explain their results.
|
||||
Such an explanation must come from the network or an attached piece
|
||||
of technology to allow adoption in mass. Obviously this setting
|
||||
of technology to allow adoption in mass. Obviously, this setting
|
||||
poses the question, how such an endeavour can be achieved.
|
||||
|
||||
For neural networks there are fundamentally two types of tasks:
|
||||
@ -55,7 +55,7 @@ class of any given input. In this thesis, I will work with both.
|
||||
More specifically, I will look at object detection in the open set
|
||||
conditions (see figure \ref{fig:open-set}).
|
||||
In non-technical words this effectively describes
|
||||
the kind of situation you encounter with CCTV cameras or robots
|
||||
the kind of situation you encounter with \gls{CCTV} or robots
|
||||
outside of a laboratory. Both use cameras that record
|
||||
images. Subsequently, a neural network analyses the image
|
||||
and returns a list of detected and classified objects that it
|
||||
@ -63,8 +63,8 @@ found in the image. The problem here is that networks can only
|
||||
classify what they know. If presented with an object type that
|
||||
the network was not trained with, as happens frequently in real
|
||||
environments, it will still classify the object and might even
|
||||
have a high confidence in doing so. Such an example would be
|
||||
a false positive. Anyone who uses the results of
|
||||
have a high confidence in doing so. This is an example for a
|
||||
false positive. Anyone who uses the results of
|
||||
such a network could falsely assume that a high confidence always
|
||||
means the classification is very likely correct. If one uses
|
||||
a proprietary system one might not even be able to find out
|
||||
@ -73,7 +73,7 @@ Therefore, it would be impossible for one to identify the output
|
||||
of the network as false positive.
|
||||
|
||||
This reaffirms the need for automatic explanation. Such a system
|
||||
should by itself recognise that the given object is unknown and
|
||||
should recognise by itself that the given object is unknown and
|
||||
hence mark any classification result of the network as meaningless.
|
||||
Technically there are two slightly different approaches that deal
|
||||
with this type of task: model uncertainty and novelty detection.
|
||||
@ -117,7 +117,7 @@ but perform poorly on challenging real world data sets
|
||||
like MS COCO~\cite{Lin2014}, complicating any potential comparison between
|
||||
them and object detection networks like \gls{SSD}.
|
||||
Therefore, a comparison between model uncertainty with a network like
|
||||
SSD and novelty detection with auto-encoders is considered out of scope
|
||||
\gls{SSD} and novelty detection with auto-encoders is considered out of scope
|
||||
for this thesis.
|
||||
|
||||
Miller et al.~\cite{Miller2018} use an \gls{SSD} pre-trained on COCO
|
||||
@ -177,7 +177,7 @@ distance-based, reconstruction-based, domain-based, and information-theoretic
|
||||
novelty detection. Based on their categorisation, this thesis falls under
|
||||
reconstruction-based novelty detection as it deals only with neural network
|
||||
approaches. Therefore, the other types of novelty detection will only be
|
||||
briefly introduced.
|
||||
introduced briefly.
|
||||
|
||||
\subsection{Overview over types of Novelty Detection}
|
||||
|
||||
@ -215,19 +215,19 @@ a recent approach.
|
||||
|
||||
Reconstruction-based approaches use the reconstruction error in one form
|
||||
or another to calculate the novelty score. This can be auto-encoders that
|
||||
literally reconstruct the input but it also includes MLP networks which try
|
||||
literally reconstruct the input but it also includes \gls{MLP} networks which try
|
||||
to reconstruct the ground truth. Pimentel et al.~\cite{Pimentel2014} differentiate
|
||||
between neural network-based approaches and subspace methods. The first are
|
||||
further differentiated between MLPs, Hopfield networks, autoassociative networks,
|
||||
further differentiated between MLPs, \glspl{Hopfield network}, autoassociative networks,
|
||||
radial basis function, and self-organising networks.
|
||||
The remainder of this section focuses on MLP-based works, a particular focus will
|
||||
The remainder of this section focuses on \gls{MLP}-based works, a particular focus will
|
||||
be on the task of object detection and Bayesian networks.
|
||||
|
||||
Novelty detection for object detection is intricately linked with
|
||||
open set conditions: the test data can contain unknown classes.
|
||||
Bishop~\cite{Bishop1994} investigates the correlation between
|
||||
the degree of novel input data and the reliability of network
|
||||
outputs, and introduced a quantitative way to measure novelty.
|
||||
outputs, and introduces a quantitative way to measure novelty.
|
||||
|
||||
The Bayesian approach provides a theoretical foundation for
|
||||
modelling uncertainty \cite{Ghahramani2015}.
|
||||
@ -235,7 +235,7 @@ MacKay~\cite{MacKay1992} provides a practical Bayesian
|
||||
framework for backpropagation networks. Neal~\cite{Neal1996} builds upon
|
||||
the work of MacKay and explores Bayesian learning for neural networks.
|
||||
However, these Bayesian neural networks do not scale well. Over the course
|
||||
of time, two major Bayesian approximations were introduced: one based
|
||||
of time, two major Bayesian approximations have been introduced: one based
|
||||
on dropout and one based on batch normalisation.
|
||||
|
||||
Gal and Ghahramani~\cite{Gal2016} show that dropout training is a
|
||||
@ -243,9 +243,9 @@ Bayesian approximation of a Gaussian process. Subsequently, Gal~\cite{Gal2017}
|
||||
shows that dropout training actually corresponds to a general approximate
|
||||
Bayesian model. This means every network trained with dropout is an
|
||||
approximate Bayesian model. During inference the dropout remains active,
|
||||
this form of inference is called Monte Carlo Dropout (MCDO).
|
||||
this form of inference is called \gls{MCDO}.
|
||||
Miller et al.~\cite{Miller2018} build upon the work of Gal and Ghahramani: they
|
||||
use MC dropout under open-set conditions for object detection.
|
||||
use \gls{MCDO} under open-set conditions for object detection.
|
||||
In a second paper \cite{Miller2018a}, Miller et al. continue their work and
|
||||
compare merging strategies for sampling-based uncertainty techniques in
|
||||
object detection.
|
||||
@ -256,7 +256,7 @@ introduce batch normalisation which has been adapted widely in the
|
||||
meantime. Teye et al.
|
||||
show how batch normalisation training is similar to dropout and can be
|
||||
viewed as an approximate Bayesian inference. Estimates of the model uncertainty
|
||||
can be gained with a technique named Monte Carlo Batch Normalisation (MCBN).
|
||||
can be gained with a technique named \gls{MCBN}.
|
||||
Consequently, this technique can be applied to any network that utilises
|
||||
standard batch normalisation.
|
||||
Li et al.~\cite{Li2019} investigate the problem of poor performance
|
||||
@ -266,21 +266,21 @@ does not change the variance. This inconsistency leads to a variance shift which
|
||||
can have a larger or smaller impact based on the network used.
|
||||
|
||||
Non-Bayesian approaches have been developed as well. Usually, they compare with
|
||||
MC dropout and show better performance.
|
||||
\gls{MCDO} and show better performance.
|
||||
Postels et al.~\cite{Postels2019} provide a sampling-free approach for
|
||||
uncertainty estimation that does not affect training and approximates the
|
||||
sampling at test time. They compare it to MC dropout and find less computational
|
||||
sampling at test time. They compare it to \gls{MCDO} and find less computational
|
||||
overhead with better results.
|
||||
Lakshminarayanan et al.~\cite{Lakshminarayanan2017}
|
||||
implement a predictive uncertainty estimation using deep ensembles.
|
||||
Compared to MC dropout, it shows better results.
|
||||
Compared to \gls{MCDO}, it shows better results.
|
||||
Geifman et al.~\cite{Geifman2018}
|
||||
introduce an uncertainty estimation algorithm for non-Bayesian deep
|
||||
neural classification that estimates the uncertainty of highly
|
||||
confident points using earlier snapshots of the trained model and improves,
|
||||
among others, the approach introduced by Lakshminarayanan et al.
|
||||
Sensoy et al.~\cite{Sensoy2018} explicitely model prediction uncertainty:
|
||||
a Dirichlet distribution is placed over the class probabilities. Consequently,
|
||||
a \gls{Dirichlet distribution} is placed over the class probabilities. Consequently,
|
||||
the predictions of a neural network are treated as subjective opinions.
|
||||
|
||||
In addition to the aforementioned Bayesian and non-Bayesian works,
|
||||
@ -492,18 +492,18 @@ The raw output of \gls{SSD} is not very useful: it contains thousands of
|
||||
boxes per image. Among them are many boxes with very low confidences
|
||||
or background classifications, those need to be filtered out to
|
||||
get any meaningful output of the network. The process of
|
||||
filtering is called decoding and presented for the three variants
|
||||
of \gls{SSD} used in the thesis.
|
||||
filtering is called decoding and presented for the three structural
|
||||
variants of \gls{SSD} used in the thesis.
|
||||
|
||||
\subsection{Vanilla SSD}
|
||||
|
||||
Liu et al.~\cite{Liu2016} used \gls{Caffe} for their original \gls{SSD}
|
||||
Liu et al.~\cite{Liu2016} use \gls{Caffe} for their original \gls{SSD}
|
||||
implementation. The decoding process contains largely two
|
||||
phases: decoding and filtering. Decoding transforms the relative
|
||||
coordinates predicted by \gls{SSD} into absolute coordinates. At this point
|
||||
the shape of the output per batch is \((batch\_size, \#nr\_boxes, \#nr\_classes + 12)\). The last twelve elements are split into
|
||||
coordinates predicted by \gls{SSD} into absolute coordinates. Before decoding, the shape of the output per batch is \((batch\_size, \#nr\_boxes, \#nr\_classes + 12)\). The last twelve elements are split into
|
||||
the four bounding box offsets, the four anchor box coordinates, and
|
||||
the four variances; there are 8732 boxes.
|
||||
the four variances; there are 8732 boxes. After decoding, of the twelve
|
||||
elements only four remain: the absolute coordinates of the bounding box.
|
||||
|
||||
\glslocalreset{NMS}
|
||||
Filtering of these boxes is first done per class:
|
||||
@ -600,7 +600,7 @@ set up, and what the results are.
|
||||
\section{Data Sets}
|
||||
|
||||
This thesis uses the MS COCO~\cite{Lin2014} data set. It contains
|
||||
80 classes, from airplanes to toothbrushes many classes are present.
|
||||
80 classes, their range is illustrated by two classes: airplanes and toothbrushes.
|
||||
The images are taken by camera from the real world, ground truth
|
||||
is provided for all images. The data set supports object detection,
|
||||
keypoint detection, and panoptic segmentation (scene segmentation).
|
||||
@ -636,13 +636,13 @@ For this thesis, weights pre-trained on the sub data set trainval35k of the
|
||||
COCO data set are used. These weights have been created with closed set
|
||||
conditions in mind, therefore, they have been sub-sampled to create
|
||||
an open set condition. To this end, the weights for the last
|
||||
20 classes have been thrown away, making them effectively unknown.
|
||||
20 classes have been thrown away, making these classes effectively unknown.
|
||||
|
||||
All images of the minival2014 data set are used but only ground truth
|
||||
belonging to the first 60 classes is loaded. The remaining 20
|
||||
classes are considered "unknown" and no ground truth bounding
|
||||
boxes for them is provided during the inference phase.
|
||||
A total of 31,991 detections remains after this exclusion. Of these
|
||||
boxes for them are provided during the inference phase.
|
||||
A total of 31,991 detections remain after this exclusion. Of these
|
||||
detections, a staggering 10,988 or 34,3\% belong to the persons
|
||||
class, followed by cars with 1,932 or 6\%, chairs with 1,791 or 5,6\%,
|
||||
and bottles with 1,021 or 3,2\%. Together, these four classes make up
|
||||
@ -721,7 +721,7 @@ in the next chapter.
|
||||
and Bayesian \gls{SSD} with \gls{NMS} performed best for 1.4 as \gls{entropy}
|
||||
threshold.
|
||||
Bayesian \gls{SSD} with dropout enabled and 0.9 keep ratio performed
|
||||
best for 1.4 as \gls{entropy} threshold, the run with 0.5 keep ratio performed
|
||||
best for 1.4 as \gls{entropy} threshold, the variant with 0.5 keep ratio performed
|
||||
best for 1.3 as threshold.}
|
||||
\label{tab:results-micro}
|
||||
\end{table}
|
||||
@ -807,7 +807,7 @@ reported figures, such as the ones in Miller et al.~\cite{Miller2018}
|
||||
\gls{entropy} threshold of 1.7, Bayesian \gls{SSD} without \gls{NMS} performed best for 1.5,
|
||||
and Bayesian \gls{SSD} with \gls{NMS} performed best for 1.5 as \gls{entropy}
|
||||
threshold. Bayesian \gls{SSD} with dropout enabled and 0.9 keep ratio performed
|
||||
best for 1.7 as \gls{entropy} threshold, the run with 0.5 keep ratio performed
|
||||
best for 1.7 as \gls{entropy} threshold, the variant with 0.5 keep ratio performed
|
||||
best for 2.0 as threshold.}
|
||||
\label{tab:results-macro}
|
||||
\end{table}
|
||||
@ -840,8 +840,8 @@ is lower but in an insignificant way. The rest of the performance metrics are
|
||||
almost identical after rounding.
|
||||
|
||||
The results for Bayesian \gls{SSD} show a significant impact of \gls{NMS} or the lack thereof: maximum \(F_1\) score of 0.363 (with NMS) to 0.226
|
||||
(without NMS). Dropout was disabled in both cases, making them effectively a
|
||||
\gls{vanilla} \gls{SSD} run with multiple forward passes.
|
||||
(without NMS). Dropout was disabled in both cases, making them effectively
|
||||
\gls{vanilla} \gls{SSD} with multiple forward passes.
|
||||
|
||||
With an open set error of 809, the Bayesian \gls{SSD} variant with disabled dropout and
|
||||
without \gls{NMS} offers the best performance with respect
|
||||
@ -1133,7 +1133,7 @@ threshold indicates a worse performance.
|
||||
|
||||
\subsection*{Non-Maximum Suppression and Top \(k\)}
|
||||
|
||||
\begin{table}[htbp]
|
||||
\begin{table}[tbp]
|
||||
\centering
|
||||
\begin{tabular}{rccc}
|
||||
\hline
|
||||
@ -1166,10 +1166,10 @@ The number of observations have been measured before and after the combination o
|
||||
NMS and dropout, and Bayesian \gls{SSD} with \gls{NMS} and disabled dropout
|
||||
have the same number of observations everywhere before the \gls{entropy} threshold. After the \gls{entropy} threshold (the value 1.5 has been used for both) and NMS, the variant with \gls{NMS} has roughly 23\% of its observations left
|
||||
(see table \ref{tab:effect-nms} for absolute numbers).
|
||||
Without \gls{NMS} 79\% of observations are left. Irrespective of the absolute
|
||||
Without \gls{NMS} 79\% of observations are left. Moreover, many classes have more observations after the entropy threshold and per class confidence threshold than before, which is clear since the background observations make up around 70\% of the initial observations and only 21\% of the initial observations are removed. Irrespective of the absolute
|
||||
number, this discrepancy clearly shows the impact of \gls{NMS} and also explains a higher count of false positives:
|
||||
more than 50\% of the original observations are removed with \gls{NMS} and
|
||||
stayed without---all of these are very likely to be false positives.
|
||||
stay without---all of these are very likely to be false positives.
|
||||
|
||||
A clear distinction between micro and macro averaging can be observed:
|
||||
recall is hardly effected with micro averaging (0.300) but goes down equally with macro averaging (0.229). For micro averaging, it does
|
||||
@ -1185,8 +1185,8 @@ The impact of top \(k\) has been measured by counting the number of observations
|
||||
after top \(k\) is applied: the variant with \gls{NMS} keeps about 94\%
|
||||
of the observations left after NMS, without \gls{NMS} only about 59\% of observations
|
||||
are kept. This shows a significant impact on the result by top \(k\)
|
||||
in the case of disabled \gls{NMS}. Furthermore, some
|
||||
classes are hit harder by top \(k\) then others: for example,
|
||||
in the case of disabled \gls{NMS}. Furthermore, with disabled \gls{NMS}
|
||||
some classes are hit harder by top \(k\) then others: for example,
|
||||
dogs keep around 82\% of the observations but persons only 57\%.
|
||||
This indicates that detected dogs are mostly on images with few detections
|
||||
overall and/or have a high enough prediction confidence to be
|
||||
@ -1200,7 +1200,7 @@ recall.
|
||||
|
||||
\subsection*{Dropout Sampling and Observations}
|
||||
|
||||
\begin{table}[htbp]
|
||||
\begin{table}[tbp]
|
||||
\centering
|
||||
\begin{tabular}{rccc}
|
||||
\hline
|
||||
|
||||
Reference in New Issue
Block a user