Second pass over thesis

- added more glossary terms - added crucial information - improved language Signed-off-by: Jim Martens <github@2martens.de>
2019-09-30 13:44:38 +02:00
parent 7ec1cfce36
commit d1ead9e613
3 changed files with 105 additions and 40 deletions
--- a/body.tex
+++ b/body.tex
@ -25,11 +25,11 @@ Is a correlation enough to bring forth negative consequences
 for a particular person? And if so, what is the possible defence
 against math? Similar questions can be raised when looking at computer
 vision networks that might be used together with so called smart
-CCTV cameras to discover suspicious activity.
+\gls{CCTV} cameras to discover suspicious activity.

 This leads to the need for neural networks to explain their results.
 Such an explanation must come from the network or an attached piece
-of technology to allow adoption in mass. Obviously this setting
+of technology to allow adoption in mass. Obviously, this setting
 poses the question, how such an endeavour can be achieved.

 For neural networks there are fundamentally two types of tasks:
@ -55,7 +55,7 @@ class of any given input. In this thesis, I will work with both.
 More specifically, I will look at object detection in the open set
 conditions (see figure \ref{fig:open-set}).
 In non-technical words this effectively describes
-the kind of situation you encounter with CCTV cameras or robots
+the kind of situation you encounter with \gls{CCTV} or robots
 outside of a laboratory. Both use cameras that record
 images. Subsequently, a neural network analyses the image
 and returns a list of detected and classified objects that it
@ -63,8 +63,8 @@ found in the image. The problem here is that networks can only
 classify what they know. If presented with an object type that
 the network was not trained with, as happens frequently in real
 environments, it will still classify the object and might even
-have a high confidence in doing so. Such an example would be
-a false positive. Anyone who uses the results of
+have a high confidence in doing so. This is an example for a
+false positive. Anyone who uses the results of
 such a network could falsely assume that a high confidence always
 means the classification is very likely correct. If one uses
 a proprietary system one might not even be able to find out
@ -73,7 +73,7 @@ Therefore, it would be impossible for one to identify the output
 of the network as false positive.

 This reaffirms the need for automatic explanation. Such a system
-should by itself recognise that the given object is unknown and
+should recognise by itself that the given object is unknown and
 hence mark any classification result of the network as meaningless.
 Technically there are two slightly different approaches that deal
 with this type of task: model uncertainty and novelty detection.
@ -117,7 +117,7 @@ but perform poorly on challenging real world data sets
 like MS COCO~\cite{Lin2014}, complicating any potential comparison between
 them and object detection networks like \gls{SSD}.
 Therefore, a comparison between model uncertainty with a network like
-SSD and novelty detection with auto-encoders is considered out of scope
+\gls{SSD} and novelty detection with auto-encoders is considered out of scope
 for this thesis.

 Miller et al.~\cite{Miller2018} use an \gls{SSD} pre-trained on COCO
@ -177,7 +177,7 @@ distance-based, reconstruction-based, domain-based, and information-theoretic
 novelty detection. Based on their categorisation, this thesis falls under
 reconstruction-based novelty detection as it deals only with neural network
 approaches. Therefore, the other types of novelty detection will only be
-briefly introduced.
+introduced briefly.

 \subsection{Overview over types of Novelty Detection}

@ -215,19 +215,19 @@ a recent approach.

 Reconstruction-based approaches use the reconstruction error in one form
 or another to calculate the novelty score. This can be auto-encoders that
-literally reconstruct the input but it also includes MLP networks which try
+literally reconstruct the input but it also includes \gls{MLP} networks which try
 to reconstruct the ground truth. Pimentel et al.~\cite{Pimentel2014} differentiate
 between neural network-based approaches and subspace methods. The first are
-further differentiated between MLPs, Hopfield networks, autoassociative networks,
+further differentiated between MLPs, \glspl{Hopfield network}, autoassociative networks,
 radial basis function, and self-organising networks.
-The remainder of this section focuses on MLP-based works, a particular focus will
+The remainder of this section focuses on \gls{MLP}-based works, a particular focus will
 be on the task of object detection and Bayesian networks.

 Novelty detection for object detection is intricately linked with
 open set conditions: the test data can contain unknown classes.
 Bishop~\cite{Bishop1994} investigates the correlation between
 the degree of novel input data and the reliability of network
-outputs, and introduced a quantitative way to measure novelty.
+outputs, and introduces a quantitative way to measure novelty.

 The Bayesian approach provides a theoretical foundation for
 modelling uncertainty \cite{Ghahramani2015}.
@ -235,7 +235,7 @@ MacKay~\cite{MacKay1992} provides a practical Bayesian
 framework for backpropagation networks. Neal~\cite{Neal1996} builds upon
 the work of MacKay and explores Bayesian learning for neural networks.
 However, these Bayesian neural networks do not scale well. Over the course
-of time, two major Bayesian approximations were introduced: one based
+of time, two major Bayesian approximations have been introduced: one based
 on dropout and one based on batch normalisation.

 Gal and Ghahramani~\cite{Gal2016} show that dropout training is a
@ -243,9 +243,9 @@ Bayesian approximation of a Gaussian process. Subsequently, Gal~\cite{Gal2017}
 shows that dropout training actually corresponds to a general approximate
 Bayesian model. This means every network trained with dropout is an
 approximate Bayesian model. During inference the dropout remains active,
-this form of inference is called Monte Carlo Dropout (MCDO).
+this form of inference is called \gls{MCDO}.
 Miller et al.~\cite{Miller2018} build upon the work of Gal and Ghahramani: they
-use MC dropout under open-set conditions for object detection.
+use \gls{MCDO} under open-set conditions for object detection.
 In a second paper \cite{Miller2018a}, Miller et al. continue their work and
 compare merging strategies for sampling-based uncertainty techniques in
 object detection.
@ -256,7 +256,7 @@ introduce batch normalisation which has been adapted widely in the
 meantime. Teye et al.
 show how batch normalisation training is similar to dropout and can be
 viewed as an approximate Bayesian inference. Estimates of the model uncertainty
-can be gained with a technique named Monte Carlo Batch Normalisation (MCBN).
+can be gained with a technique named \gls{MCBN}.
 Consequently, this technique can be applied to any network that utilises
 standard batch normalisation.
 Li et al.~\cite{Li2019} investigate the problem of poor performance
@ -266,21 +266,21 @@ does not change the variance. This inconsistency leads to a variance shift which
 can have a larger or smaller impact based on the network used.

 Non-Bayesian approaches have been developed as well. Usually, they compare with
-MC dropout and show better performance.
+\gls{MCDO} and show better performance.
 Postels et al.~\cite{Postels2019} provide a sampling-free approach for
 uncertainty estimation that does not affect training and approximates the
-sampling at test time. They compare it to MC dropout and find less computational
+sampling at test time. They compare it to \gls{MCDO} and find less computational
 overhead with better results.
 Lakshminarayanan et al.~\cite{Lakshminarayanan2017}
 implement a predictive uncertainty estimation using deep ensembles.
-Compared to MC dropout, it shows better results.
+Compared to \gls{MCDO}, it shows better results.
 Geifman et al.~\cite{Geifman2018}
 introduce an uncertainty estimation algorithm for non-Bayesian deep
 neural classification that estimates the uncertainty of highly
 confident points using earlier snapshots of the trained model and improves,
 among others, the approach introduced by Lakshminarayanan et al.
 Sensoy et al.~\cite{Sensoy2018} explicitely model prediction uncertainty:
-a Dirichlet distribution is placed over the class probabilities. Consequently,
+a \gls{Dirichlet distribution} is placed over the class probabilities. Consequently,
 the predictions of a neural network are treated as subjective opinions.

 In addition to the aforementioned Bayesian and non-Bayesian works,
@ -492,18 +492,18 @@ The raw output of \gls{SSD} is not very useful: it contains thousands of
 boxes per image. Among them are many boxes with very low confidences
 or background classifications, those need to be filtered out to
 get any meaningful output of the network. The process of
-filtering is called decoding and presented for the three variants
-of \gls{SSD} used in the thesis.
+filtering is called decoding and presented for the three structural
+variants of \gls{SSD} used in the thesis.

 \subsection{Vanilla SSD}

-Liu et al.~\cite{Liu2016} used \gls{Caffe} for their original \gls{SSD}
+Liu et al.~\cite{Liu2016} use \gls{Caffe} for their original \gls{SSD}
 implementation. The decoding process contains largely two
 phases: decoding and filtering. Decoding transforms the relative
-coordinates predicted by \gls{SSD} into absolute coordinates. At this point
-the shape of the output per batch is \((batch\_size, \#nr\_boxes, \#nr\_classes + 12)\). The last twelve elements are split into
+coordinates predicted by \gls{SSD} into absolute coordinates. Before decoding, the shape of the output per batch is \((batch\_size, \#nr\_boxes, \#nr\_classes + 12)\). The last twelve elements are split into
 the four bounding box offsets, the four anchor box coordinates, and
-the four variances; there are 8732 boxes.
+the four variances; there are 8732 boxes. After decoding, of the twelve
+elements only four remain: the absolute coordinates of the bounding box.

 \glslocalreset{NMS}
 Filtering of these boxes is first done per class:
@ -600,7 +600,7 @@ set up, and what the results are.
 \section{Data Sets}

 This thesis uses the MS COCO~\cite{Lin2014} data set. It contains
-80 classes, from airplanes to toothbrushes many classes are present.
+80 classes, their range is illustrated by two classes: airplanes and toothbrushes.
 The images are taken by camera from the real world, ground truth
 is provided for all images. The data set supports object detection,
 keypoint detection, and panoptic segmentation (scene segmentation).
@ -636,13 +636,13 @@ For this thesis, weights pre-trained on the sub data set trainval35k of the
 COCO data set are used. These weights have been created with closed set
 conditions in mind, therefore, they have been sub-sampled to create
 an open set condition. To this end, the weights for the last
-20 classes have been thrown away, making them effectively unknown.
+20 classes have been thrown away, making these classes effectively unknown.

 All images of the minival2014 data set are used but only ground truth
 belonging to the first 60 classes is loaded. The remaining 20
 classes are considered "unknown" and no ground truth bounding
-boxes for them is provided during the inference phase.
-A total of 31,991 detections remains after this exclusion. Of these
+boxes for them are provided during the inference phase.
+A total of 31,991 detections remain after this exclusion. Of these
 detections, a staggering 10,988 or 34,3\% belong to the persons
 class, followed by cars with 1,932 or 6\%, chairs with 1,791 or 5,6\%,
 and bottles with 1,021 or 3,2\%. Together, these four classes make up
@ -721,7 +721,7 @@ in the next chapter.
    and Bayesian \gls{SSD} with \gls{NMS} performed best for 1.4 as \gls{entropy}
    threshold.
    Bayesian \gls{SSD} with dropout enabled and 0.9 keep ratio performed
-    best for 1.4 as \gls{entropy} threshold, the run with 0.5 keep ratio performed
+    best for 1.4 as \gls{entropy} threshold, the variant with 0.5 keep ratio performed
    best for 1.3 as threshold.}
    \label{tab:results-micro}
 \end{table}
@ -807,7 +807,7 @@ reported figures, such as the ones in Miller et al.~\cite{Miller2018}
    \gls{entropy} threshold of 1.7, Bayesian \gls{SSD} without \gls{NMS} performed best for 1.5,
    and Bayesian \gls{SSD} with \gls{NMS} performed best for 1.5 as \gls{entropy}
    threshold. Bayesian \gls{SSD} with dropout enabled and 0.9 keep ratio performed
-    best for 1.7 as \gls{entropy} threshold, the run with 0.5 keep ratio performed
+    best for 1.7 as \gls{entropy} threshold, the variant with 0.5 keep ratio performed
    best for 2.0 as threshold.}
    \label{tab:results-macro}
 \end{table}
@ -840,8 +840,8 @@ is lower but in an insignificant way. The rest of the performance metrics are
 almost identical after rounding.

 The results for Bayesian \gls{SSD} show a significant impact of \gls{NMS} or the lack thereof: maximum \(F_1\) score of 0.363 (with NMS) to 0.226
-(without NMS). Dropout was disabled in both cases, making them effectively a
-\gls{vanilla} \gls{SSD} run with multiple forward passes.
+(without NMS). Dropout was disabled in both cases, making them effectively
+\gls{vanilla} \gls{SSD} with multiple forward passes.

 With an open set error of 809, the Bayesian \gls{SSD} variant with disabled dropout and
 without \gls{NMS} offers the best performance with respect
@ -1133,7 +1133,7 @@ threshold indicates a worse performance.

 \subsection*{Non-Maximum Suppression and Top \(k\)}

-\begin{table}[htbp]
+\begin{table}[tbp]
    \centering
    \begin{tabular}{rccc}
        \hline
@ -1166,10 +1166,10 @@ The number of observations have been measured before and after the combination o
 NMS and dropout, and Bayesian \gls{SSD} with \gls{NMS} and disabled dropout
 have the same number of observations everywhere before the \gls{entropy} threshold. After the \gls{entropy} threshold (the value 1.5 has been used for both) and NMS, the variant with \gls{NMS} has roughly 23\% of its observations left
 (see table \ref{tab:effect-nms} for absolute numbers).
-Without \gls{NMS} 79\% of observations are left. Irrespective of the absolute
+Without \gls{NMS} 79\% of observations are left. Moreover, many classes have more observations after the entropy threshold and per class confidence threshold than before, which is clear since the background observations make up around 70\% of the initial observations and only 21\% of the initial observations are removed. Irrespective of the absolute
 number, this discrepancy clearly shows the impact of \gls{NMS} and also explains a higher count of false positives:
 more than 50\% of the original observations are removed with \gls{NMS} and
-stayed without---all of these are very likely to be false positives.
+stay without---all of these are very likely to be false positives.

 A clear distinction between micro and macro averaging can be observed:
 recall is hardly effected with micro averaging (0.300) but goes down equally with macro averaging (0.229). For micro averaging, it does
@ -1185,8 +1185,8 @@ The impact of top \(k\) has been measured by counting the number of observations
 after top \(k\) is applied: the variant with \gls{NMS} keeps about 94\%
 of the observations left after NMS, without \gls{NMS} only about 59\% of observations
 are kept. This shows a significant impact on the result by top \(k\)
-in the case of disabled \gls{NMS}. Furthermore, some
-classes are hit harder by top \(k\) then others: for example,
+in the case of disabled \gls{NMS}. Furthermore, with disabled \gls{NMS}
+some classes are hit harder by top \(k\) then others: for example,
 dogs keep around 82\% of the observations but persons only 57\%.
 This indicates that detected dogs are mostly on images with few detections
 overall and/or have a high enough prediction confidence to be
@ -1200,7 +1200,7 @@ recall.

 \subsection*{Dropout Sampling and Observations}

-\begin{table}[htbp]
+\begin{table}[tbp]
    \centering
    \begin{tabular}{rccc}
        \hline