Improved thesis based upon feedback

Signed-off-by: Jim Martens <github@2martens.de>
2019-10-04 13:43:27 +02:00
parent bca14cd8b4
commit dc976932f8
4 changed files with 101 additions and 97 deletions
--- a/acknowledge.tex
+++ b/acknowledge.tex
@ -1,12 +1,12 @@
 \clearpage
 \section*{Acknowledgement}

-I would like to thank for the continued support, suggestions, and advise
-from my super-visor Prof. Dr. Simone Frintrop and co-supervisor Dr.
+I would like to thank for the continued support, suggestions, and advice
+from my supervisor Prof. Dr. Simone Frintrop and co-supervisor Dr.
 Mikko Lauri.

-Additionally, I would like to thank my friends and family for the continued
-support and sometimes helpful questions. Especially in some hard times
+Additionally, I would like to thank my friends and family for their continued
+support and helpful questions. Especially during some hard times
 their support was invaluable.

 Furthermore, I am grateful for the Fridays for Future movement
--- a/appendix.tex
+++ b/appendix.tex
@ -1,23 +1,23 @@
 \chapter{Software and Source Code Design}

 The source code of many published papers is either not available
-or seems like an afterthought: it is poorly documented, difficult
+or is of bad quality: it is poorly documented, difficult
 to integrate into your own work, and often does not follow common
 software development best practices. Moreover, with Tensorflow,
 PyTorch, and Caffe there are at least three machine learning
-frameworks. Every research team seems to prefer another framework
-and sometimes even develops their own; this makes it difficult
+frameworks. Every research team seems to prefer another framework,
+and, occasionally, even develops their own; this makes it difficult
 to combine the work of different authors.
-In addition to all this, most papers do not contain proper information
-regarding the implementation details, making it difficult to
-accurately replicate them if their source code is not available.
+In addition to this, most papers do not contain proper information
+regarding implementation details, making it difficult to
+accurately replicate their results, if their source code is not available.

-Therefore, it was clear to me: I will release my source code and
-make it available as Python package on the PyPi package index.
+Therefore, I will release my source code and
+make it available as a Python package on the PyPi package index.
 This makes it possible for other researchers to simply install
 a package and use the API to interact with my code. Additionally,
-the code has been designed to be future proof and work with
-the announced Tensorflow 2.0 by supporting eager mode.
+the code has been designed to be future proof, and work with
+the announced Tensorflow 2.0, by supporting eager mode.

 Furthermore, it is configurable, well documented, and conforms largely
 to the clean code guidelines: evolvability and extendability among
@ -38,7 +38,7 @@ can be found in plotting.py, and the ssd.py module contains
 code to train the SSD and later predict with it.

 Lastly, the SSD implementation from a third party repository
-has been modified to work inside a Python package architecture and
+has been modified to work inside a Python package architecture, and
 with eager mode. It is stored as a Git submodule inside the package
 repository.

--- a/body.tex
+++ b/body.tex
@ -21,7 +21,7 @@ black boxes and prevents any answers to questions of causality.

 However, these questions of causality are of enormous consequence when
 results of neural networks are used to make life changing decisions:
-Is a correlation enough to bring forth negative consequences
+is a correlation enough to bring forth negative consequences
 for a particular person? And if so, what is the possible defence
 against math? Similar questions can be raised when looking at computer
 vision networks that might be used together with so called smart
@ -29,14 +29,14 @@ vision networks that might be used together with so called smart

 This leads to the need for neural networks to explain their results.
 Such an explanation must come from the network or an attached piece
-of technology to allow adoption in mass. Obviously, this setting
-poses the question, how such an endeavour can be achieved.
+of technology to allow mass adoption. Obviously, this setting
+poses the question of how such an endeavour can be achieved.

-For neural networks there are fundamentally two types of tasks:
+For neural networks there are fundamentally two types of problems:
 regression and classification. Regression deals with any case
 where the goal for the network is to come close to an ideal
 function that connects all data points. Classification, however,
-describes tasks where the network is supposed to identify the
+describes problems where the network is supposed to identify the
 class of any given input. In this thesis, I will work with both.

 \subsection*{Object Detection in Open Set Conditions}
@ -54,53 +54,51 @@ class of any given input. In this thesis, I will work with both.

 More specifically, I will look at object detection in the open set
 conditions (see figure \ref{fig:open-set}).
-In non-technical words this effectively describes
-the kind of situation you encounter with \gls{CCTV} or robots
-outside of a laboratory. Both use cameras that record
-images. Subsequently, a neural network analyses the image
-and returns a list of detected and classified objects that it
-found in the image. The problem here is that networks can only
+In non-technical terms this effectively describes
+the conditions \gls{CCTV} and robots outside of a laboratory operate in. In both cases images are recorded with cameras. In order to detect objects, a neural network has to analyse the images
+and return a list of detected and classified objects that it
+finds in the images. The problem here is that networks can only
 classify what they know. If presented with an object type that
 the network was not trained with, as happens frequently in real
 environments, it will still classify the object and might even
 have a high confidence in doing so. This is an example for a
 false positive. Anyone who uses the results of
-such a network could falsely assume that a high confidence always
+such a network could falsely assume that a high confidence
 means the classification is very likely correct. If one uses
 a proprietary system one might not even be able to find out
 that the network was never trained on a particular type of object.
 Therefore, it would be impossible for one to identify the output
-of the network as false positive.
+of the network as a false positive.

 This reaffirms the need for automatic explanation. Such a system
 should recognise by itself that the given object is unknown and
-hence mark any classification result of the network as meaningless.
+mark any classification result of the network as meaningless.
 Technically there are two slightly different approaches that deal
 with this type of task: model uncertainty and novelty detection.

 Model uncertainty can be measured, for example, with dropout sampling.
-Dropout layers are usually used only during training but
-Miller et al.~\cite{Miller2018} use them also during testing
-to achieve different results for the same image making use of
+Dropout layers are usually used only during training, but
+Miller et al.~\cite{Miller2018} also use them during testing
+to achieve different results for the same image---making use of
 multiple forward passes. The output scores for the forward passes
 of the same image are then averaged. If the averaged class
 probabilities resemble a uniform distribution (every class has
 the same probability) this symbolises maximum uncertainty. Conversely,
 if there is one very high probability with every other being very
-low this signifies a low uncertainty. An unknown object is more
-likely to cause high uncertainty which allows for an identification
+low, this signifies a low uncertainty. An unknown object is more
+likely to cause high uncertainty, which allows for an identification
 of false positive cases.

-Novelty detection is another approach to solve the task.
+Novelty detection is another approach to solve the problem.
 In the realm of neural networks it is usually done with the help of
-auto-encoders that solve a regression task of finding an
-identity function that reconstructs the given input~\cite{Pimentel2014}. Auto-encoders have
-internally at least two components: an encoder, and a decoder or
+auto-encoders that try to solve a regression problem of finding an
+identity function that reconstructs the given input~\cite{Pimentel2014}. Auto-encoders have,
+internally, at least two components: an encoder, and a decoder or
 generator. The job of the encoder is to find an encoding that
-compresses the input as good as possible while simultaneously
+compresses the input as well as possible, while simultaneously
 being as loss-free as possible. The decoder takes this latent
-representation of the input and has to find a decompression
-that reconstructs the input as accurate as possible. During
+representation of the input, and has to find a decompression
+that reconstructs the input as accurately as possible. During
 training these auto-encoders learn to reproduce a certain group
 of object classes. The actual novelty detection takes place
 during testing: given an image, and the output and loss of the
@ -148,25 +146,24 @@ conditions compared to object detection without it.

 \subsection*{Reader's Guide}

-First, chapter \ref{chap:background} presents related works and
+First, chapter \ref{chap:background} presents related works, and
 provides the background for dropout sampling.
-Afterwards, chapter \ref{chap:methods} explains how \gls{vanilla} \gls{SSD} works, how
+Thereafter, chapter \ref{chap:methods} explains how \gls{vanilla} \gls{SSD} works, how
 Bayesian \gls{SSD} extends \gls{vanilla} \gls{SSD}, and how the decoding pipelines are
 structured.
 Chapter \ref{chap:experiments-results} presents the data sets,
 the experimental setup, and the results. This is followed by
 chapter \ref{chap:discussion}, focusing on the discussion and closing.

-Therefore, the contribution is found in chapters \ref{chap:methods},
+The contribution of this thesis is found in chapters \ref{chap:methods},
 \ref{chap:experiments-results}, and \ref{chap:discussion}.

 \chapter{Background}

 \label{chap:background}

-This chapter begins with an overview over previous works
-in the field of this thesis. Afterwards the theoretical foundations
-of dropout sampling are explained.
+This chapter begins with an overview of previous works, followed by an explanation of the theoretical
+foundations of dropout sampling.

 \section{Related Works}

@ -176,7 +173,7 @@ methods published over the previous decade. They showcase probabilistic,
 distance-based, reconstruction-based, domain-based, and information-theoretic
 novelty detection. Based on their categorisation, this thesis falls under
 reconstruction-based novelty detection as it deals only with neural network
-approaches. Therefore, the other types of novelty detection will only be
+approaches. The other types of novelty detection will, therefore, only be
 introduced briefly.

 \subsection{Overview over types of Novelty Detection}
@ -197,16 +194,16 @@ Both methods are similar to estimating the
 \gls{pdf} of data, they use well-defined distance metrics to compute the distance
 between two data points.

-Domain-based novelty detection describes the boundary of the known data, rather
-than the data itself. Unknown data is identified by its position relative to
-the boundary. A common implementation for this are support vector machines
-(e.g. implemented by Song et al. \cite{Song2002}).
+Domain-based novelty detection describes the boundary of the known data,
+rather than the data itself. Unknown data is identified by its position
+relative to the boundary. Support vector machines (e.g. implemented by
+Song et al. \cite{Song2002}) are a common implementation of this.

 Information-theoretic novelty detection computes the information content
 of a data set, for example, with metrics like \gls{entropy}. Such metrics assume
 that novel data inside the data set significantly alters the information
 content of an otherwise normal data set. First, the metrics are calculated over the
-whole data set. Afterwards, a subset is identified that causes the biggest
+whole data set. Second, a subset is identified that causes the biggest
 difference in the metric when removed from the data set. This subset is considered
 to consist of novel data. For example, Filippone and Sanguinetti \cite{Filippone2011} provide
 a recent approach.
@ -214,7 +211,7 @@ a recent approach.
 \subsection{Reconstruction-based Novelty Detection}

 Reconstruction-based approaches use the reconstruction error in one form
-or another to calculate the novelty score. This can be auto-encoders that
+or another to calculate the novelty score. These can be auto-encoders that
 literally reconstruct the input but it also includes \gls{MLP} networks which try
 to reconstruct the ground truth. Pimentel et al.~\cite{Pimentel2014} differentiate
 between neural network-based approaches and subspace methods. The first are
@ -242,7 +239,7 @@ Gal and Ghahramani~\cite{Gal2016} show that dropout training is a
 Bayesian approximation of a Gaussian process. Subsequently, Gal~\cite{Gal2017}
 shows that dropout training actually corresponds to a general approximate
 Bayesian model. This means every network trained with dropout is an
-approximate Bayesian model. During inference the dropout remains active,
+approximate Bayesian model. During inference the dropout remains active:
 this form of inference is called \gls{MCDO}.
 Miller et al.~\cite{Miller2018} build upon the work of Gal and Ghahramani: they
 use \gls{MCDO} under open-set conditions for object detection.
@ -261,14 +258,13 @@ Consequently, this technique can be applied to any network that utilises
 standard batch normalisation.
 Li et al.~\cite{Li2019} investigate the problem of poor performance
 when combining dropout and batch normalisation: dropout shifts the variance
-of a neural unit when switching from train to test, batch normalisation
+of a neural unit when switching from train to test; batch normalisation
 does not change the variance. This inconsistency leads to a variance shift which
 can have a larger or smaller impact based on the network used.

-Non-Bayesian approaches have been developed as well. Usually, they compare with
-\gls{MCDO} and show better performance.
+Non-Bayesian approaches have also been developed. Usually they are compared with \gls{MCDO} and show better performance.
 Postels et al.~\cite{Postels2019} provide a sampling-free approach for
-uncertainty estimation that does not affect training and approximates the
+uncertainty estimation that does not affect training, and approximates the
 sampling at test time. They compare it to \gls{MCDO} and find less computational
 overhead with better results.
 Lakshminarayanan et al.~\cite{Lakshminarayanan2017}
@ -279,7 +275,7 @@ introduce an uncertainty estimation algorithm for non-Bayesian deep
 neural classification that estimates the uncertainty of highly
 confident points using earlier snapshots of the trained model and improves,
 among others, the approach introduced by Lakshminarayanan et al.
-Sensoy et al.~\cite{Sensoy2018} explicitely model prediction uncertainty:
+Sensoy et al.~\cite{Sensoy2018} explicitly model prediction uncertainty:
 a \gls{Dirichlet distribution} is placed over the class probabilities. Consequently,
 the predictions of a neural network are treated as subjective opinions.

@ -348,21 +344,21 @@ training of the network determines a plausible set of weights by
 evaluating the probability output (\gls{posterior}) over the weights given
 the training data \(\mathbf{T}\): \(p(\mathbf{W}|\mathbf{T})\).
 However, this
-evaluation cannot be performed in any reasonable
+evaluation cannot be performed in any reasonable amount of
 time. Therefore approximation techniques are
 required. In those techniques the \gls{posterior} is fitted with a
 simple distribution \(q^{*}_{\theta}(\mathbf{W})\). The original
 and intractable problem of averaging over all weights in the network
-is replaced with an optimisation task, where the parameters of the
-simple distribution are optimised over~\cite{Kendall2017}.
+is replaced with an optimisation task: the parameters of the
+simple distribution are optimised~\cite{Kendall2017}.

 \subsubsection*{Dropout Variational Inference}

 Kendall and Gal~\cite{Kendall2017} show an approximation for
 classfication and recognition tasks. Dropout variational inference
 is a practical approximation technique by adding dropout layers
-in front of every weight layer and using them also during test
-time to sample from the approximate \gls{posterior}. Effectively, this
+in front of every weight layer and also using them during test
+time to sample from the approximate \gls{posterior}. In effect, this
 results in the approximation of the class probability
 \(p(y|\mathcal{I}, \mathbf{T})\) by performing \(n\) forward
 passes through the network and averaging the so obtained softmax
@ -479,7 +475,7 @@ and very low confidences in other classes.
 \subsection{Implementation Details}

 For this thesis, an \gls{SSD} implementation based on Tensorflow~\cite{Abadi2015} and
-Keras\footnote{\url{https://github.com/pierluigiferrari/ssd\_keras}}
+Keras~\cite{Chollet2015}
 is used. It has been modified to support \gls{entropy} thresholding,
 partitioning of observations, and dropout
 layers in the \gls{SSD} model. Entropy thresholding takes place before
@ -517,7 +513,7 @@ confidence thresholding and a subsequent \gls{NMS}.
 All boxes that pass \gls{NMS} are added to a
 per image maxima list. One box could make the confidence threshold
 for multiple classes and, hence, be present multiple times in the
-maxima list for the image. Lastly, a total of \(k\) boxes with the
+maxima list for the image. In the end, a total of \(k\) boxes with the
 highest confidences is kept per image across all classes. The
 original implementation uses a confidence threshold of \(0.01\), an
 IOU threshold for \gls{NMS} of \(0.45\) and a top \(k\)
@ -548,7 +544,7 @@ confidence threshold is required.

 \subsection{Vanilla SSD with Entropy Thresholding}

-Vanilla \gls{SSD} with \gls{entropy} tresholding adds an additional component
+Vanilla \gls{SSD} with \gls{entropy} thresholding adds an additional component
 to the filtering already done for \gls{vanilla} \gls{SSD}. The \gls{entropy} is
 calculated from all \(\#nr\_classes\) softmax scores in a prediction.
 Only predictions with a low enough \gls{entropy} pass the \gls{entropy}
@ -558,8 +554,8 @@ false positive or false negative cases with high confidence values.

 \subsection{Bayesian SSD with Entropy Thresholding}

-Bayesian \gls{SSD} has the speciality of multiple forward passes. Based
-on the information in the paper, the detections of all forward passes
+Bayesian \gls{SSD} uses multiple forward passes. Based
+on the information from Miller et al.~\cite{Miller2018}, the detections of all forward passes
 are grouped per image but not by forward pass. This leads
 to the following shape of the network output after all
 forward passes: \((batch\_size, \#nr\_boxes \, \cdot \, \#nr\_forward\_passes, \#nr\_classes + 12)\). The size of the output
@ -576,7 +572,7 @@ mutual IOU score of every detection with all other detections. Detections
 with a mutual IOU  score of 0.95 or higher are partitioned into an
 observation. Next, the softmax scores and bounding box coordinates of
 all detections in an observation are averaged.
-There can be a different number of observations for every image which
+There can be a different number of observations for every image, which
 destroys homogenity and prevents batch-wise calculation of the
 results. The shape of the results is per image: \((\#nr\_observations,\#nr\_classes + 4)\).

@ -598,14 +594,14 @@ at the end.

 \label{chap:experiments-results}

-This chapter explains the used data sets, how the experiments have been
-set up, and what the results are.
+This chapter explains the data sets used, and how the experiments have been
+set up. Furthermore, it presents the results.

 \section{Data Sets}

 This thesis uses the MS COCO~\cite{Lin2014} data set. It contains
 80 classes, their range is illustrated by two classes: airplanes and toothbrushes.
-The images are taken by camera from the real world, ground truth
+The images are real world images, ground truth
 is provided for all images. The data set supports object detection,
 keypoint detection, and panoptic segmentation (scene segmentation).

@ -779,7 +775,7 @@ The relation of \(F_1\) score to absolute open set error can be observed
 in figure \ref{fig:ose-f1-micro}. Precision-recall curves for all variants
 can be seen in figure \ref{fig:precision-recall-micro}. Both \gls{vanilla} \gls{SSD}
 variants with 0.01 confidence threshold reach a much higher open set error
-and a higher recall. This behaviour is expected as more and worse predictions
+and a higher recall. This behaviour is to be expected as more and worse predictions
 are included.
 All plotted variants show a similar behaviour that is in line with previously
 reported figures, such as the ones in Miller et al.~\cite{Miller2018}
@ -861,7 +857,7 @@ The relation of \(F_1\) score to absolute open set error can be observed
 in figure \ref{fig:ose-f1-macro}. Precision-recall curves for all variants
 can be seen in figure \ref{fig:precision-recall-macro}. Both \gls{vanilla} \gls{SSD}
 variants with 0.01 confidence threshold reach a much higher open set error
-and a higher recall. This behaviour is expected as more and worse predictions
+and a higher recall. This behaviour is to be expected as more and worse predictions
 are included.
 All plotted variants show a similar behaviour that is in line with previously
 reported figures, such as the ones in Miller et al.~\cite{Miller2018}
@ -878,9 +874,9 @@ only 0.7\% of the ground truth. With this share, it is below
 the average of roughly 0.9\% for each of the 56 classes that make up the
 second half of the ground truth.

-In some cases, multiple variants have seemingly the same performance
-but only one or some of them are marked bold. This is informed by
-differences prior to rounding. If two or more variants are marked bold
+In some cases, multiple variants have apparently the same performance
+but only one or some of them are marked bold. This is caused by
+differences prior to rounding: if two or more variants are marked bold
 they had the exact same performance before rounding.

 \begin{table}[tbp]
@ -909,11 +905,9 @@ they had the exact same performance before rounding.
 \end{table}

 The vanilla \gls{SSD} variant with 0.2 per class confidence threshold performs
-best in the persons class with a max \(F_1\) score of 0.460, as well as
-recall of 0.405 and precision of 0.533 at the max \(F_1\) score.
-It shares the first place in recall with the \gls{vanilla} \gls{SSD}
-variant using 0.01 confidence threshold. All Bayesian \gls{SSD} variants
-perform worse than the \gls{vanilla} \gls{SSD} variants (see table
+best in the persons class: it has a max \(F_1\) score of 0.460, consisting of a recall of 0.405 and a precision of 0.533.
+The variant shares the first place in recall with the \gls{vanilla} \gls{SSD}
+variant that uses a 0.01 confidence threshold. All Bayesian \gls{SSD} variants perform worse than the \gls{vanilla} \gls{SSD} variants (see table
 \ref{tab:results-persons}). With respect to the macro averaged result,
 all variants perform better than the average of all classes.

@ -951,7 +945,7 @@ variant with \gls{NMS} and disabled dropout, and the one with 0.9 keep
 ratio have a better precision (0.460 and 0.454 respectively) than the
 \gls{vanilla} \gls{SSD} variants with 0.01 confidence threshold (0.452 and
 0.453). With respect to the macro averaged result, all variants have
-a better precision than the average and the Bayesian variant without
+a better precision than the average. The Bayesian variant without
 \gls{NMS} and dropout also has a better recall and \(F_1\) score.

 \begin{table}[tbp]
@ -983,7 +977,7 @@ The best \(F_1\) score (0.288) and recall (0.251) for the chairs class
 belongs to \gls{vanilla} \gls{SSD} with \gls{entropy} threshold. Precision
 is mastered by Bayesian \gls{SSD} with \gls{NMS} and disabled dropout (0.360).
 The variant with 0.9 keep ratio has the second-highest precision (0.343)
-of all variants. Both in \(F_1\) score and recall all Bayesian variants
+of all variants. Both in \(F_1\) score and recall, all Bayesian variants
 are worse than the \gls{vanilla} variants. Compared with the macro averaged
 results, all variants perform worse than the average.

@ -1077,7 +1071,7 @@ ratio.
 \end{figure}

 The ground truth only contains a stop sign and a truck. The differences between \gls{vanilla} \gls{SSD} and Bayesian \gls{SSD} are almost not visible
-(see figures \ref{fig:stop-sign-truck-vanilla} and \ref{fig:stop-sign-truck-bayesian}): the truck is neither detected by \gls{vanilla} nor Bayesian \gls{SSD}, instead both detected a pottet plant and a traffic light. The stop sign is detected by both variants.
+(see figures \ref{fig:stop-sign-truck-vanilla} and \ref{fig:stop-sign-truck-bayesian}): the truck is neither detected by \gls{vanilla} nor Bayesian \gls{SSD}, instead both detected a "potted plant" and a traffic light. The stop sign is detected by both variants.
 This behaviour implies problems with detecting objects at the edge
 that overwhelmingly lie outside the image frame. Furthermore, the predictions are usually identical.

@ -1095,9 +1089,11 @@ that overwhelmingly lie outside the image frame. Furthermore, the predictions ar
    \end{minipage}
 \end{figure}

-Another example (see figures \ref{fig:cat-laptop-vanilla} and \ref{fig:cat-laptop-bayesian}) is a cat with a laptop/TV in the background on the right
-side. Both variants detect a cat but the \gls{vanilla} variant detects a dog as well. The laptop and TV are not detected but this is expected since
-these classes have not been trained.
+Another example (see figures \ref{fig:cat-laptop-vanilla} and
+\ref{fig:cat-laptop-bayesian}) is a cat with a laptop/TV in the background
+on the right side. Both variants detect a cat but the \gls{vanilla}
+variant detects a dog as well. The laptop and TV are not detected but this
+is to be expected since these classes have not been trained.

 \chapter{Discussion and Outlook}

@ -1153,7 +1149,7 @@ open set error continues to rise a bit.
 There is no visible impact of \gls{entropy} thresholding on the object detection
 performance for \gls{vanilla} \gls{SSD}. This indicates that the network has almost no
 uniform or close to uniform predictions, the vast majority of predictions
-has a high confidence in one class---including the background.
+have a high confidence in one class---including the background.
 However, the \gls{entropy} plays a larger role for the Bayesian variants---as
 expected: the best performing thresholds are 1.0, 1.3, and 1.4 for micro averaging,
 and 1.5, 1.7, and 2.0 for macro averaging. In all of these cases the best
@ -1190,7 +1186,7 @@ threshold indicates a worse performance.

 Miller et al.~\cite{Miller2018} supposedly do not use \gls{NMS}
 in their implementation of dropout sampling. Therefore, a variant with disabled \glslocalreset{NMS}
-\gls{NMS} has been tested. The results are somewhat expected:
+\gls{NMS} has been tested. The results are somewhat as expected:
 \gls{NMS} removes all non-maximum detections that overlap
 with a maximum one. This reduces the number of multiple detections per
 ground truth bounding box and therefore the false positives. Without it,
@ -1208,7 +1204,7 @@ more than 50\% of the original observations are removed with \gls{NMS} and
 stay without---all of these are very likely to be false positives.

 A clear distinction between micro and macro averaging can be observed:
-recall is hardly effected with micro averaging (0.300) but goes down equally with macro averaging (0.229). For micro averaging, it does
+recall is hardly affected with micro averaging (0.300) but goes down noticeably with macro averaging (0.229). For micro averaging, it does
 not matter which class the true positives belong to: every detection
 counts the same way. This also means that top \(k\) will have only
 a marginal effect: some true positives might be removed without \gls{NMS} but overall that does not have a big impact. With macro averaging, however,
@ -1256,7 +1252,7 @@ recall.
 \end{table}

 The dropout variants have largely worse performance than the Bayesian variants
-without dropout. This is expected as the network was not trained with
+without dropout. This is to be expected as the network was not trained with
 dropout and the weights are not prepared for it.

 Gal~\cite{Gal2017}
@ -1282,7 +1278,7 @@ more than 430 million detections remain (see table \ref{tab:effect-dropout} for
 has slightly fewer predictions left compared to the one without dropout.

 After the grouping, the variant without dropout has on average between
-10 and 11 detections grouped into an observation. This is expected as every
+10 and 11 detections grouped into an observation. This is to be expected as every
 forward pass creates the exact same result and these ten identical detections
 per \gls{vanilla} \gls{SSD} detection perfectly overlap. The fact that slightly more than
 ten detections are grouped together could explain the marginally better precision
@ -1316,5 +1312,4 @@ networks.

 To facilitate future work based on this thesis, the source code will be
 made available and an installable Python package will be uploaded to the
-PyPi package index. In the appendices can be found more details about the
-source code implementation as well as more figures.
+PyPi package index. More details about the source code implementation and additional figures can be found in the appendices.
--- a/ma.bib
+++ b/ma.bib
@ -909,4 +909,13 @@ to construct explicit models for non-normal classes. Application includes infere
  timestamp = {2019.09.09},
 }

+@Misc{Chollet2015,
+  author       = {Chollet, Fran\c{c}ois and others},
+  title        = {Keras},
+  year         = {2015},
+  howpublished = {\url{https://keras.io}},
+  owner        = {jim},
+  timestamp    = {2019.10.04},
+}
+
@Comment{jabref-meta: databaseType:biblatex;}