From 20de5336f08fa81cac84d84b426ad3e239463685 Mon Sep 17 00:00:00 2001
From: Jim Martens <github@2martens.de>
Date: Sun, 15 Sep 2019 13:21:53 +0200
Subject: [PATCH] Updated thesis wrt to new results and insights

Signed-off-by: Jim Martens <github@2martens.de>
---
 body.tex | 119 ++++++++++++++++++++++++++++---------------------------
 1 file changed, 60 insertions(+), 59 deletions(-)

diff --git a/body.tex b/body.tex
index 739375a..96b45df 100644
--- a/body.tex
+++ b/body.tex
@@ -670,6 +670,9 @@ However, in case of a class imbalance the macro averaging
 favours classes with few detections whereas micro averaging benefits classes
 with many detections.
 
+This section only presents the results. Interpretation and discussion is found
+in the next chapter.
+
 \subsection{Micro Averaging}
 \begin{table}[ht]
     \begin{tabular}{rcccc}
@@ -691,7 +694,7 @@ with many detections.
         \hline
     \end{tabular}
     \caption{Rounded results for micro averaging. SSD with Entropy test and Bayesian SSD are represented with
-    their best performing entropy threshold. Vanilla SSD with Entropy test performed best with an
+    their best performing entropy threshold with respect to \(F_1\) score. Vanilla SSD with Entropy test performed best with an
     entropy threshold of 2.4, Bayesian SSD without non-maximum suppression performed best for 1.0,
     and Bayesian SSD with non-maximum suppression performed best for 1.4 as entropy
     threshold.
@@ -728,13 +731,10 @@ shows no significant impact of an entropy test. Only the open set errors
 are lower but in an insignificant way. The rest of the performance metrics is
 identical after rounding.
 
-The results for Bayesian SSD show a massive impact of the existance of
-non-maximum suppression: maximum \(F_1\) score of 0.371 (with NMS) to 0.006
-(without NMS). Dropout was disabled in both cases, making them effectively a
-vanilla SSD run with multiple forward passes.
-Therefore, the low number of open set errors with
-micro averaging (164 without NMS) does not qualify as a good result and is not
-marked bold, although it is the lowest number.
+Bayesian SSD with disabled dropout and without non-maximum suppression
+has the worst performance of all tested variants (vanilla and Bayesian)
+with respect to \(F_1\) score (0.209) and precision (0.161). The precision is not only the worst but also significantly lower compared to all other variants.
+In comparison to all variants with 0.2 confidence threshold, it has the worst recall (0.300) as well.
 
 With 2335 open set errors, the Bayesian SSD variant with disabled dropout and
 enabled non-maximum suppression offers the best performance with respect
@@ -755,8 +755,7 @@ in figure \ref{fig:ose-f1-micro}. Precision-recall curves for all variants
 can be seen in figure \ref{fig:precision-recall-micro}. Both vanilla SSD
 variants with 0.01 confidence threshold reach much higher open set errors
 and a higher recall. This behaviour is expected as more and worse predictions
-are included. The Bayesian variant without non-maximum suppression was not
-plotted.
+are included.
 All plotted variants show a similar behaviour that is in line with previously
 reported figures, such as the ones in Miller et al.~\cite{Miller2018}
 
@@ -783,7 +782,7 @@ reported figures, such as the ones in Miller et al.~\cite{Miller2018}
         \hline
     \end{tabular}
     \caption{Rounded results for macro averaging. SSD with Entropy test and Bayesian SSD are represented with
-    their best performing entropy threshold. Vanilla SSD with Entropy test performed best with an
+    their best performing entropy threshold with respect to \(F_1\) score. Vanilla SSD with Entropy test performed best with an
     entropy threshold of 1.7, Bayesian SSD without non-maximum suppression performed best for 1.5,
     and Bayesian SSD with non-maximum suppression performed best for 1.5 as entropy
     threshold. Bayesian SSD with dropout enabled and 0.9 keep ratio performed
@@ -819,14 +818,13 @@ shows no significant impact of an entropy test. Only the open set errors
 are lower but in an insignificant way. The rest of the performance metrics is
 almost identical after rounding.
 
-The results for Bayesian SSD show a massive impact of the existance of
-non-maximum suppression: maximum \(F_1\) score of 0.363 (with NMS) to 0.006
+The results for Bayesian SSD show a significant impact of non-maximum suppression or the lack thereof: maximum \(F_1\) score of 0.363 (with NMS) to 0.226
 (without NMS). Dropout was disabled in both cases, making them effectively a
 vanilla SSD run with multiple forward passes.
 
-With 1057 open set errors, the Bayesian SSD variant with disabled dropout and
-enabled non-maximum suppression offers the best performance with respect
-to open set errors. It also has the best \(F_1\) score (0.363) and best
+With 809 open set errors, the Bayesian SSD variant with disabled dropout and
+without non-maximum suppression offers the best performance with respect
+to open set errors. The variant without dropout and enabled non-maximum suppression has the best \(F_1\) score (0.363) and best
 precision (0.420) of all Bayesian variants, and ties with the 0.9 keep ratio
 variant on recall (0.321).
 
@@ -841,8 +839,7 @@ in figure \ref{fig:ose-f1-macro}. Precision-recall curves for all variants
 can be seen in figure \ref{fig:precision-recall-macro}. Both vanilla SSD
 variants with 0.01 confidence threshold reach much higher open set errors
 and a higher recall. This behaviour is expected as more and worse predictions
-are included. The Bayesian variant without non-maximum suppression was not
-plotted.
+are included.
 All plotted variants show a similar behaviour that is in line with previously
 reported figures, such as the ones in Miller et al.~\cite{Miller2018}
 
@@ -881,7 +878,7 @@ performance for vanilla SSD. This indicates that the network has almost no
 uniform or close to uniform predictions, the vast majority of predictions
 has a high confidence in one class - including the background.
 However, the entropy plays a larger role for the Bayesian variants - as
-expected: the best performing thresholds are 1.3 and 1.4 for micro averaging,
+expected: the best performing thresholds are 1.0, 1.3, and 1.4 for micro averaging,
 and 1.5, 1.7, and 2.0 for macro averaging. In all of these cases the best
 threshold is not the largest threshold tested. A lower threshold likely
 eliminated some false positives from the result set. On the other hand a
@@ -891,19 +888,20 @@ too low threshold likely eliminated true positives as well.
 
 Miller et al.~\cite{Miller2018} supposedly did not use non-maximum suppression
 in their implementation of dropout sampling. Therefore, a variant with disabled
-non-maximum suppression (NMS) was tested. The disastrous results heavily imply
-that NMS is crucial and pose serious questions about the implementation of
-Miller et al., who still have not released source code.
+non-maximum suppression (NMS) was tested. The results are somewhat expected:
+non-maximum suppression removes all non-maximum detections that overlap
+with a maximum one. This reduces the number of multiple detections per
+ground truth bounding box and therefore the false positives. Without it,
+a lot more false positives remain and have a negative impact on precision.
+In combination with top \(k\) selection, recall can be affected:
+duplicate detections could stay and maxima boxes could be removed.
 
-Without NMS all detections passing the per-class confidence threshold are
-directly ordered in descending order by their confidence value. Afterwards the
-top \(k\) detections are kept. This enables the following scenario:
-the first top \(k\) detections all belong to the same class and potentially
-object. Detections of other classes and objects could be discarded, reducing
-recall in the process. Multiple detections of the same object also increase
-the number of false positives, further reducing the \(F_1\) score.
+A clear distinction between micro and macro averaging can be observed:
+recall is hardly effected with micro averaging (0.300) but goes down equally with
+macro averaging (0.229). % TODO: explain why micro and macro differ in result
+% TODO: give evidence for claim that more false positives are left without NMS
 
-\subsection*{Dropout}
+\subsection*{Dropout Sampling and Observations}
 
 The dropout variants have largely worse performance than the Bayesian variants
 without dropout. This is expected as the network was not trained with
@@ -911,39 +909,42 @@ dropout and the weights are not prepared for it.
 
 Gal~\cite{Gal2017}
 showed that networks \textbf{trained} with dropout are approximate Bayesian
-models. Miller et al. never fine-tuned or properly trained SSD after
-the dropout layers were inserted. Therefore, the Bayesian variant of SSD
-implemented in this thesis is not guaranteed to be such an approximate
-model.
+models. The Bayesian variants of SSD implemented in this thesis are not fine-tuned or trained with dropout, therefore, they are not guaranteed to be such approximate models.
 
-These results further question the reported results of Miller et al., who
-reported significantly better results of dropout sampling compared to vanilla
-SSD. Admittedly, they used the network not on COCO but SceneNet RGB-D~\cite{McCormac2017}. However, they also claim that no fine-tuning
-for SceneNet took place. Applying SSD to an unknown data set should result
-in overall worse performance. Attempts to replicate their work on SceneNet RGB-D
-failed with miserable results even for vanilla SSD, further attempts for this
-thesis were not made. But Miller et al. used
-a different implementation of SSD, therefore, it is possible that their
-implementation worked on SceneNet without fine-tuning.
+But dropout alone does not explain the difference in results. Both variants
+with and without dropout have the exact same number of detections coming
+out of the network (8732 per image per forward pass). With 16 images in a batch,
+308 batches, and 10 forward passes, the total number of detections is
+an astounding 430,312,960 detections. As such a large number could not be
+handled in memory, only one batch is calculated at a time. That
+still leaves 1,397,120 detections per batch. These have to be grouped into
+observations, including a quadratic calculation of mutual IOU scores.
+Therefore, these detections are filtered by removing all those with background
+confidence levels of 0.8 or higher.
 
-\subsection*{Sampling and Observations}
+The number of detections per class was measured before and after the
+detections were grouped into observations. To this end, the stored predictions
+were unbatched and summed together. After the aforementioned filter
+and before the grouping, roughly 0.4\% (in fact less than that) of the
+more than 430 million detections are remaining. The variant with dropout
+has slightly fewer predictions left compared to the one without dropout.
 
-It is remarkable that the Bayesian variant with disabled dropout and
-non-maximum suppression performed better than vanilla SSD with respect to
-open set errors. This indicates a relevant impact of multiple forward
-passes and the grouping of observations on the result. With disabled
-dropout, the ten forward passes should all produce the same results,
-resulting in ten identical detections for every detection in vanilla SSD.
-The variation in the result can only originate from the grouping into
-observations.
+After the grouping, the variant without dropout has on average between
+10 and 11 detections grouped into an observation. This is expected as every
+forward pass creates the exact same result and these 10 identical detections
+per vanilla SSD detection perfectly overlap. The fact that slightly more than
+10 detections are grouped together could explain the marginally better precision
+of the Bayesian variant without dropout compared to vanilla SSD.
+However, on average only three detections are grouped together into an
+observation if dropout with 0.9 keep ratio is enabled. This does not
+negatively impact recall as true positives do not disappear but offers
+a higher chance of false positives. It can be observed in the results which
+clearly show no negative impact for recall between the variants without
+dropout and dropout with 0.9 keep ratio.
 
-All detections that overlap by at least 95\% with each other
-are grouped into an observation. For every ten identical detections one
-observation should be the result. However, due to the 95\% overlap rather than
-100\%, more than ten detections could be grouped together. This would result
-in fewer overall observations compared to the number of detections
-in vanilla SSD. Such a lower number reduces the chance for the network
-to make mistakes.
+This behaviour implies that even a slight usage of dropout creates such
+diverging anchor box offsets that the resulting detections from multiple
+forward passes no longer have a mutual IOU score of 0.95 or higher.
 
 \section*{Outlook}