Updated thesis wrt to new results and insights

Signed-off-by: Jim Martens <github@2martens.de>
2019-09-15 13:21:53 +02:00
parent 27d5f5c402
commit 20de5336f0
1 changed files with 60 additions and 59 deletions
--- a/body.tex
+++ b/body.tex
@ -670,6 +670,9 @@ However, in case of a class imbalance the macro averaging
 favours classes with few detections whereas micro averaging benefits classes
 with many detections.
 This section only presents the results. Interpretation and discussion is found
 in the next chapter.
 \subsection{Micro Averaging}
 \begin{table}[ht]
    \begin{tabular}{rcccc}
@ -691,7 +694,7 @@ with many detections.
        \hline
    \end{tabular}
    \caption{Rounded results for micro averaging. SSD with Entropy test and Bayesian SSD are represented with
-    their best performing entropy threshold. Vanilla SSD with Entropy test performed best with an
+    their best performing entropy threshold with respect to \(F_1\) score. Vanilla SSD with Entropy test performed best with an
    entropy threshold of 2.4, Bayesian SSD without non-maximum suppression performed best for 1.0,
    and Bayesian SSD with non-maximum suppression performed best for 1.4 as entropy
    threshold.
@ -728,13 +731,10 @@ shows no significant impact of an entropy test. Only the open set errors
 are lower but in an insignificant way. The rest of the performance metrics is
 identical after rounding.
-The results for Bayesian SSD show a massive impact of the existance of
+Bayesian SSD with disabled dropout and without non-maximum suppression
-non-maximum suppression: maximum \(F_1\) score of 0.371 (with NMS) to 0.006
+has the worst performance of all tested variants (vanilla and Bayesian)
-(without NMS). Dropout was disabled in both cases, making them effectively a
+with respect to \(F_1\) score (0.209) and precision (0.161). The precision is not only the worst but also significantly lower compared to all other variants.
-vanilla SSD run with multiple forward passes.
+In comparison to all variants with 0.2 confidence threshold, it has the worst recall (0.300) as well.
 Therefore, the low number of open set errors with
 micro averaging (164 without NMS) does not qualify as a good result and is not
 marked bold, although it is the lowest number.
 With 2335 open set errors, the Bayesian SSD variant with disabled dropout and
 enabled non-maximum suppression offers the best performance with respect
@ -755,8 +755,7 @@ in figure \ref{fig:ose-f1-micro}. Precision-recall curves for all variants
 can be seen in figure \ref{fig:precision-recall-micro}. Both vanilla SSD
 variants with 0.01 confidence threshold reach much higher open set errors
 and a higher recall. This behaviour is expected as more and worse predictions
-are included. The Bayesian variant without non-maximum suppression was not
+are included.
 plotted.
 All plotted variants show a similar behaviour that is in line with previously
 reported figures, such as the ones in Miller et al.~\cite{Miller2018}
@ -783,7 +782,7 @@ reported figures, such as the ones in Miller et al.~\cite{Miller2018}
        \hline
    \end{tabular}
    \caption{Rounded results for macro averaging. SSD with Entropy test and Bayesian SSD are represented with
-    their best performing entropy threshold. Vanilla SSD with Entropy test performed best with an
+    their best performing entropy threshold with respect to \(F_1\) score. Vanilla SSD with Entropy test performed best with an
    entropy threshold of 1.7, Bayesian SSD without non-maximum suppression performed best for 1.5,
    and Bayesian SSD with non-maximum suppression performed best for 1.5 as entropy
    threshold. Bayesian SSD with dropout enabled and 0.9 keep ratio performed
@ -819,14 +818,13 @@ shows no significant impact of an entropy test. Only the open set errors
 are lower but in an insignificant way. The rest of the performance metrics is
 almost identical after rounding.
-The results for Bayesian SSD show a massive impact of the existance of
+The results for Bayesian SSD show a significant impact of non-maximum suppression or the lack thereof: maximum \(F_1\) score of 0.363 (with NMS) to 0.226
 non-maximum suppression: maximum \(F_1\) score of 0.363 (with NMS) to 0.006
 (without NMS). Dropout was disabled in both cases, making them effectively a
 vanilla SSD run with multiple forward passes.
-With 1057 open set errors, the Bayesian SSD variant with disabled dropout and
+With 809 open set errors, the Bayesian SSD variant with disabled dropout and
-enabled non-maximum suppression offers the best performance with respect
+without non-maximum suppression offers the best performance with respect
-to open set errors. It also has the best \(F_1\) score (0.363) and best
+to open set errors. The variant without dropout and enabled non-maximum suppression has the best \(F_1\) score (0.363) and best
 precision (0.420) of all Bayesian variants, and ties with the 0.9 keep ratio
 variant on recall (0.321).
@ -841,8 +839,7 @@ in figure \ref{fig:ose-f1-macro}. Precision-recall curves for all variants
 can be seen in figure \ref{fig:precision-recall-macro}. Both vanilla SSD
 variants with 0.01 confidence threshold reach much higher open set errors
 and a higher recall. This behaviour is expected as more and worse predictions
-are included. The Bayesian variant without non-maximum suppression was not
+are included.
 plotted.
 All plotted variants show a similar behaviour that is in line with previously
 reported figures, such as the ones in Miller et al.~\cite{Miller2018}
@ -881,7 +878,7 @@ performance for vanilla SSD. This indicates that the network has almost no
 uniform or close to uniform predictions, the vast majority of predictions
 has a high confidence in one class - including the background.
 However, the entropy plays a larger role for the Bayesian variants - as
-expected: the best performing thresholds are 1.3 and 1.4 for micro averaging,
+expected: the best performing thresholds are 1.0, 1.3, and 1.4 for micro averaging,
 and 1.5, 1.7, and 2.0 for macro averaging. In all of these cases the best
 threshold is not the largest threshold tested. A lower threshold likely
 eliminated some false positives from the result set. On the other hand a
@ -891,19 +888,20 @@ too low threshold likely eliminated true positives as well.
 Miller et al.~\cite{Miller2018} supposedly did not use non-maximum suppression
 in their implementation of dropout sampling. Therefore, a variant with disabled
-non-maximum suppression (NMS) was tested. The disastrous results heavily imply
+non-maximum suppression (NMS) was tested. The results are somewhat expected:
-that NMS is crucial and pose serious questions about the implementation of
+non-maximum suppression removes all non-maximum detections that overlap
-Miller et al., who still have not released source code.
+with a maximum one. This reduces the number of multiple detections per
 ground truth bounding box and therefore the false positives. Without it,
 a lot more false positives remain and have a negative impact on precision.
 In combination with top \(k\) selection, recall can be affected:
 duplicate detections could stay and maxima boxes could be removed.
-Without NMS all detections passing the per-class confidence threshold are
+A clear distinction between micro and macro averaging can be observed:
-directly ordered in descending order by their confidence value. Afterwards the
+recall is hardly effected with micro averaging (0.300) but goes down equally with
-top \(k\) detections are kept. This enables the following scenario:
+macro averaging (0.229). % TODO: explain why micro and macro differ in result
-the first top \(k\) detections all belong to the same class and potentially
+% TODO: give evidence for claim that more false positives are left without NMS
 object. Detections of other classes and objects could be discarded, reducing
 recall in the process. Multiple detections of the same object also increase
 the number of false positives, further reducing the \(F_1\) score.
-\subsection*{Dropout}
+\subsection*{Dropout Sampling and Observations}
 The dropout variants have largely worse performance than the Bayesian variants
 without dropout. This is expected as the network was not trained with
@ -911,39 +909,42 @@ dropout and the weights are not prepared for it.
 Gal~\cite{Gal2017}
 showed that networks \textbf{trained} with dropout are approximate Bayesian
-models. Miller et al. never fine-tuned or properly trained SSD after
+models. The Bayesian variants of SSD implemented in this thesis are not fine-tuned or trained with dropout, therefore, they are not guaranteed to be such approximate models.
 the dropout layers were inserted. Therefore, the Bayesian variant of SSD
 implemented in this thesis is not guaranteed to be such an approximate
 model.
-These results further question the reported results of Miller et al., who
+But dropout alone does not explain the difference in results. Both variants
-reported significantly better results of dropout sampling compared to vanilla
+with and without dropout have the exact same number of detections coming
-SSD. Admittedly, they used the network not on COCO but SceneNet RGB-D~\cite{McCormac2017}. However, they also claim that no fine-tuning
+out of the network (8732 per image per forward pass). With 16 images in a batch,
-for SceneNet took place. Applying SSD to an unknown data set should result
+308 batches, and 10 forward passes, the total number of detections is
-in overall worse performance. Attempts to replicate their work on SceneNet RGB-D
+an astounding 430,312,960 detections. As such a large number could not be
-failed with miserable results even for vanilla SSD, further attempts for this
+handled in memory, only one batch is calculated at a time. That
-thesis were not made. But Miller et al. used
+still leaves 1,397,120 detections per batch. These have to be grouped into
-a different implementation of SSD, therefore, it is possible that their
+observations, including a quadratic calculation of mutual IOU scores.
-implementation worked on SceneNet without fine-tuning.
+Therefore, these detections are filtered by removing all those with background
 confidence levels of 0.8 or higher.
-\subsection*{Sampling and Observations}
+The number of detections per class was measured before and after the
 detections were grouped into observations. To this end, the stored predictions
 were unbatched and summed together. After the aforementioned filter
 and before the grouping, roughly 0.4\% (in fact less than that) of the
 more than 430 million detections are remaining. The variant with dropout
 has slightly fewer predictions left compared to the one without dropout.
-It is remarkable that the Bayesian variant with disabled dropout and
+After the grouping, the variant without dropout has on average between
-non-maximum suppression performed better than vanilla SSD with respect to
+10 and 11 detections grouped into an observation. This is expected as every
-open set errors. This indicates a relevant impact of multiple forward
+forward pass creates the exact same result and these 10 identical detections
-passes and the grouping of observations on the result. With disabled
+per vanilla SSD detection perfectly overlap. The fact that slightly more than
-dropout, the ten forward passes should all produce the same results,
+10 detections are grouped together could explain the marginally better precision
-resulting in ten identical detections for every detection in vanilla SSD.
+of the Bayesian variant without dropout compared to vanilla SSD.
-The variation in the result can only originate from the grouping into
+However, on average only three detections are grouped together into an
-observations.
+observation if dropout with 0.9 keep ratio is enabled. This does not
 negatively impact recall as true positives do not disappear but offers
 a higher chance of false positives. It can be observed in the results which
 clearly show no negative impact for recall between the variants without
 dropout and dropout with 0.9 keep ratio.
-All detections that overlap by at least 95\% with each other
+This behaviour implies that even a slight usage of dropout creates such
-are grouped into an observation. For every ten identical detections one
+diverging anchor box offsets that the resulting detections from multiple
-observation should be the result. However, due to the 95\% overlap rather than
+forward passes no longer have a mutual IOU score of 0.95 or higher.
 100\%, more than ten detections could be grouped together. This would result
 in fewer overall observations compared to the number of detections
 in vanilla SSD. Such a lower number reduces the chance for the network
 to make mistakes.
 \section*{Outlook}