Updated thesis wrt to new results and insights

Signed-off-by: Jim Martens <github@2martens.de>
This commit is contained in:
2019-09-15 13:21:53 +02:00
parent 27d5f5c402
commit 20de5336f0

119
body.tex
View File

@ -670,6 +670,9 @@ However, in case of a class imbalance the macro averaging
favours classes with few detections whereas micro averaging benefits classes
with many detections.
This section only presents the results. Interpretation and discussion is found
in the next chapter.
\subsection{Micro Averaging}
\begin{table}[ht]
\begin{tabular}{rcccc}
@ -691,7 +694,7 @@ with many detections.
\hline
\end{tabular}
\caption{Rounded results for micro averaging. SSD with Entropy test and Bayesian SSD are represented with
their best performing entropy threshold. Vanilla SSD with Entropy test performed best with an
their best performing entropy threshold with respect to \(F_1\) score. Vanilla SSD with Entropy test performed best with an
entropy threshold of 2.4, Bayesian SSD without non-maximum suppression performed best for 1.0,
and Bayesian SSD with non-maximum suppression performed best for 1.4 as entropy
threshold.
@ -728,13 +731,10 @@ shows no significant impact of an entropy test. Only the open set errors
are lower but in an insignificant way. The rest of the performance metrics is
identical after rounding.
The results for Bayesian SSD show a massive impact of the existance of
non-maximum suppression: maximum \(F_1\) score of 0.371 (with NMS) to 0.006
(without NMS). Dropout was disabled in both cases, making them effectively a
vanilla SSD run with multiple forward passes.
Therefore, the low number of open set errors with
micro averaging (164 without NMS) does not qualify as a good result and is not
marked bold, although it is the lowest number.
Bayesian SSD with disabled dropout and without non-maximum suppression
has the worst performance of all tested variants (vanilla and Bayesian)
with respect to \(F_1\) score (0.209) and precision (0.161). The precision is not only the worst but also significantly lower compared to all other variants.
In comparison to all variants with 0.2 confidence threshold, it has the worst recall (0.300) as well.
With 2335 open set errors, the Bayesian SSD variant with disabled dropout and
enabled non-maximum suppression offers the best performance with respect
@ -755,8 +755,7 @@ in figure \ref{fig:ose-f1-micro}. Precision-recall curves for all variants
can be seen in figure \ref{fig:precision-recall-micro}. Both vanilla SSD
variants with 0.01 confidence threshold reach much higher open set errors
and a higher recall. This behaviour is expected as more and worse predictions
are included. The Bayesian variant without non-maximum suppression was not
plotted.
are included.
All plotted variants show a similar behaviour that is in line with previously
reported figures, such as the ones in Miller et al.~\cite{Miller2018}
@ -783,7 +782,7 @@ reported figures, such as the ones in Miller et al.~\cite{Miller2018}
\hline
\end{tabular}
\caption{Rounded results for macro averaging. SSD with Entropy test and Bayesian SSD are represented with
their best performing entropy threshold. Vanilla SSD with Entropy test performed best with an
their best performing entropy threshold with respect to \(F_1\) score. Vanilla SSD with Entropy test performed best with an
entropy threshold of 1.7, Bayesian SSD without non-maximum suppression performed best for 1.5,
and Bayesian SSD with non-maximum suppression performed best for 1.5 as entropy
threshold. Bayesian SSD with dropout enabled and 0.9 keep ratio performed
@ -819,14 +818,13 @@ shows no significant impact of an entropy test. Only the open set errors
are lower but in an insignificant way. The rest of the performance metrics is
almost identical after rounding.
The results for Bayesian SSD show a massive impact of the existance of
non-maximum suppression: maximum \(F_1\) score of 0.363 (with NMS) to 0.006
The results for Bayesian SSD show a significant impact of non-maximum suppression or the lack thereof: maximum \(F_1\) score of 0.363 (with NMS) to 0.226
(without NMS). Dropout was disabled in both cases, making them effectively a
vanilla SSD run with multiple forward passes.
With 1057 open set errors, the Bayesian SSD variant with disabled dropout and
enabled non-maximum suppression offers the best performance with respect
to open set errors. It also has the best \(F_1\) score (0.363) and best
With 809 open set errors, the Bayesian SSD variant with disabled dropout and
without non-maximum suppression offers the best performance with respect
to open set errors. The variant without dropout and enabled non-maximum suppression has the best \(F_1\) score (0.363) and best
precision (0.420) of all Bayesian variants, and ties with the 0.9 keep ratio
variant on recall (0.321).
@ -841,8 +839,7 @@ in figure \ref{fig:ose-f1-macro}. Precision-recall curves for all variants
can be seen in figure \ref{fig:precision-recall-macro}. Both vanilla SSD
variants with 0.01 confidence threshold reach much higher open set errors
and a higher recall. This behaviour is expected as more and worse predictions
are included. The Bayesian variant without non-maximum suppression was not
plotted.
are included.
All plotted variants show a similar behaviour that is in line with previously
reported figures, such as the ones in Miller et al.~\cite{Miller2018}
@ -881,7 +878,7 @@ performance for vanilla SSD. This indicates that the network has almost no
uniform or close to uniform predictions, the vast majority of predictions
has a high confidence in one class - including the background.
However, the entropy plays a larger role for the Bayesian variants - as
expected: the best performing thresholds are 1.3 and 1.4 for micro averaging,
expected: the best performing thresholds are 1.0, 1.3, and 1.4 for micro averaging,
and 1.5, 1.7, and 2.0 for macro averaging. In all of these cases the best
threshold is not the largest threshold tested. A lower threshold likely
eliminated some false positives from the result set. On the other hand a
@ -891,19 +888,20 @@ too low threshold likely eliminated true positives as well.
Miller et al.~\cite{Miller2018} supposedly did not use non-maximum suppression
in their implementation of dropout sampling. Therefore, a variant with disabled
non-maximum suppression (NMS) was tested. The disastrous results heavily imply
that NMS is crucial and pose serious questions about the implementation of
Miller et al., who still have not released source code.
non-maximum suppression (NMS) was tested. The results are somewhat expected:
non-maximum suppression removes all non-maximum detections that overlap
with a maximum one. This reduces the number of multiple detections per
ground truth bounding box and therefore the false positives. Without it,
a lot more false positives remain and have a negative impact on precision.
In combination with top \(k\) selection, recall can be affected:
duplicate detections could stay and maxima boxes could be removed.
Without NMS all detections passing the per-class confidence threshold are
directly ordered in descending order by their confidence value. Afterwards the
top \(k\) detections are kept. This enables the following scenario:
the first top \(k\) detections all belong to the same class and potentially
object. Detections of other classes and objects could be discarded, reducing
recall in the process. Multiple detections of the same object also increase
the number of false positives, further reducing the \(F_1\) score.
A clear distinction between micro and macro averaging can be observed:
recall is hardly effected with micro averaging (0.300) but goes down equally with
macro averaging (0.229). % TODO: explain why micro and macro differ in result
% TODO: give evidence for claim that more false positives are left without NMS
\subsection*{Dropout}
\subsection*{Dropout Sampling and Observations}
The dropout variants have largely worse performance than the Bayesian variants
without dropout. This is expected as the network was not trained with
@ -911,39 +909,42 @@ dropout and the weights are not prepared for it.
Gal~\cite{Gal2017}
showed that networks \textbf{trained} with dropout are approximate Bayesian
models. Miller et al. never fine-tuned or properly trained SSD after
the dropout layers were inserted. Therefore, the Bayesian variant of SSD
implemented in this thesis is not guaranteed to be such an approximate
model.
models. The Bayesian variants of SSD implemented in this thesis are not fine-tuned or trained with dropout, therefore, they are not guaranteed to be such approximate models.
These results further question the reported results of Miller et al., who
reported significantly better results of dropout sampling compared to vanilla
SSD. Admittedly, they used the network not on COCO but SceneNet RGB-D~\cite{McCormac2017}. However, they also claim that no fine-tuning
for SceneNet took place. Applying SSD to an unknown data set should result
in overall worse performance. Attempts to replicate their work on SceneNet RGB-D
failed with miserable results even for vanilla SSD, further attempts for this
thesis were not made. But Miller et al. used
a different implementation of SSD, therefore, it is possible that their
implementation worked on SceneNet without fine-tuning.
But dropout alone does not explain the difference in results. Both variants
with and without dropout have the exact same number of detections coming
out of the network (8732 per image per forward pass). With 16 images in a batch,
308 batches, and 10 forward passes, the total number of detections is
an astounding 430,312,960 detections. As such a large number could not be
handled in memory, only one batch is calculated at a time. That
still leaves 1,397,120 detections per batch. These have to be grouped into
observations, including a quadratic calculation of mutual IOU scores.
Therefore, these detections are filtered by removing all those with background
confidence levels of 0.8 or higher.
\subsection*{Sampling and Observations}
The number of detections per class was measured before and after the
detections were grouped into observations. To this end, the stored predictions
were unbatched and summed together. After the aforementioned filter
and before the grouping, roughly 0.4\% (in fact less than that) of the
more than 430 million detections are remaining. The variant with dropout
has slightly fewer predictions left compared to the one without dropout.
It is remarkable that the Bayesian variant with disabled dropout and
non-maximum suppression performed better than vanilla SSD with respect to
open set errors. This indicates a relevant impact of multiple forward
passes and the grouping of observations on the result. With disabled
dropout, the ten forward passes should all produce the same results,
resulting in ten identical detections for every detection in vanilla SSD.
The variation in the result can only originate from the grouping into
observations.
After the grouping, the variant without dropout has on average between
10 and 11 detections grouped into an observation. This is expected as every
forward pass creates the exact same result and these 10 identical detections
per vanilla SSD detection perfectly overlap. The fact that slightly more than
10 detections are grouped together could explain the marginally better precision
of the Bayesian variant without dropout compared to vanilla SSD.
However, on average only three detections are grouped together into an
observation if dropout with 0.9 keep ratio is enabled. This does not
negatively impact recall as true positives do not disappear but offers
a higher chance of false positives. It can be observed in the results which
clearly show no negative impact for recall between the variants without
dropout and dropout with 0.9 keep ratio.
All detections that overlap by at least 95\% with each other
are grouped into an observation. For every ten identical detections one
observation should be the result. However, due to the 95\% overlap rather than
100\%, more than ten detections could be grouped together. This would result
in fewer overall observations compared to the number of detections
in vanilla SSD. Such a lower number reduces the chance for the network
to make mistakes.
This behaviour implies that even a slight usage of dropout creates such
diverging anchor box offsets that the resulting detections from multiple
forward passes no longer have a mutual IOU score of 0.95 or higher.
\section*{Outlook}