Updated thesis wrt to new results and insights
Signed-off-by: Jim Martens <github@2martens.de>
This commit is contained in:
119
body.tex
119
body.tex
@ -670,6 +670,9 @@ However, in case of a class imbalance the macro averaging
|
|||||||
favours classes with few detections whereas micro averaging benefits classes
|
favours classes with few detections whereas micro averaging benefits classes
|
||||||
with many detections.
|
with many detections.
|
||||||
|
|
||||||
|
This section only presents the results. Interpretation and discussion is found
|
||||||
|
in the next chapter.
|
||||||
|
|
||||||
\subsection{Micro Averaging}
|
\subsection{Micro Averaging}
|
||||||
\begin{table}[ht]
|
\begin{table}[ht]
|
||||||
\begin{tabular}{rcccc}
|
\begin{tabular}{rcccc}
|
||||||
@ -691,7 +694,7 @@ with many detections.
|
|||||||
\hline
|
\hline
|
||||||
\end{tabular}
|
\end{tabular}
|
||||||
\caption{Rounded results for micro averaging. SSD with Entropy test and Bayesian SSD are represented with
|
\caption{Rounded results for micro averaging. SSD with Entropy test and Bayesian SSD are represented with
|
||||||
their best performing entropy threshold. Vanilla SSD with Entropy test performed best with an
|
their best performing entropy threshold with respect to \(F_1\) score. Vanilla SSD with Entropy test performed best with an
|
||||||
entropy threshold of 2.4, Bayesian SSD without non-maximum suppression performed best for 1.0,
|
entropy threshold of 2.4, Bayesian SSD without non-maximum suppression performed best for 1.0,
|
||||||
and Bayesian SSD with non-maximum suppression performed best for 1.4 as entropy
|
and Bayesian SSD with non-maximum suppression performed best for 1.4 as entropy
|
||||||
threshold.
|
threshold.
|
||||||
@ -728,13 +731,10 @@ shows no significant impact of an entropy test. Only the open set errors
|
|||||||
are lower but in an insignificant way. The rest of the performance metrics is
|
are lower but in an insignificant way. The rest of the performance metrics is
|
||||||
identical after rounding.
|
identical after rounding.
|
||||||
|
|
||||||
The results for Bayesian SSD show a massive impact of the existance of
|
Bayesian SSD with disabled dropout and without non-maximum suppression
|
||||||
non-maximum suppression: maximum \(F_1\) score of 0.371 (with NMS) to 0.006
|
has the worst performance of all tested variants (vanilla and Bayesian)
|
||||||
(without NMS). Dropout was disabled in both cases, making them effectively a
|
with respect to \(F_1\) score (0.209) and precision (0.161). The precision is not only the worst but also significantly lower compared to all other variants.
|
||||||
vanilla SSD run with multiple forward passes.
|
In comparison to all variants with 0.2 confidence threshold, it has the worst recall (0.300) as well.
|
||||||
Therefore, the low number of open set errors with
|
|
||||||
micro averaging (164 without NMS) does not qualify as a good result and is not
|
|
||||||
marked bold, although it is the lowest number.
|
|
||||||
|
|
||||||
With 2335 open set errors, the Bayesian SSD variant with disabled dropout and
|
With 2335 open set errors, the Bayesian SSD variant with disabled dropout and
|
||||||
enabled non-maximum suppression offers the best performance with respect
|
enabled non-maximum suppression offers the best performance with respect
|
||||||
@ -755,8 +755,7 @@ in figure \ref{fig:ose-f1-micro}. Precision-recall curves for all variants
|
|||||||
can be seen in figure \ref{fig:precision-recall-micro}. Both vanilla SSD
|
can be seen in figure \ref{fig:precision-recall-micro}. Both vanilla SSD
|
||||||
variants with 0.01 confidence threshold reach much higher open set errors
|
variants with 0.01 confidence threshold reach much higher open set errors
|
||||||
and a higher recall. This behaviour is expected as more and worse predictions
|
and a higher recall. This behaviour is expected as more and worse predictions
|
||||||
are included. The Bayesian variant without non-maximum suppression was not
|
are included.
|
||||||
plotted.
|
|
||||||
All plotted variants show a similar behaviour that is in line with previously
|
All plotted variants show a similar behaviour that is in line with previously
|
||||||
reported figures, such as the ones in Miller et al.~\cite{Miller2018}
|
reported figures, such as the ones in Miller et al.~\cite{Miller2018}
|
||||||
|
|
||||||
@ -783,7 +782,7 @@ reported figures, such as the ones in Miller et al.~\cite{Miller2018}
|
|||||||
\hline
|
\hline
|
||||||
\end{tabular}
|
\end{tabular}
|
||||||
\caption{Rounded results for macro averaging. SSD with Entropy test and Bayesian SSD are represented with
|
\caption{Rounded results for macro averaging. SSD with Entropy test and Bayesian SSD are represented with
|
||||||
their best performing entropy threshold. Vanilla SSD with Entropy test performed best with an
|
their best performing entropy threshold with respect to \(F_1\) score. Vanilla SSD with Entropy test performed best with an
|
||||||
entropy threshold of 1.7, Bayesian SSD without non-maximum suppression performed best for 1.5,
|
entropy threshold of 1.7, Bayesian SSD without non-maximum suppression performed best for 1.5,
|
||||||
and Bayesian SSD with non-maximum suppression performed best for 1.5 as entropy
|
and Bayesian SSD with non-maximum suppression performed best for 1.5 as entropy
|
||||||
threshold. Bayesian SSD with dropout enabled and 0.9 keep ratio performed
|
threshold. Bayesian SSD with dropout enabled and 0.9 keep ratio performed
|
||||||
@ -819,14 +818,13 @@ shows no significant impact of an entropy test. Only the open set errors
|
|||||||
are lower but in an insignificant way. The rest of the performance metrics is
|
are lower but in an insignificant way. The rest of the performance metrics is
|
||||||
almost identical after rounding.
|
almost identical after rounding.
|
||||||
|
|
||||||
The results for Bayesian SSD show a massive impact of the existance of
|
The results for Bayesian SSD show a significant impact of non-maximum suppression or the lack thereof: maximum \(F_1\) score of 0.363 (with NMS) to 0.226
|
||||||
non-maximum suppression: maximum \(F_1\) score of 0.363 (with NMS) to 0.006
|
|
||||||
(without NMS). Dropout was disabled in both cases, making them effectively a
|
(without NMS). Dropout was disabled in both cases, making them effectively a
|
||||||
vanilla SSD run with multiple forward passes.
|
vanilla SSD run with multiple forward passes.
|
||||||
|
|
||||||
With 1057 open set errors, the Bayesian SSD variant with disabled dropout and
|
With 809 open set errors, the Bayesian SSD variant with disabled dropout and
|
||||||
enabled non-maximum suppression offers the best performance with respect
|
without non-maximum suppression offers the best performance with respect
|
||||||
to open set errors. It also has the best \(F_1\) score (0.363) and best
|
to open set errors. The variant without dropout and enabled non-maximum suppression has the best \(F_1\) score (0.363) and best
|
||||||
precision (0.420) of all Bayesian variants, and ties with the 0.9 keep ratio
|
precision (0.420) of all Bayesian variants, and ties with the 0.9 keep ratio
|
||||||
variant on recall (0.321).
|
variant on recall (0.321).
|
||||||
|
|
||||||
@ -841,8 +839,7 @@ in figure \ref{fig:ose-f1-macro}. Precision-recall curves for all variants
|
|||||||
can be seen in figure \ref{fig:precision-recall-macro}. Both vanilla SSD
|
can be seen in figure \ref{fig:precision-recall-macro}. Both vanilla SSD
|
||||||
variants with 0.01 confidence threshold reach much higher open set errors
|
variants with 0.01 confidence threshold reach much higher open set errors
|
||||||
and a higher recall. This behaviour is expected as more and worse predictions
|
and a higher recall. This behaviour is expected as more and worse predictions
|
||||||
are included. The Bayesian variant without non-maximum suppression was not
|
are included.
|
||||||
plotted.
|
|
||||||
All plotted variants show a similar behaviour that is in line with previously
|
All plotted variants show a similar behaviour that is in line with previously
|
||||||
reported figures, such as the ones in Miller et al.~\cite{Miller2018}
|
reported figures, such as the ones in Miller et al.~\cite{Miller2018}
|
||||||
|
|
||||||
@ -881,7 +878,7 @@ performance for vanilla SSD. This indicates that the network has almost no
|
|||||||
uniform or close to uniform predictions, the vast majority of predictions
|
uniform or close to uniform predictions, the vast majority of predictions
|
||||||
has a high confidence in one class - including the background.
|
has a high confidence in one class - including the background.
|
||||||
However, the entropy plays a larger role for the Bayesian variants - as
|
However, the entropy plays a larger role for the Bayesian variants - as
|
||||||
expected: the best performing thresholds are 1.3 and 1.4 for micro averaging,
|
expected: the best performing thresholds are 1.0, 1.3, and 1.4 for micro averaging,
|
||||||
and 1.5, 1.7, and 2.0 for macro averaging. In all of these cases the best
|
and 1.5, 1.7, and 2.0 for macro averaging. In all of these cases the best
|
||||||
threshold is not the largest threshold tested. A lower threshold likely
|
threshold is not the largest threshold tested. A lower threshold likely
|
||||||
eliminated some false positives from the result set. On the other hand a
|
eliminated some false positives from the result set. On the other hand a
|
||||||
@ -891,19 +888,20 @@ too low threshold likely eliminated true positives as well.
|
|||||||
|
|
||||||
Miller et al.~\cite{Miller2018} supposedly did not use non-maximum suppression
|
Miller et al.~\cite{Miller2018} supposedly did not use non-maximum suppression
|
||||||
in their implementation of dropout sampling. Therefore, a variant with disabled
|
in their implementation of dropout sampling. Therefore, a variant with disabled
|
||||||
non-maximum suppression (NMS) was tested. The disastrous results heavily imply
|
non-maximum suppression (NMS) was tested. The results are somewhat expected:
|
||||||
that NMS is crucial and pose serious questions about the implementation of
|
non-maximum suppression removes all non-maximum detections that overlap
|
||||||
Miller et al., who still have not released source code.
|
with a maximum one. This reduces the number of multiple detections per
|
||||||
|
ground truth bounding box and therefore the false positives. Without it,
|
||||||
|
a lot more false positives remain and have a negative impact on precision.
|
||||||
|
In combination with top \(k\) selection, recall can be affected:
|
||||||
|
duplicate detections could stay and maxima boxes could be removed.
|
||||||
|
|
||||||
Without NMS all detections passing the per-class confidence threshold are
|
A clear distinction between micro and macro averaging can be observed:
|
||||||
directly ordered in descending order by their confidence value. Afterwards the
|
recall is hardly effected with micro averaging (0.300) but goes down equally with
|
||||||
top \(k\) detections are kept. This enables the following scenario:
|
macro averaging (0.229). % TODO: explain why micro and macro differ in result
|
||||||
the first top \(k\) detections all belong to the same class and potentially
|
% TODO: give evidence for claim that more false positives are left without NMS
|
||||||
object. Detections of other classes and objects could be discarded, reducing
|
|
||||||
recall in the process. Multiple detections of the same object also increase
|
|
||||||
the number of false positives, further reducing the \(F_1\) score.
|
|
||||||
|
|
||||||
\subsection*{Dropout}
|
\subsection*{Dropout Sampling and Observations}
|
||||||
|
|
||||||
The dropout variants have largely worse performance than the Bayesian variants
|
The dropout variants have largely worse performance than the Bayesian variants
|
||||||
without dropout. This is expected as the network was not trained with
|
without dropout. This is expected as the network was not trained with
|
||||||
@ -911,39 +909,42 @@ dropout and the weights are not prepared for it.
|
|||||||
|
|
||||||
Gal~\cite{Gal2017}
|
Gal~\cite{Gal2017}
|
||||||
showed that networks \textbf{trained} with dropout are approximate Bayesian
|
showed that networks \textbf{trained} with dropout are approximate Bayesian
|
||||||
models. Miller et al. never fine-tuned or properly trained SSD after
|
models. The Bayesian variants of SSD implemented in this thesis are not fine-tuned or trained with dropout, therefore, they are not guaranteed to be such approximate models.
|
||||||
the dropout layers were inserted. Therefore, the Bayesian variant of SSD
|
|
||||||
implemented in this thesis is not guaranteed to be such an approximate
|
|
||||||
model.
|
|
||||||
|
|
||||||
These results further question the reported results of Miller et al., who
|
But dropout alone does not explain the difference in results. Both variants
|
||||||
reported significantly better results of dropout sampling compared to vanilla
|
with and without dropout have the exact same number of detections coming
|
||||||
SSD. Admittedly, they used the network not on COCO but SceneNet RGB-D~\cite{McCormac2017}. However, they also claim that no fine-tuning
|
out of the network (8732 per image per forward pass). With 16 images in a batch,
|
||||||
for SceneNet took place. Applying SSD to an unknown data set should result
|
308 batches, and 10 forward passes, the total number of detections is
|
||||||
in overall worse performance. Attempts to replicate their work on SceneNet RGB-D
|
an astounding 430,312,960 detections. As such a large number could not be
|
||||||
failed with miserable results even for vanilla SSD, further attempts for this
|
handled in memory, only one batch is calculated at a time. That
|
||||||
thesis were not made. But Miller et al. used
|
still leaves 1,397,120 detections per batch. These have to be grouped into
|
||||||
a different implementation of SSD, therefore, it is possible that their
|
observations, including a quadratic calculation of mutual IOU scores.
|
||||||
implementation worked on SceneNet without fine-tuning.
|
Therefore, these detections are filtered by removing all those with background
|
||||||
|
confidence levels of 0.8 or higher.
|
||||||
|
|
||||||
\subsection*{Sampling and Observations}
|
The number of detections per class was measured before and after the
|
||||||
|
detections were grouped into observations. To this end, the stored predictions
|
||||||
|
were unbatched and summed together. After the aforementioned filter
|
||||||
|
and before the grouping, roughly 0.4\% (in fact less than that) of the
|
||||||
|
more than 430 million detections are remaining. The variant with dropout
|
||||||
|
has slightly fewer predictions left compared to the one without dropout.
|
||||||
|
|
||||||
It is remarkable that the Bayesian variant with disabled dropout and
|
After the grouping, the variant without dropout has on average between
|
||||||
non-maximum suppression performed better than vanilla SSD with respect to
|
10 and 11 detections grouped into an observation. This is expected as every
|
||||||
open set errors. This indicates a relevant impact of multiple forward
|
forward pass creates the exact same result and these 10 identical detections
|
||||||
passes and the grouping of observations on the result. With disabled
|
per vanilla SSD detection perfectly overlap. The fact that slightly more than
|
||||||
dropout, the ten forward passes should all produce the same results,
|
10 detections are grouped together could explain the marginally better precision
|
||||||
resulting in ten identical detections for every detection in vanilla SSD.
|
of the Bayesian variant without dropout compared to vanilla SSD.
|
||||||
The variation in the result can only originate from the grouping into
|
However, on average only three detections are grouped together into an
|
||||||
observations.
|
observation if dropout with 0.9 keep ratio is enabled. This does not
|
||||||
|
negatively impact recall as true positives do not disappear but offers
|
||||||
|
a higher chance of false positives. It can be observed in the results which
|
||||||
|
clearly show no negative impact for recall between the variants without
|
||||||
|
dropout and dropout with 0.9 keep ratio.
|
||||||
|
|
||||||
All detections that overlap by at least 95\% with each other
|
This behaviour implies that even a slight usage of dropout creates such
|
||||||
are grouped into an observation. For every ten identical detections one
|
diverging anchor box offsets that the resulting detections from multiple
|
||||||
observation should be the result. However, due to the 95\% overlap rather than
|
forward passes no longer have a mutual IOU score of 0.95 or higher.
|
||||||
100\%, more than ten detections could be grouped together. This would result
|
|
||||||
in fewer overall observations compared to the number of detections
|
|
||||||
in vanilla SSD. Such a lower number reduces the chance for the network
|
|
||||||
to make mistakes.
|
|
||||||
|
|
||||||
\section*{Outlook}
|
\section*{Outlook}
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user