Added graphs and interpretation for averaging

Signed-off-by: Jim Martens <github@2martens.de>
This commit is contained in:
Jim Martens 2019-09-11 14:10:04 +02:00
parent 29c2ebbda4
commit f083224b87
5 changed files with 63 additions and 0 deletions

View File

@ -702,6 +702,20 @@ with many detections.
\label{tab:results-micro}
\end{table}
\begin{figure}[ht]
\begin{minipage}[t]{0.48\textwidth}
\includegraphics[width=\textwidth]{ose-f1-all-micro}
\caption{Micro averaged \(F_1\) score versus open set error for each variant. Perfect performance is an \(F_1\) score of 1 and an absolute OSE of 0.}
\label{fig:ose-f1-micro}
\end{minipage}%
\hfill
\begin{minipage}[t]{0.48\textwidth}
\includegraphics[width=\textwidth]{precision-recall-all-micro}
\caption{Micro averaged precision-recall curves for each variant tested.}
\label{fig:precision-recall-micro}
\end{minipage}
\end{figure}
Vanilla SSD with a per-class confidence threshold of 0.2 performs best (see
table \ref{tab:results-micro}) with respect to the maximum \(F_1\) score
(0.376) and recall at the maximum \(F_1\) point (0.382). In comparison, neither
@ -737,6 +751,16 @@ ratio has worse recall (0.342) than the variant with disabled dropout.
However, all variants with multiple forward passes have lower open set errors
than all vanilla SSD variants.
The relation of \(F_1\) score to absolute open set error can be observed
in figure \ref{fig:ose-f1-micro}. Precision-recall curves for all variants
can be seen in figure \ref{fig:precision-recall-micro}. Both vanilla SSD
variants with 0.01 confidence threshold reach much higher open set errors
and a higher recall. This behaviour is expected as more and worse predictions
are included. The Bayesian variant without non-maximum suppression was not
plotted.
All plotted variants show a similar behaviour that is in line with previously
reported figures, such as the ones in Miller et al.~\cite{Miller2018}
\subsection{Macro Averaging}
\begin{table}[t]
@ -769,6 +793,20 @@ than all vanilla SSD variants.
\label{tab:results-macro}
\end{table}
\begin{figure}[ht]
\begin{minipage}[t]{0.48\textwidth}
\includegraphics[width=\textwidth]{ose-f1-all-macro}
\caption{Macro averaged \(F_1\) score versus open set error for each variant. Perfect performance is an \(F_1\) score of 1 and an absolute OSE of 0.}
\label{fig:ose-f1-macro}
\end{minipage}%
\hfill
\begin{minipage}[t]{0.48\textwidth}
\includegraphics[width=\textwidth]{precision-recall-all-macro}
\caption{Macro averaged precision-recall curves for each variant tested.}
\label{fig:precision-recall-macro}
\end{minipage}
\end{figure}
Vanilla SSD with a per-class confidence threshold of 0.2 performs best (see
table \ref{tab:results-macro}) with respect to the maximum \(F_1\) score
(0.375) and recall at the maximum \(F_1\) point (0.338). In comparison, the SSD
@ -799,6 +837,16 @@ recall values. However, all variants with multiple forward passes and
non-maximum suppression have lower open set errors than all vanilla SSD
variants.
The relation of \(F_1\) score to absolute open set error can be observed
in figure \ref{fig:ose-f1-macro}. Precision-recall curves for all variants
can be seen in figure \ref{fig:precision-recall-macro}. Both vanilla SSD
variants with 0.01 confidence threshold reach much higher open set errors
and a higher recall. This behaviour is expected as more and worse predictions
are included. The Bayesian variant without non-maximum suppression was not
plotted.
All plotted variants show a similar behaviour that is in line with previously
reported figures, such as the ones in Miller et al.~\cite{Miller2018}
\chapter{Discussion and Outlook}
\label{chap:discussion}
@ -812,6 +860,21 @@ The results clearly do not support the hypothesis: \textit{Dropout sampling deli
is no area where dropout sampling performs better than vanilla SSD. In the
remainder of the section the individual results will be interpreted.
\subsection*{Impact of averaging}
Micro and macro averaging create largely similar results. Notably, micro
averaging has a significant performance increase towards the end
of the list of predictions. This is signaled by the near horizontal movement
of the plot in both the \(F_1\) versus absolute open set error graph (see figure \ref{fig:ose-f1-micro}) and
the precision-recall curve (see figure \ref{fig:precision-recall-micro}).
There are potentially true positive detections of one class that significantly
improve recall when compared to all detections across the classes but are
insignificant when solely compared to other detections of their own class.
Furthermore, the plotted behaviour implies that Miller et al.~\cite{Miller2018}
use macro averaging in their paper as the unique behaviour of micro
averaging was not reported in their paper.
\subsection*{Impact of Entropy}
There is no visible impact of entropy thresholding on the object detection

BIN
images/ose-f1-all-macro.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 60 KiB

BIN
images/ose-f1-all-micro.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 58 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 57 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 54 KiB