Added graphs and interpretation for averaging

Signed-off-by: Jim Martens <github@2martens.de>
2019-09-11 14:10:04 +02:00
parent 29c2ebbda4
commit f083224b87
5 changed files with 63 additions and 0 deletions
--- a/body.tex
+++ b/body.tex
@ -702,6 +702,20 @@ with many detections.
    \label{tab:results-micro}
 \end{table}

+\begin{figure}[ht]
+    \begin{minipage}[t]{0.48\textwidth}
+        \includegraphics[width=\textwidth]{ose-f1-all-micro}
+        \caption{Micro averaged \(F_1\) score versus open set error for each variant. Perfect performance is an \(F_1\) score of 1 and an absolute OSE of 0.}
+        \label{fig:ose-f1-micro}
+    \end{minipage}%
+    \hfill
+    \begin{minipage}[t]{0.48\textwidth}
+        \includegraphics[width=\textwidth]{precision-recall-all-micro}
+        \caption{Micro averaged precision-recall curves for each variant tested.}
+        \label{fig:precision-recall-micro}
+    \end{minipage}
+\end{figure}
+
 Vanilla SSD with a per-class confidence threshold of 0.2 performs best (see
 table \ref{tab:results-micro}) with respect to the maximum \(F_1\) score
 (0.376) and recall at the maximum \(F_1\) point (0.382). In comparison, neither
@ -737,6 +751,16 @@ ratio has worse recall (0.342) than the variant with disabled dropout.
 However, all variants with multiple forward passes have lower open set errors
 than all vanilla SSD variants.

+The relation of \(F_1\) score to absolute open set error can be observed
+in figure \ref{fig:ose-f1-micro}. Precision-recall curves for all variants
+can be seen in figure \ref{fig:precision-recall-micro}. Both vanilla SSD
+variants with 0.01 confidence threshold reach much higher open set errors
+and a higher recall. This behaviour is expected as more and worse predictions
+are included. The Bayesian variant without non-maximum suppression was not
+plotted.
+All plotted variants show a similar behaviour that is in line with previously
+reported figures, such as the ones in Miller et al.~\cite{Miller2018}
+
 \subsection{Macro Averaging}

 \begin{table}[t]
@ -769,6 +793,20 @@ than all vanilla SSD variants.
    \label{tab:results-macro}
 \end{table}

+\begin{figure}[ht]
+    \begin{minipage}[t]{0.48\textwidth}
+        \includegraphics[width=\textwidth]{ose-f1-all-macro}
+        \caption{Macro averaged \(F_1\) score versus open set error for each variant. Perfect performance is an \(F_1\) score of 1 and an absolute OSE of 0.}
+        \label{fig:ose-f1-macro}
+    \end{minipage}%
+    \hfill
+    \begin{minipage}[t]{0.48\textwidth}
+        \includegraphics[width=\textwidth]{precision-recall-all-macro}
+        \caption{Macro averaged precision-recall curves for each variant tested.}
+        \label{fig:precision-recall-macro}
+    \end{minipage}
+\end{figure}
+
 Vanilla SSD with a per-class confidence threshold of 0.2 performs best (see
 table \ref{tab:results-macro}) with respect to the maximum \(F_1\) score
 (0.375) and recall at the maximum \(F_1\) point (0.338). In comparison, the SSD
@ -799,6 +837,16 @@ recall values. However, all variants with multiple forward passes and
 non-maximum suppression have lower open set errors than all vanilla SSD
 variants.

+The relation of \(F_1\) score to absolute open set error can be observed
+in figure \ref{fig:ose-f1-macro}. Precision-recall curves for all variants
+can be seen in figure \ref{fig:precision-recall-macro}. Both vanilla SSD
+variants with 0.01 confidence threshold reach much higher open set errors
+and a higher recall. This behaviour is expected as more and worse predictions
+are included. The Bayesian variant without non-maximum suppression was not
+plotted.
+All plotted variants show a similar behaviour that is in line with previously
+reported figures, such as the ones in Miller et al.~\cite{Miller2018}
+
 \chapter{Discussion and Outlook}

 \label{chap:discussion}
@ -812,6 +860,21 @@ The results clearly do not support the hypothesis: \textit{Dropout sampling deli
 is no area where dropout sampling performs better than vanilla SSD. In the
 remainder of the section the individual results will be interpreted.

+\subsection*{Impact of averaging}
+
+Micro and macro averaging create largely similar results. Notably, micro
+averaging has a significant performance increase towards the end
+of the list of predictions. This is signaled by the near horizontal movement
+of the plot in both the \(F_1\) versus absolute open set error graph (see figure \ref{fig:ose-f1-micro}) and
+the precision-recall curve (see figure \ref{fig:precision-recall-micro}).
+There are potentially true positive detections of one class that significantly
+improve recall when compared to all detections across the classes but are
+insignificant when solely compared to other detections of their own class.
+
+Furthermore, the plotted behaviour implies that Miller et al.~\cite{Miller2018}
+use macro averaging in their paper as the unique behaviour of micro
+averaging was not reported in their paper.
+
 \subsection*{Impact of Entropy}

 There is no visible impact of entropy thresholding on the object detection
--- a/images/ose-f1-all-macro.png
+++ b/images/ose-f1-all-macro.png
--- a/images/ose-f1-all-micro.png
+++ b/images/ose-f1-all-micro.png
--- a/images/precision-recall-all-macro.png
+++ b/images/precision-recall-all-macro.png
--- a/images/precision-recall-all-micro.png
+++ b/images/precision-recall-all-micro.png