diff --git a/body.tex b/body.tex index c8c6b76..4853c12 100644 --- a/body.tex +++ b/body.tex @@ -702,6 +702,20 @@ with many detections. \label{tab:results-micro} \end{table} +\begin{figure}[ht] + \begin{minipage}[t]{0.48\textwidth} + \includegraphics[width=\textwidth]{ose-f1-all-micro} + \caption{Micro averaged \(F_1\) score versus open set error for each variant. Perfect performance is an \(F_1\) score of 1 and an absolute OSE of 0.} + \label{fig:ose-f1-micro} + \end{minipage}% + \hfill + \begin{minipage}[t]{0.48\textwidth} + \includegraphics[width=\textwidth]{precision-recall-all-micro} + \caption{Micro averaged precision-recall curves for each variant tested.} + \label{fig:precision-recall-micro} + \end{minipage} +\end{figure} + Vanilla SSD with a per-class confidence threshold of 0.2 performs best (see table \ref{tab:results-micro}) with respect to the maximum \(F_1\) score (0.376) and recall at the maximum \(F_1\) point (0.382). In comparison, neither @@ -737,6 +751,16 @@ ratio has worse recall (0.342) than the variant with disabled dropout. However, all variants with multiple forward passes have lower open set errors than all vanilla SSD variants. +The relation of \(F_1\) score to absolute open set error can be observed +in figure \ref{fig:ose-f1-micro}. Precision-recall curves for all variants +can be seen in figure \ref{fig:precision-recall-micro}. Both vanilla SSD +variants with 0.01 confidence threshold reach much higher open set errors +and a higher recall. This behaviour is expected as more and worse predictions +are included. The Bayesian variant without non-maximum suppression was not +plotted. +All plotted variants show a similar behaviour that is in line with previously +reported figures, such as the ones in Miller et al.~\cite{Miller2018} + \subsection{Macro Averaging} \begin{table}[t] @@ -769,6 +793,20 @@ than all vanilla SSD variants. \label{tab:results-macro} \end{table} +\begin{figure}[ht] + \begin{minipage}[t]{0.48\textwidth} + \includegraphics[width=\textwidth]{ose-f1-all-macro} + \caption{Macro averaged \(F_1\) score versus open set error for each variant. Perfect performance is an \(F_1\) score of 1 and an absolute OSE of 0.} + \label{fig:ose-f1-macro} + \end{minipage}% + \hfill + \begin{minipage}[t]{0.48\textwidth} + \includegraphics[width=\textwidth]{precision-recall-all-macro} + \caption{Macro averaged precision-recall curves for each variant tested.} + \label{fig:precision-recall-macro} + \end{minipage} +\end{figure} + Vanilla SSD with a per-class confidence threshold of 0.2 performs best (see table \ref{tab:results-macro}) with respect to the maximum \(F_1\) score (0.375) and recall at the maximum \(F_1\) point (0.338). In comparison, the SSD @@ -799,6 +837,16 @@ recall values. However, all variants with multiple forward passes and non-maximum suppression have lower open set errors than all vanilla SSD variants. +The relation of \(F_1\) score to absolute open set error can be observed +in figure \ref{fig:ose-f1-macro}. Precision-recall curves for all variants +can be seen in figure \ref{fig:precision-recall-macro}. Both vanilla SSD +variants with 0.01 confidence threshold reach much higher open set errors +and a higher recall. This behaviour is expected as more and worse predictions +are included. The Bayesian variant without non-maximum suppression was not +plotted. +All plotted variants show a similar behaviour that is in line with previously +reported figures, such as the ones in Miller et al.~\cite{Miller2018} + \chapter{Discussion and Outlook} \label{chap:discussion} @@ -812,6 +860,21 @@ The results clearly do not support the hypothesis: \textit{Dropout sampling deli is no area where dropout sampling performs better than vanilla SSD. In the remainder of the section the individual results will be interpreted. +\subsection*{Impact of averaging} + +Micro and macro averaging create largely similar results. Notably, micro +averaging has a significant performance increase towards the end +of the list of predictions. This is signaled by the near horizontal movement +of the plot in both the \(F_1\) versus absolute open set error graph (see figure \ref{fig:ose-f1-micro}) and +the precision-recall curve (see figure \ref{fig:precision-recall-micro}). +There are potentially true positive detections of one class that significantly +improve recall when compared to all detections across the classes but are +insignificant when solely compared to other detections of their own class. + +Furthermore, the plotted behaviour implies that Miller et al.~\cite{Miller2018} +use macro averaging in their paper as the unique behaviour of micro +averaging was not reported in their paper. + \subsection*{Impact of Entropy} There is no visible impact of entropy thresholding on the object detection diff --git a/images/ose-f1-all-macro.png b/images/ose-f1-all-macro.png new file mode 100644 index 0000000..7a2ff68 Binary files /dev/null and b/images/ose-f1-all-macro.png differ diff --git a/images/ose-f1-all-micro.png b/images/ose-f1-all-micro.png new file mode 100644 index 0000000..e529f4a Binary files /dev/null and b/images/ose-f1-all-micro.png differ diff --git a/images/precision-recall-all-macro.png b/images/precision-recall-all-macro.png new file mode 100644 index 0000000..1814d38 Binary files /dev/null and b/images/precision-recall-all-macro.png differ diff --git a/images/precision-recall-all-micro.png b/images/precision-recall-all-micro.png new file mode 100644 index 0000000..8a96d23 Binary files /dev/null and b/images/precision-recall-all-micro.png differ