Added results tables and text about vanilla SSD results

Signed-off-by: Jim Martens <github@2martens.de>
This commit is contained in:
Jim Martens 2019-09-05 16:11:43 +02:00
parent 1e7ef8d8e6
commit 8a9dacb5e1
1 changed files with 75 additions and 1 deletions

View File

@ -550,6 +550,9 @@ increases linearly with more forward passes.
These detections have to be decoded first. Afterwards,
all detections are thrown away which do not pass a confidence
threshold for the class with the highest prediction probability.
Additionally, all detections with a winning prediction of background are discarded.
However, to test the impact an alternative has been implemented as well:
Only detections with a background prediction of 0.8 or higher are discarded.
The remaining detections are partitioned into observations to
further reduce the size of the output, and
to identify uncertainty. This is accomplished by calculating the
@ -643,7 +646,7 @@ to vanilla SSD with 0.01 confidence threshold; this comparison
investigates the effect of the per class confidence threshold
on the object detection performance.
Bayesian SSD was run with 0.5 confidence threshold and compared
Bayesian SSD was run with 0.2 confidence threshold and compared
to vanilla SSD with 0.2 confidence threshold. Coupled with the
entropy threshold, this comparison shows how uncertain the network
is. If it is very certain the dropout sampling should have no
@ -655,6 +658,77 @@ from 0.1 to 2.5 as specified in Miller et al.~\cite{Miller2018}.
\section{Results}
Results in this section are presented both for micro and macro averaging.
In macro averaging, for example, the precision values of each class are added up
and then divided by the number of classes. Conversely, for micro averaging the
precision is calculated across all classes directly. Both methods have
a specific impact: macro averaging weighs every class the same while micro
averaging weighs every detection the same. They will be largely identical
when every class is balanced and has about the same number of detections.
However, in case of a class imbalance the macro averaging
favours classes with few detections whereas micro averaging benefits classes
with many detections.
In both cases, vanilla SSD with a per-class confidence threshold of 0.2
performs best (see tables \ref{tab:results-micro} and \ref{tab:results-macro})
with a maximum \(F_1\) score of 0.376/0.375 (always micro/macro) compared to both vanilla SSD with a per-class
threshold of 0.01 (0.255/0.370) and vanilla SSD with entropy thresholding (0.255/0.370).
It has the fewest open set errors (2939/1218 to 3176/1426 and 3168/1373), and the best recall
(0.382/0.338 to 0.214/0.328 and 0.214/0.329). For micro averaging it has the best precision (0.372 to 0.318 and 0.318), macro averaging is won by vanilla SSD with entropy test (0.425 to 0.424 and 0.424).
This shows: a higher per-class confidence threshold removes many bad detections and hence the end
result is that much better. These comparisons also show that the network is
not very uncertain. The best performing entropy threshold is not any better than
the corresponding vanilla SSD without entropy threshold. Therefore, in this
case the per-class confidence score is far more important for the result.
\begin{table}
\begin{tabular}{rcccc}
\hline
Forward & max & abs OSE & Recall & Precision\\
Passes & \(F_1\) Score & \multicolumn{3}{c}{at max \(F_1\) point} \\
\hline
vanilla SSD - 0.01 conf & 0.255 & 3176 & 0.214 & 0.318 \\
vanilla SSD - 0.2 conf & \textbf{0.376} & \textbf{2939} & \textbf{0.382} & \textbf{0.372} \\
SSD with Entropy test - 0.01 conf & 0.255 & 3168 & 0.214 & 0.318 \\
% entropy thresh: 2.4 for vanilla SSD is best
\hline
Bayesian SSD - no bg - 0.2 conf \; 10 & 0.003 & 2145 & 0.005 & 0.002 \\
no bg > 0.8 conf - 0.2 conf \; 10 & 0.003 & 151 & 0.004 & 0.003 \\
% entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
\hline
\end{tabular}
\caption{Results for micro averaging. SSD with Entropy test and Bayesian SSD are represented with
their best performing entropy threshold. Vanilla SSD with Entropy test performed best with an
entropy threshold of 2.4, Bayesian SSD with no background performed best for 1.2,
and Bayesian SSD with no background prediction higher than 0.8 performed best for 0.4 as entropy
threshold.}
\label{tab:results-micro}
\end{table}
\begin{table}
\begin{tabular}{rcccc}
\hline
Forward & max & abs OSE & Recall & Precision\\
Passes & \(F_1\) Score & \multicolumn{3}{c}{at max \(F_1\) point} \\
\hline
vanilla SSD - 0.01 conf & 0.370 & 1426 & 0.328 & 0.424 \\
vanilla SSD - 0.2 conf & \textbf{0.375} & \textbf{1218} & \textbf{0.338} & 0.424 \\
SSD with Entropy test - 0.01 conf & 0.370 & 1373 & 0.329 & \textbf{0.425} \\
% entropy thresh: 1.7 for vanilla SSD is best
\hline
Bayesian SSD - no bg - 0.2 conf \; 10 & 0.002 & 1784 & 0.005 & 0.002 \\
no bg > 0.8 conf - 0.2 conf \; 10 & 0.002 & 122 & 0.003 & 0.002 \\
% entropy thresh: 1.2 for Bayesian - 2 is best, 0.4 for 3
\hline
\end{tabular}
\caption{Results for macro averaging. SSD with Entropy test and Bayesian SSD are represented with
their best performing entropy threshold. Vanilla SSD with Entropy test performed best with an
entropy threshold of 1.7, Bayesian SSD with no background performed best for 1.2,
and Bayesian SSD with no background prediction higher than 0.8 performed best for 0.4 as entropy
threshold.}
\label{tab:results-macro}
\end{table}
\chapter{Discussion}
\label{chap:discussion}