Added explanation for averaging

Signed-off-by: Jim Martens <github@2martens.de>
This commit is contained in:
Jim Martens 2019-09-24 15:02:43 +02:00
parent 79acbf78bf
commit 4a608bcef6
1 changed files with 28 additions and 3 deletions

View File

@ -920,9 +920,34 @@ averaging has a significant performance increase towards the end
of the list of predictions. This is signaled by the near horizontal movement
of the plot in both the \(F_1\) versus absolute open set error graph (see figure \ref{fig:ose-f1-micro}) and
the precision-recall curve (see figure \ref{fig:precision-recall-micro}).
There are potentially true positive detections of one class that significantly
improve recall when compared to all detections across the classes but are
insignificant when solely compared to other detections of their own class.
This behaviour is caused by a large imbalance of detections between
the classes. For vanilla SSD with 0.2 confidence threshold there are
a total of 36,863 detections after non-maximum suppression and top \(k\).
The persons class contributes 14,640 detections or around 40\% to that number. Another strong class is cars with 2,252 detections or around
6\%. This means that two classes have together almost as many detections
as the remaining 58 classes combined.
In macro averaging, the cumulative precision and recall values are
calculated per class and then averaged across all classes. Smaller
classes quickly reach high recall values as the total number of
ground truth is small as well. The last recall and precision value
of the smaller classes is repeated to achieve homogenity with the largest
class. As a consequence, early on the average recall is quite high. Later on, only the values of the largest class still change which has only
a small impact on the overall result.
Conversely, in micro averaging the cumulative true positives
are added up across classes and then divided by the total number of
ground truth. Here, the effect is the opposite: the total number of
ground truth is very large which means the combined true positives
of 58 classes have only a smaller impact on the average recall.
As a result, the open set error rises quicker than the \(F_1\) score
in micro averaging, creating the sharp rise of open set error at a lower
\(F_1\) score than in macro averaging. The open set error
reaches a high value early on and changes little afterwards. This allows
the \(F_1\) score to catch up and produces the almost horizontal line
in the graph. Eventually, the \(F_1\) score decreases again while the
open set error further rises a bit.
Furthermore, the plotted behaviour implies that Miller et al.~\cite{Miller2018}
use macro averaging in their paper as the unique behaviour of micro