Added explanation for averaging

Signed-off-by: Jim Martens <github@2martens.de>
2019-09-24 15:02:43 +02:00
parent 79acbf78bf
commit 4a608bcef6
1 changed files with 28 additions and 3 deletions
--- a/body.tex
+++ b/body.tex
@ -920,9 +920,34 @@ averaging has a significant performance increase towards the end
 of the list of predictions. This is signaled by the near horizontal movement
 of the plot in both the \(F_1\) versus absolute open set error graph (see figure \ref{fig:ose-f1-micro}) and
 the precision-recall curve (see figure \ref{fig:precision-recall-micro}).
-There are potentially true positive detections of one class that significantly
+
-improve recall when compared to all detections across the classes but are
+This behaviour is caused by a large imbalance of detections between
-insignificant when solely compared to other detections of their own class.
+the classes. For vanilla SSD with 0.2 confidence threshold there are
 a total of 36,863 detections after non-maximum suppression and top \(k\).
 The persons class contributes 14,640 detections or around 40\% to that number. Another strong class is cars with 2,252 detections or around
 6\%. This means that two classes have together almost as many detections
 as the remaining 58 classes combined.
 In macro averaging, the cumulative precision and recall values are
 calculated per class and then averaged across all classes. Smaller
 classes quickly reach high recall values as the total number of
 ground truth is small as well. The last recall and precision value
 of the smaller classes is repeated to achieve homogenity with the largest
 class. As a consequence, early on the average recall is quite high. Later on, only the values of the largest class still change which has only
 a small impact on the overall result.
 Conversely, in micro averaging the cumulative true positives
 are added up across classes and then divided by the total number of
 ground truth. Here, the effect is the opposite: the total number of
 ground truth is very large which means the combined true positives
 of 58 classes have only a smaller impact on the average recall.
 As a result, the open set error rises quicker than the \(F_1\) score
 in micro averaging, creating the sharp rise of open set error at a lower
 \(F_1\) score than in macro averaging. The open set error
 reaches a high value early on and changes little afterwards. This allows
 the \(F_1\) score to catch up and produces the almost horizontal line
 in the graph. Eventually, the \(F_1\) score decreases again while the
 open set error further rises a bit.
 Furthermore, the plotted behaviour implies that Miller et al.~\cite{Miller2018}
 use macro averaging in their paper as the unique behaviour of micro