Added explanation for averaging

Signed-off-by: Jim Martens <github@2martens.de>
2019-09-24 15:02:43 +02:00
parent 79acbf78bf
commit 4a608bcef6
1 changed files with 28 additions and 3 deletions
--- a/body.tex
+++ b/body.tex
@ -920,9 +920,34 @@ averaging has a significant performance increase towards the end
 of the list of predictions. This is signaled by the near horizontal movement
 of the plot in both the \(F_1\) versus absolute open set error graph (see figure \ref{fig:ose-f1-micro}) and
 the precision-recall curve (see figure \ref{fig:precision-recall-micro}).
-There are potentially true positive detections of one class that significantly
-improve recall when compared to all detections across the classes but are
-insignificant when solely compared to other detections of their own class.
+
+This behaviour is caused by a large imbalance of detections between
+the classes. For vanilla SSD with 0.2 confidence threshold there are
+a total of 36,863 detections after non-maximum suppression and top \(k\).
+The persons class contributes 14,640 detections or around 40\% to that number. Another strong class is cars with 2,252 detections or around
+6\%. This means that two classes have together almost as many detections
+as the remaining 58 classes combined.
+
+In macro averaging, the cumulative precision and recall values are
+calculated per class and then averaged across all classes. Smaller
+classes quickly reach high recall values as the total number of
+ground truth is small as well. The last recall and precision value
+of the smaller classes is repeated to achieve homogenity with the largest
+class. As a consequence, early on the average recall is quite high. Later on, only the values of the largest class still change which has only
+a small impact on the overall result.
+
+Conversely, in micro averaging the cumulative true positives
+are added up across classes and then divided by the total number of
+ground truth. Here, the effect is the opposite: the total number of
+ground truth is very large which means the combined true positives
+of 58 classes have only a smaller impact on the average recall.
+As a result, the open set error rises quicker than the \(F_1\) score
+in micro averaging, creating the sharp rise of open set error at a lower
+\(F_1\) score than in macro averaging. The open set error
+reaches a high value early on and changes little afterwards. This allows
+the \(F_1\) score to catch up and produces the almost horizontal line
+in the graph. Eventually, the \(F_1\) score decreases again while the
+open set error further rises a bit.

 Furthermore, the plotted behaviour implies that Miller et al.~\cite{Miller2018}
 use macro averaging in their paper as the unique behaviour of micro