From 4a608bcef640368763fb108431d1a5a4163bb78b Mon Sep 17 00:00:00 2001 From: Jim Martens Date: Tue, 24 Sep 2019 15:02:43 +0200 Subject: [PATCH] Added explanation for averaging Signed-off-by: Jim Martens --- body.tex | 31 ++++++++++++++++++++++++++++--- 1 file changed, 28 insertions(+), 3 deletions(-) diff --git a/body.tex b/body.tex index bad9280..f471eec 100644 --- a/body.tex +++ b/body.tex @@ -920,9 +920,34 @@ averaging has a significant performance increase towards the end of the list of predictions. This is signaled by the near horizontal movement of the plot in both the \(F_1\) versus absolute open set error graph (see figure \ref{fig:ose-f1-micro}) and the precision-recall curve (see figure \ref{fig:precision-recall-micro}). -There are potentially true positive detections of one class that significantly -improve recall when compared to all detections across the classes but are -insignificant when solely compared to other detections of their own class. + +This behaviour is caused by a large imbalance of detections between +the classes. For vanilla SSD with 0.2 confidence threshold there are +a total of 36,863 detections after non-maximum suppression and top \(k\). +The persons class contributes 14,640 detections or around 40\% to that number. Another strong class is cars with 2,252 detections or around +6\%. This means that two classes have together almost as many detections +as the remaining 58 classes combined. + +In macro averaging, the cumulative precision and recall values are +calculated per class and then averaged across all classes. Smaller +classes quickly reach high recall values as the total number of +ground truth is small as well. The last recall and precision value +of the smaller classes is repeated to achieve homogenity with the largest +class. As a consequence, early on the average recall is quite high. Later on, only the values of the largest class still change which has only +a small impact on the overall result. + +Conversely, in micro averaging the cumulative true positives +are added up across classes and then divided by the total number of +ground truth. Here, the effect is the opposite: the total number of +ground truth is very large which means the combined true positives +of 58 classes have only a smaller impact on the average recall. +As a result, the open set error rises quicker than the \(F_1\) score +in micro averaging, creating the sharp rise of open set error at a lower +\(F_1\) score than in macro averaging. The open set error +reaches a high value early on and changes little afterwards. This allows +the \(F_1\) score to catch up and produces the almost horizontal line +in the graph. Eventually, the \(F_1\) score decreases again while the +open set error further rises a bit. Furthermore, the plotted behaviour implies that Miller et al.~\cite{Miller2018} use macro averaging in their paper as the unique behaviour of micro