Added explanation for averaging
Signed-off-by: Jim Martens <github@2martens.de>
This commit is contained in:
31
body.tex
31
body.tex
@ -920,9 +920,34 @@ averaging has a significant performance increase towards the end
|
|||||||
of the list of predictions. This is signaled by the near horizontal movement
|
of the list of predictions. This is signaled by the near horizontal movement
|
||||||
of the plot in both the \(F_1\) versus absolute open set error graph (see figure \ref{fig:ose-f1-micro}) and
|
of the plot in both the \(F_1\) versus absolute open set error graph (see figure \ref{fig:ose-f1-micro}) and
|
||||||
the precision-recall curve (see figure \ref{fig:precision-recall-micro}).
|
the precision-recall curve (see figure \ref{fig:precision-recall-micro}).
|
||||||
There are potentially true positive detections of one class that significantly
|
|
||||||
improve recall when compared to all detections across the classes but are
|
This behaviour is caused by a large imbalance of detections between
|
||||||
insignificant when solely compared to other detections of their own class.
|
the classes. For vanilla SSD with 0.2 confidence threshold there are
|
||||||
|
a total of 36,863 detections after non-maximum suppression and top \(k\).
|
||||||
|
The persons class contributes 14,640 detections or around 40\% to that number. Another strong class is cars with 2,252 detections or around
|
||||||
|
6\%. This means that two classes have together almost as many detections
|
||||||
|
as the remaining 58 classes combined.
|
||||||
|
|
||||||
|
In macro averaging, the cumulative precision and recall values are
|
||||||
|
calculated per class and then averaged across all classes. Smaller
|
||||||
|
classes quickly reach high recall values as the total number of
|
||||||
|
ground truth is small as well. The last recall and precision value
|
||||||
|
of the smaller classes is repeated to achieve homogenity with the largest
|
||||||
|
class. As a consequence, early on the average recall is quite high. Later on, only the values of the largest class still change which has only
|
||||||
|
a small impact on the overall result.
|
||||||
|
|
||||||
|
Conversely, in micro averaging the cumulative true positives
|
||||||
|
are added up across classes and then divided by the total number of
|
||||||
|
ground truth. Here, the effect is the opposite: the total number of
|
||||||
|
ground truth is very large which means the combined true positives
|
||||||
|
of 58 classes have only a smaller impact on the average recall.
|
||||||
|
As a result, the open set error rises quicker than the \(F_1\) score
|
||||||
|
in micro averaging, creating the sharp rise of open set error at a lower
|
||||||
|
\(F_1\) score than in macro averaging. The open set error
|
||||||
|
reaches a high value early on and changes little afterwards. This allows
|
||||||
|
the \(F_1\) score to catch up and produces the almost horizontal line
|
||||||
|
in the graph. Eventually, the \(F_1\) score decreases again while the
|
||||||
|
open set error further rises a bit.
|
||||||
|
|
||||||
Furthermore, the plotted behaviour implies that Miller et al.~\cite{Miller2018}
|
Furthermore, the plotted behaviour implies that Miller et al.~\cite{Miller2018}
|
||||||
use macro averaging in their paper as the unique behaviour of micro
|
use macro averaging in their paper as the unique behaviour of micro
|
||||||
|
|||||||
Reference in New Issue
Block a user