Added explanation for NMS and top k

Signed-off-by: Jim Martens <github@2martens.de>
2019-09-18 15:19:56 +02:00
parent 075c35b7d8
commit 8ba241a8d7
1 changed files with 33 additions and 4 deletions
--- a/body.tex
+++ b/body.tex
@ -931,7 +931,7 @@ threshold is not the largest threshold tested. A lower threshold likely
 eliminated some false positives from the result set. On the other hand a
 too low threshold likely eliminated true positives as well.

-\subsection*{Non-maximum suppression}
+\subsection*{Non-maximum suppression and top \(k\)}

 Miller et al.~\cite{Miller2018} supposedly did not use non-maximum suppression
 in their implementation of dropout sampling. Therefore, a variant with disabled
@ -943,10 +943,39 @@ a lot more false positives remain and have a negative impact on precision.
 In combination with top \(k\) selection, recall can be affected:
 duplicate detections could stay and maxima boxes could be removed.

+The number of observations was measured before and after the entropy threshold/NMS filter: both Bayesian SSD without
+NMS and dropout, and Bayesian SSD with NMS and disabled dropout
+have the same number of observations everywhere before the entropy threshold. After the entropy threshold (the value 1.5 was used for both) and NMS, the variant with NMS has roughly 23\% of its observations left.
+Without NMS 79\% of observations are left. Irrespective of the absolute
+number, this discrepancy clearly shows the impact of non-maximum suppression and also explains a higher count of false positives:
+more than 50\% of the original observations were removed with NMS and
+stayed without - all of these are very likely to be false positives.
+
 A clear distinction between micro and macro averaging can be observed:
-recall is hardly effected with micro averaging (0.300) but goes down equally with
-macro averaging (0.229). % TODO: explain why micro and macro differ in result
-% TODO: give evidence for claim that more false positives are left without NMS
+recall is hardly effected with micro averaging (0.300) but goes down equally with macro averaging (0.229). For micro averaging, it does
+not matter which class the true positives belong to: every detection
+counts the same way. This also means that top \(k\) will have only
+a marginal effect: some true positives might be removed without NMS but overall that does not have a big impact. With macro averaging, however,
+the class of the true positives matters a lot: for example, if two
+true positives are removed from a class with only few true positives
+to begin with than their removal will have a drastic influence on
+the class recall value and hence the overall result.
+
+The impact of top \(k\) was measured by counting the number of observations
+after top \(k\) has been applied: the variant with NMS keeps about 94\%
+of the observations left after NMS, without NMS only about 59\% of observations
+are kept. This shows a significant impact on the result by top \(k\)
+in the case of disabled non-maximum suppression. Furthermore, some
+classes are hit harder by top \(k\) then others: for example,
+dogs keep around 82\% of the observations but persons only 57\%.
+This indicates that detected dogs are mostly on images with few detections
+overall and/or have a high enough prediction confidence to be
+kept by top \(k\). However, persons are likely often on images
+with many detections and/or have too low confidences.
+In this example, the likelihood for true positives to be removed in
+the person category is quite high. For dogs, the probability is far lower.
+This goes back to micro and macro averaging and their impact on recall.
+

 \subsection*{Dropout Sampling and Observations}