uni/masterproj/seminar_report.tex

\documentclass[12pt]{scrartcl}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Languages:

% Falls die Ausarbeitung in Deutsch erfolgt:
% \usepackage[german]{babel}
% \usepackage[T1]{fontenc}
% \usepackage[latin1]{inputenc}
% \usepackage[latin9]{inputenc}
% \selectlanguage{german}

% If the thesis is written in English:
\usepackage[spanish,english]{babel}
\selectlanguage{english}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Bind packages:
\usepackage[utf8]{inputenc} % Unicode funktioniert unter Windows, Linux und Mac
\usepackage[T1]{fontenc}
\usepackage{acronym}                    % Acronyms
\usepackage{algorithmic}								% Algorithms and Pseudocode
\usepackage{algorithm}									% Algorithms and Pseudocode
\usepackage{amsfonts}                   % AMS Math Packet (Fonts)
\usepackage{amsmath}                    % AMS Math Packet
\usepackage{amssymb}                    % Additional mathematical symbols
\usepackage{amsthm}
\usepackage{booktabs}                   % Nicer tables
%\usepackage[font=small,labelfont=bf]{caption} % Numbered captions for figures
\usepackage{color}                      % Enables defining of colors via \definecolor
\definecolor{uhhRed}{RGB}{254,0,0}		  % Official Uni Hamburg Red
\definecolor{uhhGrey}{RGB}{122,122,120} % Official Uni Hamburg Grey
\usepackage{fancybox}                   % Gleichungen einrahmen
%\usepackage{fancyhdr}										% Packet for nicer headers
\usepackage[automark]{scrlayer-scrpage}
\usepackage[hidelinks]{hyperref}\urlstyle{rm}
%\usepackage{fancyheadings}             % Nicer numbering of headlines

%\usepackage[outer=3.35cm]{geometry} 	  % Type area (size, margins...) !!!Release version
%\usepackage[outer=2.5cm]{geometry} 		% Type area (size, margins...) !!!Print version
%\usepackage{geometry} 									% Type area (size, margins...) !!!Proofread version
\usepackage[outer=3.15cm]{geometry} 	  % Type area (size, margins...) !!!Draft version
\geometry{a4paper,body={5.8in,9in}}

\usepackage{graphicx}                   % Inclusion of graphics
%\usepackage{latexsym}                  % Special symbols
\usepackage{longtable}									% Allow tables over several parges
\usepackage{listings}                   % Nicer source code listings
\usepackage{multicol}										% Content of a table over several columns
\usepackage{multirow}										% Content of a table over several rows
\usepackage{rotating}										% Alows to rotate text and objects
\usepackage{gensymb}
\usepackage[hang]{subfigure}            % Allows to use multiple (partial) figures in a fig
%\usepackage[font=footnotesize,labelfont=rm]{subfig}	% Pictures in a floating environment
\usepackage{tabularx}										% Tables with fixed width but variable rows
\usepackage{url,xspace,boxedminipage}   % Accurate display of URLs

\usepackage{csquotes}
\usepackage[
backend=biber,
bibstyle=ieee,
citestyle=ieee,
minnames=1,
maxnames=2
]{biblatex}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Configurationen:

\hyphenation{whe-ther} 									% Manually use: "\-" in a word: Staats\-ver\-trag

%\lstloadlanguages{C}                   % Set the default language for listings
\DeclareGraphicsExtensions{.pdf,.svg,.jpg,.png,.eps} % first try pdf, then eps, png and jpg
\graphicspath{{./src/}} 								% Path to a folder where all pictures are located
%\pagestyle{fancy} 											% Use nicer header and footer
\pagestyle{scrheadings}

\addbibresource{bib.bib}
\MakeOuterQuote{"}

\begin{document}

\title{Master project: seminar report template}
\author{Jim Martens}

\maketitle
\section*{Abstract}

The short abstract (100-150 words) is intended to give the reader an overview of the paper and your general opinion about the paper.


% Lists:
\setcounter{tocdepth}{2} 					% depth of the table of contents (for Seminars 2 is recommented)
\tableofcontents
\pagenumbering{arabic}
\clearpage

\section{Introduction}
Use this template as a starting point for preparing your seminar report.
For more information on \LaTeX, please consult, e.g., the online book at \url{https://en.wikibooks.org/wiki/LaTeX}.
Refer also to material on scientific writing.
The length of the report should not exceed 10 pages (excluding the reference list).

This part contains the introduction to the topic.
It introduces the general problem area of the paper, and leads the reader to the next section that provides more details.
This part should also cite other related work (not only the seminar paper you are working on) and compare the approaches on a high level.

\section{Method description}
% This section describes the proposed approach in the paper in more detail.
% Do not take sections directly from the paper, provide your own understanding and description.

Deep Sliding Shapes\cite{Song2016} is using both a Regional Proposal Network (RPN) and an
Object Recognition Network (ORN). The raw 3D data is encoded by a directional
Truncated Signed Distance Function (TSDF) and then presented to the RPN.
The RPN is working with multiple scales and only a small subset of the overall
predicted regions (2000 in number) is forwarded to the ORN.

For each of the forwarded proposals the TSDF is used again to encode the geometric
shape of the object. As part of the ORN the points inside the proposal box
are projected into 2D and the resulting 2D bounding box is given to VGGnet\cite{Simonyan2015}
to extract colour features. The results from both the 3D ORN and the 2D part
are concatenated and via two fully connected layers the object label and 3D box
are predicted.

\subsection{Encoding 3D Representation and Normalization}

Deep Sliding Shapes do not use the raw 3D data. Instead the raw data is
encoded in a certain way and then used by the networks. The raw 3D space
is divided into an equally spaced 3D voxel grid. Each voxel has an associated
value which is the shortest distance between the center of the voxel and
the surface from the input depth map. In addition to this relative distance
the direction of each surface point is encoded as well. To this end the
aforementioned Truncated Signed Distance Function is used. It stores a
three-dimensional vector \([dx, dy, dz]\) in each voxel. Each of these
values records the distance in the respective direction to the closest
surface point. These values are clipped at \(2\delta\) where \(\delta\) represents
the grid size in each dimension. Lastly the sign of these values indicates
whether the cell is in front of or behind the surface.

Furthermore every scene is rotated to align it with the gravity direction.
In addition only a subset of the 3D space is targeted. Horizontally the range
is from \(-2.6\) meters to \(2.6\) meters. Vertically it ranges from \(-1.5\)
meters to \(1\) meter. The depth is limited to the range \(0.4\) to \(5.6\)
meters. Within this 3D range the scene is encoded by a volumetric TSDF with
grid size \(0.025\) meters, which results in a \(208 \times 208 \times 100\)
volume that functions as the input to the 3D Region Proposal Network.

The major directions of the room are used for the orientations of the proposals.
RANSAC plane fitting is used unter the Manhattan world assumption to calculate
the proposal box orientations.

\subsection{Multi-scale 3D Region Proposal Network}

At the start of the pipeline stands the 3D Region Proposal network. It uses the
normalized input and has the high-level task to reduce the number of potential
regions so that the Joint Amodal Object Recognition Network only has to work on
a relatively small number of regions.

To this end it utilizes so called anchor boxes. \(N\) region proposals are predicted
for each sliding window. Each of the region proposals corresponds to one of the
\(N\) anchor boxes. There are \(N = 19\) anchor boxes. For anchors with non-square
horizontal aspect ratios another anchor is defined, which is rotated by \(90 \degree\).

The size of the anchor boxes varies quite a bit (from \(0.3\) meters to \(2\)
meters). A region proposal network on one scale would therefore not really work.
As a consequence the RPN works with two different scales. The list of anchors
is split into two lists (one for each scale) based on how close their physical
sizes are to the receptive fields of the output layers.

A fully 3D convolutional architecture is used for the RPN. The stride for the last
convolution layer is one, which resembles \(0.1\) meters in 3D. The last layer
predicts the objectness score and the bounding box regression. For the first level
of anchors the filter size is 2x2x2 and for the second layer it is 5x5x5. The
receptive fields are \(0.4 \text{m}^3\) for level one and \(1 \text{m}^3\) for
level two respectively.

After the anchor boxes have been calculated, the anchor boxes with a point density
lower than \(0.005\) points per cubic centimeter are removed using the integral
image technique. On average there are \(107674\) boxes remaining after this step.
For the remaining anchors an objectness score is calculated, which are essentially
two probabilities (being an object and not being an object).

In addition to this classification step a box regression is applied to all
anchor boxes. This regression calculates the center and size of each
box, whereas the size is given in three major directions of the box.
The overall output is therefore containing both the objectness score (classification)
and the 6-element vector describing the center and size of the box.

Lastly 3D non-maximum suppression is used to remove redundancies. It works with
an Intersection-over-Union (IOU) threshold of \(0.35\). From the remaining
boxes only the top \(2000\) boxes are selected as input to the next network.

The multi-task loss function is the sum of the classification loss and the
regression loss. Cross entropy is used for the classification loss.
The labels for the classification loss are obtained by calculating the 3D
Intersection-over-Union value of every anchor box with respect to the ground truth.
If this value is larger than \(0.35\) the anchor box is considered positive. If
it is below \(0.15\) then the box is considered negative.

The regression loss is only used for all positive examples. It utilizes a smooth
\(L_1\) loss as it was used by Fast-RCNN\cite{Girshick2015} for 2D box regression.
At the core of the loss function stands the difference of the centers and sizes
between the anchor box and the corresponding ground truth. The orientation of
the box is not used for simplicity. The center offset is represented
by the difference of the anchor box center and the ground truth center in the
camera coordinate system. The size difference is a bit more complicated to calculate.
First the major directions have to be determined by using the closest match of
the major directions between both boxes. Next the difference is calculated in
each of the major directions. Lastly the size difference is normalized by the
anchor size.

\subsection{Joint Amodal Object Recognition Network}

The object recognition network is \(>-<\)-shaped. It starts with both a 3D and a 2D
object recognition network which are then combined for the joint recognition.

For the 3D object recognition every proposal bounding box is padded with \(12.5\%\)
of the size in each direction to encode contextual information. The space is divided
into a 30x30x30 voxel grid and TSDF is used to encode the geometric shape of
the object. This network part contains two max pooling layers which use stride 2
and a kernel size of 2x2x2. The three convolution layers use kernel sizes
5x5x5, 3x3x3 and 3x3x3 respectively with a stride of 1 each. Between the fully connected
layers are ReLU and dropout layers (dropout ratio 0.5). The fully connected
layer produces a 4096 dimensional feature vector.

The 2D object recognition part projects the points inside each 3D proposal box
to the 2D image plane. Afterwards the tightest box that contains all these points
is determined. A VGGnet that is pre-trained on ImageNet (without fine-tuning)
is used to extract colour features from the image. The output of VGGnet is then
funneled into a fully connected layer that results in a 4096 dimensional feature
vector.

After both object recognition parts the two feature vectors are concatenated.
Another fully connected layer reduces this feature vector to 1000 dimensions.
These features are used by two separate fully connected layers to predict the
object label and the 3D box surrounding the object.

For every detected box the box size in each direction and the aspect ratio of
each pair of box edges is calculated. These numbers are then compared with a
distribution collected from all the training examples of the same category.
If any of the values falls outside the first to 99th percentile the score
of the box is decreased by \(2\).

The multi-task loss is a sum of classification and regression loss. Cross entropy
is used for the classification loss. The output of the network consists of 20
probabilities (one for each object category). For the regression loss nothing
changes in comparison to the region proposal network. Only difference is the
element-wise normalization of the labels with the object category specific
mean and standard deviation.

After the training of the network concluded the features are extracted from the last
fully connected layer. A Support Vector Machine (SVM) is trained for each object
category. During the testing of the object recognition network a 3D non-maximum
suppression is applied on the results with a treshold of \(0.1\) using the SVM
scores for every box. In case of the box regressions the results from the network
are used directly.

\section{Experimental result and evaluation}

The regional proposal network was trained for 10 hours and the object recognition
network was trained for 17 hours. In both cases an Nvidia K40 GPU was used.
During testing phase it took the RPN \(5.62\) seconds per image and the ORN
\(13.93\) seconds per image. Both networks were evaluated on the NYUv2\cite{Silberman2012}
and SUN RGB-D\cite{Song2015} data sets.
A threshold of \(0.25\) was used to calculate the average recall for the proposal
generation and the average precision for the detection. The SUN RGB-D data set
was used to obtain the ground truth amodal bounding boxes.

For the evaluation of the proposal generation a single-scale RPN, a multi-scale RPN
and a multi-scale RPN with RGB colour added to the 3D TSDF were compared with
each other and the baselines using the NYU data set. 3D selective search
and a naive 2D to 3D conversion were used as baselines. The naive conversion used the
2D region proposal to retrieve the 3D points within that region. Afterwards the
outermost 2 percentiles in each direction were removed and a tight 3D bounding
box calculated. The values of recall averaged over all object categories were
\(34.4\) for the naive approach, \(74.2\) for 3D selective search, \(75.2\) for
the single-scale RPN, \(84.4\) for the multi-scale RPN and \(84.9\) for the
multi-scale RPN with added colour. The last value is used as the final region
proposal result.

Another experiment tested the detection results for the same ORN architecture
given different region proposals. Comparing the 3D selective search with
RPN gave mean average precisions of \(27.4\) and \(32.3\) respectively. Hence
the RPN provides a better solution. Planar objects (e.g. doors) seem to work
better with 3D selective search. Boxes, monitors and TVs don't work for the RPN,
where the presumed reason for boxes is the high variance and for monitors and TVs
the missing depth information is likely responsible.

The detection evaluation was structured differently. First the feature encodings
were compared with other (the same experiment that was mentioned in previous
paragraph), then the design was justified and lastly the results were compared
with state-of-the-art methods. The feature encoding experiment provided better
results for encoding the directions directly compared to a single distance.
An accurate TSDF measured better than a projective one. The usage of the 2D image
VGGnet proved to be better than the direct encoding of colour on 3D voxels.
Lastly it didn't help to include HHA (horizontal disparity, height above ground
and the angle the pixel's local surface normal makes with the inferred gravity
direction).

The same experiment was used to help with design choices. It was found that
bounding box regression helps significantly (increase in mAP of 4.4 and 4.1
for 3D selective search and RPN respectively compared to the case without this
regression). SVM was found to outperform the softmax slightly (increase of 0.5 mAP)
which presumably is the case, because it can better handle the unbalanced number of
training samples for each category in the NYUv2 data set. Size pruning was identified
as helping (increase of mAP per category of 0.1 up to 7.8).

For the comparison with state-of-the-art methods Song and Xiao used 3D Sliding
Shapes\cite{Song2014} and 2D Depth-RCNN\cite{Gupta2015} and the same test set
that was used for the 2D Depth-RCNN (intersection of NYUv2 test set and Sliding
Shapes test set for the five categories bed, chair, table, sofa/couch and toilet).
The comparison shows that 3D Deep Sliding Shapes outperforms the chosen state-of-the-art
methods in all categories. The toilet is the only example where it is relevant
for the result that the 2D data is used. With only 3D data used the 2D Depth-RCNN
performs better on the estimated model if it uses 2D and 3D.

All in all 3D Deep Sliding Shapes works well on non-planar objects that have depth
information. The 2D component helps in distinguishing similar shaped objects.

\section{Discussion} % (fold)
\label{sec:discussion}

Deep Sliding Shapes offers a seemingly powerful new approach for object detection
in a 3D environment.

\subsection{Paper Strengths} % (fold)
\label{sub:paper_strengths}

The paper is written in a clearly structured way and uses sub headlines to
better guide the reader. The authors apparently tried to minimize repetition
in the sentences and are using some elements of novelized storytelling like
rhetorical questions that soften up the paper and make it less dry. The introduction
in particular is giving a very good motivation for the paper and ends with a cliff
hanger that creates excitement to continue reading beyond the detour that is
the section about related works.

Overall the paper provides many illustrating figures that make it far easier
to imagine the results of the introduced method and quite simply hydrate the
paper and make it friendlier to the eyes compared to an all text paper.

Lastly the paper provides many evaluation results that are understandable
largely without the main paper text and give a good overview over the performance
of the proposed method compared to others.

% subsection positive_aspect (end)

\subsection{Paper Weaknesses} % (fold)
\label{sub:paper_weaknesses}

That said there are things to criticize about this paper. The information about
the network structure is spread over two figures and some sections of the paper
with no guarantees that no information is missing. Furthermore no information
regarding the training, validation and testing data split were available. While
this implementation information does not have to be inside the paper proper it
should have been inside appendices to make an independent replication of results
easier. Not directly a problem with the paper itself the decision to implement
a software framework from scratch rather than using a proven existing one like
Tensorflow makes it more difficult to utilize the pretrained models which are
indeed available.

The evaluation sections are inconsistent in their structure. The first section
about object proposal evaluation follows the rest of the paper and is written
in continuous text. It describes the compared methods and then discusses the
results. The second section regarding the object detecion evaluation however
is written completely different. There is no continuous text and the compared
methods are not really described. Instead the section is largely used to justify
the chosen design. This would not even be a problem if there were a introductory
text explaining their motivations for this kind of evaluation and guiding the
reader through the process. Currently there is no explanation given why
the detection evaluation starts with feature encoding and is followed by
design justification.

Furthermore the motivations for the used data sets NYUv2 and SUN RGB-D are
not quite clear. Which data set is used for what purpose and why? The text
mentions in one sentence that the amodal bounding boxes are obtained from
SUN RGB-D without further explanation. It would have been advantageous
if the actual process of this "obtaining" were explained.

% subsection negitive (end)

% section review (end)

\section{Conclusion}

Deep Sliding Shapes introduces a 3D convolutional network pipeline for
amodal 3D object detection. This pipeline consists of a regional proposal
network and a joint 2D and 3D object recognitioin network. Experimental
results show that this approach delivers better results than previous
state-of-the-art methods.

In future work this method should be compared to other 3D centric object detection
approaches like Frustum Point Net\cite{Qi2017}.

\newpage
\printbibliography
\addcontentsline{toc}{section}{Bibliography}% Add to the TOC

\end{document}