\documentclass[12pt]{scrartcl} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Languages: % Falls die Ausarbeitung in Deutsch erfolgt: % \usepackage[german]{babel} % \usepackage[T1]{fontenc} % \usepackage[latin1]{inputenc} % \usepackage[latin9]{inputenc} % \selectlanguage{german} % If the thesis is written in English: \usepackage[spanish,english]{babel} \selectlanguage{english} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Bind packages: \usepackage[utf8]{inputenc} % Unicode funktioniert unter Windows, Linux und Mac \usepackage[T1]{fontenc} \usepackage{acronym} % Acronyms \usepackage{algorithmic} % Algorithms and Pseudocode \usepackage{algorithm} % Algorithms and Pseudocode \usepackage{amsfonts} % AMS Math Packet (Fonts) \usepackage{amsmath} % AMS Math Packet \usepackage{amssymb} % Additional mathematical symbols \usepackage{amsthm} \usepackage{booktabs} % Nicer tables %\usepackage[font=small,labelfont=bf]{caption} % Numbered captions for figures \usepackage{color} % Enables defining of colors via \definecolor \usepackage{xcolor} \definecolor{uhhRed}{RGB}{254,0,0} % Official Uni Hamburg Red \definecolor{uhhGrey}{RGB}{122,122,120} % Official Uni Hamburg Grey \definecolor{conv}{RGB}{160,206,31} \definecolor{relu}{RGB}{102,136,205} \definecolor{fc}{RGB}{212,71,87} \definecolor{softmax}{RGB}{191,178,144} \definecolor{gray2}{RGB}{211,210,210} \usepackage{fancybox} % Gleichungen einrahmen %\usepackage{fancyhdr} % Packet for nicer headers \usepackage[automark]{scrlayer-scrpage} \usepackage[hidelinks]{hyperref}\urlstyle{rm} %\usepackage{fancyheadings} % Nicer numbering of headlines %\usepackage[outer=3.35cm]{geometry} % Type area (size, margins...) !!!Release version %\usepackage[outer=2.5cm]{geometry} % Type area (size, margins...) !!!Print version %\usepackage{geometry} % Type area (size, margins...) !!!Proofread version \usepackage[outer=3.15cm]{geometry} % Type area (size, margins...) !!!Draft version \geometry{a4paper,body={5.8in,9in}} \usepackage{tikz} \usetikzlibrary{backgrounds,calc,positioning,quotes} \usepackage{graphicx} % Inclusion of graphics %\usepackage{latexsym} % Special symbols \usepackage{longtable} % Allow tables over several parges \usepackage{listings} % Nicer source code listings \usepackage{multicol} % Content of a table over several columns \usepackage{multirow} % Content of a table over several rows \usepackage{rotating} % Alows to rotate text and objects \usepackage{textcomp} \usepackage{gensymb} \usepackage[hang]{subfigure} % Allows to use multiple (partial) figures in a fig %\usepackage[font=footnotesize,labelfont=rm]{subfig} % Pictures in a floating environment \usepackage{tabularx} % Tables with fixed width but variable rows \usepackage{url,xspace,boxedminipage} % Accurate display of URLs \usepackage{csquotes} \usepackage[ backend=biber, bibstyle=ieee, citestyle=ieee, minnames=1, maxnames=2 ]{biblatex} \usepackage{epstopdf} \epstopdfDeclareGraphicsRule{.tif}{png}{.png}{convert #1 \OutputFile} \AppendGraphicsExtensions{.tif} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Configurationen: \hyphenation{whe-ther} % Manually use: "\-" in a word: Staats\-ver\-trag %\lstloadlanguages{C} % Set the default language for listings \DeclareGraphicsExtensions{.pdf,.svg,.jpg,.png,.eps,.tif} % first try pdf, then eps, png and jpg \graphicspath{{./images/}} % Path to a folder where all pictures are located %\pagestyle{fancy} % Use nicer header and footer \pagestyle{scrheadings} \addbibresource{bib.bib} \MakeOuterQuote{"} \begin{document} \title{Deep Sliding Shapes: A Review} \author{Jim Martens} \maketitle \section*{Abstract} Deep Sliding Shapes is an approach that uses 3D data in a regional proposal network to limit the search space before both 3D and 2D are used in an object recognition network to find the actual objects. In the end it produces 3D bounding boxes and outperforms 3D selective search and other state-of-the-art solutions. The introduced approach has a remarkable high-level structure that is used in more recent networks as well. But the code implementation and the provided implementation details or the lack thereof makes an independent reproduction of the results and an adoption for other problems very difficult if not impossible. % Lists: %\setcounter{tocdepth}{2} % depth of the table of contents (for Seminars 2 is recommented) %\tableofcontents %\pagenumbering{arabic} \clearpage \section{Introduction} Object detection is a central task in the field of neural networks. It is a combination of classification and localization tasks and aims to classify and locate objects inside an image. It may be restricted to certain classes that indicate objects of interest so that not every stone or leaf of a tree is detected as an object. The output of object detection networks is usually a collection of bounding boxes, one for each detected object, and the corresponding classifications. The area of 2D object detection has matured over many years. Single Shot Multibox Detector\cite{Liu2016} uses a convolutional neural network (CNN) and the RGB data of an image to detect objects. The result is a 2D bounding box and the classification for each object. With increasing availability of depth cameras, images gain the depth component and approaches utilizing the depth are becoming more relevant. Depth RCNN\cite{Gupta2015} uses the depth as a fourth channel of a 2D image. After the bounding box is calculated they fit a 3D model to the points within the bounding box. Deep Sliding Shapes\cite{Song2016} is utilizing the depth for actual 3D deep learning but also uses the RGB channels of an RGB-D image to benefit from the strength of 2D object detectors. The results of both the 3D and 2D parts are combined and the result is a 3D bounding box and classification. Section 2 explains the method used by Deep Sliding Shapes. The experimental results are presented and evaluated in section 3. Strengths and weaknesses of the paper are discussed in section 4 before concluding in section 5. \section{Method description} % This section describes the proposed approach in the paper in more detail. % Do not take sections directly from the paper, provide your own understanding and description. Deep Sliding Shapes\cite{Song2016} is using both a Regional Proposal Network (RPN) and an Object Recognition Network (ORN). The raw 3D data is encoded and then presented to the RPN. The proposed regions of the RPN are filtered and the remaining regions given to the ORN. The ORN is projecting the points inside the proposal box into 2D and gives the resulting 2D bounding box to VGGnet\cite{Simonyan2015} to extract colour features. In parallel the depth data is used by the 3D ORN. The results from both the 3D ORN and the 2D part are concatenated and via two fully connected layers the object label and 3D box are predicted. \subsection{Encoding 3D Representation and Normalization} Deep Sliding Shapes do not use the raw 3D data. Instead the raw data is encoded in a certain way and then used by the networks. The raw 3D space is divided into an equally spaced 3D voxel grid. Each voxel has an associated value which is the shortest distance between the center of the voxel and the surface from the input depth map. In addition to this relative distance the direction of each surface point is encoded as well. To this end the aforementioned Truncated Signed Distance Function is used. It stores a three-dimensional vector \([dx, dy, dz]\) in each voxel. Each of these values records the distance in the respective direction to the closest surface point. These values are clipped at \(2\delta\) where \(\delta\) represents the grid size in each dimension. Lastly the sign of these values indicates whether the cell is in front of or behind the surface. Furthermore every scene is rotated to align it with the gravity direction. In addition only a subset of the 3D space is targeted. Horizontally the range is from \(-2.6\) meters to \(2.6\) meters. Vertically it ranges from \(-1.5\) meters to \(1\) meter. The depth is limited to the range \(0.4\) to \(5.6\) meters. Within this 3D range the scene is encoded by a volumetric TSDF with grid size \(0.025\) meters, which results in a \(208 \times 208 \times 100\) volume that functions as the input to the 3D Region Proposal Network. The major directions of the room are used for the orientations of the proposals. RANSAC plane fitting is used under the Manhattan world assumption to calculate the proposal box orientations. \subsection{Multi-scale 3D Region Proposal Network} At the start of the pipeline stands the 3D Region Proposal network. It uses the normalized input and has the high-level task to reduce the number of potential regions so that the Joint Amodal Object Recognition Network only has to work on a relatively small number of regions. To this end it utilizes so called anchor boxes. \(N\) region proposals are predicted for each sliding window. Each of the region proposals corresponds to one of the \(N\) anchor boxes. There are \(N = 19\) anchor boxes. For anchors with non-square horizontal aspect ratios another anchor is defined, which is rotated by \(90 \degree\). The size of the anchor boxes varies quite a bit (from \(0.3\) meters to \(2\) meters). A region proposal network on one scale would therefore not really work. As a consequence the RPN works with two different scales. The list of anchors is split into two lists (one for each scale) based on how close their physical sizes are to the receptive fields of the output layers. A fully 3D convolutional architecture is used for the RPN. The stride for the last convolution layer is one, which resembles \(0.1\) meters in 3D. The last layer predicts the objectness score and the bounding box regression. For the first level of anchors the filter size is 2x2x2 and for the second layer it is 5x5x5. The receptive fields are \(0.4 \text{m}^3\) for level one and \(1 \text{m}^3\) for level two respectively. After the anchor boxes have been calculated, the anchor boxes with a point density lower than \(0.005\) points per cubic centimeter are removed using the integral image technique. On average there are \(107674\) boxes remaining after this step. For the remaining anchors an objectness score is calculated, which are essentially two probabilities (being an object and not being an object). In addition to this classification step a box regression is applied to all anchor boxes. This regression calculates the center and size of each box, whereas the size is given in three major directions of the box. The overall output is therefore containing both the objectness score (classification) and the 6-element vector describing the center and size of the box. Lastly 3D non-maximum suppression is used to remove redundancies. It works with an Intersection-over-Union (IOU) threshold of \(0.35\). From the remaining boxes only the top \(2000\) boxes are selected as input to the next network. The multi-task loss function is the sum of the classification loss and the regression loss. Cross entropy is used for the classification loss. The labels for the classification loss are obtained by calculating the 3D Intersection-over-Union value of every anchor box with respect to the ground truth. If this value is larger than \(0.35\) the anchor box is considered positive. If it is below \(0.15\) then the box is considered negative. The regression loss is only used for all positive examples. It utilizes a smooth \(L_1\) loss as it was used by Fast-RCNN\cite{Girshick2015} for 2D box regression. At the core of the loss function stands the difference of the centers and sizes between the anchor box and the corresponding ground truth. The orientation of the box is not used for simplicity. The center offset is represented by the difference of the anchor box center and the ground truth center in the camera coordinate system. To calculate the size difference first the major directions have to be determined by using the closest match of the major directions between both boxes. Next the difference is calculated in each of the major directions. Lastly the size difference is normalized by the anchor size. \subsection{Joint Amodal Object Recognition Network} \begin{figure} \centering \includegraphics{orn-system-drawing} \caption{\textbf{Joint Object Recognition Network:} For each 3D region proposal, the 3D volume from depth is fed to a 3D ConvNet and the 2D projection of the 3D proposal is fed to a 2D ConvNet. Jointly they learn the object category and 3D box regression.} \label{fig:system} \end{figure} The structure of the object recognition network can be seen in figure \ref{fig:system}. It starts with both a 3D and a 2D object recognition network which are then combined for the joint recognition. For the 3D object recognition every proposal bounding box is padded with \(12.5\%\) of the size in each direction to encode contextual information. The space is divided into a 30x30x30 voxel grid and TSDF is used to encode the geometric shape of the object. This network part contains two max pooling layers which use stride 2 and a kernel size of 2x2x2. The three convolution layers use kernel sizes 5x5x5, 3x3x3 and 3x3x3 respectively with a stride of 1 each. Between the fully connected layers are ReLU and dropout layers (dropout ratio 0.5). The fully connected layer produces a 4096 dimensional feature vector. The 2D object recognition part projects the points inside each 3D proposal box to the 2D image plane. Afterwards the tightest box that contains all these points is determined. A VGGnet that is pre-trained on ImageNet (without fine-tuning) is used to extract colour features from the image. The output of VGGnet is then funneled into a fully connected layer that results in a 4096 dimensional feature vector. After both object recognition parts the two feature vectors are concatenated. Another fully connected layer reduces this feature vector to 1000 dimensions. These features are used by two separate fully connected layers to predict the object label and the 3D box surrounding the object. For every detected box the box size in each direction and the aspect ratio of each pair of box edges is calculated. These numbers are then compared with a distribution collected from all the training examples of the same category. If any of the values falls outside the first to 99th percentile the score of the box is decreased by \(2\). The multi-task loss is a sum of classification and regression loss. Cross entropy is used for the classification loss. The output of the network consists of 20 probabilities (one for each object category). For the regression loss nothing changes in comparison to the region proposal network. Only difference is the element-wise normalization of the labels with the object category specific mean and standard deviation. After the training of the network concluded the features are extracted from the last fully connected layer. A Support Vector Machine (SVM) is trained for each object category. During the testing of the object recognition network a 3D non-maximum suppression is applied on the results with a threshold of \(0.1\) using the SVM scores for every box. In case of the box regressions the results from the network are used directly. \section{Experimental result and evaluation} The regional proposal network was trained for 10 hours and the object recognition network was trained for 17 hours. In both cases an Nvidia K40 GPU was used. During testing phase it took the RPN \(5.62\) seconds per image and the ORN \(13.93\) seconds per image. Both networks were evaluated on the NYUv2\cite{Silberman2012} and SUN RGB-D\cite{Song2015} data sets. A threshold of \(0.25\) was used to calculate the average recall for the proposal generation and the average precision for the detection. The SUN RGB-D data set was used to obtain the ground truth amodal bounding boxes. \begin{table} \includegraphics[scale=0.85]{results-drawing-1} \includegraphics[scale=0.85]{results-table-1} \caption{\textbf{Evaluation for Amodal 3D Object Proposal:} [All Anchors] shows the performance upper bound when using all anchors.} \label{tab:results-object-proposal} \end{table} For the evaluation of the proposal generation a single-scale RPN, a multi-scale RPN and a multi-scale RPN with RGB colour added to the 3D TSDF were compared with each other and the baselines using the NYU data set. 3D selective search and a naive 2D to 3D conversion were used as baselines. The naive conversion used the 2D region proposal to retrieve the 3D points within that region. The results can be seen in table \ref{tab:results-object-proposal}. Afterwards the outermost 2 percentiles in each direction were removed and a tight 3D bounding box calculated. The values of recall averaged over all object categories were \(34.4\) for the naive approach, \(74.2\) for 3D selective search, \(75.2\) for the single-scale RPN, \(84.4\) for the multi-scale RPN and \(84.9\) for the multi-scale RPN with added colour. The last value is used as the final region proposal result. \begin{table} \includegraphics[scale=0.85]{results-table-2} \caption{\textbf{Control Experiments on NYUv2 Test Set.} Not working: box (too much variance), door (planar), monitor and tv (no depth).} \label{tab:results-control-experiments} \end{table} Another experiment tested the detection results for the same ORN architecture given different region proposals (see table \ref{tab:results-control-experiments}). Comparing the 3D selective search with RPN gave mean average precisions of \(27.4\) and \(32.3\) respectively. Hence the RPN provides a better solution. Planar objects (e.g. doors) seem to work better with 3D selective search. Boxes, monitors and TVs don't work for the RPN, where the presumed reason for boxes is the high variance and for monitors and TVs the missing depth information is likely responsible. The detection evaluation was structured differently. First the feature encodings were compared with each other (the same experiment that was mentioned in previous paragraph), then the design was justified and lastly the results were compared with state-of-the-art methods. The feature encoding experiment provided better results for encoding the directions directly compared to a single distance. An accurate TSDF measured better than a projective one. The usage of the 2D image VGGnet proved to be better than the direct encoding of colour on 3D voxels. Lastly it didn't help to include HHA (horizontal disparity, height above ground and the angle the pixel's local surface normal makes with the inferred gravity direction). The same experiment was used to help with design choices. It was found that bounding box regression helps significantly (increase in mAP of 4.4 and 4.1 for 3D selective search and RPN respectively compared to the case without this regression). SVM was found to outperform the softmax slightly (increase of 0.5 mAP) which presumably is the case, because it can better handle the unbalanced number of training samples for each category in the NYUv2 data set. Size pruning was identified as helping (increase of mAP per category of 0.1 up to 7.8). \begin{table} \centering \includegraphics{results-table-3} \caption{\textbf{Comparison on 3D Object Detection.}} \label{tab:results-object-detection} \end{table} For the comparison with state-of-the-art methods Song and Xiao used 3D Sliding Shapes\cite{Song2014} and 2D Depth-RCNN\cite{Gupta2015} and the same test set that was used for the 2D Depth-RCNN (intersection of NYUv2 test set and Sliding Shapes test set for the five categories bed, chair, table, sofa/couch and toilet). The comparison in table \ref{tab:results-object-detection} shows that 3D Deep Sliding Shapes outperforms the chosen state-of-the-art methods in all categories. The toilet is the only example where it is relevant for the result that the 2D data is used. With only 3D data used the 2D Depth-RCNN performs better on the estimated model if it uses 2D and 3D. All in all 3D Deep Sliding Shapes works well on non-planar objects that have depth information. The 2D component helps in distinguishing similar shaped objects. \section{Discussion} % (fold) \label{sec:discussion} Deep Sliding Shapes offers a seemingly powerful new approach for object detection in a 3D environment. \subsection{Paper Strengths} % (fold) \label{sub:paper_strengths} The paper is written in a clearly structured way and uses sub headlines to better guide the reader. The authors apparently tried to minimize repetition in the sentences and are using some elements of novelized storytelling like rhetorical questions that soften up the paper and make it less dry. The introduction in particular is giving a very good motivation for the paper and ends with a cliff hanger that creates excitement to continue reading beyond the detour that is the section about related works. Overall the paper provides many illustrating figures that make it far easier to imagine the results of the introduced method and quite simply hydrate the paper and make it friendlier to the eyes compared to an all text paper. Furthermore the paper provides many evaluation results that are understandable largely without the main paper text and give a good overview over the performance of the proposed method compared to others. Aside from the paper writing skills the authors clearly posess, the presented approach itself is also very good. It is an elegant idea to first reduce the search volume by applying a region proposal network and then use an object recognition network to do the heavy lifting. The usage of the 2D data is well thought of as well. This abstract idea of dealing with 3D data has persisted and is somewhat repeated by the Frustum Pointnet\cite{Qi2017}, which uses the results of a 2D object detection network to determine the region in which the 3D object detection takes place. The object detection network not only provides the region in form of bounding boxes but also the classification of the detected objects in form of a k vector. Though the specific implementation varies greatly the abstract idea of region proposal, usage of 2D data and object detection/recognition at the end is visible in both Deep Sliding Shapes and the Frustum Pointnet. % subsection positive_aspect (end) \subsection{Paper Weaknesses} % (fold) \label{sub:paper_weaknesses} That said there are things to criticize about this paper. The information about the network structure is spread over two figures and some sections of the paper, with no guarantees that no information is missing. The evaluation sections are inconsistent in their structure. The first section about object proposal evaluation follows the rest of the paper and is written in continuous text. It describes the compared methods and then discusses the results. However the second section regarding the object detection evaluation is written completely different. There is no continuous text and the compared methods are not really described. Instead the section is largely used to justify the chosen design. If there was an introductory text explaining their motivations for this kind of evaluation and guiding the reader through the process it would not even be a problem. However currently there is no explanation given why the detection evaluation starts with feature encoding and is followed by design justification. Furthermore the motivations for the used data sets NYUv2 and SUN RGB-D are not quite clear. Which data set is used for what purpose and why? The text mentions in one sentence that the amodal bounding boxes are obtained from SUN RGB-D without further explanation. It would have been advantageous if the actual process of this "obtaining" was explained. Lastly no information regarding the training, validation and testing data split were available. While this implementation information does not have to be inside the paper proper it should have been at least inside appendices to make an independent replication of results possible. Not directly a problem with the paper itself the decision to implement a software framework from scratch (Marvin framework) rather than using a proven existing one like Tensorflow makes it more difficult to utilize the pretrained models which are indeed available and more importantly to adapt Deep Sliding Shapes to other data sets and problems. To top it all of, the available Matlab "glue" code is not well documented. % subsection negitive (end) % section review (end) \section{Conclusion} Deep Sliding Shapes introduces a 3D convolutional network pipeline for amodal 3D object detection. This pipeline consists of a regional proposal network and a joint 2D and 3D object recognition network. Experimental results show that this approach delivers better results than previous state-of-the-art methods. The proposed approach introduced an important general structure for networks working with 3D data and is roughly and on a high-level visible in more recent network utilizing 3D data as well. In the practical sphere the custom code framework and the badly documented code makes it very difficult to replicate the results independently or even adapt Deep Sliding Shapes to other problems. In short: Good theory, bad practical implementation. In future work this method should be compared to other 3D centric object detection approaches like Frustum Point Net\cite{Qi2017}. Especially a structural comparison with other 3D approaches is interesting to see if there is a best practice structure emerging for the handling of 3D data. \newpage \printbibliography \addcontentsline{toc}{section}{Bibliography}% Add to the TOC \end{document}