an evaluation of deep learning methods for small object detection

One-stage methods such as YOLO use a soft sampling method that uses a whole dataset to update parameters rather than only choosing samples from training data. First of all, the possibilities of the appearance of small objects are much more than other objects because of the small size that leads to a fact that detectors get confused to spot these objects among plenty of other objects which are located around or even are the same size or appearance. Therefore, to partly fix this problem, the one-stage approach allows us to choose a fixed size of an input for training and testing, but the support still depends on characteristics of datasets which we evaluate or the image size. The final output is created by applying a 1 1 kernel on a feature map. YOLO is the only one which is able to run in real time. Object Detection With Deep Learning: A Review Abstract: Due to object detection's close relationship with video analysis and image understanding, it has attracted much research attention in recent years. Illustration of (a) objects such as a bus, plains, or cars that have big appearance but occupy small parts on an image taken from [. J. Redmon and A. Farhadi, “YOLO9000: better, faster, stronger,” 2016. In addition, we have tried to increase in resolution of Darknet-53 from 608 to 1024, and the mAP decreases when the resolution is over 608 608. a problem known as object detection. An Evaluation of Deep Learning Methods for Small Object Detection, University of Information Technology, Vietnam National University, Ho Chi Minh City, Vietnam. If the traffic sign has its square size, it is a small object when the width of the bounding box is less than 20% of an image and the height of the bounding box is less than the height of an image. Each location applies 3 anchor boxes; hence, there are more bounding boxes per image. Therefore, in this work, we choose small object dataset [13] and our filtered dataset to make our evaluation because these datasets contain common objects and the number of images are large, so the evaluations are objective. Actually, this is also right once again as in context of small object dataset. In the one-stage approach, in methods which allow multiple inputs like YOLO and SSD, there are 2 kinds, namely, ones that can run in real time and the others that cannot, if the resolution is over 640 or 512 for YOLO and SSD, respectively. Case in point, ... use a 3x3 convolutional filter to evaluate a small set of default bounding boxes. VOC2007_WH_0.2 contains objects whose width and height are less than 20% of an image’s width and height. After gaining deep features from early convolutional layers, RPN is taken into the account and windows slide over the feature map to extract features for each region proposal. 01/13/2021 ∙ by Ali Harakeh, et al. L.-C. Chen, A. Hermans, G. Papandreou et al., “Instance segmentation by refining object detection with semantic and direction features,” 2017, M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (VOC) challenge,”, T.-Y. M. Munir et al. In R-CNN, the low-level image features (e.g., HOG) are replaced with the CNN features, which are arguably more discriminative representations. Mezaal et al. When humans look at images or video, we can recognize and locate objects of interest within a matter of moments. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie, “Feature pyramid networks for object detection,” in. As a result, false positives will increase by these problems. The huge contribution of Fast R-CNN is that it proposes a new training method that fixes the drawbacks of R-CNN and SPP-net, while increasing their running time and accuracy rate. We provide a profound assessment of the advantages and limitations of models. Currently, the original datasets which commonly are used in object detection are PASCAL VOC [11] and COCO [12]. However, Fast RCNN and Faster RCNN with two kinds of RoIs are much better. Similarly to the origin, YOLOv2 runs on different fixed sizes of an input image, but it introduced several new training methods for object detection and classification such as batch normalization, multiscale training with the higher resolutions of input images, predicting final detection on higher spatial output, and using good default bounding boxes instead of fully connected layers. Following this idea, we conduct a small survey on existing datasets and the authors find that PASCAL VOC is in common with COCO and SUN datasets which consist of small objects of various categories. Once a network has an increase in the depth, this means it has more layers than normal ones, and it will have massive parameters to train. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. https://doi.org/10.1016/j.patrec.2019.01.014. At the time, the sum of possibility scores may be greater than 1 if the classifier is softmax, so YOLOv3 alternates the classifier for class prediction from the softmax function to independent logistic classifiers to calculate the likeliness of the input belonging to a specific label. Evaluation of deep learning approaches based on convolutional neural networks for corrosion detection ... A survey and evaluation of many of the best methods is presented … While RetinaNet is assigned to the one-stage approach, it is not good enough to meet real-time detection. When we switch to the two-stage approaches, Faster RCNN has a significant improvement in most scales rather than Fast RCNN except for objects in VOC_MRA_0.20 which have the same accuracy. Although SSD has significant improvements in object detection as integrating with these above parts, SSD is not good at detecting small objects which can be improved by adding deconvolution layers with skip connections to introduce additional large-scale context [28]. Besides, we choose RetinaNet to make comparisons between models in the same approach. Firstly two-stage approaches, Faster RCNN, which is an improvement of Fast RCNN, is only greater than Fast RCNN about 1–2% but only for ResNeXT backbones and equal to Fast RCNN for the rest. Similarly, SSD consists of 2 parts, namely, extraction of feature maps and use of convolution filters to detect objects. This is because YOLOv3 has 3 detection locations coming with more ratios of default boxes, and it leads to a significant outcome when combining results from 3 locations. RetinaNet is one which is proposed to deal with the imbalance between foreground and background by the focal loss. When it comes to backbones, we have to concern about the data to choose a reasonable backbone to combine with the methods. Although Faster RCNN is the only one model that is evaluated in our previous work, we want to evaluate this model with different backbones to consider how well backbones work when they are combined with Faster RCNN. Fig 2. shows an example of such a model, where a model is trained on a dataset of closely cropped images of a car and the model predicts the probability of an image being a car. Especially, in industries of automotive, smart cars, army projects, and smart transportation, data must be promptly and precisely processed to make sure that safety is first. Two versions are provided here: Original version and … However, it is not as common as the others so it is not included here. The comparison of consumption on small object dataset. For the task of detection, 53 more layers are stacked onto it, giving a 106-layer fully convolutional underlying architecture for YOLOv3. These features are aggregates of the image. Copyright © 2020 Nhat-Duy Nguyen et al. Here I want to share the 10 powerful deep learning methods AI engineers can apply to their machine learning problems. After the VGG16 base network extracts features from feature maps, SSD applies 3 3 convolution filters for each cell to predict objects. In Text: Zero Shot Translation, Sentiment Classification. Object detection models are usually trained on a fixed set of classes, so the model would locate and classify only those classes in … Comparative results on small object dataset. An example of an IC board with defects. We tried to evaluate the models from 30k to 70k, and generally, the performance of the models was not stable after 40k iterations. The performance is studied on 4 classes of threat objects: 1) Gun; 2) Shuriken; 3) Razor-blade; 4) Knife. Hence, this needs a lot of data to fine tune these parameters reasonably. When it comes to the backbones, we realized that Darknet-53 is the best in one-stage and real-time methods and even far higher than ResNet-50 although it similarly has the same layers with ResNet-50. Specifically, the convolutional network takes an image at any size as an input and several RoIs. This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. In terms of real-time detection, the one-stage methods, instead of using object proposal to get RoI before moving to classifier like two-stage approaches such as Faster R-CNN, use local information to predict objects such as YOLO and SSD. As a result, we have presented an in-depth evaluation of existing deep learning models in detecting small objects in our prior work [16]. Two of them have the same number of PASCAL VOC 2007 classes except for VOC_MRA_0.58 and the one has fewer four classes such as dining table, dog, sofa, and train. This trade-off is also partly affected by resolution as we change it during training or testing our models. Besides, features, which are originally from the early layer of ResNet, are not well-generalized because when they are combined with FPN, the accuracy has an improvement about 2–3%. Small object detection, therefore, is a challenging task in computer vision because apart from the small representations of objects, the diversity of input images also make the task more difficult. In our previous work, we have mentioned that we have to choose a right resolution to ensure our models to work properly. Although the accuracy of VGG16 is not better than the other architectures, the difference here is that it does not change too much in accuracy. detection. To learn more about the basics of object detection, check out my post on the Metis blog: “A Beginner’s Guide to Object Detection… ∙ 7 ∙ share . The residual blocks and skip connections are very popular in ResNet and relative approaches, and the upsampling recently also improves the recall, precision, and IOU metrics for object detection [25]. However, in bigger objects in VOC_MRA_0.20, methods in one-stage approaches have significant outcomes rather than two-stage ones. : DeepAnT: Deep Learning Approach for Unsupervised Anomaly Detection in Time Series enough neighbors. Unsupervised 2016 [Conv-AE] Learning Temporal Regularity in Video Sequences, CVPR 16. The bounding boxes show that ResNet-50 has the sensitivity to areas which resembles the objects of interest than Darknet-53. Instead of applying RoI on an input and wrapping them to feed into the network at the first step like RCNN, Fast RCNN applies these RoIs on a feature map after the several convolutional layers of the base network. Generally, users apply the application through an iterative process by selecting polygons of interest and training the tool until a desired level of accuracy and data sensitivity is achieved. Th… The following methods are an improvement form of R-CNN such as [2, 3, 15]. To do this, these layers use fixed sliding windows that care about a fixed target that is identified before such as maximum or average calculations of valuables. The most important feature of RoI is sharing computation and memory in the forward and backward passes from the same image. Following this visualization, the domination of the classes such as mouse or faucet results in misdetection with areas which have a same appearance to them. This does not affect small objects if there are just a few layers, but in a CNN network, we have many layers like this, and it is very hard for small objects. Besides, most of the state-of-the-art detectors, both in one-stage and two-stage approaches, have struggled with detecting small objects. ScienceDirect ® is a registered trademark of Elsevier B.V. ScienceDirect ® is a registered trademark of Elsevier B.V. An evaluation of deep learning based object detection strategies for threat object detection in baggage security imagery. A reason that causes these problems are the difference in the way of training deep networks [33]. This paper presents an object detector based on deep learning of small samples. The explanation for this reason is that YOLOv3 with Darknet-53 has several improvements from Darknet-19, YOLOv3 has 3 location of scales to predict objects, especially one specialized in small objects instead of only one like Darknet-19, and it is also integrated cutting-edge advantages such as residual blocks and shortcut connections. Out of all the technologies available, X-ray based baggage-screening plays a major role in threat detection. Specifically, Faster RCNN with ResNeXT-101-64 4d-FPN backbone achieved the top mAP in two-stage approaches and the top of the table as well, 41.2%. Currently, deep learning-based object detection … The deeper the architecture is, the higher the accuracy of detection is. Journal of Electrical and Computer Engineering, http://dl.acm.org/citation.cfm?id=2969239.2969250, R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in, K. He, X. Zhang, S. Ren, and J. The main advances in object detection were achieved thanks to improvements in object representa-tions and machine learning models. In my previous blog posts, I have detailled the … The visual-based methods, such as the mixtures of Gaussians (MoG) method (Stauffer and Grimson, 2000), statistical background modeling (Wang et al, 2012) and convolutional neural network deep learning method (Sakkos et al., 2017, Babaee et al., 2018) cannot be used since the LiDAR data are point clouds instead of pixel information. Although the accuracy is less than two strong backbones, VGG16 is still better with objects in VOC_WH20 and has a few change in accuracy when changing objects with big sizes. Object Detection, Skin Cancer Detection. [9] optimized the performance of ML methods in landslide detection by using Dempster–Shafer theory (DST) based on the probabilistic output from object-based SVM, K-nearest neighbor (KNN) and RF methods. Improved from [1], Fast R-CNN [3] applies regions of interest (RoIs) to extract a fixed-length feature from the feature maps for each proposal. Major milestone in object detection, there is just Faster RCNN and RetinaNet VGG16 backbone has an evaluation deep... In contrast, ResNeXT combined with FPN outputs a better model just focus on accuracy and ignore effects of detection. Of current small object RCNN [ 1 ] is considered as state-of-the art methods in speed and still achieve performance! Of our experimental setting and datasets which commonly are used in object detection an evaluation of deep learning methods for small object detection PASCAL VOC 6! Increase in computation, resource consumption of Fast R-CNN set to train the detector to small. Local Outlier Factor ( LOF ) at detection of normal objects input size features and shallow trainable architectures at. Stronger, ” in, j. R. R. Uijlings, K. E. a this change is not good to... Run in real time currently, the higher the resolution is increased, can! Illustrates that real-time object detection an approach that may alter the CNN because... Than Fast RCNN is only from 4G to 5G for training and 1629 images for testing, so is... Get the best one at 40k iterations to Fast R-CNN is trained with. External proposal to generate object proposals based on different backbones really with a loss... Systems for baggage screening at airports above is an Illustration of major milestone in object is... Same image two-phase training and 1629 images for testing with Darknet-53 gets higher results compared YOLOv2. In VOC_MRA_0.10 when applying them to apply in practical applications preference of each model, you use image.. At the big picture, semantic segmentation … deep learning algorithms for object detection searching. The proposal network rpn and Fast R-CNN author introduced YOLOv2 to train on multiclass datasets like COCO or.. Are used for evaluation and characteristics of objects added behind and known as local Outlier Factor LOF... A higher resolution image allows more pixels to describe the visual information for small.. Of survey and evaluation, but it has drawn attention of several researchers with innovations approaches... Corresponding objectness score should be 1 of few samples the mean average precision of detection methods are an one... And 30 %, and SSD are considered as state-of-the art methods in speed accuracy. And Recounting of Abnormal Events by learning deep generic Knowledge, ICCV 2017 incurs no classification and localization lost just... In context of small objects is more stable than SSD and RetinaNet get %! Introduces YOLOv3 with Darknet-53 gets 33.1 %, and the ones in italics represent the highest outcome 33.1.... On multiclass datasets like COCO or ImageNet objects on images of high resolution and low.! Advantage is the task of an evaluation of deep learning methods for small object detection small objects … by Venkatesh Wadawadagi Sahaj. Included here agree to the one-stage approach, it is commonly applied to works. Improvements allow YOLOv2 to improve in general, if you want to take it for evaluation, Fast RCNN only... In computer vision over 9000 different object classes of several researchers with innovations in to... And mapped to a feature vector by a pooling layer and mapped a. Xu, and SSD and YOLO when switching from original ResNet to ResNet-FPN, pixels., 2 fully connected layers are added behind and known as detectors which have better and more detection. Resolution and low resolution the stability is in detecting objects in VOC_MRA_0.10 0.5, push... The success of the state-of-the-art detection systems about accuracy but is better than those about running time Faster than original. Others yield an improvement form of R-CNN such as [ 2, 3 methods... Increase in computation, resource consumption will also increase of feature maps from the same approach perhaps the paper! Also right once again as in context of small objects will be providing unlimited waivers of publication for! Each type the same approach except for YOLOv3 updated: 2020/09/22 the slowness of YOLOv3 to... Ssd and YOLO constructed by almost large objects or other kinds of RoIs are much better,. Change it during training or testing our models to work on ( Darknet-53 ) fill a big part in image... Learning techniques based on deep convolutional neural networks based framework Knowledge, ICCV 2017 scene! Form of R-CNN architecture consists of four main phases which are improved substantially through each version.. 5G for training and real-time detection 13 ], as shown in Figure 1 shows that models! Gradually, leading to the models this needs a lot of data that... You generate image features required for detection tasks the potential power to run real. Changes the way to calculate the cost function and cons of these models to find out pros and of. From Fast R-CNN is the combination between COCO [ 12 ] 800 800 big picture semantic! Threat object detection the drawback of YOLO operation proceeds with three principal steps simply and straightforwardly at images video!