Abstract:Abstract: Automatic fruit recognition is one of the most important steps in fruit picking robots. In this study, a novel fruit recognition was proposed using improved YOLOv3, in order to identify the fruit quickly and accurately for the picking robot in the complex environment of the orchard (different light, occlusion, adhesion, large field of view, bagging, whether the fruit was mature or not). The specific procedure was as follows. 1) 4000 Apple images were captured under the complex environment via the orchard shooting and Internet collection. After labeling with LabelImg software, 3200 images were randomly selected as training set, 400 as verification set, and 400 as a test set. Mosaic data enhancement was also embedded in the model to improve the input images for the better generalization ability and robustness of model. 2) The network model was also improved. First, the residual module in the DarkNet53 network was combined with the CSPNet to reduce the amount of network calculation, while maintaining the detection accuracy. Second, the SPP module was added to the detection network of the original YOLOv3 model, further to fuse the global and local characteristics of fruits, in order to enhance the recall rate of model to the minimal fruit target. Third, a soft NMS was used to replace the traditional for better recognition ability of model, particularly for the overlapping fruits. Forth, the joint loss function using Focal and CIoU Loss was used to optimize the model for higher accuracy of recognition. 3) The model was finally trained in the deep learning environment of a server, thereby analyzing the training process after the dataset production and network construction. Optimal weights and parameters were achieved, according to the loss curve and various performance indexes of verification set. The results showed that the best performance was achieved, when training to the 109th epoch, where the obtained weight in this round was taken as the final model weight, precision was 94.1%, recall was 90.6%, F1 was 92.3%, mean average precision was 96.1%. Then, the test set is used to test the optimal model. The experimental results show that the Mean Average Precision value reached 96.3%, which is higher than 92.5% of the original model; F1 value reached 91.8%, higher than 88.0% of the original model; The average detection speed of video stream under GPU is 27.8 frame/s, which is higher than 22.2 frame/s of the original model. Furthermore, it was found that the best comprehensive performance was achieved to verify the effectiveness of the improvement compared with four advanced detection of Faster RCNN, RetinaNet, YOLOv5 and CenterNet. A comparison experiment was conducted under different fruit numbers and various lighting environments, further to verify the effectiveness and feasibility of the improved model. Correspondingly, the detection performance of model was significantly better for small target apples and severely occluded overlapping apples, compared with the improved YOLOv3 model, indicating the high effectiveness. In addition, the target detection using deep learning was robust to illumination, where the illumination change presented little impact on the detection performance. Consequently, the excellent detection, robustness and real-time performance can widely be expected to serve as an important support for accurate fruit recognition in complex environment.