Abstract:The complex working environment of picking robots has limited the picking speed and equipment memory resources in the intelligent harvesting of Lingwu long jujubes. Therefore, it is necessary to meet the requirements of lighter network structure and higher detection accuracy, particularly for the visual recognition system. A pre-train model has widely been loaded almost all the object detection at present, due to high initialization performance and convergence speed. However, two challenges are still remained: 1) The network structure cannot be changed on the limited memory resources of the device; 2) There may be great differences between the ImageNet dataset and the dataset to be trained, leading to the low training effect. Taking the SSD model as the basic framework, this research aims to propose a lightweight object detection for the images of Lingwu long jujubes. The excellent performance was achieved without loading the pre-train model. Firstly, data augmentation is performed on the collected 1 000 images to obtain 5 000 images. Data augmentation operations include random cropping, random vertical or horizontal flipping, random brightness adjustment, random contrast adjustment, and random saturation adjustment. Secondly, the Lingwu long jujube dataset was established, including 3 500 training images and 1 500 test images. The resolution of images consisted of 3 016×4 032, 4 068×3 456, and 2 448×3 264. The models of smartphones for image acquisition included HUAWEI TRT-AL00A, Vivo Y79A, and Xiaomi 2014501. The images were uniformly scaled to the resolution of 300×300, in order to meet the input requirements of image size in the SSD object detection. Data augmentation included random cropping, random vertical or horizontal flipping, as well as random adjustment of brightness, contrast, and saturation. The format of the PASCAL VOC dataset was also adopted. Labelling software was used to label the images, and then the marked images were stored in the label folder in XML format. Secondly, the improved DenseNet was utilized the Convolutional Block Attention Modules and two dense blocks with convolution groups of 6 and 8. Taking the improved DenseNet as the backbone network, the improved SSD model was obtained to combine with the multi-level fusion structure, where the first three additional layers were replaced in the SSD model with the Inception module. In the improved SSD model without loading the pre-train model, the mAP was 96.60%, the detection speed was 28.05 frames/s, and the number of parameters was 1.99×106, particularly 2.02 percentage points and 0.05 percentage points higher than that of the SSD and SSD model (pre-train), respectively. Correspondingly, the parameter of the improved SSD model was 11.14×106 lower than the SSD model, fully meeting the requirements of the lightweight network without loading the pre-train model. This finding can provide a strong visual technical support for the intelligent harvesting of Lingwu long jujubes, even medical and multispectral images detection tasks.