Abstract:Automated and intelligent harvesting has been one of the most important steps for urgent task in the grape industry. However, the current models of fruit recognition have posed a great balance between accuracy and real-time performance. In this study, a lightweight and real-time semantic segmentation model was proposed for field grape harvesting using a channel feature pyramid. Firstly, a publicly available dataset of field grape instance segmentation was used as the experimental object. A total of 300 grape images were collected with the different pruning periods, lighting conditions, and maturity levels. The LabelMe annotation tool was used to build the field grape dataset. Four types of objects were annotated, including the background, leaves, grapes, and stems. The dataset was then expanded using random enhancement, resulting in a total of 1200 images. Since the original images were too large in pixels to be trained directly, the image resolution was uniformly compressed to 512×512 (pixels) for better training efficiency of the network model. Secondly, the convolutional kernels of different sizes were arranged in the perceptual fields, due to the huge differences in the grape size and location. The channel feature pyramid module was then utilized for the feature extraction. The 3×3, 5×5, and 7×7 multi-scale feature extraction datasets were then achieved for the jumping connections of 1×3 and 3×1 null convolutions in a single channel. As such, the multi-scale and contextual features were effectively extracted from the grape images. At the same time, the model parameters were reduced to increase the trainable ones for less information loss. The convolutional fusion structure was pooled during down-sampling, instead of the traditional maximum pooling structure. The jump joints were employed in the decoding part, in order to fuse information from different feature layers for the recovery of image details. Finally, the improved model was tested on a grape test set. The experimental results showed that the Mean Intersection over Union(MIoU)was 78.8%, The Mean Pixel Accuracy (MPA) was 90.3%, and the real-time processing speed was 68.56 frames/s. The model size was only 4.88 MB. The accuracies of Mean IoU were improved by 7.9, 5.7, and 10.5 percentage points in the real-time semantic segmentation networks, respectively, compared with the BiSeNet, ENet, and DFAnet. The accuracies of the improved model increased by 1.2 and 8.8 percentage points, respectively, compared with lightweight networks using mobilienetv3 and inception as encoders. Therefore, the proposed network presented a significant advantage over the real-time and lightweight networks, in terms of segmentation accuracy. The mean IoUs of the semantic segmentation network was reduced by 2.3, 2.0, and 3.7 percentage points, respectively, but the model sizes were 12.3%, 4.1%, and 7.4%, respectively, compared with the classical networks, Deeplabv3+, SegNet, and UNet. The real-time requirement fully met the tradeoff between real-time and accuracy. The improved model can be expected to serve as the segmentation recognition of field grapes in smart agriculture. The finding can also provide technical support for the visual recognition systems in the grape-picking robots.