Abstract:Accurate detection of young fruits is critical to obtain growth data, particularly in the high-throughput and automatic acquisition of phenotypic information serving as the basis of fruit tree breeding. Since the fruits at young stage are in a small shape similar to the leaf color, it has made it difficult to be detected in deep learning. In this study, an improved YOLOv4 network model (YOLOv4-SENL) was proposed to achieve highly efficient detection of young apples in a natural environment. Squeeze-and-excitation (SE) and Non-local (NL) blocks were also combined to detect young apples. The backbone network of feature extraction in YOLOv4 was utilized to extract high-level features, whereas, the SE block was used to reorganize and consolidate high-level features in the channel dimension to achieve the enhancement of the channel information. The NL block was added to three paths of improved path aggregation network (PAN), combining non-local and local information obtained by convolution operations to enhance features. Two visual attention mechanisms (SE and NL block) were used to re-integrate high-level features from both channel and non-local aspects, with emphasis on the channel information and long-range dependencies in features. As such, the improved ability was achieved to capture the characteristics of background and fruit. Finally, the coordinates and classification were performed on the feature maps with different sizes of young apples. The pre-training weights of the backbone network on MS COCO dataset were loaded in the process of network training, where random gradient descent was used to update the parameters. The initial parameters were set as follows: The initial learning rate was 0.01, the training epoch was 350, the weight decay rate was 0.000 484, and the momentum factor was 0.937. A total of 3 000 images were collected in the natural environment, including young fruits in different periods and different interference factors, with abundant samples. Four indexes were selected to evaluate the detection of models in the experiments, including precision, the recall rate, F1 score, and average precision. 1 920 images of the dataset were trained, where the average precision of network was 96.9% on 600 test set images, 6.9 percentage points, 1.5 percentage points, and 0.2 percentage points higher than that of SSD, Faster R-CNN, and YOLOv4 models, respectively. The size of the YOLOv4-SENL model was 69 M larger than that of the SSD model, 59 M smaller than that of the Faster R-CNN model, and 11M larger than that of the YOLOv4 model. It indicated that the detection of young apple objects was accurately realized. The ablation experiment on 480 validation set images showed that only retaining the SE block in YOLOv4-SENL, the precision of the model was improved by 3.8 percentage points, compared with the YOLOv4 model. Only retaining three NL block visual attention modules in YOLOv4-SENL, the precision of the model was improved by 2.7 percentage points, compared with the YOLOv4 model. When replacing the SE and NL blocks in YOLOv4-SENL, the precision of model was improved by 4.1 percentage points, compared with the YOLOv4 model. These indicated that two visual attention mechanisms contributed to significantly improving the perception of network for young apples with a small increase in parameters. This finding can provide a potential reference to obtain the growth information in fruit breeding.