Abstract:Automatic driving of agricultural machinery has drawn much more attention in recent years, particularly with the development of precision farming and the improvement of sensor technologies. Four parts of autonomous driving are positioning, perception, decision-making, and control system. In perception, the road recognition aims to extract the drivable area for the safe driving of agricultural machinery. However, there are no obvious lane markings or signs for field roads, while the road borders are in irregular shape, often shaded by trees. All of these features make it difficult for field road identification, unlike structured urban road. In road recognition, semantic segmentation on the collected road images is a binary classification task of background and road for each pixel to extract the drivable area. In this study, the data in spring and summer was collected in the Yufa Town, Daxing District, Beijing of China. A stereo camera was fixed on the agricultural machine to collect image data. The fixed position ensured that the camera was firm and reliable without being obscured during driving. The fixed height was set to 1.2 m. The driving speed of agricultural machinery was about 5 km/h during data collection. The field roads included semi-structured and unstructured roads. The sunny day was selected to collect data. The collecting time was about 4 hours, and a total of 1 600 pictures were captured. The training and test set were divided into the ratio of 4:1. The open- source software Labelme was used for image labeling. UNet was selected as the basic network, due to its simplicity and suitability for binary classification. A better performance was achieved when training on a small data set. Three improvements were also proposed for the UNet. 1) An identity mapping channel was established between every two convolutions, and the residual was constructed by adding pixels. The residual connection was used to alleviate the gradient disappearance and explosion during training, while easy the training of deep neural networks. 2) A fusion convolutional structure and the maximum pooling were established to replace the maximum pooling layer in the UNet. The useful information in the original image was maximized when halving feature map, where the segmentation of small area features was improved significantly. The inference time of the model was much longer because much more convolution operation increased the training parameters. 3) An asymmetric convolution structure was used in ACBlock, where the weight of the "skeleton" structure increased to improve the efficiency of feature extraction in the convolution kernel. Inspired by ACBlock, DACBlock was proposed using the dilated convolution, which further expanded the receptive field of the convolution feature map. ACBlock and DACBlock were used to replace the 3×3 convolution kernel in UNet. As such, the segmentation accuracy of road edge shapes was improved significantly. The hierarchical fusion and batch normalization were used in the inference stage to maintain that the number of parameters and inference time were all the same as the original structure. The improved UNet presented an IOU value of 85.03% for the field road segmentation, higher than the original UNet, ResUNet, and UNet3+. The recognition accuracy was relatively lower under cloudy weather in road junctions, due to insufficient light and occlusion. There was always water in the middle of the road after rain, where a certain degree of reflection occurred on the water under the mirror reflection. Therefore, the water increased the error of road segmentation. In the case of good or weak light in the evening and shade, the road segmentation was performed better for the safe driving of agricultural machinery. The segmentation accuracies of remote roads and road edges were also significantly better than those of other networks. Moreover, the average inference time of the model was 163 ms, meeting the time requirements of automatic driving in agricultural machinery.