Abstract:Wheat is one of the most widely cultivated staple food crops in the globe. Its yield prediction can share the profound implications for food security. Deep learning can be expected to detect and count wheat spikes, and then rapidly predict wheat yields. However, some challenges are still remained on in the low detection accuracy and a large number of model parameters under complex agricultural environments. This study aims to propose a lightweight wheat ear detection model, RT-WEDT (Real-Time Wheat Ear Detection Transformer), using RT-DETR. Firstly, EfficientFormerV2 was selected as the backbone network structure of RT-WEDT to fully capture both the long-range and local features of wheat ear images with the high computational efficiency. Secondly, a multiscale enhanced hybrid encoder (MSEHE) was introduced to take as the input feature maps at four scales output from the four downsampling stages of the backbone network. The MSEHE consisted of three sub-modules: the Attention-based intra-scale feature interaction (AIFI) module acted on the smallest feature maps to extract global features of the image; the Scale Sequence Feature Fusion (SSFF) module with multiscale fusion and 3D convolution was utilized to extract information about wheat ear targets at different scales. The outputs of these two modules were fed into the Enhanced Feature Fusion Module (EFFM) for feature fusion, in order to integrate the global and local information of the wheat ear image. Additionally, the localization accuracy was improved for wheat targets. WIoUv3 loss function was employed as the bounding box one to enhance the quality of the anchor frame. The detection dataset was obtained for the global wheat head. Experimental results demonstrate that the RT-WEDT model was had 12M parameters, a floating-point operation capacity of 33.1×109 G, an average accuracy of 90.2%, and a detection speed of 79.7 frames/s. Compared with RT-DETR, the RT-WEDT model was had 62.5% fewer parameters, 68% fewer floating-point operations, an AP50-95 increase of 0.6%, an AP50 increase of 0.5%, and a detection speed increase of 22.4%. The AP50-95 values were improved by 8.2%, 2.4%, and 1.7%, respectively, and the AP50 values were improved by 4.6%, 1.1%, and 0.7%, respectively, compared with YOLOv5, YOLOv8, and YOLOX with a similar parameter volume. Furthermore, samples were classified from the detection dataset of global wheat heads. The performance was then evaluated on wheat ear targets in various scenarios. The experimental results indicate that the dense and overlapping wheat ears were the most significant influencing factors on the performance of the model, followed by image blurriness. The intensity of light during photography shared the a minimal effect on the detection. Drone The drone perspective wheat spike dataset (DPWSD) was constructed for two periods, in order to verify the robustness of the improved RT-WEDT. And then, the RT-WEDT was directly tested on the drone perspective wheat dataset. Specifically, 60.2% AP50-95 and 97.4% AP50 were achieved during the filling stage; 61.0% AP50-95 and 96.1% AP50 were achieved during the maturity stage. The counting experiments were conducted on the test set from the global wheat dataset and the self-built drone perspective wheat ear dataset, respectively, in order to validate the counting effectiveness of RT-WEDT. The R2 values of RT-WEDT on the global wheat head detection dataset and the DPWSD were 0.94, and 0.95, respectively, indicating an excellent fit between predicted and actual values. Therefore, the RT-WEDT was highly accurate for wheat ear detection and counting. The improved model was significantly reduced the complexity to maintain a high average accuracy, indicating the real-time detection of the wheat ear. This finding can provide the technical support for the efficient and rapid estimation of wheat yields in smart agriculture.