Abstract:Abstract: The evaluation of tea quality can directly dominate the market value in the tea industry. Among them, the sensory evaluation is widely combined with the rational analysis to assess the quality of tea in recent years. However, some limitations are prone to the random evaluation with high error and strong repeatability. Physical and chemical approaches are also limited in the tea quality assessment, due to the costly, time-consuming, and destructive tasks. In this study, a computer vision-based approach was proposed to assess the appearance and sensory quality of Xinchang's renowned flat green tea. Various characteristics of tea were also considered during detection, such as the small scale, high density, and weak feature significance. A quality detection model was established for the tea shape sensory usingYOLOv5s deep learning and machine vision. A three-band diluted convolution (TDC) structure with the receptive fields was introduced to enhance the extraction of tea features in the backbone network. Additionally, the Convolution Block Attention Module (CBAM) was introduced to determine the attention area in the dense scenes using channel and spatial attention. The local perception of the network was also promoted to improve the detection accuracy of small-scale tea. Furthermore, the Swin Transformer network structure was also introduced to enhance the semantic information and feature representation of small targets with the help of window self-attention in the feature fusion stage. Finally, the positive sample matching was improved by the dynamically allocating positive samples using the SimOTA. An optimal box of sample matching was assigned to each positive tea sample for the high efficiency and the detection accuracy of the network. The ablation experiment was performed on the self-made tea dataset. The results show that the modified model was significantly improved the average accuracy of target detection on tea images. The improved YOLOv5 presented the higher confidence score in the tea quality detection than the conventional one. The higher accuracy was also achieved in the positioning. The detection accuracy increased by 3.8 percentage points in the applied dataset, indicating the greatly reduced false detection. Mean Average Precision (mAP) and Frame Per Second (FPS) reached 91.9% and 51 frames/s, respectively, indicating the multiclass average accuracy of YOLOv5. The FPS was also improved by 7 frames/s. The excellent real-time performance was achieved in the higher recognition accuracy and speed, compared with current mainstream target detections, indicating the feasibility and superiority of this model. These findings can provide a strong reference to improve the quality detection in tea market. In conclusion, the computer vision-based approach of the YOLOv5s can be expected to serve as a novel and effective way for the appearance and sensory quality of tea, with the better accuracy, speed, and efficiency in tea industry.