Abstract:Sweet taste is one of the most important tastes in food flavor and quality. Sweet molecules that can be used to produce new sweeteners have also been actively explored in food processing. However, the traditional methods cannot meet the rapid development of the economy and market demand, due mainly to time-consuming, laborious, and inefficient methods. Therefore, an effective and reliable strategy is essential to produce the sweet stuff. Currently, machine learning and structure-activity relationship can be utilized to realize accurate predictions of sweet molecules in the food industry. In this study, a new database of sweeteners and non-sweeteners together with the scores of sweetness was established using molecular sweetness and structure-activity correlation between molecular structures. MOE software was selected to compute molecular descriptors, to fully characterize the properties of molecules. These descriptors were then filtered through neighborhood variance screening, collinearity removal, and principal component contribution rate screening. Specifically, the feature descriptors were screened by removing the descriptors with high correlation. 80% of the dataset was then divided into training sets for model construction, and 20% were divided into test sets for model validation. Random forest and support vector machines were utilized to establish a qualitative structure-activity relationship for the prediction and identification of potential sweet molecules. Evaluation indexes were taken as the area under the receiver characteristic curve (AUC) and accuracy rate. The higher the AUC and accuracy rate represented the better classification. As such, the optimal model was obtained. Subsequently, the principal component, K-nearest neighbor, random forest, and partial least squares regression were used to establish the quantitative structure-activity relationship for better prediction of sweet molecules. The determination coefficient R2 and Root Mean Square Error (RMSE) were used as evaluation indexes of the quantitative structure-activity model. The higher R2 and lower RMSE showed the better model. The optimal model was obtained to compare the performance. The food composition database (FooDB) was applied to predict the possible sweet food ingredients and the sweetness. Correspondingly, the publicly accessible dataset was established ranging from artificially revised and continuously updated on sweetener, non-sweetener substances, and sweetness values. A new model was established to identify sweet molecules using the random forest. The accuracy of the model was 0.966 on the test set, and the area under the ROC curve was 0.987, indicating excellent predictive ability. The prediction model of sweetness was also established using the random forest. Specifically, the R2 was 0.82 and RMSE was 0.60. A manually modified data set was established to combine qualitative and quantitative sweetener prediction. 542 potential sweetener molecules, including lycopene were discovered in the food composition database. All data and code were then stored at the website of https://gitee.com/wang_lab/EMMSM for a better extension. Consequently, the new model indicated universal applicability and high practical application in searching for new sweet molecules.